{"title": "A General Purpose Image Processing Chip: Orientation Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 873, "page_last": 879, "abstract": "", "full_text": "Asymptotic Theory for  Regularization: \n\nOne-Dimensional Linear  Case \n\nRolf Nevanlinna Institute, P.O.  Box 4,  FIN-00014 University of Helsinki, \n\nFinland.  Email:  PetrLKoistinen@rnLhelsinkLfi \n\nPetri Koistinen \n\nAbstract \n\nThe  generalization  ability  of a  neural  network  can  sometimes  be \nimproved dramatically by regularization.  To  analyze the improve(cid:173)\nment  one  needs  more  refined  results  than  the  asymptotic  distri(cid:173)\nbution  of  the  weight  vector.  Here  we  study  the  simple  case  of \none-dimensional  linear  regression  under  quadratic  regularization, \ni.e.,  ridge  regression.  We  study  the  random  design,  misspecified \ncase, where we derive expansions for  the optimal regularization pa(cid:173)\nrameter and  the ensuing improvement.  It is  possible  to construct \nexamples  where it  is  best to use no regularization. \n\n1 \n\nINTRODUCTION \n\nSuppose  that  we  have  available  training  data  (Xl, Yd, .. 0' (Xn' Yn)  consisting  of \npairs of vectors,  and we  try to predict Yi  on the basis of Xi  with a  neural network \nwith  weight  vector w.  One popular way of selecting w  is  by the criterion \n\n(1) \n\n1  n - L \u00a3(Xi' Yi, w) + >..Q(w)  = min!, \n\nn \n\nI \n\nwhere  the  loss  \u00a3(x,y,w)  is,  e.g.,  the  squared  error  Ily - g(x,w)11 2 ,  the  function \ng(., w)  is  the  input/output  function  of  the  neural  network,  the  penalty  Q(w)  is \na  real  function  which  takes  on  small  values  when  the  mapping  g(o, w)  is  smooth \nand  high  values  when  it  changes  rapidly,  and  the  regularization  parameter  >..  is  a \nnonnegative scalar  (which  might depend  on the  training sample).  We  refer  to the \nsetup  (1)  as  (training with)  regularization,  and  to  the same setup  with  the choice \n>..  =  0 as training without regularization.  Regularization has been found  to be very \neffective for improving the generalization ability of a neural network especially when \nthe  sample  size  n  is  of the  same  order  of magnitude  as  the  dimensionality  of the \nparameter vector w,  see,  e.g., the textbooks  (Bishop,  1995; Ripley,  1996). \n\n\fAsymptotic Theory for Regularization: One-Dimensional Linear Case \n\n295 \n\nIn  this  paper  we  deal  with  asymptotics  in  the  case  where  the  architecture  of the \nnetwork  is  fixed  but  the  sample  size  grows .  To  fix  ideas,  let  us  assume  that  the \ntraining  data  is  part  of  an  Li.d. \n(independent,  identically  distributed)  sequence \n(X,Y);(Xl'Yl),(X2'Y2)\"\"  of  pairs  of random  vectors,  i.e.,  for  each  i  the  pair \n(Xi, Yi)  has  the same  distribution as  the pair  (X, Y)  and  the  collection  of pairs  is \nindependent (X and Y  can be dependent) .  Then we  can define the  (prediction)  risk \nof a  network with  weights  w  as  the expected value \n(2) \nr(w)  := IE:f(X, Y, w). \nLet  us  denote  the  minimizer  of  (1)  by  Wn (.),) ,  and  a  minimizer  of the  risk  r  by \nw*.  The quantity r(w n (>.))  is  the average prediction error for  data independent of \nthe training sample.  This  quantity r(w n (>.))  is  a  random  variable  which  describes \nthe generalization performance of the  network:  it  is  bounded  below  by  r( w*)  and \nthe  more  concentrated  it  is  about  r(w*),  the  better  the  performance.  We  will \nquantify this concentration by a  single number, the expected value IE:r(wn(>.)) . We \nare  interested  in  quantifying  the  gain  (if  any)  in  generalization  for  training  with \nversus training without regularization defined  by \n\n(3) \n\nWhen regularization helps,  this is  positive. \n\nHowever,  relatively little can be said about the quantity  (3)  without  specifying  in \ndetail how the regularization parameter is  determined.  We show in the next section \nthat provided>'  converges to zero sufficiently quickly  (at the rate op(n- 1 / 2 )),  then \nIE: r(wn(O))  and IE: r(wn(>.))  are equal to leading order.  It turns out, that the optimal \nregularization parameter resides in this asymptotic regime.  For this reason, delicate \nanalysis  is  required  in  order  to get  an  asymptotic  approximation  for  (3).  In  this \narticle  we  derive  the  needed  asymptotic  expansions  only  for  the  simplest  possible \ncase:  one-dimensional linear regression where the regularization parameter is chosen \nindependently of the training sample. \n\n2  REGULARIZATION IN LINEAR REGRESSION \n\nWe  now  specialize  the  setup  (1)  to  the  case  of linear  regression  and  a  quadratic \nsmoothness penalty, i.e. , we  take f(x,y,w) =  [y-xTwJ2  and Q(w)  =  wTRw, where \nnow y is  scalar, x  and w  are vectors, and R  is a symmetric, positive definite matrix. \nIt is  well  known  (and easy to show)  that then the minimizer of (1)  is \n\n(4) \n\n1  n \n\nwn (>')  =  ~ ~ XiX! + >'R \n\n[ \n\n]\n\n-1 \n\n1  n \n~ ~ XiYi. \n\nThis  is  called  the  generalized  ridge  regression  estimator,  see,  e.g.,  (Titterington, \n1985);  ridge regression corresponds to the  choice  R  =  I,  see  (Hoerl  and  Kennard, \n1988)  for  a  survey.  Notice  that  (generalized)  ridge  regression is  usually studied in \nthe  fixed  design  case,  where  Xi:s  are  nonrandom.  Further,  it  is  usually  assumed \nthat  the  model  is  correctly  specified,  i.e.,  that  there  exists  a  parameter such  that \nYi  = Xr w* + \u20ac i ,  and such that the distribution of the noise term \u20aci  does not depend \non Xi.  In contrast, we  study the  random  design,  misspecified case. \nAssuming  that  IE: IIXI1 2  < 00  and  that IE: [XXT]  is  invertible,  the  minimizer  of the \nrisk  (2)  and the risk itself can be written as \n\n(5) \n(6) \n\nw*  =  A-lIE: [XY],  with  A:=IE:[XXT] \n\nr(w)  = r(w*) + (w  - w*f A(w - w*). \n\n\f296 \n\nP.  Koistinen \n\nIf Zn  is  a  sequence  of random  variables,  then  the  notation  Zn  = open-a)  means \nthat n a Zn  converges  to zero  in  probability as  n  -+  00 .  For  this  notation  and  the \nmathematical  tools  needed  for  the  following  proposition  see,  e.g.,  (Serfiing,  1980, \nCh.  1)  or  (Brockwell and Davis,  1987,  Ch.  6). \n\nProposition  1  Suppose  that IE: y4 < 00, IE:  IIXII 4 < 00  and that A =  IE:  [X XTj  is  in(cid:173)\nvertible.  If,\\ =  op(n- I/2),  then  both  y'n(wn(O) -w*)  and y'n(wn('\\) - w*)  converge \nin  distribution  to  N (0, C),  a  normal  distribution  with  mean  zero  and  covariance \nmatrix C. \n\nThe previous proposition also generalizes to the nonlinear case (under more compli(cid:173)\ncated conditions).  Given  this  proposition,  it follows  (under certain additional con(cid:173)\nditions)  by  Taylor expansion that both IE:r(wn('\\))  - r(w*)  and IEr(wn(O))  - r(w*) \nadmit  the  expansion  f31 n -}  + o( n -})  with  the  same  constant  f3I.  Hence,  in  the \nregime  ,\\  =  op(n-I/2)  we  need  to  consider  higher  order  expansions  in  order  to \ncompare the performance of wn(,\\)  and wn(O). \n\n3  ONE-DIMENSIONAL LINEAR REGRESSION \n\nWe  now  specialize the setting of the previous section to the case where x  is  scalar. \nAlso,  from  now  on,  we  only  consider  the  case  where  the  regularization  parameter \nfor  given  sample  size  n  is  deterministic;  especially  ,\\  is  not  allowed  to  depend  on \nthe  training sample.  This  is  necessary,  since  coefficients  in  the  following  type  of \nasymptotic  expansions depend  on  the details  of how  the regularization  parameter \nis  determined.  The deterministic case is  the easiest one to analyze. \nWe  develop asymptotic expansions for  the criterion \n\n(7) \n\nwhere  now  the  regularization  parameter k  is  deterministic  and  nonnegative.  The \nexpansions  we  get  turn  out  to  be  valid  uniformly  for  k  ~ O.  We  then  develop \nasymptotic formulas  for  the minimizer of I n, and also for  In(O)  - inf I n.  The last \nquantity  can  be  interpreted  as  the  average  improvement  in  generalization  perfor(cid:173)\nmance gained  by  optimal level  of regularization,  when  the  regularization constant \nis  allowed  to depend on n  but not on the training sample. \nFrom now  on  we  take Q(w)  =  w2 and assume that A =  IEX2  =  1 (which  could  be \narranged by a linear change of variables).  Referring back to formulas in the previous \nsection,  we  see  that \n\n(8) \n\nr(wn(k)) - r(w*)  =  ern - kw*)2/(Un + 1 + k)2  =: h(Un, Vn, k), \n\nwhence  In(k)  =  IE:h(Un, Vn , k),  where  we  have  introduced  the  function  h  (used \nheavily in what follows)  as well  as the arithmetic means Un  and Vn \n\n(9) \n\n(10) \n\n_ \n\n1  n \n\nVn:=  - L Vi,  with \nVi  := XiYi  - w* xl \n\nn \n\nI \n\nFor  convenience,  also  define  U  :=  X2  - 1  and  V  :=  Xy - w* X2 .  Notice  that \nU; UI, U2 ,  \u2022 ..  are  zero mean Li.d.  random  variables,  and that  V; Vi, V2 ,. \"  satisfy \nthe same conditions.  Hence Un  and Vn  converge to zero,  and this leads to the idea \nof using  the  Taylor expansion of h(u, v, k)  about  the  point  (u, v)  =  (0,0)  in  order \nto get an expansion for  In(k). \n\n\fAsymptotic Theory for Regularization:  One-Dimensional Linear Case \n\n297 \n\nTo  outline  the ideas,  let  Tj(u,v,k)  be  the degree j  Taylor polynomial  of (u,v)  f-7 \nh(u, v, k)  about  (0,0),  i.e.,  Tj(u, v, k)  is  a  polynomial  in  u  and  v  whose  coeffi(cid:173)\ncients  are  functions  of  k  and  whose  degree  with  respect  to  u  and  v  is  j.  Then \nIETj(Un,Vn,k)  depends  on  n  and  moments  of U  and  V.  By  deriving  an  upper \nbound for  the  quantity IE  Ih(Un, Vn, k)  - Tj(Un, Vn, k)1  we  get  an  upper  bound  for \nthe error committed  in  approximating  In(k)  by  IE Tj(Un, Vn, k).  It turns  out that \nfor  odd  degrees  j  the  error is  of the  same  order of magnitude  in  n  as  for  degree \nj  - 1.  Therefore we  only consider even  degrees  j.  It also  turns out  that the error \nbounds  are  uniform  in  k  ~ 0  whenever  j  ~ 2.  To  proceed,  we  need  to  introduce \nassumptions. \n\nAssumption 1  IE IXlr < 00  and IE IYls  < 00  for  high  enough rand s. \n\nAssumption 2  Either  (a)  for  some  constant j3  > 0  almost  surely IXI  ;:::  j3  or  (b) \nX  has  a density  which  is  bounded in some  neighborhood  of zero. \n\nAssumption  1 guarantees the existence of high enough moments;  the values r  =  20 \nand  s  =  8  are  sufficient  for  the  following  proofs.  E.g.,  if  the  pair  (X, Y)  has  a \nnormal  distribution  or a  distribution  with  compact  support,  then  moments  of all \norders exist and hence in this  case assumption 1 would  be satisfied.  Without some \ncondition  such  as  assumption  2,  In(O)  might  fail  to be  meaningful  or  finite.  The \nfollowing  technical result is  stated without  proof. \n\nProposition 2  Let p > 0  and let 0 < IE X 2  < 00.  If assumption 2 holds,  then \n\nwhere  the  expectation  on  the  left is  finite  (a)  for  n  ~ 1  (b)  for n  > 2p provided  that \nassumption  2  (a),  respectively  2  (b)  holds. \n\nProposition 3  Let  assumptions  1  and  2  hold.  Then  there  exist  constants no  and \nM  such  that \n\nIn(k) = JET2(Un, Vn, k) + R(n, k)  where \n\n_  _ \n\n(w*)2k2 \n\n-1  [IEV2 \n\n(w*)2k2JEU2  W*kIEUV] \n\nIET2(Un, Vn, k)  =  (1+k)2  +n \n\n(1+k)2  +3 \n\n(1+k)4  +4  (1+k)3 \n\nIR(n, k) I :s;  Mn- 3/2(k + 1)-1, \n\n\"In;:::  no, k  ;:::  o. \n\nPROOF  SKETCH  The  formula  for  IE T2(Un , Vn. k)  follows  easily  by  integrating  the \ndegree  two  Taylor polynomial term  by  term.  To  get  the  upper  bound  for  R(n, k), \nconsider the residual \n\nwhere we  have omitted four  similar terms.  Using the bound \n\n\f298 \n\nP.  Koistinen \n\nthe Ll  triangle inequality, and the Cauchy-Schwartz inequality,  we  get \n\nIR(n, k)1  =  IJE  [h(Un, Vn, k)  - T2(Un, Vn, k)]1 \n\n., (k+ W' {Ii: [(~ ~Xl)-'] r \n\n{2(k + 1)3[JE (lUnI2IVnI 4 )]l/2 + 4(w*)2k2(k + 1)[18 IUnI6]l/2 ... } \n\nBy proposition 2,  here 18 [(~ 2:~ X[)-4]  =  0(1).  Next  we  use the following fact,  cf. \n(Serfiing,  1980,  Lemma B,  p.  68). \nFact  1  Let {Zd  be  i.i.d.  with  18 [Zd  =  0  and  with  18 IZI/v  < 00  for  some  v  ~ 2. \nThen \n\nv \n\nApplying the Cauchy-Schwartz inequality and this fact,  we  get,  e.g.,  that \n[18 (IUnI2 IVnI 4 )]l/2  ~ [(18 IUnI4 )1/2(E IVnI8)1/2p/2 =  0(n- 3/ 2). \n\nGoing through all the terms  carefully,  we  see that the bound holds. \n\no \n\nProposition 4  Let assumptions  1  and  2 hold,  assume  that w*  :j; 0,  and  set \n\nal := (18 V2  - 2w*E [UVD/(w*)2. \n\nIf al  >  0,  then  there  exists  a  constant  ni  such  that  for  all  n  ~ nl  the  function \nk  ~ ET2(Un, Vn,k)  has  a unique  minimum on  [0,(0)  at  the point k~ admitting  the \nexpanszon \n\nIn(O)  - inf{Jn(k)  : k  ~ O}  =  In(O)  - In(aln- 1 )  =  ar(w*)2n- 2  + 0(n- 5 / 2). \n\nk~ = aIn- 1 + 0(n-2); \n\nfurther, \n\nIf a  ~ 0,  then \n\nPROOF  SKETCH  The  proof is  based  on  perturbation expansio!1  c~nsidering lin a \nsmall parameter.  By the previous  proposition, Sn(k)  := ET2 (Un , Vn , k)  is  the sum \nof (w*)2k2/(1 + k)2  and a  term  whose  supremum  over  k  ~ ko  > -1 goes  to  zero \nas  n  ~ 00.  Here  the  first  term  has  a  unique  minimum  on  (-1,00)  at  k  =  O. \nDifferentiating Sn  we  get \n\nS~(k) =  [2(w*)2k(k + 1)2 + n- 1p2(k)]/(k + 1)5, \n\nwhere P2(k)  is  a  second  degree  polynomial  in  k.  The  numerator  polynomial  has \nthree  roots,  one  of which  converges  to  zero  as  n  ~ 00.  A  regular  perturbation \nexpansion for  this  root,  k~ =  aln- I  + a2n-2 + ... , yields  the  stated formula  for \nal.  This point is  a  minimum for  all  sufficiently  large n;  further,  it is  greater than \nzero for  all sufficiently large n  if and only if al > O. \nThe estimate for  J n (0)  - inf { J n (k)  : k  ~ O}  in  the  case  al  > 0 follows  by noticing \nthat \n\nIn(O)  - In(k) = 18 [h(Un, Vn, 0)  - h(Un, Vn, k)), \n\nwhere we  now  use  a  third degree Taylor expansion about (u, v, k)  =  (0,0,0) \n\nh(u,v,O) - h(u,v,k) = \n\n2w* kv - (w*)2k2 - 4w*kuv + 2(w*?k2u + 2kv2 - 4w*k2v + 2(W*)2k3  + r(u, v, k). \n\n\fAsymptotic Theory for Regularization: One-Dimensional Linear Case \n\n299 \n\n0.2 \n0.18 \n0.16 \n0.14 \n0.12 \n0.1  ~~ __ ~ __ ~ __ ~ __ ~ __ ~ __ ~ __ L-~ __ ~ \n0.5 \n\n0.35 \n\n0.05 \n\n0.15 \n\n0.2 \n\n0.4 \n\n0.45 \n\no \n\n0.1 \n\n0.25 \n\n0.3 \n\nFigure 1:  Illustration of the asymptotic approximations in the situation of equation \n(11) .  Horizontal  axis  kj  vertical  axis  .In(l\u00a3)  and  its  asymptotic  appr~ximations. \nLegend:  markers  In(k);  solid line IE T2 (Un, Vn, k)j  dashed line  IET4 (Un, Vn, k). \n\nUsin~  t~e  techniques  of  the  previous  proposition, \nIE Ir(Un , Vn , k~)1  =  O(n- S/ 2 ). \nestimate gives \n\nit  can  be  shown  that \nIntegrating  the  Taylor  polynomial  and  using  this \n\nIn(O)  - In(aI/n) =  af(w*)2n-2 + O(n- S/ 2 ). \n\nFinally,  by the mean value  theorem, \n\nIn(O) -inf{ In(k)  : k ~ O}  =  In(O) -In(aI/n) + ! (In(O) - In(k)]lk=8(k~ -aI/n) \n= In(O)  - In(aI/n) + O(n-1)O(n-2) \nwhere ()  lies  between k~ and aI/n, and where we  have used  the fact  that the indi(cid:173)\ncated derivative evaluated at ()  is  of order O(n- 1 ), as can be shown with moderate \n0 \neffort. \n\nRemark  In  the  preceding  we  assumed  that  A  =  IEX 2  equals  1.  If this  is  not \nthe  case,  then  the  formula  for  a1  has  to be  divided  by  A;  again,  if  a1  > 0,  then \nk~ = a1n-1 + O(n- 2 ) . \nIf the  model  is  correctly  specified  in  the  sense  that  Y  =  w* X  + E,  where  E  is \nindependent  of X  and  IE E  = 0,  then  V  =  X E  and  IE  [UV]  = O.  Hence  we  have \na1  =  IE [E2]j(w*)2,  and this is  strictly positive expect  in the degenerate case where \nE =  0 with probability one.  This  means that here regularization helps provided the \nregularization parameter is  chosen  around  the  value  aI/n and n  is  large  enough. \nSee  Figure 1 for an illustration in the case \nX  \"'\"  N(O, 1) ,  Y  =  w* X  + f , \n\nf  \"'\"  N(O, 1),  w*  =  1, \n\n(11) \n\nwhere E and X  are independent.  In(k)  is  estimated on the basis of 1000 repetitions \nof the  task for  n  =  8.  In addition  to IE T2(Un, Vn, k)  the  function  IE T4 (Un, lin, k) \nis  also  plotted.  The  latter  can  be  shown  to  give  In(k)  correctly  up  to  order \nO(n-s/ 2 (k+ 1)-3).  Notice that although IE T2(Un, Vn, k)  does not give that good an \napproximation for  In(k),  its minimizer is  near the minimizer of In(k),  and both of \nthese minimizers lie near the point al/n =  0.125  as  predicted by the theory.  In the \nsituation  (11)  it  can actually be shown by  lengthy calculations that the  minimizer \nof In(k)  is  exactly al/n for  each sample size n  ~ 1. \nIt is  possible to construct cases where  a1  < O.  For instance, take \n\nX  \"'\"  Uniform (a, b), \nY  =  cjX + d+ Z, \n\na=- b=-(3Vs-l) \n\n1 \n2 ' \n\n1 \n4 \nc= -5,d= 8, \n\n\f300 \n\nP.  Koistinen \n\nand  Z  '\"  N (0, a 2 )  with  Z  and  X  independent  and  0  :::;  a  <  1.1.  In  such  a  case \nregularization using a  positive regularization parameter only makes matters worse; \nusing a  properly chosen  negative regularization parameter would,  however,  help  in \nthis  particular case.  This  would,  however,  amount  to rewarding  rapidly  changing \nfunctions.  In the case  (11)  regularization using a  negative value for  the  regulariza(cid:173)\ntion parameter would  be catastrophic. \n\n4  DISCUSSION \n\nWe have obtained asymptotic approximations for the optimal regularization param(cid:173)\neter in  (1)  and the amount of improvement (3)  in the simple case of one-dimensional \nlinear regression when the regularization parameter is  chosen independently of the \ntraining  sample.  It turned  out  that  the  optimal  regularization  parameter  is,  to \nleading order,  given  by  Qln- 1  and the resulting improvement  is  of order O(n- 2 ). \nWe  have also seen that if Ql  < 0 then regularization only makes  matters worse. \nAlso  (Larsen and  Hansen,  1994)  have obtained  asymptotic results  for  the optimal \nregularization  parameter  in  (1).  They  consider  the  case  of a  nonlinear  network; \nhowever,  they assume that the neural network model  is  correctly specified. \nThe generalization of the  present results  to the nonlinear,  misspecified  case  might \nbe possible using,  e.g.,  techniques from  (Bhattacharya and Ghosh,  1978).  General(cid:173)\nization to the case where the regularization parameter is  chosen on the basis of the \nsample (say,  by cross  validation)  would be desirable. \n\nAcknowledgements \n\nThis  paper was  prepared while  the author was  visiting the  Department for  Statis(cid:173)\ntics and Probability Theory at the Vienna University of Technology with financial \nsupport from  the Academy of Finland.  I thank F.  Leisch for  useful  discussions. \n\nReferences \nBhattacharya, R.  N.  and Ghosh,  J. K.  (1978).  On the validity  of the formal  Edge(cid:173)\n\nworth expansion.  The  Annals  of Statistics,  6(2):434-45l. \n\nBishop,  C.  M.  (1995).  Neural Networks  for Pattern Recognition.  Oxford University \n\nPress. \n\nBrockwell,  P.  J.  and  Davis,  R.  A.  (1987).  Time  Series:  Theory  and  Methods. \n\nSpringer series in statistics. Springer-Verlag. \n\nHoerl,  A.  E.  and Kennard,  R.  W.  (1988).  Ridge  regression.  In  Kotz,  S.,  Johnson, \nN.  L., and Read, C.  B., editors, Encyclopedia  of Statistical Sciences. John Wiley \n& Sons,  Inc. \n\nLarsen,  J.  and  Hansen,  L.  K.  (1994).  Generalization  performance  of  regularized \nneural network models.  In Vlontos,  J.,  Whang,  J.-N., and Wilson,  E.,  editors, \nProc.  of the  4th  IEEE  Workshop  on  Neural  Networks  for  Signal  Processing, \npages  42-51. IEEE Press. \n\nRipley,  B.  D.  (1996).  Pattern  Recognition  and  Neural  Networks.  Cambridge  Uni(cid:173)\n\nversity Press. \n\nSerfiing,  R.  J.  (1980).  Approximation  Theorems  of Mathematical  Statistics.  John \n\nWiley  & Sons, Inc. \n\nTitterington, D.  M.  (1985).  Common structure of smoothing techniques in statistics. \n\nInternational Statistical Review,  53:141-170. \n\n\fA General Purpose Image Processing Chip: \n\nOrientation Detection \n\nRalph  Etienne-Cummings  and  Donghui  Cai \n\nDepartment of Electrical Engineering \n\nSouthern Illinois University \nCarbondale, IL 6290 1-6603 \n\nAbstract \n\nA  80  x  78  pixel  general  purpose  vision  chip  for  spatial  focal  plane \nprocessing  is  presented.  The  size  and  configuration of the  processing \nreceptive  field  are  programmable.  The  chip's  architecture  allows  the \nphotoreceptor cells  to  be  small  and  densely  packed  by  performing  all \ncomputation  on  the  read-out,  away  from  the  array.  In  addition  to  the \nraw intensity image, the chip outputs four processed images  in  parallel. \nAlso presented is an application of the chip  to  line  segment orientation \ndetection, as found in the retinal receptive fields of toads. \n\n1  INTRODUCTION \nThe front-end  of the  biological  vision  system  is  the  retina,  which  is  a  layered  structure \nresponsible  for  image  acquisition  and  pre-processing.  The  early  processing  is  used  to \nextract  spatiotemporal  information  which  helps  perception  and  survival. \nThis  is \naccomplished  with  cells  having  feature  detecting  receptive  fields,  such  as  the  edge \ndetecting  center-surround  spatial  receptive  fields  of  the  primate  and  cat  bipolar  cells \n[Spillmann,  1990]. \nIn  toads,  the  receptive  fields  of the  retinal  cells  are  even  more \nspecialized  for  survival  by  detecting  ''prey''  and  \"predator\"  (from  size  and  orientation \nfilters)  at this very early stage [Spi11mann,  1990]. \n\nThe  receptive  of the  retinal  cells  performs  a  convolution  with  the  incident  image  in \nparallel and continuous time.  This has inspired many engineers to develop  retinomorphic \nvision  systems  which  also  imitate  these  parallel  processing  capabilities  [Mead,  1989; \nCamp,  1994].  While  this  approach  is  ideal  for  fast  early  processing,  it  is  not  space \nefficient.  That is,  in  realizing  the  receptive  field  within  each  pixel,  considerable  die  area \nis required to  implement the convolution kernel.  In  addition,  should programmability  be \nrequired,  the  complexity  of each  pixel  increases  drastically.  The  space  constraints  are \neliminated  if the  processing  is  performed  serially  during  read-out.  The  benefits  of this \napproach are  1) each pixel can be as small  as  possible  to  allow  high  resolution  imaging, \n2) a single processor unit is used for the  entire retina  thus  reducing  mis-match  problems, \n3) programmability can be obtained with no impact on  the  density  of imaging  array,  and \n\n\f874 \n\nR.  Etienne-Cummings and D.  Cai \n\n4)  compact  general  purpose  focal  plane  visual  processing  is  realizable.  The  space \nconstrains are then  transfonned  into  temporal  restrictions  since  the  scanning clock  speed \nand  response  time  of  the  processing  circuits  must  scale  with  the  size  of  the  array. \nDividing  the  array  into  sub-arrays  which  are  scanned  in  parallel  can  help  this  problem. \nClearly this approach departs from  the architecture of its  biological  counterpart,  however, \nthis method capitalizes on  the  main  advantage  of silicon  which  is  its  speed.  This  is  an \nexample  of mixed signal  neuromorphic  engineering,  where  biological  ideas  are  mapped \nonto silicon not using direct imitation (which has been the preferred approach  in  the  past) \nbut rather by  realizing  their  essence  with  the  best  silicon  architecture  and computational \ncircuits. \n\nThis paper presents a general  purpose  vision  chip  for  spatial  focal  plane  processing.  Its \narchitecture allows the photoreceptor cells to  be  small  and  densely  packed  by  performing \nall  computation  on  the  read-out,  away  from  the  array.  Performing  computation  during \nread-out  is  ideal  for  silicon  implementation  since  no  additional  temporal  over-head  is \nrequired,  provided that  the  processing  circuits  are  fast  enough.  The  chip  uses  a  single \nconvolution kernel, per parallel  sub-array,  and  the  scanning  bit  pattern  to  realize  various \nreceptive  fields.  This  is  different  from  other  focal  plane  image  processors  which  am \nusually  restricted  to  hardwired  convolution  kernels,  such  as  oriented  20  Gabor  filters \n[Camp,  1994]. \nIn  addition  to  the  raw  intensity  image,  the  chip  outputs  four  processed \nversions  per sub-array.  Also  presented  is  an  application  of  the  chip  to  line  segment \norientation detection, as found in the retinal receptive fields of toads [Spillmann,  1990]. \n2  THE  GENERAL  PURPOSE  IMAGE  PROCESSING  CHIP \n2.1 \n\nSystem  Overview \n\nThis chip has an 80  row  by  78  column  photocell  array  partitioned  into  four  independent \nsub-arrays, which are scanned and output in  parallel, (see figure  I).  Each block  is  40  row \nby  39  column,  and  has  its  own  convolution  kernel  and  output  circuit.  The  scanning \ncircuit includes three parts:  virtual  ground,  control  signal  generator (CSG),  and  scanning \noutput  transformer.  Each  block  has  its  own  virtual  ground  and  scanning  output \ntransformer in both x direction (horizontal)  and  y  direction  (vertical).  The control  signal \ngenerator is shared among blocks. \n\n2.2  Hardware  Implementation \n\nThe  phototransistor  performance  light  transduction,  while \n\nThe  photocell  is  composed  of  phototransistor,  photo  current  amplifier,  and  output \ncontrol. \nthe  amplifier \nmagnifies  the  photocurrent  by  three  orders  of magnitude.  The  output  control  provides \nmultiple copies of the amplified  photocun-ent  which  is  subsequently  used  for  focal  plane \nimage processing. \n\nThe  phototransistor  is  a  parasitic  PNP  transistor  in  an  Nwell  CMOS  process.  The \ncurrent  amplifier  uses  a  pair  of  diode  connected  pmosfets  to  obtain  a  logarithmic \nrelationship  between  light  intensity  and  output  current.  This  circuit  also  amplifies  the \nphotocurrent from nanoamperes to microamperes.  The photocell sends three copies of the \noutput currents into three independent buses.  The connections  from  the  photocell  to  the \nbuses are controlled by  pass  transistors,  as  shown  in  Fig.  2.  The three  current  outputs \nallow the image to be  processed  using  mUltiple  receptive  field  organization  (convolution \nkernels), while the raw image is also output.  The  row  (column)  buses  provides  currents \nfor  extracting  horizontally  (vertically)  oriented  image  features,  while  the  original  bus \nprovides the logarithmically compressed intensity image. \n\nThe scanning circuit addresses the photocell array by selecting groups of cells at one time. \nSince the output of the cells are  currents,  virtual  ground circuits  are  used  on  each  bus  to \nmask the> I pF capacitance of the buses.  The CSG, implemented with shift registers \n\n\fA General Purpose Image Processing Chip: Orientation Detection \n\n875 \n\n\" \n\n\" \n\n\" \n\nt  I  -=_\"ODat~tl'g<I\"O'''''' \nJ, \n\" \n\n1 : ~!;~~.+ :': : ~I \n\nI \n\n-\n\nI \n\nr \n1 \nf \nf \n\n1 \n! \n\" \n\n-f \n\nL \n\n1 \n\n1 \" \n\n,, \"  :;\"\" o;::!' ,, : 1 \n\n...... \n.,. \n. \nJ9~ \n\n. - - I \n\n. \n\ni \n! \n\nf \nL \n~ , \n1 \n\nI \n\nI \n\n1 : ><  ~;;:-i :':'< t  :1 \n*\",1,,\",,1, ,',,1,><,.,,', \n\n= -t  ~I .. _  .... \" \" \" \" ' -\n\n1 \n\nV \n\n~ . .....,. ..... ..,. \n\nlloc',I:  .... , ..-. \n\nlIIIoc~' l:  Drl,  ...... \n\n..... .  . . \" .  \"'!J\"rty \n\n, I \n:, \n\n- -t.\". ciii-. \n-\n-\n\n...f~;;;i \n\n~~\".... \n\n~'!\"' I \n\n!  m \n\n! \n1  ~ \n1 \n\nf  W \n\nV \n\n1I ..... ,2orf. _ \n\ne69r>0 . . .  o-Y ......... \n\n1IIocIf,.\u00b7\"\"._ \n\n..... ,  1Id,..,.Mpwy \n\nA \n\n[!J \n\nI \nI  ~ \nI \n1 \n\u2022 \n\n{  w \n\n..\u2022..\u2022.\u2022\n\u2022\u2022 \nc ....  jir;i \noA\"\";  -\n\n, \nI_ \n\nI \n\n-\n\nd\"' ~ ;;; \" \" \n\n~~  -\n\n~ ~ \n\nFigure  1:  Block diagram of the chip. \n\nproduces  signals  which  select  photocells  and  control  the  scanning  output  transformer. \nThe scanning output transformer converts currents  from  all  row  buses  into  Ipe\u00ab  and  Icenx' \nand  converts  currents  from  all  row  buses  into  lpery  and  Iccny.  This  transformation  is \nrequired to implement the various convolution kernels discussed later. \n\nThe output  transformer circuits  are  controlled  by  a central  CSG  and  a  peripheral  CSG. \nThese two generators  have  identical  structures but  different  initial  values.  It  consists  of \nan  n-bit shift register in  x direction (horizontally) and an  m-bit shift register in y  direction \n(vertically).  A  feedback  circuit is  used  to  restore  the  scanning  pattern  into  the  x  shift \nregister after each row scan is completed.  This is repeated until all  the  row  in  each block \nare scanned. \n\nThe control signals from  the peripheral and central  CSGs select all  the cells covered  by  a \n2D convolution mask (receptive field).  The selected cells  send Ixy  to  the  original  bus,  Ixp \nto  the  row  bus,  and  Iyp  to  the  column  bus.  The  function  of  the  scanning  output \ntransformer is  to identify which rows (columns) are  considered  as  the  center  (Icenx  or  Ircny) \nor periphery (Irerx  or Ipcry)  of the convolution kernel, respectively.  Figure  3  shows  how  a \n3x3 convolution kernel can be constructed. \n\nFigure  4  shows  how  the  output  transformer  works  for  a  3x3  mask.  Only  row  bus \ntransformation is  shown in this example, but the same mechanism applies  to  the  column \nbus as  well.  The photocell array is m row by n column, and the size is  3x3.  The  XC  (x \ncenter) and YC (y center) come from  the central CSG;  while XP (x peripheral) and  YP  (y \nperipheral) come from the peripheral CSG.  After loading  the  CSG,  the  initial  values  of \nXP  and  YP  are  both  00011...1.  The  initial  values  of XC  and  YC  are  both  10 111.. .1. \nThis  identifies  the  central  cell  as  location  (2,  2).  The  currents  from  the  central  row \n(column) are summed to form Iren\u2022 and  leeny,  while  all  the  peripheral  cells  are  summed to \nform  Iperx  and lpery.  This is  achieved  by  activating  the  switches  labeled  XC,  YC,  XP  and \nYP in figure 2.  XPj  (YP,)  {i= I, 2,  ... , n}  controls whether the output current of one cell \n\n\f876 \n\nR.  Etienne-Cummings and D.  Cai \n\nYC~  IO----+----yp \nXC---<::A. \n\n,.b---~p \n\nJyp \n\nhp \n\nOriginal  Bus--<E,..-I'----+_--+ __  \nColumn  Bus_~_--,_-+ _\n\n_ \n\nlori \n\n(1 ,1) \n\n(1 ,2) \n\n(2,1) \n\n(2,2) \n\n(3,1) \n\n(3,2) \n\n(3,:l \n\nRow  BUS_~ ___  -----4 __  \n\nFigure 2:  Connections between a photo(cid:173)\ncell and the current buses. \n\nFigure  3:  Constructing  a  3x3  receptive \nfield . \n\ngoes  to  the  row  (column)  bus.  Since  XP j  (YP)  is  connected  to  the  gate  of  a  pmos \nswitch,  a 0  in XPj  (YPj) it  turns  on.  YCj (XCj )  {i=l, 2,  ... ,  n}  controls  whether a  row \n(column) bus connects to Icenx  bus  in  the  same way.  On  the  other hand,  the  connection \nfrom a row (column) bus to Ipcrx  bus  is  controlled by  an  nmos  and  a  pmos  switch.  The \nconnection is  made if and only if YC, (XCi)' an nmos switch, is  1 and YPi (XPi),  a  pmos \nswitches,  is  O.  The  intensity  image  is  obtained directly  when  XCi  and  YC j  are  both  O. \nHence, lori  =  1(2,2), Icenx  = lrow2=  1(2,1) +  1(2,2)  +  1(2,3)  and  Iperx  =  lrowl  +  lrow3  =  1(1,1)  + \n1(1,2) + 1(1,3) + 1(3,1) + 1(3,2) + 1(3,3). \n\nThe convolution kernel can  be programmed  to  perform  many  image  processing  tasks  by \nloading  the  scanning  circuit  with  the  appropriate  bit  pattern.  This  is  illustrated  by \nconfiguring the chip to perform image smoothing and edge extraction (x edge, y  edge,  and \n20 edge), which are all computed simultaneously on read-out.  It receives five  inputs  (lori' \nIccn,' lperx, Iceny, Ipcry)  from  the scanning circuit  and  produces five  outputs  (lori'  ledge.>  ledgey' \nIsmllllth,ledge2d).  The kernel (receptive field) size is  programmable from 3x3,  5x5,  7x7,  9x9 \nand  11 x 11 .  Fig. 5  shows the  3x3  masks  for  this  processing.  Repeating  the  above  steps \nfor 5x5, 7x7, 9x9, and  II x 11  masks,  we can  get similar results. \n\np..  u \n>-\n>-\n\nc:s:l \n\n--\n\nc:s:l \n\nc:s:l \n\nc:s:l \n\n..... \n\n--\n--\n\nYPI \nVCI \n\nYl'2 \nYC2 \n\nYP3 \nYC3 \n\nYPIoI \nYCN \n\nFigure 4:  Scanning output transformer for an m row by n column photo cell array. \n\n\fA General Purpose Image Processing  Chip:  Orientation Detection \n\n877 \n\n1 \n\n1 \n\n1 \n\nI \n\n1 \n\n1 \n\n1 \n\n1 \n\n1 \n\n(a) smooth \n\n-I \n\n2 \n\n-I \n\n-I \n\n2 \n\n-I \n\n-I \n\n2 \n\n-1 \n\n(b) edge_x \n\n- 1 \n\n-I \n\n-I \n\n2 \n\n2 \n\n2 \n\n-I \n\n-I \n\n-I \n\n(c) edge-y \n\n0 \n\n-I \n\n0 \n\n- 1 \n\n4 \n\n-I \n\n0 \n\n-I \n\n0 \n\n(d) edge_2D \n\nFigure 5:  3x3 convolution masks for  various image processing. \n\nIedge.=Kld * Icen. -Ipc\", \n\nIn general, the convolution results under different mask sizes can be expressed as follows: \nI~mooth=Icen. + Ire... \nIedge2D=K2d *I\"ri-Icen.-Iceny \nWhere Kid and K2d  are the programmable coefficients (from 2-6 and 6-14,  respectively)  for \nID edge extraction and 2D edge extraction, respectively.  By  varying  the  locations  of the \nO's in the scanning circuits, different types of receptive fields (convolution kernels) can  be \nrealized. \n\nIedgey=Kld * Iceny-Ipcry \n\n2.3  Results \n\nThe chip contains 65K transistors in  a footprint of 4.6 mm x 4.7 mm.  There are 80 x  78 \nphotocells in  the chip, each of which is  45.6 11m  x  45  !lm  and  a  fill  factor  of  15%.  The \nconvolution kernel occupies 690.6 !lm x  102.6 11m.  The power consumption of the  chip \nfor a 3x3 (1\\ x 11) receptive field,  indoor light, and 5V power supply is < 2 m W (8 m W). \n\nTo capitalize on the programmability of this chip, an  ND card in  a Pentium  133MHz PC \nis used to load the scanning circuit and to  collect data.  The card,  which  has  a  maximum \nanalog throughput of 100KHz limits the frame rate of the  chip to  12  frames  per second. \nAt this rate, five processed versions of the image is collected and displayed.  The scanning \nand processing circuits can  operate  at  10  MHz  (6250  fps),  however,  the  phototransistors \nhave much slower dynamics.  Temporal smoothing (smear) can  be  observed on  the  scope \nwhen the frame rate exceeds 100 fps. \n\nThe chip  displays  a  logarithmic  relationship  between light  intensity  and  output  current \n(unprocessed  imaged)  from  0.1  lux  (100  nA)  to  6000  lux  (10  IlA).  The  fixed  pattern \nnoise, defined as standard-deviation/mean, decreases abruptly from  25%  in  the  dark to  2% \nat room light (800 lux).  This behavior is expected since  the  variation  of individual  pixel \ncurrent is large compared to  the  mean  output  when  the  mean  is  small.  The  logarithmic \nresponse  of the  photocell  results  in  high  sensitivity  at  low  light,  thus  increasing  the \nmean value sharply.  Little variation is observed between chips. \n\nThe contrast sensitivity of the edge detection masks is also measured for  the  3x3  and 5x5 \nreceptive fields.  Here contrast is defined as (1m .. - Imin)/(Im .. + Imin) and sensitivity is given \nas  a  percentage  of the  maximum  output.  The  measurements  are  performed  for  normal \nroom  and  bright  lighting  conditions.  Since  the  two  conditions  corresponded  to  the \nsaturated  part  of  the  logarithmic  transfer  function  of  the  photocells,  then  a  linear \nrelationship between  output  response  and contrast is  expected.  Figure  6  shows  contrast \nsensitivity plot. Figure 7 shows examples of chip's outputs.  The top two images  are  the \nraw and smoothed (5x5)  images.  The bottom  two  are  the  1 D  edge_x  (left)  and  2D  edge \n(right)  images.  The  pixels  with  positive  values have  been  thresholded  to  white.  The \nvertical black line in the image is  not visible in  the edge_x image, but can be clearly  seen \nin  the edge_2D image. \n\n\fR.  Etienne-Cummings and D. Cai \n\n878 \n\n80 \n\n>< \n... o \n~ 60 \n\n~ \n:;  40 \n& \n::I o \n\n20 \n\n..... \u00b7\u00b75,5 Bri!,hl \n-e-5xS Normal \n... \u2022 \n\u00b7\u00b7 3,3 Bri!,hl \n_ _  3,3 Nonnal \n\nContrast [%] \n\nFigure 6:  Contrast  sensitivity  function  of \nthe x edge detection mask. \n3  APPLICATION:  ORIENTATION  DETECTION \n3.1  Algorithm  Overview \n\nFigure 7:  (Clockwise) Raw image. 5x5 \nsmoothed image. edge_2D and edge_x. \n\nThis vision chip can be elegantly used to measure the orientation  of line  segments  which \nfall  across  the  receptive  field  of each  pixel.  The output  of the  10 Laplacian  operators, \nedge_x  and edge_y,  shown  in  figure  5,  can  be  used  to  detennine  the  orientation  of edge \nsegments.  Consider a continuous line through the origin,  represented  by  a  delta  function \nin  20 space by IX y-xtan()).  If the origin  is  the  center  of the  receptive  field.  the  response \nofthe edge_x kernel can  be  computed  by  evaluating  the  convolution  equation  (1).  where \nW(x) = u(x+m)-u(x-m) is  the x window over which smoothing is  performed, 2m+ J is  the \nwidth  of  the  window  and  2n+ J  is  the  number  of  coefficients  realizing  the  discrete \nLaplacian operator.  In  our case,  n  = m.  Evaluating  this  equation  and  substituting  the \norigin for the pixel location yields equation (2), which indicates that the output of the  10 \nedge_x  (edge-y)  detectors  have  a  discretized  linear  relationship  to  orientation  from  on to \n45\" (45\u00b0 to  90\u00b0).  At  0\",  the  second  term  in  equation  (2)  is  zero.  As  e increase,  more \nterms  are  subtracted until  all  tenns  are  subtracted  at  45\u00b0.  Above  45 0  (below  45\u00b0),  the \nedge_x  (edge-y)  detectors  output  zero  since  equal  numbers  of  positive  and  negative \ncoefficients  are  summed.  Provided  that  contrast  can  be  nonnalized.  the  output  of  the \ndetectors  can  be  used  to  extract  the  orientation  of the  line.  Clearly  these  responses  are \neven about the x- and y-axis. respectively.  Hence, a second pair of edge detectors. oriented \nat 45\",  is required to uniquely extract the angle of the line segment. \n\n10 \n\n8 \n\n~ 6 \n..=. \n:; \ns-\n:::I \n0 \n\n4 \n\n0 \n\n, \n\n0..,. \n\nb~ \n\n-o._~ \n\no: \n\n\"I \n\n--370 Lux \n\n- 9- -260 Lux \n\n- -. - - (R!) Lux \n\n-\u00b7\u00b7 .. -\u00b7 - (25 Lux \n\n2 \n\n~ \u2022 \u2022 . \u2022 \u2022.\u2022 \u2022\n\n. _. . \n\no \n\no \n\n' .\n\n. . . . \n\n20 \n\n,  , \n'b \n'0, ,b \n-. ... .... ... ~ .~ ..... :-. \n\nCI' ... ' \n\n\u2022\n\n\u2022  .,,;:: . :-:-' . -\n\n- \u2022\u2022 ~ . ':  T  -0- - _ _  _ \n\n40 \nAngle ['-'] \n\n60 \n\n80 \n\nFigure 8:  Measured orientation transfer function of edge_x detectors. \n\n\fA General Purpose Image Processing Chip:  Orientation Detection. \n\n879 \n\n0edge_Ax,y) = [2nW(x \u00b1m)8(y)-...EW(x \u00b1m)8(y\u00b1i)]*8(y-xtane) \n\n(I) \n\nn \n\nOedge  AO.O) =  2n-[ ~(W(--)+ W(--)] \n\n(2) \n\n-\n\nn \u00b7  \n~  I \n\n;=} \n\ntane \n\n;=} \n\n. \n-I \nfane \n\n3.2  Results \n\nFigure  8  shows  the  measured  output  of  the  edge_x  detectors  for  various  lighting \nconditions as a line is  rotated.  The average positive outputs are plotted.  As  expected,  the \noutput  is  maximum  for  bright  ambients  when  the  line  is  horizontal.  As  the  line  is \nrotated, the output current decreases linearly and levels off  at  approximately  45\".  On  the \nother hand, the edge_y (not shown) begins its linear increase at 45\" and maximizes at  90\u00b0. \nAfter normalizing for brightness. the four curves are very similar (not shown). \n\nTo  further  demonstrate  orientation  detection  with  this  chip,  a  character  consisting  of  a \ncircle and some straight lines is presented.  The intensity image of the  character  is  shown \nin  figure  9(a).  Figures  9(b)  and  9(c)  show  the  outputs  of  the  edge_x  and  edge-y \ndetectors, respectively.  Since a 7x7 receptive field  is used in this experiment, some  outer \npixels  of each  block  are  lost.  The  orientation  selectivity  of  the  1 D  edge  detectors  are \nclearly visible in the figures, where edge_x highlights horizontal edges and edge_y vertical \nedges.  Figure 9(d) shows the reported angles.  A program is  written  which  takes  the  two \nI D edge images, finds  the location of the edges from the  edge_2D  image,  the  intensity  at \nthe edges (positive lobe) and then computes the angle of the edge segment.  In figure  9(d), \nthe black background is chosen for locations where no edges are detected, white is used  for \n0\u00b0 and gray for 90\u00b0. \n\n(a) \n\n(b) \n\n(c) \n\n(d) \n\nFigure 9:  Orientation detection using  ID Laplacian Operators. \n\n4  CONCLUSION \nA  80x78  pixel  general  purpose  vision  chip  for  spatial  focal  plane  processing  has  been \npresented.  The size and configuration of the processing receptive  field  are  programmable. \nIn addition to the raw intensity image, the chip outputs four processed  images  in  parallel. \nThe  chip  has  been  successfully  used  for  compact  line  segment  orientation  detection, \nwhich  can  be  used  in  character  recognition.  The  programmability  and  relatively  low \npower consumption makes it  ideal for many visual processing tasks. \n\nReferences \n\nCamp  W.  and  J.  Van  cler  Spiegel,  \"A  Silicon  VLSI  Optical  Sensor  for  Pattern \n\nRecognition, \" Sensors and Actuators A,  Vol.  43, No.  1-3, pp.  188-195,  1994. \n\nMead  C.  and  M.  Ismail  (Eds.),  Analog  VLSI  Implementation  of  Neural  Networks, \n\nKluwer Academic Press, Newell, MA, 1989. \n\nSpi11mann  L.  and  J.  Werner  (Eds.),  Visual  Perception:  The  Neurophysiological \n\nFoundations,  Academic Press, San Diego, CA,  1990. \n\n\f", "award": [], "sourceid": 1455, "authors": [{"given_name": "Ralph", "family_name": "Etienne-Cummings", "institution": null}, {"given_name": "Donghui", "family_name": "Cai", "institution": null}]}