{"title": "Adaptive Spline Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 675, "page_last": 683, "abstract": null, "full_text": "ADAPTIVE SPLINE  NETWORKS \n\nJerome H.  Friedman \nDepartment of Statistics and \nStanford Linear Accelerator  Center \nStanford University \nStanford, CA 94305 \n\nAbstract \n\nA network based on splines is  described.  It automatically adapts the num(cid:173)\nber of units, unit parameters, and the architecture of the network for  each \napplication. \n\n1 \n\nINTRODUCTION \n\nIn  supervised  learning  one  has  a  system  under  study  that  responds  to  a  set  of \nsimultaneous  input  signals  {Xl'\"  x n }.  The  response  is  characterized  by  a  set  of \noutput signals  {Y1, Y2,\"', Ym}.  The  goal  is  to learn  the  relationship  between  the \ninputs and  the  outputs.  This  exercise  generally  has  two purposes:  prediction  and \nunderstanding.  With  prediction  one  is  given  a  set  of input  values  and  wishes  to \npredict  or  forecast  likely  values  of the  corresponding  outputs  without  having  to \nactually run the system.  Sometimes prediction is the only purpose.  Often, however, \none wishes to use  the derived relationship to gain understanding of how  the system \nworks.  Such knowledge is often useful in its own right, for  example  in science,  or  it \nmay  be  used  to help  improve the  characteristics of the system,  as  in  industrial  or \nengineering applications. \nThe  learning  is  accomplished  by  taking  training  data.  One  observes  the  outputs \nproduced by the system in  response  to varying sets of input values \n\n(1) \nThese  data  (1)  are  then  used  to  train  an  \"artificial\"  system  (usually  a  computer \nprogram)  to  learn  the  input/output  relationship_  The  underlying  framework  or \nmodel is  usually  taken to be \n\n{Y1i  ... Ymi  I Xli' .. xndf -\n\nYk  = !k(Xl- - -xn ) + fk, \n\nk = I,m \n\n(2) \n\n675 \n\n\f676 \n\nFriedman \n\nwith  ave(fk  I Xl  ... xn) = O.  Here  (2)  Yk  is  the  kth responding  output signal,  fk  is \na  single  valued  deterministic  function  of an  n-dimensional  argument  (inputs)  and \ntk  is  a  random  (stochastic)  component  that  reflects  the  fact  that  (if nonzero)  Yk \nis  not  completely specified  by the observed inputs,  but is  also  responding  to  other \nquantities that are  neither controlled nor  observed.  In this framework  the  learning \ngoal is  to use  the training data to derive a  function  j(Xl '\"  xn)  that can serve as a \nreasonable  approximation (estimate)  of the true underlying  (\"target\") function  fk \n(2).  The supervised learning problem can in  this  way  be viewed  as  one  of function \nor surface  approximation,  usually  in  high  dimensions  (n  \u00bb  2). \n\n2  SPLINES \n\nThere is an extensive literature on the theory of function approximation (see Cheney \n[1986]  and Chui [1988],  and references therein).  From this literature spline methods \nhave emerged  as being among the  most  successful  (see  deBoor  [1978]  for  a  nice  in(cid:173)\ntroduction to spline methods).  Loosely speaking, spline functions have the property \nthat  they  are  the  smoothest  for  a  given  flexibility  and  vice  versa.  This  is  impor(cid:173)\ntant  if one  wishes  to  operate  under  the  least  restrictive  assumptions  concerning \nfk(XI'\"  xn)  (2),  namely,  that it is  relatively smooth  compared to the noise  tk  but \nis otherwise  arbitrary.  A spline approximation is  characterized by its order q  [q  = 1 \n(linear),  q  = 2  (quadratic),  and  q = 3  (cubic)  are  the most  popular  orders].  The \nprocedure is  to first  partition the input variable space into a  set of disjoint regions. \nThe approximation l(xi ... xn) is  taken to be a  separate n-dimensional polynomial \nin  each  region  with  maximum degree  q  in  anyone  variable,  constrained  so  that I \nand all of its derivatives  to order  q - 1 are continuous across all region  boundaries. \nThus, a particular spline approximation is determined by a choice for  q,  which tends \nnot  to  be very  important,  and the  particular set of chosen regions,  which  tends  to \nbe  crucial.  The  central problem  associated  with  spline  approximations  is  how  to \nchoose  a  good set of associated regions for  the  problem at hand. \n\n2.1  TENSOR-PRODUCT SPLINES \n\nThe most popular method for  partitioning the input variable space is by the tensor \nor outer product of interval sets on each of the n  axes.  Each input axis is  partitioned \ninto  I<  + 1  intervals  delineated  by  I<  points  (\"knots\").  The  regions  in  the  n(cid:173)\ndimensional  space  are  taken to be the  (I< + 1t intersections  of all  such  intervals. \nFigure  1 illustrates  this  procedure for  I<  = 4  knots on  each of two axes  producing \n25  regions  in the corresponding two-dimensional space. \n\nOwing to the regularity of tensor-product representations, the corresponding spline \napproximation  can  be  represented in  a  simple  form  as  a  basis function  expansion. \nLet  x =  (Xl'\"  x n ).  Then \n\nlex) = l: WtBt(x) \n\n(3) \n\nwhere  {wtl  are  the  coefficients  (weights)  for  each respective  basis function  Bt(x), \nand the  basis  function  set  {Bt(x)}  is  obtained by taking the  tensor product  of the \nset of functions \n\nt \n\n(4) \n\n\fAdaptive Spline Networks \n\n677 \n\nover all of the axes,  j  =  1, n.  That is, each of the  I< + q + 1 functions  on each axis j \n(j = 1, n)  is  multiplied by all  of the functions  (4)  corresponding to  all  of the  other \naxes k (k  =  1, n;  k 1=  j).  As  a result  the total number of basis functions  (3)  defining \nthe  tensor-product spline  approximation is \n\n(5) \n\nX\u00b7  < tk\u00b7 \n3 \n3  -\nXj  > tkj \n\nThe functions  comprising the  second  set  in  (4)  are  known  as  the  truncated  power \nfunctions: \n\n(6) \nand there is one for each knot location tkj  (k = 1, I<)  on each input axis j  (j = 1, n). \nAlthough  conceptually quite simple,  tensor-product splines  have severe limitations \nthat  preclude  their  use  in  high  dimensional  settings  (n  > >  2).  These  limitations \nstem from  the exponentially large number of basis functions  that  are  required  (5). \nFor  cubic  splines  (q  = 3)  with  five  inputs  (n  = 5)  and  only  five  knots  per  axis \n(I<  = 5)  59049 basis functions  are  required.  For  n = 6  that number is  531441,  and \nfor  n  =  10  it  is  approximately  3.5  x  109 \u2022  This  poses  severe  statistical  problems \nin  fitting  the  corresponding  number  of weights  unless  the  training sample  is  large \ncompared  to  these  numbers,  and  computational  problems  in  any  case  since  the \ncomputation  grows  as  the  number  of weights  (basis  functions)  cubed.  These  are \ntypical  manifestations  of  the  so-called  \"curse-of-dimensionality\"  (Bellman  [1961]) \nthat afflicts  nearly all  high-dimensional  problems. \n\n3  ADAPTIVE  SPLINES \n\nThis  section  gives  a  very  brief  overview  of  an  adaptive  strategy  that  attempts \nto  overcome  the  limitations  of  the  straightforward  application  of  tensor-product \nsplines, making practical their use in high-dimensional settings.  This method, called \nMARS (multivariate adaptive regression splines),  is  described in detail in Friedman \n[1991]  along with many examples of its use  involving both real and artificially gen(cid:173)\nerated  data.  (A  FORTRAN  program  implementing  the  method  is  available  from \nthe  author.) \nThe  method  (conceptually)  begins  by generating a  tensor-product  partition of the \ninput  variable  space  using  a  large  number  of knots,  J{ < N,  on  each  axis.  Here \nN  (1)  is  the  training  sample  size.  This  induces  a  very  large  (I<  + l)n  number \nof regions.  The  procedure  then  uses  the  training  data to  select  particular  unions \nof these  (initially  large  number  of)  regions  to  define  a  relatively  small  number  of \n(larger)  regions  most suitable for  the problem at hand. \n\n\"\" \n\nThis  strategy  is  implemented  through  the  basis  function  representation  of spline \napproximations (3).  The idea is to select  a relatively small subset of basis functions \n\n{B~(x)}~  C  {Bl(X)}~1Uge \n\nsmall \n\n(7) \n\nfrom  the  very  large set  (3)  (4)  (5)  induced by  the  initial  tensor-product  partition. \nThe particular subset for  a problem at.  hand is obtained through standard statistical \nvariable  subset  selection,  treating  the  basis  functions  as  the  \"variables\".  At  the \n\n\f678 \n\nFriedman \n\nfirst step the best single basis function  is  chosen.  The second step chooses the basis \nfunction  that  works  best  in  conjunction  with  the  first.  At the  mth  step,  the  one \nthat works  best with the  m - 1 already selected, is  chosen,  and so  on.  The process \nstops when including additional basis functions  fails  to improve the approximation. \n\n3.1  ADAPTIVE SPLINE NETWORKS \n\nThis  section  describes  a  network  implementation  that  approximates  the  adaptive \nspline strategy  described  in  the  previous  section.  The goal is  to synthesize a  good \nset of spline basis functions  (7)  to approximate a  particular system's input/output \nrelationship,  using the training data.  For the moment,  consider only one output y; \nthis  is  generalized  later.  The  basic  observation  leading  to  this  implementation  is \nthat the approximation takes the form of sums of products of very simple functions, \nnamely the  truncated power functions  (6),  each involving a  single  input  variable, \n\nand \n\nKm \n\nB~ (x) = II (Xj(k)  -\n\ntkj )~, \n\nk=l \n\nM \n\nj(x) =  L  wmB:n(x). \n\n(8) \n\n(9) \n\nHere  1 <j(k) :$  n  is  an input variable and 1 ~ J{m  < n  is  the number of factors in \nthe  product  (interaction level). \n\nThe network is  comprised of an ordered set of interconnected units.  Figure 2 shows \na diagram of the interconnections for a (small) network.  Figure 3 shows a schematic \ndiagram of each individual unit.  Each unit has as its inputs all of the system inputs \nXl  ... Xn  and all of the outputs from the previous units in the network Bo  . \"  BM.  It \nis  also characterized by three parameters:  j, f, t.  The triangles in Figure 3 represent \nselectors.  The  upper  triangle selects  one of the system inputs,  Xj;  the left  triangle \nselects  one  of the previous  unit outputs,  Be.  These serve as  inputs, along  with  the \nparameter t,  to  two  internal  units  that  each  produce  an  output.  The  first  output \nis  Be  . (Xj  - t)~ and the second is  Be  . (t - Xj )~.  The whole  unit  thereby produces \ntwo outputs BM+l  and BM+2,  that are  available to serve  as  inputs to future  units. \nIn  addition  to  units of this  nature,  there is  an initial  unit  (Bo)  that  produces  the \nconstant output Bo = 1,  that is  also available to be selected as an input to all units. \nThe output of the entire network,  j, is  a weighted sum (9) of all of the unit outputs \n(including  Bo  = 1).  This is  represented by the bottom trapezoid in  Figure  2. \nThe parameters  associated  with  the  network  are  the  number  of units  Nu,  the  pa(cid:173)\nrameters  associated  with each one \n\nand  the  weights in  the final  adder \n\n{Wk}~=2.Nu. \n\n(10) \n\n(11) \n\nThe goal of training the network is  to choose  values for  these  parameters  (10)  (11) \nthat minimize average future prediction error (squared), that is  the squared error on \n\n\fAdaptive Spline Networks \n\n679 \n\n(test)  data not  used as  part of the training sample.  An estimate of this quantity is \nprovided by  the  generalized  cross-validation  model  selection criterion  (Craven and \nWahba [1979]) \n\nGCV = ~ t,<y, -J;)'; [1 - 5 . N; + 1 r \n\n(12) \n\nThe  numerator  in  (12)  is  the  average  squared-error  over  the  training  data.  The \ndenominator is an (inverse) penalty for adding units.  The quantity (5.Nu+1)  is just \nthe number of adjustable  parameters in  the  network.  This  GCV  criterion  (12)  has \nits roots in ordinary (leave-one-out) cross-validation and serves as an approximation \nto it  (see  Craven and \"\"ahba [1979]). \nThe training strategy used is a  semi-greedy one.  The units are  considered in order. \nFor the  mth unit  the weights of all later  units  are set  to zero,  that is \n\nwhere  Mmax  is  the  maximum number  of units  in  the  network.  The GCV criterion \n(12)  is  then minimized with respect to the parameters of the mth unit (fm,jm, t m), \nand  the  weights  associated  with  all  previous  units  as  well  as  the  unit  under  COll(cid:173)\nsideration  {Wk 15m ,  given  the  parameter  values  associated  with  the  previous  units \n{fi,ji, td~-l. This optimization  can  be  done  very  rapidly, O(nm2 N),  using  least \nsquares  updating  formulae  (see  Friedman  [1991]).  This  process  is  repeated  un(cid:173)\ntil  Mmax  units  have  been  added  to  the  network.  A  post  optimization  procedure \n(weight elimination) is  then applied to select an optimal subset of weights to be set \nto zero, so as  to minimize the GCV  criterion  (12).  This will  (usually)  decrease  the \nGCV value since it includes a penalty for  increasing the number of nonzero weights \n\nThe semi-greedy training strategy has the  advantage of being quite fast.  The total \ncomputation  is  O(nN .M~ax) where  n  is  the  number  of system  inputs,  N  is  the \ntraining  sample  size,  and  Mmax  is  the  maximum  number  of units  to  be  included \nin  the  network  (before  weight  elimination).  On  a  SUN  SPARCstation,  small  to \nmoderate  sized  problems  train  in  seconds  to  minutes,  and  very  large  ones  in  a \nfew  hours.  A  potential disadvantage of this  strategy  is  possible  loss  of prediction \naccuracy  compared  to  a  more  thorough  optimization  strategy.  This  tends  not  to \nbe  the  case.  Experiments  with  more  complete  optimization  seldom  resulted  in \neven moderate improvement.  This is  because units added later to the network  can \ncompensate for  the suboptimal settings of parameters introduced earlier. \n\nFigure 4 illustrates  a  (very small)  network that might  be  realized  with  the  MARS \nprocedure.  The  number  above each unit is  the system input  that it selected.  The \nletter within each unit represents its knot  parameter.  The first  unit necessarily has \nas its input the constant Eo = 1.  Its first output goes to the final adder but was not \nselected  as  an  input  to  any future  units.  Its  second  output  serves  as  the selected \ninput  to the next two units, but was eliminated from  the adder  by  the final  weight \nelimination, and so on.  The final  approximation for  this  network is \nJ(x) =  Wo  + Wl(X3  - s)~ + W2(S  - X3)~(X7 - t)~ \n\n+W3(S - X3)~(X2 - u)~ + W4(S  - X3)~(U - X2)~(X8 - v)~ \n\n+W5(S - X3)~(U - X2)~(V - X8)~' \n\n\f680 \n\nFriedman \n\nTwo possible network topologies  that might be realized  are of special interest.  One \nis  where all  units  happen to select  the constant line  Bo = 1 as  their unit input.  In \nthis case the resulting approximation will be a sum of spline functions each involving \nonly  one  input  variable.  This is  known as  an additive function  (no interactions) \n\nJ \n\nj(x) =  Lfj(xj). \n\nj=l \n\n(13) \n\nAn  additive function  has  the property that the functional  dependence on any vari(cid:173)\nable is independent of the values of all other input variables up to an overall additive \nconstant.  Additive function  approximations are  important because  many true  Ull(cid:173)\nderlying  functions  f(x)  (2)  are  close  to  additive  and  thus  well  approximated  by \nadditive functions.  MARS can  realize  additive functions  as  a  subclass of its poten(cid:173)\ntial  models. \nAnother potential network topology that can  be realized by MARS is  one  in which \nevery unit output serves either as  an input to one (and only one) other unit or goes \nto the final weighted adder (but not both).  This is  a binary tree topology similar to \nthose generated by recursive partitioning strategies like CART (Breiman, Friedman, \nOlshen and Stone [1984]).  In fact,  if one were to impose this restriction and employ \nq =  0 splines,  the MARS  strategy reduces  to that of CART (see  Friedman [1991]). \nThus,  MARS  can  also  realize  CART approximations  as  a  subclass  of its  potential \nmodels. \nMARS  can be viewed  as  a  generalization of CART.  First by  allowing q > 0 splines \ncontinuous approximations  are  produced.  This  generally  results  in  a  dramatic  in(cid:173)\ncrease  in  accuracy.  In  addition,  all  unit  outputs  are  eligible  to  contribute  to  the \nfinal  adder,  not  just  the  terminal  ones;  and  finally,  all  previous  unit  outputs  are \neligible  to be selected as inputs for  new  units, not just the  currently terminal ones. \n\nBoth  additive  and  CART  approximations  have  been  highly  successful  in  largely \ncomplementary situations:  additive modeling when  the  true  underlying  function  is \nclose  to  additive,  and  CART  when  it  dominately  involves  high  order  interactions \nbetween  the  input  variables.  MARS  unifies  both  into  a  single  framework.  This \nlends  hope  that  MARS  will  be  successful  at  both  these  extremes  as  well  as  the \nbroad spectrum of situations in  between where  neither works well. \n\nMultiple  response  outputs  Y1'\"  Ym  (1)  (2)  are  incorporated  in  a  straightforward \nmanner.  The  internal units  and  their  interconnections  are  the  same  as  described \nabove and shown in  Figures 2 and 3.  Only the final  weighted adder  unit  (Figure 2) \nis  modified  to incorporate  a set  of weights \n\nfor  each  response output (k  =  1, m).  The approximation for  each output is \n\n{Wmk}~lm \n\n(14) \n\nM \n\nA(x) = L WmkBm,  k = 1, m. \n\nm=O \n\nThe numerator in the  GCV  criterion  (12)  is  replaced by \n\n1  m  N \nN  LL(Yik - jik)2 \nm \n\nk=l i=l \n\n\fAdaptive Spline Networks \n\n681 \n\nand it  is  minimized with respect to the internal network parameters (10)  and all  of \nthe weights (14). \n\n4  DISCUSSION \n\nThis section (briefly)  compares and contrasts the MARS approach with radial basis \nfunctions  and  sigmoid  \"back-probagation\"  networks.  An  important  consequence \nof the  MARS  strategy  is  input  variable  subset  selection.  Each  unit  individually \nselects  the best system input so  that it  can  best  contribute to  the  approximation. \nIt is often the case that some or many of the inputs are never selected.  These will be \ninputs that tend to have little or no effect  on the output(s).  In  this  case excluding \nthem from  the approximation will  greatly increase statistical accuracy.  It also  aids \nin the interpretation of the  produced model.  In  addition to global  variable subset \nselection,  MARS  is  able  to  do  input  variable  subset  selection  locally  in  different \nregions of the input variable space.  This is  a  consequence of the restricted support \n(nonzero  value)  of the  basis  functions  produced.  Thus,  if in  any  local  region,  the \ntarget function (2)  depends on only a few  of the inputs, MARS is able to use  this to \nadvantage even  if the  relevant  inputs  are  different  in  different  local  regions.  Also, \nMARS is  able to produce approximations of low interaction order even if the number \nof selected inputs  is  large. \n\nRadial basis functions are not able to do local (or usually even global) input variable \nsubset  selection  as  a  part  of the  procedure.  All  basis  functions  involve  all  of the \ninputs at  the same relative strength everywhere in  the input variable space.  If the \ntarget  function  (2)  is  of this  nature  they  will  perform  well  in  that  no  competing \nprocedure  will  do  better,  or  likely  even  as  well.  If this  is  not  the  case,  radial \nbasis functions are not  able to take advantage of the situation to improve accuracy. \nAlso,  radially  symmetric  basis  functions  produce  approximations  of  the  highest \npossible interaction order (everywhere in the input space).  This results in a marked \ndisadvantage  if the  target  function  tends  to  dominately  involve  interactions  in  at \nmost  a  few  of the inputs (such as  additive functions  (13\u00bb. \n\nStandard networks based on sigmoidal units  of linear  combinations of inputs share \nthe  properties  described  above  for  radial  basis  functions.  Including  \"weight  elim(cid:173)\nination\"  (Rumelhart  [1988])  provides  an  (important)  ability  to  do  global  (but not \nlocal) input variable subset selection.  The principal differences between MARS and \nthis approach center on the use of splines rather than sigmoids, and products rather \nthan linear  combinations of the input variables.  Splines  tend  to be more flexible  in \nthat two spline functions  can  closely  approximate any sigmoid whereas it can  take \nmany sigmoids to approximate some  splines.  MARS' use  of product expansions en(cid:173)\nables it  to produce  approximations that are  local in  nature.  Local  approximations \nhave the  property  that if the  target function  is  badly  behaved in  any  local  region \nof the input space,  the quality of the approximation is  not affected in  the other  re(cid:173)\ngions.  Also,  as  noted above,  MARS  can produce approximations of low  interaction \norder.  This is  difficult  for  approximations based on linear  combinations. \n\nBoth  radial  basis  functions  and  sigmoidal  networks  produce  approximations  that \nare  difficult  to  interpret.  Even  in  situations  where  they  produce  high  accuracy, \nthey provide little information concerning  the nature of the target function.  MARS \napproximations on  the other  hand  can  often provide  considerable interpretable in-\n\n\f682 \n\nFriedman \n\nformation.  Interpreting  MARS  models  is  discussed  in  detail  in  Friedman  [1991]. \nFinally,  training  MARS  networks  tends  to  be  computationally  much  faster  than \nother  types  of learning procedures. \n\nReferences \n\nBellman,  R.  E.  (1961).  Adaptive  Control Processes.  Princeton  University  Press, \nPrinceton, NJ. \n\nBreiman, L.,  Friedman, J. H., Olshen, R.  A.  and Stone, C. J. (1984).  Classification \nand Regression  Trees.  Wadsworth,  Belmont,  CA. \n\nCheney,  E.  W.  (1986).  Multivariate  Approximation  Theory:  Selected  Topics. \nMonograph:  SIAM  CBMS-NSF  Regional  Conference  Series  in Applied  Mathemat(cid:173)\nics,  Vol.  5l. \nChui, C.  K.  (1988).  Multivariate Splines.  Monograph:  SIAM CBMS-NSF Regional \nConference  Series  in Applied Mathematics, Vol.  54. \n\nCraven,  P.  and  Wahba,  G.  (1979).  Smoothing  noisy  data  with  spline  functions. \nEstimating  the  correct  degree  of smoothing  by  the  method  of generalized  cross(cid:173)\nvalidation.  Numerische Mathematik 31 317-403. \nde Boor , C.  (1978).  A  Practical Guide  to Splines.  Springer-Verlag,  New  York,  NY. \n\nFriedman, J. H.  (1991).  Multivariate adaptive regression splines  (with discussion). \nAnnals of Statistics,  March. \nRumelhart,  D. E.  (1988).  Learning and generalization.  IEEE International Confer(cid:173)\nence on  Neural  Networks,  San Diego,  plenary  address. \n\n\fAdaptive Spline Networks \n\n683 \n\nG-~ fl\\ca..A  Aal.G.p  ~i If II.  S  p' ,;...,  IVa t~ I< \n\nFIGURE  2 \n\nFIGURE  1 \n\na. \n\nFIGURE  3 \n\nAoL\",.ldve  Se/;\"\",  ~it: \n\n1. 0 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\nX 1..' \nJ, \n,-1.L \n--- J .. \n\"_1 \n~  r-LL \nR~ I \n\nj. \nr--- c .. \nr--\"r-' \n\n\u2022 \n\n, \n\n\u2022  X\", \n\nJ, \n\n, \n\n,.--'-\nJa. \nJ, \n\n---I_~ \n\nr-'-\nJrI. \nj ... \n~ t:ol \n\ns.. \n\n8..  0, \n\nII ..  \"DB. \n/ \n~~.w_8_ \n\na..  S. \n\n__ 0 \n\n,...= I \n\n-!- \" \n\nX:a. \n\n\u2022 \n\n\u2022 \n\n\u2022  \u2022 \n\n. \n\n\u2022  \u2022 \n\nlC,.. \n\n~ \n\nFIGURE  4 \n\no \n\n/'to \n\nf-\n\n\f", "award": [], "sourceid": 408, "authors": [{"given_name": "Jerome", "family_name": "Friedman", "institution": null}]}