{"title": "Some Approximation Properties of Projection Pursuit Learning Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 936, "page_last": 943, "abstract": null, "full_text": "Some Approximation Properties of Projection \n\nPursuit Learning Networks \n\nYing Zhao Christopher G. Atkeson \n\nThe Artificial Intelligence Laboratory \nMassachusetts Institute of Technology \n\nCambridge, MA 02139 \n\nAbstract \n\nThis paper will address an important question in machine learning: What \nkind of network architectures work better on what kind of problems? A \nprojection pursuit learning network has a very similar structure to a one \nhidden layer sigmoidal neural network. A general method based on a \ncontinuous version of projection pursuit regression is developed to show \nthat projection pursuit regression works better on angular smooth func(cid:173)\ntions than on Laplacian smooth functions. There exists a ridge function \napproximation scheme to avoid the curse of dimensionality for approxi(cid:173)\nmating functions in L2(\u00a2d). \n\n1 \n\nINTRODUCTION \n\nProjection pursuit is a nonparametric statistical technique to find \"interesting\" \nlow dimensional projections of high dimensional data sets. It has been used for \nnonparametric fitting and other data-analytic purposes (Friedman and Stuetzle, \n1981, Huber, 1985). Approximation properties have been studied by Diaconis & \nShahshahani (1984) and Donoho & Johnstone (1989). It was first introduced into \nthe context of learning networks by Barron & Barron (1988). A one hidden layer \nsigmoidal feedforward neural network approximates f(x.) using the structure (Figure \nl(a)): \n\nn \n\ng(x.) = 'E a:jlT(pj9J x. + llj) \n\nj=l \n\n(1) \n\n936 \n\n\fSome Approximation Properties of Projection Pursuit Learning Networks \n\n937 \n\n(a) \n\n(b) \n\nFigure 1: (a) A One Hidden Layer Feedforward Neural Network. (b) A Projection \nPursuit Learning Network. \n\nwhere iT is a sigmoidal function. 8; are direction parameters with 118;11 = I, and \noj! P;, 6; are function parameters. A projection pursuit learning network based \non projection pursuit regression (PPR) (Friedman and Stuetzle, 1981) or a ridge \nfunction approximation scheme (RA) has a very similar structure (Figure l(b\u00bb. \n\ng(x) = Lg;(8Jx) \n\n\" \n;=1 \n\n(2) \n\nwhere 8; are also direction parameters with 118;11 = 1. The corresponding function \nparameters are ridge functions 9; which are any smooth function to be learned \nfrom the data. Since iT is replaced by a more general smooth function gj, projection \npursuit learning networks can be viewed as a generalization of one hidden layer \nsigmoidal feedforward neural networks. This paper will discuss some approximation \nproperties of PPR: \n\n1. Projection pursuit learning networks work better on angular smooth functions \nthan on Laplacian smooth functions. Here \"work better\" means that for fixed \ncomplexities of hidden unit functions and a certain accuracy, fewer hidden units \nare required. For the two dimensional case (d = 2), Donoho and Johnstone \n(1989) show this result using equispaced directions. The equispaced directions \nmay not be available when d > 2. We use a set of directions generated from \nzeros of orthogonal polynomials and uniformly distributed directions on an unit \nsphere instead. The analysis method in D \"J's paper is limited to two dimen(cid:173)\nsions. We apply the theory of spherical harmonics (Muller, 1966) to develop a \ncontinuous ridge function representation of any arbitray smooth functions and \nthen employ different numerical integration sehemes to discretize it for eases \nwhen d> 2. \n\n2. The curse of dimensionality can be avoided when a proper ridge function ap(cid:173)\nproximation is applied. Once a continuous ridge function representation is es(cid:173)\ntablished for any function in L 2(4)41), a Monte Carlo type of integration scheme \ncan be applied which has a RMS error convergence rate O(N-i) where N is \nthe number of ridge functions in the linear combinations. This is a similar \nresult to Barron's result (Barron, 1991) except that we have less restrictions \non the underlying function class. \n\n\f938 \n\nZhao and Atkeson \n\n(a) \n\n(b) \n\nFigule 2: (a> A radial basil element 1014' (b) a harmonic basis element 1110. \n\n2 SMOOTHNESS CLASSES AND A L'( tPd) BASIS \n\nII \n\nI~IJ \n\nWe use L2(4),.) u our underlying function space with Gaussian meuure 4>4 = \n. 11/112 = JR. P4>4dx. The smoothness classes characterise the rates of \n(2~)ye-\nconvergence. Let A4 be the Laplacian operator and A4 be the Laplacian-Beltrami \noperator (Muller, 1966). The smoothness classes can be defined u: \nDefinition 1 The function 1 E L2(4)4) will be .aid to have Cartuian .moo#Jane .. \n01 order p i/ it hal p derivativu and thue derivativu are all in L2(<1>4) , It will \nbe .aid to have angular .moo#Jane .. 0/ order q i/ A~f 1 E L 2(<1>4) ' It will be .aid \nto have Laplacian .moo#Jane .. 0/ order 7' i/ A41 E L2(4)4)' Let;:, be the cia .. 0/ \nfunction. with Cartuian .moothneu P, A,f be #Jae cia .. 01 function. with Cartuian \n.moothne .. p and angular .moo#Jane .. q and c,~ be the cla.. 0/ function. with \nCartuian .moothne .. p and Laplacian .moo#Jane .. 7', \n\nWe derive an orthogonal basis in L2(<1>4) from the eigenfunctions of a self-adjoint \noperator, The basis element is defined as: \n\n(3) \n\nwhere x = 7'e, n = 0, \",,00, m = 0, \"\" 00, j = I, \"\" N(d, n),1'm = (_2)mm!, (t = \nn + 422. S\"i(e) are linearly independent spherical harmonies of degree n in d \ndimensions (Muller, 1966), L!C;) is a Laguerre polynomial. The advantage of \nthe basis comes from its representation u a product of a spherical harmonic and \na radial polynomial. Specifically JOjm is a radial polynomial for n = 0 and J\"jO \nis a harmonic polynomial for m = O. Figure 2(a),(b) show a radial buis element \nand a harmonic basis element when n + 2m = 8. The basis element J\"jm has an \northogonality: \n\nwhere E denotes expectation with respect to <1>4' Since it is a basis in L2 (<1>4), any \nfunction / E L2(<1>4) hu an expansion in terms of basis elements J\"jm \n\n1 = L ~imJ\";m \n\n\".j,m \n\n(5) \n\n\fSome Approximation Properties of Projection Pursuit Learning Networks \n\n939 \n\nThe circular harmonic einB is a special case of the spherical harmonic Snj (e). In two \ndimensions, d = 2, N(d,n) = 2 and x = (rcos8,rsin8). The spherical harmonic \n\nSnj(e) can be reduced to the following: Snl(e) = 7: cos n8, Sn2(e) = '* sin n8, \n\nwhich is the circular harmonic. \n\nc .. i.,.J .. ; ... \n\nSmoothness classes can also be defined qualitatively from expansions of functions \nin terms of basis elements Jnjm. Since 111112 = 2:c!jmJ~jm' one can think of \nPnjm(J) = E!i2.,.J!i'; \nas representing the distribution of energy in 1 among the \ndifferent modes of oscillation Jnjm . If 1 is cartesian smooth, Pnjm(J) peaks around \nsmall n + 2m. If 1 is angular smooth, Pnjm (J) peaks around small n. If 1 is \nLaplacian smooth, Pnjm (I) peaks around small m. To explain why PPR works \nbetter on angular smooth functions than on Laplacian smooth functions, we first \nexamine how to represent these L2(\u00a2d) basis elements systematically in terms of \nridge functions and then use the expansion (5) to derive a error bound of RA for \ngeneral smooth functions. \n\n3 CONTINUOUS RIDGE FUNCTION SCHEMES \n\nThere exists a continuous ridge function representation for any function I(x) E \nL 2(\u00a2d) which is an integral of ridge functions through all possible directions. \n\nI(x) = 1 g(xT TJ, TJ)dwd(TJ)\u00b7 \n\nn~ \n\n(6) \n\nThis works intuitively because any object is determined by any infinite set of radio(cid:173)\ngraphs. More precisely, any function I(x) E L 2(\u00a2d) can be approximated arbitrarily \nwell by a linear combination of ridge functions 2:k g(xT TJk, TJk) provided infinitely \nmany combination units (Jones, 1987). As k --+ 00, we have (6). The natural \ndiscrete approximation to (6) has the form: In(x) = 2:'1=1 Wjg(xTTJj,TJj), which \nbecomes the usual PPR (2). We proved a continuous ridge function representation \nof basis elements Jnjm which is shown in Lemma 1. \n\nLenuna 1 The continuou6 ridge function repre6entation of Jnjm i6: \n\nJnjm(x) = Anmd 1 Hn+2m (TJT x)Snj(TJ)dWd(TJ) \n\nn~ \n\n(7) \n\nwhere Anmd i6 a con6tant and Hn+2m (:l) i6 a Hermite polynomial. \nTherefore any function 1 E L 2(\u00a2d) has a continuous ridge function representation \n(6) with \n\ng(xT TJ,7J) = L Cnjm Anmd Hn+2m(XT 7J)Snj(7J) \n\n(8) \nGaussian quadrature and Monte Carlo integration schemes can be used to discretize \n(6). \n\n4 GAUSSIAN QUADRATURE \n\nSince Jn~ g(xT TJ, TJ)dwd(TJ) = Jn~_1 J~l g(xT TJ, TJ)(1 - t~_d ~;3 dtd_ldwd_l (TJd-d, \nsimple product rules using Gaussian quadrature formulae can be used here. tij, i = \n\n\f940 \n\nZhao and Atkeson \n\n(a) \n\n(b) \n\nFigure 3: Directions (a) for a radial polynomial (b) for a hatmonic polynomial \n\nd -1, ... , 1, j = 1, ... , ft are lerGI of orthogonal polynomials with weight. (1 - t 2)\u00a5. \nN = ,,4-1 points on the unit .phere {}4 can be formed using tii through \n\n\"/1- tLI 00. y1-if 1 \n\n(9) \n\n,,= Jl- tLI\u00b7 .. tl \n\n[ \n\nt4_1 \n\nIf g( r \", ,,) is a polynomial of degree at mo.t 2\" - 1 (in term. of t 1 , ... , t4-il, then \nN = ,,4-1 point! (direction.) ate .ufficient to represent the integral exactly. This \ncan be demonstrated with two eX&lllples by taking d = 3. \nExample 1: a radial function \n\n24 + 114 + %4 + 222112 + 222%2 + 2112%2 = Cl I (xT ,,)4dwa('1) \n\nIn. \n\n(10) \n\n,,2 = 9 direction. from (9) are sufficient to represent this poly(cid:173)\n\nd = 3.\" = 3. \nnomial with tl = 0, /i, -/i (Ieros of a degree 3 Legendre polynomial) and \ntl = 0, llj. -~ ( .eros of a degree 3 Tschebyscheff polynomial). More directions \nate needed to represent a hatmonic function with exactly the same number of ternu \nof monomials but with different coefficient \u2022. \n\nEX&IIlpie 2: a hat monic function \n\ni(82 4 + 3114 + 3%4 - 2422112 - 2422%2 + 6~%2) = C2 L. (r,,)4S.i (,,)dwa(,,) (11) \n,,= 5, ,,2 = \nwhere S4/(\") = i(35t4 - 30tl + 3).\" = t\u20aca + v'f=1J'h' \n25 direction. from (9) ate sufficient to represent the polynomial with t2 = \n0,0.90618, -0.90618, 0.53847, -0.53847 and tl = COl 2{;1 ~,j = I, ... , 5. The dis(cid:173)\ntribution of these directioDJ on a unit .phere ate shown in Figure 3(a) and (b). \nIn general, N = (ft + m + 1)4-1 direction. ate sufficient to represent J\"im exactly \nby using leros of orthogonal polynomial.. If p = ,,+ 2m (the degree of the basis) \ni. fixed, N = (p - m + 1)4-1 = (~ + 1)4-1 is minimiled when\" = 0 which \ncorresponds to the radial basis element. N is maximiled when m = 0 which is the \nhat monic element. Using definition. of .moothnes. dassel in Section 2. We can \nshow that ridge function approximation works better on angular smooth functions. \nThe basic result i. as follow.: \n\n\fSome Approximation Properties of Projection Pursuit Learning Networks \n\n941 \n\nTheorem liE .A\"f' let RN 1 denote a .um 01 ridge Junction. which belt approz(cid:173)\nimate f by tI.ing a .et of direction. generated by zero. of orthogonal polynomial \u2022. \nThen \n\n(12) \n\nThis error bound says that ridge function approximation does take advantage of \nangular smoothness. Radial functions are the most angular smooth functions with \nq = +00 and harmonic functions are the least angular smooth functions when the \nCartesian smoothness p is fixed. Therefore ridge function approximation works \nbetter on angular smooth functions than on Laplacian smooth functions. Radial \nand harmonic functions are the two extreme cases. \n\n5 UNIFORMLY DISTRIBUTED DIRECTIONS ON nd \n\nInstead of using directions from zeros of orthogonal polynomials, N uniformly dis(cid:173)\ntributed directions on nd is an alternative to generalizing equispaced directions. \nThis is a Monte Carlo type of integration scheme on Oct. \nTo approximate the integral (7), N uniformly distributed directions 1/1, 112, ...... , TJN \non nd drawn from the density I(TJ) = l/wct on Od are used: \n\nJnjm(x) = ';; Anmd L Hn+2m (XT TJIe)Snj(TJIe) \n\nN \n\nIe=l \n\n(13) \n\n(14) \n\nThe variance is \n\nu 2(x) \nu1(x) = - -\n\n(15) \nwhere u2 (x) = In. [AnmdWdHn+2m(XTTJ)Snj(TJ) - Jnjm(X)] 2 ..,l.dwd(TJ). Therefore \nJnjm(X) approximates Jnjm(x) with a rate u(x)/VN. The difference between a \nradial basis and a harmonic basis is u(x). Let us average u(x) with respect to \u00a2d: \n\nN \n\nIlu(x)112 = IIJnjmll2 [ri~~;~:d 1) - 1] = IIJnjmll 2anjm \n\n(16) \nFor a fixed n + 2m = p, Ilu(x)112 is minimized at n = 0 (a radial element) and \nmaximized at m = 0 (a harmonic element). \nThe same justification can be done for a general function 1 E L2( \u00a2d) with (6) and \n(8). A RA scheme is: \n\n\f942 \n\nZhao and Atkeson \n\nmN(x) = I(x), O';;(x) = (T2kX) , 0'2(x) = fOil wdg2(xT TJ, TJ)dwct(TJ) - 12(x) and \n110'(x)1I2 = L c!jmIIJnjmWanjm' Since anjm is small when n is small, large when m \nis small and recall the distribution Pnjm(J) in Section 3, 110'112/11/112 is small when \nI is smooth. Among these smooth functions, if I is angular smooth, 110'112/11/112 \nis smaller than that if I is Laplacian smooth. The RMS error convergence rate \nIIli\",1I = ,,)I~J:v is consequently smaller for I being angular smooth than for I being \nLaplacian smooth. But both rates are D( N-i) no matter what class the underlying \nfundion belongs to. The difference is the constant which is related to the distribu(cid:173)\ntion of energy in I among the different modes of oscillations (angular or Laplacian). \nThe radial and harmonic functions are two extremes. \n\n6 THE CURSE OF DIMENSIONALITY IN RA \n\nGenerally, if N directions TJ1, 1/2, ...... , TJN, on Od drawn from any distribution p(TJ) \non the sphere Od to approximate (6) \n\nI(x) = ~ t 9(XTTJ1c ,TJ1c\u00bb \n\nN 1c=1 \n\np( TJ1c) \n\n(17) \n\n(18) \n\n(19) \n\nmN = lex) IT;; = C;; where 0'2(x) = fOil p(~)g2(xTTJ,TJ)dwd(TJ) - 12(x). And \n\n111T(x)112 = I _(1) IIgl1I12dwd(TJ) -11/112 = c/ \n\nlOll P TJ \n\nThen \n\nThat is, I(x) --+ lex) with a rate D(N-~). Equation (19) shows that there is no \ncurse of dimensionality if a ridge function approximation scheme (17) is used for \nlex). The same conclusion can be drawn when sigmoidal hidden unit function neu(cid:173)\nral networks are applied to Barron's class of underlying function (Barron, 1991). \nBut our function class here is the function class that can be represented by a con(cid:173)\ntinuous ridge function (6), which is a much larger function class than Barron's. Any \nfunction I E L 2(\u00a2d) has a representation (6)(Section 4). Therefore, for any func(cid:173)\ntion I E L2( \u00a2d), there exists a node function g(xT TJ, TJ) and related ridge function \napproximation scheme (17) to approximate lex) with a rate D(N-i), which has no \ncurse of dimensionality. In other words, if we are allowed to choose a node function \ng(xT TJ, TJ) according to the property of data, which is the characteristic ofPPR, then \nridge function approximation scheme can avoid the curse of dimensionality. That \nis a generalization of Barron's result that the curse of dimensionality goes away if \ncertain types of node function (e.g., CO$ and 0') are considered. \nThe smoothness of a underlying function determines the size of the constant Ct. As \nshown in the previous section, ifp(TJ) = l/wd (i.e., uniformly distributed directions), \nthen angular smooth functions have smaller c/ than Laplacian smooth functions do. \nChoosing different p( TJ) does not change this conclusion. But a properly chosen p( TJ) \n\nreduces c, in general. If I(x) is smooth enough, the node function g(xT TJ, TJ) can be \n\n\fSome Approximation Properties of Projection Pursuit Learning Networks \n\n943 \n\ncomputed from the Radon transform R\"f of f in the direction 1] which is defined \nas \n\n(20) \nand we proved: g(xT1],1]) = F-1(Fl1(t)ltld-1) It=X'l'l1, where Fl1(t) is the Fourier \ntransform of R\"f(') and F- 1 denotes the inverse Fourier transform. In practice, \nlearning g(xT 11,11) is usually replaced by a smoothing step which seeks a one di(cid:173)\nmensional function to fit x T 1] best to the residual in this direction (Friedman and \nStuetzle, 1981, Zhao and Atkeson, 1991). \n\n7 CONCLUSION \n\nAs we showed, PPR works better on angular smooth function than on Laplacian \nsmooth functions by discretizing a continuous ridge function representation. PPR \ncan avoid the curse of dimensionality by learning node functions from data. \n\nAcknowledgments \n\nSupport was provided under Air Force Office of Scientific Research grant AFOSR-\n89-0500. Support for CGA was provided by a National Science Foundation Presi(cid:173)\ndential Young Investigator Award, an Alfred P. Sloan Research Fellowship, and the \nW. M. Keck Foundation Associate Professorship in Biomedical Engineering. Spe(cid:173)\ncial thanks goes to Prof. Zhengfang Zhou and Prof. Peter Huber at Math Dept. in \nMIT, who provided useful discussions. \n\nReferences \n\nBarron, A. R. and Barron, R. L. (1988) \"Statistical Learning Networks: A \nUnifying View.\" Computing Science and Stati6tic6: Proceeding6 of 20th Symp06ium \non the Interface. Ed Wegman, editor, Amer. Statist. Assoc., Washington, D. C., \n192-203. \nBarron, A. R. (1991) \"Universal Approximation Bounds for Superpositions of \nA Sigmoidal Function\". TR. 58. Dept. of Stat., Univ. of nlinois at Urbana(cid:173)\nChampaign. \nDonoho, D. L. and Johnstone, I. (1989). \"Projection-based Approximation, \nand Duality with Kernel Methods\". Ann. Statid., 17, 58-106. \nDiaconis, P. and Shahshahani, M. (1984) \"On Non-linear Functions of Linear \nCombinations\", SIAM J. Sci. Stat. Compt. 5, 175-191. \nFriedman, J. H. and Stuetzle, W. (1981) \"Projection Pursuit Regression\". J. \nAmer. Stat. Auoc., 76, 817-823. \nHuber, P. J. (1985) \"Projection Pursuit (with discussion)\", Ann. Stati6t., 19, \n495-4 75. \nJones, L. (1987) \"On A Conjecture of Huber Concerning the Convergence of Pro-\njection Pursuit Regression\". Ann. Statid., 15, 880-882. \nMuller, C. (1966), Spherical Harmonic,. Lecture Notes in Mathematics, no.17. \nZhao, Y. and C. G. Atkeson (1991) \"Projection Pursuit Learning\", Proc. \nIJCNN-91-SEATTLE. \n\n\f", "award": [], "sourceid": 493, "authors": [{"given_name": "Ying", "family_name": "Zhao", "institution": null}, {"given_name": "Christopher", "family_name": "Atkeson", "institution": null}]}