{"title": "A Polygonal Line Algorithm for Constructing Principal Curves", "book": "Advances in Neural Information Processing Systems", "page_first": 501, "page_last": 507, "abstract": null, "full_text": "A Polygonal Line Algorithm for Constructing \n\nPrincipal Curves \n\nBalazs Kegl, Adam Krzyzak \nDept. of Computer Science \n\nConcordia University \n\n1450 de Maisonneuve Blvd. W. \nMontreal, Canada H3G IM8 \n\nkegl@cs.concordia.ca \n\nkrzyzak@cs.concordia.ca \n\nTamas Linder \n\nDept. of Mathematics \n\nand Statistics \n\nQueen's University \nKingston, Ontario \nCanada K7L 3N6 \n\nlinder@mast.queensu.ca \n\nKenneth Zeger \n\nDept. of Electrical and \nComputer Engineering \nUniversity of California \n\nSan Diego, La Jolla \n\nCA 92093-0407 \nzeger@ucsd.edu \n\nAbstract \n\nPrincipal curves have  been  defined  as  \"self consistent\" smooth  curves \nwhich pass through the \"middle\" of a d-dimensional probability distri(cid:173)\nbution or data cloud.  Recently, we  [1]  have offered a new approach by \ndefining principal curves as continuous curves of a  given length which \nminimize the expected squared distance between the curve and points of \nthe space randomly chosen according to  a given distribution.  The new \ndefinition made it possible to carry out a theoretical analysis of learning \nprincipal curves from training data.  In this paper we propose a practical \nconstruction based on the new definition.  Simulation results demonstrate \nthat the new algorithm compares favorably with previous methods both \nin terms of performance and computational complexity. \n\n1  Introduction \n\nHastie [2] and Hastie and Stuetzle [3] (hereafter HS) generalized the self consistency prop(cid:173)\nerty  of principal components and introduced the  notion of principal curves.  Consider a \nd-dimensional  random vector X =  (X(I), ... ,X(d))  with  finite  second  moments,  and  let \nf{t) = (II (t), ... ,!d(t)) be a smooth curve in 1{,d  parameterized by t E 1{,. For any x E 1{,d \nlet tf(x)  denote the parameter value t for which the  distance between x  and f(t)  is  mini(cid:173)\nmized.  By the  HS  definition, f(t)  is a  principal curve if it does not intersect itself and is \nself consistent, that is, f(t) = E(Xltr(X) = t).  Intuitively speaking, self-consistency means \nthat each point of f is the average (under the distribution of X) of points that project there. \nBased  on their  defining  property  HS  developed  an  algorithm  for  constructing  principal \ncurves for distributions or data sets,  and described an  application in the  Stanford Linear \nCollider Project [3]. \n\n\f502 \n\nB.  Keg/,  A.  Krzyiak,  T.  Linder and K.  Zeger \n\nPrincipal curves have been applied by Banfield and Raftery  [4]  to identify the outlines of \nice floes in satellite images. Their method of clustering about principal curves led to a fully \nautomatic method for identifying ice floes and their outlines.  On the theoretical side, Tib(cid:173)\nshirani [5] introduced a semiparametric model for principal curves and proposed a method \nfor estimating principal curves using the EM algorithm.  Recently, Delicado [6]  proposed \nyet another definition based on a property of the  first principal components of multivari(cid:173)\nate normal distributions.  Close connections between principal curves and Kohonen's self(cid:173)\norganizing maps were pointed out by Mulier and Cherkas sky  [7].  Self-organizing maps \nwere also used by Der et al.  [8] for constructing principal curves. \n\nThere is  an  unsatisfactory  aspect of the  definition of principal curves in the  original HS \npaper as well as in subsequent works.  Although principal curves have been defined to be \nnonparametric,  their existence for  a  given  distribution or probability  density  is  an  open \nquestion, except for very  special cases such as elliptical distributions.  This also makes it \ndifficult to theoretically analyze any learning schemes for principal curves. \n\nRecently,  we  [1]  have proposed a  new  definition of principal curves which  resolves this \nproblem.  In the new definition, a curve f\" is called a principal curve of length L  for X if f\" \nminimizes ~(f) = E [infr IIX - f(t) 112]  = EIIX - f(tf(X)) 11 2, the expected squared distance \nbetween X and the curve, over all curves of length less than or equal to L. It was proved in \n[1]  that for any X with finite  second moments there always exists a principal curve in the \nnew sense. \n\nA  theoretical  algorithm has also  been developed to  estimate principal curves based on a \ncommon model in statistical learning theory (e.g.  see [9]).  SUJ'pose that the distribution of \nX is concentrated on a closed and bounded convex set K  C  1l ,and we are given n training \npoints XI, ... ,Xn drawn independently from the distribution of X.  Let S  denote the family \nof curves taking values in K  and having length not greater than L.  For k :2:  1 let Sk  be the \nset of polygonal (piecewise linear) curves in K  which have k segments and whose lengths \ndo not exceed L. Let \n\n~(X,f) =  minllx - f(t)112 \n\nr \n\n(1) \n\ndenote  the  squared  distance  between x  and  f.  For  any  f  E S  the  empirical  squared  er(cid:173)\n\nror  of f  on  the  training  data  is  the  sample  average  ~n(f) =  * L:':I ~(Xi' f).  Let  the \n\ntheoretical  algorithm  choose  an  fk ,n  E  Sk  which  minimizes  the  empirical  error,  i.e,  let \nfk,n  =  arg minfEsk ~n (f).  It was shown in [1]  that if k is chosen to  be proportional to n I /3 , \nthen the expected squared loss of the empirically optimal polygonal curve with k segments \nand  length  at most L  converges,  as  n -+  00,  to  the  squared loss of the principal  curve of \nlength L  at a rate ~(fk,n) - ~(f\") = O(n- 1/ 3 ). \n\nAlthough amenable to theoretical analysis, the algorithm in [1] is computationally burden(cid:173)\nsome for  implementation.  In this paper we develop a  suboptimal  algorithm for  learning \nprincipal curves. This practical algorithm produces polygonal curve approximations to the \nprincipal curve just as the theoretical method does, but global optimization is replaced by \na less complex iterative descent method. We give simulation results and compare our algo(cid:173)\nrithm with previous work.  In general, on examples considered by HS  the performance of \nthe new algorithm is comparable with the HS  algorithm, while it proves to be more robust \nto changes in the data generating model. \n\n2  A Polygonal Line Algorithm \nGiven a set of data points ~ = {XI, ... ,xn} C  1ld, the task of finding the polygonal curve \nwith k segments and length L which minimizes ~ Li=1 ~(Xi' f)  is computationally difficult. \nWe  propose a  suboptimal method with  reasonable complexity.  The basic  idea is  to  start \nwith  a straight line segment fl ,n (k = 1)  and in each iteration of the algorithm to increase \n\n\fA Polygonal Line Algorithmfor Constructing Principal Curves \n\n503 \n\nthe number of segments by one by adding a new vertex to the polygonal curve fk,n  produced \nby the previous iteration. After adding a new vertex, the positions of all vertices are updated \nin an inner loop. \n\n.  . ' \n....... \n\n\" ,  . \n.' \n.'. ..... \n\n' ,1\"1' \n\n.. ., .. ., \n\n~ \n\n~ \n\nV\u00b7::\u00b7 \n... \n:\";\" ': \n' . \n.' \n.....\\ \n.. \n..... \n, . IV ~ .. \u00b7 \n\n.. \n.. \n\n., \n\ni'> \n1\"' \n1\"' \ni\" \n\n\"\" \n\"\" ... \n\nFigure I:  The curves fk,n  produced by the polygonal line algorithm for n =  100 data points. \nThe data was  generated by  adding independent Gaussian errors to  both  coordinates of a \npoint chosen randomly on a half circle.  (a) fl,n,  (b)  f2,n,  (c)  f4 ,n, (d) fU,n  (the output of the \nalgorithm). \n\nSTART \n\nProjection \n\nVertex optimization \n\ny \n\nEND \n\nFigure 2: The flow chart of the polygonal line algorithm. \n\nThe inner loop consists  of a  projection step  and  an  optimization step.  In  the  projection \nstep the data points are partitioned into \"Voronoi regions\" according to which segment or \nvertex they project.  In the optimization step the new position of each vertex is determined \nby  minimizing an  average squared distance criterion penalized by a measure of the local \ncurvature.  These two steps are iterated until convergence is achieved and An is produced. \nThen a new vertex is added. \n\nThe algorithm stops when k exceeds a threshold c(n,~) .  This stopping criterion is  based \non a heuristic complexity measure, determined by the number segments k,  the number of \ndata points n, and the average squared distance ~n(fk,n) . \nTHE INITIALIZATION  STEP.  To obtain fl,n,  take the shortest segment of the first principal \ncomponent line which contains all of the projected data points. \n\nTHE  PROJECTION  STEP.  Let f  denote a  polygonal curve with  vertices  VI , . . .  ,Vk+1  and \nclosed line segments SI , ... ,Sk,  such that Si  connects vertices Vi  and Vi+l.  In this step the \ndata  set Xn  is  partitioned into  (at most)  2k + 1 disjoint  sets  VI, .. ' ,Vk+1  and  SI,'\"  ,Sk, \nthe  Voronoi regions of the  vertices and segments of f,  in the  following  manner.  For any \nx E 1(d let ~(x, Si) be the squared distance from x to Si  (see definition (1\u00bb, and let ~(x, Vi) = \nIIX-ViI12.  Then let \n\nVi = {x E Xn:  ~(x, Vi)  =  ~(x,f),  ~(x, Vi)  < ~(x, vm), m =  1, ... ,i -I}. \n\nUpon setting V  = utl Vi, the Si  sets are defined by \n\nSi = {x E Xn:  x \u00a2 V,  ~(X, Si) = ~(x,f) ,  ~(X,Si) < ~(x,sm) , m = 1, . .. ,i -I}. \n\nThe resulting partition is illustrated in Figure 3. \n\n\f504 \n\nB.  Keg/,  A.  Krzyz'ak,  T Linder and K.  Zeger \n\nFigure 3:  The Voronoi partition induced by the vertices and segments of f \n\nTHE VERTEX  OPTIMIZATION  STEP.  In this step we iterate over the vertices, and relocate \neach  vertex  while  all  the  others  are  kept fixed.  For each  vertex,  we  minimize  ~n(Vi) + \nAPP(Vi),  a  local  average  squared  distance  criterion penalized by  a  measure  of the  local \ncurvature by using a gradient (steepest descent) method. \n\nThe local measure of the average squared distance is calculated from the data points which \nproject to  Vi  or to  the  line  segment(s)  starting  at  Vi  (see  Projection Step).  Accordingly, \nlet O+(Vi)  =  LXESi~(X,Si)' O-(Vi)  =  LXESi-l ~(X,Si-l), and V(Vi)  =  LXEVi~(X, Vi).  Now \ndefine the local average squared distance as a function of Vi  by \nif i = 1 \n\nV(Vi) + O+(Vi) \n\nIVil + ISil \n\nO-(Vi) + V(Vi) + O+(Vi) \n\nlSi-Ii + IVil + ISil \n\nO-(Vi) + V(Vi) \nlSi-II + IV;! \n\nifl < i < k+ 1 \n\n(2) \n\nifi =  k+ 1. \n\nIn the  theoretical algorithm the average squared distance ~n(x,f) is minimized subject to \nthe constraint that f is a polygonal curve with k segments and length not exceeding L.  One \ncould use  a Lagrangian formulation and  attempt to  find  a new  position for  Vi  (while  all \nother vertices are  fixed)  such that the penalized squared error ~n(f) + Al(f)2  is  minimum. \nHowever,  we  have  observed that this  approach is  very  sensitive  to  the  choice  of A,  and \nreproduces the estimation bias of the HS algorithm which flattens the curve at areas of high \ncurvature.  So,  instead of directly  penalizing  the  lengths  of the  line  segments,  we  chose \nto penalize sharp angles to obtain a smooth curve solution.  Nonetheless, note that if only \none vertex is moved at a time,  penalizing sharp  angles  will  indirectly penalize long line \nsegments.  At inner vertices Vi,  3 ::;  i  ::;  k - 1 we  penalize  the  sum of the  cosines  of the \nthree  angles  at vertices  Vi-I,  Vi,  and  Vi+l.  The  cosine  function  was  picked  because  of \nits  regular behavior around 1t,  which makes it especially  suitable for the  steepest descent \nalgorithm.  To make the algorithm invariant under scaling, we multiply the cosines by the \nsquared radius of the data,  that is,  r =  1/2maxxEx\",YEx\"lIx- YII.  At the endpoints and at \ntheir immediate neighbors  (Vi,  i  =  1,2,k,k+ 1),  where penalizing  sharp angles  does  not \ntranslate to penalizing long line segments, the penalty on a nonexistent angle is replaced \nby  a  direct  penalty on the  squared  length  of the  first  (or last)  segment.  Formally,  let \"Ii \ndenote the angle at vertex Vi,  let 1t(Vi)  =  ,2(1 + COS'Yi) , let Jl+(Vi)  =  IIVi  - Vi+111 2,  and let \nJl- (Vi)  =  IIVi - Vi_111 2\u2022 Then the penalty at vertex Vi  is \n\n! 2Jl+(Vi) + 1t(Vi+l) \n\nJl-(Vi) + 1t(Vi) + 1t(Vi+l) \n1t(Vi-I) +1t(Vi) +1t(Vi+l) \n1t(Vi-I) + 1t(Vi) + Jl+(Vi) \n1t(Vi-l) + 2Jl-(Vi) \n\nP(Vi) = \n\nifi= 1 \nif i = 2 \nif2::; i::; k-l \nif i = k \nif i = k+ 1. \n\n\fA Polygonal Line Algorithmfor Constructing Principal Curves \n\n505 \n\nOne important issue is  the  amount of smoothing required for a given data set.  In the HS \nalgorithm one needs to  set the  penalty  coefficient of the  spline  smoother, or the  span of \nthe  scatterplot smoother.  In our algorithm, the corresponding parameter is  the  curvature \npenalty factor Ap.  If some a priori knowledge about the  distribution is available, one can \nuse it to determine the smoothing parameter.  However in the absence of such knowledge, \nthe coefficient should be data-dependent.  Intuitively, Ap  should increase with  the number \nof segments and the size of the average squared error, and it should decrease with the data \nsize.  Based on heuristic considerations and after carrying out practical  experiments, we \nset Ap  =  A~n-l/3~n(fk,n)1/2r-l, where A'p  is a parameter of the algorithm, and can be kept \nfixed for substantially different data sets. \n\nADDING  A NEW  VERTEX.  We  start with the optimized fk,n and choose the segment that \nhas the largest number of data points projecting to it.  If more then one such segment exists, \nwe  choose the  ~n~~.st ~ne.  ~e I?~dPoint of this  seg~ent is  selecte~ as  the  new vertex. \ntl. IS,I  ~ IS;I, ) - 1, ... ,k}, and f  - argmaxiEI Ilv, - v,+lli.  Then the \nFormally, let! -\nnew vertex is V new  =  (ve + vl+l)/2. \nSTOPPING  CONDITION.  According to  the  theoretical results of [1],  the  number of seg(cid:173)\nments k  should be proportional to n1/3 to achieve the O(n1/3) convergence rate for the ex(cid:173)\npected squared distance.  Although the theoretical bounds are not tight enough to determine \nthe optimal number of segments for a given data size, we found that k '\" n1/ 3 also works \nin practice.  To achieve robustness we need to make k sensitive to the average squared dis(cid:173)\ntance.  The stopping condition blends these two considerations. The algorithm stops when \nk exceeds c(n,~n(fk,n)) =  Aknl/3~n(fk,n)-1/2r. \nCOMPUTATIONAL  COMPLEXITY.  The complexity of the  inner loop is dominated by  the \ncomplexity of the projection step, which is O(nk).  Increasing the number of segments by \none at a time (as described in Section 2), and using the stopping condition of Section 2, the \ncomputational complexity of the algorithm becomes O( n5/ 3). This is slightly better than the \nO(n2)  complexity of the HS  algorithm.  The complexity can be dramatically decreased if, \ninstead of adding only one vertex, a new vertex is placed at the midpoint of every segment, \ngiving O(n4/3logn) , or if k is set to be a constant, giving O(n). These simplifications work \nwell in certain situations, but the original algorithm is more robust. \n\n3  Experimental Results \n\nWe  have extensively tested our algorithm on two-dimensional data sets.  In most experi(cid:173)\nments the data was generated by a commonly used (see, e.g.,  [3)  [5]  [7])  additive model \nX =  Y + e, where Y is uniformly distributed on a smooth planar curve (hereafter called the \ngenerating curve) and e is bivariate additive noise which is independent of Y. \n\nSince the \"true\" principal curve is not known (note that the generating curve in the model \nX =  Y + e is in general not a principal curve either in the HS sense or in our definition), it \nis hard to give  an objective measure of performance.  For this reason, in what follows, the \nperformance is judged subjectively, mainly on the basis of how closely the resulting curve \nfollows the shape of the generating curve. \n\nIn general, in simulation examples considered by HS the performance of the new algorithm \nis comparable with the HS algorithm. Due to the data-dependence of the curvature penalty \nfactor and the stopping condition, our algorithm turns out to be more robust to alterations in \nthe data generating model, as well as to changes in the parameters of the particular model. \n\nWe use varying generating shapes, noise parameters, and data sizes to demonstrate the ro(cid:173)\nbustness of the polygonal line algorithm.  All of the plots in Figure 4 show the generating \ncurve (Generator Curve),  the curve produced by our polygonal line algorithm  (Principal \n\n\f506 \n\nB. Keg/, A. Krzyiak, T.  Linder and K.  Zeger \n\nCurve), and the curve produced by the HS algorithm with spline smoothing (HS Principal \nCurve),  which  we have  found  to  perform better than  the HS  algorithm using  scatterplot \nsmoothing.  For closed generating curves we also include the curve produced by the Ban(cid:173)\nfield and Raftery (BR) algorithm [4], which extends the HS algorithm to closed curves (BR \nPrincipal Curve).  The two coefficients of the polygonal line algorithm are set in all exper(cid:173)\niments to the constant values Ak = 0.3 and A~ = 0.1.  All plots have been normalized to fit \nin a 2 x 2 square. The parameters given below refer to values before this normalization. \n\n(a) \n\nCiIdt,  'OOpoi'lllwlhrnodiumnoioa \n\n(b) \n\nH.I_. 'OOpoi'lll\"\"rnodiumnoioa \n\n(c) \n\nOioIoItadhalcllcit, 'OOpoinllwthmodllllnoioa \n\nr-~~~~~~~~~'~~~~~~~~~~ ' ~~~~~~~~~~ \n\n.\n\n:\nI \n\n'Slft'4>lapoi'lll  \u2022 \n\u2022  5enIriIorp.xve  -\n0.8 \n\u2022 !f'~~ ~-:- M \n0.4 \n\n\u2022  \u2022  ::,;.~. \n1 .\" \n/\n'\n\u2022\u2022\u2022\u2022 (1 \n. . \n. \n\n.!\n\n. ~ \n\n02 \n\n0.& \n\nI \u2022 \n' 1 \nI \n\n.. I, \nt\" \n\n\u2022 \n\n\u2022 \n\n. 1  ~O.~& --',  ., .'--, \n\n....... -0.-& ~-o.-I  -'--o.4~-o2~0 -0\"':-2 ~0'-.4 ~O\"\"'.I  -0.& .......... , \n\nS<IIapo, 'OOOOpoi'lllwthmodllllnoioa \n\nSlft'4>lapoi'lll  \u2022 \n\nGtnItaIorCUM (cid:173)\nPltrdpalc..vo ---. \n\nHS pltrdpalc..vo  . . \n\n-OJ \n\n. , '--,--~~\"\"\"\"\",----\"\"--,\"--,\"--,--' \n\n\u2022 \n\n. , \n\n-0.&  -0.1 \n\n-0.4 \n\n-0.2 \n\n0 \n\n02 \n\n0.4 \n\n0.1  0.& \n\n1  \u2022 \nI' \n\n_~ 'OOpoinllwth_noIoa \n\n. \" .  . ... \n. ~  . --..::::..  :. \n'---'-0.-& ~.O.-I -'--0.4----\"-:-02:--'-0 -0~2'-0'-.4 -0 .....\n\nc .... , 'OOpohta\"\"omaIl!OIII \n\n~~~~-T~~~~~'~~~~~~~~~~ \n\n\u2022 \n\u2022 \"-. \n\n1 \nI  _,' . \".  GIMIiIorOllW  -\n\n~poi'III . \n\n0.8 \n\n'.  p ' \n\nc..vo ... \n\n~ \"\u00b7 O.I \n\n\\\n\n. \n\n' .' \n\n0.4 \n\n02 \n\ntt,: \n\n'. \n\n\" \n\nPltrdpalc..vo .. -\n\n:  a:'::: -=- OJ \n\n1 \nI.  ~ Pltrdpalc ..... ..  0.1 \n..~.:;.::: .. ,.\" \nI ,.\" \nI\u00b7 \n\n.. \n\nD.4 \n\n\u2022 \n\n02 \n\nOJ \n\nM \n\n0.4 \n\n-0.4 \n\n-0.1 \n\n\u2022 \n\u2022  ( \n\\ \n\n\u2022 \n\n0.& \n\nOA \n\n02 \n\n\u00b702 \n\n-0.4 \n\n-0.1 \n\n-0.& \n\n. , \n\n., \n\n-0.& \n\n-0.1 \n\n-0.4 \n\n-0.2 \n\n1 \n\nl \n' \".\" \n,  . ~ '  . \n., . \n. \n~/  . \n00 \n\n0 \n\n., \n\n., \n\n0.& \n\n, \n\n-0.& \n\n-0.1 \n\n-0.4 \n\n-0.2 \n\n.,  '---'\"--'~~---'-~~~~--' \n\n;\" \n\n02  0.4 \n\n0.1 \n\n0.8 \n\n, \n\n., \n\n-0.8 \n\n-0.1 \n\n-0.4 \n\n-0.2 \n\n02  0.4  O.! \n\n0.& \n\n, \n\n0 \n\n02  0.4  O.! \n\nW \n\n0 \n\n00 \n\nFigure  4:  (a)  The Circle Example:  the  BR and  the  polygonal line  algorithm show  less \nbias than the HS algorithm.  (b) The Half Circle Example:  the HS and the polygonal line \nalgorithms produce similar curves. (c) and (d) Transformed Data Sets:  the polygonal line \nalgorithm still follows fairly closely the \"distorted\" shapes.  (e) Small Noise Variance and \n(f)  Large Sample Size:  the  curves produced by  the  polygonal line  algorithm are  nearly \nindistinguishable from the generating curves. \nIn Figure 4( a) the generating curve is a circle of radius r = 1, and e = (el' e2) is a zero mean \nbivariate uncorrelated Gaussian with variance E(er) = 0.04, i = 1, 2.  The performance of \nthe three algorithms (HS,  BR,  and the polygonal line algorithm) is comparable, although \nthe HS algorithm exhibits more bias than the other two.  Note that the BR algorithm [4] has \nbeen tailored to fit closed curves and to reduce the estimation bias. In Figure 4(b), only half \nof the circle is used as a generating curve and the other parameters remain the same. Here, \ntoo, both the HS and our algorithm behave similarly. \n\nWhen we depart from these usual settings the polygonal line algorithm exhibits better be(cid:173)\nhavior than the HS algorithm. In Figure 4(c) the data set of Figure 4(b) was linearly trans(cid:173)\nformed using the matrix (j'.~ ?:~). In Figure 4( d) the transformation (-I:g =b:~) was used. \nThe original  data set was  generated by  an  S-shaped generating curve,  consisting  of two \nhalf circles of unit radii, to which the same Gaussian noise was added as in Figure 4(b).  In \nboth cases the polygonal line algorithm produces curves that fit  the generator curve more \nclosely.  This is especially noticeable in Figure 4(c) where the HS  principal curve fails to \nfollow the shape of the distorted half circle. \n\n\fA Polygonal Line Algorithm for Constructing Principal Curves \n\n507 \n\nThere are two situations when we expect our algorithm to perform particularly well. If the \ndistribution is concentrated on a curve, then according to both the HS  and our definitions \nthe  principal curve  is  the  generating  curve  itself.  Thus,  if the  noise  variance  is  small, \nwe  expect both algorithms to very closely approximate the generating curve.  The data in \nFigure 4(e) was generated using the same additive Gaussian model as in Figure 4(a), but \nthe noise variance was reduced to E(er) = 0.001 for i = 1,2. In this case we found that the \npolygonal line algorithm outperformed both the HS and the BR algorithms. \n\nThe second case  is when the  sample  size  is  large.  Although  the  generating curve is  not \nnecessarily the principal curve of the  distribution, it is natural to expect the  algorithm to \nwell approximate the generating curve as the sample size grows.  Such a case is shown in \nFigure 4(f), where n =  10000 data points were generated (but only a small subset of these \nwas actually plotted). Here the polygonal line algorithm approximates the generating curve \nwith much better accuracy than the HS algorithm. \n\nThe Java implementation of the algorithm is available at the WWW site \n\nhttp://www.cs.concordia.ca/-grad/kegl/pcurvedemo.html \n\n4  Conclusion \n\nWe  offered a  new  definition of principal curves  and  presented a  practical algorithm  for \nconstructing principal curves for data sets.  One significant difference between our method \nand previous principal  curve  algorithms  ([3],[4],  and  [8])  is  that,  motivated  by  the  new \ndefinition, our algorithm minimizes a distance criterion (2) between the data points and the \npolygonal curve rather than minimizing a distance criterion between the data points and the \nvertices of the polygonal curve. This and the introduction of the data-dependent smoothing \nfactor Ap  made our algorithm more robust to variations in the data distribution, while we \ncould keep computational complexity low. \n\nAcknowledgments \n\nThis  work  was  supported  in  part by  NSERC  grant OGPOOO270,  Canadian  National  Networks  of \nCenters of Excellence grant 293, and the National Science Foundation. \n\nReferences \n[1]  B. Kegl, A. Krzyzak, T. Linder, and K. Zeger, ''Principal curves:  Learning and convergence,\" in \n\nProceedings of IEEE Int.  Symp.  on Information Theory, p. 387,  1998. \n\n[2]  T. Hastie, Principal curves and surfaces.  PhD thesis, Stanford University,  1984. \n[3]  T.  Hastie and W.  Stuetzle,  ''Principal curves,\" Journal of the American Statistical Association, \n\nvol. 84,no. 406, pp. 502-516,  1989. \n\n[4]  J. D. Banfield and A.  E. Raftery,  \"Ice floe  identification in satellite image~ using mathematical \nmorphology and clustering about principal curves,\" Journal of the American Statistical Associa(cid:173)\ntion, vol. 87, no. 417, pp. 7-16,  1992. \n\n[5]  R. Tibshirani, \"Principal curves revisited,\" Statistics and Computation, vol. 2, pp. 183-190, 1992. \n[6]  P.  Delicado,  ''Principal  curves  and  principal  oriented  points,\"  Tech.  Rep.  309,  Department \n\nd'Economia i Empresa, Universitat Pompeu Fabra,  1998. \nhttp://www.econ.upf.es/deehome/what/wpapers/postscripts/309.pdf. \n[7]  F. Mulier and V. Cherkassky, \"Self-organization as an iterative kernel smoothing process,\" Neural \n\nComputation, vol. 7, pp.  1165-1177,  1995. \n\n[8]  R.  Der,  U.  Steinmetz, and G.  Balzuweit,  \"Nonlinear principal component analysis,\" tech. rep., \n\nInstitut fUr Infonnatik, Universitilt Leipzig,  1998. \nhttp://www.informatik.uni-leipzig.de/~der/Veroeff/npcafin.ps. \n[9]  V.  N. Vapnik, The Nature of Statistical Learning Theory.  New York:  Springer-Verlag,  1995. \n\n\f", "award": [], "sourceid": 1627, "authors": [{"given_name": "Bal\u00e1zs", "family_name": "K\u00e9gl", "institution": null}, {"given_name": "Adam", "family_name": "Krzyzak", "institution": null}, {"given_name": "Tam\u00e1s", "family_name": "Linder", "institution": null}, {"given_name": "Kenneth", "family_name": "Zeger", "institution": null}]}