{"title": "Non-Intrusive Gaze Tracking Using Artificial Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 753, "page_last": 760, "abstract": null, "full_text": "Non-Intrusive Gaze Tracking Using Artificial \n\nNeural Networks \n\nShumeet Baluja \nbaluja@cs.cmu.edu \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nDean Pomerleau \n\npomerleau @cs.cmu.edu \n\nSchool of Computer Science \nCarnegie Mellon University \n\nPittsburgh, PA  15213 \n\nAbstract \n\nWe have developed an  artificial  neural  network based gaze  tracking system \nwhich  can  be  customized  to  individual  users.  Unlike  other gaze  trackers, \nwhich  normally  require the  user to  wear cumbersome headgear,  or to use a \nchin  rest  to  ensure  head  immobility,  our  system  is  entirely  non-intrusive. \nCurrently, the best intrusive gaze tracking  systems are  accurate  to  approxi(cid:173)\nmately  0.75  degrees.  In  our experiments,  we  have  been  able  to  achieve an \naccuracy  of  1.5  degrees,  while  allowing  head  mobility.  In  this  paper  we \npresent an empirical analysis of the performance of a large number of artifi(cid:173)\ncial neural network architectures for this task. \n\n1 \n\nINTRODUCTION \n\nThe goal of gaze tracking is to determine where a subject is looking from the appearance \nof the subject's eye.  The interest in  gaze  tracking exists  because of the  large  number of \npotential applications. Three of the most common uses of a gaze tracker are as an alterna(cid:173)\ntive to the mouse as  an  input modality  [Ware  & Mikaelian,  1987],  as  an  analysis tool for \nhuman-computer  interaction  (HCI)  studies  [Nodine  et.  aI,  1992],  and  as  an  aid  for  the \nhandicapped [Ware &  Mikaelian,  1987]. \nViewed in the context of machine vision,  successful gaze tracking requires techniques to \nhandle imprecise data, noisy images, and a potentially infinitely large image set. The most \naccurate gaze tracking has come from intrusive systems. These systems either use devices \nsuch as chin rests to restrict head motion, or require the user to wear cumbersome equip(cid:173)\nment, ranging from special contact lenses to a camera placed on the user's head. The sys(cid:173)\ntem described here attempts  to  perform non-intrusive gaze  tracking,  in  which  the user is \nneither required to wear any special equipment, nor required to keep hislher head still. \n\n753 \n\n\f754 \n\nBaluja and Pomerleau \n\n2  GAZE TRACKING \n\n2.1 TRADITIONAL GAZE TRACKING \n\nIn  standard gaze trackers, an image of the eye is processed in  three basic steps.  First, the \nspecular  reflection  of a  stationary  light source  is  found  in  the  eye's  image.  Second,  the \npupil's center is found.  Finally, the relative position of the light's reflection to the pupil's \ncenter is calculated. The gaze direction is determined from information about the relative \npositions,  as  shown in Figure  1. In many of the  current gaze  tracker systems,  the user is \nrequired  to  remain  motionless,  or  wear  special  headgear  to  maintain  a  constant  offset \nbetween the position of the camera and the eye. \n\nSpecular \nReflection \n\n~~~ \n\nLooking Above  Looking Below  Looking Left of \n\nLooking at \n\nLight \n\nLight \n\nLight \n\nLight \n\nFigure 1: Relative position of specular reflection and pupil. This diagram assumes that \n\nthe light is placed in the same location as the observer (or camera). \n\n2.2 ARTIFICIAL NEURAL NETWORK BASED GAZE TRACKING \n\nOne of the primary benefits of an  artificial  neural network based gaze tracker is  that it is \nnon-intrusive; the user is allowed to move his head freely. In order to account for the shifts \nin the relative positions of the camera and the eye, the eye must be located in each image \nframe.  In the current system, the right eye is located by  searching for  the specular reflec(cid:173)\ntion of a stationary light in the image of the user's face.  This can usually be distinguished \nby a small bright region surrounded by a very dark region. The reflection's location is used \nto limit the search for the eye in  the  next frame.  A  window surrounding the reflection  is \nextracted; the image of the eye is located within this window. \nTo determine the coordinates of the point the user is looking at, the pixels of the extracted \nwindow are used as the inputs to the artificial neural network. The forward pass is simu(cid:173)\nlated  in  the  ANN,  and  the coordinates of the  gaze are  determined  by  reading the output \nunits. The output units are organized with 50 output units for specifying the X coordinate, \nand 50 units for the Y coordinate. A gaussian output representation, similar to that used in \nthe ALVINN autonomous road following system [Pomerleau, 1993], is used for the X and \nY  axis  output units. Gaussian encoding represents  the network's response by a Gaussian \nshaped activation peak in a vector of output units. The position of the peak within the vec(cid:173)\ntor represents the gaze location along either the X or Y axis. The number of hidden units \nand the structure of the hidden layer necessary for this task are explored in section 3. \nThe training data is collected by instructing the user to visually track a moving cursor. The \ncursor moves in  a predefined path. The image of the eye is digitized, and paired with the \n(X,Y) coordinates of the cursor.  A total of 2000 image/position pairs are gathered. All of \nthe networks described in this paper are trained with the same parameters for 260 epochs, \nusing  standard  error  back  propagation.  The  training  procedure  is  described  in  greater \n\n\fNon-Intrusive Gaze Tracking Using Artificial Neural Networks \n\n755 \n\ndetail in the next section. \n\n3  THE ARTIFICIAL NEURAL NETWORK IMPLEMENTATION \n\nIn  designing  a  gaze  tracker,  the  most  important  attributes  are  accuracy  and  speed.  The \nneed  for  balancing  these  attributes  arises  in  deciding  the  number of connections  in  the \nANN, the number of hidden units needed, and the resolution of the input image. This sec(cid:173)\ntion describes several architectures tested, and their respective performances. \n\n3.1 EXAMINING ONLY THE PUPIL AND CORNEA \n\nMany of the traditional gaze trackers look only at a high resolution picture of the subject's \npupil and cornea. Although we use low resolution images, our first attempt also only used \nan image of the pupil and cornea as the input to the ANN. Some typical input images are \nshown below, in Figure 2(a). The size of the  images is  15x15 pixels. The ANN architec(cid:173)\nture used is shown in Figure 2(b). This architecture was used with varying numbers of hid(cid:173)\nden units in the single, divided, hidden layer; experiments with  10,  16 and 20 hidden units \nwere performed. \nAs  mentioned before,  2000  image/position  pairs  were  gathered for  training.  The cursor \nautomatically  moved  in  a  zig-zag  motion  horizontally  across  the  screen,  while  the  user \nvisually tracked the cursor. In addition, 2000 image/position pairs were also gathered for \ntesting. These pairs were gathered while the user tracked the cursor as it followed a verti(cid:173)\ncal  zig-zag path across  the screen. The results  reported in  this paper,  unless  noted other(cid:173)\nwise,  were  all  measured on  the 2000 testing points. The results for training  the ANN on \nthe three architectures mentioned above as a function of epochs is shown in Figure 3. Each \nline in  Figure 3 represents  the  average of three  ANN  training  trials  (with  random  initial \nweights) for each of the two users tested. \nUsing this system, we  were able to reduce the average error to approximately 2.1  degrees, \nwhich  corresponds  to  0.6  inches  at  a  comfortable  sitting  distance  of approximately  17 \ninches. In addition to these initial attempts, we have also attempted to use the position of \nthe cornea within the eye socket to aid in making finer discriminations. These experiments \nare described in the next section. \n\n50 X Output Units \n\n50 Y Output Units \n\n15 x 15 \nInput \nRetina \n\nFigure 2: (a-left) 15 x 15 Input to the ANN. Target outputs also shown. (b-right) \n\nthe ANN architecture used. A single divided hidden layer is used. \n\n\f756 \n\nBaluja and Pomerleau \n\nFigure 3: Error vs. Epochs for the 15x15 \n\nimages. Errors shown for the 2000 \nimage test set. Each line represents \nthree ANN trainings per user; two \nusers are tested. \n\n\"0 \n\n240 \n\n1JO \n\n'''' \n\n210 \n\nJSdSlmages \n\nto Hidden \niii Hidden \nr O\"Hidden \n\n3.2  USING THE EYE SOCKET FOR ADDITIONAL INFORMATION \n\nIn  addition  to  using  the  information  present from  the pupil  and  cornea,  it is  possible  to \ngain information about the subject's gaze by analyzing the position of the pupil and cornea \nwithin  the eye  socket.  Two  sets of experiments  were  performed using  the expanded eye \nimage.  The  first  set  used  the  network  described  in  the  next  section.  The  second  set  of \nexperiments used the  same architecture  shown in  Figure 2(b),  with  a larger input image \nsize. A sample image used for training is shown below, in Figure 4. \n\nFigure 4: Image of the pupil and the \neye socket, and the corresponding \ntarget outputs.  15 x 40 input image \nshown. \n\n3.2.1. Using a Single Continuous Hidden Layer \nOne of the remaining issues in creating the ANN to be used for analyzing the position of \nthe gaze is the structure of the hidden unit layer. In this study, we have limited our explo(cid:173)\nration  of ANN  architectures  to  simple  3  layer  feed-forward  networks.  In  the  previous \narchitecture (using 15 x 15 images) the hidden layer was divided into 2 separate parts, one \nfor  predicting the  x-axis,  and  the  other for  the  y-axis.  Selecting  this  architecture  over a \nfully  connected hidden  layer makes the assumption  that the features  needed for  accurate \nprediction of the x-axis are not related to the features  needed for predicting the y-axis. In \nthis  section,  this  assumption  is  tested.  This  section  explores  a  network  architecture  in \nwhich the hidden layer is fully connected to the inputs and the outputs. \nIn addition to deciding the architecture of the ANN, it is necessary to decide on the size of \nthe input images.  Several  input sizes were attempted,  15x30,  15x40 and 20x40. Surpris(cid:173)\ningly,  the 20x40 input image did  not provide the most accuracy. Rather,  it was  the  15x40 \nimage which gave the best results. Figure 5 provides two charts showing the performance \nof the  15x40  and  20x40  image  sizes  as  a  function  of the  number  of hidden  units  and \nepochs. The 15x30 graph is not shown due to space restrictions, it can be found in [Baluja \n& Pomerleau,  1994]. The accuracy achieved by using the eye socket information, for  the \n15x40 input images, is better than using only the pupil and cornea; in particular, the 15x40 \ninput retina worked better than both the  15x30 and 20x40. \n\n\fNon-Intrusive Gaze Tracking Using Artificial Neural Networks \n\n757 \n\nIS x 40 Image \n\nlOx 40 Image \n\n10 Hidden \ni6\"Hfdden \niii\"\"fiidden \n\n10 Hidden \ni6-Hfdiien \n2oHi!iden \n\n2 &0 \n\n260 \n\n240 \n\n220 \n\nI sooo \n\nI \n\n10000 \n\nI \n\nI ~OO \n\nI \n\n20000 \n\n2!tO OO \n\nEpochs \n\n3l1l \n\n320 \n\n3 10 \n\n300 \n\n290 \n\nliO \nI \n\nl70~ \nI \n\n...... ~ . \n\n\" \n\nI \nsooo \n\nI \n\n10000 \n\nI \n\n1.5000 \n\nI \n20000 \n\nFigure 5: Performance of 15x40, and 20x40 input image sizes as a function of \nepochs and number of hidden units. Each line is the average of 3 runs. Data \npoints taken every 20 epochs, between 20 and 260 epochs. \n\n3.2.2.  Using a Divided Hidden Layer \nThe final  set of experiments which were performed were with  15x40 input images and 3 \ndifferent hidden unit architectures: 5x2, 8x2 and 10x2. The hidden unit layer was divided \nin the manner described in the first network, shown in Figure 2(b). Two experiments were \nperformed,  with the only difference between experiments being the selection of training \nand testing images.  The first experiment was  similar to the experiments described previ(cid:173)\nously. The  training  and  testing  images  were  collected  in  two  different  sessions,  one  in \nwhich the user visually  tracked the cursor as it moved horizontally across  the screen  and \nthe other in which the cursor moved vertically across the screen. The training of the ANN \nwas  on  the  \"horizontally\"  collected  images,  and  the  testing  of the  network  was  on  the \n\"vertically\" collected images. In the second experiment, a random sample of 1000 images \nfrom the horizontally collected images and a random sample of 1000 vertically collected \nimages were used as the training set. The remaining 2000 images from both sets were used \nas the testing set. The second method yielded reduced tracking errors. If the images from \nonly one session were used,  the network  was  not trained to accurately predict gaze posi(cid:173)\ntion independently of head position. As the two sets of data were collected in two separate \nsessions,  the  head  positions  from  one  session  to  the other  would  have changed  slightly. \nTherefore, using both sets should have helped the network in two ways. First, the presen(cid:173)\ntation of different head positions and different head movements should have improved the \nability  of the  network to  generalize.  Secondly,  the  network  was  tested  on  images  which \nwere gathered from the same sessions as it was trained. The use of mixed training and test(cid:173)\ning sets will be explored in more detail in section 3.2.3. \nThe results of the first and second experiments are presented here, see Figure 6. In order to \ncompare this  architecture  with  the  previous  architectures  mentioned,  it should  be  noted \nthat the performance of this architecture,  with 10 hidden units, more accurately predicted \ngaze location than the architecture mentioned in section 3.2.1, in  which a single continu(cid:173)\nous hidden layer was used. In comparing the performance of the architectures with 16 and \n20 hidden units, the performances were very similar. Another valuable feature of using the \n\n\f758 \n\nBaluja and Pomerleau \n\ndivided hidden layer is the reduced number of connections decreases the training and sim(cid:173)\nulation  times.  This  architecture operates  at approximately  15hz.  with  10 and  16  hidden \nunits, and slightly slower with 20 hidden units. \n\nEno,-o.gr...  Separate Hidden Layer & 15x40 Image - Test Set 1 \n\nErro,-DegI8OS  Seperate Hidden Layer & 15x40 Images - Test Set 2 \n\n310 \n\n300 \n\n290 \n\n280 \n\n2 40 \n\n2 30 \n\n2 .0 \n\n200 \n\n10 Hidden \ni(j-H1aden \niifHiCiden \n\n210 \n\n240 \n\n. 60 \n\n.so \n\n. 40 \n\n'. \n\n10 Hidden \ni(j\u00b7Hraden \n2o--fii-dden \n\n, , \n\n\"\"\"\" \n\n............... ... -......... \n\"\"~l \n\n.. 90c-----ulnn-----,oi=--~.-----.,;I;;OOon-----,,;250i;-,.OOr=\"  Epochs \n\nFigure 6: (Left) The average of 2 users with the 15x40 images, and a divided hidden \nlayer architecture, using test setup #1. (Right) The average performance tested on \n5 users, with test setup #2. Each line represents the average of three ANN \ntrainings per user per hidden unit architecture. \n\n3.2.3. Mixed Training and Testing Sets \nIt was hypothesized, above, that there are two reasons for the improved performance of a \nmixed training and testing set. First, the network ability to  generalize is  improved, as it is \ntrained  with  more  than  a  single  head  position.  Second,  the  network is  tested  on  images \nwhich are similar,  with respect to head position, as those on  which it was trained.  In  this \nsection, the first  hypothesized benefit is examined in greater detail using the experiments \ndescribed below. \nFour sets of 2000 images were collected. In each set, the user had a different head position \nwith respect to the camera. The first two sets  were collected as previously described. The \nfirst  set of 2000 images (horizontal train set 1) was collected by visually tracking the cur(cid:173)\nsor as  it made a horizontal path across the screen. The second set (vertical test set 1) was \ncollected by  visually  tracking the cursor as  it  moved in a vertical path across the  screen. \nFor the third and fourth image sets,  the camera was  moved,  and the user was  seated in a \ndifferent location with respect to the screen than during the collection of the first training \nand testing sets. The third set (horizontal train set 2) was again gathered from tracking the \ncursor's horizontal path, while the fourth (vertical test set 2) was from  the vertical path of \nthe cursor. \nThree tests  were  performed.  In  the  first  test,  the  ANN  was  trained  using  only  the  2000 \nimages in horizontal training set  1.  In the second test,  the network  was  trained  using the \n2000 images in horizontal  training set 2.  In  the third test,  the network  was  trained with a \nrandom selection of 1000 images from horizontal training set 1, and a random selection of \n1000 images of horizontal training set 2. The performance of these networks was tested on \nboth of the vertical test sets.  The results are reported below, in Figure 7.  The last experi(cid:173)\nment, in which samples were taken from both training sets, provides more accurate results \n\n\fNon-Intrusive Gaze Tracking Using Artificial Neural Networks \n\n759 \n\nwhen testing on vertical test set I, than the network trained alone on horizontal training set \n1. When testing on vertical test set 2, the combined network performs almost as well as the \nnetwork trained only on horizontal training set 2. \nThese three experiments provide evidence for the network's increased ability to generalize \nif sets of images which contain multiple head positions are used for training. These exper(cid:173)\niments  also  show the  sensitivity of the  gaze tracker to  movements  in  the camera;  if the \ncamera is moved between training and testing, the errors in simulation will be large. \n\nError-Degr ... \n\nVertil:al Test Set I \n\nError-Degr ... \n\nVertical Test Set 1 \n\nJOO \n\n' 80 \n\n'60 \n\n, 40  \\. \n\\ \n\\ \n\n'20 \n\n' 00 \n\n\\ \n\"'-\" \n\n' 80 \n\n'~--\n\ncombined \ntriiiiisei-j \nttaiii\"\"se,\"i \n\n: \n\n380 \n\n36:1 \n\n)41] \n\n\". ,. \n\n300 \n\nZ80 \n\ncombined \niiaiii set -i \nifaiii\"set2 \n\n\\ \n\n, 6:1  \\ \n\n,,. \n\n240 \n\n\\ \n\n\\ \n\nEpochs \n\n200 \n\n' 10 \n\n~ \n\nFigure 7: Comparing the performance between networks trained with only one head \n\nposition (horizontal train set 1 & 2), and a network trained with both. \n\n4  USING THE GAZE TRACKER \n\nThe experiments described to this point have used static test sets which are gathered over \na period of several minutes, and then stored for repeated use. Using the same test set has \nbeen  valuable  in  gauging  the  performance  of different  ANN  architectures.  However,  a \nuseful gaze tracker must produce accurate on-line estimates of gaze location. The use of \nan \"offset table\" can increase the accuracy of on-line gaze prediction. The offset table is a \ntable of corrections to the output made by a gaze tracker. The network's gaze predictions \nfor each image are hashed into the 2D offset-table, which performs an additive correction \nto the network's prediction. The offset table is filled after the network is fully trained. The \nuser  manually  moves  and  visually  tracks  the cursor to  regions  in  which the  ANN is  not \nperforming accurately. The offset table is updated by subtracting the predicted position of \nthe cursor from the actual position_ This procedure can also be automated,  with the cursor \nmoving  in  a  similar  manner  to  the  procedure  used  for  gathering  testing  and  training \nimages.  However,  manually  moving  the  cursor  can  help  to  concentrate  effort  on  areas \nwhere the ANN is not performing well; thereby reducing the time required for offset table \ncreation. \nWith the use of the offset table, the current system works at approximately 15 hz. The best \non-line accuracy  we have achieved is  1.5 degrees. Although we have not yet matched the \nbest gaze tracking systems, which have achieved approximately 0.75 degree accuracy, our \nsystem is  non-intrusive,  and  does  not require the expensive hardware  which  many  other \nsystems  require.  We  have used  the  gaze  tracker in several forms;  we have used  it as  an \n\n\f760 \n\nBaluja and Pomerleau \n\ninput modality to replace the mouse, as  a method of selecting windows in an  X-Window \nenvironment, and as a tool to report gaze direction, for human-computer interaction stud(cid:173)\nies. \n\nThe  gaze  tracker  is  currently  trained  for  260  epochs,  using  standard  back propagation. \nTraining  the  8x2 hidden  layer network  using  the  15x40 input retina,  with  2000 images, \ntakes approximately 30-40 minutes on a Sun SPARC  10 machine. \n\n5  CONCLUSIONS \n\nWe have created a non-intrusive gaze tracking system which is based upon a simple ANN. \nUnlike  other  gaze-tracking  systems  which  employ  more  traditional  vision  techniques, \nsuch as a edge detection and circle fitting,  this system develops its own features  for suc(cid:173)\ncessfully completing the task. The system's average on-line accuracy is  1.7 degrees. It has \nsuccessfully been used in HCI studies and as an input device.  Potential extensions to the \nsystem,  to  achieve  head-position  and  user  independence,  are  presented  in  [Baluja  & \nPomerleau, 1994]. \nAcknowledgments \nThe  authors  would  like  to  gratefully  acknowledge  the  help  of Kaari  Flagstad,  Tammy \nCarter, Greg Nelson, and Ulrike Harke for letting us scrutinize their eyes, and being \"will(cid:173)\ning\" subjects. Profuse thanks are also due to Henry Rowley for aid in revising this paper. \n\nShumeet Baluja is supported by a National Science Foundation Graduate Fellowship. This \nresearch  was supported by the Department of the Navy,  Office of Naval Research under \nGrant No. NOO014-93-1-0806. The views and conclusions contained in this document are \nthose  of the  authors  and  should  not  be  interpreted  as  representing  the  official  policies, \neither expressed or implied, of the National  Science Foundation,  ONR, or the U.S.  gov(cid:173)\nernment. \nReferences \nBaluja, S. Pomerleau, D.A. (1994) \"Non-Intrusive Gaze Tracking Using Artificial Neural Networks\" \nCMU-CS-94. \nJochem,  T.M.,  D.A.  Pomerleau,  C.E.  Thorpe  (1993),  \"MANIAC:  A  Next  Generation  Neurally \nBased Autonomous Road Follower\".  In Proceedings of the International Conference on Intelligent \nAutonomous Systems (IAS-3). \nNodine, c.P., H.L. Kundel, L.c. Toto &  E.A.  Krupinksi (1992) \"Recording and analyzing eye-posi(cid:173)\ntion data using a microcomputer workstation\", Behavior Research Methods,  Instruments &  Comput(cid:173)\ners 24 (3) 475-584. \nPomerleau, D.A.  (1991) \"Efficient Training of Artificial Neural Networks for Autonomous Naviga(cid:173)\ntion,\" Neural Computation 3: I, Terrence Sejnowski (Ed). \nPomerleau, D.A.  (1993) Neural Network Perception for Mobile Robot Guidance. Kluwer Academic \nPublishing. \nPomerleau,  D.A.  (1993)  \"Input  Reconstruction  Reliability  Estimation\",  Neural  Information  Pro(cid:173)\ncessing Systems 5. Hanson, Cowan, Giles (eds.) Morgan Kaufmann, pp.  270-286. \nStarker,  I.  &  R.  Bolt  (1990)  \"A  Gaze-Responsive  Self Disclosing  Display\",  In  CHI-90.  Addison \nWesley, Seattle, Washington. \nWaibel, A.,  Sawai, H.  &  Shikano, K.  (1990) \"Consonant Recognition by  Modular Construction of \nLarge Phonemic Time-Delay Neural Networks\". Readings in Speech Recognition. Waibel and Lee. \nWare,  C.  &  Mikaelian,  H.  (1987)  \"An  Evaluation  of an  Eye  Tracker  as  a Device  for  Computer \nInput\", In 1. Carrol and P.  Tanner (ed.) Human Factors in Computing Systems -IV. Elsevier. \n\n\f", "award": [], "sourceid": 863, "authors": [{"given_name": "Shumeet", "family_name": "Baluja", "institution": null}, {"given_name": "Dean", "family_name": "Pomerleau", "institution": null}]}