{"title": "3D Social Saliency from Head-mounted Cameras", "book": "Advances in Neural Information Processing Systems", "page_first": 422, "page_last": 430, "abstract": null, "full_text": "3D Social Saliency from Head-mounted Cameras\n\nHyun Soo Park\n\nCarnegie Mellon University\nhyunsoop@cs.cmu.edu\n\nEakta Jain\n\nTexas Instruments\ne-jain@ti.com\n\nAbstract\n\nYaser Sheikh\n\nCarnegie Mellon University\n\nyaser@cs.cmu.edu\n\nA gaze concurrence is a point in 3D where the gaze directions of two or more\npeople intersect. It is a strong indicator of social saliency because the attention\nof the participating group is focused on that point. In scenes occupied by large\ngroups of people, multiple concurrences may occur and transition over time. In\nthis paper, we present a method to construct a 3D social saliency \ufb01eld and lo-\ncate multiple gaze concurrences that occur in a social scene from videos taken by\nhead-mounted cameras. We model the gaze as a cone-shaped distribution emanat-\ning from the center of the eyes, capturing the variation of eye-in-head motion. We\ncalibrate the parameters of this distribution by exploiting the \ufb01xed relationship\nbetween the primary gaze ray and the head-mounted camera pose. The result-\ning gaze model enables us to build a social saliency \ufb01eld in 3D. We estimate the\nnumber and 3D locations of the gaze concurrences via provably convergent mode-\nseeking in the social saliency \ufb01eld. Our algorithm is applied to reconstruct mul-\ntiple gaze concurrences in several real world scenes and evaluated quantitatively\nagainst motion-captured ground truth.\n\nIntroduction\n\n1\nScene understanding approaches have largely focused on understanding the physical structure of a\nscene: \u201cwhat is where?\u201d [1]. In social scenes, i.e., scenes occupied by people, this de\ufb01nition of\nunderstanding needs to be expanded to include interpreting what is socially salient in that scene,\nsuch as who people interact with, where they look, and what they attend to. While classic structural\nscene understanding is an objective interpretation of the scene (e.g., 3D reconstruction [2], object\nrecognition [3], or human affordance identi\ufb01cation [4]), social scene understanding is subjective as\nit depends on the beholder and the particular group of people occupying the scene. For example,\nwhen we \ufb01rst enter a foyer during a party, we quickly look at different people and the groups they\nhave formed, search for personal friends or acquaintances, and choose a group to join. Consider\ninstead, an arti\ufb01cial agent, such as a social robot, that enters the same room: how should it interpret\nthe social dynamics of the environment? The subjectivity of social environments makes the identi-\n\ufb01cation of quanti\ufb01able and measurable representations of social scenes dif\ufb01cult. In this paper, we\naim to recover a representation of saliency in social scenes that approaches objectivity through the\nconsensus of multiple subjective judgements.\nHumans transmit visible social signals about what they \ufb01nd important and these signals are powerful\ncues for social scene understanding [5]. For instance, humans spontaneously orient their gaze to the\ntarget of their attention. When multiple people simultaneously pay attention to the same point in\nthree dimensional space, e.g., at an obnoxious customer at a restaurant, their gaze rays1 converge to\na point that we refer to as a gaze concurrence. Gaze concurrences are foci of the 3D social saliency\n\ufb01eld of a scene. It is an effective approximation because although an individual\u2019s gaze indicates\nwhat he or she is subjectively interested in, a gaze concurrence encodes the consensus of multiple\nindividuals. In a scene occupied by a larger number of people, multiple such concurrences may\nemerge as social cliques form and dissolve. In this paper, we present a method to reconstruct a 3D\nsocial saliency \ufb01eld and localize 3D gaze concurrences from videos taken by head-mounted cameras\n\n1A gaze ray is a three dimensional ray emitted from the center of eyes and oriented to the point of regard as\n\nshown in Figure 1(b).\n\n1\n\n\f(a) Input and output\n\n(b) Head top view\n\n(c) Gaze ray model\n\n(d) Gaze distribution\n\nFigure 1: (a) In this paper, we present a method to reconstruct 3D gaze concurrences from videos\ntaken by head-mounted cameras. (b) The primary gaze ray is a \ufb01xed 3D ray with respect to the head\ncoordinate system and the gaze ray can be described by an angle with respect to the primary gaze\nray. (c) The variation of the eye orientation is parameterized by a Gaussian distribution of the points\non the plane, \u0003, which is normal to the primary gaze ray, \u0006 at unit distance from F. (d) The gaze ray\nmodel results in a cone-shaped distribution of the point of regard.\n\non multiple people (Figure 1(a)). Our method automatically \ufb01nds the number and location of gaze\nconcurrences that may occur as people form social cliques in an environment.\nWhy head-mounted cameras? Estimating 3D gaze concurrences requires accurate estimates of\nthe gaze of people who are widely distributed over the social space. For a third person camera,\ni.e., an outside camera looking into a scene, state-of-the-art face pose estimation algorithms cannot\nproduce reliable face orientation and location estimation beyond approximately 45 degrees of a head\nfacing the camera directly [6]. Furthermore, as they are usually \ufb01xed, third person views introduce\nspatial biases (i.e., head pose estimates would be better for people closer to and facing the camera)\nand limit the operating space. In contrast, head-mounted cameras instrument people rather than the\nscene. Therefore, one camera is used to estimate each head pose. As a result, 3D pose estimation of\nhead-mounted cameras provides accurate and spatially unbiased estimates of the primary gaze ray2.\nHead-mounted cameras are poised to broadly enter our social spaces and many collaborative teams\n(such as search and rescue teams [8], police squads, military patrols, and surgery teams [9]) are\nalready required to wear them. Head-mounted camera systems are increasingly becoming smaller,\nand will soon be seamlessly integrated into daily life [10].\nContributions The core contribution of this paper is an algorithm to estimate the 3D social saliency\n\ufb01eld of a scene and its modes from head-mounted cameras, as shown in Figure 1(a). This is en-\nabled by a new model of gaze rays that represents the variation due to eye-in-head motion via a\ncone-shaped distribution. We present a novel method to calibrate the parameters of this model by\nleveraging the fact that the primary gaze ray is \ufb01xed with respect to the head-mounted camera in 3D.\nGiven the collection of gaze ray distributions in 3D space, we automatically estimate the number\nand 3D locations of multiple gaze concurrences via mode-seeking in the social saliency \ufb01eld. We\nprove that the sequence of mode-seeking iterations converge. We evaluate our algorithm using mo-\ntion capture data quantitatively, and apply it to real world scenes where social interactions frequently\noccur, such as meetings, parties, and theatrical performances.\n\n2 Related Work\n\nHumans transmit and respond to many different social signals when they interact with others.\nAmong these signals, gaze direction is one of the most prominent visible signals because it usu-\nally indicates what the individual is interested in. In this context, gaze direction estimation has been\nwidely studied in robotics, human-computer interaction, and computer vision [6, 11\u201322]. Gaze di-\nrection can be precisely estimated by the eye orientation. Wang and Sung [11] presented a system\nthat estimates the direction of the iris circle from a single image using the geometry of the iris.\nGuestrin and Eizenman [12] and Hennessey and Lawrence [13] utilized corneal re\ufb02ections and the\nvergence of the eye to infer the eye geometry and its motion, respectively. A head-mounted eye\ntracker is often used to determine the eye orientation [14, 15]. Although all these methods can es-\ntimate highly accurate gaze direction, either they can be used in a laboratory setting or the device\noccludes the viewer\u2019s \ufb01eld of view.\n\n2The primary gaze ray is a \ufb01xed eye orientation with respect to the head.\n\nIt has been shown that the\n\norientation is a unique pose, independent of gravity, head posture, horizon, and the fusion re\ufb02ex [7].\n\n2\n\nGaze concurrencesVideos fromhead-mounted camerasP rimary gaze ray,Left eyeRight eyePoint of regard Center of eyes,G aze raylpvdv1\u22a5vv1122dd\u22a5\u22a5=+dvvd2\u22a5vldl212,(0,)ddh\u223cN\u03a0WpPrimary gaze rayCone-shaped distributionof the point of regardWC\fWhile the eyes are the primary source of gaze direction, Emery [16] notes that the head orientation\nis a strong indication of the direction of attention. For head orientation estimation, there are two\napproaches: outside-in and inside-out [23]. An outside-in system takes, as input, a third person\nimage from a particular vantage point and estimates face orientation based on a face model. Murphy-\nChutorian and Trivedi [6] have summarized this approach. Geometric modeling of the face has been\nused to orient the head by Gee and Cipolla [17] and Ballard and Stockman [18]. Rae and Ritter [19]\nestimated the head orientation via neural networks and Robertson and Reid [20] presented a method\nto estimate face orientation by learning 2D face features from different views in a low resolution\nvideo. With these approaches, a large number of cameras would need to be placed to cover a space\nlarge enough to contain all people. Also, the size of faces in these videos is often small, leading to\nbiased head pose estimation depending on the distance from the camera. Instead of the outside-in\napproach, an inside-out approach estimates head orientation directly from a head-mounted camera\nlooking out at the environment. Munn and Pelz [22] and Takemura et al. [15] estimated the head-\nmounted camera motion in 3D by feature tracking and visual SLAM, respectively. Pirri et al. [24]\npresented a gaze calibration procedure based on the eye geometry using 4 head-mounted cameras.\nWe adopt an inside-out as it does not suffer from space limitations and biased estimation.\nGaze in a group setting has been used to identify social interaction or to measure social behavior.\nStiefelhagen [25] and Smith et al. [26] estimated the point of interest in a meeting scene and a\ncrowd scene, respectively. Bazzani et al. [27] introduced the 3D representation of the visual \ufb01eld\nof view, which enabled them to locate the convergence of views. Cristani et al. [28] adopted the\nF-formation concept that enumerates all possible spatial and orientation con\ufb01gurations of people to\nde\ufb01ne the region of interest. However, these methods rely on data captured from the third person\nview point, i.e., outside-in systems and therefore, their capture space is limited and accuracy of head\npose estimation degrades with distance from the camera. Our method is not subject to the same\nlimitations. For an inside-out approach, Fathi et al. [29] present a method that uses a single \ufb01rst\nperson camera to recognize discrete interactions within the wearer\u2019s immediate social clique. Their\nmethod is a complementary approach to our method as it analyzes the faces within a single person\u2019s\n\ufb01eld of view. In contrast, our approach analyzes an entire environment where several social cliques\nmay form or dissolve over time.\n\n3 Method\n\nThe videos from the head-mounted cameras are collected and reconstructed in 3D via structure from\nmotion. Each person wears a camera on the head and performs a prede\ufb01ned motion for gaze ray\ncalibration based on our gaze ray model (Section 3.1). After the calibration (Section 3.2), they may\nmove freely and interact with other people. From the reconstructed camera poses in conjunction with\nthe gaze ray model, we estimate multiple gaze concurrences in 3D via mode-seeking (Section 3.3).\nOur camera pose registration in 3D is based on structure from motion as described in [2, 30, 31]. We\n\ufb01rst scan the area of interest (for example, the room or the auditorium) with a camera to reconstruct\nthe reference structure. The 3D poses of the head-mounted cameras are recovered relative to the\nreference structure using a RANSAC [32] embedded Perspective-n-Point algorithm [33]. When\nsome camera poses cannot be reconstructed because of lack of features or motion blur, we interpolate\nthe missing camera poses based on the epipolar constraint between consecutive frames.\n\n3.1 Gaze Ray Model\nWe represent the direction of the viewer\u2019s gaze as a 3D ray that is emitted from the center of the eyes\nand is directed towards the point of regard, as shown in Figure 1(b). The center of the eyes is \ufb01xed\nwith respect to the head position and therefore, the orientation of the gaze ray in the world coordinate\nsystem is a composite of the head orientation and the eye orientation (eye-in-head motion). A head-\nmounted camera does not contain suf\ufb01cient information to estimate the gaze ray because it can\ncapture only the head position and orientation but not the eye orientation. However, when the motion\nof the point of regard is stabilized, i.e., when the point of regard is stationary or slowly moving with\nrespect to the head pose, the eye orientation varies by a small degree [34\u201336] from the primary gaze\nray. We represent the variation of the gaze ray with respect to the primary gaze ray by a Gaussian\ndistribution on a plane normal to the primary gaze ray. The point of regard (and consequently, the\ngaze ray) is more likely to be near the primary gaze ray.\n\n3\n\n\f(a) Cone\n\n(b) Apex candidate\n\n(c) Cone estimation\n\nFigure 2: (a) We parameterize our cone, \u0001, with an apex, F, and ratio, \u0001, of the radius, \u0001, to the\nheight, \u0001. (b) An apex can lie on the orange colored half line, i.e., behind F\u001e. Otherwise some of the\npoints are invisible. (c) An apex can be parameterized as F \u0003 F\u001e \u08a4 \u0001L where \u0001 \u0003 \u001e. Equation (2)\nallows us to locate the apex accurately.\nLet us de\ufb01ne the primary gaze ray \u0006 by the center of the eyes F \u08a0 R!, and the unit direction vector,\nL \u08a0 R! in the world coordinate system, \u0001, as shown in Figure 1(b). Any point on the primary gaze\nray can be written as F \u0002 \u0001L where \u0001 \u0003 \u001e.\nLet \u0003 be a plane normal to the primary gaze ray \u0006 at unit distance from F, as shown in Figure 1(c).\nThe point @ in \u0003 can be written as @ \u0003 \u0001\u001fL\u0016\n are two orthogonal vectors\nto L and \u0001\u001f and \u0001 are scalars drawn from a Gaussian distribution, i.e., \u0001\u001f, \u0001 \u00df \u0001 \u001c\u001e\u0002 \u00d2 \u001d. This\npoint @ corresponds to the ray \u0006@ in 3D. Thus, the distribution of the points on the plane maps to the\ndistribution of the gaze ray by parameterizing the 3D ray as \u0006@\u001cF\u0002 L@\u001d \u0003 F\u0002\u0001L@ where L@ \u0003 L\u0002@\nand \u0001 \u0003 \u001e. The resulting distribution of 3D points of regard is a cone-shaped distribution whose\ncentral axis is the primary gaze ray, i.e., a point distribution on any normal plane to the primary gaze\nray is a scaled Gaussian centered at the intersection between \u0006 and the plane as shown in Figure 1(d).\n\n where L\u0016\n\n\u001f \u0002 \u0001 L\u0016\n\n\u001f and L\u0016\n\n3.2 Gaze Ray Calibration Algorithm\nWhen a person wears a head-mounted camera, it may not be aligned with the direction of the primary\ngaze ray. In general, its center may not coincide with the center of the eyes either, as shown in\nFigure 1(d). The orientation and position offsets between the head-mounted camera and the primary\ngaze ray must be calibrated to estimate where the person is looking.\nThe relative transform between the primary gaze ray and the camera pose is constant across time\nbecause the camera is, for the most part, stationary with respect to the head, \u0001, as shown in Fig-\nure 1(d). Once the relative transform and camera pose have been estimated, the primary gaze ray\ncan be recovered. We learn the primary gaze ray parameters, F and L, with respect to the camera\npose and the standard deviation \u00d2 of eye-in-head motion.\nWe ask people to form pairs and instruct each pair to look at each other\u2019s camera. While doing so,\nthey are asked to move back and forth and side to side. Suppose two people ) and * form a pair.\nIf the cameras from ) and * are temporally synchronized and reconstructed in 3D simultaneously,\nthe camera center of * is the point of regard of ). Let O\u0001 (the camera center of *) be the point of\nregard of ) and 4 and + be the camera orientation and the camera center of ), respectively. O\u0001\nis represented in the world coordinate system, \u0001. We can transform O\u0001 to )\u2019s camera centered\ncoordinate system, \u0001, by O \u0003 4O\u0001 \u08a4 4+. From \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 where \u0001 is the number of the points\nof regard, we can infer the primary gaze ray parameters with respect to the camera pose. If there is\nno eye-in-head motion, all \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 will form a line which is the primary gaze ray. Due to the\neye-in-head motion, \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 will be contained in a cone whose central axis is the direction of\nthe primary gaze ray, L, and whose apex is the center of eyes, F.\nWe \ufb01rst estimate the primary gaze line and then, \ufb01nd the center of the eye on the line to completely\ndescribe the primary gaze ray. To estimate the primary gaze line robustly, we embed line estimation\nby two points in the RANSAC framework [32]3. This enables us to obtain a 3D line, \u0006\u001cF\u0001\u0002 L\u001d where\nF\u0001 is the projection of the camera center onto the line and L is the direction vector of the line. The\nprojections of \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 onto the line will be distributed on a half line with respect to F\u0001. This\nenables us to determine the sign of L. Given this line, we \ufb01nd a 3D cone, \u0001\u001cF\u0002 \u0001\u001d, that encapsulates\n\n3We estimate a 3D line by randomly selecting two points at each iteration and \ufb01nd the line that produces\n\nthe maximum number of inlier points.\n\n4\n\nHRvpRH\u03be=(,)C\u03bepiy0pvapWC()ia\u2212vypTibiaiy0p0\u03b1=+ppvWCv\f(a) Geometry\n\n(b) Gaze model\n\n(c) Social saliency \ufb01eld and mean trajectories\n\nFigure 3: (a)\u0002N\u0001 is the projection of N onto the primary gaze ray, \u0006\u0001, and @ is a perspective distance\n\nvector de\ufb01ned in Equation (4). (b) Our gaze ray representation results in the cone-shaped distribution\nin 3D. (c) Two gaze concurrences are formed by seven gaze rays. High density is observed around\nthe intersections of rays. Note that the maximum intensity projection is used to visualize the 3D\ndensity \ufb01eld. Our mean-shift algorithm allows any random points to converge to the highest density\npoint accurately.\nall \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 where F is the apex and \u0001 is the ratio of the radius, \u0001, to height, \u0001, as shown in\nFigure 2(a).\nThe apex can lie on a half line, which originates from the closest point, F\u001e, to the center of the eyes\nand orients to \u08a4L direction, otherwise some O are invisible. In Figure 2(b), the apex must lie on the\norange half line. F\u001e can be obtained as follows:\n\nF\u001e \u0003 F\u0001 \u0002 \u0006E\u0006\u0007L6 \u001cO\u001f \u08a4 F\u0001\u001d \u0002\u0016\u0016\u0016 \u0002 L6 \u001cO\u0001 \u08a4 F\u0001\u001d\u0007L\u0002\n\n(1)\n\nThen, the apex can be written as F \u0003 F\u001e \u08a4 \u0001L where \u0001 \u0003 \u001e, as shown in Figure 2(c).\nThere are an in\ufb01nite number of cones which contain all points, e.g., any apex behind all points and\n\u0001 \u0003 \u00dd can be a solution. Among these solutions, we want to \ufb01nd the tightest cone, where the\nminimum of \u0001 is achieved. This also leads a degenerate solution where \u0001 \u0003 \u001e and \u0001 \u0003 \u00dd. We add\na regularization term to avoid the \u0001 \u0003 \u00dd solution. The minimization can be written as,\n\n\u0006E\u0006E\u0006E\u0007A\n\n\u0001\n\n\u0001 \u0002 \u0001\u0001\n\n\u0898 \u0001 \u0003 \u001f\u0002\u0016\u0016\u0016 \u0002 \u0001\n\n(2)\n\nIK>\u0006A?J J\u0006\n\nwhere \u0001\u0001 \u0003(cid:13)(cid:13)\u001c1 \u08a4 LL6\u001d\u001cO\u0001 \u08a4 F\u001e\u001d(cid:13)(cid:13) and \u0001\u0001 \u0003 L6\u001cO\u0001 \u08a4 F\u001e\u001d (Figure 2(c)), which are all known once\n\n\u0001\u0001\n\u0001\u0001\u0002\u0001 \u0003 \u0001\u0002\n\u0001 \u0003 \u001e\n\nL and F\u001e are known. \u0001\u0001\u0002\u001c\u0001\u0001 \u0002 \u0001\u001d \u0003 \u0001 is the constraint that the cone encapsulates all points of regard\n\u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001 and \u0001 \u0003 \u001e is the condition that the apex must be behind F\u001e. \u0001 is a parameter that\ncontrols how far the apex is from F\u001e. Equation (2) is a convex optimization problem (see Appendix\nin the supplementary material). Once the cone \u0001\u001cF\u0002 \u0001\u001d is estimated from \u0007O\u0001\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001, \u00d2 is the\nstandard deviation of the distance, \u00d2 \u0003 std\u0007\u08b1@\u001c\u0006\u0002 O\u0001\u001d\u08b1\u0007\u0001\u0003\u001f\u0002\u0016\u0016\u0016 \u0002\u0001, and will be used in Equation (3)\nas the bandwidth for the kernel density function.\n3.3 Gaze Concurrence Estimation via Mode-seeking\n3D gaze concurrences are formed at the intersections of multiple gaze rays, not at the intersection of\nmultiple primary gazes (see Figure 1(b)). If we knew the 3D gaze rays, and which of rays shared a\ngaze concurrence, the point of intersection could be directly estimated via least squares estimation,\nfor example. In our setup, neither one of these are known, nor do we know the number of gaze\nconcurrences. With a head-mounted camera, only the primary gaze ray is computable; the eye-in-\nhead motion is an unknown quantity. This precludes estimating the 3D gaze concurrence by \ufb01nding\na point of intersection, directly. In this section, we present a method to estimate the number and the\n3D locations of gaze concurrences given primary gaze rays.\nOur observations from head-mounted cameras are primary gaze rays. The gaze ray model discussed\nin Section 3.1 produces a distribution of points of regard for each primary gaze ray. The superpo-\nsition of these distributions yields a 3D social saliency \ufb01eld. We seek modes in this saliency \ufb01eld\nvia a mean-shift algorithm. The modes correspond to the gaze concurrences. The mean-shift algo-\nrithm [37] \ufb01nds the modes by evaluating the weights between the current mean and observed points.\nWe derive the closed form of the mean-shift vector directly from the observed primary gaze rays.\nWhile the observations are rays, the estimated modes are points in 3D. This formulation differs from\nthe classic mean-shift algorithm where the observations and the modes lie in the same space.\n\n5\n\nx\u02c6ixix(cid:2)ivip(,)iiilpv()ii\u2212vxp(cid:55)(,)idlxPrimary gaze directionCenter of eyes Primary gaze rayCenter of eyesMean trajectoriesMean convergences \fFor any point in 3D, N \u08a0 R!, a density function (social saliency \ufb01eld), \u0001, is generated by our gaze\nray model. \u0001 is the average of the Gaussian kernel density functions \u0001 which evaluate the distance\n\u001d\nvector between the point, N, and the primary gaze rays \u0006\u0001 as follows:\n\n\u001c\u08b1@\u001c\u0006\u0001\u0002 N\u001d\u08b1 \n\n\u001c @\u001c\u0006\u0001\u0002 N\u001d\n\n\u08b1@\u001c\u0006\u0001\u0002 N\u001d\u08b1 \n\n\u0001\u08a3\n\n\u0001\u08a3\n\n\u0001\u08a3\n\n\u001d\n\n\u001d\n\n\u001c\n\n\u0001 \u001cN\u001d \u0003\n\n\u0001\n\n\u001f\n\u0001\n\n\u0001\u0003\u001f\n\n\u00d2\u0001\n\n\u0003\n\n\u0001\n\u0001\n\n\u001f\n\u00d2\u0001\n\n\u0001\n\n\u0001\u0003\u001f\n\n\u00d2 \n\u0001\n\n\u0003\n\n\u001f\n\u0001\n\n\u00d2\u0001\n\n\u0001\u0003\u001f\n\n\u00dd\n\u001f\n \u0001\n\nANF\n\n\u08a4 \u001f\n \n\n\u00d2 \n\u0001\n\nwhere \u0001 is the number of gaze rays and \u00d2\u0001 is a bandwidth set to be the standard deviation of eye-in-\nhead motion obtained from the gaze ray calibration (Section 3.2) for the \u0001JD gaze ray. \u0001 is the pro\ufb01le\nof the kernel density function, i.e., \u0001\u001c\u0016\u001d \u0003 \u0001\u0001\u001c\u08b1 \u0016 \u08b1 \u001d\u0002\u00d2 and \u0001 is a scaling constant. @ \u08a0 R! is a\nperspective distance vector de\ufb01ned as\n\n\u0002\n\n(3)\n\n\u0001 \u001cN \u08a4 F\u0001\u001d \u08d9 \u001e\n\nB\u0006H L6\n\u0006JDAHMEIA\u0002\n\nwhere\u0002N\u0001 \u0003 F\u0001 \u0002 L6\n\n(4)\n@\u001c\u0006\u0001\u001cF\u0001\u0002 L\u0001\u001d\u0002 N\u001d \u0003\n\u0001 \u001cN \u08a4 F\u0001\u001d L\u0001, which is the projection of N onto the primary gaze ray as shown in\nFigure 3(a). F\u0001 is the center of eyes and L\u0001 is the direction vector for the \u0001JD primary gaze ray. Note\n\u0001 \u001cN \u08a4 F\u0001\u001d \u0003 \u001e, the point is behind the eyes, and therefore is not visible. This distance\nthat when L6\nvector directly captures the distance between \u0006 and \u0006@ in the gaze ray model (Section 3.1) and there-\nfore, this kernel density function yields a cone-shaped density \ufb01eld (Figure 1(d) and Figure 3(b)).\nFigure 3(c) shows a social saliency \ufb01eld (density \ufb01eld) generated by seven gaze rays. The regions of\nhigh density are the gaze concurrences. Note that the maximum intensity projection of the density\n\ufb01eld is used to illustrate a 3D density \ufb01eld.\nThe updated mean is the location where the maximum density increase can be achieved from the\ncurrent mean. Thus, it moves along the gradient direction of the density function evaluated at the\ncurrent mean. The gradient of the density function, \u0001 \u001cN\u001d, is\n\n\u0007 N\u08a4\u0002N\u0001\n\n\u0001 \u001cN\u08a4F\u0001\u001d\nL6\n\u00dd\n\n\u0005\u0005\u08a3\u0001\n\u0001\u0003\u001f \u0001\u0001\u0002N\u0001\n\u08a3\u0001\n\n\u0001\u0003\u001f \u0001\u0001\n\n\u00056\n\n\u08a4 N\n\n\u0002\n\n(5)\n\n\u0001\u08a3\n\n\u0001\u0003\u001f\n\n\u07f0\n\n\u0001\n\n\u001f\n\u00d2!\n\u0001\n\n\u089fN\u0001 \u001cN\u001d \u0003\n\n \u0001\n\u0001\n\nwhere\n\n\u00d2\u0001\n\n@\u001c\u0006\u0001\u0002 N\u001d6 \u001c\u089fN@\u001c\u0006\u0001\u0002 N\u001d\u001d \u0003\n\n\u001c(cid:13)(cid:13)(cid:13)(cid:13) @\u001c\u0006\u0001\u0002 N\u001d\n(cid:13)(cid:13)(cid:13)(cid:13) \u001d\n\u001c(cid:13)(cid:13)(cid:13) @\u001c\u0006\u0001\u0002N\u001d\n\u0001 \u001cN \u08a4 F\u0001\u001d\u001d \u0002 \u0002N\u0001 \u0003\u0002N\u0001 \u0002\n\u001cL6\n\n(cid:13)(cid:13)(cid:13) \u001d\n\n\u00d2!\n\u0001\n\n\u00d2\u0001\n\n\u0001\n\n \u0001\n\u0001\n\n\u0001\u0003\u001f\n\n\u0001\u0001\n\n\u0005 \u0001\u08a3\n\u08b1N \u08a4\u0002N\u0001\u08b1 \n\u0001 \u001cN \u08a4 F\u001d\nL6\n\n\u0001\u0001 \u0003\n\nand \u0001\u001c\u0001\u001d \u0003 \u08a4\u0001\u07f0\u001c\u0001\u001d. \u0002N\u0001 is the location that the gradient at N points to with respect to \u0006\u0001, as shown\n\nin Figure 3(a). Note that the gradient direction at N is perpendicular to the ray connecting N and F\u0001.\nThe last term of Equation (5) is the difference between the current mean estimate and the weighted\nmean. The new mean location, N\u0001\u0002\u001f, can be achieved by adding the difference to the current mean\nestimate, N\u0001:\n\nL\u0001\u0002\n\n\u08a3\u0001\n\u08a3\u0001\n\n\u0001\u0003\u001f \u0001\u0001\n\u0001\u0003\u001f \u0001\u0001\n\n\u0001\n\n\u0001\u0002N\u0001\n\n\u0001\n\nN\u0001\u0002\u001f \u0003\n\n\u0002\n\n(6)\n\nFigure 3(c) shows how our mean-shift vector moves random initial points according to the gradient\ninformation. The mean-shift algorithm always converges as shown in the following theorem.\nTheorem 1 The sequence \u0007\u0001 \u001cN\u0001\u001d\u0007\u0001\u0003\u001f\u0002 \u0002\u0016\u0016\u0016 provided by Equation (6) converges to the local maxi-\nmum of the density \ufb01eld.\n\nSee Appendix in the supplementary material for proof.\n4 Result\nWe evaluate our algorithm quantitatively using a motion capture system to provide ground truth and\napply it to real world examples where social interactions frequently occur. We use GoPro HD Hero2\ncameras (www.gopro.com) and use the head mounting unit provided by GoPro. We synchronize the\ncameras using audio signals, e.g., a clap. In the calibration step, we ask people to form pairs, and\nmove back and forth and side to side at least three times to allow the gaze ray model to be accurately\nestimated. For the initial points of the mean-shift algorithm, we sample several points on the primary\ngaze rays. This sampling results in convergences of the mean-shift because the local maxima form\naround the rays. If the weights of the estimated mode are dominated by only one gaze, we reject the\nmode, i.e., more than one gaze rays must contribute to estimate a gaze concurrence.\n\n6\n\n\f4.1 Validation with Motion Capture Data\nWe compare the 3D gaze concurrences estimated by our result with ground truth obtained from\na motion capture system (capture volume: 8.3m17.7m4.3m). We attached several markers on\na camera and reconstructed the camera motion using structure from motion and the motion cap-\nture system simultaneously. From the reconstructed camera trajectory, we recovered the similarity\ntransform (scale, orientation, and translation) between two reconstructions. We placed two static\nmarkers and asked six people to move freely while looking at the markers. Therefore, the 3D gaze\nconcurrences estimated by our algorithm should coincide with the 3D position of the static markers.\nThe top row in Figure 4(a) shows the trajectories of the gaze concurrences (solid lines) overlaid by\nthe static marker positions (dotted lines). The mean error is 10.1cm with 5.73cm standard deviation.\nThe bottom row in Figure 4(a) shows the gaze concurrences (orange and red points) with the ground\ntruth positions (green and blue points) and the con\ufb01dence regions (pink region) where a high value\nof the saliency \ufb01eld is achieved (region which has higher than 80% of the local maximum value).\nThe ground truth locations are always inside these regions.\n\n4.2 Real World Scenes\nWe apply our method to reconstruct 3D gaze concurrences in three real world scenes: a meeting, a\nmusical, and a party. Figures 4(b), 5(a), and 5(b) show the reconstructed gaze concurrences and the\nprojections of 3D gaze concurrences onto the head-mounted camera plane (top row). 3D renderings\nof the gaze concurrences (red dots) with the associated con\ufb01dence region (salient region) are drawn\nin the middle row and the cone-shaped gaze ray models are also shown. The trajectories of the gaze\nconcurrences are shown in the bottom row. The transparency of the trajectories encodes the timing.\nMeeting scene: There were 11 people forming two groups: 6 for one group and 5 for the other\ngroup as shown in Figure 4(b). The people in each group started to discuss among themselves at\nthe beginning (2 gaze concurrences). After a few minutes, all the people faced the presenter in the\nmiddle (50th frame: 1 gaze concurrence), and then they went back to their group to discuss again\n(445th frame: 2 gaze concurrences) as shown in Figure 4(b).\nMusical scene: 7 audience members wore head-mounted cameras and watched the song, \u201cSum-\nmer Nights\u201d from the musical Grease. There were two groups of actors, \u201cthe pink ladies (women\u2019s\ngroup)\u201d and \u201cthe T-birds (men\u2019s group)\u201d and they sang the song alternatingly as shown in Fig-\nure 5(a). In the \ufb01gure, we show the reconstruction of two frames when the pink ladies sang (41st\nframe) and when the T-birds sang (390th frame).\nParty scene:\nthere were 11 people forming 4 groups: 3 sat on couches, 3 talked to each other\nat the table, 3 played table tennis, and 2 played pool (178th frame: 4 gaze concurrences) as\nshown in Figure 5(b). Then, all moved to watch the table tennis game (710th frame: one gaze\nconcurrence). Our method correctly evaluates the gaze concurrences at the location where peo-\nple look. All results are best seen in the videos from the following project website (http:\n//www.cs.cmu.edu/\u02dchyunsoop/gaze_concurrence.html).\n\n5 Discussion\nIn this paper, we present a novel representation for social scene understanding in terms of 3D gaze\nconcurrences. We model individual gazes as a cone-shaped distribution that captures the variation of\nthe eye-in-head motion. We reconstruct the head-mounted camera poses in 3D using structure from\nmotion and estimate the relationship between the camera pose and the gaze ray. Our mode-seeking\nalgorithm \ufb01nds the multiple time-varying gaze concurrences in 3D. We show that our algorithm can\naccurately estimate the gaze concurrences.\nWhen people\u2019s gaze rays are almost parallel, as in the musical scene (Figure 5(a)), the estimated gaze\nconcurrences become poorly conditioned. The con\ufb01dence region is stretched along the direction of\nthe primary gaze rays. This is the case where the point of regard is very far away while people look\nat the point from almost the same vantage point. For such a scene, head-mounted cameras from\ndifferent points of views can help to localize the gaze concurrences precisely.\nRecognizing gaze concurrences is critical to collaborative activity. A future application of this work\nwill be to use gaze concurrence to allow arti\ufb01cial agents, such as robots, to become collabora-\ntive team members that recognize and respond to social cues, rather than passive tools that require\nprompting. The ability to objectively measure gaze concurrences in 3D will also enable new investi-\ngations into social behavior, such as group dynamics, group hierarchies, and gender interactions, and\n\n7\n\n\f(a) Quantitative result\n\n(b) Meeting scene\n\nFigure 4: (a) Top: the solid lines (orange and red) are the trajectories of the gaze concurrences\nand the dotted lines (green and blue) are the ground truth marker positions. The colored bands are\none standard deviation wide and are centered at the trajectory means. Bottom: there are two gaze\nconcurrences with six people. (b) We reconstruct the gaze concurrences for the meeting scene. 11\nhead-mounted cameras were used to capture the scene. Top row: images with the reprojection of\nthe gaze concurrences, middle row: rendering of the 3D gaze concurrences with cone-shaped gaze\nmodels, bottom row: the trajectories of the gaze concurrences.\n\n(a) Musical scene\n\n(b) Party scene\n\nFigure 5: (a) We reconstruct the gaze concurrences from musical audiences. 7 head-mounted cam-\neras were used to capture the scene. (b) We reconstruct the gaze concurrences for the party scene.\n11 head-mounted cameras were used to capture the scene. Top row: images with the reprojection of\nthe gaze concurrences, bottom row: rendering of the 3D gaze concurrences with cone-shaped gaze\nmodels.\n\nresearch into behavioral disorders, such as autism. We are interested in studying the spatiotemporal\ncharacteristics of the birth and death of gaze concurrences and how they relate to the groups in the\nscene.\n\nAcknowledgement\nThis work was supported by a Samsung Global Research Outreach Program, Intel ISTC-EC, NSF\nIIS 1029679, and NSF RI 0916272. We thank Jessica Hodgins, Irfan Essa, and Takeo Kanade for\ncomments and suggestions on this work.\n\nReferences\n[1] D. Marr. Vision: A Computational Investigation into the Human Representation and Processing of Visual\n\nInformation. Phenomenology and the Cognitive Sciences, 1982.\n\n[2] N. Snavely, M. Seitz, and R. Szeliski. Photo tourism: Exploring photo collections in 3D. TOG, 2006.\n[3] R. Fergus, P. Perona, and A. Zisserman. Object class recognition by unsupervised scale-invariant learning.\n\nIn CVPR, 2003.\n\n8\n\n50100150200250\u221240040X (cm)50100150200250\u2212120\u221280\u221240Y (cm)5010015020025010203040FrameZ (cm)Ground truth position 1Gaze concurrence trajectory 1Ground truth position 2Gaze concurrence trajectory 250th frame: 1 gaze concurrenceside viewoblique viewtop viewtop viewside viewoblique view445th frame: 2 gaze concurrencesleft oblique viewtop viewside viewright oblique viewfront viewoblique view178th frame: 4 gaze concurrences710th frame: 1 gaze concurrenceoblique viewoblique viewtop viewtop viewoblique viewoblique view\f2011.\n\n2009.\n\n2004.\n\n2002.\n\n[4] A. Gupta, S. Satkin, A. A. Efros, and M. Hebert. From scene geometry to human workspace. In CVPR,\n\n[5] A. Vinciarelli, M. Pantic, and H. Bourlard. Social signal processing: Survey of an emerging domain.\n\nImage and Vision Computing, 2009.\n\n[6] E. Murphy-Chutorian and M. M. Trivedi. Head pose estimation in computer vision: A survey. TPAMI,\n\n[7] R. S. Jampel and D. X. Shi. The primary position of the eyes, the resetting saccade, and the transverse\nvisual head plane. head movements around the cervical joints. Investigative Ophthalmology and Vision\nScience, 1992.\n\n[8] R. R. Murphy. Human-robot interaction in rescue robotics. IEEE Trans. on Systems, Man and Cybernetics,\n\n[9] S. Marks, B. W\u00a8unsche, and J. Windsor. Enhancing virtual environment-based surgical teamwork training\n\nwith non-verbal communication. In GRAPP, 2009.\n\n[10] N. Bilton. A rose-colored view may come standard: Google glass. The New York Times, April 2012.\n[11] J.-G. Wang and E. Sung. Study on eye gaze estimation. IEEE Trans. on Systems, Man and Cybernetics,\n\n[12] E. D. Guestrin and M. Eizenman. General theory of remote gaze estimation using the pupil center and\n\ncorneal re\ufb02ection. IEEE Trans. on Biomedical Engineering, 2006.\n\n[13] C. Hennessey and P. Lawrence. 3D point-of-gaze estimation on a volumetric display. In ETRA, 2008.\n[14] D. Li, J. Babcock, and D. J. Parkhurst. openEyes: a low-cost head-mounted eye-tracking solution. In\n\nETRA, 2006.\n\n[15] K. Takemura, Y. Kohashi, T. Suenaga, J. Takamatsu, and T. Ogasawara. Estimating 3D point-of-regard\n\nand visualizing gaze trajectories under natural head movements. In ETRA, 2010.\n\n[16] N. J. Emery. The eyes have it: the neuroethology, function and evolution of social gaze. Neuroscience\n\n[17] A. H. Gee and R. Cipolla. Determining the gaze of faces in images. Image and Vision Computing, 1994.\n[18] P. Ballard and G. C. Stockman. Controlling a computer via facial aspect. IEEE Trans. on Systems, Man\n\n[19] R. Rae and H. J. Ritter. Recognition of human head orientation based on arti\ufb01cial neural networks. IEEE\n\n[20] N. M. Robertson and I. D. Reid. Estimating gaze direction from low-resolution faces in video. In ECCV,\n\n[21] B. Noris, K. Benmachiche, and A. G. Billard. Calibration-free eye gaze direction detection with gaussian\n\n[22] S. M. Munn and J. B. Pelz. 3D point-of-regard, position and head orientation from a portable monocular\n\n[23] G. Welch and E. Foxlin. Motion tracking: no silver bullet, but a respectable arsenal. IEEE Computer\n\nand Biobehavioral Reviews, 2000.\n\nand Cybernetics, 1995.\n\nTrans. on Neural Networks, 1998.\n\n2006.\n\nprocesses. In GRAPP, 2006.\n\nvideo-based eye tracker. In ETRA, 2008.\n\nGraphics and Applications, 2002.\n\nCVPR, 2011.\n\n[24] F. Pirri, M. Pizzoli, and A. Rudi. A general method for the point of regard estimation in 3d space. In\n\n[25] R. Stiefelhagen, M. Finke, J. Yang, and A. Waibel. From gaze to focus of attention. In VISUAL, 1999.\n[26] K. Smith, S. O. Ba, J.-M. Odobez, and D. Gatica-Perez. Tracking the visual focus of attention for a\n\nvarying number of wandering people. TPAMI, 2008.\n\n[27] L. Bazzani, D. Tosato, M. Cristani, M. Farenzena, G. Pagetti, G. Menegaz, and V. Murino. Social inter-\n\nactions by visual focus of attention in a three-dimensional environment. Expert Systems, 2011.\n\n[28] M. Cristani, L. Bazzani, G. Paggetti, A. Fossati, D. Tosato, A. Del Bue, G. Menegaz, and V. Murino.\n\nSocial interaction discovery by statistical analysis of F-formations. In BMVC, 2011.\n\n[29] A. Fathi, J. K. Hodgins, and J. M. Rehg. Social interaction: A \ufb01rst-person perspective. In CVPR, 2012.\n[30] R. I. Hartley and A. Zisserman. Multiple View Geometry in Computer Vision. Cambridge University\n\nPress, 2004.\n\ncameras. TOG, 2011.\n\n[31] T. Shiratori, H. S. Park, L. Sigal, Y. Sheikh, and J. K. Hodgins. Motion capture from body-mounted\n\n[32] M. A. Fischler and R. C. Bolles. Random sample consensus: a paradigm for model \ufb01tting with applica-\n\ntions to image analysis and automated cartography. Communications of the ACM, 1981.\n\n[33] V. Lepetit, F. Moreno-Noguer, and P. Fua. EPnP: An accurate O(n) solution to the PnP problem. IJCV,\n\n[34] H. Misslisch, D. Tweed, and T. Vilis. Neural constraints on eye motion in human eye-head saccades.\n\n[35] E. M. Klier, H. Wang, A. G. Constantin, and J. D. Crawford. Midbrain control of three-dimensional head\n\n[36] D. E. Angelaki and B. J. M. Hess. Control of eye orientation: where does the brain\u2019s role end and the\n\nmuscle\u2019s begin? European Journal of Neuroscience, 2004.\n\n[37] K. Fukunaga and L. D. Hostetler. The estimation of the gradient of a density function, with applications\n\nin pattern recognition. IEEE Trans. on Information Theory, 1975.\n\n2009.\n\nJournal of Neurophysiology, 1998.\n\norientation. Science, 2002.\n\n9\n\n\f", "award": [], "sourceid": 4619, "authors": [{"given_name": "Hyun", "family_name": "Park", "institution": null}, {"given_name": "Eakta", "family_name": "Jain", "institution": null}, {"given_name": "Yaser", "family_name": "Sheikh", "institution": null}]}