{"title": "An ideal observer model for identifying the reference frame of objects", "book": "Advances in Neural Information Processing Systems", "page_first": 514, "page_last": 522, "abstract": "The object people perceive in an image can depend on its orientation relative to the scene it is in (its reference frame). For example, the images of the symbols $\\times$ and $+$ differ by a 45 degree rotation. Although real scenes have multiple images and reference frames, psychologists have focused on scenes with only one reference frame. We propose an ideal observer model based on nonparametric Bayesian statistics for inferring the number of reference frames in a scene and their parameters. When an ambiguous image could be assigned to two conflicting reference frames, the model predicts two factors should influence the reference frame inferred for the image: The image should be more likely to share the reference frame of the closer object ({\\em proximity}) and it should be more likely to share the reference frame containing the most objects ({\\em alignment}). We confirm people use both cues using a novel methodology that allows for easy testing of human reference frame inference.", "full_text": "An ideal observer model for identifying\n\nthe reference frame of objects\n\nJoseph L. Austerweil\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nDepartment of Computer Science and Engineering\n\nAbram L. Friesen\n\nUniversity of Washington\n\nSeattle, WA 98195\n\nJoseph.Austerweil@gmail.com\n\nafriesen@cs.washington.edu\n\nThomas L. Grif\ufb01ths\n\nDepartment of Psychology\n\nUniversity of California, Berkeley\n\nBerkeley, CA 94720\n\nTom Griffiths@berkeley.edu\n\nAbstract\n\nThe object people perceive in an image can depend on its orientation relative to\nthe scene it is in (its reference frame). For example, the images of the symbols\n\u00d7 and + differ by a 45 degree rotation. Although real scenes have multiple im-\nages and reference frames, psychologists have focused on scenes with only one\nreference frame. We propose an ideal observer model based on nonparametric\nBayesian statistics for inferring the number of reference frames in a scene and\ntheir parameters. When an ambiguous image could be assigned to two con\ufb02icting\nreference frames, the model predicts two factors should in\ufb02uence the reference\nframe inferred for the image: The image should be more likely to share the refer-\nence frame of the closer object (proximity) and it should be more likely to share\nthe reference frame containing the most objects (alignment). We con\ufb01rm people\nuse both cues using a novel methodology that allows for easy testing of human\nreference frame inference.\n\n1\n\nIntroduction\n\nWhen are the objects in two images the same?1 Although people recognize and categorize objects\nsuccessfully and effortlessly, object recognition in machine learning is an incredibly dif\ufb01cult prob-\nlem and people\u2019s success is a puzzle to cognitive scientists. To solve this problem, object recognition\ntechniques typically generate a set of features using a prede\ufb01ned procedure (e.g., SIFT descriptors\n[1] or textons [2]) or learn features (e.g., deep belief networks [3]) from the images. The general\ngoal of these methods is to extract features from images that are useful for identifying the objects\nthat generated the images after whatever transformations occurred while producing them (e.g., view-\npoint changes). This is a sensible strategy given that people typically perceive the same object even\nwhen it is transformed in its image (e.g., translations). However, not all transformations should\nbe ignored: The perceived identity of some objects depends on the orientation of its features with\nrespect to the scene it is in (e.g., \u00d7 vs. + differ only in orientation), but for other objects it does\n\n1In this paper, we use the following terminology for scene, image, and object. The entire visual input of an\nobserver is a scene. A scene contains a set of images. An image is a part of the visual input that is generated\nby a single object, which is ambiguous as two or more objects could generate the same image. An object is the\nitem in the world that generates an image in the visual input.\n\n1\n\n\fnot. Developing proper object recognition and fully understanding how people do it depends on\nexplaining how people determine the orientation of objects with respect to the scene they are in.\n\nThe importance of orientation for object recognition leads us to the following question: If two ob-\njects project to the same image under different viewing conditions (e.g., + and \u00d7 after 45 degree\nrotations), how do people infer which object is in the image? In psychology, there are two main\ntheories for how people solve this problem: the invariant feature hypothesis [4], which is essentially\nthe strategy taken by current object recognition techniques (use features that preserve object identity\nover the possible transformations that generate images of the object), and the reference frame hy-\npothesis, which posits that objects are embedded in coordinate axes [5]. The coordinate axes set the\norientation and scale of the objects, and thus + and \u00d7 can be identi\ufb01ed as different objects. Though\nthey may produce the same image, they will have different coordinate axes.\n\nIn some situations the orientation of an image\u2019s reference frame is simply the orientation of the\nretina; however, this is not the case when we rotate our heads (as our retinal image rotates) or\nlook at a rotated object (e.g., a person lying on a bench or a document rotated on a desk). Thus,\nthe reference frame of an image is ambiguous without additional information. However, if there\nis another object in the scene whose orientation is unambiguous (like a 5), then the orientation of\nthe ambiguous image can be inferred.2 We demonstrate that people use the orientation of other\nimages in the scene to determine the orientation of an ambiguous image by asking participants to\nsolve arithmetic problems, where the operator image is ambiguous and the two numbers \ufb02anking the\noperator are either oriented upright or rotated 45 degrees. The solution people adopt is indicative of\nthe reference frame they inferred for the operator (multiplication implies an upright reference frame\nand addition implies a diagonal reference frame). This is a novel experimental method that allows\nus to explore reference frame inference in a wide range of contexts.\n\nIn real life, we typically view scenes with multiple reference frames. For example, some books\non a bookshelf might be upright, other books could be tilted diagonally (for support), while other\nbooks might lie \ufb02at. Yet there has been little work investigating how people infer the number of\nreference frames, their orientations, and which images belong to each reference frame. To solve this\nproblem, we note that each image in a scene belongs to a single reference frame, and thus reference\nframes form a partition of the images in a scene (where each block in the partition corresponds to a\nreference frame). Using a standard nonparametric Bayesian model for partitions, we formulate an\nideal observer model to infer multiple reference frames and their parameters. The model predicts\nthat people should be sensitive to two cues when inferring the reference frames of a scene: the\nproximity of the ambiguous image to two unambiguous \ufb02anking images in con\ufb02icting orientations,\nand the difference in the number of objects aligned in the competing reference frames. We con\ufb01rm\npeople are sensitive to both cues using the novel method described above.\n\nThe summary of the article is as follows. First, Section 2 summarizes relevant psychological research\non how orientation affects the objects perceived in ambiguous images. Next, Section 3 develops a\nnovel method for online testing of the reference frame people infer for an image and establishes\nits ef\ufb01cacy. Section 4 presents an ideal observer model for reference frame inference in scenes\nwith multiple reference frames. The model predicts that the ambiguous image\u2019s proximity to other\nreference frames should affect the inferred reference frame and Section 5 con\ufb01rms that people act in\naccordance with this prediction in a behavioral experiment. The model also predicts that the number\nof aligned objects in a reference frame should affect the reference frame inferred for an ambiguous\nimage. Section 6 con\ufb01rms this prediction in a behavioral experiment. Section 7 concludes the paper\nand highlights some directions for future research.\n\n2 Orientation in psychological theories of object representation\n\nThough the perceived object of some images does not depend on its orientation (like a 5), there are\nmany examples where the perceived object does depend on its orientation [7, 8], including + vs. \u00d7\nor a square vs. a diamond, and other effects of orientation on object recognition [9, 10]. This has led\npsychologists to believe that people represent objects within a reference frame (a set of coordinate\naxes).3 Figure 1 (a) shows that reference frames predict the image + is interpreted as a + when\n\n2We view the ambiguity of a reference frame as essentially the same as the strength of the intrinsic axes [6].\n3Though coordinate axes have other properties (e.g., scale), we focus on orientation in this article.\n\n2\n\n\f(a)\n\n(b) 5\n\n5\n\n(c)\n\n5\n\n+\n\n+\n\n+\n\n+\n\n5\n\n(d)\n\n+ ++\n+ ++\n\nFigure 1: Reference frames. (a) The ambiguity of the + image can be resolved using reference\nframes: a + with horizontal orientation (solid axes) or a \u00d7 rotated 45 degrees (dashed axes). (b)\nOther images are unambiguous, like a 5. (c) The reference frame of ambiguous objects is in\ufb02uenced\nby objects with unambiguous reference frames. (d) The group of objects is seen as either all + or\nall \u00d7, but not some + and some \u00d7. This establishes one reference frame per group.\n\nthe coordinate axes are aligned with the document\u2019s axes and as \u00d7 when the coordinate axes are\ndiagonal to the document\u2019s axes. For objects that are rotationally invariant, there is only one object\nthat generates the observed image and so it is identi\ufb01able in any orientation (see Figure 1 (b)). The\ndependence of object perception on orientation is a well established norm and has been demonstrated\nwith novel and familiar 2-D objects, faces, handwriting [8, 9], and 3-D objects [10, 11].\n\nCentral to the reference frame hypothesis is the ability of our perceptual system to infer a reference\nframe for a given image. As more than one reference frame may be consistent with an observed\nimage, psychologists have explored how people infer the appropriate reference frame for an image.\nThough reference frame inference is strongly in\ufb02uenced by the top-down axis of the retinal image\nand by the axis of gravity (given by our proprioceptive and vestibular senses) [8], the scene itself\ncan in\ufb02uence the inferred reference frame. Objects grouped together in the world tend to be affected\nby the same transformation when they generate images (e.g., the text on a poster as the poster is\nrotated), and so it is sensible that the inferred reference frame for an ambiguous image is in\ufb02uenced\nby the orientations of the images surrounding it. Figures 1 (c) and (d) are phenomenological demon-\nstrations of how the alignment of the orientations of other objects in a scene can bias the inferred\nreference frame for an image whose reference frame is ambiguous (and there is strong corroborating\nempirical evidence for this principle [12, 13]).4 Figure 1 (c) is biased towards being interpreted as\n\u00d7 based on the surrounding context and the images in Figure 1 (d) are interpreted as either all + or\nall tilted \u00d7, but it is dif\ufb01cult to interpret some as + and others as tilted \u00d7 simultaneously [14]. Thus,\nthere is one reference frame shared by all the objects in a group.\n\nAlthough there is a wealth of research into reference frame inference for scenes containing a single\nreference frame, to the best of our knowledge, there has not been any research into how people de-\ntermine the reference frame of ambiguously oriented images when there is more than one reference\nframe in the scene (and both are consistent with the images). Before exploring what cues in\ufb02u-\nence human reference frame inference in scenes with multiple reference frames, we develop a novel\nmethod for testing human reference frame inference.\n\n3 Testing reference frame inference using arithmetic\n\nTo test how different factors in\ufb02uence the reference frame people infer for an image, we ask people\nto solve an arithmetic problem without specifying the appropriate operation. If people view \u00d7 and\ntheir response is the multiplication answer, then their reference frame for \u00d7 is aligned with the\nhorizontal and vertical axes of the page. Alternatively, if people view the same \u00d7, but their response\nis the addition answer, then their reference frame for \u00d7 is aligned with the axes diagonal to the page\n(and thus, relative to its own reference frame, it is treated as +).5 We use this new method instead of\nprevious techniques (e.g., explicitly asking the image\u2019s orientation and recording the frequency each\norientation is chosen that is either compatible or con\ufb02icting to the tested hypothesis [15]) due to its\nability to be used in a wide range of contexts and to demonstrate the robust importance of reference\n\n4We use slightly different terminology than previous work has done and refer to this principle as alignment\n\nrather then symmetry to avoid the ambiguity in the word symmetry (which symmetry we are referring to).\n\n5Although we use + and \u00d7 as the ambiguous images, this method works with any ambiguous images by\n\nteaching the participant to use addition in one orientation of the image and multiplication in the other.\n\n3\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nAxis Oriented\n\nDiagonal Oriented\n\nAxis Oriented\n\nDiagonal Oriented\n10\n\n5\n\n5\n\n+\n\n+\n\n5\n\n5\n\n10\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n5\n\n0\n\ny\nc\nn\ne\nu\nq\ne\nr\nF\n\n5\n\n0\n\n10\n25\nResponse\n\n10\n25\nResponse\n\nFigure 2: Effect of the orientations of other objects in the same reference frame. (a) 5s aligned\nwith axes implies that the operator is \u00d7. (b) 5s aligned with diagonal implies the operator is +\nat a diagonal orientation. (c) Frequency of answers to (a) given by participants. Most participants\nrespond with 25, the solution to the product of 5 and 5, meaning their reference frame is aligned\nwith the axes of the page. (d) Frequency of answers to (b) given by participants. Most participants\nrespond with 10, meaning their reference frame is aligned with the diagonals of the page.\n\nframe inference on a seemingly unrelated cognitive behavior (solving an arithmetic problem). We\ncon\ufb01rm its validity by reproducing a previously found effect \u2013 the in\ufb02uence of orientation on other\nimages in the scene [12].\n\nWhen the reference frame for an image is ambiguous, one factor that in\ufb02uences the inferred refer-\nence frame is the orientation of other images it is grouped with, especially when those images are\nidenti\ufb01able in any orientation. Thus, if we ask people to solve an arithmetic problem, where the\noperator \u00d7 is paired with the numbers 5 aligned with the top-down axes of the page (Figure 2 (a)),\nthey should respond 25, the result of multiplication. Alternatively, if people solve the same problem\nexcept the numbers 5 are aligned diagonally, they should infer the diagonal axes to be the reference\nframe and respond 10, the result of addition (Figure 2 (b)).\n\nTo test this method, we recruited 20 participants online, who answered one arithmetic problem in\nexchange for a small monetary reward. The participants were counterbalanced over the axis or\ndiagonally oriented conditions (Figures 2 (a) and (b) respectively) and all participants gave either\nthe addition (10) or multiplication (25) solution. By changing the orientation of the numbers, the\nsolutions to the arithmetic problems given by participants in Figures 2 (a) and (b) are different\ndespite having identical numbers and the identical operator image. Figures 2 (c) and (d) show that\nthe responses of two groups of participants who answered the arithmetic problem in (a) and (b)\ndiffered as predicted (\u03c72(1) = 5.208, p < 0.05, using Yates\u2019 chi-square correction). Thus, asking\nparticipants to solve arithmetic problems is an effective method for testing reference frame inference\nand perceived orientations can in\ufb02uence higher level cognition.\n\n4 Modeling reference frame inference\n\nBefore describing our model of reference frame inference with multiple reference frames, we \ufb01rst\npresent a probabilistic model for scenes of multiple images with only a single reference frame.\n\n4.1 Reference frame inference for scenes with one reference frame\n\nWe assume that a vocabulary of possible objects is known ahead of time of size V and that there\nare R possible rotations. Each scene (e.g., Figure 2 (a) is one scene) consists of a set of images\n(e.g., 5, \u00d7, and 5 are the images of Figure 2 (a)). For each image i in a scene, the model is given\nits visual properties yi and its spatial location xi = (xi1, xi2) The visual properties of the image yi\nare generated by an unknown object vi rotated by r, the orientation of the scene\u2019s reference frame.\nA V \u00d7 R binary image-object alignment matrix A(i) encodes the object-rotation pairs consistent\nwith the observed image yi such that A(i)(v, r) = 1 if the image of object v rotated r degrees\nis consistent with yi. The model assumes that the spatial locations of the images are independent\nidentically distributed draws from a Gaussian distribution with shared parameters \u00b5, the center point\nfor the reference frame, and \u03a3, the spread of objects around its center point. The unobserved\nobjects and the orientation of the reference frame r are drawn from independent discrete distributions\n\n4\n\n\fwith parameters \u03c6 and \u03b8, the prior over objects and reference frame orientations, respectively. The\nfollowing generative model de\ufb01nes our statistical model:\n\nr|\u03b8 \u223c Discrete(\u03b8)\n\nxi|\u00b5, \u03a3 \u223c Gaussian (\u00b5, \u03a3)\n\niid\u223c Discrete(\u03c6)\n\nvi|\u03c6\nP (yi|vi, r) = A(i)(vi, r)\n\nIf the model assumes there are three types of objects (5, + and \u00d7) and two possible rotations (0 and\n45 degrees), the model captures the sensitivity of participants in the demonstration (Figure 2). In\nFigure 2 (a), the 5s are oriented at 0 degrees. A(5, r) is only non-zero when r = 0 because no other\nobject can produce an image consistent with the observed image of the 5. r = 0 implies that the\noperator is \u00d7, which is consistent with participant responses (Figure 2 (c)). When the 5s are oriented\nat 45 degrees (Figure 2 (b)), A(5, r) is only non-zero when r = 45 for the same reason as before.\nr = 45 implies that the operator is +, which is consistent with participant responses (Figure 2 (d)).\n\n4.2 Extending the model for scenes with multiple reference frames\n\nAlthough the model de\ufb01ned in the previous section succeeds in inferring the reference frame of an\nambiguous image using other images it is grouped with, it cannot handle scenes containing multiple\nreference frames, such as the scenes in Figure 3. We extend the model by partitioning the images of\na scene into reference frames, where each image of the scene belongs to exactly one reference frame\nand a reference frame is a block of the partition. From this perspective, inferring multiple reference\nframes for a scene of images is equivalent to partitioning the scene or clustering the images.\n\nWith the insight that grouping images into reference frames is like \ufb01nding a partition of a scene, we\ncan extend our model to select the reference frames of a scene (with an unknown number of reference\nframes). First, we generate a partition of the images in the scene from the Chinese restaurant process\n(CRP) [16] with parameter \u03b1, an exchangeable distribution over partitions. The CRP is de\ufb01ned\nthrough the following sequential construction:\n\nP (ci = k|c1, . . . , ci\u22121) = (cid:26) nk\n\n\u03b1\n\n\u03b1+i\u22121\n\n\u03b1+i\u22121\n\nk \u2264 K\nk = K + 1\n\nwhere K is the current number of reference frames and nk is the number of objects assigned to\nreference frame k. ci denotes the reference frame that object i is assigned to and if ci = K + 1, it\nis assigned a new reference frame containing none of the previous objects and K increments by one\n(to initialize, the \ufb01rst object starts its own reference frame and K = 1). This gives us an assignment\nvector c, where ci = j denotes reference frame j contains image i. Each block in the partition\n(reference frame) j is associated with a rotation rj and is embedded in the spatial layout of the\nscene with a center position \u00b5j and spread \u03a3j (each of which is generated from a Gaussian-Inverse\nWishart distribution with shared parameters). Thus, we have de\ufb01ned the following generative model\nfor a set of images in a scene:\n\nc|\u03b1 \u223c CRP(\u03b1)\n\nrj|\u03b8\n\niid\u223c Discrete(\u03b8)\n\nxi|ci, \u00b5ci , \u03a3ci \u223c Gaussian (\u00b5ci , \u03a3ci )\n\niid\u223c GIW (\u00b50, \u03a30, k0, \u03bd0)\n\n\u00b5j, \u03a3j|\u00b50, \u03a30, k0, \u03bd0\niid\u223c Discrete(\u03c6)\n\nvi|\u03c6\nP (yi|vi, rci , ci) = A(i)(vi, rci )\n\nwhere GIW signi\ufb01es the Gaussian-Inverse-Wishart distribution, and \u03b1, \u00b50, \u03a30, k0, \u03bd0, \u03b8, and \u03c6 are\nhyperparameters of our model.\n\nWe use Gibbs sampling for inference [17], which gives us the cluster assignments for each image\nand the updated parameters \u03c8j = (\u00b5j, \u03a3j, rj) for each cluster j. We begin by assigning each\nimage to its own reference frame and then iterating. For each observed image, we resample ci\nfrom the set of existing clusters and m = 2 newly drawn clusters. After all ci values have been\nresampled, we discard any empty clusters and update the parameters of the remaining clusters by\ndrawing them from their posterior distribution given the objects assigned to that reference frame\np(\u03c8j|{xi, yi : ci = j}), where {xi, yi : ci = j} is the set of images and their locations in reference\nframe j.\n\n4.3 Predictions for human reference frame inference\n\nWhat factors in\ufb02uence the reference frame assigned to an ambiguous image according to our ideal\nobserver model? Two factors it predicts should in\ufb02uence the image\u2019s inferred reference frame are\n\n5\n\n\fOperator Position -2\n\nOperator Position -1\n\nOperator Position 1\n\nOperator Position 2\n\n5\n\n5\n\n5\n\n+\n\n5\n\n+\n\n+\n\n5\n\n+\n\n5\n\n5\n\n5\n\nFigure 3: Trials from Experiment 1 showing the possible positions of the operators for the main\nfactor of the experiment. Other factors randomized over trials are the numbers in the problem\n(always single digits), which of the two numbers was rotated, the diagonal that the numbers and\noperator are aligned on (positive diagonal shown in the \ufb01gure, but numbers and operator aligned on\nthe negative diagonal as well), and the rotation of the operator.\n\n(a)\n\nParticipant responses for proximity experiment\n\n(b)\n\nModel predictions for proximity experiment\n\nt\nf\ne\nl\n \nd\ne\np\nu\no\nr\ng\n \nt\nn\ne\nc\nr\ne\nP\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22122\n\n\u22121\n\n1\nOperator Position\n\n2\n\nt\nf\ne\nl\n \nd\ne\np\nu\no\nr\ng\n \nt\nn\ne\nc\nr\ne\nP\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n\u22122\n\n\u22121\n1\nOperator Position\n\n2\n\nFigure 4: Proximity effects: (a) Human results and (b) Model results. The closer the operator is to\nthe left number, the more likely it is to take the left number\u2019s orientation.\n\nproximity or how close the image is to unambiguous images (as images in the same reference frame\nare coupled in spatial location) and alignment or the difference in the number of images assigned\nto each reference frame. The general paradigm we use to test the predictions is to have the + or \u00d7\noperator \ufb02anked by a number with different orientations on each side (see examples in Figure 3).\nIt is clear that the two numbers should have their own reference frame, but it is ambiguous which\nreference frame the operator should be assigned to. We compare how each of these factors in\ufb02uences\nthe reference frames inferred in the scene by people and our model in two behavioral experiments.\n\n5 Experiment 1: Proximity effects on reference frame inference\n\nWhen the reference frame for an image is ambiguous and there are two con\ufb02icting neighboring\nreference frames, our model predicts that proximity or the distance of the ambiguous image to the\ntwo con\ufb02icting reference frames should affect the reference frame adopted by the ambiguous image.\nWe explore this question using the method presented above, where participants are asked to solve\nan arithmetic problem where the operator is ambiguous between + or \u00d7 and the two numbers have\ncon\ufb02icting reference frames (orientations). This allows us to deduce the reference frame inferred for\nthe operator image from the answer given by participants. We manipulate proximity by changing\nthe location of the operator such that it is closer to one of the two numbers as shown in Figure 3.\n\n5.1 Methods\n\nA total of 134 participants completed the experiment online through Amazon Mechanical Turk in\nexchange for $0.20 USD. Four participants did not give a correct solution to the arithmetic problem\n(neither the addition nor multiplication solution) leaving 130 participants for analysis. Participants\nwere asked to maximize their window before answering the arithmetic problem. All factors were\nmanipulated between subjects as preliminary testing demonstrated a strong effect of trial order on\nthe selected reference frame (probably because reference frames rarely change in the world).\n\n6\n\n\fThe primary factor of interest of the experiment was the position of the operator scored from -2\n(far to the left) to 2 (far to the right), which was counterbalanced over participants (without the\n0 position). The problem was viewed through a simulated aperture (to minimize the effect of the\nmonitor\u2019s reference frame). See Figure 3 for example trials with the operator in each position.\nThere were several other factors that were randomized over participants: the numbers in the problem\n(randomly chosen single digit numbers), which number was rotated (left or right), the diagonal that\nthe numbers and operator were aligned on (positive diagonal, as shown in Figure 3, or negative\ndiagonal), and the rotation of the operator (+ or \u00d7).\n\n5.2 Results and Discussion\n\nFigure 4 (a) shows that participants are more likely to infer the orientation of the left number for the\noperator the closer it is to the left number. The results con\ufb01rm our hypothesis: the closer the operator\nis to an image with an unambiguous reference frame, the more likely participants are to infer that\nreference frame for the operator (\u03c72(1) = 3.99, p < 0.05 for -2 vs. 2). A probit regression analysis\ncorroborates this result as the regression coef\ufb01cient is signi\ufb01cantly different from zero (p < 0.05).\nThe model results were generated using Gibbs sampling (as previously described) and shown in\nFigure 4 (b). For each trial, we ran the sampler for 50 burn-in iterations, recorded 750 samples,\nand then thinned the samples by selecting every 5 samples. This left 150 samples that formed\nour estimate for the proportion of times the operator grouped with the left reference frame. The\nparameters were initialized to: \u03b1 = 0.001, \u00b50 = [264.7, 261.94], \u03a30 = 1000I (scenes are 550\u00d7550\npixels with the bottom-left corner as origin), where I is the identity matrix, k0 = 0.2, and \u03bd0 = 110.\nThe discrete distributions encoding the priors on objects and orientations, \u03b8 and \u03c6, were uniform\nover all V and R possibilities. The model and human results clearly exhibit the same qualitative\nbehavior: As the distance between the operator and the left number decreased, the probability the\noperator took the orientation of the left number increased.\n\n6 Experiment 2: Alignment effects on reference frame inference\n\nOur model also predicts that the difference in the number of unambiguous images assigned to the\ncon\ufb02icting reference frames should affect the reference frame adopted by the operator image. In this\nexperiment, we test the prediction using the same method as above, but manipulate the number of\nextra oriented unambiguous objects in each of the competing reference frames (see Figure 5 (a)).\n\n6.1 Methods\n\nA total of 80 people participated online through Amazon Mechanical Turk in exchange for $0.20\nUSD. There were 12 participants who gave an incorrect answer, leaving 68 participants for analysis.\nThe instructions and design were identical to the previous experiment, except that there were two\nextra factors manipulating the context of the left and right number (5 on the left and 1 on the right\nor vice versa) and there were only two operator positions (-2 and 2). Figure 5 (a) illustrates example\ntrials of the context manipulations for the operator in position -2.\n\n6.2 Results and Discussion\n\nFigure 5 (b) shows that participants were more likely to infer the operator\u2019s orientation to be the ori-\nentation of whichever side had more objects and it was closer to, replicating the effect of Experiment\n1 (\u03c72(1) = 12.8728, p < 0.0005 ). Model results were generated using the same procedure and pa-\nrameter values as Experiment 1 (except \u03bd0 = 10 to account for the increased number of objects) and\nFigure 5 (c) shows its similarity to participant results.\n\n7 Conclusions and future directions\n\nIn this paper, we introduced the \ufb01rst study of how people infer the reference frame of images in\nscenes with multiple reference frames. We presented an implicit method for testing reference frame\ninference, an ideal observer model that predicts people should be sensitive to two scene cues, and\n\n7\n\n\fAlignment effects on participant responses\n\n5L1R\n1L5R\n\n\u22122\n\nOperator Position\n\n2\n\nAlignment effects on model responses\n\n5L1R\n1L5R\n\n(a)\n\n(b)\n\n5L1R\n\n1L5R\n\n5\n\n5\n\n+\n\n5\n\n+\n\n5\n\n(c)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\nt\nf\ne\nl\n \n\nd\ne\np\nu\no\nr\ng\n \nt\nn\ne\nc\nr\ne\nP\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\nt\nf\ne\nl\n \n\nd\ne\np\nu\no\nr\ng\n \nt\nn\ne\nc\nr\ne\nP\n\nFigure 5: Alignment effects. The operator is more likely to take the orientation of the side with more\nobjects. 5L1R denotes \ufb01ve objects in the left reference frame and one object in the right, and 1L5R\nindicates the opposite arrangement. (a) Example stimuli, (b) Human results, and (c) Model results.\n\n\u22122\n\nOperator Position\n\n2\n\nbehavioral evidence supporting its predictions. Because the objects people perceive depend on the\norientation of their images in the scene, these results improve our understanding of how the con\ufb01g-\nuration of objects in scenes affects object perception.\n\nWe plan to extend our model to capture other cues identi\ufb01ed by perceptual psychologists. A \ufb01rst\nstep is to include the bias towards using the up-down axis of the input image [8] by using a non-\nuniform distribution over rotations (estimating \u03b8). We can capture the elongation cue (that the\norientation of the spread of images in a scene biases the orientation of the reference frame of the\nimages in the scene [5]) by coupling the covariance matrix (\u03a3) and rotation (r) of a reference frame.\nCurrently, our model assumes the positions of images in a reference frame are Gaussian distributed;\nhowever, people have strong expectations about the arrangement of images in a scene [18]. We plan\nto compare people\u2019s bias to a sophisticated scene segmentation model [19]. We are also interested\nin cues that depend on the structure of the images or the orientation of the agent in the world, like\naxes of symmetry [5] or gravitational axes [8].\n\nAnother direction for future work is to address an assumption of the model: How do people learn the\nset of objects and whether or not those objects are orientation-invariant? A potential solution is to\ncombine our model with previous work that presented a nonparametric Bayesian model for learning\nfeatures and the transformations they are allowed to undergo [20]. Hopefully, incorporating our\nmodel into this feature learning method will yield better inferred features and, in turn, will help\ncreate better feature generation and object recognition techniques by providing better understanding\nof how people perceive objects from ambiguous image data.\n\nFinally, we plan to explore how the presented principles scale to more realistic scenes with objects\nmore complex than + and \u00d7 and more orientations. Our paradigm provides a principled starting\npoint for investigating how reference frames are identi\ufb01ed in scenes with multiple reference frames.\nIt is easily extended to more complex scenes by associating different orientations (or rotations in\ndepth) of an ambiguous image with different arithmetic operators. Our hope is that this leads to a\nbetter understanding of object identi\ufb01cation and reference frame identi\ufb01cation.\nAcknowledgements We thank Karen Schloss, Stephen Palmer, Anna Rafferty, David Whitney and the Compu-\ntational Cognitive Science Lab at Berkeley for discussions and AFOSR grant FA-9550-10-1-0232 for support.\n\n8\n\n\fReferences\n\n[1] D. G. Lowe. Object recognition from local scale-invariant features.\n\nIn Proceedings of the\n\nInternational Conference on Computer Vision, volume 2, pages 1150\u20131157, 1999.\n\n[2] D. R. Martin, C. C. Fowlkes, and J. Malik. Learning to detect natural image boundaries using\nlocal brightness, color, and texture cues. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 26(5):530\u2013549, 2005.\n\n[3] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313:504\u2013507, 2006.\n\n[4] O. G. Selfridge and U. Neisser. Pattern recognition by machine. In Computers and thought,\n\npages 235\u2013267. McGraw-Hill, New York, 1963.\n\n[5] S. E. Palmer. Reference frames in the perception of shape and orientation.\n\nIn Object per-\nception: Structure and Process, pages 121\u2013163. Lawrence Erlbaum Associates, Hillsdale, NJ,\n1989.\n\n[6] M. Wiser. The role of intrinsic axes in shape recognition. In Proceedings of the Third Annual\nMeeting of the Cognitive Science Society, pages 184\u2013186, San Mateo, CA, 1981. Morgan\nKaufman.\n\n[7] E. Mach. The analysis of sensations. Open Court, Chicago, 1914/1959.\n[8] I. Rock. Orientation and form. Academic Press, New York, 1973.\n[9] P. Jolicoeur. The time to name disoriented natural objects. Memory & Cognition, 13:289\u2013303,\n\n1985.\n\n[10] M. J. Tarr, P. Williams, W. G. Hayward, and I. Gauthier. Three-dimensional object recognition\n\nis viewpoint dependent. Nature Neuroscience, 1(4):275\u2013277, 1998.\n\n[11] I. Rock, J. DiVita, and R. Barbeito. The effect on form perception of change of orientation\nin the third dimension. Journal of Experimental Psychology: Human Perception and Perfor-\nmance, 7:719\u2013732, 1981.\n\n[12] S. E. Palmer. What makes triangles point: Local and global effects in con\ufb01gurations of am-\n\nbiguous triangles. Cognitive Psychology, 12:285\u2013305, 1980.\n\n[13] S. E. Palmer. The role of symmetry in shape perception. Acta Psychologica, 59:67\u201390, 1985.\n[14] F. Attneave. Triangles as ambiguous \ufb01gures. American Journal of Psychology, 81:447\u2013453,\n\n1968.\n\n[15] S. E. Palmer and N. M. Bucher. Con\ufb01gural effects in perceived pointing of ambiguous trian-\ngles. Journal of Experimental Psychology: Human Perception and Performance, 7(1):88\u2013114,\n1981.\n\n[16] J. Pitman. Combinatorial Stochastic Processes. 2002. Notes for Saint Flour Summer School.\n[17] R. M. Neal. Markov chain sampling methods for Dirichlet process mixture models. Journal of\n\nComputational and Graphical Statistics, 9:249\u2013265, 2000.\n\n[18] S. E. Palmer. Vision Science. MIT Press, Cambridge, MA, 1999.\n[19] E. Sudderth and M. I. Jordan. Shared segmentation of natural scenes using dependent Pitman-\nYor processes. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, Advances in\nNeural Information Processing Systems 21, pages 1585\u20131592. 2009.\n\n[20] J. L. Austerweil and T. L. Grif\ufb01ths. Learning invariant features using the transformed Indian\nbuffet process. In J. Lafferty, C. K. I. Williams, J. Shawe-Taylor, R.S. Zemel, and A. Culotta,\neditors, Advances in Neural Information Processing Systems 23, pages 82\u201390. 2010.\n\n9\n\n\f", "award": [], "sourceid": 373, "authors": [{"given_name": "Joseph", "family_name": "Austerweil", "institution": null}, {"given_name": "Abram", "family_name": "Friesen", "institution": null}, {"given_name": "Thomas", "family_name": "Griffiths", "institution": null}]}