Daphna Weinshall, Shimon Edelman, Heinrich Bülthoff
We demonstrate the ability of a two-layer network of thresholded summation units to support representation of 3D objects in which several distinct 2D views are stored for ea.ch object. Using unsu(cid:173) pervised Hebbian relaxation, the network learned to recognize ten objects from different viewpoints. The training process led to the emergence of compact representations of the specific input views. When tested on novel views of the same objects, the network ex(cid:173) hibited a substantial generalization capability. In simulated psy(cid:173) chophysical experiments, the network's behavior was qualitatively similar to that of human subjects.
1 Background Model-based object recognition involves, by definition, a compa.rison between the input image and models of different objects that are internal to the recognition system. The form in which these models are best stored depends on the kind of information available in the input, and on the trade-off between the amount of memory allocated for the storage and the degree of sophistication required of the recognition process.
In computer vision, a distinction can be made between representation schemes that use 3D object-centered coordinate systems and schemes that store viewpoint-specific information such as 2D views of objects. In principle, storing enough 2D views would
A Self-Organizing Multiple-View Representation of 3D Objects
allow the system to use simple recognition techniques such as template matching. If only a few views of each object are remembered, the system must have the capa(cid:173) bility to normalize the appearance of an input object, by carrying out appropriate geometrical transformations, before it can be directly compared to the stored rep(cid:173) resen tat ions .
What representation strategy is employed by the human visual system? The notion that objects are represented in viewpoint-dependent fashion is supported by the finding that commonplace objects are more readily recognized from certain so-called canonical vantage points than from other, random viewpoints (Palmer et al. 1981). Namely, canonical views are identified more quickly (and more accurately) than others, with response times decreasing monotonically with increasing subjective goodness.!
The monotonic increase in the recognition latency with misorientation of the object relative to a canonical view prompts the interpretation of the recognition process in terms of a mechanism related to mental rotation. In the classical mental rotation task (see Shepard & Cooper 1982), the subject is required to decide whether two simultaneously presented images are two views of the same 3D object. The average latency of correct response in this task is linearly dependent on the difference in the 3D attitude of the object in the two images. This dependence is commonly accounted for by postulating a process that attempts to rotate the 3D shapes per(cid:173) ceived in the two images into congruence before making the identity decision. The rotation process is sometimes claimed to be analog, in the sense that the represen(cid:173) tation of the object appears to pass through intermediate orientation stages as the rotation progresses (Shepard & Cooper 1982). Psychological findings seem to support the involvement of some kind of mental rotation in recognition by demonstrating the dependence of recognition latency for an unfamiliar view of an object on the distance to its closest familiar view. There is, however, an important qualification. Practice with specific objects appears to cause this strategy to be abandoned in favor of a more memory-intensive, less time(cid:173) consuming direct comparison strategy. Under direct comparison, many views of the objects are stored and recognition proceeds in essentially constant time, provided that the presented views are sufficiently close to one of the stored views (Tarr & Pinker 1989, Edelman et al. 1989).
From the preceding outline, it appears that a faithful model of object representa(cid:173) tion in the human visual system should provide both for the ability to "rotate" 3D objects and for the fast direct-comparison strategy that supersedes mental ro(cid:173) tation for highly familiar objects. Surprisingly, it turns out that mental rotation in recognition can be replicated by a self-organizing memory-intensive model based on direct comparison. The rest of the present paper describes such a model, called CLF (conjunctions of localized features; see Edelman & Weins hall 1989).
1 Canonical viewl of objects can be reliably identified in lubjective judgement al well as in recognition talb. For example, when alked to form a mental image of an object, people Ulually imagine it as leen &om a canonical perspective.
276 Weinshall, Edelman and Bulthoff
INPUT (feature) LAYER