{"title": "Visual Grammars and Their Neural Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 428, "page_last": 435, "abstract": null, "full_text": "Visual Grammars and their Neural Nets \n\nEric Mjolsness \n\nDepartment of Computer Science \n\nYale University \n\nNew Haven, CT 06520-2158 \n\nAbstract \n\nI exhibit a systematic way to derive neural nets for vision problems. It \ninvolves formulating a vision problem as Bayesian inference or decision \non a comprehensive model of the visual domain given by a probabilistic \ngrammar. \n\n1 \n\nINTRODUCTION \n\nI show how systematically to derive optimizing neural networks that represent quan(cid:173)\ntitative visual models and match them to data. This involves a design methodology \nwhich starts from first principles, namely a probabilistic model of a visual domain, \nand proceeds via Bayesian inference to a neural network which performs a visual \ntask. The key problem is to find probability distributions sufficiently intricate to \nmodel general visual tasks and yet tractable enough for theory. This is achieved \nby probabilistic and expressive grammars which model the image-formation pro(cid:173)\ncess, including heterogeneous sources of noise each modelled with a grammar rule. \nIn particular these grammars include a crucial \"relabelling\" rule that removes the \nundetectable internal labels (or indices) of detectable features and substitutes an \nuninformed labeling scheme used by the perceiver. \n\nThis paper is a brief summary of the contents of [Mjolsness, 1991] . \n428 \n\n\f\u2022 \n\ninstance \n\n\u2022 \n\nI \n\n\u2022 \n\u2022 \u2022 3 \n\n2 \n\nVisual Grammars and their Neural Nets \n\n429 \n\nI \u2022 \n\u2022 \n\n3 \n\n\u2022 \n\n\u2022 \n\n2 \n\n\u2022 \n\u2022 \n\n\u2022 \n\n\u2022 \n\n(unordered dots) \n\n3 \u2022 \n\n\u2022 \n\n\u2022 I \n2 \u2022 \n(permuted dots) \n\nFigure 1: Operation of random dot grammar. The first arrow illustrates dot place(cid:173)\nment; the next shows dot jitter; the next arrow shows the pure, un-numbered fea(cid:173)\nture locations; and the final arrow is the uninformed renumbering scheme of the \nperceIver. \n\n2 EXAMPLE: A RANDOM-DOT GRAMMAR \n\nThe first example grammar is a generative model of pictures consisting of a number \nof dots (e.g. a sum of delta functions) whose relative locations are determined by \none out of M stored models. But the dots are subject to unknown independent jitter \nand an unknown global translation, and the identities of the dots (their numerical \nlabels) are hidden from the perceiver by a random permutation operation. For \nexample each model might represent an imaginary asterism of equally bright stars \nwhose locations have been corrupted by instrument noise. One useful task would \nbe to recognize which model generated the image data. The random-dot grammar \nis shown in (1). \n\nmodel a.nd \nloca.tion \n\ndot \n\nloca.tions \n\nr O : \n\nr 1 : \n\nroot \nEo (x) \n\ninsta.nce (Cl', x) \n\nEl({Xm}) \n\ndot \njitter \n\nr 2 : dotloc( Cl', m, Xm) \nE2(Xm) \n\nscra.mble \nall dots \n\nr 3 : \n\n{dot(m, Xm)} \n\n--+ \n\nE3({xd) \n\n-\n\n--+ \n\ninsta.nce of model Cl' a.t x \n\n-\n\n--+ \n-\n\n1 IxI2 \n20'~ \n{dotloc(Cl', m, xm = X + u~)} \n-log TIm 6(xm - x - u~), \n\nwhere < u~ >m=O \n\n~ lim0'6-0 ~ E \n\n0'6 m \n\nIXm - x - U~12 + C(0'6) \n\ndot(m,xm} \n\n--+ \n12 \n- ~ Xm-Xm \n\n1 1 ~ \n0' il \n{ima.gedot(xi = Em Pm,iXm)} \n-log [Pr(P) TIi 6(Xi - Em Pm,iXm)] \n\nwhere P is a. permuta.tion \n\n~~ \n\n\f430 \n\nMjolsness \n\nThe final joint probability distribution for this grammar allows recognition and \nother problems to be posed as Bayesian inference and solved by neural network \noptimization of \n\nA sum over all permutations has been approximated by the optimization over near(cid:173)\npermutations, as usual for Mean Field Theory networks [Yuille, 1990], resulting in \na neural network implementable as an analog circuit. The fact that P appears \nonly linearly in Efinal makes the optimization problems easier; it is a generalized \n\"assignment\" problem. \n\n2.1 APPROXIMATE NEURAL NETWORK WITHOUT MATCH \n\nVARIABLES \n\nShort of approximating a P configuration sum via Mean Field Theory neural nets, \nthere is a simpler, cheaper, less accurate approximation that we have used on match(cid:173)\ning problems similar to the model recognition problem (find a and x) for the dot(cid:173)\nmatching grammar. Under this approximation, \n\nargmaxa,xIT(a,xl{xi ):::::: argmaxa,x L..,.exp -T 2N 21x l + --~ IXi - X - uml \n\n} \n\n1 \n2u}t \n\na 2) \n\n, \n\n'\" \n. \nm,1 \n\n1 (1 2 \n\n~ \n\n(3) \nfor T = 1. This objective function has a simple interpretation when U r ----+ 00: it \nminimizes the Euclidean distance between two Gaussian-blurred images containing \nthe Xi dots and a shifted version of the Um dots respectively: \n\n, \n\nargmina X J dz IG * II (z) - G * I2(z - x)1 2 \nargmina,x J dz IG o\"/V2 * 2:i 8(z - Xi) - Gq / V2 * 2:m 8(z - X - u~)1 \nargmina,x [C1 - 2 2:mi J dz exp -~ [Iz - xd2 + Iz - X - U~12]] \nargmaxa,x 2:mi exp - 2~2 IXi - X - U~ 12 \n\n2 \n\n(4) \nDeterministic annealing from T = 00 down to T = 1, which is a good strategy for \nfinding global maxima in equation (3), corresponds to a coarse-to-fine correlation \nmatching algorithm: the global shift X is computed by repeated local optimization \nwhile gradually decreasing the Gaussian blurr parameter U down to Ujt. \n\nThe approximation (3) has the effect of eliminating the discrete Pmi variables, rather \nthan replacing them with continuous versions Vmi. The same can be said for the \n\"elastic net\" method [Durbin and Willshaw, 1987]. Compared to the elastic net, the \npresent objective function is simpler, more symmetric between rows and columns, \nhas a nicer interpretation in terms of known algorithms (correlation matching in \nscale space), and is expected to be less accurate. \n\n\fVisual Grammars and their Neural Nets \n\n431 \n\n3 EXPERIMENTS IN IMAGE REGISTRATION \n\nEquation (3) is an objective function for recovering the global two-dimensional (2D) \ntranslation of a model consisting of arbitrarily placed dots, to match up with sim(cid:173)\nilar dots with jittered positions. We use it instead to find the best 2D rotation \nand horizontal translation, for two images which actually differ by a horizontal \n3D translation with roughly constant camera orientation. The images consist of \nline segments rather than single dots, some of which are missing or extra data. In \naddition, there are strong boundary effects due to parts of the scene being trans(cid:173)\nlated outside the camera's field of view. The jitter is replaced by whatever po(cid:173)\nsitional inaccuracies come from an actual camera producing an 128 x 128 image \n[Williams and Hanson, 1988] which is then processed by a high quality line-segment \nfinding algorithm [Burns, 1986]. Better results would be expected of objective func(cid:173)\ntions derived from grammars which explicitly model more of these noise processes, \nsuch as the grammars described in Section 4. \n\nWe experimented with minimizing this objective function with respect to unknown \nglobal translations and (sometimes) rotations, using the continuation method and \nsets of line segments derived from real images. The results are shown in Figures 2, \n3 and 4. \n\n4 MORE GRAMMARS \n\nGoing beyond the random-dot grammar, we have studied several grammars of in(cid:173)\ncreasing complexity. One can add rotation and dot deletion as new sources of noise, \nor introduce a two-level hierarchy, in which models are sets of clusters of dots. In \n[Mjolsness et al., 1991] we present a grammar for multiple curves in a single image, \neach of which is represented in the image as a set of dots that may be hard to \ngroup into their original curves. This grammar illustrates how flexible objects can \nbe handled in our formalism. \n\nWe approach a modest plateau of generality by augmenting the hierarchical version \nof the random-dot grammar with multiple objects in a single scene. This degree of \ncomplexity is sufficient to introduce many interesting features of knowledge repre(cid:173)\nsentation in high-level vision, such as multiple instances of a model in a scene, as \nwell as requiring segmentation and grouping as part of the recognition process. We \nhave shown [Mjolsness, 1991] that such a grammar can yield neural networks nearly \nidentical to the \"Frameville\" neural networks we have previously studied as a means \nof mixing simple Artificial Intelligence frame systems (or semantic networks) with \noptimization-based neural networks. What is more, the transformation leading to \nFrameville is very natural. It simply pushes the permutation matrix as far back \ninto the grammar as possible. \n\n\f432 \n\nMjolsness \n\n-\n\nFigure 2: A simple image registration problem. (a) Stair image. (b) Long line \nsegments derived from stair image. (c) Two unregistered line segment images de(cid:173)\nrived from two images taken from two horizontally translated viewpoints in three \ndimensions. The images are a pair of successive frames in an image sequence. (d) \nRegistered viersions of same data: superposed long line segments extracted from \ntwo stair images (taken from viewpoints differing by a small horizontal translation \nin three dimensions) that have been optimally registered in two dimensions. \n\n-\n- :s::.....===: \n-\n\n-\n\n\fVisual Grammars and their Neural Nets \n\n433 \n\nFigure 3: Continuation method (deterministic annealing). (a) Objective function at \n(j = .0863. (b) Objective function at (j = .300. (c) Objective function at (j = .105. \n(d) Objective function at (j = .0364. \n\n\f434 \n\nMjolsness \n\nFigure 4: Image sequence displacement recovery. Frame 2 is matched to frames 3-8 \nin the stair image sequence. Horizontal displacements are recovered. Other starting \nframes yield similar results except for frame 1, which was much worse. (a) Horizon(cid:173)\ntal displacement recovered, assuming no 2-d rotation. Recovered dispacement as \na function of frame number is monotonic. (b) Horizontal displacement recovered, \nalong with 2-d rotation which is found to be small except for the final frame. Dis(cid:173)\nplacements are in qualitative agreement with (a), more so for small displacements. \n(c) Objective function before and after displacement is recovered (upper and lower \ncurves) without rotation. Note gradual decrease in dE with frame number (and \nhence with displacement). (d) Objective function before and after displacement is \nrecovered (upper and lower curves) with rotation. \n\n\fVisual Grammars and their Neural Nets \n\n435 \n\nAcknowlegements \n\nCharles Garrett performed the computer simulations and helped formulate the line(cid:173)\nmatching objective function used therein. \n\nReferences \n\n[Burns, 1986] Burns, J. B. (1986). Extracting straight lines. IEEE Trans. PAMI, \n\n8(4):425-455. \n\n[Durbin and Willshaw, 1987] Durbin, R. and Willshaw, D. (1987). An analog ap(cid:173)\n\nproach to the travelling salesman problem using an elastic net method. Nature, \n326:689-691. \n\n[Mjolsness, 1991] Mjolsness, E. (1991). Bayesian inference on visual grammars by \n\nneural nets that optimize. Technical Report YALEU jDCSjTR854, Yale Univer(cid:173)\nsity Department of Computer Science. \n\n[Mjolsness et al., 1991] Mjolsness, E., Rangarajan, A., and Garrett, C. (1991). A \nneural net for reconstruction of multiple curves with a visual grammar. In Seattle \nInternational Joint Conference on Neural Networks. \n\n[Williams and Hanson, 1988] Williams, L. R. and Hanson, A. R. (1988). Translat(cid:173)\ning optical flow into token matches and depth from looming. In Second Inter(cid:173)\nnational Conference on Computer Vision, pages 441-448. Staircase test image \nsequence. \n\n[Yuille, 1990] Yuille, A. L. (1990). Generalized deformable models, statistical \n\nphysics, and matching problems. Neural Computation, 2(1):1-24. \n\n\f", "award": [], "sourceid": 499, "authors": [{"given_name": "Eric", "family_name": "Mjolsness", "institution": null}]}