Part of Advances in Neural Information Processing Systems 9 (NIPS 1996)

*Elie Bienenstock, Stuart Geman, Daniel Potter*

Images are ambiguous at each of many levels of a contextual hi(cid:173) erarchy. Nevertheless, the high-level interpretation of most scenes is unambiguous, as evidenced by the superior performance of hu(cid:173) mans. This observation argues for global vision models, such as de(cid:173) formable templates. Unfortunately, such models are computation(cid:173) ally intractable for unconstrained problems. We propose a composi(cid:173) tional model in which primitives are recursively composed, subject to syntactic restrictions, to form tree-structured objects and object groupings. Ambiguity is propagated up the hierarchy in the form of multiple interpretations, which are later resolved by a Bayesian, equivalently minimum-description-Iength, cost functional.

1 Bayesian decision theory and compositionaiity

In his Essay on Probability, Laplace (1812) devotes a short chapter-his "Sixth Principle" -to what we call today the Bayesian decision rule. Laplace observes that we interpret a "regular combination," e.g., an arrangement of objects that displays some particular symmetry, as having resulted from a "regular cause" rather than arisen by chance. It is not, he argues, that a symmetric configuration is less likely to happen by chance than another arrangement. Rather, it is that among all possible combinations, which are equally favored by chance, there are very few of the regular type: "On a table we see letters arranged in this order, Constantinople, and we judge that this arrangement is not the result of chance, not because it is less possible than the others, for if this word were not employed in any language

Compositionality, MDL Priors, and Object Recognition

839

we should not suspect it came from any particular cause, but this word being in use amongst us, it is incomparably more probable that some person has thus arranged the aforesaid letters than that this arrangement is due to chance." In this example, regularity is not a mathematical symmetry. Rather, it is a convention shared among language users, whereby Constantinople is a word, whereas Jpctneolnosant, a string containing the same letters but arranged in a random order, is not.

Central in Laplace's argument is the observation that the number of words in the language is smaller, indeed "incomparably" smaller, than the number of possible arrangements of letters. Indeed, if the collection of 14-letter words in a language made up, say, half of all 14-letter strings- a rich language indeed-we would, upon seeing the string Constantinople on the table, be far less inclined to deem it a word, and far more inclined to accept it as a possible coincidence. The sparseness of al(cid:173) lowed combinations can be observed at all linguistic articulations (phonetic-syllabic, syllabic-lexical, lexical-syntactic, syntactic-pragmatic, to use broadly defined levels), and may be viewed as a form of redundancy-by analogy to error-correcting codes. This redundancy was likely devised by evolution to ensure efficient communication in spite of the ambiguity of elementary speech signals. The hierarchical composi(cid:173) tional structure of natural visual scenes can also be thought of as redundant: the rules that govern the composition of edge elements into object boundaries, of in(cid:173) tensities into surfaces etc., all the way to the assembly of 2-D projections of named objects, amount to a collection of drastic combinatorial restrictions. Arguably, this is why in all but a few-generally hand-crafted-cases, natural images have a unique high-level interpretation in spite of pervasive low-level ambiguity-this being amply demonstrated by the performances of our brains.

In sum, compositionality appears to be a fundamental aspect of cognition (see also von der Malsburg 1981, 1987; Fodor and Pylyshyn 1988; Bienenstock, 1991, 1994, 1996; Bienenstock and Geman 1995). We propose here to account for mental com(cid:173) putation in general and scene interpretation in particular in terms of elementary composition operations, and describe a mathematical framework that we have de(cid:173) veloped to this effect. The present description is a cursory one, and some notions are illustrated on two simple examples rather than formally defined-for a detailed account, see Geman et al. (1996), Potter (1997). The binary-image example refers to an N x N array of binary-valued pixels, while the Laplace-Table example refers to a one-dimensional array of length N, where each position can be filled with one of the 26 letters of the alphabet or remain blank.

2 Labels and composition rules

The objects operated upon are denoted Wi, i = 1,2, ... , k. Each composite object W carries a label, I = L(w), and the list of its constituents, (Wt,W2,·· .). These uniquely determine w, so we write W = I (WI, W2, .•. ) . A scene S is a collection of primitive objects. In the binary-image case, a scene S consists of a collection of black pixels in the N x N array. All these primitives carry the same label, L(w) = p (for "Point"), and a parameter 7r(w) which is the position in the image. In Laplace's Table, a scene S consists of an arrangement of characters on the table. There are 26 primitive labels, "A", "B" , ... , "Z" , and the parameter of a primitive W is its position 1 ~ 7r(w) ~ N (all primitives in such a scene must have different positions).

An example of a composite W in the binary-image case is an arrangement composed

840

E. Bienenstock. S. Geman and D. Potter

of a black pixel at any position except on the rightmost column and another black pixel to the immediate right of the first one. The label is "Horizontal Linelet," denoted L(w) = hl, and there are N(N - 1) possible horizontallinelets. Another non-primitive label, "Vertical Linelet," or vl, is defined analogously. An example of a composite W for Laplace's Table is an arrangement of 14 neighboring primi(cid:173) tives carrying the labels "G", "0", "N", "S", ... , "E" in that order, wherever that arrangement will fit. We then have L(w) = Ganstantinople, and there are N - 13 possible Constantinople objects.

The composition rule for label type 1 consists of a binding junction, B" and a set of allowed binding-function values, or binding support, S,: denoting by 0 the set of all objects in the model, we have, for any WI, ' .. ,Wk E 0, B, (WI. ... ,Wk) E s, ¢:} l(WI"" ,Wk) E O. In the binary-image example, Bhl(WI,W2) = Bv,(WI,W2) = (L(WI),L(W2),7I'(W2) -7I'(WI)), Sh' = {(P,p,(I,O))} and Sv' = {(p,p,(O,I))} define the hl- and vl-composition rules, p+p -+ hl and p+p -+ vl. In Laplace's Table, G + 0+· .. + E -+ Ganstantinpole is an example of a 14-ary composition rule, where we must check the label and position of each constituent. One way to define the binding function and support for this rule is: B(WI, ' " ,WI4) = (L(WI),' " ,L(WI4), 71'(W2) - 71'(Wt} , 71'(W3) - 71'(WI),"', 71'(W14) - 71'(WI)) and S = (G,"', E, 1,2"",13).

We now introduce recursive labels and composition rules: the label of the composite object is identical to the label of one or more of its constituents, and the rule may be applied an arbitrary number of times, to yield objects of arbitrary complexity. In the binary-image case, we use a recursive label c, for Curve, and an associated binding function which creates objects of the form hl + p -+ c, vl + p -+ c, c + p -+ c, p + hl -+ c, p + vl -+ c, p + c -+ c, and c + c -+ c. The reader may easily fill in the details, i.e., define a binding function and binding support which result in "c" -objects being precisely curves in the image, where a curve is of length at In the previous examples, primitives were least 3 and may be self-intersecting. composed into compositions; here compositions are further composed into more complex compositions. In general, an object W is a labeled tree, where each vertex carries the name of an object, and each leaf is associated with a primitive (the association is not necessarily one-to-one, as in the case of a self-intersecting curve).

Let M be a model-Le., a collection of labels with their binding functions and binding supports-and 0 the set of all objects in M . We say that object W E o covers S if S is precisely the set of primitives that make up w's leaves. An interpretation I of S is any finite collection of objects in 0 such that the union of the sets of primitives they cover is S. We use the convention that, for all M and S, 10 denotes the trivial interpretation, defined as the collection of (unbound) primitives in S. In most cases of interest, a model M will allow many interpretations for a scene S . For instance, given a long curve in the binary-image model, there will be many ways to recursively construct a "c"-labeled tree that covers exactly that curve.

3 The MDL formulation

In Laplace's Table, a scene consisting of the string Constantinople admits, in addition to 10 , the interpretation II = {WI}, where WI is a "G anstantinople" - object. We wish to define a probability distribution D on interpretations such that D(I1 ) » D(Io), in order to realize Laplace's "incomparably more probable". Our

Compositionality, MDL Priors, and Object Recognition

841

definition of D will be motivated by the following use of the Minimum Description Length (MDL) principle (Rissanen 1989). Consider a scene S and pretend we want to transmit S as quickly as possible through a noiseless channel, hence we seek to encode it as efficiently as possible, i.e., with the shortest possible binary code c. We can always use the trivial interpretation 10: the codeword c(Io) is a mere list of n locations in S. We need not specify labels, since there is only one primitive label in this example. The length, or cost, of this code for S is Ic(Io)1 = nlog2 (N 2 ).

Now however we want to take advantage of regularities, in the sense of Laplace, that we expect to be present in S. We are specifically interested in compositional regularities, where some arrangements that occur more frequently than by chance can be interpreted advantageously using an appropriate compositional model M. Interpretation I is advantageous if Ic(I)1 < Ic(Io)l. An example in the binary-image case is a linelet scene S. The trivial encoding of this scene costs us Ic(Io)1 = 2[log2 3+ log2(N2)] bits, whereas the cost of the compositional interpretation II = {wI} is Ic(Idl = log2 3+log2 (N(N -1)), where WI is an hI or vl object, as the case may be. The first log23 bits encode the label L(WI) E {p, hi, vi}, and the rest encodes the position in the image. The compositional {p, hl, vl} model is therefore advantageous for a linelet scene, since It affords us a gain in encoding cost of about 2log2 N bits. In general, the gain realized by encoding {w} = {I (WI, W2)} instead of {WI, W2} may be viewed as a binding energy, measuring the affinity that WI and W2 exhibit for each other as they assemble into w. This binding energy is c, = IC(WI)I + IC(W2)1 - I c( I (WI, W2) ) I, and an efficient M is one that contains judiciously chosen cost-saving composition rules. In effect, if, say, linelets were very rare, we would be better off with the trivial model. The inclusion of non-primitive labels would force us to add at least one bit to the code of every object-to specify its label-and this would increase the average encoding cost, since the infrequent use of non-primitive labels would not balance the extra small cost incurred on primitives. In practical applications, the construction of a sound M is no trivial issue. Note however the simple rationale for including a rule such as p + p --7 hl. Giving ourselves the label hi renders redundant the independent encoding of the positions of horizontally adjacent pixels. In general, a good model should allow one to hierarchically compose with each other frequently occurring arrangements of objects.

This use of MDL leads in a straightforward way to an equivalent Bayesian formula(cid:173) tion. Setting P'(w) = 2- lc(w)lj L:w'EO 2- lc(w')I yields a probability distribution P' on n for which c is approximately a Shannon code (Cover and Thomas 1991). With this definition, the decision to include the label hl-or the label Con8tantinople(cid:173) would be viewed, in principle, as a statement about the prior probability of finding horizontal linelets-or Constantinople strings-in the scene to be interpreted.

4 The observable-measure formulation

The MDL formulation however has a number of shortcomings; in particular, com(cid:173) puting the binding energy for composite objects can be problematic. We outline now an alternative approach (Geman et al. 1996, Potter 1997), where a probabil(cid:173) ity distribution P(w) on n is defined through a family of observable measures Q,. These measures assign probabilities to each possible binding-function value, s E S" and also to the primitives. We require L:'EM L:sEsr Q,(8) = 1, where the notion of binding function has been extended to primitives via Bprim (w) = 7r(w) for primitive

842

E. Bienenstoc/c, S. Geman and D. Potter

w. The probabilities induced on 0 by Q, are given by P(w) = Qprim(Bprim(w)) for a primitive w, and P(w) = Q,(B,(WI,W2))P2(WI,W2IB,(WI,W2)) for a composite object w = l(wI, W2).1 Here p 2 = P X P is the product probability, i.e., the free, or not-bound, distribution for the pair (WI, W2) E 0 2. For instance, with C + ... + E -? Canstantinople, p 14 (WI,W2,'" ,w14IBcons ... (W1, ... ,W14) = (C, 0,···,13)) is the conditional probability of observing a particular string Constantinople, under the free distribution, given that (WI, ... , W14) constitutes such a string. With the rea(cid:173) sonable assumption that, under Q, primitives are uniformly distributed over the table, this conditional probability is simply the inverse of the number of possible Constantinople strings, Le., 1/(N - 13).

The binding energy, defined, by analogy to the MDL approach, as [, = log2(P(w)/(P(wdP(w2))), now becomes [, = log2(P x P(B'(Wl,W2)))' Finally, if I is the collection of all finite interpretations / c 0, we define the probability of / E I as D(/) = IIwElP(w)/Z, with Z = L:I'EI IIwEl'P(w), Thus, the probability of an interpretation containing several free objects is obtained by assuming that these objects occurred in the scene independently of each other. Given a scene S, recognition is formulated as the task of maximizing D over all the l's in I that are interpretations of S.

log2(Q,(B,(wI,w2))) -

We now illustrate the use of D on our two examples. In the binary-image example with model M = {p, hi, vi}, we use a parameter q, 0 ~ q ~ 1, to adjust the prior probability of linelets. Thus, Qprim(Bprim(W)) = (1 - q)/N2 for primitives, and Qh'«P,p,O, 1)) = Qv'«P,p, 1,0)) = q/2 for linelets. It is easily seen that regardless of the normalizing constant Z, the binding energy of two adjacent pixels into a linelet is [h' = [v, = log2(q/2) - log2[{lNf N(N - 1)]. Interestingly, as long as q =1= 0 and q =1= 1, the binding energy, for large N, is approximately 2log2 N, which is independent of q. Thus, the linelet interpretation is "incomparably" more likely than the independent occurrence of two primitives at neighboring positions. We leave it to the reader to construct a prior P for the model {p, hl, vI, c}, e.g. by distributing the Q-mass evenly between all composition rules. Finally, in Laplace's Table, if there are M equally likely non-primitive labels-say city names-and q is their total mass, the binding energy for Constantinople is [Cons ... = log2 M(r! -13) - log2[~~.&]14, and the "regular" cause is again "incomparably" more likely.

There are several advantages to this reformulation from codewords into probabilities using the Q-parameters. First, the Q-parameters can in principle be adjusted to better account for a particular world of images. Second, we get an explicit formula for the binding energy, (namely log2 (Q / P x P)). Of course, we need to evaluate the product probability P x P, and this can be highly non-trivial-one approach is through sampling, as demonstrated in Potter (1997). Fi~ally, this formulation is well-suited for parameter estimation: the Q's, which are the parameters of the distribution P, are indeed observables, Le., directly available empirically.

5 Concluding remarks

The approach described here was applied by X. Xing to the recognition of "on(cid:173) line" handwritten characters, using a binary-image-type model as above, enriched

IThis is actually an implicit definition. Under reasonable conditions, it is well defined(cid:173)

see Geman et al. (1996).

Compositionality, MDL Priors, and Object Recognition

843

with higher-level labels including curved lines, straight lines, angles, crossings, T(cid:173) junctions, L-junctions {right angles}, and the 26 letters of the alphabet. In such a model, the search for an optimal solution cannot be done exhaustively. We ex(cid:173) perimented with a number of strategies, including a two-step algorithm which first generates all possible objects in the scene, and then selects the "best" objects, Le., the objects with highest total binding energy, using a greedy method, to yield a final scene interpretation. (The total binding energy of W is the sum of the binding ener(cid:173) gies £, over all the composition rules I used in the composition of w. Equivalently, the total binding energy is the log-likelihood ratio log2{P{w}/IIi P{Wi)), where the product is taken over all the primitives Wi covered by w.}

The first step of the algorithm typically results in high-level objects partly over(cid:173) lapping on the set of primitives they cover, i.e., competing for the interpretation of shared primitives. Ambiguity is thus propagated in a "bottom-up" fashion. The ambiguity is resolved in the second "top-down" pass, when high-level composition rules are used to select the best compositions, at all levels including the lower ones. A detailed account of our experiments will be given elsewhere. We found the re(cid:173) sults quite encouraging, particularly in view of the potential scope of the approach. In effect, we believe that this approach is in principle capable of addressing unre(cid:173) stricted vision problems, where images are typically very ambiguous at lower levels for a variety of reasons-including occlusion and mutual overlap of objects-hence purely bottom-up segmentation is impractical.

Turning now to biological implications, note that dynamic binding in the nervous system has been a subject of intensive research and debate in the last decade. Most interesting in the present context is the suggestion, first clearly articulated by von der Malsburg {1981}, that composition may be performed thanks to a dual mech(cid:173) anism of accurate synchronization of spiking activity-not necessarily relying on periodic firing-and fast reversible synaptic plasticity. If there is some neurobio(cid:173) logical truth to the model described in the present paper, the binding mechanism proposed by von der Malsburg would appear to be an attractive implementation. In effect, the use of fine temporal structure of neural activity opens up a large realm of possible high-order codes in networks of neurons.

In the present model, constituents always bind in the service of a new object, an operation one may refer to as triangular binding. Composite objects can engage in further composition, thus giving rise to arbitrarily deep tree-structured constructs. Physiological evidence of triangular binding in the visual system can be found in Sil(cid:173) lito et al. {1994}; Damasio {1989} describes an approach derived from neuroanatom(cid:173) ical data and lesion studies that is largely consistent with the formalism described here.

An important requirement for the neural representation of the tree-structured ob(cid:173) jects used in our model is that the doing and undoing of links operating on some constituents, say Wi and W2, while affecting in some useful way the high-order pat(cid:173) terns that represent these objects, leaves these patterns, as representations of Wi and W2, intact. A family of biologically plausible patterns that would appear to satisfy this requirement is provided by synfire patterns {Abeles 1991}. We hypothesized elsewhere {Bienenstock 1991, 1994, 1996} that synfire chains could be dynamically bound via weak synaptic couplings; such couplings would synchronize the wave-like activities of two synfire chains, in much the same way as coupled oscillators lock

844

E. Bienenstock, S. Geman and D. Potter

their phases. Recursiveness of compositionality could, in principle, arise from the further binding of these composite structures.

Acknow ledgements

Supported by the Army Research Office (DAAL03-92-G-0115), the National Science Foundation (DMS-9217655), and the Office of Naval Research (N00014-96-1-0647).

Do not remove: This comment is monitored to verify that the site is working properly