{"title": "Learning in Compositional Hierarchies: Inducing the Structure of Objects from Data", "book": "Advances in Neural Information Processing Systems", "page_first": 285, "page_last": 292, "abstract": null, "full_text": "Learning in Compositional Hierarchies: \n\nInducing the Structure of Objects from Data \n\nJoachim Utans \n\nOregon Graduate Institute \n\nDepartment of Computer Science and Engineering \n\nP.O. Box 91000 \n\nPortland, OR 97291-1000 \n\nutans@cse.ogi.edu \n\nAbstract \n\nI propose a learning algorithm for learning hierarchical models for ob(cid:173)\nject recognition. The model architecture is a compositional hierarchy \nthat represents part-whole relationships: parts are described in the lo(cid:173)\ncal context of substructures of the object. The focus of this report is \ninducing the structure of \nlearning hierarchical models from data, i.e. \nmodel prototypes from observed exemplars of an object. At each node \nin the hierarchy, a probability distribution governing its parameters must \nbe learned. The connections between nodes reflects the structure of the \nobject. The formulation of substructures is encouraged such that their \nparts become conditionally independent. The resulting model can be \ninterpreted as a Bayesian Belief Network and also is in many respects \nsimilar to the stochastic visual grammar described by Mjolsness. \n\n1 INTRODUCTION \n\nModel-based object recognition solves the problem of invariant recognition by relying on \nstored prototypes at unit scale positioned at the origin of an object-centered coordinate \nsystem. Elastic matching techniques are used to find a correspondence between features of \nthe stored model and the data and can also compute the parameters of the transformation the \nobserved instance has undergone relative to the stored model. An example is the TRAFFIC \nsystem (Zemel, Mozer and Hinton, 1990) or the Frameville system (Mjolsness, Gindi and \n\n285 \n\n\f286 \n\nUtans \n\ni~----::;\"\"Human \n\nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \nI \n1. ______ 1 \n\nr--------, \nI \nI \n\nr o\n\no\n\n-\n\n' \n\niwi Arm \n\ni @ ~ ! \n\nL ________ J \n\nLower Arm \n\n~ -oo~oj 1(' \n\nFigure I: Example of a compositional \nhierarchy. The simple figure can be \nrepresented as hierarchical composi(cid:173)\ntion of parts. The hierarchy can \nbe represented as a graph (a tree in \nthis case). Nodes represent parts and \nedges represent the structural relation(cid:173)\nship. Nodes at the bottom represent \nindividual parts of the object; nodes \nat higher levels denote more complex \nsubstructures. The single node at the \ntop of the tree represents the entire ob(cid:173)\nject. \n\nAnandan, 1989; Gindi, Mjolsness and Anandan, 1991; Vtans, 1992). Frameville stores \nmodels as compositional hierarchies and by matching at each level in the hierarchy reduces \nthe combinatorics of the match. \n\nThe attractive feature of feed-forward neural networks for object recognition is the relative \nease with which their parameters can be learned from training data. Multilayer feed-forward \nnetworks are typically trained on input/output pairs (supervised learning) and thus are tuned \nto recognize instances of objects as seen during training. Difficulties arise if the observed \nobject appears at a different position in the input image, is scaled or rotated, or has been \nsubject to distortions. Some of these problems can be overcome by suitable preprocessing or \njudicious choice of features. Other possibilities are weight sharing (LeCun, Boser, Denker, \nHenderson, Howard, Hubbard and Jackel, 1989) or invariant distance measures (Simard, \nLeCun and Denker, 1993). \n\nFew attempts have been reported in the neural network literature to learn the prototype \nmodels for model based recognition from data. For example, the Frameville system uses \nhand-designed models. However, models learned from data and reflecting the statistics of \nthe data should be superior to the hand-designed models used previously. Segen (1988a; \n1988b) reports an approach to learning structural descriptions where features are clustered \nto substructures using a Minimum Description Length (MDLJ criterion to obtain a sparse \nrepresentation. Saund (1993) has proposed a algorithm for constructing tree presentation \nwith multiple \"causes\" where observed data is accounted for by multiple substructures at \nhigher levels in the hierarchy. Veda and Suzuki (1993) have developed an algorithm for \nlearning models from shape contours using multiscale convex/concave structure matching \nto find a prototype shape typical for exemplars from a given class. \n\n2 LEARNING COMPOSITIONAL HIERARCHIES \n\nThe algorithm described here merges parts by means of grouping variables to form sub(cid:173)\nstructures. The model architecture is a compositional hierarchy, i.e. a part-whole hierarchy \n(an example is shown in Figure 1). The nodes in the graph represent parts and substruc(cid:173)\ntures, the arcs describe the structure of the object. At each node a probability density for \npart parameters is stored. A prominent advocate of such models has been Marr (1982) \nand models of this type are used in the Frameville system (Mjolsness et ai., 1989; Gindi \net al., 1991; Vtans, 1992). The nodes in the graph represent parts and substructures, the \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from Data \n\n287 \n\nFigure 2: Examples of differ(cid:173)\nent compositional hierarchies for \nthe same object (the digit 9 for \na seven-segment LED display). \nOne model emphasizes the paral(cid:173)\nlel lines making up the square in \nthe top part of the figure while for \nanother model angles are chosen \nas intermediate substructures. The \nexample on the right shows a hier(cid:173)\narchy that \"reuses\" parts. \n\narcs describe the structure of the object. The arcs can be regarded as \"part-of\" or \"ina\" \nrelationships (similar to the notion used in semantic networks). At each node a probability \ndensity for part parameters such as position, size and orientation is stored. \n\nThe model represents a typical prototype object at unit scale in an object-centered coordinate \nsystem. Parameters of parts are specified relative to parameters of the parent node in the \nhierarchy. Substructures thus provide a local context for their parts and decouple their parts \nfrom other parts and substructures in the model. The advantages of this representation are \nsparseness, invariance with respect to viewpoint transformations and the ability to model \nlocal deformations. In addition, the model explicitly represents the structure of an object \nand emphasizes the importance of structure for recognition (Cooper, 1989). \n\nLearning requires estimating the parameters of the distributions at each node (the mean and \nvariance in the case of Gaussians) and finding the structure of model. The emphasis in this \nreport is on learning structure from exemplars. The parameterization of substructures may \nbe different than for the parts at the lowest level and become more complex and require more \nparameters as the substructures themselves become more complex. The representation as \ncompositional hierarchy can avoid overfitting since at higher levels in the hierarchy more \nexemplars are available for parameter estimation due to the grouping of parts (Omohundro, \n1991). \n\n2.1 Structure and Conditional Independence: Bayesian Networks \n\nIn what way should substructures be allocated? Figure 2 shows examples of different \ncompositional hierarchies for the same object (the digit 9 for a seven-segment LED display). \nOne model emphasizes the parallel lines making up the square in the top part of the figure \nwhile for another model angles are chosen as intermediate substructures. It is not clear \nwhich of these models to choose. \n\nThe important benefit of a hierarchical representation of structure is that parts belonging to \ndifferent substructures become decoupled, i.e. they are assigned to a different local context. \nThe problem of constructing structured descriptions of data that reflect this independence \nrelationship has been studied previously in the field of Machine Learning (see (Pearl, 1988) \nfor a comprehensive introduction). The resulting models are Bayesian Belief Networks. \nCentral to the idea of Bayesian Networks is the assumption that objects can be regarded \nas being composed of components that only sparsely interact and the network captures \nthe probabilistic dependency of these components. The network can be represented as \nan interaction graph augmented with conditional probabilities. The structure of the graph \nrepresents the dependence of variables, i.e. connects them with and arc. The strength of the \n\n\f288 \n\nUtans \n\nm,. \n\n0.11 \n\nFigure 3: Bayesian Networks and conditional \nindependence (see text). \n\nFigure 4: The model architecture. Circles denote \nthe grouping variables ina (here a possible valid \nmodel after leaming is shown). \n\ndependence is expressed as forward conditional probability. The conditional independence \nis represented by the absence of an arc between two nodes and leads to the sparseness of \nthe model. \n\nThe notion of conditional independence in the context studied here manifest itself as follows. \nBy just observing two parts in the image, one must assume that they, i.e. their parameters \nsuch as position, are dependent and must be modeled using their joint distribution. How(cid:173)\never, if one knows that these two parts are grouped to form a substructure then knowing \nthe parameters of the substructure, the parts become conditionally independent, namely \nconditioned on the parameters of the substructure. Thus, the internal nodes representing the \nsubstructures summarize the interaction of their child nodes. The correlation between the \nchild nodes is summarized in the parent node and what remains is, for example, independent \nnoise in observed instances of the child nodes. \n\nThe probability of observing an instance can be calculated from the model by starting at \nthe root node and multiplying with the conditional probabilities of nodes traversed until the \nleaf nodes are reached. For example, given the graph in Figure 3, the joint distribution can \nbe factored as \n\nP(Xl' Yl, Y2, zl, Z2, z3, Z4) = \n\n(I) \n(note that the hidden nodes are treatedjust like the nodes corresponding to observable parts). \n\nP(Xd P (Yllxd P (Zllyd P (ZlIYl)P(Z2IYl )P(z3IY2)P(Z4IY2) \n\nNote that the stochastic visual grammar described by Mjolsness (1991) is equivalent to this \nmodel. The model used there is a stochastic forward (generative) model where each level \nof the compositional hierarchy corresponds to a stochastic production rule that generates \nnodes in the next lower level. The distribution of parameters at the next lower level \nare conditioned on the parameters of the parent node. Thus, the model obtained from \nconstructing a Bayesian network is equivalent to the stochastic grammar if the network is \nconstrained to a directed acyclic graph (DAG). \n\nIf all the nodes of the network correspond to observable events, techniques exist for finding \nthe structure of the Bayesian Network and estimate its parameters (Pearl, 1988) (see also \n(Cooper and Herskovits, 1992)}. However, for the hierarchical models considered here, \nonly the nodes at the lowest layer (the leaves of the tree) correspond to observable instances \nof parts of the object in the training data. The learning algorithm must induce hidden, \nunobservable substructures. That is, it is assumed that the observables are \"caused\" by \ninternal nodes not directly accessible. These are represented as nodes in the network just \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from Data \n\n289 \n\nlike the observables and their parameters must be estimated as well. See (Pearl, 1988) for \nan extensive discussion and examples of this idea. \n\nLearning Bayesian networks is a hard problem when the network contains hidden nodes \nbut a construction algorithm exists if it is known that the data is in fact tree-decomposable \n(Pearl, 1988). The methods is based on computing the correlations p between child nodes \nand constraints on the correlation coefficients dictated by a particular structure. The entire \ntree can be constructed recursively using this method. Here, the case of Normal-distributed \nreal-valued random variables is of interest: \n\np(XI, ... , Xn) = ~ Vdet'f exp --(x - p) :E \n\nT \n\n-I \n\n) \n(x - p) \n\n(2) \n\n1 1 \n\nv2?r \n\ndetL \n\n(1 \n\n2 \n\nwhere x = (XI, X2, ... ,xn ) with mean p = E{x} and covariance matrix :E = E{(x -\np)(x - p)T} The method is based on a condition under which a set of random variables \nis star-decomposable. The question one ask is whether a set of n random variables can \nbe represented as the marginal distribution of n + 1 variables XI, ... , X n , W such that the \nXI, ... , Xn are conditionally independent given w, i.e. \n\nJ p(XI, ... , Xn, w)dw \n\n(3) \n\n(4) \n\nIn the graph representation of the Bayesian Network w is the central node relating the \nXI, ... ,Xn , hence the name star-decomposable. In the general case of n variables this is \nhard to verify but a result by Xu and Pearl (1987) is available for 3 variables: A necessary \nand sufficient condition for 3 random variables with a joint normal distribution to be star(cid:173)\ndecomposable is that the pairwise correlation coefficients satisfy the triangle inequality \n\npjk ~ PjiPik \n\nwith \n\n(5) \n\nfor all i, j, k E [1,2,3] and i \"I j \"I k. Equality holds if node w coincides with node i. For \nthe lowest level of the hierarchy, nodes j and k represent parts and node i = w represents \nthe common substructure. \n\n2.2 An Objective Function for Grouping Parts \n\nThe algorithm proposed here is based on \"soft\" grouping by means of grouping variables ina \nwhere both the grouping variables and the parameter estimates are updated concurrently. \nThe learning algorithms described in (Pearl, 1988) incrementally construct a Bayesian \nnetwork and decisions made at early stages cannot be reversed. It is hoped that the method \nproposed here is more robust with regard to inaccuracies of the estimates. However, if the \ntrue distribution is not a star-decomposable normal distribution it can only be approximated. \nLet inaij be a binary variable associated with the arc connecting node i and node j; inaij = \n1 if the arc is present in the network (ina is the adjacency matrix of the graph describing the \nstructure of the model). The model architecture is restricted to a compositional hierarchy (a \ndeparture from the more general structure of a Bayesian Network, i.e. nodes are preassigned \nto levels of the hierarchy (see Figure 4)). Based on the condition in equation (5) a cost \n\n\f290 \n\nUtans \n\nfunction term for the grouping variables ina is \n\nEp = L inawjinawk (PwjPwk - Pjk)2 \n\nw,j,kt-j \n\n(6) \n\nThe term penalizes the grouping of two part nodes to the same parent if the term in \nparentheses is large (i and k index part nodes, w nodes at the next higher level in the \nhierarchy) The inawj can be regarded as assignment variables the assign child nodes j to \nparent nodes w. The parameters at each node and the assignment variables ina are estimated \nusing an EM algorithm (Dempster, Laird and Rubin, 1977; Utans, 1993; Yuille, Stolorz and \nUtans, 1994). For the details of the implementation of grouping with match networks see \n(Mjolsness et at., 1989; Mjolsness, 1991; Gindi et at., 1991; Utans, 1992; Utans, 1994). \nAt each node for each parameter a probability distribution is stored. Nodes at the lowest \nlevel of the hierarchy represent parts in the input data. For the Gaussian distributions used \nhere for all nodes, the parameters are the mean J-t and the variance (J' and can be estimated \nfrom data. Each part node can potentially be grouped to any substructure at the next \nhigher level in the hierarchy. The parameters of the distributions at this level are estimated \nfrom data as well but using the current value of the grouping variables inaij to weight the \ncontribution from each part node. Because each child node j can have only one parent node \ni, an additional constraint for a unique assignment is Lw inawj = 1. \n\n3 ANEXAMPLE \n\nInitial simulations of the proposed algorithm were performed using a hierarchial model for \ndot clusters. The training data was generated using the three-level model shown in Figure 5. \nEach node is parameterized by its position (x, y). The node at the top level represents the \nentire dot cluster. At the intermediate level nodes represent subcluster centers. The leaf \nnodes at the lowest level represent individual dots that are output by the model and observed \nin the image. The top level node represents the position of the entire cluster. At each level \n1 + 1 stored offsets d!t 1 are added to the parent coordinates x~ to obtain the coordinates \nof the child nodes. Then, independent, zero-mean Gaussian distributed noise ( is added: \nxj+l = x! + d~jl + ( The training data consists of a vector of positions at the lowest level \n{Xj} with Xj = (Xj, Yj), j = 1 ... 9 for each exemplar. \nThe identity of the parts in the training data is assumed known. In addition, the data consists \nof parts from a single object. For the simulations, the model architecture is restricted to a \nthree-level hierarchy. Since at the top level a single node represents the entire object, only \nthe grouping variables from the lowest to the intermediate level are unknown (the nodes \nat the intermediate level are implicitly grouped to the single node at the top level). In the \ncurrent implementation the parameters of a parent node are defined as the average over the \nparameters of its child nodes: x~ = Jv Lj i~jxj+l \nFor this problem the algorithm has recovered the structure of the model that generated the \ntraining data. Thus in this case it is possible to use the correlation coefficients to learn \nthe structure of an object from noisy training exemplars. However, the algorithm does \nnot recover the same parameter values x used in the generative model at the intermediate \nlayers. These cannot uniquely specified due to the ambiguity between the parameters Xi \nand offsets d ij (a different choice for Xi leads to different values for d ij ). \n\n\fLearning in Compositional Hierarchies: Inducing the Structure of Objects from Data \n\n291 \n\n0 \n\n0 \n\n\u2022 \n\n0 \n\n0 \n\nDaIs \n\n)( Global Position \n\n0 \n\n0 \n\n0 \n\nDot \n\n0 \n\n\u2022 \n\n0 \n\n)( \n\n0 \n\n\u2022 \n\u2022 CI ustar Center \n\nFigure 5: The model used to generated training data. The structure of the model is a three-level \nhierarchy. The model parameters are chosen such that the generated dot cluster spatially overlap. On \nthe left, an example of an instance of a dot cluster generated from the model is shown (these constitute \nthe training data). \n4 EXTENSIONS \n\nThe results of the initial experiments are encouraging but more research needs to be done \nbefore the algorithm can be applied to real data. For the example used here, the training \ndata was generated by a hierarchical model. Thus the distribution of the training exemplars \ncould, in principle, be learned exactly using the proposed model architecture. I plan to \nstudy the effect of approximating the distribution of real-world data by applying the method \nto the problem of learning models for handwritten digit recognition. \n\nThe model should be extended to include provisions to deal with missing data. Instead of \nbeing binary variables, inaij could be the conditional probability that part j is present in a \ntypical instance of the object given that the parent node i itself is present (similar to the dot \ndeletion rule described in (Mjolsness, 1991)}. These probabilities must also be estimated \nfrom data. Under this interpretation the inaij are similar to the mixture coefficients in the \nmixture of experts model (Jordan and Jacobs, 1993) \n\nThe robustness of the algorithm can be improved when the desired locality of the model is \nexplicitly favored via an additional constraint. \n\nE\\ocal = .A L inaij inaik IXj - Xk 12 \n\nij k \n\nIn this sense, the toy problem shown here is unnecessarily difficult. Preliminary experiments \nindicate that including this term reduces the sensitivity to spurious correlations between \nparts that are far apart. \n\nAs described the algorithm performs unsupervised grouping; learning the hierarchical model \ndoes not take in to account the recognition performance obtained when using the model. \nWhile the problem of learning and representing models in a hierarchical form is interesting \nin its own right, the final criteria for judging the model in the context of a recognition \nproblem should be recognition performance. The assumption is that the model should pick \nup substructures that are specific to a particular class of objects and maximally discriminate \nbetween objects belonging to other classes. For example, after a initial model is obtained \nthat roughly captures the structure of the training data, it can be refined on-line during the \nrecognition stage. \n\n\f292 \n\nUtans \n\nAcknowledgements \nInitial work on this project was performed while the author was with the International \nComputer Science Institute, Berkeley, CA. At OGI supported was provided in part under \ngrant ONR N00014-92-J-4062. Discussions with S. Knerr, E. Mjolsness and S. Omohundro \nwere helpful in preparing this work. \n\nReferences \nCooper, G. F. and Herskovits, E. (1992), 'A bayesian method for induction of probabilistic networks \n\nfrom data', Machine Learning 9, 309-347. \n\nCooper, P. R. (1989), Parallel Object Recognition from Structure (The Tinkertoy Project), PhD thesis, \n\nUniversity of Rochester, Computer Science. also Technical Report No. 301. \n\nDempster, A. P., Laird, N. M. and Rubin, D. B. (1977), 'Maximum likelihood from incomplete data \n\nvia the EM algorithm', J. Royal Statist. Soc. B 39, 1-39. \n\nGindi, G., Mjolsness, E. and Anandan, P. (1991), Neural networks for model based recognition, in \n'Neural Networks: Concepts, Applications and Implementations', Prentice-Hall, pp. 144-173. \nJordan, M. I. and Jacobs, R. A. (1993), Hierarchical mixtures of experts and the EM algorithm, \n\nTechnical Report 930 I, MIT Computational Cognitive Science. \n\nLeCun, Y., Boser, B., Denker, J. S., Henderson, D., Howard, R. E., Hubbard, W. and Jackel, L. D. \n(1989), 'Backpropagation applied to handwritten zip code recognition', Neural Computation \n1,541-551. \n\nMarr, D. (1982), Vision, W. H. Freeman and Co., New York. \nMjolsness, E. (1991), Bayesian inference on visual grammars by neural nets that optimize, Technical \n\nReport YALEU-DCS-TR-854, Yale University, Dept. of Computer Science. \n\nMjolsness, E., Gindi, G. R. and Anandan, P. (1989), 'Optimization in model matching and perceptual \n\norganization', Neural Computation 1(2). \n\nOmohundro, S. M. (1991), Bumptrees for efficient function, constraint, and classification learning, in \nR. Lippmann, J. Moody and D. Touretzky, eds, 'Advances in Neural Information Processing 3', \nMorgan Kaufmann Publishers, San Mateo, CA. \n\nPearl, J. (1988), Probabilistic Reasoning in Intelligent Systems: Networks of Plausible Inference, \n\nMorgan Kaufmann Publishers, Inc., San Mateo, CA. \n\nSaund, E. (1993), A multiple cause mixture model for unsupervised learning, Technical report, Xerox \n\nPARC, Palo Alto, CA. preprint, submitted to Neural Computation. \n\nSegen, J. (1988a), Learning graph models of shape, in 'Proceedings of the 5th International Conference \n\non Machine Learning' . \n\nSegen, J. (1988b), 'Learning structural description of shape', Machine Vision pp. 257-269. \nSimard, P., LeCun, Y. and Denker, J. (1993), Efficient pattern recognition using a new transformation \ndistance, in S. J. Hanson, J. Cowan and L. Giles, eds, 'Advances in Neural Information Processing \n5', Morgan Kaufmann Publishers, San Mateo, CA. \n\nUeda, N. and Suzuki, S. (1993), 'Learning visual models from shape contours using multiscale \n\nconvex/concave structure matching' , IEEE Transactions on Pattern Analysis and Machine In(cid:173)\ntelligence 15(4), 337-352. \n\nUtans, J. (1992), Neural Networks for Object Recognition within Compositional Hierarchies, PhD \n\nthesis, Department of Electrical Engineering, Yale University, New Haven, CT 06520. \n\nUtans, J. (1993), Mixture models and the EM algorithm for object recognition within compositional \nhierarchies. part 1: Recognition, Technical Report TR-93-004, International Computer Science \nInstitute, 1947 Center St., Berkeley, CA 94708. \n\nUtans, J. (1994), 'Mixture models for learning and recognition in compositional hierarchies', in \n\npreparation. \n\nXu, L. and Pearl, J. (1987), Structuring causal tree models with continous variables, in 'Proceedings \n\nof the 3rd Workshop on Uncertainty in AI', pp. 170-179. \n\nYuille, A., Stolorz, P. and Utans, J. (1994), 'Statistical physics, mixtures of distributions and the EM \n\nalgorithm', to appear in Neural Computation. \n\nZemel, R. S., Mozer, M. C. and Hinton, G. E. (1990), Traffic: Recognizing objects using hierarchical \nreference frame transformations, in D. S. Touretzky, ed., 'Advances in Neural Information \nProcessing 2', Morgan Kaufman Pulishers, San Mateo, CA. \n\n\f", "award": [], "sourceid": 840, "authors": [{"given_name": "Joachim", "family_name": "Utans", "institution": null}]}