{"title": "Dynamical And-Or Graph Learning for Object Shape Modeling and Detection", "book": "Advances in Neural Information Processing Systems", "page_first": 242, "page_last": 250, "abstract": "This paper studies a novel discriminative part-based model to represent and recognize object shapes with an \u201cAnd-Or graph\u201d. We define this model consisting of three layers: the leaf-nodes with collaborative edges for localizing local parts, the or-nodes specifying the switch of leaf-nodes, and the root-node encoding the global verification. A discriminative learning algorithm, extended from the CCCP [23], is proposed to train the model in a dynamical manner: the model structure (e.g., the configuration of the leaf-nodes associated with the or-nodes) is automatically determined with optimizing the multi-layer parameters during the iteration. The advantages of our method are two-fold. (i) The And-Or graph model enables us to handle well large intra-class variance and background clutters for object shape detection from images. (ii) The proposed learning algorithm is able to obtain the And-Or graph representation without requiring elaborate supervision and initialization. We validate the proposed method on several challenging databases (e.g., INRIA-Horse, ETHZ-Shape, and UIUC-People), and it outperforms the state-of-the-arts approaches.", "full_text": "Dynamical And-Or Graph Learning for Object Shape\n\nModeling and Detection\n\nXiaolong Wang\n\nSun Yat-Sen University\n\nGuangzhou, P.R. China 510006\ndragonwxl123@gmail.com\n\nLiang Lin\u2217\n\nSun Yat-Sen University\n\nGuangzhou, P.R. China 510006\n\nlinliang@ieee.org\n\nAbstract\n\nThis paper studies a novel discriminative part-based model to represent and rec-\nognize object shapes with an \u201cAnd-Or graph\u201d. We de\ufb01ne this model consist-\ning of three layers: the leaf-nodes with collaborative edges for localizing local\nparts, the or-nodes specifying the switch of leaf-nodes, and the root-node encod-\ning the global veri\ufb01cation. A discriminative learning algorithm, extended from\nthe CCCP [23], is proposed to train the model in a dynamical manner: the model\nstructure (e.g., the con\ufb01guration of the leaf-nodes associated with the or-nodes) is\nautomatically determined with optimizing the multi-layer parameters during the\niteration. The advantages of our method are two-fold.\n(i) The And-Or graph\nmodel enables us to handle well large intra-class variance and background clutters\nfor object shape detection from images. (ii) The proposed learning algorithm is\nable to obtain the And-Or graph representation without requiring elaborate super-\nvision and initialization. We validate the proposed method on several challenging\ndatabases (e.g., INRIA-Horse, ETHZ-Shape, and UIUC-People), and it outper-\nforms the state-of-the-arts approaches.\n\n1 Introduction\nPart-based and hierarchical representations have been widely studied in computer vision, and lead\nto some elegant frameworks for complex object detection and recognition. However, most of the\nmethods address only the hierarchical decomposition by tree-structure models [5, 25], and oversim-\nplify the recon\ufb01gurability (i.e. structural switch) in hierarchy, which is the key to handle the large\nintra-class variance in object detection. In addition, the interactions of parts are often omitted in\nlearning and detection. And-Or graph models are recently explored in [26, 27] to hierarchically\nmodel object categories via \u201cand-nodes\u201d and \u201cor-nodes\u201d that represent, respectively, compositions\nof parts and structural variation of parts. Their main limitation is that the learning process is strongly\nsupervised and the model structure needs to be manually annotated.\nThe key contribution of this work is a novel And-Or graph model, whose parameters and structure\ncan be jointly learned in a weakly supervised manner. We achieve the superior performance on the\ntask of detecting and localizing shapes from cluttered backgrounds, compared to the state-of-the-\nart approaches. As Fig. 3(a) illustrates, the proposed And-Or graph model consists of three layers\ndescribed as follows.\nThe leaf-nodes in the bottom layer represent a batch of local classi\ufb01ers of contour fragments. We\nprovide a partial matching scheme that can recognize the accurate part of the contour, to deal with\n\n\u2217\n\nCorresponding author is Liang Lin. This work was supported by National Natural Science Foundation of\nChina (no. 61173082), Fundamental Research Funds for the Central Universities (no. 2010620003162041),\nand the Guangdong Natural Science Foundation (no.S2011010001378).This work was also partially funded by\nSYSU-Sugon high performance computing typical application project.\n\n1\n\n\fthe problem that the true contours of objects are often connected to background clutters due to\nunreliable edge extraction.\nThe or-nodes in the middle layer are \u201cswitch\u201d variables specifying the activation of their children\nleaf-nodes. We utilize the or-nodes accounting for alternate ways of composition, rather than just\nde\ufb01ning multi-layer compositional detectors, which is shown to better handle the intra-class variance\nand inconsistency caused by unreliable edge detection. Each or-node is used to select one contour\nfrom the candidates detected via the associated leaf-nodes in the bottom layer. Moreover, during\ndetection, location displacement is allowed for each or-node to tackle the part deformation.\nThe root-node (i.e. the and-node) in the top layer is a global classi\ufb01er capturing the holistic defor-\nmation of the object. The contours selected via the or-nodes are further veri\ufb01ed as a whole, in order\nto make the detection robust against the background clutters.\nThe collaborative edges between leaf-nodes are de\ufb01ned by the probabilistic co-occurrence of local\nclassi\ufb01ers, which relax the conditional independence assumption commonly used in previous tree\nstructure models. Concretely, our model allows nearby contours to interact with each other.\nThe key problem of training our And-Or graph model is automatic structure determination. We\npropose a novel learning algorithm, namely dynamic CCCP , extended from the concave-convex\nprocedure (CCCP) [23, 22] by embedding the structural recon\ufb01guration. It iterates to dynamically\ndetermine the production of leaf-nodes associated with the or-nodes, which is often simpli\ufb01ed by\nmanually \ufb01xing in previous methods [25, 16]. The other structure attributes (e.g., the layout of\nor-nodes and the activation of leaf-nodes) are implicitly inferred with the latent variables.\n\n2 Related Work\nRemarkable progress has been made in shape-based object detection [6, 10, 9, 11, 19]. By em-\nploying some shape descriptors and matching schemes, many works represent and recognize object\nshapes as a loose collection of local contours. For example, Ferrari et al. [6] used a codebook of\nPAS (pairwise adjacent segments) to localize object of interest; Maji et al. [11] proposed a maximum\nmargin hough voting for hypothesis regions combining with intersection kernel SVM(IKSVM) for\nveri\ufb01cation; Yang and Latecki [19] constructed shape models in a fully connected graph form with\npartially-supervised learning, and detected objects via a Particle Filters (PF) framework.\nRecently, the tree structure latent models [25, 5] have provided signi\ufb01cant improvements on object\ndetection. Based on these methods, Srinivasan et al. [16] trained the descriptive contour-based de-\ntector by using the latent-SVM learning; Song et al. [15] integrated the context information with the\nlearning, namely Context-SVM. Schnitzspan et al. [14] further combined the latent discriminative\nlearning with conditional random \ufb01elds using multiple features.\nKnowledge representation with And-Or graph was \ufb01rst introduced for modeling visual patterns by\nZhu and Mumford [27]. Its general idea, i.e. using con\ufb01gurable graph structures with And, Or\nnodes, has been applied in object and scene parsing [26, 18, 24] and action classi\ufb01cation [20].\n\n3 And-Or Graph Representation for Object Shape\nThe And-Or Graph model is de\ufb01ned as G = (V,E), where V represents three types of nodes and\nE the graph edges. As Fig. 3(a) illustrates, the square on the top is the root-node representing\nthe complete object instances. The dashed circles derived from the root are z or-nodes arranged\nin a layout of b1 \u00d7 b2 blocks, representing the object parts. Each or-node comprises an un\ufb01xed\nnumber of leaf-nodes (denoted by the solid circles on the bottom); the leaf-nodes are allowed to be\ndynamically created and removed during the learning. For simplicity, we set the maximum number\nm of leaf-nodes af\ufb01liated to one or-node, and the parameters of non-existing leaf-nodes to zero.\nThen the maximum number of all nodes in the model is 1 + n = 1 + z + z \u00d7 m. We use i = 0\nindexing the root node, i = 1, ..., z the or-nodes and j = z + 1, ..., n the leaf-nodes. We also de\ufb01ne\nthat j \u2208 ch(i) indexes the child nodes of node i. The horizontal graph edges (i.e., collaborative\nedges) are de\ufb01ned between the leaf-nodes that are associated with different or-nodes, in order to\nencode the compatibility of object parts. The de\ufb01nitions of G are presented as follows.\nLeaf-node: Each leaf-node Lj, j = z + 1, ..., n is a local classi\ufb01er of contours, whose placement is\ndecided by its parent or-node (the localized block). Suppose a contour fragment c on the edge map\nX is captured by the block located at pi = (px\ni ), as the input of classi\ufb01er. We denote \u03d5l(pi, c) as\n\ni , py\n\n2\n\n\fthe feature vector using the Shape Context descriptor [3]. For any classi\ufb01er, only the part of c fallen\ninto the block will be taken into account, and we set \u03d5l(pi, c) = 0 if c is entirely out. The response\nof classi\ufb01er Lj at location pi of the edge map X is de\ufb01ned as:\n\nRLj (X, pi) = max\nc\u2208X\n\n\u00b7 \u03d5l(pi, c),\n\n\u03c9l\nj\n\n(1)\nj is a parameter vector, which is set to zero if the corresponding leaf-node Lj is nonexistent.\nwhere \u03c9l\n\u00b7 \u03d5l(pi, c).\nThen we can detect the contour from edge map X via the classi\ufb01er, cj = argmaxc\u2208X \u03c9l\nj\nOr-node: Each or-node Ui, i = 1, ..., z is proposed to specify a proper contour from a set of candi-\ndates detected via its children leaf-nodes. Note that we can also consider the or-node activating one\nleaf-node. The or-nodes are allowed to perturb slightly with respect to the root. For each or-node\nUi, we de\ufb01ne the deformation feature as \u03d5s(p0, pi) = (dx, dy, dx2, dy2), where (dx, dy) is the dis-\nplacement of the or-node position pi to the expected position p0 determined by the root-node. Then\nthe cost of locating Ui at pi is:\n\nCosti(p0, pi) = \u2212\u03c9s\n\n\u00b7 \u03d5s(p0, pi),\n\ni\n\n(2)\nwhere \u03c9s\ni is a 4-dimensional parameter vector corresponding to \u03d5s(p0, pi). In our method, each or-\nnode contains at most m leaf-nodes, among which one is to be activated during inference. For each\nleaf-node Lj associated with Ui, we introduce an indicator variable vj \u2208 {0, 1} representing whether\nit is activated or not. Then we derive the auxiliary \u201cswitch\u201d vector for Ui, vi = (vj1 , vj2 , ..., vjm ),\nwhere ||vi|| = 1. Thus, the response of the or-node Ui is de\ufb01ned as,\n\nRUi (X, p0, pi, vi) =\n\nRLj (X, pi) \u00b7 vj + Costi(p0, pi).\n\n(3)\n\n\u2211\n\nj\u2208ch(i)\n\nCollaborative Edge: For any pair of leaf-nodes (Lj, Lj\u2032) respectively associated with two dif-\nferent or-nodes, we de\ufb01ne the collaborative edge between them according to their contextual co-\noccurrence. That is, how likely it is that the object contains contours detected via the two leaf-nodes.\nThe response of the pairwise potentials is parameterized as,\n\u00b7 vj \u00b7 vj\u2032,\n\nRE(V ) =\n\nn\u2211\n\n\u2211\n\n(4)\n\n\u03c9e\n\n(j;j\u2032)\n\nj=z+1\n\nj\u2032\u2208neigh(j)\n\nwhere neigh(j) is de\ufb01ned as the neighbor leaf-nodes from the other or-node adjacent (in spatial\ndirection) to Lj, and V is a joint vector for each vi: V = (v1, ..., vz) = (vz+1, ..., vn). \u03c9e\n(j;j\u2032)\nindicates the compatibility between Lj and Lj\u2032.\nRoot-node: The root-node represents a global classi\ufb01er to verify the ensemble of contour fragments\nC r = {c1, ..., cz} proposed by the or-nodes. The response of the root-node is parameterized as,\n\nRT (C r) = \u03c9r \u00b7 \u03d5r(C r),\n\nwhere P = (p0, p1, ..., pz) is a vector of the positions of or-nodes. For better understanding, we\nrefer H = (P, V ) as the latent variables during inference, where P implies the deformation of\nparts represented by the or-nodes and V implies the discrete distribution of leaf-nodes (i.e., which\nleaf-nodes are activated for detection). The Eq.(6) can be further simpli\ufb01ed as :\n\nRG(X, H) = \u03c9 \u00b7 \u03d5(X, H),\n\nwhere \u03c9 includes the complete parameters of And-Or graph, and \u03d5(X, H) is the feature vector,\n\nn,\u2212\u03c9s\n\n1, ..., \u2212\u03c9s\n\nz+1, ..., \u03c9l\n\n\u03c9 = (\u03c9l\n(z+1;z+1+m), ..., \u03c9e\n\u03d5(X, H) = (\u03d5l(p1, cz+1) \u00b7 vz+1,\u00b7\u00b7\u00b7 , \u03d5l(pz, cn) \u00b7 vn,\n\nz, \u03c9e\n\n(n\u2212m;n), \u03c9r).\n\n\u03d5s(p0, p1),\u00b7\u00b7\u00b7 , \u03d5s(p0, pz), vz+1 \u00b7 vz+1+m, ..., vn\u2212m \u00b7 vn, \u03d5r(C r)).\n\n3\n\nwhere \u03d5r(C r) is the feature vector of C r and \u03c9r the corresponding parameter vector.\nTherefore, the overall response of the And-Or graph is:\n\na\u2211\n\nRG(X, P, V ) =\n\nRUi (X, p0, pi, vi) + RE(V ) + RT (C r)\n\ni=1\n\nj \u00b7 \u03d5l(pi, cj) \u00b7 vj \u2212 \u03c9s\n\u03c9l\n\ni \u00b7 \u03d5s(p0, pi)] +\n\n(j;j\u2032) \u00b7 vj \u00b7 vj\u2032 + \u03c9r \u00b7 \u03d5r(C r),\n\u03c9e\n\n(6)\n\nn\u2211\n\n\u2211\n\nj=z+1\n\nj\u2032\u2208neigh(j)\n\nz\u2211\n\n\u2211\n\n=\n\n[\n\ni=1\n\nj\u2208ch(i)\n\n(5)\n\n(7)\n\n(8)\n\n(9)\n\n\fFigure 1: Illustration of dynamical structure learning. Parts of the model, two or-nodes (U1, U6), are\nvisualized in three intermediate steps. (a) The initial structure, i.e., the regular layout of an object.\nTwo new structures are dynamically generated during iteration. (b) A leaf-node associated with U1\nis removed. (c) A new leaf-node is created and assigned to U6.\n\n4 Inference\nThe inference task is to localize the optimal contour fragments within the detection window, which\nis slidden at all scales and positions of the edge map X. Assuming the root-node is located at p0,\nthe object shape is localized by maximizing RG(X, H) de\ufb01ned in (6):\n\nS(p0, X) = max\n\nH\n\nRG(X, H).\n\n(10)\n\nThe inference procedure integrates the bottom-up testing and top-down veri\ufb01cation:\nBottom-up testing: For each or-node Ui, its children leaf-nodes (i.e. the local classi\ufb01ers) are uti-\nlized to detect contour fragments within the edge map X. Assume that leaf-node Lj, j \u2208 ch(i)\nassociated with Ui is activated, vj = 1, and the optimal contour fragment cj is localized by maxi-\n\u2217\nmizing the response in Eq.(3), where the optimal location p\ni;j is also determined. Then we generate\n}, each of which is one detected contour fragments via\na set of candidates for each or-node, {cj, p\nthe leaf-nodes. These sets of candidates will be passed to the top-down step where the leaf-node\nactivation vi for Ui can be further validated. We calculate the response for the bottom-up step, as,\n\n\u2217\ni;j\n\nz\u2211\n\nRbot(V ) =\n\nRUi(X, p0, p\n\n\u2217\ni , vi),\n\ni=1\n\n(11)\nwhere V = {vi} denotes a hypothesis of leaf-node activation for all or-nodes. In practice, we can\nfurther prune the candidate contours by setting a threshold on Rbot(V ). Thus, given the V = {vi},\nwe can select an ensemble of contours C r = {c1, ..., cz}, each of which is detected by an activated\nleaf-node, Lj, vj = 1.\nTop-down veri\ufb01cation: Given the ensemble of contours C r, we then apply the global classi\ufb01er\nat the root-node to verify C r by Eq.\n(5), as well as the accumulated pairwise potentials on the\ncollaborative edges de\ufb01ned in Eq.(4).\nBy incorporating the bottom-up and top-down steps, we obtain the response of And-Or graph model\nby Eq.(6). The \ufb01nal detection is acquired by selecting the maximum score in Eq.(10).\n\n5 Discriminative Learning for And-Or Graph\nWe formulate the learning of And-Or graph model as a joint optimization task for model struc-\nture and parameters, which can be solved by an iterative method extended from the CCCP frame-\nwork [22]. This algorithm iterates to determine the And-Or graph structure in a dynamical manner:\ngiven the inferred latent variables H = (P, V ) in each step, the leaf-nodes can be automatically\ncreated or removed to generate a new structural con\ufb01guration. To be speci\ufb01c, a new leaf-node is\nencouraged to be created as the local detector for contours that cannot be handled by the current\nmodel(Fig. 1(c)); a leaf-node is encourage to be removed if it has similar discriminative ability as\nother ones(Fig. 1(b)). We thus call this procedure dynamical CCCP (dCCCP).\n\n5.1 Optimization Formulation\nSuppose a set of positive and negative training samples (X1, y1),...,(XN , yN ) are given, where X is\nthe edge map, y = \u00b11 is the label to indicate positive and negative samples. We assume the samples\nindexed from 1 to K are the positive samples, and the feature vector for each sample (X, y) as,\n\n4\n\n(cid:9)a(cid:10)(cid:9)b(cid:10)(cid:9)c(cid:10)(cid:20)(cid:3)(cid:25)(cid:3)\u2026 \u2026 (cid:20)(cid:3)(cid:25)(cid:3)\u2026 \u2026 (cid:20)(cid:3)(cid:25)(cid:3)\u2026 \u2026 \fwhere H is the latent variables. Thus, Eq.(10) can be rewritten as a discriminative function,\n\n(12)\n\n(13)\n\n{\n\n\u03d5(X, y, H) =\n\n\u03d5(X, H)\n0\n\nif y = +1\nif y = \u22121\n\n,\n\nS!(X) = argmaxy;H (\u03c9 \u00b7 \u03d5(X, y, H)).\n\nN\u2211\n\nThe optimization of this function can be solved by using structural SVM with latent variables,\n\n\u2225\u03c9\u22252 + D\n\n1\n2\n\n(\u03c9 \u00b7 \u03d5(Xk, y, H) + L(yk, y, H)) \u2212 max\n\n(\u03c9 \u00b7 \u03d5(Xk, yk, H))],\n\n!\n\nk=1\n\nmin\n\n[max\ny;H\n\n(14)\nwhere D is a penalty parameter(set as 0.005 empirically), and L(yk, y, H) is the loss function. We\nde\ufb01ne that L(yk, y, H) = 0 if yk = y, \u201c1\u201d if yk \u0338= y in our method.\nThe optimization target in Equation(14) is non-convex. The CCCP framework [23] was recently\nutilized in [22, 25] to provide a local optimum solution by iteratively solving the latent variables\nH and the model parameter \u03c9. However, the CCCP does not address the or-nodes in hierarchy,\ni.e., assuming the con\ufb01guration of structure is \ufb01xed. In the following, we propose the dCCCP by\nembedding a structural recon\ufb01guration step.\n\nH\n\n5.2 Optimization with dynamic CCCP\nN\u2211\nFollowing the original CCCP framework, we convert the function in Eq. (14) into a convex and\nconcave form as,\n\u2225\u03c9\u22252 + D\n1\n[\n2\n[f (\u03c9) \u2212 g(\u03c9)],\n\n(\u03c9 \u00b7 \u03d5(Xk, y, H) + L(yk, y, H))] \u2212 [D\n\n(\u03c9 \u00b7 \u03d5(Xk, yk, H))]\n\nN\u2211\n\nmax\ny;H\n\n= min\n\n(15)\n\nmax\n\nmin\n\n(16)\n\nk=1\n\nk=1\n\nH\n\n!\n\n!\n\nIn our method, besides the inferred H\n\nwhere f (\u03c9) represents the \ufb01rst two terms, and g(\u03c9) represents the last term in (15).\nThe original CCCP includes two iterative steps: (I) \ufb01xing the model parameters, estimate the la-\n\u2217 for each positive samples; (II) compute the model parameters by the traditional\ntent variables H\n\u2217, we need to further determine\nstructural SVM method.\nthe graph con\ufb01guration, i.e.\nthe production of leaf-nodes associated with or-nodes, to obtain the\ncomplete structure. Thus, we insert one step between two original ones to perform the structure\nrecon\ufb01guration. The three iterative steps are presented as follows.\n(I) For optimization, we \ufb01rst \ufb01nd a hyperplane qt to upper bound the concave part \u2212g(\u03c9) in Eq.(16),\n\u2212g(\u03c9) \u2264 \u2212g(\u03c9t) + (\u03c9 \u2212 \u03c9t) \u00b7 qt,\u2200\u03c9.\n(17)\nwhere \u03c9t includes the model parameters obtained in the previous iteration. We construct qt by\n\u2211\nk = argmaxH (\u03c9t\u00b7\u03d5(Xk, yk, H)). Since \u03d5(Xk, yk, H) =\n\u2217\ncalculating the optimal latent variables H\n0 when yk = \u22121, we only take the positive training samples into account during computation. Then\nthe hyperplane is constructed as qt = \u2212D\n(II) In this step, we adjust the model structure by recon\ufb01guring the leaf-nodes. In our model, each\n\u2217\n). Thus, the process\nleaf-node is mapped to several feature dimensions of the vector \u03d5(X, y, H\n\u2217\nof recon\ufb01guration is equivalent to reorganizing the feature vector \u03d5(X, y, H\n). Accordingly, the\n), and would lead to non-convergence of learning.\nhyperplane qt would change with \u03d5(X, y, H\n) guided by the Principal Component Analysis(PCA). That is,\nTherefore, we operate on \u03d5(X, y, H\nwe allow the adjustment only with the non-principal components (dimensions) of \u03d5(X, y, H\n), in\n) [8]. As a result, qt is assumed to be\nterms of preserving the signi\ufb01cant information of \u03d5(X, y, H\nunaltered. This step of model recon\ufb01guration can be then divided into two sub-steps.\n(i) Feature refactoring guided by PCA. Given \u03d5(Xk, yk, H\nPCA on them,\n\n\u2217\nk ) of all positive samples, we apply\n\nN\nk=1 \u03d5(Xk, yk, H\n\n\u2217\nk ).\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u2217\n\n\u03d5(Xk, yk, H\n\n(18)\n\u2211K\nwhere K is the number of the eigenvectors, ei the eigenvector with its parameter \u03b2k;i. We set K a\ni=1 \u03b2k;iei)||2 < \u03c3, \u2200k. For the jth bin of the feature\nlarge number so that ||\u03d5(Xk, yk, H\n\nk )\u2212 (u +\n\u2217\n\n\u03b2k;iei,\n\ni=1\n\nK\u2211\n\nk ) \u2248 u +\n\u2217\n\n5\n\n\fFigure 2: A toy example for structural clustering. We consider 4 samples, X1, . . . , X4, for train-\ning the structure of Ui. (a) shows the feature vectors \u03d5 of the samples associated with Ui, and the\nintensity of the feature bin indicates the feature value. The red and green bounding boxes on the\nvectors indicate the non-principal features representing the detected contour fragments via two dif-\n\u2032. The vector \u27e8\u03d56, \u03d58, \u03d59\u27e9 of X2 is\nferent leaf-nodes. (b) illustrates the clustering performed with \u03d5\ngrouped from the right cluster to the left one. (c) shows the adjusted feature vectors according to the\nclustering. Note that clustering would result in structural recon\ufb01guration, as we discuss in the text.\nThis \ufb01gure is encouraged to be view in electronic version.\n\n(pK\n\n\u2032\n\n(p1\n\ni , c1\n\ni , c2\n\ni , cK\n\ni )}.\n\ni , ..., cK\ni\n\ni ), ..., \u03d5l(pK\n\n\u2032\ni ), ..., \u03d5\n\nvector, we consider it non-principal only if ei;j < \u03b4 and uj < \u03b4 for all ei and u, (\u03c3 = 2.0, \u03b4 = 0.001\nin experiments).\nFor each or-node Ui, a set of detected contour fragments, {c1\n}, are obtained with the\n\u2217\nk of all positive samples. The feature vectors for these contours that are generated by\ngiven H\ni )}, are mapped to different parts of the complete feature\nthe leaf-nodes, {\u03d5l(p1\ni , c1\ni , cK\nK)}. More speci\ufb01cally, once we select the jth bin for the\nvector, {\u03d5(X1, y1, H\n\u2217\n\u2217\n1 ), ..., \u03d5(XK, yK, H\nall feature vectors \u03d5l, it can be either principal or not in different vectors \u03d5. For all feature vector \u03d5l,\nwe select the non-principal bins to form a new vector. We thus refactor the feature vectors of these\ncontours as {\u03d5\n(ii) Structural recon\ufb01guration by clustering. To trigger the structural recon\ufb01guration, for each or-\nnode Ui, we perform the clustering for detected contour fragments represented by the newly formed\nfeature vectors. We \ufb01rst group the contours detected by the same leaf-node into the same cluster\nas a temporary partition. Then the re-clustering is performed by applying the ISODATA algorithm\nand the Euclidean distance. And the close contours are grouped into the same cluster. According\nto the new partition, we can re-organize the feature vectors, i.e. represent the similar contour with\nthe same bins in the complete feature vector \u03d5. Please recall that the vector of one contour is part\nof \u03d5. We present a toy example for illustration in Fig. 2. The selected feature vector (non-principal)\ni ) = \u27e8\u03d56, \u03d58, \u03d59\u27e9 of X2 is grouped from one cluster to another; by comparing (a) with (c)\n\u2032\n(p2\n\u03d5\nwe can observe that \u27e8\u03d56, \u03d58, \u03d59\u27e9 is moved to \u27e8\u03d51, \u03d53, \u03d54\u27e9.\nWith the re-organization of feature vectors, we can accordingly recon\ufb01gure the leaf-nodes corre-\nsponding to the clusters of contours. There are two typical states.\n\u2022 New leaf-nodes are created once more clusters are generated than previous. Their parame-\n\u2022 One leaf-node is removed when the feature bins related to it are zero, which implies the\n\nters can be learned based on the feature vectors of contours within the clusters.\n\ni , c2\n\ncontours detected by the leaf-node are grouped to another cluster.\n\n\u2211\n\nIn practice, we constrain the extent of structural recon\ufb01guration, i.e., only few leaf-nodes can be\ncreated or removed for each or-node per iteration. After the structural recon\ufb01guration, we denote\n\u2217\nk ). Then the new hyperplane is\nall the feature vectors \u03d5(Xk, yk, H\ngenerated as qd\n(III) Given the newly generated model structures represented by the feature vectors \u03d5d(Xk, yk, H\nwe can learn the model parameters by solving \u03c9t+1 = argmin![f (\u03c9) + \u03c9 \u00b7 qd\n\u2212g(\u03c9) with the upper bound hyperplane qd\n\n\u2217\nk ) are adjusted to \u03d5d(Xk, yk, H\n\n\u2217\nk ),\nt ]. By substituting\n\nt , the optimization task in Eq. (15) can be rewritten as,\n\nN\nk=1 \u03d5d(Xk, yk, H\n\nt = \u2212D\n\n\u2217\nk ).\n\n(\u03c9 \u00b7 \u03d5(Xk, y, H) + L(yk, y, H)) \u2212 \u03c9 \u00b7 \u03d5d(Xk, yk, H\n\n\u2217\nk )].\n\n(19)\n\nN\u2211\n\n\u2225\u03c9\u22252 + D\n\nmin\n\n!\n\n1\n2\n\n[max\ny;H\n\nk=1\n\nThis is a standard structural SVM problem, whose solution is presented as,\n\n6\n\n(cid:21)(cid:29)(cid:25)(cid:15)(cid:3)(cid:27)(cid:15)(cid:3)(cid:28) (cid:21)(cid:29)(cid:20)(cid:15)(cid:3)(cid:22)(cid:15)(cid:3)(cid:23) (cid:20)(cid:29)(cid:20)(cid:15)(cid:3)(cid:22)(cid:15)(cid:3)(cid:23) (cid:23)(cid:29)(cid:20)(cid:15)(cid:3)(cid:22)(cid:15)(cid:3)(cid:23) (cid:21)(cid:29)(cid:25)(cid:15)(cid:3)(cid:27)(cid:15)(cid:3)(cid:28) (cid:22)(cid:29)(cid:25)(cid:15)(cid:3)(cid:27)(cid:15)(cid:3)(cid:28) (cid:20)(cid:29)(cid:20)(cid:15)(cid:3)(cid:22)(cid:15)(cid:3)(cid:23) (cid:23)(cid:29)(cid:20)(cid:15)(cid:3)(cid:22)(cid:15)(cid:3)(cid:23) (cid:22)(cid:29)(cid:25)(cid:15)(cid:3)(cid:27)(cid:15)(cid:3)(cid:28) (cid:21)(cid:29)(cid:25)(cid:15)(cid:3)(cid:27)(cid:15)(cid:3)(cid:28) ( 1, 2, 3, 4, 5, 6, 7, 8, 9, 10) (cid:20) (cid:21) (cid:22) (cid:23) (cid:20) (cid:21) (cid:22) (cid:23) (a) (c) (b) (cid:260) (cid:260) \fFigure 3: The trained And-Or graph model with the UIUC-People dataset. (a) visualizes the three\nlayer model, where the images on the top imply the veri\ufb01cation via the root-node. (b) exhibits the\nleaf-nodes associated with the or-nodes, U1, . . . , U8; a practical detection with the activated leaf-\nnodes are highlighted by red. (c) shows the average precisions (AP) results generated by the And-Or\ntree (AOT) model and the And-Or graph (AOG) model.\n\n\u2211\n\nk;y;H\n\n\u2217\n\n\u03c9\n\n= D\n\n\u2217\nk;y;H \u2206\u03d5(Xk, y, H),\n\n\u03b1\n\n(20)\n\n\u2217 by maximizing the dual\n\n\u2211\n\nwhere \u2206\u03d5(Xk, y, H) = \u03d5d(Xk, yk, H\nfunction:\n\nmax\n\n(cid:11)\n\nk;y;H\n\n\u03b1k;y;HL(yk, y, H) \u2212 D\n2\n\nk ) \u2212 \u03d5(Xk, y, H). We calculate \u03b1\n\u2217\n\u2211\n\n\u2211\n\nk;k\u2032\n\ny;H;y\u2032;H\u2032\n\n\u03b1k;y;H \u03b1k\u2032;y\u2032;H\u2032 \u2206\u03d5(Xk, y, H)\u2206\u03d5(Xk\u2032 , y\n\n\u2032\n\n\u2032\n\n).\n\n, H\n\n(21)\n\nIt is a dual problem in standard SVM, which can be solved by applying the cutting plane method [1]\nand Sequential Minimal Optimization [13]. Thus, we obtain the updated parameters \u03c9t+1, and\ncontinue the 3-step iteration until the function in Eq.(16) converges.\n\nInitialization\n\n5.3\nAt the beginning of learning, the And-Or graph model can be initialized as follows. For each training\nsample (whose contours have been extracted), we partition it into a regular layout of several blocks,\neach of which corresponds to one or-node. The contours fallen into the block are treated as the\ninput for learning. Once there are more than two contours in one block, we select the one with\nlargest length. Then the leaf-nodes are generated by clustering the selected contours without any\nconstraints, and we can thus obtain the initial feature vector \u03d5d for each sample.\n\n6 Experiments\nWe evaluate our method for object shape detection, using three benchmark datasets: the UIUC-\nPeople [17], the ETHZ-Shape [7] and the INRIA-Horse [7].\nImplementation setting. We \ufb01x the number of or-nodes in the And-Or model as 8 for the UIUC-\nPeople dataset, and 6 in other experiments. The initial layout is a regular partition (e.g. 4\u00d7 2 blocks\nfor the UIUC-People dataset and 2 \u00d7 3 for others). There are at most m = 4 leaf-nodes for each\nor-node. For positive samples, we extract their clutter-free object contours; for negative samples,\nwe compute their edge maps by using the Pb edge detector [12] with an edge link method. The\nconvergence of our learning algorithm take 6 \u223c 9 iterations. During detection, the edge maps of\ntest images are extracted as for negative training samples, within which the object is searched at 6\ndifferent scales, 2 per octave. For each contour as the input to the leaf-node, we sample 20 points\nand compute the Shape Context descriptor for each point; the descriptor is quantized with 6 polar\nangles and 2 radial bins. We adopt the testing criterion de\ufb01ned in the PASCAL VOC challenge: a\ndetection is counted as correct if the intersection over union with the groundtruth is at least 50%.\nExperiment I. The UIUC-People dataset contains 593 images (346 for training, 247 for testing).\nMost of the images contain one person playing badminton. Fig. 3(b) shows the trained And-Or\nmodel(AOG) in that each of the 8 or-nodes associates with 2 \u223c 4 leaf-nodes. To evaluate the bene\ufb01t\nfrom the collaborative edges, we degenerate our model to the And-Or Tree (AOT) by removing the\ncollaborative edges. As Fig. 3(c) illustrates, the average precisions (AP) of detection by applying\nAOG and AOT are 56.20%and 53.84% respectively. Then we compare our model with the state-\nof-the-art detectors in [18, 2, 4, 5], some of which used manually labeled models. Following the\n\n7\n\n(cid:21)(cid:3)(cid:20)(cid:3)(cid:23)(cid:3)(cid:22)(cid:3)(cid:25)(cid:3)(cid:24)(cid:3)(cid:27)(cid:3)(cid:26)(cid:3)(cid:22)(cid:26)(cid:3)(cid:22)(cid:27)(cid:3)(cid:28)(cid:3)(cid:20)(cid:19)(cid:3)(cid:21)(cid:24)(cid:3)(cid:20)(cid:26)(cid:3)and-node or-node leaf-node (cid:22)(cid:22)(cid:3)(cid:20) (cid:21) (cid:22) (cid:23) (cid:24) (cid:25) (cid:26) (cid:27) (a)(b)135790.350.40.450.50.550.60.65IterationAPUIUC human AOTAOG(c)\fAccuracy\n0.680\n0.660\n0.668\n0.506\n0.486\n0.458\n\nOur AOG\nOur AOT\nWang et al. [18]\nAndriluka et al. [2]\nFelz et al. [5]\nBourdev et al. [4]\n\n(a)\n\nApplelogos Bottles Giraffes Mugs Swans Average\n0.910\nOur method\nMa et al. [10]\n0.881\nSrinivasan et al. [16] 0.845\nMaji et al. [11]\n0.869\n0.891\nFelz et al. [5]\nLu et al. [9]\n0.844\n\n0.885 0.968 0.898\n0.868 0.959 0.877\n0.888 0.922 0.872\n0.806 0.716 0.771\n0.721 0.391 0.712\n0.643 0.798 0.709\n\n0.803\n0.756\n0.787\n0.742\n0.608\n0.617\n\n0.926\n0.920\n0.916\n0.724\n0.950\n0.641\n\n(b)\n\nTable 1: (a) Comparisons of detection accuracies on the UIUC-People dataset. (b) Comparisons of\naverage precision (AP) on the ETHZ-Shape dataset.\n\nmetric mentioned in [18], to calculate the detection accuracy, we only consider the detection with\nthe highest score on an image for all the methods. As Table. 1a reports, our methods outperforms\nother approaches.\n\nFigure 4: (a)Experimental results with the recall-FPPI measurement on the INRIA-Horse database.\n(b),(c) and (d) shows a few object shape detections by applying our method on the three datasets,\nand the false positives are annotated by blue frames.\n\nExperiment II. The INRIA-Horse dataset consists of 170 horse images and 170 images without\nhorses. Among them, 50 positive examples and 80 negative examples are used for training and\nremaining 210 images for testing. Fig. 4 reports the plots of false positives per image (FPPI) vs.\nrecall. It is shown that our system substantially outperforms the recent methods: the AOG and AOT\nmodels achieve detection rates of 89.6% and 88.0% at 1.0 FPPI, respectively; in contrast, the results\nof competing methods are: 87.3% in [21], 85.27% in [11], 80.77% in [7], and 73.75% in [6].\nExperiment III. We test our method with more object categories on the ETHZ-Shape dataset: Ap-\nplelogos, Bottles, Giraffes, Mugs and Swans. For each category (including 32 \u223c 87 images), half of\nthe images are randomly selected as positive examples, and 70 \u223c 90 negative examples are obtained\nfrom the other categories as well as backgrounds. The trained model for each category is tested\non the remaining images. Table 1b reports the results evaluated by the mean average precision.\nCompared with the current methods [11, 16, 5, 9, 10], our model achieves very competitive results.\nA few results are visualized in Fig.4(b),(c) and (d) for experiment I, II, and III respectively.\n\n7 Conclusion\nThis paper proposes a discriminative contour-based object model with the And-Or graph represen-\ntation. This model can be trained in a dynamical manner that the model structure is automatically\ndetermined during iterations as well as the parameters. Our method achieves the state-of-art of\nobject shape detection on challenging datasets.\n\n8\n\n00.20.40.60.8100.10.20.30.40.50.60.70.80.91FPPIRecallINRIA Horses IKSVMM2HT+IKSVM [11]KAS [7]TPS\u2212RPM [6]Voting with grps + verif [21]Our AOGOur AOT(a)(b)(c)(d)\fReferences\n[1] Y. Altun, I. Tsochantaridis, and T. Hofmann, Hidden markov support vector machines, In ICML,\n\n2003. 7\n\n[2] M. Andriluka, S. Roth, and B. Schiele, Pictorial structures revisited: People detection and artic-\n\nulated pose estimation, In CVPR, 2009. 7, 8\n\n[3] S. Belongie, J. Malik, and J. Puzicha, Shape Matching and Object Recognition using Shape\n\nContexts, IEEE TPAMI, 24(1): 705-522, 2002. 3\n\n[4] L. Bourdev, S. Maji, T. Brox, and J. Malik, Detecting people using mutually consistent poselet\n\nactivations, In ECCV, 2010. 7, 8\n\n[5] P. F. Felzenszwalb, R. B. Girshick, D. McAllester, and D. Ramanan, Object Detection with\n\nDiscriminatively Trained Part-based Models, IEEE TPAMI, 2010. 1, 2, 7, 8\n\n[6] V. Ferrari, F. Jurie, and C. Schmid, From Images to Shape Models for Object Detection, Int\u2019l J.\n\nof Computer Vision, 2009. 2, 8\n\n[7] V. Ferrari, L. Fevrier, F. Jerie, and C. Schmid, Groups of Adjacent Contour Segments for Object\n\nDetection, IEEE TPAMI, 30(1): 36-51, 2008. 7, 8\n\n[8] N. Kambhatla and T. K. Leen, Dimension Reduction by Local Principal Component Analysis,\n\nNeural Computation, 9: 1493-1516, 1997. 5\n\n[9] C. Lu, L. J. Latecki, N. Adluru, X. Yang, and H. Ling, Shape Guided Contour Grouping with\n\nParticle Filters, In ICCV, 2009. 2, 8\n\n[10] T. Ma and L. J. Latecki, From Partial Shape Matching through Local Deformation to Robust\n\nGlobal Shape Similarity for Object Detection, In CVPR, 2011. 2, 8\n\n[11] S. Maji and J. Malik, Object Detection using a Max-Margin Hough Transform, In CVPR, 2009.\n\n2, 8\n\n[12] D. R. Martin, C. C. Fowlkes, and J. Malik, Learning to detect natural image boundaries using\n\nlocal brightness, color, and texture cues, IEEET PAMI, 26(5): 530-549, 2004. 7\n\n[13] J. C. Platt, Using analytic qp and sparseness to speed training of support vector machines, In\n\nAdvances in Neural Information Processing Systems, pages 557-563, 1998. 7\n\n[14] P. Schnitzspan, M. Fritz, S. Roth, and B. Schiele, Discriminative structure learning of hierar-\n\nchical representations for object detection, In CVPR, 2009. 2\n\n[15] Z. Song, Q. Chen, Z. Huang, Y. Hua, and S. Yan, Contextualizing Object Detection and Clas-\n\nsi\ufb01cation, In CVPR, 2010. 2\n\n[16] P. Srinivasan, Q. Zhu, and J. Shi, Many-to-one Contour Matching for Describing and Discrim-\n\ninating Object Shape, In CVPR, 2010. 2, 8\n\n[17] D. Tran and D. Forsyth, Improved human parsing with a full relational model, In ECCV, 2010.\n\n7\n\n[18] Y. Wang, D. Tran, and Z. Liao, Learning Hierarchical Poselets for Human Parsing, In CVPR,\n\n2011. 2, 7, 8\n\n[19] X. Yang and L. J. Latecki, Weakly Supervised Shape Based Object Detection with Particle\n\nFilter, In ECCV, 2010. 2\n\n[20] B. Yao, A. Khosla, and L. Fei-Fei, Classifying Actions and Measuring Action Similarity by\n\nModeling the Mutual Context of Objects and Human Poses, In ICML, 2011. 2\n\n[21] P. Yarlagadda, A. Monroy and B. Ommer, Voting by Grouping Dependent Parts, In ECCV,\n\n2010. 8\n\n[22] C.-N. J. Yu and T. Joachims, Learning structural svms with latent variables, In ICML, 2009. 2,\n\n4, 5\n\n[23] A. Yuille and A. Rangarajan, The concave-convex procedure(cccp), In NIPS, pages 1033-1040,\n\n2001. 1, 2, 5\n\n[24] Y.B. Zhao and S.C. Zhu, Image Parsing via Stochastic Scene Grammar, In NIPS, 2011. 2\n[25] L. Zhu, Y. Chen, A. Yuille, and W. Freeman, Latent Hierarchical Structural Learning for Object\n\nDetection, In CVPR, 2010. 1, 2, 5\n\n[26] L. Zhu, Y. Chen, Y. Lu, C. Lin, and A. Yuille, Max Margin AND/OR Graph Learning for\n\nParsing the Human Body, In CVPR, 2008. 1, 2\n\n[27] S.C. Zhu and D. Mumford, A stochastic grammar of images, Foundations and Trends in Com-\n\nputer Graphics and Vision, 2(4): 259-362, 2006. 1, 2\n\n9\n\n\f", "award": [], "sourceid": 131, "authors": [{"given_name": "Xiaolong", "family_name": "Wang", "institution": null}, {"given_name": "Liang", "family_name": "Lin", "institution": null}]}