{"title": "Articulated Pose Estimation by a Graphical Model with Image Dependent Pairwise Relations", "book": "Advances in Neural Information Processing Systems", "page_first": 1736, "page_last": 1744, "abstract": "We present a method for estimating articulated human pose from a single static image based on a graphical model with novel pairwise relations that make adaptive use of local image measurements. More precisely, we specify a graphical model for human pose which exploits the fact the local image measurements can be used both to detect parts (or joints) and also to predict the spatial relationships between them (Image Dependent Pairwise Relations). These spatial relationships are represented by a mixture model. We use Deep Convolutional Neural Networks (DCNNs) to learn conditional probabilities for the presence of parts and their spatial relationships within image patches. Hence our model combines the representational flexibility of graphical models with the efficiency and statistical power of DCNNs. Our method significantly outperforms the state of the art methods on the LSP and FLIC datasets and also performs very well on the Buffy dataset without any training.", "full_text": "Articulated Pose Estimation by a Graphical Model\n\nwith Image Dependent Pairwise Relations\n\nUniversity of California, Los Angeles\n\nUniversity of California, Los Angeles\n\nXianjie Chen\n\nLos Angeles, CA 90024\n\ncxj@ucla.edu\n\nAlan Yuille\n\nLos Angeles, CA 90024\n\nyuille@stat.ucla.edu\n\nAbstract\n\nWe present a method for estimating articulated human pose from a single static\nimage based on a graphical model with novel pairwise relations that make adap-\ntive use of local image measurements. More precisely, we specify a graphical\nmodel for human pose which exploits the fact the local image measurements can\nbe used both to detect parts (or joints) and also to predict the spatial relationships\nbetween them (Image Dependent Pairwise Relations). These spatial relationships\nare represented by a mixture model. We use Deep Convolutional Neural Networks\n(DCNNs) to learn conditional probabilities for the presence of parts and their spa-\ntial relationships within image patches. Hence our model combines the represen-\ntational \ufb02exibility of graphical models with the ef\ufb01ciency and statistical power of\nDCNNs. Our method signi\ufb01cantly outperforms the state of the art methods on the\nLSP and FLIC datasets and also performs very well on the Buffy dataset without\nany training.\n\n1\n\nIntroduction\n\nArticulated pose estimation is one of the fundamental challenges in computer vision. Progress in\nthis area can immediately be applied to important vision tasks such as human tracking [2], action\nrecognition [25] and video analysis.\nMost work on pose estimation has been based on graphical model [8, 6, 27, 1, 10, 2, 4]. The graph\nnodes represent the body parts (or joints), and the edges model the pairwise relationships between\nthe parts. The score function, or energy, of the model contains unary terms at each node which\ncapture the local appearance cues of the part, and pairwise terms de\ufb01ned at the edges which capture\nthe local contextual relations between the parts. Recently, DeepPose [23] advocates modeling pose\nin a holistic manner and captures the full context of all body parts in a Deep Convolutional Neural\nNetwork (DCNN) [12] based regressor.\nIn this paper, we present a graphical model with image dependent pairwise relations (IDPRs). As\nillustrated in Figure 1, we can reliably predict the relative positions of a part\u2019s neighbors (as well as\nthe presence of the part itself) by only observing the local image patch around it. So in our model\nthe local image patches give input to both the unary and pairwise terms. This gives stronger pairwise\nterms because data independent relations are typically either too loose to be helpful or too strict to\nmodel highly variable poses.\nOur approach requires us to have a method that can extract information about pairwise part relations,\nas well as part presence, from local image patches. We require this method to be ef\ufb01cient and to\nshare features between different parts and part relationships. To do this, we train a DCNN to output\n\n1\n\n\fFigure 1: Motivation. The local image measurements around a part, e.g., in an image patch, can reliably\npredict the relative positions of all its neighbors (as well as detect the part). Center Panel: The local image\npatch centered at the elbow can reliably predict the relative positions of the shoulder and wrist, and the local\nimage patch centered at the wrist can reliably predict the relative position of the elbow. Left & Right Panels: We\nde\ufb01ne different types of pairwise spatial relationships (i.e., a mixture model) for each pair of neighboring parts.\nThe Left Panel shows typical spatial relationships the elbow can have with its neighbors, i.e., the shoulder and\nwrist. The Right Panel shows typical spatial relationships the wrist can have with its neighbor, i.e., the elbow.\n\nestimates for the part presence and spatial relationships which are used in our unary and pairwise\nterms of our score function. The weight parameters of different terms in the model are trained using\nStructured Supported Vector Machine (S-SVM) [24]. In summary, our model combines the rep-\nresentational \ufb02exibility of graphical models, including the ability to represent spatial relationships,\nwith the data driven power of DCNNs.\nWe perform experiments on two standard pose estimation benchmarks: LSP dataset [10] and FLIC\ndataset [20]. Our method outperforms the state of the art methods by a signi\ufb01cant margin on both\ndatasets. We also do cross-dataset evaluation on Buffy dataset [7] (without training on this dataset)\nand obtain strong results which shows the ability of our model to generalize.\n\n2 The Model\n\nThe Graphical Model and its Variables: We represent human pose by a graphical model G =\n(V,E) where the nodes V specify the positions of the parts (or joints) and the edges E indicates\nwhich parts are spatially related. For simplicity, we impose that the graph structure forms a K\u2212node\ntree, where K = |V|. The positions of the parts are denoted by l, where li = (x, y) speci\ufb01es the\npixel location of part i, for i \u2208 {1, . . . , K}. For each edge in the graph (i, j) \u2208 E, we specify a\ndiscrete set of spatial relationships indexed by tij, which corresponds to a mixture of different spatial\nrelationships (see Figure 1). We denote the set of spatial relationships by t = {tij, tji|(i, j) \u2208 E}.\nThe image is written as I. We will de\ufb01ne a score function F (l, t|t) as follows as a sum of unary and\npairwise terms.\nUnary Terms: The unary terms give local evidence for part i \u2208 V to lie at location li and is based\non the local image patch I(li). They are of form:\n(1)\nwhere \u03c6(.|.; \u03b8) is the (scalar-valued) appearance term with \u03b8 as its parameters (speci\ufb01ed in the next\nsection), and wi is a scalar weight parameter.\nImage Dependent Pairwise Relational (IDPR) Terms: These IDPR terms capture our intuition\nthat neighboring parts (i, j) \u2208 E can roughly predict their relative spatial positions using only local\nIn our model, the relative positions of parts i and j are discretized\ninformation (see Figure 1).\ninto several types tij \u2208 {1, . . . , Tij} (i.e., a mixture of different relationships) with corresponding\nmean relative positions rtij\nij plus small deformations which are modeled by the standard quadratic\n\nU (li|I) = wi\u03c6(i|I(li); \u03b8),\n\n2\n\nLower Arm:Upper Arm:Elbow:Wrist:\fdeformation term. More formally, the pairwise relational score of each edge (i, j) \u2208 E is given by:\n\nR(li, lj, tij, tji|I) = (cid:104)wtij\n+ (cid:104)wtji\n\nij , \u03c8(lj \u2212 li \u2212 rtij\nji , \u03c8(li \u2212 lj \u2212 rtji\n\nij )(cid:105) + wij\u03d5(tij|I(li); \u03b8)\n,\nji )(cid:105) + wji\u03d5(tji|I(lj); \u03b8)\n\n(2)\n\n(cid:124) are the standard quadratic deformation features,\nwhere \u03c8(\u2206l = [\u2206x, \u2206y]) = [\u2206x \u2206x2 \u2206y \u2206y2]\n\u03d5(.|.; \u03b8) is the Image Dependent Pairwise Relational (IDPR) term with \u03b8 as its parameters (speci\ufb01ed\nin the next section), and wtij\nji , wji are the weight parameters. The notation (cid:104)., .(cid:105) speci\ufb01es\ndot product and boldface indicates a vector.\nThe Full Score: The full score F (l, t|I) is a function of the part locations l, the pairwise relation\ntypes t, and the input image I. It is expressed as the sum of the unary and pairwise terms:\n\nij , wij, wtji\n\n(cid:88)\n\ni\u2208V\n\n(cid:88)\n\n(i,j)\u2208E\n\nF (l, t|I) =\n\nU (li|I) +\n\nR(li, lj, tij, tji|I) + w0,\n\n(3)\n\nwhere w0 is a scalar weight on constant 1 (i.e., the bias term).\nThe model consists of three sets of parameters: the mean relative positions r = {rtij\nji |(i, j) \u2208 E}\nof different pairwise relation types; the parameters \u03b8 of the appearance terms and IDPR terms; and\nthe weight parameters w (i.e., wi, wtij\nji , wji and w0). See Section 4 for the learning of\nthese parameters.\n\nij , wij, wtji\n\nij , rtji\n\n2.1\n\nImage Dependent Terms and DCNNs\n\nThe appearance terms and IDPR terms depend on the image patches. In other words, a local image\npatch I(li) not only gives evidence for the presence of a part i, but also about the relationship tij\nbetween it and its neighbors j \u2208 N(i), where j \u2208 N(i) if, and only if, (i, j) \u2208 E. This requires\nus to learn distribution for the state variables i, tij conditioned on the image patches I(li). In order\nto specify this distribution we must de\ufb01ne the state space more precisely, because the number of\npairwise spatial relationships varies for different parts with different numbers of neighbors (see\nFigure 1), and we need also consider the possibility that the patch does not contain a part.\nWe de\ufb01ne c to be the random variable which denotes which part is present c = i for i \u2208 {1, ..., K}\nor c = 0 if no part is present (i.e., the background). We de\ufb01ne mcN(c) to be the random variable that\ndetermines the spatial relation types of c and takes values in McN(c). If c = i has one neighbor j\n(e.g., the wrist), then MiN(i) = {1, . . . , Tij}. If c = i has two neighbors j and k (e.g., the elbow),\nthen MiN(i) = {1, . . . , Tij} \u00d7 {1, . . . , Tik}. If c = 0, then we de\ufb01ne M0N(0) = {0}.\nThe full space is represented as:\n\nThe size of the space is |S| =(cid:80)K\n\n(4)\nc=0 |McN(c)|. Each element in this space corresponds to a part with\nall the types of its pairwise relationships, or the background.\nWe use DCNN [12] to learn the conditional probability distribution p(c, mcN(c)|I(li); \u03b8). DCNN is\nsuitable for this task because it is very ef\ufb01cient and enables us to share features. See section 4 for\nmore details.\nWe specify the appearance terms \u03c6(.|.; \u03b8) and IDPR terms \u03d5(.|.; \u03b8) in terms of p(c, mcN(c)|I(li); \u03b8)\nby marginalization:\n\nc=0{c} \u00d7 McN(c)\n\nS = \u222aK\n\n\u03c6(i|I(li); \u03b8) = log(p(c = i|I(li); \u03b8))\n\u03d5(tij|I(li); \u03b8) = log(p(mij = tij|c = i, I(li); \u03b8))\n\n(5)\n(6)\n\n2.2 Relationship to other models\n\nWe now brie\ufb02y discuss how our method relates to standard models.\nPictorial Structure: We recover pictorial structure models [6] by only allowing one relationship\ntype (i.e., Tij = 1). In this case, our IDPR term conveys no information. Our model reduces to\n\n3\n\n\fstandard unary and (image independent) pairwise terms. The only slight difference is that we use\nDCNN to learn the unary terms instead of using HOG \ufb01lters.\nMixtures-of-parts: [27] describes a model with a mixture of templates for each part, where each\ntemplate is called a \u201ctype\u201d of the part. The \u201ctype\u201d of each part is de\ufb01ned by its relative position with\nrespect to its parent. This can be obtained by restricting each part in our model to only predict the\nrelative position of its parent (i.e., Tij = 1, if j is not parent of i). In this case, each part is associated\nwith only one informative IDPR term, which can be merged with the appearance term of each part\nto de\ufb01ne different \u201ctypes\u201d of part in [27]. Also this method does not use DCNNs.\nConditional Random Fields (CRFs): Our model is also related to the conditional random \ufb01eld\nliterature on data-dependent priors [18, 13, 15, 19]. The data-dependent priors and unary terms\nare typically modeled separately in the CRFs.\nIn this paper, we ef\ufb01ciently model all the image\ndependent terms (i.e. unary terms and IDPR terms) together in a single DCNN by exploiting the fact\nthe local image measurements are reliable for predicting both the presence of a part and the pairwise\nrelationships of a part with its neighbors.\n\n3\n\nInference\n\nTo detect the optimal con\ufb01guration for each person, we search for the con\ufb01gurations of the locations\nl and types t that maximize the score function: (l\u2217, t\u2217) = arg maxl,t F (l, t|I). Since our relational\ngraph is a tree, this can be done ef\ufb01ciently via dynamic programming.\nLet K(i) be the set of children of part i in the graph (K(i) = \u2205, if part i is a leaf), and Si(li|I) be\nmaximum score of the subtree rooted at part i with part i located at li. The maximum score of each\nsubtree can be computed as follow:\n\n(cid:88)\n\nk\u2208K(i)\n\nSi(li|I) = U (li|I) +\n\nmax\n\nlk,tik,tki\n\n(R(li, lk, tik, tki|I) + Sk(lk|I))\n\n(7)\n\nUsing Equation 7, we can recursively compute the overall best score of the model, and the optimal\ncon\ufb01guration of locations and types can be recovered by the standard backward pass of dynamic\nprogramming.\nComputation: Since our pairwise term is a quadratic function of locations, li and lj, the max\noperation over lk in Equation 7 can be accelerated by using the generalized distance transforms [6].\nThe resulting approach is very ef\ufb01cient, taking O(T 2LK) time once the image dependent terms are\ncomputed, where T is the number of relation types, L is the total number of locations, and K is the\ntotal number of parts in the model. This analysis assumes that all the pairwise spatial relationships\nhave the same number of types, i.e., Tij = Tji = T,\u2200(i, j) \u2208 E.\nThe computation of the image dependent terms is also ef\ufb01cient. They are computed over all the\nlocations by a single DCNN. Applying DCNN in a sliding fashion is inherently ef\ufb01cient, since the\ncomputations common to overlapping regions are naturally shared [22].\n\n4 Learning\n\nNow we consider the problem of learning the model parameters from images with labeled part\nlocations, which is the data available in most of the human pose datasets [17, 7, 10, 20]. We derive\ntype labels tij from part location annotations and adopt a supervised approach to learn the model.\nOur model consists of three sets of parameters: the mean relative positions r of different pairwise\nrelation types; the parameters \u03b8 of the image dependent terms; and the weight parameters w. They\nare learnt separately by the K-means algorithm for r, DCNN for \u03b8, and S-SVM for w.\nMean Relative Positions and Type Labels: Given the labeled positive images {(In, ln)}N\nn=1, let\ndij be the relative position from part i to its neighbor j. We cluster the relative positions over the\ntraining set {dn\nn=1 to get Tij clusters (in the experiments Tij = 11 for all pairwise relations).\nEach cluster corresponds to a set of instances of part i that share similar spatial relationship with\nits neighbor part j. Thus we de\ufb01ne each cluster as a pairwise relation type tij from part i to j in\nour model, and use the center of each cluster as the mean relative position rtij\nij associated with each\n\nij}N\n\n4\n\n\ftype. In this way, the mean relative positions of different pairwise relation types are learnt, and the\ntype label tn\nij for each training instance is derived based on its cluster index. We use K-means in our\nexperiments by setting K = Tij to do the clustering.\nParameters of Image Dependent Terms: After deriving type labels, each local image patch I(ln)\ncentered at an annotated part location is labeled with category label cn \u2208 {1, . . . , K}, that indi-\ncates which part is present, and also the type labels mn\ncnN(cn) that indicate its relation types with\nall its neighbors. In this way, we get a set of labelled patches {I(ln), cn, mn\nn=1 from pos-\nitive images (each positive image provides K part patches), and also a set of background patches\n{I(ln), 0, 0} sampled from negative images.\nGiven the labelled part patches and background patches, we train a multi-class DCNN classi\ufb01er by\nstandard stochastic gradient descent using softmax loss. The DCNN consists of \ufb01ve convolutional\nlayers, 2 max-pooling layers and three fully-connected layers with a \ufb01nal |S| dimensions softmax\noutput, which is de\ufb01ned as our conditional probability distribution, i.e., p(c, mcN(c)|I(li); \u03b8). The\narchitecture of our network is summarized in Figure 2.\nWeight Parameters: Each pose in the positive image is now labeled with annotated part locations\nand derived type labels: (In, ln, tn). We use S-SVM to learn the weight parameters w. The structure\nprediction problem is simpli\ufb01ed by using 0 \u2212 1 loss, that is all the training examples either have all\ndimensions of its labels correct or all dimensions of its labels wrong. We denote the former ones\nas pos examples, and the later ones as neg examples. Since the full score function (Equation 3) is\nlinear in the weight parameters w, we write the optimization function as:\n\ncnN(cn)}KN\n\n(cid:88)\n\nn\n\nmin\n\nw\n\n1\n2(cid:104)w, w(cid:105) + C\n\nmax(0, 1 \u2212 yn(cid:104)w, \u03a6(In, ln, tn)(cid:105)),\n\n(8)\n\nwhere yn \u2208 {1,\u22121}, and \u03a6(In, ln, tn) is a sparse feature vector representing the n-th example\nand is the concatenation of the image dependent terms (calculated from the learnt DCNN), spatial\ndeformation features, and constant 1. Here yn = 1 if n \u2208 pos, and yn = \u22121 if n \u2208 neg.\n5 Experiment\n\nThis section introduces the datasets, clari\ufb01es the evaluation metrics, describes our experimental\nsetup, presents comparative evaluation results and gives diagnostic experiments.\n\n5.1 Datasets and Evaluation Metrics\n\nWe perform our experiments on two publicly available human pose estimation benchmarks: (i)\nthe \u201cLeeds Sports Poses\u201d (LSP) dataset [10], that contains 1000 training and 1000 testing images\nfrom sport activities with annotated full-body human poses; (ii) the \u201cFrames Labeled In Cinema\u201d\n(FLIC) dataset [20] that contains 3987 training and 1016 testing images from Hollywood movies\nwith annotated upper-body human poses. We follow previous work and use the observer-centric\nannotations on both benchmarks. To train our models, we also use the negative training images from\nthe INRIAPerson dataset [3] (These images do not contain people).\nWe use the most popular evaluation metrics to allow comparison with previous work. Percentage\nof Correct Parts (PCP) is the standard evaluation metric on several benchmarks including the LSP\ndataset. However, as discussed in [27], there are several alternative interpretations of PCP that can\nlead to severely different results. In our paper, we use the stricter version unless otherwise stated,\nthat is we evaluate only a single highest-scoring estimation result for one test image and a body part\nis considered as correct if both of its segment endpoints (joints) lie within 50% of the length of the\nground-truth annotated endpoints (Each test image on the LSP dataset contains only one annotated\nperson). We refer to this version of PCP as strict PCP.\nOn the FLIC dataset, we use both strict PCP and the evaluation metric speci\ufb01ed with it [20]: Per-\ncentage of Detected Joints (PDJ). PDJ measures the performance using a curve of the percentage\nof correctly localized joints by varying localization precision threshold. The localization precision\nthreshold is normalized by the scale (de\ufb01ned as distance between left shoulder and right hip) of each\nground truth pose to make it scale invariant. There are multiple people in the FLIC images, so each\n\n5\n\n\fFigure 2: Architectures of our DCNNs. The size of input patch is 36 \u00d7 36 pixels on the LSP dataset, and\n54 \u00d7 54 pixels on the FLIC dataset. The DCNNs consist of \ufb01ve convolutional layers, 2 max-pooling layers\nand three fully-connected (dense) layers with a \ufb01nal |S| dimensions output. We use dropout, local response\nnormalization (norm) and overlapping pooling (pool) described in [12].\n\nground truth person is also annotated with a torso detection box. During evaluation, we return a\nsingle highest-scoring estimation result for each ground truth person by restricting our neck part to\nbe localized inside a window de\ufb01ned by the provided torso box.\n\n5.2\n\nImplementation detail\n\nData Augmentation: Our DCNN has millions of parameters, while only several thousand of train-\ning images are available. In order to reduce over\ufb01tting, we augment the training data by rotating\nthe positive training images through 360\u25e6. These images are also horizontally \ufb02ipped to double the\ntraining images. This increases the number of training examples of body parts with different spatial\nrelationships with its neighbors (See the elbows along the diagonal of the Left Panel in Figure 1).\nWe hold out random positive images as a validation set for the DCNN training. Also the weight\nparameters w are trained on this held out set to reduce over\ufb01tting to training data.\nNote that our DCNN is trained using local part patches and background patches instead of the whole\nimages. This naturally increases the number of training examples by a factor of K (the number of\nparts). Although the number of dimensions of the DCNN \ufb01nal output also increases linearly with the\nnumber of parts, the number of parameters only slightly increase in the last fully-connected layer.\nThis is because most of the parameters are shared between different parts, and thus we can bene\ufb01t\nfrom larger K by having more training examples per parameter. In our experiments, we increase K\nby adding the midway points between annotated parts, which results in 26 parts on the LSP dataset\nand 18 parts on the FLIC dataset. Covering a person by more parts also reduces the distance between\nneighboring parts, which is good for modeling foreshortening [27].\nGraph Structure: We de\ufb01ne a full-body graph structure for the LSP dataset, and a upper-body\ngraph structure for the FLIC dataset respectively. The graph connects the annotated parts and their\nmidways points to form a tree (See the skeletons in Figure 5 for clari\ufb01cation).\nSettings: We use the same number of types for all pairs of neighbors for simplicity. We set it as\n11 on all datasets (i.e., Tij = Tji = 11,\u2200(i, j) \u2208 E), which results in 11 spatial relation types\nfor the parts with one neighbor (e.g., the wrist), 112 spatial relation types for the parts with two\nneighbors (e.g., the elbow), and so forth (recall Figure 1). The patch size of each part is set as\n36 \u00d7 36 pixels on the LSP dataset, and 54 \u00d7 54 pixels on the FLIC dataset, as the FLIC images are\nof higher resolution. We use similar DCNN architectures on both datasets, which differ in the \ufb01rst\nlayer because of different input patch sizes. Figure 2 visualizes the architectures we used. We use\nthe Caffe [9] implementation of DCNN in our experiments.\n\n5.3 Benchmark results\n\nWe show strict PCP results on the LSP dataset in Table 1, and on the FLIC dataset in Table 2. We\nalso show PDJ results on the FLIC dataset in Figure 3. As is shown, our method outperforms state\nof the art methods by a signi\ufb01cant margin on both datasets (see the captions for detailed analysis).\nFigure 5 shows some estimation examples on the LSP and FLIC datasets.\n\n6\n\n36x36 3 5 5 conv + norm + pool conv conv conv dense dense + dropout 4096 4096 |S|54x54 3 7 7 18x18 33 32 9x9 33 128 9x9 33 128 9x9 33 128 9x9 128 OR conv + norm + pool dense + dropout \fMethod\nOurs\nPishchulin et al. [16]\nOuyang et al. [14]\nDeepPose* [23]\nPishchulin et al. [15]\nEichner&Ferrari [4]\nYang&Ramanan [26]\n\nTorso\n92.7\n88.7\n85.8\n\n-\n\n87.5\n86.2\n84.1\n\nHead\n87.8\n85.6\n83.1\n\n-\n\n78.1\n80.1\n77.1\n\nU.arms\n69.2\n61.5\n63.3\n56\n54.2\n56.5\n52.5\n\nL.arms\n55.4\n44.9\n46.6\n38\n33.9\n37.4\n35.9\n\nU.legs\n82.9\n78.8\n76.5\n77\n75.7\n74.3\n69.5\n\nL.legs Mean\n75.0\n77.0\n69.2\n73.4\n72.2\n68.6\n71\n68.0\n69.3\n65.6\n\n62.9\n64.3\n60.8\n\n-\n\nTable 1: Comparison of strict PCP results on the LSP dataset. Our method improves on all parts by a signi\ufb01cant\nmargin, and outperforms the best previously published result [16] by 5.8% on average. Note that DeepPose uses\nPerson-Centric annotations and is trained with an extra 10,000 images.\n\nMethod\nOurs\nMODEC[20]\n\nU.arms L.arms Mean\n97.0\n91.9\n68.3\n84.4\n\n86.8\n52.1\n\nTable 2: Comparison of strict PCP results on the\nFLIC dataset. Our method signi\ufb01cantly outperforms\nMODEC [20].\n\nFigure 3: Comparison of PDJ curves of elbows and\nwrists on the FLIC dataset. The legend shows the\nPDJ numbers at the threshold of 0.2.\n\n5.4 Diagnostic Experiments\n\nWe perform diagnostic experiments to show the cross-dataset generalization ability of our model,\nand better understand the in\ufb02uence of each term in our model.\nCross-dataset Generalization: We directly apply the trained model on the FLIC dataset to the\nof\ufb01cial test set of Buffy dataset [7] (i.e., no training on the Buffy dataset), which also contains\nupper-body human poses. The Buffy test set includes a subset of people whose upper-body can be\ndetected. We get the newest detection windows from [5], and compare our results to previously\npublished work on this subset.\nMost previous work was evaluated with the of\ufb01cial evaluation toolkit of Buffy, which uses a less\nstrict PCP implementation1. We refer to this version of PCP as Buffy PCP and report it along with the\nstrict PCP in Table 3. We also show the PDJ curves in Figure 4. As is shown by both criterions, our\nmethod signi\ufb01cantly outperforms the state of the arts, which shows the good generalization ability\nof our method. Also note that both DeepPose [23] and our method are trained on the FLIC dataset.\nCompared with Figure 3, the margin between our method and DeepPose signi\ufb01cantly increases in\nFigure 4, which implies that our model generalizes better to the Buffy dataset.\n\nMethod\nOurs*\nOurs* strict\nYang[27]\nYang[27] strict\nSapp[21]\nFLPM[11]\nEichner[5]\n\nU.arms L.arms Mean\n92.9\n96.8\n94.5\n89.3\n83.2\n97.8\n75.9\n94.3\n79.2\n95.3\n93.2\n76.9\n76.8\n93.2\n\n89.0\n84.1\n68.6\n57.5\n63.0\n60.6\n60.3\n\nTable 3: Cross-dataset PCP results on Buffy test sub-\nset. The PCP numbers are Buffy PCP unless other-\nwise stated. Note that our method is trained on the\nFLIC dataset.\n\nFigure 4: Cross-dataset PDJ curves on Buffy test\nsubset. The legend shows the PDJ numbers at the\nthreshold of 0.2. Note that both our method and\nDeepPose [23] are trained on the FLIC dataset.\n\n1A part is considered correctly localized if the average distance between its endpoints (joints) and ground-\n\ntruth is less than 50% of the length of the ground-truth annotated endpoints.\n\n7\n\n00.050.10.150.200.10.20.30.40.50.60.70.80.91Normalized Precision ThresholdPercentage of Detected Joints (PDJ)Elbows MODEC: 75.5%DeepPose: 91.0%Ours: 94.9%00.050.10.150.200.10.20.30.40.50.60.70.80.91Normalized Precision ThresholdPercentage of Detected Joints (PDJ)Wrists MODEC: 57.9%DeepPose: 80.9%Ours: 92.0%00.050.10.150.200.10.20.30.40.50.60.70.80.91Normalized Precision ThresholdPercentage of Detected Joints (PDJ)Elbows Yang: 80.4%MODEC: 77.0%DeepPose*: 83.4%Ours*: 93.2%00.050.10.150.200.10.20.30.40.50.60.70.80.91Normalized Precision ThresholdPercentage of Detected Joints (PDJ)Wrists Yang: 57.4%MODEC: 58.8%DeepPose*: 64.6%Ours*: 89.4%\fMethod\nUnary-Only\nNo-IDPRs\nFull Model\n\nTorso\n56.3\n87.4\n92.7\n\nHead\n66.4\n74.8\n87.8\n\nU.arms\n28.9\n60.7\n69.2\n\nL.arms\n15.5\n43.0\n55.4\n\nU.legs\n50.8\n73.2\n82.9\n\nL.legs\n45.9\n65.1\n77.0\n\nMean\n40.5\n64.6\n75.0\n\nTable 4: Diagnostic term analysis strict PCP results on the LSP dataset. The unary term alone is still not\npowerful enough to get good results, even though it\u2019s trained by a DCNN classi\ufb01er. No-IDPRs method, whose\npairwise terms are not dependent on the image (see Terms Analysis in Section 5.4), can get comparable perfor-\nmance with the state-of-the-art, and adding IDPR terms signi\ufb01cantly boost our \ufb01nal performance to 75.0%.\n\nTerms Analysis: We design two experiments to better understand the in\ufb02uence of each term in\nour model. In the \ufb01rst experiment, we use only the unary terms and thus all the parts are localized\nindependently. In the second experiment, we replace the IDPR terms with image independent priors\n(i.e., in Equation 2, wij\u03d5(tij|I(li); \u03b8) and wji\u03d5(tji|I(lj); \u03b8) are replaced with scalar prior terms\nij and btji\nbtij\nji respectively), and retrain the weight parameters along with the new prior terms. In this\ncase, our pairwise relational terms do not depend on the image, but instead is a mixture of Gaussian\ndeformations with image independent biases. We refer to the \ufb01rst experiment as Unary-Only and the\nsecond one as No-IDPRs, short for No IDPR terms. The experiments are done on the LSP dataset\nusing identical appearance terms for fair comparison. We show strict PCP results in Table 4. As is\nshown, all terms in our model signi\ufb01cantly improve the performance (see the caption for detail).\n\n6 Conclusion\n\nWe have presented a graphical model for human pose which exploits the fact the local image mea-\nsurements can be used both to detect parts (or joints) and also to predict the spatial relationships\nbetween them (Image Dependent Pairwise Relations). These spatial relationships are represented\nby a mixture model over types of spatial relationships. We use DCNNs to learn conditional prob-\nabilities for the presence of parts and their spatial relationships within image patches. Hence our\nmodel combines the representational \ufb02exibility of graphical models with the ef\ufb01ciency and statisti-\ncal power of DCNNs. Our method outperforms the state of the art methods on the LSP and FLIC\ndatasets and also performs very well on the Buffy dataset without any training.\n\nFigure 5: Results on the LSP and FLIC datasets. We show the part localization results along with the graph\nskeleton we used in the model. The last row shows some failure cases, which are typically due to large fore-\nshortening, occlusions and distractions from clothing or overlapping people.\n\n7 Acknowledgements\n\nThis research has been supported by grants ONR MURI N000014-10-1-0933, ONR N00014-12-1-\n0883 and ARO 62250-CS. The GPUs used in this research were generously donated by the NVIDIA\nCorporation.\n\n8\n\n\fReferences\n[1] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. Yuille. Detect what you can: Detecting and\nrepresenting objects using holistic models and body parts. In Computer Vision and Pattern Recognition\n(CVPR), 2014.\n\n[2] N.-G. Cho, A. L. Yuille, and S.-W. Lee. Adaptive occlusion state estimation for human pose tracking\n\nunder self-occlusions. Pattern Recognition, 2013.\n\n[3] N. Dalal and B. Triggs. Histograms of oriented gradients for human detection. In Computer Vision and\n\nPattern Recognition (CVPR), 2005.\n\n[4] M. Eichner and V. Ferrari. Appearance sharing for collective human pose estimation. In Asian Conference\n\non Computer Vision (ACCV), 2012.\n\n[5] M. Eichner, M. Marin-Jimenez, A. Zisserman, and V. Ferrari. 2d articulated human pose estimation and\nretrieval in (almost) unconstrained still images. International Journal of Computer Vision (IJCV), 2012.\nInternational\n\n[6] P. F. Felzenszwalb and D. P. Huttenlocher. Pictorial structures for object recognition.\n\nJournal of Computer Vision (IJCV), 2005.\n\n[7] V. Ferrari, M. Marin-Jimenez, and A. Zisserman. Progressive search space reduction for human pose\n\nestimation. In Computer Vision and Pattern Recognition (CVPR), 2008.\n\n[8] M. A. Fischler and R. A. Elschlager. The representation and matching of pictorial structures.\n\nTransactions on Computers, 1973.\n\nIEEE\n\n[9] Y. Jia. Caffe: An open source convolutional architecture for fast feature embedding. http://caffe.\n\nberkeleyvision.org/, 2013.\n\n[10] S. Johnson and M. Everingham. Clustered pose and nonlinear appearance models for human pose esti-\n\nmation. In British Machine Vision Conference (BMVC), 2010.\n\n[11] L. Karlinsky and S. Ullman. Using linking features in learning non-parametric part models. In European\n\nConference on Computer Vision (ECCV), 2012.\n\n[12] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Neural Information Processing Systems (NIPS), 2012.\n\n[13] J. Lafferty, A. McCallum, and F. C. Pereira. Conditional random \ufb01elds: Probabilistic models for segment-\n\ning and labeling sequence data. In International Conference on Machine Learning (ICML), 2001.\n\n[14] W. Ouyang, X. Chu, and X. Wang. Multi-source deep learning for human pose estimation. In Computer\n\nVision and Pattern Recognition (CVPR), 2014.\n\n[15] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Poselet conditioned pictorial structures.\n\nComputer Vision and Pattern Recognition (CVPR), 2013.\n\nIn\n\n[16] L. Pishchulin, M. Andriluka, P. Gehler, and B. Schiele. Strong appearance and expressive spatial models\n\nfor human pose estimation. In International Conference on Computer Vision (ICCV), 2013.\n\n[17] D. Ramanan. Learning to parse images of articulated bodies. In Neural Information Processing Systems\n\n(NIPS), 2006.\n\n[18] C. Rother, V. Kolmogorov, and A. Blake. Grabcut: Interactive foreground extraction using iterated graph\n\ncuts. In ACM Transactions on Graphics (TOG), 2004.\n\n[19] B. Sapp, C. Jordan, and B. Taskar. Adaptive pose priors for pictorial structures. In Computer Vision and\n\nPattern Recognition (CVPR), 2010.\n\n[20] B. Sapp and B. Taskar. Modec: Multimodal decomposable models for human pose estimation. In Com-\n\nputer Vision and Pattern Recognition (CVPR), 2013.\n\n[21] B. Sapp, A. Toshev, and B. Taskar. Cascaded models for articulated pose estimation.\n\nConference on Computer Vision (ECCV), 2010.\n\nIn European\n\n[22] P. Sermanet, D. Eigen, X. Zhang, M. Mathieu, R. Fergus, and Y. LeCun. Overfeat: Integrated recogni-\ntion, localization and detection using convolutional networks. In International Conference on Learning\nRepresentations (ICLR), 2014.\n\n[23] A. Toshev and C. Szegedy. Deeppose: Human pose estimation via deep neural networks. In Computer\n\nVision and Pattern Recognition (CVPR), 2014.\n\n[24] I. Tsochantaridis, T. Hofmann, T. Joachims, and Y. Altun. Support vector machine learning for interde-\npendent and structured output spaces. In International Conference on Machine Learning (ICML), 2004.\n[25] C. Wang, Y. Wang, and A. L. Yuille. An approach to pose-based action recognition. In Computer Vision\n\nand Pattern Recognition (CVPR), 2013.\n\n[26] Y. Yang and D. Ramanan. Articulated pose estimation with \ufb02exible mixtures-of-parts. In Computer Vision\n\nand Pattern Recognition (CVPR), 2011.\n\n[27] Y. Yang and D. Ramanan. Articulated human detection with \ufb02exible mixtures of parts. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence (TPAMI), 2013.\n\n9\n\n\f", "award": [], "sourceid": 908, "authors": [{"given_name": "Xianjie", "family_name": "Chen", "institution": "UCLA"}, {"given_name": "Alan", "family_name": "Yuille", "institution": "UCLA"}]}