{"title": "(RF)^2 -- Random Forest Random Field", "book": "Advances in Neural Information Processing Systems", "page_first": 1885, "page_last": 1893, "abstract": "We combine random forest (RF) and conditional random field (CRF) into a new computational framework, called random forest random field (RF)^2. Inference of (RF)^2 uses the Swendsen-Wang cut algorithm, characterized by Metropolis-Hastings jumps. A jump from one state to another depends on the ratio of the proposal distributions, and on the ratio of the posterior distributions of the two states. Prior work typically resorts to a parametric estimation of these four distributions, and then computes their ratio. Our key idea is to instead directly estimate these ratios using RF. RF collects in leaf nodes of each decision tree the class histograms of training examples. We use these class histograms for a non-parametric estimation of the distribution ratios. We derive the theoretical error bounds of a two-class (RF)^2. (RF)^2 is applied to a challenging task of multiclass object recognition and segmentation over a random field of input image regions. In our empirical evaluation, we use only the visual information provided by image regions (e.g., color, texture, spatial layout), whereas the competing methods additionally use higher-level cues about the horizon location and 3D layout of surfaces in the scene. Nevertheless, (RF)^2 outperforms the state of the art on benchmark datasets, in terms of accuracy and computation time.", "full_text": "(RF)2 \u2014 Random Forest Random Field\n\nNadia Payet and Sinisa Todorovic\n\nSchool of Electrical Engineering and Computer Science\n\nOregon State University\n\npayetn@onid.orst.edu, sinisa@eecs.oregonstate.edu\n\nAbstract\n\nWe combine random forest (RF) and conditional random \ufb01eld (CRF) into a new\ncomputational framework, called random forest random \ufb01eld (RF)2. Inference\nof (RF)2 uses the Swendsen-Wang cut algorithm, characterized by Metropolis-\nHastings jumps. A jump from one state to another depends on the ratio of the\nproposal distributions, and on the ratio of the posterior distributions of the two\nstates. Prior work typically resorts to a parametric estimation of these four dis-\ntributions, and then computes their ratio. Our key idea is to instead directly es-\ntimate these ratios using RF. RF collects in leaf nodes of each decision tree the\nclass histograms of training examples. We use these class histograms for a non-\nparametric estimation of the distribution ratios. We derive the theoretical error\nbounds of a two-class (RF)2. (RF)2 is applied to a challenging task of multiclass\nobject recognition and segmentation over a random \ufb01eld of input image regions.\nIn our empirical evaluation, we use only the visual information provided by image\nregions (e.g., color, texture, spatial layout), whereas the competing methods addi-\ntionally use higher-level cues about the horizon location and 3D layout of surfaces\nin the scene. Nevertheless, (RF)2 outperforms the state of the art on benchmark\ndatasets, in terms of accuracy and computation time.\n\n1 Introduction\n\nThis paper presents a new computational framework, called random forest random \ufb01eld (RF)2,\nwhich provides a principled way to jointly reason about multiple, statistically dependent random\nvariables and their attributes. We derive theoretical performance bounds of (RF)2, and demonstrate\nits utility on a challenging task of conjoint object recognition and segmentation.\n\nIdentifying subimage ownership among occurrences of distinct object classes in an image is a fun-\ndamental, and one of the most actively pursued problem in computer vision, machine learning, and\narti\ufb01cial intelligence [1\u201311]. The goal is to assign the label of one of multiple semantic classes to\neach image pixel. Our approach builds on the following common recognition strategies: (i) Labels\nof neighboring image parts are likely to be correlated \u2013 one of the main principles of perceptual\norganization; and (ii) Recognized objects dictate which other objects to expect in the scene, and\ntheir scale and spatial con\ufb01guration \u2013 one of the main principles of context-driven recognition that\n\u201cbinds\u201d all object detections in a coherent scene interpretation. We formalize perceptual grouping\nand context by a graphical model aimed at capturing statistical dependencies among random vari-\nables (i.e., labels or attributes) associated with different pixel neighborhoods. Thus, we derive a\nuni\ufb01ed framework for combined object recognition and segmentation, as a graph-structured predic-\ntion of all random variables in a single, consistent model of the scene.\n\nThe graphical model we use is Conditional Random Field (CRF) [12]\u2014one of the most popular\nmodels for structured inference over pixels [2, 3], patches [4, 5], or image regions [6\u20138], for object\nrecognition and segmentation. CRF de\ufb01nes a posterior distribution of hidden random variables Y\n(labels), given observed image features X, in a factored form: p(Y |X; \u03b8)= 1\n\nZ(\u03b8) Qc \u03c8c(Yc, X; \u03b8).\n\n1\n\n\fEach potential \u03c8c is a function over a subset Yc\u2286Y , conditioned on X, and parameterized by \u03b8.\nThe potentials are often de\ufb01ned as linear functions of parameters, \u03c8c(Yc, X; \u03b8)=\u03b8T\u03a8c, where \u03a8c\nis the output of some detectors over observables X [2\u20134]. This means that p(Y |X; \u03b8) is modeled\nas a log-linear function, which is not adequate when the detector outputs do not provide a linear\nseparability of the classes. Learning \u03b8 is hard, because computation of the partition function Z(\u03b8)\nis intractable for most graphs (except for chains and trees). Inference is typically posed as the joint\n\nq(Y (t)\u2192Y (t+1))\n\np(Y (t+1)|X)\n\nMAP assignment that minimizes the energyPc \u03c8c(Yc, X; \u03b8), which is also intractable for general\n\ngraphs. The intractability of CRF learning and inference often motivates prior work to resort to\napproximate algorithms, e.g., graph-cuts, and loopy belief propagation (LBP). The effect of these\napproximations on the original semantics of CRF is poorly understood. For example, an approximate\ninference stuck in a local maximum may not represent the intended consistent scene interpretation.\nMotivation: Some of the aforementioned shortcomings can be addressed when CRF inference is\nconducted using the Metropolis-Hastings (MH) algorithm. MH draws samples Y (t) from the CRF\u2019s\nposterior, p(Y |X), and thus generates a Markov chain in which state Y (t+1) depends only on the\nprevious state Y (t). The jumps between the states are reversible, and governed by a proposal density\nq(Y (t) \u2192 Y (t+1)). The proposal is accepted if the acceptance rate, \u03b1, drawn from U (0, 1), satis\ufb01es\n\u03b1< min{1, q(Y (t+1)\u2192Y (t))\np(Y (t)|X) }. MH provides strong theoretical guarantees of convergence\nto the globally optimal state. As can be seen, the entire inference process is regulated by ratios of\nthe proposal and posterior distributions. Consequently, the bottleneck of every CRF learning and\ninference \u2014 namely, computing the partition function Z \u2014 is eliminated in MH.\nOur key idea is to directly estimate the ratios of the proposal and posterior distributions, instead\nof computing each individual distribution for conducting MH jumps. Previous work on MH for\nCRFs usually commits to linear forms of the potential functions, and spends computational re-\nsources on estimating the four distributions: q(Y (t+1)\u2192Y (t)), q(Y (t)\u2192Y (t+1)), p(Y (t+1)|X)\nand p(Y (t)|X).\nq(Y (t)\u2192Y (t+1)) and\np(Y (t+1)|X)\np(Y (t)|X) , in a non-parametric manner, since the acceptance rate of MH jumps depends only on\nthese ratios. To this end, we use the random forests (RF) [13]. Given a training set of labeled ex-\namples, RF grows many decision trees. We view the trees as a way of discriminatively structuring\nevidence about the class distributions in the training set. In particular, each leaf of each tree in RF\nstores a histogram of the number of training examples from each class that reached that leaf. When\na new example is encountered, it is \u201cdropped\u201d down each of the trees in the forest, until it reaches a\nleaf in every tree. The class histograms stored in all these leaves can then be used as a robust esti-\nmate of the ratio of that example\u2019s posterior distributions. This is related to recent work on Hough\nforests for object detection and localization [14], where leaves collect information on locations and\nsizes of bounding boxes of objects in training images. However, they use this evidence to predict\na spatial distribution of bounding boxes in a test image, whereas we use the evidence stored in tree\nleaves to predict the distribution ratios. Evidence trees are also used in [15], but only as a \ufb01rst stage\nof a stacked-classi\ufb01er architecture which replaces the standard majority voting of RF.\n\nIn contrast, our goal is to directly estimate the two ratios, q(Y (t+1)\u2192Y (t))\n\nRF is dif\ufb01cult to analyze [13, 16]. Regarding consistency of RF, it is known that their rate of con-\nvergence to the optimal Bayes\u2019 rule depends only on the number of informative variables. It is also\nshown that RF that cuts down to pure leaves uses a weighted, layered, nearest neighbor rule [16].\nWe are not aware of any theoretical analysis of RF as an estimator of ratios of posterior distributions.\nContributions: We combine RF and CRF into a new, principled and elegant computational frame-\nwork (RF)2. Learning is ef\ufb01ciently conducted by RF which collects the class histograms of training\nexamples in leaf nodes of each decision tree. This evidence is then used for the non-parametric\nestimation of the ratios of the proposal and posterior distributions, required by MH-based inference\nof (RF)2. We derive the theoretical error bounds of estimating distribution ratios by a two-class RF,\nwhich is then used to derive the theoretical performance bounds of a two-class (RF)2.\nPaper Organization: Sections 2\u20134 specify the CRF model, its MH-based inference, and RF-based\nlearning. Sections 5\u20136 present our experimental evaluation, and theoretical analysis of (RF)2.\n\n2\n\n\f2 CRF Model\n\nWe formulate multiclass object recognition and segmentation as the MAP inference of a CRF, de-\n\ufb01ned over a set of multiscale image regions. Regions are used as image features, because they are\ndimensionally matched with 2D object occurrences in the image, and thus facilitate modeling of var-\nious perceptual-organization and contextual cues (e.g., continuation, smoothness, containment, and\nadjacency) that are often used in recognition [6\u201311]. Access to regions is provided by the state-of-\nthe-art, multiscale segmentation algorithm of [17], which detects and closes object (and object-part)\nboundaries using the domain knowledge. Since the right scale at which objects occur is unknown,\nwe use all regions from all scales.\n\nThe extracted regions are organized in a graph, G = (V, E), with V and E are sets of nodes and\nedges. The nodes i=1, . . . , N correspond to multiscale segments, and edges (i, j) \u2208 E capture\ntheir spatial relations. Each node i is characterized by a descriptor vector, xi, whose elements\ndescribe photometric and geometric properties of the corresponding region (e.g., color, shape, \ufb01lter\nresponses). A pair of regions can have one of the following relationships: (1) ascendent/descendent,\n(2) touching, and (3) far. Since the segmentation algorithm of [17] is strictly hierarchical, region\ni is descendent of region j, if i is fully embedded as subregion within ancestor j. Two regions i\nand j touch if they share a boundary part. Finally, if i and j are not in the hierarchical and touch\nrelationships then they are declared as far. Edges connect all node pairs E = V \u00d7 V , |E| = N 2.\nEach edge (i, j) is associated with a tag, eij, indicating the relationship type between i and j.\nCRF is de\ufb01ned as the graphical model over G. Let Y = {yi} denote all random variables associated\nwith the nodes, indicating the class label of the corresponding region, yi \u2208 {0, 1, . . . , K}, where\nK denotes the total number of object classes, and label 0 is reserved for the background class. Let\npi = p(yi|xi) and pij = p(yi, yj|xi, xj , eij) denote the posterior distributions over nodes and pairs\nof nodes. Then, we de\ufb01ne CRF as\n(1)\nMulti-coloring of CRF is de\ufb01ned as the joint MAP assignment Y \u2217 = arg maxY p(Y |G). In the\nfollowing section, we explain how to conduct this inference.\n\np(Y |G) = Qi\u2208V p(yi|xi)Q(i,j)\u2208E p(yi, yj|xi, xj , eij) = Qi\u2208V piQ(i,j)\u2208E pij .\n\n3 CRF Inference\n\nFor CRF inference, we use the Swendsen-Wang cut algorithm (SW-cut), presented in [18]. SW-\ncut iterates the Metropolis-Hastings (MH) reversible jumps through the following two steps. (1)\nGraph clustering: SW-cut probabilistically samples connected components, CC\u2019s, where each\nCC represents a subset of nodes with the same color. This is done by probabilistically cutting\nedges between all graph nodes that have the same color based on their posterior distributions\npij = p(yi, yj|xi, xj, eij). (2) Graph relabeling: SW-cut randomly selects one of the CC\u2019s ob-\ntained in step (1), and randomly \ufb02ips the color of all nodes in that CC, and cuts their edges with the\nrest of the graph nodes having that same color. In each iteration, SW-cut probabilistically decides\nwhether to accept the new coloring of the selected CC, or to keep the previous state. Unlike other\nMCMC methods that consider one node at a time (e.g., Gibbs sampler), SW-cut operates on a num-\nber of nodes at once. Consequently, SW-cut converges faster and enables inference on relatively\nlarge graphs. Below, we review steps (1) and (2) of SW-cut, for completeness.\n\nIn step (1), edges of G are probabilistically sampled. This re-connects all nodes into new connected\ncomponents CC. If two nodes i and j have different labels, they cannot be in the same CC, so\ntheir edge remains intact. If i and j have the same label, their edge is probabilistically sampled\naccording to posterior distribution pij. If in the latter case edge (i, j) is not sampled, we say that\nit has been probabilistically \u201ccut\u201d. Step (1) results in a state A. In step (2), we choose at random\na connected component CC from step (1), and randomly reassign a new color to all nodes in that\nCC. To separate the re-colored CC from the rest of the graph, we cut existing edges that connect\nCC to the rest of the graph nodes with that same color. Step (2) results in a new state B. SW-cut\naccepts state B if the acceptance rate is suf\ufb01ciently large via a random thresholding. Let q(A \u2192 B)\nbe the proposal probability for moving from state A to B, and let q(B \u2192 A) denote the converse.\nThe acceptance rate, \u03b1(A\u2192B), of the move from A to B is de\ufb01ned as\nq(A \u2192 B)p(Y = A|G)(cid:19) .\nq(B \u2192 A)p(Y = B|G)\n\n\u03b1(A \u2192 B) = min(cid:18)1,\n\n(2)\n\n3\n\n\fThe computation complexity of each move is relatively low. The ratio q(B\u2192A)\nq(A\u2192B) in (2) involves only\nthose edges that are \u201ccut\u201d around CC in states A and B \u2013 not all edges. Also, the ratio p(Y =B|G)\np(Y =A|G)\naccounts only for the recolored nodes in CC \u2013 not the entire graph G, since all other probabilities\nhave not changed from state A to state B. Thus, from Eq. (1), the ratios of the proposal and posterior\ndistributions characterizing states A and B can be speci\ufb01ed as\n\nq(B\u2192A)\nq(A\u2192B)\n\n= Q(i,j)\u2208CutB\nQ(i,j)\u2208CutA\n\n(1\u2212pB\nij)\n(1\u2212pA\nij)\n\n,\n\nand\n\np(Y = B|G)\np(Y = A|G)\n\n= Yi\u2208CC\n\npB\ni\npA\ni\n\n\u00b7 Yj\u2208N (i)\n\npB\nij\npA\nij\n\n.\n\n(3)\n\nwhere CutA and CutB denote the sets of \u201ccut\u201d edges in states A and B, and N (i) is the set of\nneighbors of node i, N (i) = {j : j \u2208 V, (i, j) \u2208 E}.\nAs shown in [18], SW-cut is relatively insensitive to different initializations. In our experiments, we\ninitialize all nodes in the CRF with label 0. Next, we show how to compute the ratios in Eq. (3).\n\n4 Learning\n\nRF can be used for estimating the ratios of the proposal and posterior distributions, given by Eq. (3),\nsince RF provides near Bayesian optimal decisions, as theoretically shown by Breiman [13]. In the\nfollowing, we describe how to build RF, and use it for computing the ratios in Eq. (3).\n\nOur training data represent a set of M labeled regions. If region i falls within the bounding box of\nan object in class y \u2208 {1, 2, . . . , K}, it receives label y. If i covers a number of bounding boxes\nof different classes then i is added to the training set multiple times to account for all distinct class\nlabels it covers. Each region i is characterized by a d-dimensional descriptor vector, xi \u2208 Rd,\nwhich encodes the photometric and geometric properties of i. The training dataset {(xi, yi) : i =\n1, . . . , M} is used to learn an ensemble of T decision trees representing RF.\nIn particular, each training sample is passed through every decision tree from the ensemble until it\nreaches a leaf node. Each leaf l records a class histogram, \u03a6l = {\u03c6l(y) : y = 1, . . . , K}, where\n\u03c6l(y) counts the number of training examples belonging to class y that reached l. The total number\nof training examples in l is then k\u03a6lk. Also, for each pair of leaves (l, l\u2032), we record a two-class\nhistogram, \u03a8ll\u2032 = {\u03c8ll\u2032 (y, y\u2032, e) : y, y\u2032 = 1, . . . , K; e = 1, 2, 3}, where \u03c8ll\u2032 (y, y\u2032, e) counts the\nnumber of pairs of training examples belonging to classes y and y\u2032 that reached leaves l and l\u2032, and\nalso have the relationship type e \u2013 namely, ascendent/descendent, touching, or far relationship.\nGiven \u03a6l and \u03a8ll\u2032, we in a position to estimate the ratios of the proposal and posterior distributions,\nde\ufb01ned in (3), which control the Metropolis-Hastings jumps in the SW-cut. Suppose two regions,\nrepresented by their descriptors xi and xj, are labeled as yA\nin\nstate B of one iteration of the SW-cut. Also, after passing xi and xj through T decision trees of the\nlearned RF, suppose they reached leaves lt\n\nj in state A, and yB\n\ni and yB\n\ni and yA\n\nj\n\nt=1 \u03c8lt\ni lt\nt=1 \u03c8lt\ni lt\n\nj\n\nj , eij)\n\nj in each tree t = 1, . . . , T . Then, we compute\ni and lt\ni , yB\n(yB\np(Y = B|G)\np(Y = A|G)\ni , yA\nq(A\u2192B) , it is necessary to compute each individ-\nq(A\u2192B) do not contain the same set of\n\nfor estimating\n\n. (4)\n\n(yA\n\n,\n\nj\n\npB\ni\npA\ni\n\ni\n\nt=1 \u03c6lt\nt=1 \u03c6lt\n\n= PT\nPT\n\n(yB\ni )\n(yA\ni )\n\n,\n\npB\nij\npA\nij\n\n= PT\nPT\n\ni\n\nj , eij)\nTo estimate the ratio of the proposal distributions, q(B\u2192A)\nual probability pij, since the nominator and denominator of q(B\u2192A)\n\u201ccut\u201d edges, CutA 6= CutB, as speci\ufb01ed in (3). Thus, we compute\nfor estimating\n\npij = PT\nt=1 \u03c8lt\ni lt\nPT\nt=1 k\u03a6\n\nj\n\n(yi, yj, eij)\njk\nikk\u03a6\n\nlt\n\nlt\n\nq(B\u2192A)\nq(A\u2192B)\n\n.\n\n(5)\n\nIn the following, we \ufb01rst present our empirical evaluation of (RF)2, and then derive the theoretical\nperformance bounds of a simple, two-class (RF)2.\n\n5 Results\n\n(RF)2 is evaluated on the task of object recognition and segmentation on two benchmark datasets.\nFirst, the MSRC dataset consists of 591 images showing objects from 21 categories [3]. We use the\n\n4\n\n\fstandard split of MSRC into training and test images [3]. Second, the Street-Scene dataset consists\nof 3547 images of urban environments, and has manually annotated regions [6, 19]. As in [6], one\n\ufb01fth of the Street-Scene images are used for testing, and the rest, for training. Both datasets provide\nlabels of bounding boxes around object occurrences as ground truth.\n\nImages are segmented using the multiscale segmentation algorithm of [17], which uses the per-\nceptual signi\ufb01cance of a region boundary, Pb \u2208 [0, 100], as an input parameter. We vary Pb =\n30:10:150, and thus obtain a hierarchy of regions for each image. A region is characterized by a\ndescriptor vector consisting of the following properties: (i) 30-bin color histogram in the CIELAB\nspace; (ii) 250-dimensional histogram of \ufb01lter responses of the MR8 \ufb01lter bank, and the Laplacian of\nGaussian \ufb01lters computed at each pixel, and mapped to 250 codewords whose dictionary is obtained\nby K-means over all training images; (iii) 128-dimensional region boundary descriptor measuring\noriented contour energy along 8 orientations of each cell of a 4 \u00d7 4 grid overlaid over the region\u2019s\nbounding box; (iv) coordinates of the region\u2019s centroid normalized to the image size. Regions ex-\ntracted from training images are used for learning RF. A region that falls within a bounding box is\nassigned the label of that box. If a region covers a number of bounding boxes of different classes,\nit is added to the training set multiple times to account for each distinct label. We use the standard\nrandom splits of training data to train 100 decision trees of RF, constructed in the top-down way.\nThe growth of each tree is constrained so its depth is less than 30, and its every leaf node contains at\nleast 20 training examples. To recognize and segment objects in a new test image, we \ufb01rst extract a\nhierarchy of regions from the image by the segmentation algorithm of [17]. Then, we build the fully\nconnected CRF graph from the extracted regions (Sec. 2), and run the SW-cut inference (Sec. 4).\n\nWe examine the following three variants of (RF)2: (RF)2-1 \u2014 The spatial relationships of regions,\neij, are not accounted for when computing pij in Eq. (4) and Eq. (5); (RF)2-2 \u2014 The region rela-\ntionships touching and far are considered, while the ascendent/descendent relationship is not cap-\ntured; and (RF)2-3 \u2014 All three types of region layout and structural relationships are modeled. In\nthis paper, we consider (RF)2-3 as our default variant, and explicitly state when the other two are\nused instead. Note that considering region layouts and structure changes only the class histograms\nrecorded by leaf nodes of the learned decision trees, but it does not increase complexity.\n\nFor quantitative evaluation, we compute the pixel-wise classi\ufb01cation accuracy averaged across all\ntest images, and object classes. This metric is suitable, because it does not favor object classes that\noccur in images more frequently. Tab. 1 and Tab. 2 show our pixel-wise classi\ufb01cation accuracy\non MSRC and Street-Scene images. Table. 2 also compares the three variants of (RF)2 on MSRC\nand Street-Scene images. The additional consideration of the region relationships touching and far\nincreases performance relative to that of (RF)2-1, as expected. Our performance is the best when all\nthree types of region relationships are modeled. The tables also present the pixel-wise classi\ufb01cation\naccuracy of the state of the art CRF models [3,6,20,21]. Note that the methods of [6,21] additionally\nuse higher-level cues about the horizon location and 3D scene layout in their object recognition and\nsegmentation. As can be seen, (RF)2 outperforms the latest CRF models on both datasets.\nOur segmentation results on example MSRC and Street-Scene images are shown in Fig. 5. Labels\nof the \ufb01nest-scale regions are depicted using distinct colors, since pixels get labels of the \ufb01nest-scale\nregions. As can be seen, (RF)2 correctly identi\ufb01es groups of regions that belong to the same class.\nSince the depth of each decision tree in RF is less than 30, the complexity of dropping an instance\nthrough one tree is O(1), and through RF with T trees is O(T ). Our C-implementation of the RF-\n\nd\no\nh\nt\ne\n\nM\n[10]\n[22]\n[23]\n[20]\n[3]\nOurs\n\ne\nn\na\nl\np\no\nr\ne\nA\n88\n82\n83\n100\n60\n100\n\ne\nl\nc\ny\nc\ni\nB\n91\n72\n79\n98\n75\n99\n\nd\nr\ni\n\nB\n34\n24\n30\n11\n19\n42\n\nt\na\no\nB\n49\n18\n27\n63\n7\n69\n\ny\nd\no\nB\n54\n66\n67\n55\n62\n68\n\ng\nn\ni\nd\nl\ni\nu\nB\n30\n49\n69\n73\n62\n74\n\nk\no\no\nB\n93\n93\n80\n78\n92\n95\n\nr\na\nC\n82\n74\n70\n88\n63\n88\n\nt\na\nC\n56\n75\n68\n11\n54\n77\n\nr\ni\na\nh\nC\n74\n51\n45\n80\n15\n80\n\nw\no\nC\n68\n97\n78\n74\n58\n99\n\ng\no\nD\n54\n35\n52\n43\n19\n61\n\nr\ne\nw\no\nl\nF\n\n90\n74\n47\n72\n63\n93\n\ne\nc\na\nF\n\n77\n87\n84\n72\n74\n91\n\ns\ns\na\nr\nG\n71\n88\n96\n96\n97\n99\n\nd\na\no\nR\n31\n78\n78\n76\n86\n78\n\np\ne\ne\nh\nS\n\n64\n97\n80\n90\n50\n99\n\nn\ng\ni\nS\n\n82\n36\n61\n92\n35\n93\n\ny\nk\nS\n\n84\n78\n95\n50\n83\n96\n\ne\ne\nr\nT\n69\n79\n87\n76\n86\n90\n\nr\ne\nt\na\n\nW\n58\n54\n67\n61\n53\n68\n\nTable 1: The average pixel-wise classi\ufb01cation accuracy on the MSRC dataset. (RF)2 yields the best\nperformance for all object classes except one.\n\n5\n\n\fFigure 1: Our object recognition and segmentation results on example images from the MSRC\ndataset (top two rows), and the Street-Scene dataset (bottom two rows). The \ufb01gure depicts bound-\naries of the \ufb01nest-scale regions found by the multiscale algorithm of [17], and the color-coded labels\nof these regions inferred by (RF)2. The results are good despite the presence of partial occlusion,\nand changes in illumination and scale. (best viewed in color)\n\nMethod\n(RF)2-1\n(RF)2-2\n(RF)2-3\n\n[20]\n[21]\n[6]\n[3]\n\nMSRC\n\nStreetScene\n69.5%\u00b113.7% 78.2%\u00b10.5%\n80.2%\u00b114.4% 86.7%\u00b10.5%\n82.9%\u00b115.8% 89.8%\u00b10.6%\n\n70.0%\n76.4%\nN/A\n70.0%\n\nN/A\n83.0%\n84.2%\nN/A\n\nTest time\n\n45s\n31s\n31s\nN/A\nN/A\nN/A\n10-30s\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n \n\n \n\nP(\u03b5)\n\n0.15\n\n0.2\n\n0.25\n\n0.3\n\n\u03b3\n\n0.35\n\n0.4\n\n0.45\n\n0.5\n\nTable 2: The average pixel-wise classi\ufb01cation accuracy and\naverage computation times on the MSRC and Street-Scene\ndatasets of the three variants of our approach with those of\nthe state-of-the-art CRF-based methods.\n\nFigure 2: The probability of classi-\n\ufb01cation error of (RF)2, P (\u01eb), given\nby Eq. (6) and Theorem 1 as a\nfunction of the margin, \u03b3, of RF.\n\nguided SW-cut inference of CRF takes 10s\u201330s on a 2.40GHz PC with 3.48GB RAM for MSRC\nand Street-Scene images. Table 2 shows that our average running times are comparable to those of\nthe other CRF methods that use approximate inference [3, 6, 20, 21].\n\n6 Theoretical Analysis\n\nWe are interested in a theoretical explanation of the good performance of (RF)2 presented in the\nprevious section. In particular, we derive the theoretical performance bounds of a two-class (RF)2,\nfor simplicity. As explained in Sec. 3, we use the SW-cut for (RF)2 inference. The SW-cut iterates\nthe Metropolis-Hastings (MH) reversible jumps, and thus explores the state-space of solutions. An\nMH jump between states A and B is controlled by the acceptance rate \u03b1(A\u2192B) which depends on\n\n6\n\n\fthe ratios of the proposal and posterior distributions, q(B\u2192A)p(Y =B|G)\nq(A\u2192B)p(Y =A|G) . Below, we show that the\nerror made by the two-class RF in estimating these ratios is bounded. Our derivation of the error\nbounds of RF is based on the theoretical analysis of evidence trees, presented in [15].\n\n6.1 An Upper Error Bound of (RF)2\n\nAn error occurs along MH jumps when a balanced reversible jump is encountered, i.e., when there\nis no preference between jumping from state A to state B and reverse, q(B\u2192A)\nq(A\u2192B) =1, and RF wrongly\npredicts that the posterior distribution of state B is larger than that of A, p(Y =B|G)\np(Y =A|G)\u22651. In this\ncase, \u03b1(A\u2192B)=1, and the SW-cut will erroneously visit state B. We are interested in \ufb01nding the\nprobability of this error, speci\ufb01ed as\n\nP (\u01eb) = P (cid:18) p(Y = B|G)\n\np(Y = A|G) \u2265 1(cid:19) = P \uf8eb\n\uf8ed Yi\u2208CC\ni \u2208 [0,\u221e), and Wij = pB\n\ni /pA\n\npB\ni\npA\ni\n\n\u00b7 Yj\u2208N (i)\n\npB\nij\npA\n\nij \u2265 1\uf8f6\n\uf8f8 .\n\n(6)\n\n1\n\n1\u2212n\n\nn z\n\nij/pA\n\n1\u2212n\n\nn w\n\nn exp(\u2212\u03bb2w\n\n1\n\nn ). Also, the product W =Qn\n\nFrom Eq. (6), P (\u01eb) can be computed using the probability density function of a product of ran-\nij \u2208 [0,\u221e), within a speci\ufb01c con-\ndom variables Zi = pB\nnected component CC, where |CC|=n, i = 1, . . . , n, and j \u2208 N (i). As we will prove in the\nsequel, all random variables Zi have the same exponential distribution fZi(z)=\u03bb1 exp(\u2212\u03bb1z).\nAlso, we will prove that all random variables Wij have the same exponential distribution\nfWij (w)=\u03bb2 exp(\u2212\u03bb2w). Then, it follows that the product Z=Qn\ni=1 Zi=(Zi)n has the distribution\ni=1Qj\u2208N (i) Wij =(Wij )nk\u2248(Wij )n has\n\nfZ(z)= \u03bb1\nn exp(\u2212\u03bb1z\nthe distribution fW (w)= \u03bb2\nn ), where we approximate that the number of edges\nwithin CC is the same as the number of nodes in CC, as a result of the probabilistic \u201ccutting\u201d of\ngraph edges by the SW-cut algorithm. Given fZ(z) and fW (w), from Eq. (6), we analytically de-\nrive the probability that (RF)2 makes a wrong prediction, P (\u01eb) = P (Z \u00b7 W \u2265 1), as stated in the\nfollowing theorem.\nTheorem 1. The probability that (RF)2 makes a wrong prediction is P (\u01eb)=P (Z\u00b7W \u2265 1)=\u03bbK1(\u03bb),\nwhere Z\u2208[0,\u221e) and W\u2208[0,\u221e) are random variables characterized by the probability density func-\ntions fZ(z)= \u03bb1\nn ), with parameters \u03bb1 and\n\u03bb2, and where K1 is the modi\ufb01ed Bessel function of the second kind, and \u03bb = 2\u221a\u03bb1\u03bb2.\nProof. De\ufb01ne H = Z \u00b7 W . Then, fH(h)=R \u221e\nn K0(\u03bbh\n1\u2212R 1\n\nz )dz = \u03bb2\n2n h\n\n2n ), where\nIt follows that P (\u01eb) = P (H\u22651) =\n\nAs we will show in the following section, the parameter \u03bb is directly proportional to a measure\nof accuracy of RF predictions, referred to as probabilistic margin. Since K1(\u03bb) is a decreasing\nfunction, it follows that the probability that (RF)2 makes a wrong prediction is upper bounded, and\ndecreases as the probabilistic margin of RF increases.\n\nK0 is the modi\ufb01ed Bessel function of the second kind.\n\n0 fH (h)dh = \u03bbK1(\u03bb).(cid:3)\n\n1\n\nn ) and fW (w)= \u03bb2\n\nn w\n\n1\u2212n\n\nn z\n\nn exp(\u2212\u03bb1z\n\nn exp(\u2212\u03bb2w\n\n1\nz fZ(z)fW ( h\n\n1\u2212n\n\n1\n\n1\u2212n\n\n1\n\n0\n\n6.2 A Mathematical Model of RF Performance\n\nIn this section, we derive that the RF estimates of the ratios of posteriors Zi and Wij have the\nexponential distribution. We consider a binary classi\ufb01cation problem, for simplicity, where training\nand test instances may have positive and negative labels. We assume that the two classes are balanced\nP (y=+1) = P (y=\u22121) = 1/2. We de\ufb01ne \u03c0 to be a fraction of pairs of instances that have certain\nrelationship, corresponding to a particular spatial or structural relationship between pairs of regions,\nde\ufb01ned in Sec. 2. The learning algorithm that creates RF is not modeled. Instead, we assume that\nthe learned decision trees have the following properties. Each leaf node of a decision tree: (i) stores\na total of C training instances that reach the leaf; and (ii) has a probabilistic margin \u03b3 \u2208 [0, 1/2).\nBy margin, we mean that in every leaf reached by C training instances a fraction of 1/2 + \u03b3 of the\ntraining instances will belong to one class (e.g., positive), and fraction 1/2 \u2212 \u03b3 of them will belong\nto the other class (e.g., negative). We say that a leaf is positive if a majority of the training instances\ncollected by the leaf is positive, or otherwise, we say that the leaf is negative. It is straightforward\nto show that when a positive instance is dropped through one of the decision trees in RF, it will\n\n7\n\n\freach a positive leaf with probability 1/2 + \u03b3, and a negative leaf with probability 1/2 \u2212 \u03b3 [15].\nSimilarly holds for negative instances. A new test instance is classi\ufb01ed by dropping it through T\ndecision trees, and taking a majority vote of the labels of all C \u00b7 T training instances stored in the\nleaves reached by the test instance. We refer to this classi\ufb01cation procedure as evidence voting [15],\nas opposed to decision voting over the leaf labels in the standard RF [13]. The following proposition\nstates that the probability that evidence voting misclassi\ufb01es an instance, P (\u01eb1), is upper bounded.\nProposition 1. The probability that RF with T trees, where every leaf stores C training instances,\nincorrectly classi\ufb01es an instance is upper bounded, P (\u01eb1)\u2264 exp(\u22128CT \u03b34).\nProof. Evidence voting for labeling an instance can be formalized as drawing a total of C\u00b7T inde-\npendent Bernoulli random variables, with the success rate p1, whose outcomes are {\u22121, +1}, where\n+1 is received for correct, and \u22121 for incorrect labeling of the instance. Let S1 denote a sum of\nthese Bernoulli random variables. Thus, a positive instance is incorrectly labeled if S1\u22640, and a neg-\native instance is misclassi\ufb01ed if S1>0. Since the two classes are balanced, by applying the standard\nChernoff bound, we obtain P (\u01eb1)=P (S1\u22640)\u2264 exp(cid:2)\u22122CT (p1\u22121/2)2(cid:3). The success rate p1 can be\n\nderived as follows. When a positive (negative) instance is dropped through a decision tree, it will fall\nin a positive (negative) leaf with probability 1/2 + \u03b3, where it will be labeled as positive (negative)\nwith probability 1/2+\u03b3; else, the positive (negative) instance will be routed to a negative (positive)\nleaf with probability 1/2\u2212\u03b3, where it will be labeled as positive (negative) with probability 1/2\u2212\u03b3.\nConsequently, the probability that an instance is correctly labeled, i.e., the success rate of the asso-\nciated Bernoulli random variable, is p1=(1/2+\u03b3)(1/2+\u03b3)+(1/2\u2212\u03b3)(1/2\u2212\u03b3)=1/2 + 2\u03b32.(cid:3)\nEvidence voting is also used for labeling pairs of instances. The probability that evidence voting\nmisclassi\ufb01es a pair of test instances, P (\u01eb2), is upper bounded, as stated in Proposition 2.\nProposition 2. Given RF as in Proposition 1, the probability that RF incorrectly labels a pair of\ninstances having a certain relationship is upper bounded, P (\u01eb2) \u2264 exp(\u22128C2T \u03c04\u03b38).\nProof. Evidence voting for labeling a pair of instances can be formalized as drawing a total of\nC2T independent Bernoulli random variables, with success rate p2, whose outcomes are {\u22121, +1},\nwhere +1 is received for correct, and \u22121 for incorrect labeling of the instance pair. Let S2 denote\na sum of these Bernoulli random variables. Then, P (\u01eb2)=P (S2\u22640)\u2264 exp(cid:2)\u22122C2T (p2\u22121/2)2(cid:3).\nSimilar to the proof of Proposition 1, by considering three possible cases of correct labeling of a\npair of instances when dropping the pair through a decision tree, the success rate p2 can be derived\nas p2=\u03c0(1/2+\u03b32)(1/2+\u03c0\u03b32)+\u03c0(1/2\u2212\u03b32)(1/2\u2212\u03c0\u03b32)+(1\u2212\u03c0)(1/2) = 1/2+2\u03c02\u03b34, where \u03c0 is a\nfraction of pairs of instances that have the same type of relationship.(cid:3)\n\nFrom Proposition 1, it follows that the probability that RF makes a wrong prediction about the pos-\nterior ratio of an instance is upper bounded, P (Zi \u2265 1) = P (\u01eb1) = exp(\u22128CT \u03b34), \u2200i \u2208 CC. This\ngives the probability density function fZi(z) = \u03bb1 exp(\u2212\u03bb1z), where \u03bb1 = 8CT \u03b34. In addition,\nFrom Proposition 2, it follows that the probability that RF makes a wrong prediction about the pos-\nterior ratio of a pair of instances is upper bounded, P (Wij \u2265 1) = P (\u01eb2) = exp(\u22128C2T \u03c04\u03b38),\n\u2200i \u2208 CC and j \u2208 N (i). This gives the probability density function fWij (w) = \u03bb2 exp(\u2212\u03bb2w),\nwhere \u03bb2 = 8C2T \u03c04\u03b38. By plugging these results in Theorem 1, we complete the derivation of the\nupper error bound of (RF)2. From Theorem 1, P (\u01eb) decreases when any of the following parameters\nincreases: C, T , \u03b3, and \u03c0. Fig. 2 shows the in\ufb02uence of \u03b3 on P (\u01eb), when the other parameters are\n\ufb01xed to their typical values: C = 20, T = 100, and \u03c0 = 0.1.\n\n7 Conclusion\n\nWe have presented (RF)2 \u2013 a framework that uses the random forest (RF) for the MCMC-based\ninference of a conditional random \ufb01eld (CRF). Our key idea is to employ RF to directly compute\nthe ratios of the proposal and posterior distributions of states visited along the Metropolis-Hastings\nreversible jumps, instead of estimating each individual distribution, and thus improve the conver-\ngence rate and accuracy of the CRF inference. Such a non-parametric formulation of CRF and its\ninference has been demonstrated to outperform, in terms of computation time and accuracy, existing\nparametric CRF models on the task of multiclass object recognition and segmentation. We have also\nderived the upper error bounds of the two-class RF and (RF)2, and showed that the classi\ufb01cation\nerror of (RF)2 decreases as any of the following RF parameters increases: the number of decision\ntrees, the number of training examples stored in every leaf node, and the probabilistic margin.\n\n8\n\n\fReferences\n\n[1] L.-J. Li, R. Socher, and L. Fei-Fei, \u201cTowards total scene understanding: Classi\ufb01cation, annotation and\n\nsegmentation in an automatic framework,\u201d in CVPR, 2009.\n\n[2] X. He, R. S. Zemel, and M. A. Carreira-Perpinan, \u201cMultiscale Conditional Random Fields for image\n\nlabeling,\u201d in CVPR, 2004, pp. 695\u2013702.\n\n[3] J. Shotton, J. Winn, C. Rother, and A. Criminisi, \u201cTextonboost: Joint appearance, shape and context\n\nmodeling for multi-class object recognition and segmentation,\u201d in ECCV, 2006, pp. 1\u201315.\n\n[4] J. Verbeek and B. Triggs, \u201cScene segmentation with CRFs learned from partially labeled images,\u201d in\n\nNIPS, 2007.\n\n[5] A. Torralba, K. P. Murphy, and W. T. Freeman, \u201cContextual models for object detection using boosted\n\nrandom \ufb01elds,\u201d in NIPS, 2004.\n\n[6] S. Gould, T. Gao, and D. Koller, \u201cRegion-based segmentation and object detection,\u201d in NIPS, 2009.\n[7] A. Rabinovich, A. Vedaldi, C. Galleguillos, E. Wiewiora, and S. Belongie, \u201cObjects in context,\u201d in ICCV,\n\n2007.\n\n[8] N. Payet and S. Todorovic, \u201cFrom a set of shapes to object discovery,\u201d in ECCV, 2010.\n[9] S. Todorovic and N. Ahuja, \u201cUnsupervised category modeling, recognition, and segmentation in images,\u201d\n\nIEEE TPAMI, vol. 30, no. 12, pp. 1\u201317, 2008.\n\n[10] J. J. Lim, P. Arbelaez, C. Gu, and J. Malik, \u201cContext by region ancestry,\u201d in ICCV, 2009.\n[11] J. Sivic, B. C. Russell, A. Zisserman, W. T. Freeman, and A. A. Efros, \u201cUnsupervised discovery of visual\n\nobject class hierarchies,\u201d in CVPR, 2008.\n\n[12] J. Lafferty, A. McCallum, and F. Pereira, \u201cConditional random \ufb01elds: Probabilistic models for segmenting\n\nand labeling sequence data,\u201d in ICML, 2001, pp. 282\u2013289.\n\n[13] L. Breiman, \u201cRandom forests,\u201d Mach. Learn., vol. 45, no. 1, pp. 5\u201332, 2001.\n[14] J. Gall and V. Lempitsky, \u201cClass-speci\ufb01c hough forests for object detection,\u201d in CVPR, 2009.\n[15] G. Martinez-Munoz, N. Larios, E. Mortensen, W. Zhang, A. Yamamuro, R. Paasch, N. Payet, D. Lytle,\nL. Shapiro, S. Todorovic, A. Moldenke, and T. Dietterich, \u201cDictionary-free categorization of very similar\nobjects via stacked evidence trees,\u201d in CVPR, 2009.\n\n[16] Y. Lin and Y. Jeon, \u201cRandom forests and adaptive nearest neighbors,\u201d Journal of the American Statistical\n\nAssociation, pp. 101\u2013474, 2006.\n\n[17] C. F. P. Arbelaez, M. Maire and J. Malik, \u201cFrom contours to regions: An empirical evaluation,\u201d in CVPR,\n\n2009.\n\n[18] A. Barbu and S.-C. Zhu, \u201cGraph partition by Swendsen-Wang cuts,\u201d in ICCV, 2003, p. 320.\n[19] S. Bileschi and L. Wolf, \u201cA uni\ufb01ed system for object detection, texture recognition, and context analysis\n\nbased on the standard model feature set,\u201d in BMVC, 2005.\n\n[20] C. Galleguillos, B. McFee, S. Belongie, and G. R. G. Lanckriet, \u201cMulti-class object localization by com-\n\nbining local contextual interactions,\u201d in CVPR, 2010.\n\n[21] S. Gould, R. Fulton, and D. Koller, \u201cDecomposing a scene into geometric and semantically consistent\n\nregions,\u201d in ICCV, 2009.\n\n[22] J. Shotton, M. Johnson, and R. Cipolla, \u201cSemantic texton forests for image categorization and segmenta-\n\ntion,\u201d in CVPR, 2008.\n\n[23] Z. Tu and X. Bai, \u201cAuto-context and its application to high-level vision tasks and 3D brain image seg-\n\nmentation,\u201d IEEE TPAMI, vol. 99, 2009.\n\n9\n\n\f", "award": [], "sourceid": 234, "authors": [{"given_name": "Nadia", "family_name": "Payet", "institution": null}, {"given_name": "Sinisa", "family_name": "Todorovic", "institution": null}]}