{"title": "Occlusion Detection and Motion Estimation with Convex Optimization", "book": "Advances in Neural Information Processing Systems", "page_first": 100, "page_last": 108, "abstract": "We tackle the problem of simultaneously detecting occlusions and estimating optical flow. We show that, under standard assumptions of Lambertian reflection and static illumination, the task can be posed as a convex minimization problem. Therefore, the solution, computed using efficient algorithms, is guaranteed to be globally optimal, for any number of independently moving objects, and any number of occlusion layers. We test the proposed algorithm on benchmark datasets, expanded to enable evaluation of occlusion detection performance.", "full_text": "Occlusion Detection and Motion Estimation\n\nwith Convex Optimization\n\nAlper Ayvaci, Michalis Raptis, Stefano Soatto\n\nUniversity of California, Los Angeles\n{ayvaci, mraptis, soatto}@cs.ucla.edu\n\nAbstract\n\nWe tackle the problem of simultaneously detecting occlusions and estimating op-\ntical \ufb02ow. We show that, under standard assumptions of Lambertian re\ufb02ection\nand static illumination, the task can be posed as a convex minimization problem.\nTherefore, the solution, computed using ef\ufb01cient algorithms, is guaranteed to be\nglobally optimal, for any number of independently moving objects, and any num-\nber of occlusion layers. We test the proposed algorithm on benchmark datasets,\nexpanded to enable evaluation of occlusion detection performance.\n\n1\n\nIntroduction\n\nOptical \ufb02ow refers to the deformation of the domain of an image that results from ego- or scene\nmotion. It is, in general, different from the motion \ufb01eld, that is the projection onto the image plane\nof the spatial velocity of the scene [28], unless three conditions are satis\ufb01ed: (a) Lambertian re-\n\ufb02ection, (b) constant illumination, and (c) constant visibility properties of the scene. Most surfaces\nwith benign re\ufb02ectance properties (diffuse/specular) can be approximated as Lambertian almost ev-\nerywhere under sparse illuminants (e.g., the sun). In any case, widespread violation of Lambertian\nre\ufb02ection does not enable correspondence [23], so we will embrace (a) as customary. Similarly, (b)\nconstant illumination is a reasonable assumption for ego-motion (the scene is not moving relative to\nthe light source), and even for objects moving (slowly) relative to the light source.1 Assumption (c)\nis the most critical, as it is needed for the motion \ufb01eld to be de\ufb01ned.2 It is often taken for granted in\nthe optical \ufb02ow literature, because in the limit where two images are sampled in\ufb01nitesimally close in\ntime, there are no occluded regions, and one can focus solely on motion discontinuities. Thus, most\nvariational motion estimation approaches provide an estimate of a dense \ufb02ow \ufb01eld at each location\non the image domain, including occluded regions. Alas, in occluded regions, the problem is not that\noptical \ufb02ow is discontinuous, or forward-backward inconsistent; it is simply not de\ufb01ned. Motion in\noccluded regions can be hallucinated; However, whatever motion is assigned to an occluded region\ncannot be validated from the data. In defense of these methods, it can be argued that, even without\ntaking the limit, for small parallax (slow-enough motion, or far-enough objects, or fast-enough tem-\nporal sampling) occluded areas are small. However, small does not mean unimportant, as occlusions\nare critical to perception [8] and a key for developing representations for recognition [22].\nFor this reason, we focus on issues of visibility in optical \ufb02ow computation. We show that forgoing\nassumption (c) and explicitly representing occlusions is not only conceptually correct, but also al-\ngorithmically advantageous, for the resulting optimization problem can be shown to become convex\nonce occlusions are explicitly modeled. Therefore, one can guarantee convergence to a globally\n\n1Assumption (b) is also made for convenience, as modeling illumination changes would require modeling\n\nre\ufb02ectance, which signi\ufb01cantly complicates the picture.\n\n2If the domain of an image portrays a portion of the scene that is not visible in another image, the two cannot\n\nbe put into correspondence.\n\n1\n\n\foptimal solution regardless of initial conditions (sect. 2). We adapt Nesterov\u2019s ef\ufb01cient optimization\nscheme to our problem (sect. 3), and test the resulting algorithm on benchmark datasets (sect. 4),\nincluding evaluation of occlusion detection (sect. 1.2).\n\n1.1 Related Work\n\nThe most common approach to handling occlusions in the optical \ufb02ow literature is to de\ufb01ne them as\nregions where forward and backwards motion estimates are inconsistent [19, 1]. Most approaches\nreturn estimates of motion in the occluded regions, where they cannot be invalidated: As we have\nalready pointed out, in an occluded region one cannot determine a motion \ufb01eld that maps one image\nonto another, because the scene is not visible in one of the two. Some approaches [11, 4], while also\nexploiting motion symmetry, discount occlusions by weighting the data \ufb01delity with a monotonically\ndecreasing function. The resulting problem is non-convex, and therefore the proposed alternating\nminimization techniques can be prone to local minima. An alternate approach [15, 14, 25] is to\nformulate joint motion estimation and occlusion detection in a discrete setting, where it is NP-\nhard. Various approximate solutions using combinatorial optimization require \ufb01ne quantization and,\ntherefore, suffer from a large number of labels which results in loose approximation bounds. Another\nclass of methods uses the motion estimation residual to classify a location as occluded or visible\nwither with a direct threshold on the residual [30] or with a more elaborate probabilistic model [24].\nIn each case, the resulting optimization is non-convex.\n\n1.2 Evaluation\n\nOptical \ufb02ow estimation is a mature area of computer vision, and benchmark datasets have been de-\nveloped, e.g., [2]. Unfortunately, no existing benchmark provides ground truth for occluded regions,\nnor a scoring mechanism to evaluate occlusion detection performance. Motion estimates are scored\neven in the occluded regions, where the data does not support them. Since our primary goal is to\ndetect occlusions, we have produced a new benchmark by taking a subset of the training data in the\nMiddlebury dataset, and hand-labeled occluded regions. We then use the same evaluation method\nof the Middlebury for the (ground truth) regions that are co-visible in at least two images. This\nprovides a motion estimation score. Then, we provide a separate score for occlusion detection, in\nterms of precision-recall curves.\n\n2 Joint Occlusion Detection and Optical Flow Estimation\n\nIn this section, we show how the assumptions (a)-(b) can be used to formulate occlusion detection\nand optical \ufb02ow estimation as a joint optimization problem. Let I : D \u2282 R2 \u00d7 R+ \u2192 R+; (x, t) (cid:55)\u2192\nI(x, t) be a grayscale time-varying image de\ufb01ned on a domain D. Under the assumptions (a)-(b),\nthe relation between two consecutive frames in a video {I(x, t)}T\n\n(cid:26)I(w(x, t), t + dt) + n(x, t), x \u2208 D\\\u2126(t; dt)\n\nt=0 is given by\n\nI(x, t) =\n\n\u03c1(x, t), x \u2208 \u2126(t; dt)\n\n(1)\n\nwhere w : D \u00d7 R+ \u2192 R2; x (cid:55)\u2192 w(x, t)\n.\n= x + v(x, t) is the domain deformation mapping\nI(x, t) onto I(x, t + dt) everywhere except at occluded regions. Usually optical \ufb02ow denotes the\n= w(x, t) \u2212 x. The occluded region \u2126 can change over time\n.\nincremental displacement v(x, t)\ndepending on the temporal sampling interval dt and is not necessarily simply-connected; so even if\nwe call \u2126 the occluded region (singular), it is understood that it can be made of several disconnected\nportions. Inside \u2126, the image can take any value \u03c1 : \u2126 \u00d7 R+ \u2192 R+ that is in general unrelated to\nI(w(x), t + dt)|x\u2208\u2126. In the limit dt \u2192 0, \u2126(t; dt) = \u2205. Because of (almost-everywhere) continuity\nof the scene and its motion (i), and because the additive term n(x, t) compounds the effects of a\nlarge number of independent phenomena3 and therefore we can invoke the Law of Large Numbers\n(ii), in general we have that\n\n\u2126(t; dt) = \u2205, and (ii) n IID\u223c N (0, \u03bb)\n\n(i)\n\nlim\ndt\u21920\n\n(2)\n\n3n(x, t) collects all unmodeled phenomena including deviations from Lambertian re\ufb02ection, illumination\nchanges, quantization error, sensor noise, and later also linearization error. It does not capture occlusions, since\nthose are explicitly modeled.\n\n2\n\n\f(cid:26)n(x, t), x \u2208 D\\\u2126\n\ni.e., the additive uncertainty is normally distributed in space and time with an isotropic and small\nvariance \u03bb > 0. We de\ufb01ne the residual e : D \u2192 R on the entire image domain x \u2208 D, via\n\n= I(x, t) \u2212 I(w(x, t), t + dt) =\n.\n\ne(x, t; dt)\n\n(3)\nwhich we can write as the sum of two terms, e1 : D \u2192 R and e2 : D \u2192 R, also de\ufb01ned on the\nentire domain D in such a way that\n\n\u03c1(x, t) \u2212 I(w(x, t), t + dt),\n\nx \u2208 \u2126\n\n(4)\nNote that e2 is unde\ufb01ned in \u2126, and e1 is unde\ufb01ned in D\\\u2126, in the sense that they can take any value\nthere, including zero, which we will assume henceforth. We can then write, for any x \u2208 D,\n\nx \u2208 D\\\u2126.\n\ne2(x, t; dt)\n\n= \u03c1(x, t) \u2212 I(w(x, t), t + dt), x \u2208 \u2126\n.\n.\n= n(x, t),\n\n(cid:26)e1(x, t; dt)\n\nI(x, t) = I(w(x, t), t + dt) + e1(x, t; dt) + e2(x, t; dt)\n\n(5)\nand note that, because of (i) e1 is large but sparse,4 while because of (ii) e2 is small but dense4. We\nwill use this as an inference criterion for w, seeking to optimize a data \ufb01delity term that minimizes\nthe number of nonzero elements of e1 (a proxy of the area of \u2126), and the negative log-likelihood of\nn.\n\n(6)\n\n= (cid:107)e1(cid:107)L0(D) +\n.\n\n(cid:107)e2(cid:107)L2(D)\n\n\u03c8data(w, e1)\n\nsubject to (5)\n(cid:107)I(x, t) \u2212 I(w(x, t), t + dt) \u2212 e1(cid:107)L2(D) + (cid:107)e1(cid:107)L0(D)\n\n=\n= |{x \u2208 D|f (x) (cid:54)= 0}| and (cid:107)f(cid:107)L2(D)\n.\n\n= (cid:82)\n\n.\n\n1\n\u03bb\n\n1\n\u03bb\n\nD |f (x)|2dx. Unfortunately, we do\nwhere (cid:107)f(cid:107)L0(D)\nnot know anything about e1 other than the fact that it is sparse, and that what we are looking for is\n\u03c7(\u2126) \u221d e1, where \u03c7 : D \u2192 R+ is the characteristic function that is non-zero when x \u2208 \u2126, i.e.,\nwhere the occlusion residual is non-zero. So, the data \ufb01delity term depends on w but also on the\ncharacteristic function of the occlusion domain \u2126.5 For a suf\ufb01ciently small dt, we can approximate,\nfor any x \u2208 D\\\u2126,\n\n(9)\nwhere the linearization error has been incorporated into the uncertainty term n(x, t). Therefore,\nfollowing the same previous steps, we have\n\nI(x, t + dt) = I(x, t) + \u2207I(x, t)v(x, t) + n(x, t)\n\n\u03c8data(v, e1) = (cid:107)\u2207Iv + It \u2212 e1(cid:107)L2(D) + \u03bb(cid:107)e1(cid:107)L0(D).\n\n(10)\n\nSince we typically do not know the variance \u03bb of the process n, we will treat it as a tuning param-\neter, and because \u03c8data or \u03bb\u03c8data yield the same minimizer, we have attributed the multiplier \u03bb to\nthe second term. In addition to the data term, because the unknown v is in\ufb01nite-dimensional and\nthe problem is ill-posed, we need to impose regularization, for instance by requiring that the total\nvariation (TV) be small\n\n(11)\nwhere v1 and v2 are the \ufb01rst and second components of the optical \ufb02ow v, \u00b5 is a multiplier factor to\nweight the strength of the regularizer and the weighted isotropic TV norm is de\ufb01ned by\n\n\u03c8reg(v) = \u00b5(cid:107)v1(cid:107)T V + \u00b5(cid:107)v2(cid:107)T V\n\n(cid:107)f(cid:107)T V (D) =\n\n(g1(x)\u2207xf (x))2 + (g2(x)\u2207yf (x))2dx,\n\n(cid:90)\n\n(cid:113)\n\nD\n\n4Sparse stands for almost everywhere zero on D. Similarly, dense stands for almost everywhere non-zero.\n5In a digital image, both domains D and \u2126 are discretized into a lattice, and dt is \ufb01xed. Therefore, spatial\nand temporal derivative operators are approximated, typically, by \ufb01rst-order differences. We use the formal\nnotation\n\n(cid:18)\n(cid:18)\n\n\uf8ee\uf8ef\uf8ef\uf8f0 I\n\nI\n\nx +\n\nx +\n\n(cid:20) 1\n(cid:20) 0\n\n0\n\n1\n\n(cid:21)\n(cid:21)\n\n(cid:19)\n(cid:19)\n\n, t\n\n, t\n\n\uf8f9\uf8fa\uf8fa\uf8fbT\n\n\u2212 I(x, t)\n\u2212 I(x, t)\n\n\u2207I(x, t)\n\n.\n=\n\nIt(x, t)\n\n= I(x, t + dt) \u2212 I(x, t).\n.\n\n(7)\n\n(8)\n\n3\n\n\fwhere g1(x) \u2248 exp(\u2212\u03b2|\u2207xI(x)|) and g2(x) \u2248 exp(\u2212\u03b2|\u2207yI(x)|); \u03b2 is a normalizing factor. TV\nis desirable in the context of occlusion detection because it does not penalize motion discontinuities\nsigni\ufb01cantly. The overall problem can then be written as the minimization of the cost functional\n\u03c8 = \u03c8data + \u03c8reg, which is\n\n\u02c6v1, \u02c6v2, \u02c6e1 = arg min\nv1,v2,e1\n\n(cid:124)\n(cid:125)\n(cid:107)\u2207Iv + It \u2212 e1(cid:107)2L2(D) + \u03bb(cid:107)e1(cid:107)L0(D) + \u00b5(cid:107)v1(cid:107)T V (D) + \u00b5(cid:107)v2(cid:107)T V (D)\n\n(cid:123)(cid:122)\n\n\u03c8(v1,v2,e1)\n\n(12)\nIn a digital image, the domain D is quantized into an M \u00d7 N lattice \u039b, so we can write (12) in\nmatrix form as:\n\n\u02c6v1, \u02c6v2, \u02c6e1 = arg min\nv1,v2,e1\n\n1\n2\n\n(cid:107)A[v1, v2, e1]T + b(cid:107)2\n\n(cid:96)2\n\n+ \u03bb(cid:107)e1(cid:107)(cid:96)0 + \u00b5(cid:107)v1(cid:107)T V + \u00b5(cid:107)v2(cid:107)T V\n\n(13)\n\nwhere e1 \u2208 RM N is the vector obtained from stacking the values of e1(x, t) on the lattice \u039b on\ntop of one another (column-wise), and similarly with the vector \ufb01eld components {v1(x, t)}x\u2208\u039b\nand {v2(x, t)}x\u2208\u039b stacked into M N-dimensional vectors v1, v2 \u2208 RM N . The spatial deriva-\ntive matrix A is given by A = [diag(\u2207xI) diag(\u2207yI) \u2212 I], where I is the M N \u00d7 M N\nidentity matrix, and the temporal derivative values {It(x, t)}x\u2208\u039b are stacked into b. For \ufb01nite-\n(cid:54)= 0}| and (cid:107)u(cid:107)T V =\n\ndimensional vectors u \u2208 RM N , (cid:107)u(cid:107)(cid:96)2 = (cid:112)(cid:104)u, u(cid:105), (cid:107)u(cid:107)(cid:96)0 = |{ui|ui\n(cid:80)(cid:112)((g1)i(ui+1 \u2212 ui))2 + ((g2)i(ui+M \u2212 ui))2 where g1 and g2 are the stacked versions of\n\n{g1(x)}x\u2208\u039b and {g2(x)}x\u2208\u039b.\nIn practice, (13) is NP-hard. Therefore, as customary, we relax it by minimizing the weighted-(cid:96)1\nnorm of e1, instead of (cid:96)0, such that\n\n(cid:107)A[v1, v2, e1]T + b(cid:107)2\n\n(cid:96)2\n\n1\n2\n\n(14)\n\n+ \u03bb(cid:107)W e1(cid:107)(cid:96)1 + \u00b5(cid:107)v1(cid:107)T V + \u00b5(cid:107)v2(cid:107)T V\n\n\u02c6v1, \u02c6v2, \u02c6e1 = arg min\nv1,v2,e1\n\nwhere W is a diagonal weight matrix and (cid:107)u(cid:107)(cid:96)1 =(cid:80)|ui|. When W is the identity, (14) becomes a\n\nstandard convex relaxation of (13) and its globally optimal solution can be reached ef\ufb01ciently [27].\nHowever, the (cid:96)0 norm can also be approximated by reweighting (cid:96)1, as proposed by Candes et al. [5],\nby setting the diagonal elements of W to wi \u2248 1/(|(e1)i| + \u0001), \u0001 small, after each iteration of (14).\nThe data term of the standard (unweighted) relaxation of (13) can be interpreted as a Huber norm\n[10]. We favor the more general (14) as the resulting estimate of e1 is more stable and sparse.\nThe model (9) is valid to the extent in which dt is suf\ufb01ciently small relative to v (or v suf\ufb01ciently\nslow relative to dt), so the linearization error does not alter the statistics of the residual n. When this\nis not the case, remedies must be enacted to restore proper sampling conditions [22] and therefore\ndifferentiate contributions to the residual coming from sampling artifacts (aliasing), rather than oc-\nclusions. This can be done by solving (14) in scale-space, as customary, with coarser scales used to\ninitialize \u02c6v1, \u02c6v2 so the increment is properly sampled, and the occlusion term e1 added at the \ufb01nest\nscale.\nThe residual term e1 in (5) have been characterized in some literature as modeling illumination\nchanges [21, 16, 26, 13]. Note that, even if the model (5) appears similar, the priors on e1 are rather\ndifferent: Sparsity in our case, smoothness in theirs. While sparsity is clearly motivated by (i), for\nillumination changes to be properly modeled, a re\ufb02ectance function is necessary, which is absent in\nall models of the form (5) (see [23].)\n\n3 Optimization with Nesterov\u2019s Algorithm\n\nIn this section, we describe an ef\ufb01cient algorithm to solve (14) based on Nesterov\u2019s \ufb01rst order scheme\n[17] which provides O(1/k2) convergence in k iterations, whereas for standard gradient descent, it\nis O(1/k), a considerable advantage for a large scale problem such as (14). To simplify the notation\nwe let (e1)i\n\n= [diag(\u2207xI) diag(\u2207yI) \u2212W \u22121]. We then have\n.\n\n.\n= wi(e1)i, so that A\n\n4\n\n\fInitialize v0\n\n1, v0\n\n2, e0\n\n1. For k \u2265 0\n1 , vk\n\n2 , ek\n1 )\n\n1. Compute \u2207\u03c8(vk\n2. Compute \u03b1k = 1/2(k + 1), \u03c4k = 2/(k + 3)\n1 ]T \u2212 (1/L)\u2207\u03c8(vk\n\n3. Compute yk = [vk\n\n1 , vk\n\n2 , ek\n\n1]T \u2212 (1/L)(cid:80)k\n\n2, e0\n\n1, v0\n1 ]T = \u03c4kzk + (1 \u2212 \u03c4k)yk.\n\n1 , vk\n\n1 ),\n2 , ek\ni=0 \u03b1i\u2207\u03c8(vi\n\n1, vi\n\n2, ei\n\n1),\n\n4. Compute zk = [v0\n\n5. Update [vk\n\n1 , vk\n\n2 , ek\n\nStop when the solution converges.\n\n\u03c8(v1, v2, e1) = \u03c81(v1, v2, e1) + \u03bb\u03c82(e1) + \u00b5\u03c83(v1) + \u00b5\u03c84(v2),\n\nIn order to implement this scheme, we need to address the nonsmooth nature of (cid:96)1 in the computation\nof \u2207\u03c8 [18], a common problem in sparse optimization [3]. We write \u03c8(v1, v2, e1) as\nand compute the gradient of each term separately. \u2207v1,v2,e1\u03c81(v1, v2, e1) is straightforward:\nThe other three terms require smoothing. \u03c82(e1) = (cid:107)e1(cid:107)(cid:96)1 can be rewritten as \u03c82(e1) =\nmax(cid:107)u(cid:107)\u221e\u22641 (cid:104)u, e1(cid:105) in terms of its conjugate. [18] proposes a smooth approximation\n\n\u2207v1,v2,e1\u03c81(v1, v2, e1) = AT A[v1, v2, e1]T + AT b.\n\n2 (e1) = max\n(cid:107)u(cid:107)\u221e\u22641\nand shows that (15) is differentiable and \u2207e1 \u03c8\u03c3\n\n\u03c8\u03c3\n\n(cid:26)\u03c3\u22121(e1)i,\n\nu\u03c3\ni =\n\n|(e1)i| < \u03c3,\nsgn((e1)i), otherwise.\n\n(cid:104)u, e1(cid:105) \u2212 1\n2\n\n\u03c3(cid:107)u(cid:107)2\n\n(cid:96)2\n\n,\n\n2 (e1) = u\u03c3, where u\u03c3 is the solution of (15):\n\n(15)\n\n(16)\n\n(17)\n\n(18)\n\nFollowing [3], \u2207v1\u03c83 is given by \u2207v1 \u03c8\u03c3\nweighted horizontal and vertical differentiation operators , and u\u03c3 has the form [u1, u2] where\n\n3 (v1) = GT u\u03c3where G = [G1, G2]T , G1 and G2 are\n\n(cid:26)\u03c3\u22121(G1,2v1)i,\n\nu1,2\ni =\n\n(cid:107)[(G1v1)i (G2v1)i]T(cid:107)\u22121\n\n(cid:96)2\n\n(G1,2v1)i,\n\n\u2207v2 \u03c84 can be computed in the same way. Once we have computed each term, \u2207\u03c8(v1, v2, e1) is\n\n\u2207\u03c8(v1, v2, e1) = \u2207\u03c81 + [\u03bb\u2207e1\u03c82, \u00b5\u2207v1 \u03c83, \u00b5\u2207v2\u03c84]T .\n\n(cid:107)[(G1v1)i (G2v1)i]T(cid:107)(cid:96)2 < \u03c3,\notherwise.\n\nWe also need the Lipschitz constant L to compute the auxiliary variables yk and zk to minimize \u03c8.\nSince (cid:107)GT G(cid:107)2 is bounded above [7] by 8, given the coef\ufb01cients \u03bb and \u00b5, L is given by\n\nL = max(\u03bb, 8\u00b5)/\u03c3 + (cid:107)AT A(cid:107)2.\n\nA crucial element of the scheme is the selection of \u03c3. It trades off accuracy and speed of conver-\ngence. A large \u03c3 yields a smooth solution, which is undesirable when minimizing the (cid:96)1 norm. A\nsmall \u03c3 causes slow convergence. We have chosen \u03c3 empirically, although the continuation algo-\nrithm proposed in [3] could be employed to adapt \u03c3 during convergence.\n\n4 Experiments\n\nTo evaluate occlusion detection (Sect. 1.2), we start from [2] and generate occlusion maps as fol-\nlows: for each training sequence, the residual computed from the given ground truth motion is used\nas a discriminant to determine ground truth occlusions, \ufb01xing obvious errors in the occlusion maps\nby hand. We therefore restrict the evaluation of motion to the co-visible regions, and evaluate oc-\nclusion detection as a standard binary classi\ufb01cation task. We compare our algorithm to [29] and\n[14], the former is an example of robust motion estimation and the latter is a representative of the\napproaches described in Sect. 1.1.\nIn our implementation6, we \ufb01rst solve (14) with standard relaxation (W is the identity) and then\nwith reweighted-(cid:96)1. To handle large motion, we use a pyramid with scale factor 0.5 and up to 4\nlevels; \u03bb and \u00b5 are \ufb01xed at 0.002 and 0.001 (Flower Garden) and 0.0006 and 0.0003 (Middlebury)\nrespectively. To make comparison with [29] fair, we modify the code provided online7 to include\n\n6The source code is available at http://vision.ucla.edu/~ayvaci/occlusion-detection/\n7http://gpu4vision.icg.tugraz.at\n\n5\n\n\fanisotropic regularization (Fig. 1). Note that no occlusion is present in the residual of the motion\n\ufb01eld computed by TV-L1, and subsequently the motion estimates are less precise around occluding\nboundaries (top-left corner of the Flower Garden, plane in the left in Venus).\n\nFigure 1: Comparison with TV-L1 [29] on \u201cVenus\u201d from [2] and \u201cFlower Garden.\u201d The \ufb01rst column\nshows the motion estimates by TV-L1, color-coded as in [29], the second its residual I(x, t) \u2212\nI(w(x), t + dt); the third shows our motion estimates, and the fourth our residual e1 de\ufb01ned in (14).\n\nOther frames of the Flower Garden sequence are shown in Fig. 2, where we have regularized the\noccluded region by minimizing a unilateral energy on e1 with graph-cuts. We have also compared\n\nFigure 2: Motion estimates for more frames of the Flower Garden sequence (left), residual e (mid-\ndle), and occluded region (right).\n\nmotion estimates obtained with our method and [29] in the co-visible regions for the Middlebury\ndataset (Table 1). Since occlusions can only be determined at the \ufb01nest scale absent proper sam-\npling conditions, in this experiment we minimize the same functional of [29] at coarse scales, and\nswitch to (14) at the \ufb01nest scale. To evaluate occlusion detection performance, we again use the\nMiddlebury, and compare e1 to ground truth occlusions using precision/recall curves (Fig. 3) and\naverage precision values (Table 2). We also show the improvement in detection performance when\nwe use reweighted-(cid:96)1, in Table 2. We have compared our occlusion detection results to [14], us-\ning the code provided online by the authors (Table 3). Comparing motion estimates gives an unfair\n\n6\n\n\fVenus RubberWhale Hydrangea Grove2 Grove3 Urban2 Urban3\n6.41\n4.37\n7.12\n5.28\n0.30\n0.84\n0.89\n0.33\n\nAAE (ours)\nAAE (L1TV)\nAEPE (ours)\nAEPE (L1TV)\nTable 1: Quantitative comparison of our algorithm with TV-L1 [29]. Average Angular Error (AAE)\nand Average End Point Error (AEPE) of motion estimates in co-visible regions.\n\n2.35\n2.44\n0.19\n0.20\n\n5.42\n4.49\n0.18\n0.13\n\n2.32\n3.45\n0.16\n0.24\n\n5.72\n7.66\n0.59\n0.74\n\n3.60\n3.57\n0.39\n0.46\n\nFigure 3: Left to right: Representative samples of motion estimates from the Middlebury dataset,\nlabeled ground-truth occlusions, error term estimate e1, and precision-recall curves for our occlusion\ndetection.\n\nadvantage to our algorithm because their approach is based on quantized disparity values, yielding\nlower accuracy.\n\n7\n\n00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionVenus00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionRubberWhale00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionHydrangea00.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionGrove200.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionGrove300.10.20.30.40.50.60.70.80.9100.10.20.30.40.50.60.70.80.91recallprecisionUrban2\fVenus Rubber Whale Hydrangea Grove2 Grove3 Urban2 Urban3\n0.67\n0.69\n\n0.80\n(cid:96)1\nreweighted-(cid:96)1\n0.80\nTable 2: Average precision of our approach on Middlebury data with and without re-weighting.\n\n0.55\n0.57\n\n0.70\n0.70\n\n0.60\n0.61\n\n0.72\n0.73\n\n0.48\n0.49\n\nIt takes 186 seconds for a Matlab/C++ implementation of Nesterov\u2019s algorithm to converge to a\nsolution on a 288 \u00d7 352 frame from Flower Garden sequence. We have also compared Nesterov\u2019s\nalgorithm to split-Bregman\u2019s method [9] for minimization of (14) in terms of convergence speed and\nreported the results in [20].\n\nVenus RubberWhale Hydrangea Grove2 Grove3 Urban2 Urban3\n0.61\n0.66\n0.69\n\nPrecision [14]\nRecall [14]\nPrecision(ours)\nTable 3: Comparison with [14] on Middlebury. Since Kolmogorov et al. provide a binary output,\nwe display our precision at their same recall value.\n\n0.79\n0.45\n0.86\n\n0.68\n0.20\n0.96\n\n0.26\n0.50\n0.95\n\n0.56\n0.51\n0.94\n\n0.72\n0.55\n0.96\n\n0.46\n0.20\n0.91\n\n5 Discussion\n\nWe have presented an algorithm to detect occlusions and establish correspondence between two im-\nages. It leverages on a formulation that, starting from standard assumptions (Lambertian re\ufb02ection,\nconstant diffuse illumination), arrives at a convex optimization problem. Our approach does not as-\nsume a rigid scene, nor a single moving object. It also does not assume that the occluded region\nis simply connected: Occlusions in natural scenes can be very complex (see Fig. 3) and should\ntherefore, in general, not be spatially regularized. The fact that occlusion detection reduces to a\ntwo-phase segmentation of the domain into either occluded (\u2126) or visible (D\\\u2126) should not confuse\nthe reader familiar with the image segmentation literature whereby two-phase segmentation of one\nobject (foreground) from the background can be posed as a convex optimization problem [6], but\nbreaks down in the presence of multiple objects, or \u201cphases.\u201d Note that in [6] the problem can be\nmade convex only in e1, but not jointly in e1 and v. We focus on inter-frame occlusion detection;\ntemporal consistency of occlusion \u201clayers\u201d was addressed in [12].\nThe limitations of our approach stand mostly in its dependency from the regularization coef\ufb01cients\n\u03bb and \u00b5. In the absence of some estimate of the variance coef\ufb01cient \u03bb, one is left with tuning it by\ntrial-and-error. Similarly, \u00b5 is a parameter that, like in any classi\ufb01cation problem, trades off missed\ndetections and false alarms, and therefore no single value is \u201coptimal\u201d in any meaningful sense.\nThese limitations are shared by most variational optical \ufb02ow estimation algorithms.\nAcknowledgement: This work was supported by AFOSR FA9550-09-1-0427, ARO 56765-CI, and\nONR N00014-08-1-0414.\n\nReferences\n[1] L. Alvarez, R. Deriche, T. Papadopoulo, and J. S\u00b4anchez. Symmetrical dense optical \ufb02ow estimation with\n\nocclusions detection. International Journal of Computer Vision, 75(3):371\u2013385, 2007.\n\n[2] S. Baker, D. Scharstein, J. Lewis, S. Roth, M. Black, and R. Szeliski. A database and evaluation method-\nology for optical \ufb02ow. In Proceedings of the International Conference on Computer Vision, volume 5,\n2007.\n\n[3] S. Becker, J. Bobin, and E. Candes. Nesta: A fast and accurate \ufb01rst-order method for sparse recovery.\n\nArxiv preprint arXiv, 904, 2009.\n\n[4] R. Ben-Ari and N. Sochen. Variational stereo vision with sharp discontinuities and occlusion handling.\n\nICCV. IEEE Computer Society, pages 1\u20137, 2007.\n\n[5] E. Candes, M. Wakin, and S. Boyd. Enhancing sparsity by reweighted 1 minimization. Journal of Fourier\n\nAnalysis and Applications, 14(5):877\u2013905, 2008.\n\n[6] T. Chan, S. Esedoglu, and M. Nikolova. Algorithms for \ufb01nding global minimizers of denoising and\n\nsegmentation models. SIAM J. Appl. Math, 66(1632-1648):1, 2006.\n\n8\n\n\f[7] J. Dahl, P. Hansen, S. Jensen, and T. Jensen. Algorithms and software for total variation image recon-\n\nstruction via \ufb01rst-order methods. Numerical Algorithms, pages 67\u201392, 2009.\n\n[8] J. J. Gibson. The ecological approach to visual perception. LEA, 1984.\n[9] T. Goldstein and S. Osher. The split Bregman method for L1 regularized problems. SIAM Journal on\n\nImaging Sciences, 2(2):323\u2013343, 2009.\n\n[10] P. Huber and E. Ronchetti. Robust statistics. John Wiley & Sons Inc, 2009.\n[11] S. Ince and J. Konrad. Occlusion-aware optical \ufb02ow estimation. IEEE Transactions on Image Processing,\n\n17(8):1443\u20131451, 2008.\n\n[12] J. Jackson, A. J. Yezzi, and S. Soatto. Dynamic shape and appearance modeling via moving and deforming\n\nlayers. International Journal of Computer Vision, 2008.\n\n[13] Y. Kim, A. Mart\u00b4\u0131nez, and A. Kak. Robust motion estimation under varying illumination.\n\nVision Computing, 23(4):365\u2013375, 2005.\n\nImage and\n\n[14] V. Kolmogorov and R. Zabih. Computing visual correspondence with occlusions via graph cuts.\n\nInternational Conference on Computer Vision, volume 2, pages 508\u2013515. Citeseer, 2001.\n\nIn\n\n[15] K. Lim, A. Das, and M. Chong. Estimation of occlusion and dense motion \ufb01elds in a bidirectional\nBayesian framework. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 712\u2013718,\n2002.\n\n[16] S. Negahdaripour. Revised de\ufb01nition of optical \ufb02ow: Integration of radiometric and geometric cues for\ndynamic scene analysis. IEEE Transactions on Pattern Analysis and Machine Intelligence, pages 961\u2013\n979, 1998.\n\n[17] Y. Nesterov. A method for unconstrained convex minimization problem with the rate of convergence O\n\n(1/k2). In Doklady AN SSSR, volume 269, pages 543\u2013547, 1983.\n\n[18] Y. Nesterov. Smooth minimization of non-smooth functions. Mathematical Programming, 103(1):127\u2013\n\n152, 2005.\n\n[19] M. Proesmans, L. Van Gool, and A. Oosterlinck. Determination of optical \ufb02ow and its discontinuities\n\nusing a non-linear diffusion. In European Conference on Computer Vision, 1994.\n\n[20] M. Raptis, A. Ayvaci, and S. Soatto. Occlusion Detection and Motion Estimation via Convex Optimiza-\n\ntion. Technical report, UCLA CAM 10-36, June 2010.\n\n[21] D. Shulman and J. Herve. Regularization of discontinuous \ufb02ow \ufb01elds. In Proc. of Workshop on Visual\n\nMotion, pages 81\u201386, 1989.\n\n[22] S. Soatto. Steps Towards a Theory of Visual Information. Technical report, UCLA-CSD100028, Septem-\n\nber 2010.\n\n[23] S. Soatto, A. J. Yezzi, and H. Jin. Tales of shape and radiance in multiview stereo. In Intl. Conf. on Comp.\n\nVision, pages 974\u2013981, October 2003.\n\n[24] C. Strecha, R. Fransens, and L. Van Gool. A probabilistic approach to large displacement optical \ufb02ow\n\nand occlusion detection. In ECCV Workshop SMVP, pages 71\u201382. Springer, 2004.\n\n[25] J. Sun, Y. Li, S. Kang, and H. Shum. Symmetric stereo matching for occlusion handling.\n\nConference on Computer Vision and Pattern Recognition, volume 2, page 399, 2005.\n\nIn IEEE\n\n[26] C. Teng, S. Lai, Y. Chen, and W. Hsu. Accurate optical \ufb02ow computation under non-uniform brightness\n\nvariations. Computer vision and image understanding, 97(3):315\u2013346, 2005.\n\n[27] R. Tibshirani. Regression shrinkage and selection via the lasso. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), 58(1):267\u2013288, 1996.\n\n[28] A. Verri and T. Poggio. Motion \ufb01eld and optical \ufb02ow: Qualitative properties.\n\nPattern Analysis and Machine Intelligence, 11(5):490\u2013498, 1989.\n\nIEEE Transactions on\n\n[29] A. Wedel, T. Pock, C. Zach, H. Bischof, and D. Cremers. An improved algorithm for TV-L1 optical \ufb02ow.\nIn Statistical and Geometrical Approaches to Visual Motion Analysis: International Dagstuhl Seminar.\nSpringer, 2009.\n\n[30] J. Xiao, H. Cheng, H. Sawhney, C. Rao, M. Isnardi, et al. Bilateral \ufb01ltering-based optical \ufb02ow estimation\n\nwith occlusion detection. Lecture Notes in Computer Science, 3951:211, 2006.\n\n9\n\n\f", "award": [], "sourceid": 82, "authors": [{"given_name": "Alper", "family_name": "Ayvaci", "institution": null}, {"given_name": "Michalis", "family_name": "Raptis", "institution": null}, {"given_name": "Stefano", "family_name": "Soatto", "institution": null}]}