{"title": "Convex Relaxation of Mixture Regression with Efficient Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 1491, "page_last": 1499, "abstract": "We develop a convex relaxation of maximum a posteriori estimation of a mixture of regression models. Although our relaxation involves a semidefinite matrix variable, we reformulate the problem to eliminate the need for general semidefinite programming. In particular, we provide two reformulations that admit fast algorithms. The first is a max-min spectral reformulation exploiting quasi-Newton descent. The second is a min-min reformulation consisting of fast alternating steps of closed-form updates. We evaluate the methods against Expectation-Maximization in a real problem of motion segmentation from video data.", "full_text": "Convex Relaxation of Mixture Regression with\n\nEf\ufb01cient Algorithms\n\nNovi Quadrianto, Tib\u00b4erio S. Caetano, John Lim\n\nNICTA - Australian National University\n{\ufb01rstname.lastname}@nicta.com.au\n\nCanberra, Australia\n\nDale Schuurmans\nUniversity of Alberta\nEdmonton, Canada\ndale@cs.ualberta.ca\n\nAbstract\n\nWe develop a convex relaxation of maximum a posteriori estimation of a mixture\nof regression models. Although our relaxation involves a semide\ufb01nite matrix vari-\nable, we reformulate the problem to eliminate the need for general semide\ufb01nite\nprogramming. In particular, we provide two reformulations that admit fast algo-\nrithms. The \ufb01rst is a max-min spectral reformulation exploiting quasi-Newton de-\nscent. The second is a min-min reformulation consisting of fast alternating steps of\nclosed-form updates. We evaluate the methods against Expectation-Maximization\nin a real problem of motion segmentation from video data.\n\nIntroduction\n\n1\nRegression is a foundational problem in machine learning and statistics. In practice, however, data\nis often better modeled by a mixture of regressors, as demonstrated by the prominence of mixture\nregression in a number of application areas. Gaffney and Smyth [1], for example, use mixture regres-\nsion to cluster trajectories, i.e. sets of short sequences of data such as cyclone or object movements\nin video sequences as a function of time. Each trajectory is believed to have been generated from one\nof a number of components, where each component is associated with a regression model. Finney et\nal. [2] have employed an identical mixture regression model in the context of planning: regression\nfunctions are strategies for a given planning problem. Elsewhere, the mixture of regressors model\nhas been shown to be useful in addressing covariate shift, i.e. the situation where the distribution of\nthe training set used for modeling does not match the distribution of the test set in which the model\nwill be used. Storkey and Sugiyama [3] model the covariate shift process in a mixture regression\nsetting by assuming a shift in the mixing proportions of the components.\nIn each of these problems, one must estimate k distinct latent regression functions; that is, estimate\nfunctions whose values correspond to the mean of response variables, under the assumption that\nthe response variable is generated by a mixture of k components. This estimation problem can\nbe easily tackled if it is known to which component each response variable belongs (yielding k\nindependent regression problems). However in general the component of a given observation is not\nknown and is modeled as a latent variable. A commonly adopted approach for maximum-likelihood\nestimation with latent variables (in this case, component membership for each response variable) is\nExpectation-Maximization (EM) [4]. Essentially, EM iterates inference over the hidden variables\nand parameter estimation of the resulting decoupled models until a local optimum is reached. We\nare not aware of any approach to maximum likelihood estimation of a mixture of regression models\nthat is not based on the non-convex marginal likelihood objective of EM.\nIn this paper we present a convex relaxation of maximum a posteriori estimation of a mixture of re-\ngression models. Recently, convex relaxations have gained considerable attention in machine learn-\ning (c.f. [5, 6]). By exploiting convex duality, we reformulate a relaxation of mixture regression as\na semide\ufb01nite program. To achieve a scalable approach, however, we propose two reformulations\nthat admit fast algorithms. The \ufb01rst is a max-min optimization problem which can be solved by iter-\nations of quasi-Newton steps and eigenvector computations. The second is a min-min optimization\nproblem solvable by iterations of closed-form solutions. We present experimental results comparing\nour methods against EM, both in synthetic problems and real computer vision problems, and show\nsome bene\ufb01ts of a convex approach over a local solution method.\n\n1\n\n\fRelated work Goldfeld and Quandt [7] introduced a mixture regression model with two components\ncalled switching regressions. The problem is re-cast into a single composite regression equation by\nintroducing a switching variable. A consistent estimator is then produced by a continuous relaxation\nof this switching variable. An EM algorithm for switching regressions was \ufb01rst presented by Hos-\nmer [8]. Sp\u00a8ath [9] introduced a problem called clusterwise linear regression, consisting of \ufb01nding a\nk-partition of the data such that a least squares regression criterion within those partitions becomes\na minimum. A non-probabilistic algorithm similar to k-means was proposed. Subsequently, the\ngeneral k-partition case employing EM was developed (c.f. [10, 11, 1]) and extended to various\nsituations including the use of variable length trajectory data and to non-parametric regression mod-\nels. In the extreme, each individual could have its speci\ufb01c regression model but coupled at higher\nlevel with a mixture on regression parameters [12]. An EM algorithm is again employed to handle\nhidden data, in this case group membership of parameters. The Hierarchical Mixtures of Experts\n[13] model also shares some similarity to mixture regression in that gating networks which contain\nmixtures of generalized linear models are de\ufb01ned. In principle, our algorithmic advances can be\napplied to many of these formulations.\n\n2 The Model\nNotation In the following we use the uppercase letters (X, \u03a0, \u03a8) to denote matrices and the low-\nercase letters (x, y, w, \u03c0, \u03c8, c) to denote vectors. We use t to denote the sample size, n to denote\nthe dimensionality of the data and k to denote the number of mixture components. \u039b(a) denotes a\ndiagonal matrix whose diagonal is equal to vector a, and diag(A) is a vector equal to the diagonal\nof matrix A. Finally, we let 1 denote the vector of all ones, use (cid:12) to denote Hadamard (component-\nwise) matrix product, and use \u2297 to denote Kronecker product.\nWe are given a matrix of regressors X \u2208 Rt\u00d7n and a vector of regressands y \u2208 Rt\u00d71 where the re-\nsponse variable y is generated by a mixture of k components, but we do not know which component\nof the mixture generates each response yi. We therefore use the matrix \u03a0 \u2208 {0, 1}t\u00d7k, \u03a01 = 1,\nto denote the hidden assignment of mixture labels to each observation: \u03a0ij = 1 iff observation i\nhas mixture label j. We use xi to denote the ith row of X (i.e. observation i as a row vector), \u03c0i to\ndenote the ith row of \u03a0 and yi to denote the ith element of y. We assume a linear generative model\nfor yi on a feature representation \u03c8i = \u03c0i \u2297 xi, under i.i.d. sampling\n\u0001i \u223c N(0, \u03c32),\n\n(1)\nwhere w \u2208 R(n\u00d7k)\u00d71 is the vector of stacked parameter vectors of the components. We therefore\nhave the likelihood\n\nyi|xi, \u03c0i = \u03c8iw + \u0001i,\n\np(yi|xi, \u03c0i; w) =\n\n1\u221a\n2\u03c0\u03c32\n\nexp\n\n(cid:20)\n\u2212 1\n2\u03c32 (\u03c8iw \u2212 yi)2\n\n(cid:21)\n\n(2)\n\nfor a single observation i (recalling that \u03c8i depends on both xi and \u03c0i). We further impose a Gaussian\nprior on w for capacity control. Also, one may want to constrain the size of the largest mixture\ncomponent. For that purpose one could constrain the solutions \u03a0 such that max(diag(\u03a0T \u03a0)) \u2264 \u03b3t,\nwhere \u03b3t is an upper bound on the size of the largest component (\u03b3 is an upper bound on the\nproportion of the largest component). Combining these assumptions and adopting matrix notation\nwe obtain the optimization problem: minimize the negative log-posterior of the entire sample\n\nmin\n\u03a0,w\n\nA(\u03c8i, w) \u2212 1\n\n\u03c32 yT \u03a8w +\n\n1\n2\u03c32 yT y + \u03b1\nlog(2\u03c0\u03c32).\n\n2 wT w\n\n1\n2\u03c32 wT \u03c8T\n\n1\n2\n\nA(\u03c8i, w) =\n\n(4)\nHere \u03a8 is the matrix whose rows are the vectors \u03c8i = \u03c0i \u2297 xi. Since X is observed, note that the\noptimization only runs over \u03a0 in \u03a8. The constraint max(diag(\u03a0T \u03a0)) \u2264 \u03b3t may also be added.\nEliminating constant terms, our \ufb01nal task will be to solve\n\ni \u03c8iw +\n\n, where\n\n(3)\n\n(cid:34)(cid:88)\n\ni\n\n\u03c32 yT \u03a8w + \u03b1\n\n2 wT w\n\n.\n\n(5)\n\nAlthough marginally convex on w, this objective is not jointly convex on w and \u03a0 (and involves\nnon-convex constraints on \u03a0 owing to its discreteness). The lack of joint convexity makes the opti-\nmization dif\ufb01cult. The typical approach in such situations is to use an alternating descent strategy,\nsuch as EM. Instead, in the following we develop a convex relaxation for problem (5).\n\n2\n\n(cid:35)\n\n(cid:21)\n\n(cid:20) 1\n2\u03c32 wT \u03a8T \u03a8w \u2212 1\n\nmin\n\u03a0,w\n\n\f3 Semide\ufb01nite Relaxation\nTo obtain a convex relaxation we proceed in three steps. First, we dualize the \ufb01rst term in (5).\n2\u03c32 wT \u03a8T \u03a8w. Then the Fenchel dual of A(\u03a8w) is A\u2217(c) = 1\n\nLemma 1 De\ufb01ne A(\u03a8w) := 1\nand therefore A(\u03a8w) = maxc cT \u03a8w \u2212 1\nProof From the de\ufb01nition of Fenchel dual we have A\u2217(u) := maxw uT w \u2212 1\nferentiating with respect to w and equating to zero we obtain u = 1\nrealizable if there exists a c such that u = \u03a8T c. Solving for A\u2217(c) we obtain A\u2217(c) = 1\ntherefore by de\ufb01nition of Fenchel duality A(\u03a8w) = maxc cT \u03a8w \u2212 1\nA second Lemma is required to further establish the relaxation:\n\n2\u03c32 wT \u03a8T \u03a8w. Dif-\n\u03c32 \u03a8T \u03a8w. Therefore u is only\n2 \u03c32cT c, and\n\n2 \u03c32cT c.\n\n2 \u03c32cT c,\n\n2 \u03c32cT c.\n\nLemma 2 The following set inclusion holds\n\n{\u03a0\u03a0T : \u03a0 \u2208 {0, 1}t\u00d7k, \u03a01 = 1, max(diag(\u03a0T \u03a0)) \u2264 \u03b3t}\n\n\u2286 {M : M \u2208 Rt\u00d7t, tr M = t, \u03b3tI (cid:60) M (cid:60) 0}.\n\n(6)\n(7)\nProof Let \u03a0\u03a0T be an element of the \ufb01rst set. First notice that [\u03a0\u03a0T ]ij \u2208 {0, 1} since \u03a0 \u2208 {0, 1}t\u00d7k\nand \u03a01 = 1 together imply that \u03a0 has a single 1 per row (and the rest are zeros). In particular\n[\u03a0\u03a0T ]ii = 1 for all i, i.e. tr M = t. Finally, note that (\u03a0\u03a0T )\u03a0 = \u03a0(\u03a0T \u03a0) where \u03a0T \u03a0 is a\ndiagonal matrix and therefore its diagonal elements are the eigenvalues of \u03a0\u03a0T and in particular\nmax(diag(\u03a0T \u03a0)) \u2264 \u03b3t means that the largest possible eigenvalue of \u03a0\u03a0T is \u03b3t, which implies\n\u03b3tI (cid:60) \u03a0\u03a0T . Since \u03a0\u03a0T is by construction positive semide\ufb01nite, we have \u03b3tI (cid:60) \u03a0\u03a0T (cid:60) 0.\nTherefore \u03a0\u03a0T is also a member of the second set.\nThe above two lemmas allow us to state our \ufb01rst main result below.\n\nTheorem 3 The following convex optimization problem\n\nmin\n\nM :tr M =t,\u03b3tI(cid:60)M(cid:60)0\n\nmax\n\nc\n\nis a relaxation of (5) only in the sense that domain (6) is replaced by domain (7).\n\nProof We \ufb01rst use Lemma 1 in order to rewrite the objective (5) and obtain\n\n(cid:20)\n2 \u03c32cT c \u2212 1\n\u22121\n(cid:19)\n\n2\u03b1\n\ncT \u03a8w \u2212 1\n(cid:20)\ncT \u03a8w \u2212 1\n\n(cid:20)(cid:18)\n\nmin\n\u03a0,w\n\nmax\n\nc\n\nmin\n\u03a0\n\nmax\n\nc\n\nmin\nw\n\n(cid:17)T\n(cid:16) y\n\u03c32 \u2212 c\n\n(cid:17)(cid:21)\nM (cid:12) XX T(cid:16) y\n\u03c32 \u2212 c\n(cid:21)\n(cid:21)\n\n2 wT w\n\n.\n\n2 \u03c32cT c\n\n\u2212 1\n\u03c32 yT \u03a8w + \u03b1\n\nSecond, using the distributivity of the (max, +) semi-ring, the maxc can be pulled out and we then\nuse Sion\u2019s minimax theorem [14], which allows us to interchange maxc with minw\n\n(8)\n\n(9)\n\n(10)\n\n(11)\n\n(12)\n\nand we can solve for w \ufb01rst, obtaining\n\nSubstituting (11) in the objective of (10) results in\n\n.\n\n,\n\n2 wT w\n\n2 \u03c32cT c \u2212 1\n\u03c32 yT \u03a8w + \u03b1\n\u03a8T(cid:16) y\n(cid:17)\n\u03c32 \u2212 c\n(cid:16) y\n(cid:17)T\n\u03c32 \u2212 c\n(cid:17)T\n\n\u03a8\u03a8T(cid:16) y\n\u03c32 \u2212 c\n\u03a0\u03a0T (cid:12) XX T(cid:16) y\n\u03c32 \u2212 c\n\n(cid:17)(cid:21)\n\n.\n\nw =\n\n1\n\u03b1\n\n(cid:20)\n2 \u03c32cT c \u2212 1\n\u22121\n(cid:16) y\n\u03c32 \u2212 c\n\n(cid:20)\n2 \u03c32cT c \u2212 1\n\u22121\n\n2\u03b1\n\n2\u03b1\n\n(cid:17)(cid:21)\n\nmin\n\u03a0\n\nmax\n\nc\n\nmin\n\u03a0\u03a0T\n\nmax\n\nc\n\nWe now note the critical fact that \u03a8 only shows up in the expression \u03a8\u03a8T which, from the de\ufb01nition\n\u03c8i = \u03c0i\u2297xi, is seen to be equivalent to \u03a0\u03a0T (cid:12)XX T . Therefore the minimization over \u03a0 effectively\ntakes place over \u03a0\u03a0T (since X is observed), and we have that (12) can be rewritten as\n\n.\n\n(13)\n\nSo far no relaxation has taken place. By \ufb01nally replacing the constraint (6) with constraint (7) from\nLemma 2, we obtain the claimed semide\ufb01nite relaxation.\n\n3\n\n\f4 Max-Min Reformulation\nBy upper bounding the inner maximization in (8) and applying a Schur complement, problem (8)\ncan be re-expressed as a semide\ufb01nite program. Unfortunately, such a formulation is computationally\nexpensive to solve, requiring O(t6) for typical interior-point methods. Instead, we can reformulate\nproblem (8) to allow for a fast algorithmic approach, without the introduction of any additional\nrelaxation. The basis of our development is the following classical result.\nTheorem 4 ([15]) Let V \u2208 Rt\u00d7t, V = V T have eigenvalues \u03bb1 \u2265 \u03bb2 \u2265 \u00b7\u00b7\u00b7 \u2265 \u03bbt. Let P be the\nmatrix whose columns are the normalized eigenvectors of V, i.e. P T V P = \u039b((\u03bb1, . . . , \u03bbt)). Let\nq \u2208 {1, . . . , t} and Pq be the matrix comprised by the top q eigenvectors of P . Then\n\nq(cid:88)\n\nmax\n\nM :tr(M )=q,I(cid:60)M(cid:60)0\n\nargmax\n\nM :tr(M )=q,I(cid:60)M(cid:60)0\n\ntr M V T =\n\u03bbi\ntr M V T (cid:51) PqP T\nq .\n\ni=1\n\nand\n\n(14)\n\n(15)\n\nProof See [15] for a proof of a slightly more general result (Theorem 3.4).\nWe will now show how the optimization on M for problem (8) can be cast in the terms of Theorem\n4. This will turn out to be critical for the ef\ufb01ciency of the optimization procedure, since Theorem\n4 describes a purely spectral optimization routine, which is far more ef\ufb01cient (O(t3)) than standard\ninterior-point methods used for semide\ufb01nite programming (O(t6)).\n\n(cid:20)\nProposition 5 De\ufb01ne \u00afy := y\n2 \u03c32cT c \u2212 1\n\u22121\n\nmax\n\nc\n\n2\u03b1\n\nis equivalent to optimization problem (8).\n\n\u03c32 . The following optimization problem\n\nmax\n\nM :tr M =t,\u03b3tI(cid:60)M(cid:60)0\n\ntr(M(XX T (cid:12) (\u00afy \u2212 c)(\u00afy \u2212 c)T ))\n\nProof By Sion\u2019s minimax theorem [14], minM and maxc in (8) can be interchanged\n(\u00afy \u2212 c)T M (cid:12) XX T (\u00afy \u2212 c)\n\nmax\n\nmin\n\n2 \u03c32cT c \u2212 1\n\u22121\n\n2\u03b1\n\nM :tr M =t,\u03b3tI(cid:60)M(cid:60)0\n\nc\n\n(cid:20)\n\nwhich, by distributivity of the (min, +) semi-ring, is equivalent to\n\n(cid:21)\n\n(16)\n\n(17)\n\n(cid:21)\n(cid:21)\n\n(cid:20)\n\u22121\n2 \u03c32cT c +\n\nmax\n\nc\n\n1\n2\u03b1\n\nmin\n\nM :tr M =t,\u03b3tI(cid:60)M(cid:60)0\n\n\u2212 (\u00afy \u2212 c)T M (cid:12) XX T (\u00afy \u2212 c)\n\n.\n\n(18)\n\nNow, de\ufb01ne K := XX T . The objective of the minimization in (18) can then be written as\n\n\u2212 (\u00afy \u2212 c)T (M (cid:12) K)(\u00afy \u2212 c) = \u2212 tr(cid:0)(M (cid:12) K)(cid:2)(\u00afy \u2212 c)(\u00afy \u2212 c)T(cid:3)(cid:1)\n= \u2212(cid:88)\n\n(MijKij)(cid:2)(\u00afy \u2212 c)(\u00afy \u2212 c)T(cid:3)\n\nij = \u2212(cid:88)\n\n(cid:2)(\u00afy \u2212 c)(\u00afy \u2212 c)T(cid:3)\n\n(cid:16)\n\nMij\n\nKij\n\n(cid:17)\n\nij\n\nij\n\nij\n\n= \u2212 tr(M(K (cid:12) (\u00afy \u2212 c)(\u00afy \u2212 c)T )) = \u2212 tr(M(XX T (cid:12) (\u00afy \u2212 c)(\u00afy \u2212 c)T )).\n\n(19)\n\n(20)\n\n(21)\n\nFinally, by writing minM \u2212f(M) as \u2212 maxM f(M), we obtain the claim.\nWe can now exploit the result in Theorem 4 for the purpose of our optimization problem.\nProposition 6 Let q = {u : u = max{1, . . . , t}, u \u2264 \u03b3\u22121}. The following optimization problem\n\n(cid:21)\n\nmax\n\n\u00afM :tr \u00afM =q,I(cid:60) \u00afM(cid:60)0\n\ntr( \u00afM(XX T (cid:12) (\u00afy \u2212 c)(\u00afy \u2212 c)T ))\n\n(22)\n\n(cid:20)\n\u22121\n2 \u03c32cT c \u2212 t\n2\u03b1q\n\nmax\n\nc\n\nis equivalent to optimization problem (16).\n\n4\n\n\fAlgorithm 1\n1: Input: \u03b3, \u03c3, \u03b1, XX T\n2: Output: (c\u2217, M\u2217)\n3: Initialize c = 0\n4: repeat\nSolve for maximum value in inner maximization of (22) using (14)\n5:\nSolve outer maximization in (22) using nonsmooth BFGS [16], obtain new c\n6:\n7: until c has converged (c = c\u2217)\n8: At c\u2217, solve for the maximizer(s) Pq in the inner maximization of (22) using (15)\n9: if Pq is unique then\n10:\n11: else\n12:\n13:\n14:\n15: end if\n\nAssemble top l eigenvectors in Pl\nSolve (24)\nreturn M\u2217 = Pl\u039b(\u03bb\u2217)P T\n\nreturn M\u2217 = PqP T\n\nq break\n\nl\n\nProof The only differences between (16) and (22) are (i) the factor t/q in the second term of (22) and\n(ii) the constraints {M : tr M = t, \u03b3tI (cid:60) M (cid:60) 0} in (16) versus {M : tr M = q, I (cid:60) M (cid:60) 0} in\n(22). These differences are simply the result of a proper rescaling of M. If we de\ufb01ne \u00afM := (q/t)M,\nthen I (cid:60) \u00afM (cid:60) 0 since q \u2264 \u03b3\u22121. We then have tr \u00afM = q. The result follows.\nAnd \ufb01nally we have the second main result\nTheorem 7 Optimization problem (22) is equivalent to optimization problem (8).\nProof The equivalence follows directly from Propositions 5 and 6.\nNote that, crucially, the objective in (22) is concave in c. Our strategy is now clear. Instead of solving\n(8), which demands O(t6) operations, we instead solve (22), which has as inner optimization a max\neigenvalue problem, demanding only O(t3) operations. In the next section we describe an algorithm\nto jointly optimize for M and c in (22), which will essentially consist of alternating the ef\ufb01cient\nspectral solution over M with a subgradient optimization over c.\n\n4.1 Max-Min Algorithm\nAlgorithm 1 describes how we solve optimization problem (22). The idea of the algorithm is the\nfollowing. First, having noted that (22) is concave in c, we can simply initialize c arbitrarily and\npursue a fast subgradient ascent algorithm (e.g. such as nonsmooth BFGS [16]). So at each step\nwe solve the eigenvalue problem and recompute a subgradient, until convergence to c\u2217. We then\nneed to recover M\u2217 such that (c\u2217, M\u2217) is a saddle point (note that problem (22) is concave in c\nand convex in M). For that purpose we use (15). If M\u2217 = PqP T\nq is such that Pq is unique, then\nwe are done and the labeling solution of mixture membership is M\u2217 (subject to roundoff). If Pq\nis not unique, then we have multiplicity of eigenvalues and we need to proceed as follows. De\ufb01ne\nPl = [p1 . . . pq . . . pl], l > q, where each of the additional l \u2212 q eigenvectors has an associated\neigenvalue which is equal to the eigenvalue of some of the previous q eigenvectors. We then have\nthat at the saddle point there must exist a diagonal matrix \u039b such that M\u2217 = Pl\u039bP T\n, subject to\n\u039b (cid:60) 0 and tr \u039b = q (if this were not the case there would be an ascent direction in c\u2217, contradicting\nthe hypothesis that c\u2217 is optimal). To \ufb01nd such a \u039b and therefore recover the correct M, we need to\nenforce that we are at the optimal c (c\u2217), i.e. we must have\n\nl\n\n(cid:13)(cid:13)(cid:13)(cid:13) d\n\ndc\n\n(cid:20)\n\u22121\n2 \u03c32cT c \u2212 q\n2\u03b1t\n\n(cid:21)(cid:13)(cid:13)(cid:13)(cid:13)2\n\n2\n\ntr(M(XX T (cid:12) (\u00afy \u2212 c)(\u00afy \u2212 c)T ))\n\n= 0\n\n(23)\n\nmax\n\nM :tr M =q,I(cid:60)M(cid:60)0\n\n(cid:13)(cid:13)(cid:13)\u03c32c\u2217 + q\n\n(cid:0)Pl\u039b(\u03bb)P T\n\n(cid:13)(cid:13)(cid:13)2\nl (cid:12) XX T(cid:1) (c\u2217 \u2212 \u00afy)\n\nSuch condition can be pursued by minimizing the above norm, which gives a quadratic program\n\nmin\n\n\u03bb\u22650,\u03bbT 1=q\n\n(24)\nWe can then recover the \ufb01nal solution (subject to roundoff) by M\u2217 = Pl\u039b(\u03bb\u2217)P T\n, where \u03bb\u2217 is the\noptimizer of (24). The optimal value of (24) should be very close to zero (since it\u2019s the norm of the\nderivative at point c\u2217). The pseudocode for the algorithm appears in Algorithm 1.\n\n\u03b1t\n\n2\n\nl\n\n5\n\n\fAlgorithm 2\n1: Input: \u03b3, \u03c3, \u03b1, XX T\n2: Output: (c\u2217, M\u2217)\n3: Initialize M = \u039b((1/(\u03b3t), . . . , 1/(\u03b3t)))\n4: repeat\n5:\n6:\n7: until M has converged (M = M\u2217)\n8: Recover c\u2217 = \u22121\n\n\u03c32 diag(X(A\u2217)T )\n\nSolve for minimum value in inner minimization of (25), obtain A\nSolve outer minimization in (25) given SVD of A using Theorem 4.1 of [18], obtain new M\n\n5 Min-Min Reformulation\nAlthough the max-min formulation appears satisfactory, the recent literature on multitask learn-\ning [17, 18] has developed an alternate strategy for bypassing general semide\ufb01nite programming.\nSpeci\ufb01cally, work in this area lead to convex optimization problems expressed jointly over two ma-\ntrix variables where each step is an alternating min-min descent that can be executed in closed-form\nor by a very fast algorithm. Although it is not immediately apparent that this algorithmic strategy\nis applicable to the problem at hand, with some further reformulation of (8) we discover that in fact\nthe same min-min algorithmic approach can be applied to our mixture of regression problem.\n\nTheorem 8 The following optimization problem\n\u03c32 yT diag(XAT ) +\n\n{M :I(cid:23)M(cid:23)0,tr M =1/\u03b3}\n\nmin\nA\n\nmin\n\n(cid:20) 1\n\n1\n2\u03c32 diag(XAT )T diag(XAT ) + \u03b1\n2\u03b3t\n\ntr(AT M\u22121A)\n\n(cid:21)\n\n(25)\n\n(26)\n\n(27)\n\nis equivalent to optimization problem (8).\n\nProof\n\nmin\n\n{M :I(cid:23)M(cid:23)0,tr M =1/\u03b3}\n\nmax\n\nc\n\n=\n\nmin\n\n{M :I(cid:23)M(cid:23)0,tr M =1/\u03b3}\n\n{c,C:C=\u039b(c\u2212\u00afy)X}\n\ntr(C T M C)\n\n(c \u2212 \u00afy)T (M (cid:12) XX T )(c \u2212 \u00afy)\n\n\u2212 \u03c32\n2 cT c \u2212 \u03b3t\n2\u03b1\n\u2212 \u03c32\n2 cT c \u2212 \u03b3t\n2\u03b1\n\nmax\n\n=\n\nmin\n\n{M :I(cid:23)M(cid:23)0,tr M =1/\u03b3}\n\n\u2212 \u03c32\n2 cT c \u2212 \u03b3t\n2\u03b1\nWe can then solve for c and C, obtaining c = \u2212 1\nthose two variables into (28) proves the claim.\n\nmax\nc,C\n\nmin\nA\n\ntr(C T M C) + tr(AT C) \u2212 tr(AT \u039b(c \u2212 \u00afy)X)\n(28)\n\u03b3t M\u22121A. Substituting\n\n\u03c32 diag(XAT ) and C = \u03b1\n\n5.1 Min-Min Algorithm\n\nThe problem (25) is jointly convex in A and M [14] and Algorithm 2 describes how to solve it.\nIt is important to note that although each iteration in Algorithm 2 is ef\ufb01cient, many iterations are\nrequired to reach a desired tolerance, since it is only \ufb01rst-order convergent. It is observed in our\nexperiments that the concave-convex max-min approach in Algorithm 1 is more ef\ufb01cient simply\nbecause it has the same iteration cost but exploits a quasi-Newton descent in the outer optimization,\nwhich converges faster.\n\nRemark 9 In practice, similarly to [17], a regularizer on M is added to avoid singularity, resulting\nin the following regularized objective function,\n\nmin\n\n{M :I(cid:23)M(cid:23)0,tr M =1/\u03b3}\n\n1\n\u03c32 yT diag(XAT ) +\ntr(AT M\u22121A) + \u0001 tr(M\u22121).\n\n1\n2\u03c32 diag(XAT )T diag(XAT )\n\nmin\nA\n+ \u03b1\n2\u03b3t\n\n(29)\n\nThe problem is still jointly convex in M and A.\n\n6\n\n\f6 Experiments\n\nOur primary objective in formulating this convex approach to mixture regression is to tackle a dif-\n\ufb01cult problem in video analysis (see below). However, to initially evaluate the different approaches\nwe conducted some synthetic experiments. We generated 30 synthetic data points according to\nyi = (\u03c0i \u2297 xi)w + \u0001i, with xi \u2208 R, \u0001i \u223c N(0, 1) and w \u2208 U(0, 1). The response variable yi is\nassumed to be generated from a mixture of 5 components. We compared the quality of the relax-\nation in (22) to EM. Max-min algorithm is used in this experiment. For EM, 100 random restarts\nwas used to help avoid poor local optima. The experiment is repeated 10 times. The error rates are\n0.347 \u00b1 0.086 and 0.280 \u00b1 0.063 for EM and convex relaxation, respectively. The visualization\nof the recovered membership for one of the runs is given in Figure 1. This demonstrates that the\nrelaxation can retain much of the structure of the problem.\n6.1 Vision Experiment\nIn a dynamic scene, various static and moving objects are viewed by a possibly moving observer.\nFor example, consider a moving, hand-held camera \ufb01lming a scene of several cars driving down\nthe road. Each car has a separate motion, and even the static objects, such as trees, appear to move\nin the video due to the self-motion of the camera. The task of segmenting each object according\nto its motion, estimating the parameters of each motion, and recovering the structure of the scene\nis known as the multibody structure and motion problem. This is a missing variable problem. If\nthe motions have been segmented correctly, it is easy to estimate the parameters of each motion.\nNaturally, models employing EM have been proposed to tackle such problems (c.f. [19, 20]).\nFrom epipolar geometry, given a pair of corresponding points pi and qi from two images (pi, qi \u2208\nR3\u00d71), we have the epipolar equation qT\ni F pi = 0. The fundamental matrix F encapsulates infor-\nmation about the translation and rotation relative to the scene points between the positions of the\ncamera where the two images were captured, as well as the camera calibration parameters such as\nits focal length. In a static scene, where only the camera is moving, there is only one fundamental\nmatrix, which arises from the camera self-motion. However, if some of the scene points are moving\nindependently under multiple different motions, there are several fundamental matrices. If there\nare k motion groups, the epipolar equation can be expressed in term of the multibody fundamental\ni Fjpi) = 0. An algebraic method was proposed to recover this matrix via\nGeneralized PCA [21]. An alternative approach, which we follow here, is by Li [22], who casts the\nj=1 \u03c0ijFj)pi = 0 where the membership\nproblem as a mixture of fundamental matrices, i.e. qT\nvariable \u03c0ij = 1 when image point i belongs to motion group j, and zero otherwise. Furthermore,\ni wj = 0, with the column\nsince qT\nvectors xi = [qx\nj ). Thus, we will end up with the\ni p\u03c0\ni wj = 0. The weight vector wj for motion group j can be\n\nfollowing linear equation: (cid:80)k\n\ni F pi = 0 is bilinear in the image points, we can rewrite it to be xT\n\nmatrix [21], i.e.(cid:81)k\n\ni ]T and w = vec(F T\n\ni ((cid:80)k\n\nrecovered easily if the indicator variable \u03c0ij is known.\nWe are interested in assessing the effectiveness of EM-based and convex relaxation-based methods\nfor this multibody structure and motion problem. We used the Hopkins 155 dataset [23]. The exper-\nimental results are summarized in Table 1. All hyperparameters (EM: \u03b1 and \u03c3; Convex relaxation:\n\u03b1, \u03c3, and \u03b3) were tuned and the best performances for each learning algorithm are reported. The\nEM algorithm was run with 100 random restarts to help avoid poor local optima. In terms of com-\nputation time, the max-min runs comparably to the EM algorithm, while min-min runs in the order\nof 3 to 4 times slower. As an illustration, on a Pentium 4 3.6 GHz machine, the elapsed time (in\nseconds) for two cranes dataset is 16.880, 23.536, and 60.003 for EM, max-min and min-min,\nrespectively. Rounding for the convex versions was done by k-means, which introduces some dif-\nferences in the \ufb01nal results for both algorithms. Noticeably, both max-min and min-min outperform\nthe EM algorithm. Visualizations of the motion segmentation on two cranes, three cars,\nand cars2 07 datasets are given in Figure 2 (for kanatani2 and articulated please refer\nto Appendix).\n7 Conclusion\nThe mixture regression problem is pervasive in many applications and known approaches for param-\neter estimation rely on variants of EM, which naturally have issues with local minima. In this paper\nwe introduced a semide\ufb01nite relaxation for the mixture regression problem, thus obtaining a con-\nvex formulation which does not suffer from local minima. In addition we showed how to avoid the\n\nj=1(qT\n\ni px\n\ni qx\n\ni py\n\ni qx\n\ni p\u03c0\n.... q\u03c0\ni\nj=1 \u03c0ijxT\n\n7\n\n\fuse of expensive interior-point methods typically needed to solve semide\ufb01nite programs. This was\nachieved by introducing two reformulations amenable to the use of faster algorithms. Experimental\nresults with synthetic data as well as with real computer vision data suggest the proposed methods\ncan substantially improve on EM while one of the methods in addition has comparable runtimes.\n\nTable 1: Error rate on several datasets from the Hopkins 155\n\nData set\nm\n173\nthree cars\n63\nkanatani2\n212\ncars2 07\n94\ntwo cranes\narticulated 150\n\nEM Max-Min Convex Min-Min Convex\n0.0347\n0.0000\n0.2594\n0.0106\n0.0000\n\n0.0289\n0.0000\n0.2642\n0.0213\n0.0000\n\n0.0532\n0.0000\n0.3396\n0.0532\n0.0000\n\n(a) Ground Truth\n\n(b) EM\n\n(c) Convex Relaxation\n\nFigure 1: Recovered membership on synthetic data with EM and convex relaxation. 30 data points\nare generated according to yi = (\u03c0i \u2297 xi)w + \u0001i, with xi \u2208 R, \u0001i \u223c N(0, 1) and w \u2208 U(0, 1).\n\n(a) Ground Truth\n\n(b) EM\n\n(c) Max-Min Convex (d) Min-Min Convex\n\n(e) Ground Truth\n\n(f) EM\n\n(g) Max-Min Convex (h) Min-Min Convex\n\n(i) Ground Truth\n\n(j) EM\n\n(k) Max-Min Convex (l) Min-Min Convex\n\nFigure 2: Resulting motion segmentations produced by the various techniques on the Hopkins 155\ndataset. 2(a)-2(d): two cranes, 2(e)-2(h): three cars, and 2(i)-2(l): cars2 07. In two\ncranes (\ufb01rst row), EM produces more segmentation errors at the left crane. In three cars\n(second row), the max-min method gives the least segmentation error (at the front side of the middle\ncar) and EM produces more segmentation errors at the front side of the left car. The contrast of EM\nand convex methods is apparent for cars2 07 (third row): the convex methods segment correctly\nthe static grass \ufb01eld object, while EM makes mistakes. Further, the min-min method can almost\nperfectly segment the car in the middle of the scene from the static tree background.\n\n8\n\n\fReferences\n[1] S. Gaffney and P. Smyth. Trajectory clustering with mixtures of regression models. In ACM\n\nSIGKDD, volume 62, pages 63\u201372, 1999.\n\n[2] S. Finney, L. Kaelbling, and T. Lozano-Perez. Predicting partial paths from planning problem\nparameters. In Proceedings of Robotics: Science and Systems, Atlanta, GA, USA, June 2007.\n[3] A. J. Storkey and M. Sugiyama. Mixture regression for covariate shift. In Sch\u00a8olkopf, editor,\n\nAdvances in Neural Information Processing Systems 19, pages 1337\u20131344, 2007.\n\n[4] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data\nvia the em algorithm. Journal of the Royal Statistical Society. Series B (Methodological),\n39(1):1\u201338, 1977.\n\n[5] T. De Bie, N. Cristianini, P. Bennett, and E. Parrado-hern\u00a8andez. Fast sdp relaxations of graph\n\ncut clustering, transduction, and other combinatorial problems. JMLR, 7:1409\u20131436, 2006.\n\n[6] Y. Guo and D. Schuurmans. Convex relaxations for latent variable training. In Platt et al.,\n\neditor, Advances in Neural Information Processing Systems 20, pages 601\u2013608, 2008.\n\n[7] S. M. Goldfeld and R.E. Quandt. Nonlinear methods in econometrics. Amsterdam: North-\n\nHolland Publishing Co., 1972.\n\n[8] D. W. Hosmer. Maximum likelihood estimates of the parameters of a mixture of two regression\n\nlines. Communications in Statistics, 3(10):995\u20131006, 1974.\n\n[9] H. Sp\u00a8ath. Algorithm 39: clusterwise linear regression. Computing, 22:367\u2013373, 1979.\n[10] W.S. DeSarbo and W.L. Cron. A maximum likelihood methodology for clusterwise linear\n\nregression. Journal of Classi\ufb01cation, 5(1):249\u2013282, 1988.\n\n[11] P.N. Jones and G.J. McLachlan. Fitting \ufb01nite mixtures models in a regression context. Austral.\n\nJ. Statistics, 34(2):233\u2013240, 1992.\n\n[12] S. Gaffney and P. Smyth. Curve clustering with random effects regression mixtures. In AIS-\n\nTATS, 2003.\n\n[13] M.I. Jordan and R.A. Jacobs. Hierarchical mixtures of experts and the em algorithm. Neural\n\ncomputation, 6:181\u2013214, 1994.\n\n[14] S. Boyd and L. Vandenberghe. Convex Optimization. Cambridge University Press, 2004.\n[15] M. Overton and R. Womersley. Optimality conditions and duality theory for minimizing sums\nof the largest eigenvalues of symmetric matrices. Mathematical Programming, 62:321\u2013357,\n1993.\n\n[16] J. Yu, S.V.N. Vishwanathan, S. G\u00a8unter, and N. Schraudolph. A quasi-Newton approach to\n\nnonsmooth convex optimization. In ICML, 2008.\n\n[17] A. Argyriou, T. Evgeniou, and M. Pontil. Convex multi-task feature learning. Machine Learn-\n\ning, 73:243\u2013272, 2008.\n\n[18] J. Chen, L. Tang, J. Liu, and J. Ye. A convex formulation for learning shared structures from\n\nmultiple tasks. In ICML, 2009.\n\n[19] N.Vasconcelos and A. Lippman. Empirical bayesian em-based motion segmentation. In CVPR,\n\n1997.\n\n[20] P. Torr. Geometric motion segmentation and model selection. Philosophical Trans. of the\n\nRoyal Society of London, 356(1740):1321\u20131340, 1998.\n\n[21] R. Vidal, Y. Ma, S. Soatto, and S. Sastry. Two-view multibody structure from motion. IJCV,\n\n68(1):7\u201325, 2006.\n\n[22] H. Li. Two-view motion segmentation from linear programming relaxation. In CVPR, 2007.\n[23] http://www.vision.jhu.edu/data/hopkins155/.\n\n9\n\n\f", "award": [], "sourceid": 832, "authors": [{"given_name": "Novi", "family_name": "Quadrianto", "institution": null}, {"given_name": "John", "family_name": "Lim", "institution": null}, {"given_name": "Dale", "family_name": "Schuurmans", "institution": null}, {"given_name": "Tib\u00e9rio", "family_name": "Caetano", "institution": null}]}