{"title": "Subspace Clustering via Tangent Cones", "book": "Advances in Neural Information Processing Systems", "page_first": 6744, "page_last": 6753, "abstract": "Given samples lying on any of a number of subspaces, subspace clustering is the task of grouping the samples based on the their corresponding subspaces. Many subspace clustering methods operate by assigning a measure of affinity to each pair of points and feeding these affinities into a graph clustering algorithm. This paper proposes a new paradigm for subspace clustering that computes affinities based on the corresponding conic geometry. The proposed conic subspace clustering (CSC) approach considers the convex hull of a collection of normalized data points and the corresponding tangent cones. The union of subspaces underlying the data imposes a strong association between the tangent cone at a sample $x$ and the original subspace containing $x$. In addition to describing this novel geometric perspective, this paper provides a practical algorithm for subspace clustering that leverages this perspective, where a tangent cone membership test is used to estimate the affinities. This algorithm is accompanied with deterministic and stochastic guarantees on the properties of the learned affinity matrix, on the true and false positive rates and spread, which directly translate into the overall clustering accuracy.", "full_text": "Subspace Clustering via Tangent Cones\n\nWisconsin Institute for Discovery\n\nDepartment of Electrical and Computer Engineering\n\nAmin Jalali\n\nUniversity of Wisconsin\n\nMadison, WI 53715\n\namin.jalali@wisc.edu\n\nRebecca Willett\n\nUniversity of Wisconsin\n\nMadison, WI 53706\n\nwillett@discovery.wisc.edu\n\nAbstract\n\nGiven samples lying on any of a number of subspaces, subspace clustering is the\ntask of grouping the samples based on the their corresponding subspaces. Many\nsubspace clustering methods operate by assigning a measure of af\ufb01nity to each\npair of points and feeding these af\ufb01nities into a graph clustering algorithm. This\npaper proposes a new paradigm for subspace clustering that computes af\ufb01nities\nbased on the corresponding conic geometry. The proposed conic subspace clus-\ntering (CSC) approach considers the convex hull of a collection of normalized\ndata points and the corresponding tangent cones. The union of subspaces underly-\ning the data imposes a strong association between the tangent cone at a sample x\nand the original subspace containing x. In addition to describing this novel ge-\nometric perspective, this paper provides a practical algorithm for subspace clus-\ntering that leverages this perspective, where a tangent cone membership test is\nused to estimate the af\ufb01nities. This algorithm is accompanied with deterministic\nand stochastic guarantees on the properties of the learned af\ufb01nity matrix, on the\ntrue and false positive rates and spread, which directly translate into the overall\nclustering accuracy.\n\n1\n\nIntroduction\n\nFinding a low-dimensional representation of high-dimensional data is central to many tasks in sci-\nence and engineering. Union-of-subspaces have been a popular data representation tool for the past\ndecade. These models, while still parsimonious, offer more \ufb02exibility and better approximations\nto non-linear data manifolds than single-subspace models. To fully leverage union-of-subspaces\nmodels, we must be able to determine which data point lies in which subspace. This subproblem is\nreferred to as subspace clustering [16].\nFormally, given a set of points x1, . . . , xN \u2208 Rn lying on k linear subspaces S1, . . . , Sk \u2282 Rn,\nsubspace clustering is the pursuit of partitioning those points into k clusters so that all points in each\ncluster lie within the same subspace among S1, . . . , Sk. Once the points have been clustered into\nsubspaces, standard dimensionality reduction methods such as principal component analysis can be\nused to identify the underlying subspaces. A generic approach in the literature is to construct a\ngraph with each vertex corresponding to one of the given samples and each edge indicating whether\n(or the degree to which) a pair of points could have come from the same subspace. We refer to the\n(weighted) adjacency matrix of this graph as the af\ufb01nity matrix. An ideal af\ufb01nity matrix A would\nhave A(i, j) = 1 if and only if xi and xj are in the same subspace, and otherwise A(i, j) = 0. Given\nan estimated af\ufb01nity matrix, a variety of graph clustering methods, such as spectral clustering [17],\ncan be used to cluster the samples, so forming the af\ufb01nity matrix is a critical step.\nMany existing methods for subspace clustering with provable guarantees leverage the self-expressive\nproperty of the data. Such approaches pursue a representation of each data point in terms of the\nother data points, and then the representation coef\ufb01cients are used to construct an af\ufb01nity matrix.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFor example, the celebrated sparse subspace clustering (SSC) approach of [3] seeks a representation\nof each sample as a weighted combination of the other points, with minimal \u21131 norm. However, such\nsparse self-expression can lead to graph connectivity issues, e.g., see [10, 8, 20, 5, 19, 18], where\nclusters can be arbitrarily broken into separate components. This paper proposes a new paradigm\nfor devising subspace clustering algorithms:\n\nConic Subspace Clustering (CSC): exploiting the association of the tangent cones\nto the convex hull of normalized samples with the original subspaces for comput-\ning af\ufb01nities and subsequent clustering.\n\nCSC leverages new insights into the geometry of subspace clustering. One of the key effects of\nthis approach is that the learned af\ufb01nity matrix is generally denser among samples from the same\nsubspace, which in turn can mitigate graph connectivity issues.\nIn Proposition 1 below, we hint on what we mean by the strong association of the tangent cones\nwith the underlying subspaces for an ideal dataset.\nIn Section 2, we show how a similar idea\ncan be implemented with \ufb01nite number of samples. Given a set of nonzero samples from a\nunion of linear subspaces, we normalize them to fall on the unit sphere and henceforth assume\nX = {x1, . . . , xN}\u2282S n\u22121 is the set of samples. We further overload the notation to de\ufb01ne\nX = [x1, x2, . . . , xN ] \u2208 Rn\u00d7N. Data hull refers to the convex hull of samples. The tangent cone\nat x \u2208 conv(X) with respect to conv(X) is de\ufb01ned as\n\nT (x) := cl conv cone(X + {\u2212x}) = cl!\"x\u2032\u2208X \u03bbx\u2032(x\u2032 \u2212 x) : \u03bbx\u2032 \u2265 0, x\u2032 \u2208 X#\n\nwhere the Minkowski sum of two sets A and B is denoted by A + B, while A + {x} may be\nsimpli\ufb01ed to A + x. The linear space of a cone C is de\ufb01ned as lin C := C \u2229 (\u2212C). We term the\nintersection of a subspace S with the unit sphere as a ring R = S \u2229S n\u22121.\nProposition 1. For a union of rings, namely X = (S1 \u222a . . . \u222a Sk) \u2229S n\u22121, and for every x \u2208 X,\n\nwhere S(x) is the convex hull of the union of all subspaces Si , i = 1, . . . , k, to which x belongs.\n\nS(x) = span{x} + lin T (x),\n\n1.1 Our contributions\nWe introduce a new paradigm for subspace clustering, conic subspace clustering (CSC), inspired by\nideas from convex geometry. More speci\ufb01cally, we propose to consider the convex hull of normal-\nized samples, and exploit the structure of the tangent cone to this convex body at each sample to\nestimate the relationships for pairs of samples (to construct an af\ufb01nity matrix for clustering).\nWe provide an algorithm which implements CSC (Section 2) along with deterministic guarantees on\nhow to choose the single parameter in this algorithm, \u03b2, guaranteeing no false positives (Section 5)\nand any desired true positive rate (Section 4), in the range allowed by the provided samples. We\nspecialize our results to random models, to showcase our guarantees in terms of the few parameters\nde\ufb01ning said random generative models and to compare with existing methods. Aside from statis-\ntical guarantees, we also provide different optimization programs for implementing our algorithm\nthat can be used for faster computation and increased robustness (Section 7).\nIn Section 6, we elaborate on the true positive rate and spread for CSC and compare it to what\nis known about a sparsity-based subspace clustering approach, namely sparse subspace clustering,\nSSC [3]. This comparison provides us with insight on situations where methods such as SSC would\nface the so called graph connectivity issue, demonstrating the advantage of CSC in such situations.\n\n2 Conic Subspace Clustering (CSC) via Rays: Intuition and Algorithm\n\nIn this section, we discuss an intuitive algorithm for subspace clustering under the proposed conic\nsubspace clustering paradigm. We present the underlying idea without worrying about the computa-\ntional aspects, and relegate such discussions to Section 7. All proofs are presented in the Appendix.\nHenceforth, lower case letters represent vectors, while speci\ufb01c letters such as x and x\u2032 are reserved\nto represent columns of X, and x is commonly used as the reference point.\nStart by considering Figure 1(a) and the point x \u2208R := (S1 \u222a\u00b7\u00b7\u00b7\u222a Sk) \u2229S n\u22121 from which\nall the rays are emanating. Moreover, de\ufb01ne Rt := St \u2229S n\u22121 for t = 1, . . . , k, which gives\n\n2\n\n\f(a) x + cl cone(R\u2212 x)\n\n(b) x\u2032 \u2208 S(x)\n\n(c) x\u2032 /\u2208 S(x)\n\nFigure 1:\nIllustration of the idea behind our implementation of Conic Subspace Clustering (CSC)\nvia rays. The union of the red and blue rings is R, and x is the point from which all the rays are\nemanating. The orange wedge represents x + W(x). 1(a) The union of the red and blue surfaces\nis x + cl cone(R\u2212 x). 1(b) When x\u2032 and x are from the same subspace, the points d\u03b2(x, x\u2032) for\ndifferent values of \u03b2 \u2265 0 lie within cl cone(R\u2212 x) \u2013 speci\ufb01cally, in the blue shaded cone associated\nwith the blue ring. 1(c) When x\u2032 and x are from different subspaces, the points d\u03b2(x, x\u2032) lie outside\ncl cone(R\u2212 x) for large enough values of \u03b2.\n\nd\u03b2(x, x\u2032) := \u2212sign(\u27e8x, x\u2032\u27e9)\u03b2x\u2032 \u2212 x.\n\nx\u2032 \u2208 S(x) \u21d0\u21d2 !d\u03b2(x, x\u2032) : \u03b2 \u2265 0# \u2282 cl cone(R\u2212 x)\n\nR = R1 \u222a . . .\u222aR k. Only two subspaces are shown and the reference point x is in R1. The thin red\nand blue rays correspond to elements of x + cone(R\u2212 x) = x + cone({x\u2032 \u2212 x : x\u2032 \u2208R} ), where\ncone(A) := {\u03bby : y \u2208 A, \u03bb \u2265 0}.1 We leverage the geometry of this cone to determine subspace\nmembership. Speci\ufb01cally, Figure 1(b) considers a point x\u2032 \u2208R 1 different from x. The dashed line\nsegment represents points \u2212sign(\u27e8x, x\u2032\u27e9)\u03b2x\u2032 for different values of \u03b2 \u2265 0; where sign(0) can be\narbitrarily chosen as \u00b11. The vectors emanating from x and reaching these points represent\n(1)\nFor x, x\u2032 \u2208R 1, this illustration shows that d\u03b2(x, x\u2032) \u2208 cl cone(R\u2212 x) for any \u03b2 \u2265 0. In contrast,\nFigure 1(c) considers x\u2032 \u2208R 2, while x \u2208R 1. In this case, there exist \u03b2> 0 such that d\u03b2(x, x\u2032) /\u2208\ncl cone(R\u2212 x), indicating that x\u2032 \u0338\u2208 S(x). Formally,\nProposition 2. For any x, x\u2032 \u2208R and any scalar value \u03b2 \u2265 0,\nEquivalently, x\u2032 \u2208 S(x) if and only if!\u03b2 \u2208 R : \u03b2x\u2032 \u2212 x \u2208 cl cone(R\u2212 x)# is unbounded.\nIn other words, we can test whether or not x\u2032 \u2208 S(x) by testing the cone membership for d\u03b2(x, x\u2032).\nOf course, such a test would not be practical: we cannot compute d\u03b2(x, x\u2032) for an in\ufb01nite set of\n\u03b2 values, the set cl cone(R\u2212 x) is generally non-convex (in Figure 1(a), the cone is the union of\nthe red and blue surfaces), and cl cone(R\u2212 x) is not known exactly because we only observe a\n\ufb01nite collection of points from R instead of all of R. We now develop an alternative test to (2) that\naddresses these challenges and can be computed within a convex optimization framework. We \ufb01rst\naddress the convexity issue:\nProposition 3. For the closed convex cone W(x) := conv cl cone(R\u2212 x), and for any x, x\u2032 \u2208R ,\n(3)\nIn other words, x\u2032 \u2208 S(x) implies that!\u03b2 \u2208 R : d\u03b2(x, x\u2032) \u2208 cl cone(R\u2212 x)# is unbounded.\nNext, we formulate the test as a convex optimization program, when a \ufb01nite number of samples are\ngiven. Speci\ufb01cally, using the samples in X \u2282R instead of all the points in R, we can de\ufb01ne an\napproximation of W(x) as\n\nx\u2032 \u2208 S(x) =\u21d2 !d\u03b2(x, x\u2032) : \u03b2 \u2265 0# \u2282W (x).\n\n(2)\n\n(4)\nwhich is the tangent cone (also known as the descent cone) at x with respect to the data hull\nconv(X). The implementation of CSC via rays, as sketched above and detailed below, is based\n\nWN (x) :=!(X \u2212 x1T\n\n+#\nN )\u03bb : \u03bb \u2208 RN\n\n1Note that this is not the same as a conic hull.\n\n3\n\n\fon testing the membership of d\u03b2(x, x\u2032) in the tangent cone WN (x) for all pairs of samples x, x\u2032 to\ndetermine their af\ufb01nity. More speci\ufb01cally, the cone membership test can be stated as a feasibility\nprogram, tagged as the Cone Representability (CR) program:\n\nmin\u03bb\u2208RN 0 subject to d\u03b2(x, x\u2032) = (X \u2212 x1T\n\nN )\u03bb, \u03bb \u2265 0N .\n\n(CR)\n\n\u0338\u2208 S(x). Later, in our\nIf there exists a \u03b2 \u2265 0 for which (CR) is infeasible, then we conclude x\u2032\ntheoretical results in Sections 4 and 5 we characterize a range (dependent on a target error rate) of\npossible values of \u03b2 such that for any single \u03b2 from this range, checking the feasibility of (CR) for\nall x, x\u2032 reveals the true relationships within a target error rate. In Section 7, we discuss a number\nof variations for the above optimization program. While our upcoming guarantees are all concerned\nwith the cone membership test itself and not the speci\ufb01c implementation, these variations provide\nbetter algorithmic options and are more robust to noise. Speci\ufb01cally, we choose to use a variation\n(in the box below) that is a bounded feasible linear program for our implementation of the cone\nmembership test.\nWe refer to solving any of the variations of the cone membership test for an ordered pair of samples\n(x, x\u2032) and a \ufb01xed value of \u03b2 as CSC1(\u03b2, x, x\u2032):\n\nN )\u03bb, \u03b3 \u2265 0 ,\u03bb \u2265 0}.\n\nCompute $\u03b3(x, x\u2032) = min {\u03b3 : (1 \u2212 \u03b3)(\u03b2x\u2032 \u2212 x) = (X \u2212 x1T\nSet A(x, x\u2032) \u2208{ 0, 1} by rounding 1 \u2212$\u03b3(x, x\u2032) to either 0 or 1, whichever is closest.\n\nWe refer to the optimization program used in the above as the Robust Cone Membership (RCM)\nprogram. Similarly, solving a collection of these tests for all samples x\u2032 and for a \ufb01xed x, or for all\npairs x, x\u2032, are referred to as CSC1(\u03b2, x) and CSC1(\u03b2), respectively. When CSC1(\u03b2) is followed\nby spectral clustering for the constructed af\ufb01nity matrix, we refer to the whole process as CSC(\u03b2).\nIt is worth mentioning that the linear program used in CSC1(\u03b2, x, x\u2032) is equivalent to (CR) in a\nsense made clear in Section 7, and provides the same af\ufb01nity matrix with a variety of algorithmic\nadvantages, as discussed in Section 7.\n\n3 Theoretical Guarantees\n\nIn this section, we discuss our approach to providing theoretical guarantees for the aforementioned\nimplementation of CSC via rays. Let us \ufb01rst set some conventions. We refer to a declaration\nx\u2032 \u2208 S(x) (or x\u2032 \u0338\u2208 S(x)) as positive (or negative), regardless of the ground truth. Hence, a true\npositive is an af\ufb01nity of 1 when the samples are from the same subspace, and a false positive is an\naf\ufb01nity of 1 when the samples are from different subspaces. We provide guarantees for CSC1(\u03b2)\nto give no false positives. This makes the af\ufb01nity matrix a permuted block diagonal matrix. In this\ncase, if there are enough well-spread ones in each row of the af\ufb01nity matrix, spectral clustering or\nany other reasonable clustering algorithm will be able to perfectly recover the underlying grouping;\nsee graph connectivity in spectral clustering literature [17]. These two phenomena, no false positives\nand enough well-spread true positives per sample, are the focus of our theoretical results in Sections\n4 and 5. In a nutshell, the guarantees boil down to characterizing a range of \u03b2\u2019s for which CSC\nhas controlled degrees of errors: no false positives and a certain true positive rate per row. We also\nexamine the distribution of true positives recovered by our method and illustrate a favorable spread.\nThrough the intuition behind the cone membership test, namely (CR), it is easy to observe that the\nnumber of true positives and the number of false positives are monotonically non-increasing in \u03b2\n(which can be observed in Figure 2 as well). Hence, to have a high number of true positives we\nneed to use an upper bounded \u03b2, and to have a few number of false positives we need to use a lower\nbounded value of \u03b2.\nTo assess the strength of our deterministic results, we assume probabilistic models on the subspaces\nand/or samples and study the ranges of \u03b2 for which CSC1(\u03b2) has controlled errors of both types,\nwith high probability. For the random models, we take the number of subspaces to be \ufb01xed, namely k.\nHowever, CSC1(\u03b2) need not know the number of subspaces and spectral clustering can use the gap\nin the eigenvalues of the Laplacian matrix to determine the number of clusters; e.g., see [15]. In\nthe random-sample model, we assume k subspaces are given and samples from each subspace are\ndrawn uniformly at random from the unit sphere on that subspace. In the random-subspace model,\neach subspace is chosen independently and uniformly at random with respect to the Haar measure.\n\n4\n\n\f3.1 Examples\nIn this section, we illustrate the performance of the CSC method on some small examples. First, we\nexamine the role of the parameter \u03b2 in CSC1(\u03b2, x, x\u2032) and its effect on the false positive and true\npositive rates in practice. In the \ufb01rst experiment, we have k = 5 subspaces, each with dimension\nd = 5, in an n = 10 dimensional space, and we draw 30 samples from each of the k 5-dimensional\nsubspaces. We then run CSC1(\u03b2, x, x\u2032) for a variety of values of \u03b2 between one and six over 15\nrandom trials; In Figures 2(b), 2(c), and 2(d), we show the results of each trial in thin lines and\nthe means across trials in thick lines (Figure 2(c) shows the median). Figure 2(a) shows, for each\nvalue of \u03b2, the histogram of true positive rates across rows. Superimposed on this histogram plot is\nthe empirical mean of the histogram (green solid line) and our theoretical bound from Theorem 6\n(purple dashed curve corresponding to the purple dashed curve in Figure 2(b)): for each value of \u03b2,\nthe true positive rate will be above this curve, with high probability.\nOur theoretical bounds correspond to suf\ufb01cient but not necessary conditions. While we observe the\ntightness of the theory for minimum per-row true positive rate in relation to \u03b2, the wide distribution\nof per-row true positive rates above the theoretical bound (Figures 2(a) and 2(b)), as well as the\nspectral clustering step, provide us with good error rates (Figure 2(c)) outside the range of \u03b2\u2019s for\nwhich we have guarantees (Figure 2(d)).\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(a)\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n0\n\n1\n\n600\n\n500\n\n400\n\n300\n\n200\n\n100\n\n0\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.2\n\n2\n\n3\n\n4\n\n5\n\n0\n\n1\n\n2\n\n(b)\n\n3\n\n(c)\n\n4\n\n5\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n(d)\n\nFigure 2: Illustration of the role of \u03b2 (horizontal axis) in determining (a) the histogram of true posi-\ntive rates across rows, with the empirical mean (solid line) and the theoretical bound (dashed curve,\ncorresponding to the dashed curve in (b)), (b) maximum, mean, and minimum (across rows of the\naf\ufb01nity matrix) true positive rates along with the theoretical bound (dashed curve), (c) the clustering\nmismatch rate after performing spectral clustering, and, (d) the false positive rate. This experiment\nis in a 10 dimensional space, with 30 random samples from each of 5 random 5-dimensional sub-\nspaces, over 15 random trials. Bold curves correspond to averages across trials in (a), (b), and (d),\nbut to the median in (c).\n\nNext, we look at learned af\ufb01nity matrices output by the proposed CSC method and SSC [3], which\nis a widely-used benchmark and the foundation of much current subspace clustering research. As\ndescribed at length in Section D, the true positive rate of SSC is necessarily bounded because of the\n\u21131 regularization used to learn the af\ufb01nity matrix. This is not true of CSC \u2013 in fact, \u03b2 can be used\nto control the true positive rate (in an admissible range) as long as it exceeds some lower bound\n(\u03b2 \u2265 \u03b2L). The difference between the true positive rates of SSC and CSC are illustrated in Figure 3.\nIn this experiment, CSC naturally outputs a 0/1 af\ufb01nity matrix, while the af\ufb01nity matrix of SSC has a\nbroader diversity of values. We show this matrix and a thresholded version for comparison purposes,\nwhere the threshold is set to correspond to a 5% false positive rate.\n\n4 Guarantees on True Positive Rates\nWe study conditions under which a fraction \u03c1 \u2208 (0, 1) of samples x\u2032 \u2208 S(x) are declared as such.\nAs discussed before, the number of true positives is non-increasing in \u03b2. Therefore, we are interested\nin an upper bound \u03b2U,\u03c1 on \u03b2 so that CSC1(\u03b2, x) for \u03b2 \u2264 \u03b2U,\u03c1 returns at least \u03c1Nt true positives\n(Nt is the number of samples from St) for any x \u2208 X t := X \u2229 St and t = 1, . . . , k. Consider\n{x}\u22a5 := {y : \u27e8x, y\u27e9 = 0}. For a close convex set A containing the origin, denote by r(A) the radius\nof the largest Euclidean sphere in span(A) that is centered at the origin and is a subset of A.\nTheorem 4 (Deterministic condition for any true positive rate). The conic subspace clustering algo-\nrithm at x with parameter \u03b2, namely CSC1(\u03b2, x), returns a ratio \u03c1 \u2208 (0, 1) of relationships between\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\n(f)\n\nFigure 3: Af\ufb01nity matrices for two toy models in an ambient dimension n = 12. (a-c) k = 3\nsubspaces, each of rank d = 4 and each with 3d = 12 samples. (a) Result of CSC1(\u03b2). (b) Result\nof SSC. (c) Thresholded version of (b). (d-e) k = 3 subspaces, each of rank d = 6 and each with\n3d = 18 samples. (d) Result of CSC1(\u03b2). (e) Result of SSC. (f) Thresholded version of (e), with\nthreshold set so that the false positive rate is 5%. As predicted by the theory, CSC achieves higher\ntrue positive rates than SSC can.\n\nx \u2208 X t and other samples as true positives, provided that \u03b2 \u2264 \u03b2x\n\nU,\u03c1 where\n\n\u03b2x\nU,\u03c1 :=\n\nsin2(\u03b8x\nt )\nt ) \u2212 cos(\u02dc\u03b8x\nt )\n\n(5)\n\nU,\u03c1.\n\ncos(\u03b8x\nin which, for m := \u2308\u03c1(N \u2212 1)\u2309, cos(\u02dc\u03b8x\nt ) is the (m + 1)-st largest value among |\u27e8x, x\u2032\u27e9| for\nx\u2032 \u2208 X t, and for r(\u00b7) denoting the inner radius, \u03b8x\nN (x)) \u2229{ x}\u22a5)).\nThen, CSC1(\u03b2) is guaranteed to return a fraction \u03c1 of true positives per sample provided that\n\u03b2 \u2264 \u03b2U,\u03c1 := minx\u2208X \u03b2x\nAs it can be seen from the above characterization, \u03b8x\nU,\u03c1 can vary from sample to sample even\nwithin the same subspace. When samples are drawn uniformly at random from a given subspace\n(the random-sample model), the next theorem provides a uniform lower bound on the inner radius\nand \u03b8x\nTheorem 5. Under a random-sample model, and for a choice pt \u2208 (0, 1), with probability at least\n1 \u2212 pt , a solution \u03b8 to\n\nt for all such samples. Note that \u03b2x\n\n:= arctan(r((x + W t\n\nt and non-increasing in \u02dc\u03b8x\nt .\n\nU,\u03c1 is non-decreasing in \u03b8x\n\nt and \u03b2x\n\nt\n\n(cos \u03b8)dt\u22121\n6\u221adt sin \u03b8\n\n=\n\nlog(Nt/pt)\n\nNt\n\nt , which is de\ufb01ned in Theorem 4 and is a function of the inradius of a base of\n\nis a lower bound on \u03b8x\nthe t-th cone W t\nN (x).\nTheorem 5 is proved in the Appendix using ideas from inversive geometry [1]. In a random-sample\nmodel, we can quantify the aforementioned m-th order statistic. Therefore, we can explicitly com-\npute the upper bound \u03b2U,\u03c1 (with high probability) in terms of quantities dt and Nt. The \ufb01nal result\nis given in Theorem 6. Note that both the inradius and the m-th order statistic are random variables\nde\ufb01ned through the samples, hence are dependent. Therefore, a union bound is used.\nTheorem 6. Under a random-sample model, CSC1(\u03b2, x) for any x \u2208 X t yields a fraction \u03c1 of\ntrue positives with high probability, provided that \u03b2 \u2264 \u03b2x\nu,\u03c1 is computed similar to (5),\nN +\u2206) . The probability is at least\nusing the lower bound on \u03b8x\nI( m\n\nN +\u2206; m, N \u2212 m) \u2212 pt, where I(\u00b7;\u00b7,\u00b7) denotes the incomplete Beta function.\n\nt from Theorem 5 and \u02dc\u03b8x\n\nu,\u03c1 , where \u03b2x\n\nt = \u03c0\n\n2 ( m\n\n5 Guarantees for Zero False Positives\n\nIn this section, we provide guarantees for CSC1(\u03b2, x) to return no false positives, in terms of the\nvalue of \u03b2. Speci\ufb01cally, we guarantee this by examining a lower bound \u03b2L for \u03b2 in CSC1(\u03b2, x). For\na \ufb01xed column x of the data matrix X, we will use x\u2032 as a pointer to any other column of X. Recall\nd\u03b2(x, x\u2032) from (1) and consider\n\n\u03b2L(x) := inf {\u03b2 \u2265 0 : d\u03b2(x, x\u2032) \u0338\u2208 WN (x) \u2200x\u2032 \u0338\u2208 S(x)}\n\n= sup{\u03b2 \u2265 0 : d\u03b2(x, x\u2032) \u2208W N (x) for some x\u2032 \u0338\u2208 S(x)} .\n\n(6)\n\n6\n\n\fIf the above value is \ufb01nite, then using any value even slightly larger than this would declare any\nx\u2032 \u0338\u2208 S(x) correctly as such, hence no false positives. However, the above in\ufb01mum may not exist for\na general con\ufb01guration. In other words, there might be a sample x\u2032 \u0338\u2208 S(x) for which d\u03b2(x, x\u2032) \u2208\nWN (x) for all values of \u03b2 \u2265 0. The following condition prohibits such a situation.\nTheorem 7 (Deterministic condition for zero false positives). For x \u2208 X, without loss of generality,\nsuppose S(x) = S1 \u222a . . . \u222a Sj for some j < k. Provided that all of the columns of X that are\nnot in S(x) are also not in WN (x), then \u03b2L(x) in (6) is \ufb01nite. This condition is equivalent to\nSt \u2229W N (x) = {0} for all t = j + 1, . . . , k and all x \u2208 X \\ St . In case this condition holds for all\nx \u2208 X, we de\ufb01ne\n(7)\n\n\u03b2L(x).\n\n\u03b2L := max\nx\u2208X\n\nN (x) = {0} where W\u2212t\n\nN (x) = convcone{x\u2032 \u2212 x : x\u2032\n\nIf this condition is met and \u03b2 \u2265 \u03b2L is used, then CSC1(\u03b2) will return no false positives.\nWe note that the condition of Theorem 7 becomes harder to satisfy as the number of samples grow\n(which makes WN (x) larger). While this is certainly not desired, such an artifact is present in other\nsubspace clustering algorithms. See the discussion after Theorem 1 in [11] for examples.\nNext, we specialize Theorem 7 to a random-subspace model. Under such model, for t = j +\n1, . . . , k, St and WN (x) are two random objects and are dependent (all samples, including those\nfrom St, take part in forming WN (x), hence the orientation and the dimension of St affect the\nde\ufb01nition of WN (x)), which makes the analysis harder. However, these two can be decoupled\nby massaging the condition of Theorem 7 from St \u2229W N (x) = {0} into an equivalent condition\nSt \u2229W \u2212t\n\u0338\u2208 St}; see Lemma 10 in the\nAppendix. Next, the event of a random subspace and a cone having trivial intersection can be\nstudied using the notion of the statistical dimension of the cone and the brilliant Gordon\u2019s Lemma\n(escape through a mesh) [6]. The statistical dimension of a closed convex cone C \u2282 Rn is de\ufb01ned\nas \u03b4(C) := E supy\u2208C\u2229Sn\u22121\u27e8y, g\u27e92 \u2264 n where g \u223cN (0, In). Now, we can state the following\nlemma based on Gordon\u2019s Lemma.\nLemma 8. With the notation in Theorem 7, and under the random-subspace model, \u03b2L is \ufb01nite\nprovided that \u03b4(W\u2212t\nN (x)) + dim(St) \u2264 n for t = 1, . . . , k.\nFurthermore, for the above to hold, it is suf\ufb01cient to have\"k\n\nt=1 dim(St) < n (Lemma 11 in the\nAppendix). Under the above conditions, we are guaranteed that a \ufb01nite \u03b2 exists such that with high\nprobability, CSC1(\u03b2) results in zero false positives. It is easy to compute \u03b2L for certain con\ufb01g-\nurations of subspaces. For example, when the subspaces are independent (the dimension of their\nMinkowski sum is equal to sum of their dimensions) we have \u03b2L = 1 (Lemma 13 in the Appendix).\nIndependent subspaces have been assumed before in the subspace clustering literature for provid-\ning guarantees; e.g., [2, 3]. Also see [21] and references therein. However, it remains as an open\nquestion how one should compute this value for more general con\ufb01gurations. We provide some\ntheoretical tools for such computation in Appendix B.5.\nFinally, if \u03b2L does not exceed \u03b2U,\u03c1 from above, then CSC1(\u03b2) successfully returns a (permuted)\nblock diagonal matrix with a density of ones (per row) of at least \u03c1. This allows us to have a good\nidea about the performance of the post-processing step (e.g., spectral clustering) and hence CSC(\u03b2).\n\n6 True Positives\u2019 Rate and Distribution\n\nBecause sparse subspace clustering (SSC) relies upon sparse representations, the number of true\npositives is inherently limited. In fact, it can be shown that SSC will \ufb01nd a representation of each\ncolumn x as a weighted sum of columns that correspond to the extreme rays of WN (x); as shown\nin Lemma 17 in the Appendix. This phenomenon is closely linked to the graph connectivity issues\nassociated with SSC, mentioned before. In particular, under a random-sample model, the true posi-\ntive rate for SSC will go to zero as Nt/dt grows, where Nt is the number of samples from St with\ndim(St) = dt. In contrast, the true positive behavior for CSC has several favorable characteristics.\nFirst, if the subspaces are all independent, then the true positive rate \u03c1 can approach one. Second,\nin unfavorable settings in which the true positive rate is low, it can be shown that the true positives\nare distributed in such a way that precludes graph connectivity issues (see Section D.3 for more\ndetails). Speci\ufb01cally, in the random-sample model, for each subspace St, there is a matrix Asub\n\n7\n\n\ft Xt|. Then, Asub is de\ufb01ned by (Asub)i,j = |X T\n\nde\ufb01ned below, whose support is contained within the true positive support of the output of CSC(\u03b2)\nfor \u03b2 \u2208 (\u03b2L,\u03b2 U,\u03c1). Let Xt have i.i.d. standard normal entries, and \u03f5 be the m-th largest element\nof |X T\nt Xt| >\u03f5 , and zero otherwise.\nThe distribution of Asub when columns of Xt are drawn uniformly at random from the unit sphere\nensures that graph connectivity issues are avoided with high probability as soon as the true positive\nrate \u03c1 exceeds O(log Nt/Nt). As a result, even if the values of \u03c1 which provide \u03b2U,\u03c1 >\u03b2 L are small,\nthere is still the potential of perfect clustering. These distributional arguments cannot be made for\nsparsity-based methods like SSC. We refer to Appendix D for more details.\n\nt Xt| when |X T\n\n7 CSC Optimization and Variations\n\nIn Table 1, we provide a number of optimization programs that implement the cone membership test.\nThese formulations possess different computational and robustness properties. Let us introduce a\nnotation of equivalence. We say an optimization program P , implementing the cone membership\ntest, is in CR-class if the possible set of its optimal values can be divided into two disjoint sets Oin\nand Oout corresponding to whether d\u03b2(x, x\u2032) \u2208W N (x) or d\u03b2(x, x\u2032) \u0338\u2208 WN (x), respectively. Then\nwe write [[P : Oin, Oout]]; e.g., [[(CR) : 0, infeasible]]. All of the problems in Table 1 are in CR-class.\n\nN, y and b = d\u03b2(x, x\u2032) live in Rn, and \u03bb lives in RN.\n\nTable 1: Different formulations for the cone membership (second column) with their set of outputs\nwhen d\u03b2(x, x\u2032) \u2208W N (x) (third column) and when d\u03b2(x, x\u2032) \u0338\u2208 WN (x). In all of the variations,\nA = X \u2212 x1T\nTag\nP1 min\u03bb 0 s.t. b = A\u03bb , \u03bb \u2265 0\nP2 miny \u27e8y, b\u27e9 s.t. yT A \u2265 0\nP3 miny \u27e8y, b\u27e9 s.t. yT A \u2265 0 , \u27e8y, b\u27e9 \u2265 \u2212\u03f5\nP4 min\u03b3,\u03bb \u03b3 s.t. (1 \u2212 \u03b3)b = A\u03bb , \u03b3 \u2265 0,\u03bb \u2265 0\n\nFormulation\n\nOin Oout\n{0}\n{0}\n{0}\n{0}\n\ninfeasible\nunbounded\n{\u2212\u03f5}\n{1}\n\nThe \ufb01rst optimization problem (P1) is merely the statement of the cone membership test as a linear\nN (x) (in the\nfeasibility program and (P2) is its Lagrangian dual. (P2) looks for a certi\ufb01cate y \u2208W \u22c6\ndual cone) that rejects the membership of d\u03b2(x, x\u2032) in WN (x). However, neither (P1) nor (P2) are\nrobust or computationally appealing. Next, observe that restricting y to any set with the origin in its\nrelative interior yields a program that is in CR-class. (P3) is de\ufb01ned by augmenting (P2) with a linear\nconstraint, which not only makes the problem bounded and feasible, but the freedom in choosing\n\u03f5 allows for controlling the bit-length of the optimal solution and hence allows for optimizing the\ncomputational complexity of solving (P3) via interior point methods. Furthermore, this program can\nbe solved approximately, up to a precision \u03f5\u2032 \u2208 (0,\u03f5 ), and provides the same desired set of results:\nan \u03f5\u2032-inexact solution for (P3) has a nonnegative objective value if and only if d\u03b2(x, x\u2032) \u2208W N (x).\nIf we dualize (P3) and divide the objective by \u2212\u03f5 we get (P4) which can also be derived by hand-\ntweaking (P1). However, the duality relationship with (P3) is helpful in understanding the dual space\nand devising ef\ufb01cient optimization algorithms. Notice that (\u03b3, \u03bb ) = (1, 0N ) is always feasible, and\nthe optimal solution is in [0, 1]. The latter property makes (P4) a suitable candidate for noisy setups\nwithout modi\ufb01cation. Moreover, [[(P 4) : 0, 1]], which makes it a desirable candidate as a proxy for\nx\u2032 \u0338\u2208 S(x). We use this program in our experiments reported in Section 3.1.\n8 Discussions and Future Directions\n\nThis paper describes a new paradigm for understanding subspace clustering in relation to the un-\nderlying conic geometry. With this new perspective, we design an algorithm, CSC via rays, with\nguarantees on false and true positive rates and spreads, that sidesteps graph connectivity issues that\narise with methods like sparse subspace clustering. This paper should be seen as the \ufb01rst introduc-\ntion to the idea of, and tools for, conic subspace clustering, rather than establishing CSC as the new\nstate-of-the-art, and as a means to ignite future work on several directions in subspace clustering. We\nfocus on our novel geometric perspective and its potential to lead to new algorithms by providing a\nrigorous theoretical understanding (statistical and computational) and hope that this publication will\nspur discussions and insights that can inform the suggested future work. A cone membership test is\njust one approach to exploit this geometry and implement the conic subspace clustering paradigm.\n\n8\n\n\fRemaining Questions. While more extensive theoretical comparisons with existing methods are\nnecessary, many comparisons are non-trivial because CSC reveals important properties of subspace\nclustering methods (e.g. spread of true positives) that are not understood for other methods. The\nlimited small-scale experiments were simply intended to illustrate these properties.\nOur study of the parameter choice is theoretical in nature and beyond heuristics for implementation.\nBut some questions are still open. Firstly, while we have a clear deterministic characterization for\n\u03b2U, tighter characterizations would lead to a larger range for \u03b2. In Figure 2(b), such pursuit would\nresult in a new theoretical curve (instead of the current dashed purple curve) that stays closer to the\nminimum true positive rate across rows (the lowest thick solid curve). On the other hand, outside\nof the case of independent subspaces, where \u03b2L = 1, we only have a deterministic guarantee on\nthe \ufb01niteness of \u03b2L and computing it for the random-sample model is a topic of current research.\nTherefore, we do not have a guarantee on the non-triviality of the resulting range (\u03b2L,\u03b2 U ). How-\never, as observed in the small numerical examples in Section 3.1, as well as in our more extensive\nexperiments that are not reported here, there often exists a big range of \u03b2 with which we can get\nperfect clustering.\n\nExtensions. While the presented algorithm assumes noiseless data points from the underlying sub-\nspaces, our intuition and simulations (synthetic and real data) indicate stability towards stochastic\nnoise. Moreover, the current analysis is suggestive of algorithmic variants that exhibit robust em-\npirical performance in the presence of stochastic noise. This is why, similar to advances in other\nsubspace clustering methods, we hope that the analysis for the noiseless setup provides essential\ninsights to provably generalize the method to noisy settings. Furthermore, there remain several\nother open avenues for exploration, particularly with respect to theoretical and large-scale empirical\ncomparisons with other methods, and extensions to measurements corrupted by adversarial pertur-\nbations, with outliers among the data points, as well as with missing entries in the data points. By\ndesign, SSC and other similar methods require a full knowledge of data points. CSC imposes the\nsame requirement and an open question is how to extend the CSC framework when some entries are\nmissing from the data points.\n\nReferences\n[1] David E. Blair. Inversion theory and conformal mapping, volume 9 of Student Mathematical\n\nLibrary. American Mathematical Society, Providence, RI, 2000.\n\n[2] Jo\u00e3o Paulo Costeira and Takeo Kanade. A multibody factorization method for independently\n\nmoving objects. Int. J. Comput. Vis., 29(3):159\u2013179, 1998.\n\n[3] Ehsan Elhamifar and Ren\u00e9 Vidal. Sparse subspace clustering. In 2009 IEEE Conference on\n\nComputer Vision and Pattern Recognition, pages 2790\u20132797. IEEE, 2009.\n\n[4] Ehsan Elhamifar and Ren\u00e9 Vidal. Clustering disjoint subspaces via sparse representation. In\n2010 IEEE International Conference on Acoustics, Speech and Signal Processing, pages 1926\u2013\n1929. IEEE, 2010.\n\n[5] Ehsan Elhamifar and Rene Vidal. Sparse subspace clustering: Algorithm, theory, and applica-\n\ntions. IEEE Trans. Pattern Anal. Mach. Intell, 35(11):2765\u20132781, 2013.\n\n[6] Yehoram Gordon. On Milman\u2019s inequality and random subspaces which escape through a\nmesh in Rn. In Geometric aspects of functional analysis, volume 1317 of Lecture Notes in\nMath., pages 84\u2013106. Springer, 1988.\n\n[7] Reinhard Heckel and Helmut B\u00f6lcskei. Robust subspace clustering via thresholding. IEEE\n\nTrans. Inf. Theory, 61(11):6320\u20136342, 2015.\n\n[8] Can-Yi Lu, Hai Min, Zhong-Qiu Zhao, Lin Zhu, De-Shuang Huang, and Shuicheng Yan. Ro-\nbust and ef\ufb01cient subspace segmentation via least squares regression. In European conference\non computer vision, pages 347\u2013360. Springer, 2012.\n\n[9] Jean-Jacques Moreau. D\u00e9composition orthogonale d\u2019un espace hilbertien selon deux c\u00f4nes\n\nmutuellement polaires. C. R. Acad. Sci. Paris, 255:238\u2013240, 1962.\n\n9\n\n\f[10] Behrooz Nasihatkon and Richard Hartley. Graph connectivity in sparse subspace clustering.\nIn Computer Vision and Pattern Recognition (CVPR), 2011 IEEE Conference on, pages 2137\u2013\n2144. IEEE, 2011.\n\n[11] Dohyung Park, Constantine Caramanis, and Sujay Sanghavi. Greedy subspace clustering. In\n\nAdvances in Neural Information Processing Systems, pages 2753\u20132761, 2014.\n\n[12] James Renegar. \u201cEf\ufb01cient\u201d subgradient methods for general convex optimization. SIAM J.\n\nOptim., 26(4):2649\u20132676, 2016.\n\n[13] Fritz Scholz. Con\ufb01dence bounds and intervals for parameters relating to the binomial, negative\n\nbinomial, poisson and hypergeometric distributions with applications to rare events. 2008.\n\n[14] Mahdi Soltanolkotabi and Emmanuel J. Cand\u00e9s. A geometric analysis of subspace clustering\n\nwith outliers. Ann. Statist., 40(4):2195\u20132238, 2012.\n\n[15] Mahdi Soltanolkotabi, Ehsan Elhamifar, and Emmanuel J. Cand\u00e8s. Robust subspace clustering.\n\nAnn. Statist., 42(2):669\u2013699, 2014.\n\n[16] Ren\u00e9 Vidal. Subspace clustering. IEEE Signal Process. Mag., 28(2):52\u201368, 2011.\n[17] Ulrike von Luxburg. A tutorial on spectral clustering. Statistics and Computing, 17(4):395\u2013\n\n416, 2007.\n\n[18] Yining Wang, Yu-Xiang Wang, and Aarti Singh. Graph connectivity in noisy sparse subspace\n\nclustering. In Arti\ufb01cial Intelligence and Statistics, pages 538\u2013546, 2016.\n\n[19] Yu-Xiang Wang and Huan Xu. Noisy sparse subspace clustering. J. Mach. Learn. Res., 17:Pa-\n\nper No. 12, 41, 2016.\n\n[20] Yu-Xiang Wang, Huan Xu, and Chenlei Leng. Provable subspace clustering: When LRR meets\n\nSSC. In Advances in Neural Information Processing Systems, pages 64\u201372, 2013.\n\n[21] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent,\narticulated, rigid, non-rigid, degenerate and non-degenerate. In European conference on com-\nputer vision, pages 94\u2013106. Springer, 2006.\n\n10\n\n\f", "award": [], "sourceid": 3391, "authors": [{"given_name": "Amin", "family_name": "Jalali", "institution": "Wisconsin Institute for Discovery"}, {"given_name": "Rebecca", "family_name": "Willett", "institution": "University of Wisconsin"}]}