{"title": "Understanding Regularized Spectral Clustering via Graph Conductance", "book": "Advances in Neural Information Processing Systems", "page_first": 10631, "page_last": 10640, "abstract": "This paper uses the relationship between graph conductance and spectral clustering to study (i) the failures of spectral clustering and (ii) the benefits of regularization.  The explanation is simple.  Sparse and stochastic graphs create several ``dangling sets'', or small trees that are connected to the core of the graph by only one edge.  Graph conductance is sensitive to these noisy dangling sets and spectral clustering inherits this sensitivity.  The second part of the paper starts from a previously proposed form of regularized spectral clustering and shows that it is related to the graph conductance on a ``regularized graph''.  When graph conductance is computed on the regularized graph, we call it CoreCut.  Based upon previous arguments that relate graph conductance to spectral clustering (e.g. Cheeger inequality), minimizing CoreCut relaxes to regularized spectral clustering.  Simple inspection of CoreCut reveals why it is less sensitive to dangling sets.   Together, these results show that unbalanced partitions from spectral clustering can be understood as overfitting to noise in the periphery of a sparse and stochastic graph.  Regularization fixes this overfitting.  In addition to this statistical benefit, these results also demonstrate how regularization can improve the computational speed of spectral clustering.  We provide  simulations and data examples to illustrate these results.", "full_text": "Understanding Regularized Spectral Clustering via\n\nGraph Conductance\n\nYilin Zhang\n\nDepartment of Statistics\n\nUniversity of Wisconsin-Madison\n\nMadison, WI 53706\n\nyilin.zhang@wisc.edu\n\nKarl Rohe\n\nDepartment of Statistics\n\nUniversity of Wisconsin-Madison\n\nMadison, WI 53706\n\nkarl.rohe@wisc.edu\n\nAbstract\n\nThis paper uses the relationship between graph conductance and spectral clustering\nto study (i) the failures of spectral clustering and (ii) the bene\ufb01ts of regularization.\nThe explanation is simple. Sparse and stochastic graphs create several \u201cdangling\nsets\u201d, or small trees that are connected to the core of the graph by only one edge.\nGraph conductance is sensitive to these noisy dangling sets and spectral clustering\ninherits this sensitivity. The second part of the paper starts from a previously pro-\nposed form of regularized spectral clustering and shows that it is related to the graph\nconductance on a \u201cregularized graph\u201d. When graph conductance is computed on\nthe regularized graph, we call it CoreCut. Based upon previous arguments that re-\nlate graph conductance to spectral clustering (e.g. Cheeger inequality), minimizing\nCoreCut relaxes to regularized spectral clustering. Simple inspection of CoreCut\nreveals why it is less sensitive to dangling sets. Together, these results show that\nunbalanced partitions from spectral clustering can be understood as over\ufb01tting to\nnoise in the periphery of a sparse and stochastic graph. Regularization \ufb01xes this\nover\ufb01tting. In addition to this statistical bene\ufb01t, these results also demonstrate\nhow regularization can improve the computational speed of spectral clustering. We\nprovide simulations and data examples to illustrate these results.\n\n1\n\nIntroduction\n\nSpectral clustering partitions the nodes of a graph into groups based upon the eigenvectors of the\ngraph Laplacian [19, 20]. Despite the claims of spectral clustering being \u201cpopular\u201d, in applied\nresearch using graph data, spectral clustering (without regularization) often returns a partition of the\nnodes that is uninteresting, typically \ufb01nding a large cluster that contains most of the data and many\nsmaller clusters, each with only a few nodes. These applications involve brain graphs [2] and social\nnetworks from Facebook [21] and Twitter [22]. One key motivation for spectral clustering is that\nit relaxes a discrete optimization problem of minimizing graph conductance. Previous research has\nshown that across a wide range of social and information networks, the clusters with the smallest\ngraph conductance are often rather small [15]. Figure 1 illustrates the leading singular vectors on\na communication network from Facebook during the 2012 French presidential election [21]. The\nsingular vectors localize on a few nodes, which leads to a highly unbalanced partition.\n[1] proposed regularized spectral clustering which adds a weak edge on every pair of nodes with\nedge weight \u03c4 /N, where N is the number of nodes in the network and \u03c4 is a tuning parameter. [5]\nproposed a related technique. Figure 1 illustrates how regularization changes the leading singular\nvectors in the Facebook example. The singular vectors are more spread across nodes.\nMany empirical networks have a core-periphery structure, where nodes in the core of the graph\nare more densely connected and nodes in the periphery are sparsely connected [3]. In Figure 1,\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fregularized spectral clustering leads to a \u201cdeeper cut\u201d into the core of the graph. In this application,\nregularization helps spectral clustering provide a more balanced partition, revealing a more salient\npolitical division.\n\nFigure 1: This \ufb01gure shows the leading singular vectors of the communication network. In the left\npanel, the singular vectors from vanilla spectral clustering are localized on a few nodes. In the right\npanel, the singular vectors from regularized spectral clustering provide a more balanced partition.\n\nPrevious research has studied how regularization improves the spectral convergence of the graph\nLaplacian [17, 9, 11]. This paper aims to provide an alternative interpretation of regularization by\nrelating it to graph conductance. We call spectral clustering without regularization Vanilla-SC and\nwith edge-wise regularization Regularized-SC [1].\nThis paper demonstrates (1) what makes Vanilla-SC fail and (2) how Regularized-SC \ufb01xes that\nproblem. One key motivation for Vanilla-SC is that it relaxes a discrete optimization problem of\nminimizing graph conductance [7]. Yet, this graph conductance problem is fragile to small cuts in the\ngraph. The fundamental fragility of graph conductance that is studied in this paper comes from the\ntype of subgraph illustrated in Figure 2 and de\ufb01ned here.\n\nDe\ufb01nition 1.1. In an unweighted graph G = (V, E), subset S \u2282\nV is g-dangling if and only if the following conditions hold.\n\n- S contains exactly g nodes.\n- There are exactly g \u2212 1 edges within S and they do not\nform any cycles (i.e. the node induced subgraph from S\nis a tree).\n\n- There is exactly one edge between nodes in S and nodes\n\nin Sc.\n\nThe argument in this paper is structured as follows:\n\nFigure 2: 6-dangling set.\n\n1) A g-dangling set has a small graph conductance, approximately (2g)\u22121 (Section 3.2).\n2) For any \ufb01xed g, graphs sampled from a sparse inhomogeneous model with N nodes have\n\u0398(N ) g-dangling sets in expectation (Theorem 3.4). As such, g-dangling sets are created as\nan artifact of the sparse and stochastic noise.\n3) This makes \u0398(N ) eigenvalues in the normalized graph Laplacian which have an average\nvalue less than (g \u2212 1)\u22121 (Theorem 3.5) and reveal only noise. These small eigenvalues are\nso numerous that they conceal good cuts to the core of the graph.\n\n4) \u0398(N ) eigenvalues smaller than 1/g also make the eigengap exceptionally small. This slows\n\ndown the numerical convergence for computing the eigenvectors and values.\n\n5) CoreCut, which is graph conductance on the regularized graph, does not assign a small value\nto small sets of nodes. This prevents all of the statistical and computational consequences\nlisted above for g-dangling sets and any other small noisy subgraphs that have a small\nconductance. Regularized-SC inherits the advantages of CoreCut.\n\nThe penultimate section evaluates the over\ufb01tting of spectral clustering in an experiment with several\nempirical graphs from SNAP [14]. This experiment randomly divides the edges into training set\n\n2\n\n\fand test set, then runs spectral clustering using the training edges and with the resulting partition,\ncompares \u201ctraining edge conductance\u201d to \u201ctesting edge conductance.\u201d This shows that Vanilla-SC\nover\ufb01ts and Regularized-SC does not. Moreover, Vanilla-SC tends to identify highly unbalanced\npartitions, while Regularized-SC provides a balanced partition.\nThe paper concludes with a discussion which illustrates how these results might help inform the\nconstruction of neural architectures for a generalization of Convolutional Neural Networks to cases\nwhere the input data has an estimated dependence structure that is represented as a graph [12, 4, 10,\n16].\n\n2 Notation\nGraph notation The graph or network G = (V, E) consists of node set V = {1, . . . , N} and edge\nset E = {(i, j) : i and j connect with each other}. For a weighted graph, the edge weight wij can\ntake any non-negative value for (i, j) \u2208 E and de\ufb01ne wij = 0 if (i, j) (cid:54)\u2208 E. For an unweighted graph,\nde\ufb01ne the edge weight wij = 1 if (i, j) \u2208 E and wij = 0 otherwise. For each node i, we denote its\nj wij. Given S \u2282 V , the node induced subgraph of S in G is a graph with vertex set\nS and includes every edge whose end point are both in S, i.e. its edge set is {(i, j) \u2208 E : i, j \u2208 S}.\nGraph cut notation For any subset S \u2282 V , we denote |S| = number of nodes in S, and its volume\ni\u2208S di. Note that any non-empty subset S (cid:40) V forms a partition of V\n\ndegree as di =(cid:80)\nin graph G as vol(S, G) =(cid:80)\n\nwith its complement Sc. We denote the cut for such partition on graph G as\n\nand denote the graph conductance of any subset S \u2282 V with vol(S, G) \u2264 vol(Sc, G) as\n\n(cid:88)\n\ncut(S, G) =\n\nwij,\n\ni\u2208S,j\u2208Sc\n\n\u03c6(S, G) =\n\ncut(S, G)\nvol(S, G)\n\n.\n\nWithout loss of generality, we focus on non-empty subsets S (cid:40) V with vol(S, G) \u2264 vol(Sc, G).\nNotation for Vanilla-SC and Regularized-SC We denote the adjacency matrix A \u2208 RN\u00d7N\nwith Aij = wij, and the degree matrix D \u2208 RN\u00d7N with Dii = di and Dij = 0 for i (cid:54)= j. The\nnormalized graph Laplacian matrix is\n\nL = I \u2212 D\u22121/2AD\u22121/2,\n\nwith eigenvalues 0 = \u03bb1 \u2264 \u03bb2 \u2264 . . . \u03bbN \u2264 2 (here and elsewhere, \u201cleading\u201d refers to the smallest\neigenvalues). Let v1, . . . , vN : V \u2192 R represent the eigenvectors/eigenfunctions for L corresponding\nto eigenvalues \u03bb1, . . . , \u03bbN .\nThere is a broad class of spectral clustering algorithms which represent each node i in RK with\n(v1(i), . . . , vK(i)) and cluster the nodes by clustering their representations in RK with some algo-\nrithm. For simplicity, this paper focuses on the setting of K = 2 and only uses v2. We refer to\nVanilla-SC the algorithm which returns the set Si which solves\n\n\u03c6(Si, G), where Si = {j : v2(j) \u2265 v2(i)}.\n\nmin\n\ni\n\n(2.1)\n\nThis construction of a partition appears in both [19] and in the proof of Cheeger inequality [6, 7],\nwhich says that\n\nCheeger inequality: h2\nG\n2\n\n\u2264 \u03bb2 \u2264 2hG, where hG = min\n\nS\n\n\u03c6(S, G).\n\nEdge-wise regularization [1] adds \u03c4 /N to every element of the adjacency matrix, where \u03c4 > 0 is\na tuning parameter. It replaces A by matrix A\u03c4 \u2208 RN\u00d7N , where [A\u03c4 ]ij = Aij + \u03c4 /N and the\nnode degree matrix D by D\u03c4 , which is computed with the row sums of A\u03c4 (instead of the row\nsums of A) to get [D\u03c4 ]ii = Dii + \u03c4. We de\ufb01ne G\u03c4 to be the weighted graph with adjacency matrix\nA\u03c4 as de\ufb01ned above. Regularized-SC partitions the graph using the K leading eigenvectors of\nL\u03c4 = I \u2212 D\nK : V \u2192 R. Similarly, we only use v\u03c4\nwhen K = 2. We refer to Regularized-SC the algorithm which returns the set Si which solves\n2 (j) \u2265 v\u03c4\n\n, which we represent by v\u03c4\n1 , . . . , v\u03c4\n\u03c6(Si, G\u03c4 ), where Si = {j : v\u03c4\n\n\u22121/2\nA\u03c4 D\n\u03c4\n\n2 (i)}.\n\n\u22121/2\n\u03c4\n\n2\n\nmin\n\ni\n\n3\n\n\f3 Vanilla-SC and the periphery of sparse and stochastic graphs\n\nFor notational simplicity, this section only considers unweighted graphs.\n\n3.1 Dangling sets have small graph conductance.\n\nThe following fact follows from the de\ufb01nition of a g-dangling set.\nFact 3.1. If S is a g-dangling set, then its graph conductance is \u03c6(S) = (2g \u2212 1)\u22121.\nTo interpret the scale of this graph conductance, imagine that a graph is generated from a Stochastic\nBlockmodel with two equal-size blocks, where any two nodes from the same block connect with\nprobability p and two nodes from different blocks connect with probability q [8]. Then, the graph\nconductance of one of the blocks is q/(p + q) (up to random \ufb02uctuations). If there is a g-dangling set\nwith g > p/(2q) + 1, then the g-dangling set will have a smaller graph conductance than the block.\n\n3.2 There are many dangling sets in sparse and stochastic social networks.\n\nWe consider random graphs sampled from the following model which generalizes Stochastic Block-\nmodels. Its key assumption is that edges are independent.\nDe\ufb01nition 3.2. A graph is generated from an inhomogeneous random graph model if the vertex\nset contains N nodes and all edges are independent. That is, for any two nodes i, j \u2208 V , i connects\nto j with some probability pij and this event is independent of the formation of any other edges. We\nonly consider undirected graphs with no self-loops.\nDe\ufb01nition 3.3. Node i is a peripheral node in an inhomogeneous random graph with N nodes if\nthere exist some constant b > 0, such that pij < b/N for all other nodes j, where we allow N \u2192 \u221e.\nFor example, an Erd\u00f6s-R\u00e9nyi graph is an inhomogeneous random graph. If the Erd\u00f6s-R\u00e9nyi edge\nprobability is speci\ufb01ed by p = \u03bb/N for some \ufb01xed \u03bb > 0, then all nodes are peripheral. As another\nexample, a common assumption in the statistical literature on Stochastic Blockmodels is that the\nminimum expected degree grows faster than log N. Under this assumption, there are no peripheral\nnodes in the graph. That log N assumption is perhaps controversial because empirical graphs often\nhave many low-degree nodes.\nTheorem 3.4. Suppose an inhomogeneous random graph model such that for some \u0001 > 0, pij >\n(1 + \u0001)/N for all nodes i, j. If that model contains a non-vanishing fraction of peripheral nodes\nVp \u2282 V , such that |Vp| > \u03b7N for some \u03b7 > 0, then the expected number of distinct g-dangling sets\nin the sampled graph grows proportionally to N.\n\nTheorem 3.4 studies graphs sampled from an inhomogeneous random graph model with a non-\nvanishing fraction of peripheral nodes. Throughout the paper, we refer to these graphs more simply\nas graphs with a sparse and stochastic periphery and, in fact, the proof of Theorem 3.4 only relies on\nthe randomness of the edges in the periphery, i.e. the edges that have an end point in Vp. The proof\ndoes not rely on the distribution of the node-induced subgraph of the \u201ccore\u201d V c\np . Combined with Fact\n3.1, Theorem 3.4 shows that graphs with a sparse and stochastic periphery generate an abundance of\ng-dangling sets, which creates an abundance of cuts with small conductance, but might only reveal\nnoise. [15] also shows by real datasets that there is a substantial fraction of nodes that barely connect\nto the rest of graph, especially 1-whiskers, which is a generalized version of g-dangling sets.\nTheorem 3.5. If a graph contains Q g-dangling sets, and the rest of the graph has volume at least\n4g2, then there are at least Q/2 eigenvalues that is smaller than (g \u2212 1)\u22121.\nTheorem 3.5 shows that every two dangling sets lead to a small eigenvalue. Due to the abundance of\ng-dangling sets (Theorem 3.4), there are many small eigenvalues and their corresponding eigenvalues\nare localized on a small set of nodes. This explains what we see in the data example in Figure 1. Each\nof these many eigenvectors is costly to compute (due to the small eigengaps) and then one needs to\ndecide which are localized (which requires another tuning).\n\n4 CoreCut ignores small cuts and relaxes to Regularized-SC.\nSimilar to the graph conductance \u03c6(\u00b7 , G) which relaxes to Vanilla-SC [7, 19, 20], we introduce a\nnew graph conductance CoreCut which relaxes to Regularized-SC. The following sketch illustrates\n\n4\n\n\fthe relations. This section compares \u03c6(\u00b7 , G) and CoreCut. For ease of exposition, we continue to\nfocus our attention on partitioning into two sets.\n\n\u03c6(\u00b7 , G)\n\nwith G\u03c4\n\nCoreCut\n\nrelaxes to\n\nVanilla-SC\n\nwith G\u03c4\n\nrelaxes to\nRegularized-SC\n\nDe\ufb01nition 4.1. Given a subset S \u2282 V with vol(S, G\u03c4 ) \u2264 vol(Sc, G\u03c4 ), we de\ufb01ne its CoreCut as\n\nCoreCut\u03c4 (S) =\n\ncut(S, G) + \u03c4\n\nvol(S, G) + \u03c4|S|\n\nN |S||Sc|\n\n.\n\nFact 4.2. For any S \u2282 V with vol(S, G\u03c4 ) \u2264 vol(Sc, G\u03c4 ), it follows that CoreCut\u03c4 (S) = \u03c6(S, G\u03c4 ).\nWith Fact 4.2, we can apply Cheeger inequality to G\u03c4 in order to relate the optimum CoreCut to the\nsecond eigenvalue of L\u03c4 , which we denote by \u03bb2(L\u03c4 ).\n\nh2\n\u03c4\n2\n\n\u2264 \u03bb2(L\u03c4 ) \u2264 2h\u03c4 where h\u03c4 = min\n\nS\n\nCoreCut\u03c4 (S).\n\nThe fundamental property of CoreCut is that the regularizer \u03c4 has larger effect on smaller sets. For\nexample in Figure 3a, S\u0001i\u2019s (i = 1, ..., 5) are small peripheral sets and S1, S2 are core sets, each\nwith roughly half of all nodes. From Figure 3, all \ufb01ve peripheral sets have smaller \u03c6(\u00b7 , G) than\nthe two core sets. Minimizing \u03c6(\u00b7 , G) tends to cut the periphery rather than cutting the core. By\nregularizing with \u03c4 = 2, the CoreCut of all \ufb01ve peripheral sets increases signi\ufb01cantly from \u03c6(\u00b7 , G) ,\nwhile CoreCut of the two core sets remain similar to their \u03c6(\u00b7 , G). In the end, CoreCut will cut the\ncore of the graph because all \ufb01ve peripheral sets have larger CoreCut than the two core sets S1, S2.\n\n(a) A core-periphery network.\n\n(b) Graph conductances on different sets.\n\nFigure 3: Figure (b) shows the CoreCut with \u03c4 = 2, and \u03c6(\u00b7 , G) on different sets in the core-\nperiphery network in Figure (a). CoreCut is very close to \u03c6(\u00b7 , G) on the core sets S1 and S2. But\non the peripheral sets, \u03c6(\u00b7 , G) assigns small values, while CoreCut assigns much larger values.\nMinimizing \u03c6(\u00b7 , G) will yield a peripheral set, while minimizing CoreCut will cut the core of the\ngraph.\n\nCoreCut will succeed if \u03c4 overwhelms the peripheral sets, but is negligible to core sets. Corollary 4.7\nbelow makes this intuition precise. It requires the following assumptions, where you should imagine\nS\u0001 to be a peripheral cut and S to be a cut to the core of the graph that we wish to detect.\nWe de\ufb01ne the mean degree for any subset S(cid:48) \u2282 V on G as \u00afd(S(cid:48), G) = vol(S(cid:48), G)/|S(cid:48)|.\nAssumption 4.3. For a graph G = (V, E) and subsets S\u0001 \u2282 V and S \u2282 V , there exists \u0001, \u03b1 > 0,\nsuch that\n\n5\n\nlllllllcore setsperipheral sets0.00.20.40.60.8S1S2Se1Se2Se3Se4Se5SetsValuesTypelCoreCut2(S)f(S, G)Graph conductances on different sets\f1. |S\u0001| < \u0001|V | and vol(S\u0001, G) < \u0001vol(V, G),\n2. \u00afd(S\u0001, G) < 1\u2212\u0001\n3. \u03c6(S, G) < \u03b1(1\u2212\u0001)\n1+\u03b1 .\n\n\u00afd(S, G),\n\n2(1+\u03b1)\n\nRemark 4.4. Assumption 1 indicates that the peripheral set S\u0001 is a very small part of G in terms of\nnumber of nodes and number of edges. Assumption 2 requires S to be reasonably dense. Assumption\n3 requires S and Sc to form a good partition.\nProposition 4.5. Given graph G = (V, E), for any set S\u0001 \u2282 V satisfying Assumption 1, for some\nconstant \u03b1 > 0, if we choose \u03c4 such that \u03c4 \u2265 \u03b1 \u00afd(S\u0001), then\n\nCoreCut\u03c4 (S\u0001) >\n\n\u03b1(1 \u2212 \u0001)\n1 + \u03b1\n\n.\n\nProposition 4.5 shows that CoreCut of a peripheral set is lower bounded away from zero.\nProposition 4.6. Given graph G = (V, E), for any set S \u2282 V , for some constant \u03b4 > 0, if we\nchoose \u03c4 \u2264 \u03b4 \u00afd(S, G), then\n\nCoreCut\u03c4 (S) < \u03c6(S, G) + \u03b4.\n\nWhen S is reasonably large, \u03c4 can be chosen such that \u03b4 is small. Proposition 4.6 shows that with \u03c4\nnot being too large, the CoreCut of a reasonably large set is close to \u03c6(\u00b7 , G).\nCorollary 4.7 follows directly from Proposition 4.5 and 4.6.\nCorollary 4.7. Given graph G = (V, E), for any subsets S\u0001, S \u2282 V satisfying the three assumptions\nin Assumption 4.3, if we choose \u03c4 such that\n\nwhere \u03b4 = \u03b1(1 \u2212 \u0001)/(1 + \u03b1) \u2212 \u03c6(S, G), then\n\n\u03b1 \u00afd(S\u0001, G) \u2264 \u03c4 \u2264 \u03b4 \u00afd(S, G),\n\nCoreCut\u03c4 (S) < CoreCut\u03c4 (S\u0001).\n\nCorollary 4.7 indicates the lower bound and upper bound of \u03c4 for CoreCut to ignore a cut to the\nperiphery and prefer a cut to the core. These bounds on \u03c4 lead to a deeper understanding of CoreCut.\nHowever, they are dif\ufb01cult to implement in practice.\n\n5 Real data examples\n\nThis section provides real data examples to show three things. First, Regularized-SC \ufb01nds a more\nbalanced partition. Second, Vanilla-SC is prone to \u201ccatastrophic over\ufb01tting\u201d. Third, computing the\nsecond eigenvector of L\u03c4 takes less time than computing the second eigenvector of L. This section\nstudies 37 example networks from http://snap.stanford.edu/data [14]. These networks are\nselected to be relatively easy to interpret and handle. The largest graph used is wiki-talk and has only\n2,388,953 nodes in the largest component. The complete list of graphs used is given below. Before\ncomputing anything, directed edges are symmetrized and nodes not connected to the largest connected\ncomponent are removed. Throughout all simulations, the regularization parameter \u03c4 is set to be the\naverage degree of the graph. This is not optimized, but is instead a previously proposed heuristic [17].\nAs de\ufb01ned in Section 2 Equation 2.1, the partitions are constructed by scanning through the second\neigenvector. Even though we argue that regularized approaches are trying to minimize CoreCut,\nevery notion of conductance in this section is computed on the unregularized graph G, including\nthe scanning through the second eigenvector. All eigen-computations are performed with rARPACK\n[13, 18].\nIn this simulation, half of the edges are removed from the graph and placed into a \u201ctesting-set\u201d.\nRefer to the remaining edges as the \u201ctraining-edges\u201d. On the training-edges, the largest connected\ncomponent is again identi\ufb01ed. Based upon that subset of the training-edges, the spectral partitions\nare formed.\nEach \ufb01gure in this section corresponds to a different summary value (balance, training conductance,\ntesting conductance, and running time). In all \ufb01gures, each point corresponds to a single network.\n\n6\n\n\fThe x-axis corresponds to the summary value for Regularized-SC and the y-axis corresponds to\nthe summary value for Vanilla-SC. Each \ufb01gure includes a black line, which is the line x = y. All\nplots are on the log-log scale. The size of each point corresponds to the number of nodes in the graph.\nIn Figure 4, the summary value is the number of nodes in the smaller partition set. Notice that the\nscales of the axes are entirely different. Vanilla-SC tends to identify sets with 100s of nodes or\nsmaller. However, regularizing tends increase the size of the sets into the 1000s.\nIn Figure 5a, the summary value is the conductance computed on the training-edges. Because this is\nthe quantity that Vanilla-SC approximates, it is not surprising that it \ufb01nds partitions with a smaller\nconductance. However, Figure 5b shows that if the conductance is computed using only edges in the\ntesting-set, then sometimes the vanilla sets have no internal edges (\u03c6(\u00b7 , G) = 1). We refer to this as\ncatastrophic over\ufb01tting.\nIn these simulations (and others), we \ufb01nd that the partitions produced by both forms of regularization\n[1] and [5] are exactly equivalent. We \ufb01nd it easier to implement fast code for [5] and moreover,\nour implementations of it run faster. Implementing [1] to take advantage of the sparsity in the\ngraph requires de\ufb01ning a function which quickly multiplies a vector x by L\u03c4 . This can be done\nvia L\u03c4 x = x \u2212 D\nx \u2212 \u03c4 /N1(1T x), where 1 is a vector of 1\u2019s. However, with a\nuser de\ufb01ned matrix multiplication, the eigensolver in rARPACK runs slightly slower. Because the\nregularized form from [5] simply de\ufb01nes L\u03c4 = I \u2212 D\n, it can use the same eigensolver\nas Vanilla-SC and, as such, the running times are more comparable. Figure 6 uses this de\ufb01nition of\n\u22121/2\n\u22121/2\nRegularized-SC. Running times are from rARPACK computing two eigenvectors of D\n\u03c4\n\u03c4\nand D\u22121/2AD\u22121/2 using the default settings. A line of regression is added to Figure 6. The slope of\nthis line is roughly 1.01 and its intercept is roughly 0.83.\nThe list of SNAP networks is given here: amazon0302, amazon0312, amazon0505, amazon0601, ca-\nAstroPh, ca-CondMat, ca-GrQc, ca-HepPh, ca-HepTh, cit-HepPh, cit-HepTh, com-amazon.ungraph,\ncom-youtube.ungraph, email-EuAll, email-Eu-core, facebook-combined, p2p-Gnutella04, p2p-\nGnutella05, p2p-Gnutella06, p2p-Gnutella08, p2p-Gnutella09, p2p-Gnutella24, p2p-Gnutella25,\np2p-Gnutella30, p2p-Gnutella31, roadNet-CA, roadNet-PA, roadNet-TX, soc-Epinions1, soc-\nSlashdot0811, soc-Slashdot0902, twitter-combined, web-Google, web-NotreDame, web-Stanford,\nwiki-Talk, wiki-Vote.\n\n\u22121/2\n\u03c4\n\n\u22121/2\n\u03c4\n\n\u22121/2\n\u03c4\n\nAD\n\n\u22121/2\n\u03c4\n\nAD\n\nAD\n\nFigure 4: Regularized-SC identi\ufb01es clusters that are more balanced. That is, the smallest set in the\npartition has more nodes.\n\n7\n\n1010010001e+031e+05regularizedvanillaN5e+051e+06Balance vs balance. Regularization increases balance.\f(a)\n\n(b)\n\nFigure 5: Vanilla-SC \ufb01nds cuts with a smaller conductance. However, on the testing edges, it can\nhave a catastrophic failure, where there are no internal edges to the smallest set. This corresponds to\n\u03c6(\u00b7 , G) = 1.\n\nFigure 6: The line of regression suggests that Regularized-SC runs roughly eight times faster than\nVanilla-SC in rARPACK [18].\n\n6 Discussion\n\nThe results in this paper provide a re\ufb01ned understanding of how regularized spectral clustering\nprevents over\ufb01tting. This paper suggests that spectral clustering over\ufb01ts to g-dangling sets (and,\nperhaps, other small sets) because they occur as noise in sparse and stochastic graphs and they have a\nvery small cost function \u03c6. Regularized spectral clustering optimizes a relaxation of CoreCut (a cost\nfunction very much related to \u03c6) that assigns a higher cost to small sets like g-dangling sets. As such,\nwhen a graph is sparse and stochastic, the patterns identi\ufb01ed by regularized spectral clustering are\nmore likely to persist in another sample of the graph from the same distribution.\nSuch over\ufb01tting on peripheries may also happen in many other machine learning methods with graph\ndata. There has been an interest in generalizing Convolutional Neural Networks beyond images,\nto more general graph dependence structures. In these settings, the architecture of the \ufb01rst layer\nshould identify a localized region of the graph [12, 4, 10, 16]. While spectral approaches have been\nproposed, our results herein suggest potential bene\ufb01ts from regularization.\n\n8\n\n0.0010.1000.010.10regularizedvanillaN5e+051e+06Training conductance. Vanilla cuts have much smaller conductance.0.011.000.010.10regularizedvanillaN5e+051e+06Testing conductance. Vanilla cuts are sometimes awful. 11000.11.010.0regularizedvanillaN5e+051e+06Running time in seconds. Regularized runs ~8x faster.\fAcknowledgements\n\nThe authors gratefully acknowledge support from NSF grant DMS-1612456 and ARO grant W911NF-\n15-1-0423. We thank Yeganeh Ali Mohammadi and Mobin YahyazadehJeloudar for their helpful\ncomments.\n\nReferences\n[1] Arash A Amini, Aiyou Chen, Peter J Bickel, Elizaveta Levina, et al. Pseudo-likelihood methods\nfor community detection in large sparse networks. The Annals of Statistics, 41(4):2097\u20132122,\n2013.\n\n[2] Norbert Binkiewicz, Joshua T Vogelstein, and Karl Rohe. Covariate-assisted spectral clustering.\n\nBiometrika, 104(2):361\u2013377, 2017.\n\n[3] Stephen P Borgatti and Martin G Everett. Models of core/periphery structures. Social networks,\n\n21(4):375\u2013395, 2000.\n\n[4] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. arXiv preprint arXiv:1312.6203, 2013.\n\n[5] Kamalika Chaudhuri, Fan Chung, and Alexander Tsiatas. Spectral clustering of graphs with\ngeneral degrees in the extended planted partition model. In Conference on Learning Theory,\npages 35\u20131, 2012.\n\n[6] Fan RK Chung. Laplacians of graphs and cheeger\u2019s inequalities. Combinatorics, Paul Erdos is\n\nEighty, 2(157-172):13\u20132, 1996.\n\n[7] Fan RK Chung. Spectral graph theory. Number 92. American Mathematical Soc., 1997.\n\n[8] Paul W Holland, Kathryn Blackmond Laskey, and Samuel Leinhardt. Stochastic blockmodels:\n\nFirst steps. Social networks, 5(2):109\u2013137, 1983.\n\n[9] Antony Joseph, Bin Yu, et al. Impact of regularization on spectral clustering. The Annals of\n\nStatistics, 44(4):1765\u20131791, 2016.\n\n[10] Thomas N Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. arXiv preprint arXiv:1609.02907, 2016.\n\n[11] Can M Le, Elizaveta Levina, and Roman Vershynin. Sparse random graphs: regularization and\n\nconcentration of the laplacian. arXiv preprint arXiv:1502.03049, 2015.\n\n[12] Yann LeCun, Yoshua Bengio, et al. Convolutional networks for images, speech, and time series.\n\nThe handbook of brain theory and neural networks, 3361(10):1995, 1995.\n\n[13] Richard B Lehoucq, Danny C Sorensen, and Chao Yang. ARPACK users\u2019 guide: solution of\nlarge-scale eigenvalue problems with implicitly restarted Arnoldi methods, volume 6. Siam,\n1998.\n\n[14] Jure Leskovec and Andrej Krevl. SNAP Datasets: Stanford large network dataset collection.\n\nhttp://snap.stanford.edu/data, June 2014.\n\n[15] Jure Leskovec, Kevin J Lang, Anirban Dasgupta, and Michael W Mahoney. Community\nstructure in large networks: Natural cluster sizes and the absence of large well-de\ufb01ned clusters.\nInternet Mathematics, 6(1):29\u2013123, 2009.\n\n[16] Ron Levie, Federico Monti, Xavier Bresson, and Michael M Bronstein. Cayleynets:\nGraph convolutional neural networks with complex rational spectral \ufb01lters. arXiv preprint\narXiv:1705.07664, 2017.\n\n[17] Tai Qin and Karl Rohe. Regularized spectral clustering under the degree-corrected stochastic\nblockmodel. In Advances in Neural Information Processing Systems, pages 3120\u20133128, 2013.\n\n9\n\n\f[18] Yixuan Qiu, Jiali Mei, and authors of the ARPACK library. See \ufb01le AUTHORS for details.\nrARPACK: Solvers for Large Scale Eigenvalue and SVD Problems, 2016. R package version\n0.11-0.\n\n[19] Jianbo Shi and Jitendra Malik. Normalized cuts and image segmentation. IEEE Transactions\n\non pattern analysis and machine intelligence, 22(8):888\u2013905, 2000.\n\n[20] Ulrike Von Luxburg. A tutorial on spectral clustering. Statistics and computing, 17(4):395\u2013416,\n\n2007.\n\n[21] Yilin Zhang, Marie Poux-Berthe, Chris Wells, Karolina Koc-Michalska, and Karl Rohe. Dis-\ncovering political topics in facebook discussion threads with spectral contextualization. arXiv\npreprint arXiv:1708.06872, 2017.\n\n[22] Yini Zhang, Chris Wells, Song Wang, and Karl Rohe. Attention and ampli\ufb01cation in the hybrid\nmedia system: The composition and activity of donald trump\u2019s twitter following during the\n2016 presidential election. New Media & Society, page 1461444817744390, 2017.\n\n10\n\n\f", "award": [], "sourceid": 6781, "authors": [{"given_name": "Yilin", "family_name": "Zhang", "institution": "University of Wisconsin-Madison"}, {"given_name": "Karl", "family_name": "Rohe", "institution": "UW-Madison"}]}