{"title": "General Tensor Spectral Co-clustering for Higher-Order Data", "book": "Advances in Neural Information Processing Systems", "page_first": 2559, "page_last": 2567, "abstract": "Spectral clustering and co-clustering are well-known techniques in data analysis, and recent work has extended spectral clustering to square, symmetric tensors and hypermatrices derived from a network. We develop a new tensor spectral co-clustering method that simultaneously clusters the rows, columns, and slices of a nonnegative three-mode tensor and generalizes to tensors with any number of modes. The algorithm is based on a new random walk model which we call the super-spacey random surfer. We show that our method out-performs state-of-the-art co-clustering methods on several synthetic datasets with ground truth clusters and then use the algorithm to analyze several real-world datasets.", "full_text": "General Tensor Spectral Co-clustering\n\nfor Higher-Order Data\n\nTao Wu\n\nPurdue University\nwu577@purdue.edu\n\nAustin R. Benson\nStanford University\n\narbenson@stanford.edu\n\nDavid F. Gleich\nPurdue University\n\ndgleich@purdue.edu\n\nAbstract\n\nSpectral clustering and co-clustering are well-known techniques in data analysis,\nand recent work has extended spectral clustering to square, symmetric tensors\nand hypermatrices derived from a network. We develop a new tensor spectral\nco-clustering method that simultaneously clusters the rows, columns, and slices\nof a nonnegative three-mode tensor and generalizes to tensors with any number of\nmodes. The algorithm is based on a new random walk model which we call the\nsuper-spacey random surfer. We show that our method out-performs state-of-the-art\nco-clustering methods on several synthetic datasets with ground truth clusters and\nthen use the algorithm to analyze several real-world datasets.\n\n1\n\nIntroduction\n\nClustering is a fundamental task in machine learning that aims to assign closely related entities to\nthe same group. Traditional methods optimize some aggregate measure of the strength of pairwise\nrelationships (e.g., similarities) between items. Spectral clustering is a particularly powerful technique\nfor computing the clusters when the pairwise similarities are encoded into the adjacency matrix of a\ngraph. However, many graph-like datasets are more naturally described by higher-order connections\namong several entities. For instance, multilayer or multiplex networks describe the interactions\nbetween several graphs simultaneously with node-node-layer relationships [17]. Nonnegative tensors\nare a common representation for many of these higher-order datasets. For instance the i, j, k entry in\na third-order tensor might represent the similarity between items i and j in layer k.\nHere we develop the General Tensor Spectral Co-clustering (GTSC) framework for clustering tensor\ndata. The algorithm takes as input a nonnegative tensor, which may be sparse, non-square, and\nasymmetric, and outputs subsets of indices from each dimension (co-clusters). Underlying our\nmethod is a new stochastic process that models higher-order Markov chains, which we call a super-\nspacey random walk. This is used to generalize ideas from spectral clustering based on random walks.\nWe introduce a variant on the well-known conductance measure from spectral graph partitioning [24]\nthat we call biased conductance and describe how this provides a tensor partition quality metric; this\nis akin to Chung\u2019s use of circulations to spectrally-partition directed graphs [7]. Essentially, biased\nconductance is the exit probability from a set following our new super-spacey random walk model.\nWe use experiments on both synthetic and real-world problems to validate the effectiveness of our\nmethod1. For the synthetic experiments, we devise a \u201cplanted cluster\u201d model for tensors and show that\nGTSC has superior performance compared to other state-of-the-art clustering methods in recovering\nthe planted clusters. In real-world tensor data experiments, we \ufb01nd that our GTSC framework\nidenti\ufb01es stop-words and semantically independent sets in n-gram tensors as well as worldwide and\nregional airlines and airports in a \ufb02ight multiplex network.\n\n1Code and data for this paper are available at: https://github.com/wutao27/GtensorSC\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\f1.1 Related work\n\nThe Tensor Spectral Clustering (TSC) algorithm [4], another generalization of spectral methods to\nhigher-order graph data [4], is closely related. Both the perspective and high-level view are similar,\nbut the details differ in important ways. For instance, TSC was designed for the case when the\nhigher-order tensor recorded the occurrences of small subgraph patterns within the network. This\nimposes limitations, including how, because the tensor arose based on some underlying graph that\nthe partitioning metric was designed explicitly for a graph. Thus, the applications are limited in\nscope and cannot model, for example, the airplane-airplane-airport multiplex network we analyze\nin Section 3.2. Second, for sparse data, the model used by TSC required a correction term with\nmagnitude proportional to the sparsity in the tensor. In sparse tensors, this makes it dif\ufb01cult to\naccurately identify clusters, which we show in Section 3.1.\nMost other approaches to tensor clustering proceed by using low-rank factorizations [15, 21]. or\na k-means objective [16]. In contrast, our work is based on a stochastic interpretation (escape\nprobabilities from a set), in the spirit of random walks in spectral clustering for graphs. There are\nalso several methods speci\ufb01c to clustering multiplex networks [27, 20] and clustering graphs with\nmultiple entities [11, 2]. Our method handles general tensor data, which includes these types of\ndatasets as a special case. Hypergraphs clustering [14] can also model the higher-order structures of\nthe data, and in the case of tensor data, it is approximated by a standard weighted graph.\n\n1.2 Background on spectral clustering of graphs from the perspective of random walks\n\n+\n\nWe \ufb01rst review graph clustering methods from the view of graph cuts and random walks, and then\nreview the standard spectral clustering method using sweep cuts. In Section 2, we generalize these\nnotions to higher-order data in order to develop our GTSC framework.\nLet A \u2208 Rn\u00d7n\nbe the adjacency matrix of an undirected graph G = (V, E) and let n = |V |\nbe the number of nodes in the graph. De\ufb01ne the diagonal matrix of degrees of vertices in V as\nD = diag(Ae), where e is the vector with all ones. The graph Laplacian is L = D \u2212 A and\nthe transition matrix is P = AD\u22121 = AT D\u22121. The transition matrix represents the transition\nprobabilities of a random walk on the graph. If a walker is at node j, it transitions to node i with\nprobability Pij = Aji/Djj.\nConductance. One of the most widely-used quality metrics for partitioning a graph\u2019s vertices into\ntwo sets S and \u00afS = V \\S is conductance [24]. Intuitively, conductance measures the ratio of the\nnumber of edges in the graph that go between S and \u00afS to the number of edges in S or \u00afS. Formally,\nwe de\ufb01ne conductance as:\n\nwhere\n\n\u03c6(S) = cut(S)/min(cid:0)vol(S), vol( \u00afS)(cid:1),\n(cid:88)\n(cid:88)\n\nand\n\nvol(S) =\n\nAij.\n\ni\u2208S,j\u2208V\n\ncut(S) =\n\nAij\n\ni\u2208S,j\u2208 \u00afS\n\n(1)\n\n(2)\n\nA set S with small conductance is a good partition (S, \u00afS). The following well-known observation\nrelates conductance to random walks on the graph.\n\nObservation 1 ([18]) Let G be undirected, connected, and not bipartite. Start a random walk\n(Zt)t\u2208N where the initial state X0 is randomly chosen following the stationary distribution of the\nrandom walk. Then for any set S \u2208 V ,\n\n\u03c6(S) = max(cid:8)Pr(Z1 \u2208 \u00afS | Z0 \u2208 S), Pr(Z1 \u2208 S | Z0 \u2208 \u00afS)(cid:9).\n\nThis provides an alternative view of conductance\u2014it measures the probability that one step of a\nrandom walk will traverse between S and \u00afS. This random walk view, in concert with the super-spacey\nrandom walk, will serve as the basis for our biased conductance idea to partition tensors in Section 2.4.\nPartitioning with a sweep cut. Finding the set of minimum conductance is an NP-hard combinatorial\noptimization problem [26]. However, there are real-valued relaxations of the problem that are\ntractable to solve and provide a guaranteed approximation [19, 9]. The most well known computes\nan eigenvector called the Fiedler vector and then uses a sweep cut to identify a partition based on this\neigenvector.\n\n2\n\n\fThe Fiedler eigenvector z solves Lz = \u03bbDz where \u03bb is the second smallest generalized eigenvalue.\nThis can be equivalently formulated in terms of the random walk transition matrix P . Speci\ufb01cally,\n\nLz = \u03bbDz \u21d4 (I \u2212 D\u22121A)z = \u03bbz \u21d4 zT P = (1 \u2212 \u03bb)zT .\n\nThe sweep cut procedure to identify a low-conductance set S from z is as follows:\n\n1. Sort the vertices by z as z\u03c31 \u2264 z\u03c32 \u2264 \u00b7\u00b7\u00b7 \u2264 z\u03c3n.\n2. Consider the n \u2212 1 candidate sets Sk = {\u03c31, \u03c32,\u00b7\u00b7\u00b7 , \u03c3k} for 1 \u2264 k \u2264 n \u2212 1\n\u03c6(Sk) as the solution set.\n3. Choose S = argminSk\n\nThe solution set S from this algorithm satis\ufb01es the celebrated Cheeger inequality [19, 8]: \u03c6(S) \u2264\n\n2(cid:112)\u03c6opt, where \u03c6opt = minS\u2282V \u03c6(S) is the minimum conductance over any set of nodes. Computing\n\n\u03c6(Sk) for all k only takes time linear in the number of edges in the graph because Sk+1 and Sk differ\nonly in the vertex \u03c3k+1.\nTo summarize, the spectral method requires two components: the second left eigenvector of P and\nthe conductance criterion. We generalize these ideas to tensors in the following section.\n\n2 A higher-order spectral method for tensor co-clustering\n\nWe now generalize the ideas from spectral graph partitioning to nonnegative tensor data. We \ufb01rst\nreview our notation for tensors and then show how tensor data can be interpreted as a higher-order\nMarkov chain. We brie\ufb02y review Tensor Spectral Clustering [4] before introducing the new super-\nspacey random walk that we use here. This super-spacey random walk will allow us to compute a\nvector akin to the Fiedler vector for a tensor and to generalize conductance to tensors. Furthermore,\nwe generalize the ideas from co-clustering in bipartite graph data [10] to rectangular tensors.\n\n2.1 Preliminaries and tensor notation\n\nWe use T to denote a tensor. As a generalization of a matrix, T has m indices (making T an\nmth-order or m-mode tensor), with the (i1, i2,\u00b7, im) entry denoted Ti1,i2,\u00b7\u00b7\u00b7 ,im. We will work with\nnon-negative tensors where Ti1,i2,\u00b7\u00b7\u00b7 ,im \u2265 0. We call a subset of the tensor entries with all but the\n\ufb01rst element \ufb01xed a column of the tensor. For instance, the j, k column of a three-mode tensor T\nis T:,j,k. A tensor is square if the dimension of all the modes is equal and rectangular if not, and\na square tensor is symmetric if it is equal for any permutation of the indices. For simplicity in the\nremainder of our exposition, we will focus on three-mode tensors. However, all of or ideas generalize\nto an arbitrary number of modes. (See, e.g., the work of Gleich et al. [13] and Benson et al. [5] for\nrepresentative examples of how these generalizations work.) Finally, we use two operations between\na tensor and a vector. First, a tensor-vector product with a three-mode tensor can output a vector,\nwhich we denote by:\n\nSecond, a tensor-vector product can also produce a matrix, which we denote by:\n\ny = T x2 \u21d4 yi =(cid:80)\nA = T [x] \u21d4 Ai,j =(cid:80)\n\nj,k Ti,j,kxjxk.\n\nk Ti,j,kxk.\n\n2.2 Forming higher-order Markov chains from nonnegative tensor data\n\nRecall from Section 1.2 that we can form the transition matrix for a Markov chain from a square\nnon-negative matrix A by normalizing the columns of the matrix AT . We can generalize this idea\nto de\ufb01ne a higher-order Markov chain by normalizing a square tensor. This leads to a probability\ntransition tensor P :\n\n(3)\ni Ti,j,k > 0. In Section 2.3, we will discuss the sparse case where the column\nT:,j,k may be entirely zero. When that case does not arise, entries of P can be interpreted as the\ntransition probabilities of a second-order Markov chain (Zt)t\u2208N:\n\nwhere we assume(cid:80)\n\nPi,j,k = Ti,j,k/(cid:80)\n\ni Ti,j,k\n\nPi,j,k = Pr(Zt+1 = i | Zt = j, Zt\u22121 = k).\n\nIn other words, If the last two states were j and k, then the next state is i with probability Pi,j,k.\n\n3\n\n\fIt is possible to turn any higher-order Markov chain into a \ufb01rst-order Markov chain on the product\nstate space of all ordered pairs (i, j). The new Markov chain moves to the state-pair (i, j) from\n(j, k) with probability Pi,j,k. Computing the Fiedler vector associated with this chain would be one\napproach to tensor clustering. However, there are two immediate problems. First, the eigenvector\nis of size n2, which quickly becomes infeasible to store. Second, the eigenvector gives information\nabout the product space\u2014not the original state space. (In future work we plan to explore insights\nfrom marginals of this distribution.)\nRecent work uses the spacey random walk and spacey random surfer stochastic processes to circum-\nvent these issues [5]. The process is non-Markovian and generates a sequence of states Xt as follows.\nAfter arriving at state Xt, the walker promptly \u201cspaces out\u201d and forgets the state Xt\u22121, yet it still\nwants to transition according to the higher-order transitions P . Thus, it invents a state Yt by drawing\na random state from its history and then transitions to state Xt+1 with probability PXt+1,Xt,Yt. We\ndenote Ind{\u00b7} as the indicator event and Ht as the history of the process up to time t,2 then\n\nPr(Yt = j | Ht) = 1\n\n(4)\nIn this case, we assume that the process has a non-zero probability of picking any state by in\ufb02ating\nits history count by 1 visit. The spacey random surfer is a generalization where the walk follows\nthe above process with probability \u03b1 and teleports at random following a stochastic vector v with\nprobability 1 \u2212 \u03b1. This is akin to how the PageRank random walk includes teleportation.\nLimiting stationary distributions are solutions to the multilinear PageRank problem [13]:\n\nt+n\n\n.\n\n\u03b1P x2 + (1 \u2212 \u03b1)v = x,\n\n(5)\nand the limiting distribution x represents the stationary distribution of the transition matrix P [x] [5].\nThe transition matrix P [x] asymptotically approximates the spacey walk or spacey random surfer.\nThus, it is feasible to compute an eigenvector of P [x] matrix and use it with the sweep cut procedure\non a generalized notion of conductance. However, this derivation assumes that all n2 columns of T\nwere non-zero, which does not occur in real-world datasets. The TSC method adjusted the tensor\nT and replaced any columns of all zeros with the uniform distribution vector [4]. Because the\nnumber of zero-columns may be large, this strategy dilutes the information in the eigenvector (see\nAppendix D.1). We deal with this issue more generally in the following section, and note that our\nnew solution outperforms TSC in our experiments (Section 3).\n\n(cid:16)\n\n1 +(cid:80)t\n\nr=1 Ind{Xr = j}(cid:17)\n\n2.3 A stochastic process for sparse tensors\n\nHere we consider another model of the random surfer that avoids the issue of unde\ufb01ned transitions\u2014\nwhich correspond to columns of T that are all zero\u2014entirely. If the surfer attempts to use an\nunde\ufb01ned transition, then the surfer moves to a random state drawn from history. Formally, de\ufb01ne\nthe set of feasible states by\n\n(6)\nHere, the set F denotes all the columns in T that are non-zero. The transition probabilities of our\nproposed stochastic process are given by\n\ni\n\nF = {(j, k) |(cid:88)\n\nTi,j,k > 0}.\n\nPr(Xt+1 = i | Xt = j, Ht)\n\n= (1 \u2212 \u03b1)vi + \u03b1(cid:80)\n\nPr(Xt+1 = i | Xt = j, Yt = k, Ht) =\n\nk Pr(Xt+1 = i | Xt = j, Yt = k, Ht)Pr(Yt = k | Ht)\n\n(cid:40)\nTi,j,k/(cid:80)\nn+t (1 +(cid:80)t\n\n1\n\ni Ti,j,k\nr=1 Ind{Xr = i})\n\n(j, k) \u2208 F\n(j, k) (cid:54)\u2208 F,\n\nwhere vi is the teleportation probability. Again Yt is chosen according to Equation (4). We call\nthis process the super-spacey random surfer because when the transitions are not de\ufb01ned it picks a\nrandom state from history.\nThis process is a (generalized) vertex-reinforced random walk [3]. Let P be the normalized tensor\ni Ti,j,k only for the columns in F and where all other entries are zero. Stationary\n\nPi,j,k = Ti,j,k/(cid:80)\n\ndistributions of the stochastic process must satisfy the following equation:\n\u03b1P x2 + \u03b1(1 \u2212 (cid:107)P x2(cid:107)1)x + (1 \u2212 \u03b1)v = x,\n\n2Formally, this is the \u03c3-algebra generated by the states X1, . . . , Xt.\n\n4\n\n(7)\n\n(8)\n\n(9)\n\n\fwhere x is a probability distribution vector (see Appendix A.1 for a proof). At least one solution\nvector x must exist, which follows directly from Brouwer\u2019s \ufb01xed-point theorem. Here we give a\nsuf\ufb01cient condition for it to be unique and easily computable.\nTheorem 2.1 If \u03b1 < 1/(2m \u2212 1) then there is a unique solution x to (9) for the general m-mode\ntensor. Furthermore, the iterative \ufb01xed point algorithm\nk + \u03b1(1 \u2212 (cid:107)P x2\n\nk(cid:107)1)xk + (1 \u2212 \u03b1)vk\n\nwill converge at least linearly to this solution.\nThis is a nonlinear setting and tighter convergence results are currently unknown, but these are\nunlikely to be tight on real-world data. For our experiments, we found that high values (e.g., 0.95) of\n\u03b1 do not impede convergence. We use \u03b1 = 0.8 for all our experiments.\nIn the following section, we show how to form a Markov chain from x and then develop our spectral\nclustering technique by operating on the corresponding transition matrix.\n\nxk+1 = \u03b1P x2\n\n(10)\n\n2.4 First-order Markov approximations and biased conductance for tensor partitions\n\nFrom Observation 1 in Section 1.2, we know that conductance may be interpreted as the exit\nprobability between two sets that form a partition of the nodes in the graph. In this section, we derive\nan equivalent \ufb01rst-order Markov chain from the stationary distribution of the super-spacey random\nsurfer. If this Markov chain was guaranteed to be reversible, then we could apply the standard\nde\ufb01nitions of conductance and the Fiedler vector. This will not generally be the case, and so we\nintroduce a biased conductance measure to partition this non-reversible Markov chain with respect to\nstarting in the stationary distribution of the super-spacey random walk. We use the second largest,\nreal-valued eigenvector of the Markov chain as an approximate Fiedler vector. Thus, we can use the\nsweep cut procedure described in Section 1.2 to identify the partition.\nForming a \ufb01rst-order Markov chain approximation. In the following derivation, we use the\nproperty of the two tensor-vector products that P [x]x = P x2. The stationary distribution x for\nthe super-spacey random surfer is equivalently the stationary distribution of the Markov chain with\ntransition matrix\n(Here we have used the fact that x \u2265 0 and eT x = 1.) The above transition matrix denotes\ntransitioning based on a \ufb01rst-order Markov chain with probability \u03b1, and based on a \ufb01xed vector v\nwith probability 1 \u2212 \u03b1. We introduce this following \ufb01rst-order Markov chain\n\n\u03b1(cid:0)P [x] + x(eT \u2212 eT P [x])(cid:1) + (1 \u2212 \u03b1)veT .\n\n\u02dcP = P [x] + x(eT \u2212 eT P [x]),\n\nwhich represents a useful (but crude) approximation of the higher-order structure in the data. First,\nwe determine how often we visit states using the super-spacey random surfer to get a vector x.\nThen the Markov chain \u02dcP will tend to have a large probability of spending time in states where the\nhigher-order information concentrates. This matrix represents a \ufb01rst-order Markov chain on which\nwe can compute an eigenvector and run a sweep cut.\nBiased conductance. Consider a random walk (Zt)t\u2208N. We de\ufb01ne the biased conductance \u03c6p(S)\nof a set S \u2282 {1, . . . , n} to be\n\n\u03c6p(S) = max(cid:8)Pr(Z1 \u2208 \u00afS | Z0 \u2208 S), Pr(Z1 \u2208 S | Z0 \u2208 \u00afS)(cid:9),\n\nwhere Z0 is chosen according to a \ufb01xed distribution p. Just as with the standard de\ufb01nition of\nconductance, we can interpret biased conductance as an escape probability. However, the initial state\nZ0 is not chosen following the stationary distribution (as in the standard de\ufb01nition with a reversible\nchain) but following p instead. This is why we call it biased conductance. We apply this measure\nto \u02dcP using p = x (the stationary distribution of the super-spacey walk). This choice emphasizes\nthe higher-order information. Our idea of biased conductance is equivalent to how Chung de\ufb01nes a\nconductance score for a directed graph [7].\nWe use the eigenvector of \u02dcP with the second-largest real eigenvalue as an analogue of the Fiedler\nvector. If the chain were reversible, this would be exactly the Fiedler vector. When it is not, then\nthe vector coordinates still encode indications of state clustering [25]; hence, this vector serves as a\nprincipled heuristic. It is important to note that although \u02dcP is a dense matrix, we can implement the\ntwo operations we need with \u02dcP in time and space that depends only on the number of non-zeros of\nthe sparse tensor P using standard iterative methods for eigenvalues of matrices (see Appendix B.1).\n\n5\n\n\f(cid:20) 0\n\nT =\n\n(cid:21)\n\n,\n\n2.5 Handling rectangular tensor data\n\nSo far, we have only considered square, symmetric tensor data. However, tensor data are of-\nten rectangular. This is usually the case when the different modes represent different types of\ndata. For example, in Section 3.2, we examine a tensor T \u2208 Rp\u00d7n\u00d7n of airline \ufb02ight data,\nwhere Ti,j,k represents that there is a \ufb02ight from airport j to airport k on airline i. Our approach\nis to embed the rectangular tensor into a larger square tensor and then symmetrize this tensor,\nusing approaches developed by Ragnarsson and Van Loan [23]. After the embedding, we can\nrun our algorithm to simultaneously cluster rows,\ncolumns, and slices of the tensor. This approach is\nsimilar in style to the symmetrization of bipartite\ngraphs for co-clustering proposed by Dhillon [10].\nLet U be an n-by-m-by-(cid:96) rectangular tensor. Then\nwe embed U into a square three-mode tensor T\nwith n + m + (cid:96) dimensions and where Ui,j,k =\nTi,j+n,k+n+m. This is illustrated in Figure 1 (left).\nThen we symmetrize the tensor by using all permu-\ntations of the indices Figure 1 (right). When viewed\nas a 3-by-3-by-3 block tensor, the tensor is\n\nFigure 1: The tensor is \ufb01rst embedded into a\nlarger square tensor (left) and then this square\ntensor is symmetrized (right).\n\n0\n0\n\n0\n0 U (3,2,1)\n\n0\n\nU (2,3,1)\n\n0\n\n0\n0\n\n0 U (1,3,2)\n0\nU (3,1,2) 0\n\n0\n0\n\n0\n\nU (2,1,3)\n\n0\n\nU (1,2,3) 0\n0\n0\n\n0\n0\n\nwhere U (1,3,2) is a generalized transpose of U with the dimensions permuted.\n\n2.6 Summary of the algorithm\n\nOur GTSC algorithm works by recursively applying the sweep cut procedure, similar to the recursive\nbisection procedures for clustering matrix-based data [6]. Formally for each cut, we:\n1. Compute the super-spacey stationary vector x (Equation (9)) and form P [x].\n2. Compute second largest left, real-valued eigenvector z of \u02dcP = P [x] + x(eT \u2212 eT P [x]).\n3. Sort the vertices by the eigenvector z as z\u03c31 \u2264 z\u03c32 \u2264 \u00b7\u00b7\u00b7 \u2264 z\u03c3n.\n4. Find the set Sk = {\u03c31, . . . , \u03c3k} for which the biased conductance \u03c6x(Sk) on transition\n\nmatrix \u02dcP is minimized.\n\nWe continue partitioning as long as the clusters are large enough or we can get good enough splits.\nSpeci\ufb01cally, if a cluster has dimension less than a speci\ufb01ed size minimum size, we do not consider\nit for splitting. Otherwise, the algorithm recursively splits the cluster if either (1) its dimension is\nabove some threshold or (2) the biased conductance of a new split is less than a target value \u03c6\u22173. The\noverall algorithm is summarized in Appendix B as well as the algorithm complexity. Essentially, the\nalgorithm scales linearly in the number of non-zeros of the tensor for each cluster that is produced.\n\n3 Experiments\n\nWe now demonstrate the ef\ufb01cacy of our method by clustering synthetic and real-world data. We \ufb01nd\nthat our method is better at recovering planted cluster structure in synthetically generated tensor data\ncompared to other state-of-the-art methods. Please refer to Appendix C for the parameter details.\n\n3.1 Synthetic data\nWe generate tensors with planted clusters and try to recover the clusters. For each dataset, we\ngenerate 20 groups of nodes that will serve as our planted clusters, where the number of nodes in\neach group from a truncated normal distribution with mean 20 and variance 5 so that each group\nhas at least 4 nodes. For each group g we also assign a weight wg where the weight depends on the\n2\u03c0)\u22121 exp(\u2212(i \u2212 10.5)2/(2\u03c32)), where \u03c3 varies by\ngroup number. For group i, the weight is (\u03c3\nexperiment. Non-zeros correspond to interactions between three indices (triples). We generate tw\ntriples whose indices are within a group and ta triples whose indices span across more than one group.\n3We tested \u03c6\u2217 from 0.3 to 0.4, and we found the value of \u03c6\u2217 is not very sensitive to the experimental results.\n\n\u221a\n\n6\n\n\fTable 1: Adjusted Rand Index (ARI), Normalized Mutual Information (NMI), and F1 scores on\nvarious clustering methods for recovering synthetically generated tensor data with planted cluster\nstructure. The \u00b1 entries are the standard deviation over 5 trials.\nARI\n\nNMI\n\nF1\n\nF1\n\nARI\nSquare tensor with \u03c3 = 4\n\nNMI\n\n0.99\u00b10.01\nGTSC\n0.42\u00b1 0.05\nTSC\nPARAFAC 0.82\u00b1 0.05\n0.99\u00b10.01\n0.48\u00b1 0.05\n\nMulDec\n\nSC\n\n0.78\u00b10.13\nGTSC\n0.41\u00b1 0.11\nTSC\nPARAFAC 0.48\u00b1 0.08\n0.43\u00b1 0.07\n0.19\u00b1 0.01\n\nMulDec\n\nSC\n\n0.99\u00b10.00\n0.60\u00b1 0.04\n0.94\u00b1 0.02\n0.99\u00b10.01\n0.66\u00b1 0.03\n\n0.89\u00b10.06\n0.60\u00b1 0.09\n0.67\u00b1 0.04\n0.66\u00b1 0.04\n0.37\u00b1 0.01\n\n0.99\u00b10.01\n0.45\u00b1 0.04\n0.83\u00b1 0.04\n0.99\u00b10.01\n0.51\u00b1 0.05\n\n0.79\u00b10.12\n0.44\u00b1 0.10\n0.50\u00b1 0.07\n0.47\u00b1 0.06\n0.24\u00b1 0.01\n\nSquare tensor with \u03c3 = 2\n\n0.98\u00b10.03\n0.53\u00b1 0.15\n0.90\u00b1 0.02\n0.94\u00b1 0.04\n0.39\u00b1 0.05\n\nRectangular tensor with \u03c3 = 4\n0.97\u00b10.06\n0.38\u00b1 0.17\n0.81\u00b1 0.04\n0.91\u00b1 0.06\n0.27\u00b1 0.06\nRectangular tensor with \u03c3 = 2\n0.96\u00b10.06\n0.28\u00b1 0.08\n0.10\u00b1 0.04\n0.38\u00b1 0.07\n0.08\u00b1 0.01\n\n0.97\u00b10.04\n0.44\u00b1 0.10\n0.24\u00b1 0.05\n0.52\u00b1 0.05\n0.19\u00b1 0.02\n\n0.97\u00b10.05\n0.41\u00b1 0.16\n0.82\u00b1 0.04\n0.91\u00b1 0.06\n0.32\u00b1 0.05\n\n0.96\u00b10.06\n0.32\u00b1 0.08\n0.15\u00b1 0.04\n0.41\u00b1 0.07\n0.14\u00b1 0.01\n\nThe tw triples are chosen by \ufb01rst uniformly selecting a group g and then uniformly selecting three\nindices i, j, and k from group g and \ufb01nally assigning a weight of wg. For the ta triples, the sampling\nprocedure \ufb01rst selects an index i from group gi with a probability proportional to the weights of the\ngroup. In other words, indices in group g are chosen proportional to wg. Two indices j and k are\nthen selected uniformly at random from groups gj and gk other than gi. Finally, the weight in the\ntensor is assigned to be the average of the three group weights. For rectangular data, we follow a\nsimilar procedure where we distinguish between the indices for each mode of the tensor.\nFor our experiments, tw = 10, 000 and ta = 1, 000, and the variance \u03c3 that controls the group\nweights is 2 or 4. For each value of \u03c3, we create 5 sample datasets. The value of \u03c3 affects the\nconcentration of the weights and how certain groups of nodes interact with others. This skew re\ufb02ects\nproperties of the real-world networks we examine in the next section.\nOur GTSC method is compared with Tensor Spectral Clustering (TSC) [4], the Tensor Decomposition\nPARAFAC [1], Spectral Clustering (SC) via Multilinear SVD [12] and Multilinear Decomposition\n(MulDec) [21]. Table 1 depicts the performances of the four algorithms in recovering the planted\nclusters. In all cases, GTSC has the best performance. We note that the running time is a few seconds\nfor GTSC, TSC and SC and nearly 30 minutes for PARAFAC and MulDec per trial. Note that the\ntensors have roughly 50, 000 non-zeros. The poor scalability prohibits the later two methods from\nbeing applied to the real-world tensors in the following section.\n\n3.2 Case study in airline \ufb02ight networks\n\nWe now turn to studying real-world tensor\ndatasets. We \ufb01rst cluster an airline-airport mul-\ntimodal network which consists of global air\n\ufb02ight routes from 539 airlines and 2, 939 air-\nports4. In this application, the entry Ti,j,k of\nthe three-mode tensor T is 1 if airline i \ufb02ies be-\nFigure 2: Visualization of the airline-airport data\ntween airports j and k and 0 otherwise. Figure 2\ntensor. The x and y axes index airports and the\nillustrates the connectivity of the tensor with a\nz axis indexes airlines. A dot represents that an\nrandom ordering of the indices (left) and the\nairline \ufb02ies between those two airports. On the left,\nordering given by the popularity of co-clusters\nindices are sorted randomly. On the right, indices\n(right). We can see that after the co-clustering,\nare sorted by the co-clusters found by our GTSC\nthere is clear structure in the data tensor.\nframework, which reveals structure in the tensor.\nOne prominent cluster found by the method corresponds to large international airports in cities such as\nBeijing and New York City. This group only accounts for 8.5% of the total number of airports, but it\nis responsible for 59% of the total routes. Figure 2 illustrates this result\u2014the airports with the highest\n\n4Data were collected from http://openflights.org/data.html#route.\n\n7\n\n\findices are connected to almost every other airport. This cluster is analogous to the \u201cstop word\u201d group\nwe will see in the n-gram experiments. Most other clusters are organized geographically. Our GTSC\nframework \ufb01nds large clusters for Europe, the United States, China/Taiwan, Oceania/SouthEast Asia,\nand Mexico/Americas. Interestingly, Canc\u00fan International Airport is included with the United States\ncluster, likely due to large amounts of tourism.\n\n3.3 Case study on n-grams\nNext, we study data from n-grams (consective sequences of words in texts). We construct a square\nmode-n tensor where indices correspond to words. An entry in the tensor is the number of occurrences\nthis n-gram. We form tensors from both English and Chinese corpora for n = 3, 4.5 The non-zeros\nin the tensor consist of the frequencies of the one million most frequent n-grams.\nEnglish n-grams. We \ufb01nd several conclusions that hold for both tensor datasets. Two large groups in\nboth datasets consist of stop words, i.e., frequently occuring connector words. In fact, 48% (3-gram)\nand 64% (4-gram) of words in one cluster are prepositions (e.g., in, of, as, to) and link verbs (e.g.,\nis, get, does). In the another cluster, 64% (3-gram) and 57% (4-gram) of the words are pronouns\n(e.g., we, you, them) and link verbs. This result matches the structure of English language where link\nverbs can connect both prepositions and pronouns whereas prepositions and pronouns are unlikely to\nappear in close vicinity. Other groups consist of mostly semantically related English words, e.g.,\n\n{cheese, cream, sour, low-fat, frosting, nonfat, fat-free} and\n{bag, plastic, garbage, grocery, trash, freezer}.\n\nThe clustering of the 4-gram tensor contains some groups that the 3-gram tensor fails to \ufb01nd, e.g.,\n\n{german, chancellor, angela, merkel, gerhard, schroeder, helmut, kohl}.\n\nIn this case, Angela Merkel, Gerhard Schroeder, and Helmut Kohl have all been German chancellors,\nbut it requires a 4-gram to make this connection strong. Likewise, some clusters only appear from\nclustering the 3-gram tensor. One such cluster is\n\n{church, bishop, catholic, priest, greek, orthodox, methodist, roman, episcopal}.\n\nIn 3-grams, we may see phrases such as \u201ccatholic church bishop\", but 4-grams containing these words\nlikely also contain stop words, e.g., \u201cbishop of the church\". However, since stop words already form\ntheir own cluster, this connection is destoryed.\nChinese n-grams. We \ufb01nd that many of the conclusions from the English n-gram datasets also hold\nfor the Chinese n-gram datasets. This includes groups of stop words and semantically related words.\nFor example, there are two clusters consisting of mostly stop words (200 most frequently occurring\nwords) from the 3-gram and 4-gram tensors. In the 4-gram data, one cluster of 31 words consists\nentirely of stop words and another cluster contains 36 total words, of which 23 are stop words.\nThere are some words from the two groups that are not typically considered as stop words, e.g.,\n\u793e\u4f1a society, \u7ecf\u6d4e economy, \u53d1\u5c55 develop, \u4e3b\u4e49 -ism, \u56fd\u5bb6 nation, \u653f\u5e9c government\n\nThese words are also among the top 200 most common words according to the corpus. This is a\nconsequence of the dataset coming from scanned Chinese-language books and is a known issue with\nthe Google Books corpus [22]. In this case, it is a feature as we are illustrating the ef\ufb01cacy of our\ntensor clustering framework rather than making any linguistic claims.\n\n4 CONCLUSION\n\nIn this paper we developed the General Tensor Spectral Co-clustering (GTSC) method for co-\nclustering the modes of nonnegative tensor data. Our method models higher-order data with a new\nstochastic process, the super-spacey random walk, which is a variant of a higher-order Markov\nchain. With the stationary distribution of this process, we can form a \ufb01rst-order Markov chain which\ncaptures properties of the higher-order data and then use tools from spectral graph partitioning to\n\ufb01nd co-clusters. In future work, we plan to create tensors that bridge information from multiple\nmodes. For instance, clusters in the n-gram data depended on n, e.g., the names of various German\nchancellors only appeared as a 4-gram cluster. It would be useful to have a holistic tensor to jointly\npartition both 3- and 4-gram information.\nAcknowledgements. TW and DFG are supported by NSF IIS-1422918 and DARPA SIMPLEX.\nARB is supported by a Stanford Graduate Fellowship.\n\n5English n-gram data were collected from http://www.ngrams.info/intro.asp and Chinese n-gram\n\ndata were collected from https://books.google.com/ngrams.\n\n8\n\n\fReferences\n[1] B. W. Bader, T. G. Kolda, et al. Matlab tensor toolbox version 2.6. Available online, February 2015.\n\n[2] B.-K. Bao, W. Min, K. Lu, and C. Xu. Social event detection with robust high-order co-clustering. In\n\nICMR, pages 135\u2013142, 2013.\n\n[3] M. Bena\u00efm. Vertex-reinforced random walks and a conjecture of pemantle. Ann. Prob., 25(1):361\u2013392,\n\n1997.\n\n[4] A. R. Benson, D. F. Gleich, and J. Leskovec. Tensor spectral clustering for partitioning higher-order\n\nnetwork structures. In SDM, pages 118\u2013126, 2015.\n\n[5] A. R. Benson, D. F. Gleich, and L.-H. Lim. The spacey random walk: a stochastic process for higher-order\n\ndata. arXiv, cs.NA:1602.02102, 2016.\n\n[6] D. Boley. Principal direction divisive partitioning. Data Mining and Knowledge Discovery, 2(4):325\u2013344,\n\n1998.\n\n[7] F. Chung. Laplacians and the Cheeger inequality for directed graphs. Annals of Combinatorics, 9(1):1\u201319,\n\n2005. 10.1007/s00026-005-0237-z.\n\n[8] F. Chung. Four proofs for the Cheeger inequality and graph partition algorithms. In ICCM, 2007.\n[9] F. R. L. Chung. Spectral Graph Theory. AMS, 1992.\n[10] I. S. Dhillon. Co-clustering documents and words using bipartite spectral graph partitioning. In KDD,\n\npages 269\u2013274, 2001.\n\n[11] B. Gao, T.-Y. Liu, X. Zheng, Q.-S. Cheng, and W.-Y. Ma. Consistent bipartite graph co-partitioning for\n\nstar-structured high-order heterogeneous data co-clustering. In KDD, pages 41\u201350, 2005.\n\n[12] D. Ghoshdastidar and A. Dukkipati. Spectral clustering using multilinear svd: Analysis, approximations\n\nand applications. In AAAI, pages 2610\u20132616, 2015.\n\n[13] D. F. Gleich, L.-H. Lim, and Y. Yu. Multilinear PageRank. SIAM J. Matrix Ann. Appl., 36(4):1507\u20131541,\n\n2015.\n\n[14] M. Hein, S. Setzer, L. Jost, and S. S. Rangapuram. The total variation on hypergraphs-learning on\nhypergraphs revisited. In Advances in Neural Information Processing Systems, pages 2427\u20132435, 2013.\n[15] H. Huang, C. Ding, D. Luo, and T. Li. Simultaneous tensor subspace selection and clustering: the\n\nequivalence of high order SVD and k-means clustering. In KDD, pages 327\u2013335, 2008.\n\n[16] S. Jegelka, S. Sra, and A. Banerjee. Approximation algorithms for tensor clustering. In Algorithmic\n\nlearning theory, pages 368\u2013383. Springer, 2009.\n\n[17] M. Kivel\u00e4, A. Arenas, M. Barthelemy, J. P. Gleeson, Y. Moreno, and M. A. Porter. Multilayer networks.\n\nJournal of Complex Networks, 2(3):203\u2013271, 2014.\n\n[18] M. Meil\u02d8a and J. Shi. A random walks view of spectral segmentation. In AISTATS, 2001.\n[19] M. Mihail. Conductance and convergence of markov chains\u2014a combinatorial treatment of expanders. In\n\nFOCS, pages 526\u2013531, 1989.\n\n[20] J. Ni, H. Tong, W. Fan, and X. Zhang. Flexible and robust multi-network clustering. In KDD, pages\n\n835\u2013844, 2015.\n\n[21] E. E. Papalexakis and N. D. Sidiropoulos. Co-clustering as multilinear decomposition with sparse latent\n\nfactors. In ICASSP, pages 2064\u20132067. IEEE, 2011.\n\n[22] E. A. Pechenick, C. M. Danforth, and P. S. Dodds. Characterizing the Google Books corpus: strong limits\n\nto inferences of socio-cultural and linguistic evolution. PloS one, 10(10):e0137041, 2015.\n\n[23] S. Ragnarsson and C. F. Van Loan. Block tensors and symmetric embeddings. Linear Algebra Appl.,\n\n438(2):853\u2013874, 2013.\n\n[24] S. E. Schaeffer. Graph clustering. Computer Science Review, 1(1):27\u201364, 2007.\n[25] W. J. Stewart. Introduction to the numerical solutions of Markov chains. Princeton Univ. Press, 1994.\n[26] D. Wagner and F. Wagner. Between min cut and graph bisection. In MFCS, pages 744\u2013750, 1993.\n[27] D. Zhou and C. J. Burges. Spectral clustering and transductive learning with multiple views. In ICML,\n\npages 1159\u20131166, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1330, "authors": [{"given_name": "Tao", "family_name": "Wu", "institution": "Purdue University"}, {"given_name": "Austin", "family_name": "Benson", "institution": "Stanford University"}, {"given_name": "David", "family_name": "Gleich", "institution": "Purdue University"}]}