{"title": "PRUNE: Preserving Proximity and Global Ranking for Network Embedding", "book": "Advances in Neural Information Processing Systems", "page_first": 5257, "page_last": 5266, "abstract": "We investigate an unsupervised generative approach for network embedding. A multi-task Siamese neural network structure is formulated to connect embedding vectors and our objective to preserve the global node ranking and local proximity of nodes. We provide deeper analysis to connect the proposed proximity objective to link prediction and community detection in the network. We show our model can satisfy the following design properties: scalability, asymmetry, unity and simplicity. Experiment results not only verify the above design properties but also demonstrate the superior performance in learning-to-rank, classification, regression, and link prediction tasks.", "full_text": "PRUNE: Preserving Proximity and Global Ranking\n\nfor Network Embedding\n\nYi-An Lai \u2217\u2021\n\nNational Taiwan University\nb99202031@ntu.edu.tw\n\nChin-Chi Hsu \u2020\u2021\nAcademia Sinica\n\nchinchi@iis.sinica.edu.tw\n\nWen-Hao Chen \u2217\n\nNational Taiwan University\nb02902023@ntu.edu.tw\n\nMi-Yen Yeh \u2020\nAcademia Sinica\n\nmiyen@iis.sinica.edu.tw\n\nShou-De Lin \u2217\n\nNational Taiwan University\nsdlin@csie.ntu.edu.tw\n\nAbstract\n\nWe investigate an unsupervised generative approach for network embedding. A\nmulti-task Siamese neural network structure is formulated to connect embedding\nvectors and our objective to preserve the global node ranking and local proximity\nof nodes. We provide deeper analysis to connect the proposed proximity objective\nto link prediction and community detection in the network. We show our model can\nsatisfy the following design properties: scalability, asymmetry, unity and simplicity.\nExperiment results not only verify the above design properties but also demonstrate\nthe superior performance in learning-to-rank, classi\ufb01cation, regression, and link\nprediction tasks.\n\n1\n\nIntroduction\n\nNetwork embedding aims at constructing a low-dimensional latent feature matrix from a sparse\nhigh-dimensional adjacency matrix in an unsupervised manner [1\u20133, 6, 15, 18\u201321, 23, 24, 26, 31].\nMost previous works [1\u20133, 6, 15, 18\u201320, 23, 31] try to preserve k-order proximity while performing\nembedding. That is, given a pair of nodes (i, j), the similarity between their embedding vectors\nshall be to certain extent re\ufb02ect their k-hop distances (e.g. the number of k-hop distinct paths from\nnode i to j, or the probability that node j is visited via a random walk from i). Proximity re\ufb02ects\nlocal network topology, and could even preserve global network topology like communities. There\nare some other works directly formulate node embedding to \ufb01t the community distributions by\nmaximizing the modularity [21, 24].\nAlthough through experiments some of the proximity-based embedding methods had visualized the\ncommunity separation in two-dimensional vector space [2, 3, 6, 18, 20, 23], and some demonstrate an\neffective usage scenario in link prediction [6, 15, 19, 23], so far we have not yet seen a theoretical\nanalysis to connect these three concepts. The \ufb01rst goal of this paper is to propose a proximity\nmodel that connects node embedding with link prediction and community detection. There has been\nsome research focusing on a similar direction. [24] tries to propose an embedding model preserving\nboth proximity and community. However, the objective functions for proximity and community are\nseparately designed, not showing the connection between them. [26] models an embedding approach\nconsidering link prediction, but not connect it to the preservation of the network proximity.\n\n\u2217Department of Computer Science and Information Engineering\n\u2020Institute of Information Science\n\u2021These authors contributed equally to this paper.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fBesides connecting link prediction and proximity, here we also argue that it is bene\ufb01cial for an\nembedding model to preserve a network property not speci\ufb01cally addressed in the existing research:\nglobal node importance ranking. For decades unsupervised node ranking algorithms such as PageRank\n[16] and HITS [10] have shown the effectiveness in estimating global node ranks. Besides ranking\nwebsites for better search outcomes, node rankings can be useful in other applications. For example,\nthe Webspam Challenge competition 4 requires that spam web pages to be ranked lower than non-\nspam ones; the WSDM 2016 Challenge 5 asks for ranking papers information without supervision\ndata in a billion-sized citation network. Our experiments demonstrate that being able to preserve the\nglobal ranking in node embedding can not only boost the performance of a learning-to-ranking task,\nbut also a classi\ufb01cation and regression task training from node embedding as features.\nIn this paper, we propose Proximity and Ranking-preserving Unsupervised Network Embedding\n(PRUNE), an unsupervised Siamese neural network structure to learn node embeddings from not only\ncommunity-aware proximity but also global node ranking (see Figure 1). To achieve the above goals,\nwe rely on a generative solution. That is, taking the embedding vectors of the adjacent nodes of a\nlink as the training input, the shared hidden layers of our model non-linearly map node embeddings\nto optimize a carefully designed objective function. During training, the objective function, for\nglobal node ranking and community-aware proximity, propagate gradients back to update embedding\nvectors. Besides deriving an upper-bound-based objective function from PageRank to represent the\nglobal node ranking. we also provide theoretical connection of the proposed proximity objective\nfunction to a general community detection solution. In sum, our model satis\ufb01es the following four\nmodel design characteristics: (I) Scalability [1, 6, 15, 18\u201321, 23, 26, 31]. We show that for each\ntraining epoch, our model enjoys linear time and space complexity to the number of nodes or links.\nFurthermore, different from some previous works relying on sampling non-existing links as negative\nexamples for training, our model lifts the need to sample negative examples which not only saves\nextra training time but also relieves concern of sampling bias. (II) Asymmetry [2,3,15,19,20,31]. Our\nmodel considers link directions to learn the embeddings of either directed or undirected networks.\n(III) Unity [1, 2, 6, 15, 18, 19, 21, 23, 24, 26, 31]. We perform joint learning to satisfy two different\nobjective goals in a single model. The experiments show that the proposed multi-task neural network\nstructure outperforms a two-stage model. (IV) Simplicity. Empirical veri\ufb01cations re\ufb02ect that our\nmodel can achieve superior performance with only one hidden layer in neural networks and uni\ufb01ed\nhyperparameter setting, freeing from \ufb01ne-tuning the hyperparameters. This properly is especially\nimportant for an unsupervised learning task due to lack of validation data for \ufb01ne-tuning. The source\ncode of the proposed model can be downloaded here 6.\n\n2 Related work\n\nRecently, there exists growing number of works proposing embedding models speci\ufb01cally for network\nproperty preservation. Most of the prior methods extract latent embedding features by singular value\ndecomposition or matrix factorization [1, 3, 8, 15, 19, 21, 22, 24, 28, 30]. Such methods typically de\ufb01ne\nan N-by-N matrix A (N is the number of nodes) that re\ufb02ect certain network properties, and then\nfactorizes A \u2248 U(cid:62)V or A \u2248 U(cid:62)U into two low-dimensional embedding matrices U and V .\nThere are also random-walk-based methods [6,17,18,31] proposing an implicit reduction toward word\nembedding [14] by gathering random-walk sequences of sampled nodes throughout a network. The\nmethods work well in practice but struggles to explain what network properties should be kept in their\nobjective functions [20]. Unsupervised deep autoencoders are also used to learn latent embedding\nfeatures of A [2, 23], especially achieve non-linear mapping strength through activation functions.\nFinally, some research de\ufb01ned different objective functions, like Kullback\u2013Leibler divergence [20] or\nHuber loss [26] for network embedding. Please see Table 1 for detailed model comparisons.\n\n4http://webspam.lip6.fr/wiki/pmwiki.php\n5https://wsdmcupchallenge.azurewebsites.net/\n6https://github.com/ntumslab/PRUNE\n\n2\n\n\fTable 1: Model Comparisons. (I) Scalability; (II) Asymmetry; (III) Unity. No simplicity due to\ndif\ufb01cult comparisons between models with few sensitive and many insensitive hyperparameters.\n\n(II)\nProximity Embedding [19] (cid:88) (cid:88)\n\nModel\n\nSocDim [21]\n\nGraph Factorization [1]\n\nDeepWalk [18]\n\n(I)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88) (cid:88)\n(cid:88)\n(cid:88)\n\n(cid:88)\n(cid:88) (cid:88)\n\n(II)\n\nModel\n\n(I)\n(III)\n(cid:88) MMDW [22]\n(cid:88)\n(cid:88)\nSDNE [23]\n(cid:88) (cid:88)\n(cid:88)\nHOPE [15]\n(cid:88) node2vec [6] (cid:88)\n(cid:88)\n(cid:88)\n(cid:88) (cid:88)\n\nHSCA [30]\nLANE [8]\nAPP [31]\n(cid:88) M-NMF [24]\nNRCL [26] (cid:88)\n(cid:88)\n(cid:88)\n\n(III)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n(cid:88)\n\nTADW [28]\nLINE [20]\nGraRep [3]\nDNGR [2]\nTriDNR [17]\nOur PRUNE\n\nFigure 1: PRUNE overview. Each solid arrow represents a non-linear mapping function h between\ntwo neural layers.\n\n3 Model\n\n3.1 Problem de\ufb01nition and notations\n\nWe are given a directed homogeneous graph or network G = (V, E) as input, where V is the set\nof vertices or nodes and E is the set of directed edges or links. Let N = |V |, M = |E| be the\nnumber of nodes and links in the network. For each node i, we denote Pi, Si respectively as the\nset of direct predecessors and successors of node i. Therefore, mi = |Pi|, ni = |Si| imply the\nin-degree and out-degree of the node i. Matrix A denotes the corresponding adjacency matrix where\neach entry aij \u2208 [0,\u221e) is the weight of link (i, j). For simplicity, here we discuss only binary link\nweights: aij = {1, 0} and E = {(i, j) : aij = 1}, but solutions for non-negative link weights can be\nderived in the same manner. Our goal is to build an unsupervised model, learning a K-dimensional\nembedding vector ui \u2208 RK for each node i, such that ui preserves global node ranking and local\nproximity information.\n\n3.2 Model overview\n\nThe Siamese neural network structure of our model is illustrated in Figure 1. Siamese architecture\nhas been widely applied to multi-task learning like [27]. As Figure 1 illustrates, we de\ufb01ne a pair of\nnodes (i, j) as a training instance. Since both i and j refer to the same type of objects (i.e. nodes),\nit is natural to allow them to share the same hidden layer, which is what the Siamese architecture\n\n3\n\nEmbedding uiNode ranking layerProximitylayerNode iNode jNode ranking score \u03c0iNode ranking score \u03c0jProximity representation ziProximity representation zjTraining link (i, j)Selecting embeddings of node i and jObjective functionarg min\u03c0,z, W \u03a3(i, j) (ziTWzj - max{0, log[M / (\u03b1 ni mj)]})2 + \u03bb \u03a3(i, j) mj (\u03c0i / ni - \u03c0j / mj)2Shared matrix WEmbedding ujShared hidden layers\fsuggests. We start from the bottom part in Figure 1 to introduce our proximity function. Here the\nmodel is trained using each link (i, j) as a training instance. Given (i, j), \ufb01rst our model feeds the\nexisting embedding vectors ui and uj into the input layer. The values in ui and uj are updated\nby gradients propagated back from the output layers. To learn the mapping from the embedding\nvectors to objective functions, we put one hidden layer as bridge. Here we found the empirically\none single hidden layers already yield competitive results, implying that a simple neural network\nis suf\ufb01cient to encode graph properties into a vector space, which alleviates the burden on tuning\nhyperparameters in a neural network. Second, both nodes i and j share the same hidden layers in our\nneural networks, realizing by the Siamese neural networks. Each solid arrow in Figure 1 implies the\nfollowing mapping function:\n\nh(u) = \u03c6(\u03c9u + b)\n\n(1)\nwhere \u03c9, b are the weight matrix and the bias vector. \u03c6 is an activation function leading to non-linear\nmappings. In Figure 1, our goal is to encode the proximity information in embedding space. Thus\nwe de\ufb01ne a D-dimensional vector z \u2208 [0,\u221e)D that represents latent features of a node. In the\nnext sections, we show that the proximity property can be modeled by the interaction between\nrepresentations zi and zj. We write down the mapping from embedding u to z:\n\n(2)\nIn Figure 1, we use the same network construction to encode an additional global node ranking \u03c0 \u2265 0.\nIt is used to compare the relative ranks between one node and another. Formally, \u03c0 can be mapped\nfrom embedding u using the following formula:\n\nz = \u03c62(\u03c92\u03c61(\u03c91u + b1) + b2).\n\n\u03c0 = \u03c64(\u03c94\u03c63(\u03c93u + b3) + b4).\n\n(3)\nWe impose the non-negative constraints of z, \u03c0 for better theoretical property by exploiting the\nnon-negative activation functions (ReLU or softplus for example) over the outputs \u03c62 and \u03c64. Other\noutputs of activation functions and all the \u03c9, b are not limited to be non-negative. To add global node\nranking information in proximity preservation, we construct a multi-task neural network structure\nas illustrated in Figure 1. Let the hidden layers for different network properties share the same\nembedding space. u is thus updated by the information simultaneously from multiple objective goals.\nDifferent from a supervised learning task that the model can be trained by labeled data. Here instead\nwe need to introduce an objective function for weight-tuning:\n\n(cid:26)\n\n(cid:18) M\n\n(cid:19)(cid:27)(cid:19)2\n\n\u03b1mjni\n\n(cid:88)\n\n+ \u03bb\n\nmj\n\n(i,j)\u2208E\n\n(cid:18) \u03c0j\n\nmj\n\n(cid:19)2\n\n.\n\n\u2212 \u03c0i\nni\n\n(cid:18)\n\n(cid:88)\n\narg min\n\n\u03c0\u22650,z\u22650,W \u22650\n\n(i,j)\u2208E\n\ni W zj \u2212 max\nz(cid:62)\n\n0, log\n\n(4)\nThe \ufb01rst term aims at preserving the proximity and can be applied independently, as illustrated in\nFigure 1. The second term corresponds to the global node ranking task, which regularizes the relative\nscale among ranking scores. Here we import shared matrix W = \u03c65(\u03c95) to learn the global linking\ncorrelations in the whole network. We also set non-negative-ranged activation function \u03c65 to satisfy\nnon-negative W . \u03bb controls the relative importance of these two terms. We will provide analysis for\n(4) in the next sections. Since the objective function (4) is differentiable, we are allowed to apply\nmini-batch stochastic gradient descent (SGD) to optimize every \u03c9, b and even u by propagating the\ngradients top-down from the output layers.\nDeterministic mapping in (2) could be misunderstood that both u and z capture the same embedding\ninformation, but z speci\ufb01cally captures the proximity property of a network through performing\nlink prediction, and u in fact tries to in\ufb02uence both proximity and global ranking. The reason to\nuse z instead of u for link prediction is that we believe node ranking and link prediction are two\nnaturally different tasks (but their information can be shared since highly ranked nodes can have\nbetter connectivity to others), using one single embedding representation u to achieve both goals can\nlead to a compromised solution. Instead, z can be treated as some \"distilled\" information extracted\nfrom u speci\ufb01cally for link prediction, which can prevent our model from settling to a mediocre u\nthat fails to satisfy both goals directly.\n\n3.3 Proximity preservation as PMI matrix tri-factorization\n\nThe \ufb01rst term in (4) aims at preserving the proximity property from input networks. We focus on the\n\ufb01rst-order and second-order proximity, which are explicitly addressed in several proximity-based\n\n4\n\n\fmethods [3, 20, 23, 24]. The \ufb01rst-order proximity refers to whether node pair (i, j) is connected in\nunweighted graphs. In an input network, links (i, j) \u2208 E are observed as positive training examples\naij = 1. Thus, their latent inner product z(cid:62)\ni W zj should be increased to re\ufb02ect such close linking\nrelationship. Nonetheless, usually another set of randomly chosen node pairs (i, k) \u2208 F is required\nto train the embedding model as negative training examples. Since set F does not exist in input\nnetworks, one can sample \u03b1 target nodes k (with probability proportional to in-degree mk) to form\nnegative examples (i, k) . That is, given source node i, we emphasize the existence of link (i, j) by\ndistinguishing whether the corresponding target node is observed ((i, j) \u2208 E) or not ((i, k) \u2208 F ).\nWe can construct a binary logistic regression model to distinguish E and F :\n\n(cid:2)log \u03c3(z(cid:62)\n\ni W zj)(cid:3) + \u03b1E(i,k)\u2208F\n\n(cid:2)log(cid:0)1 \u2212 \u03c3(z(cid:62)\n\ni W zk)(cid:1)(cid:3)\n\nE(i,j)\u2208E\n\narg max\n\n(5)\n\nz,W\n\nwhere E denotes an expected value, \u03c3(x) =\n1\nderivations in [12], we have the following conclusion:\nLemma 3.1. Let yij = z(cid:62)\nof (5) over yij:\n\n1+exp(\u2212x) is the sigmoid function. Inspired by the\n\ni W zj. We have the closed-form solution from zero \ufb01rst-order derivative\n\nyij = log\n\nM\n\n\u03b1nimj\n\n= log\n\nps,t(i, j)\nps(i)pt(j)\n\n\u2212 log \u03b1\n\n(6)\n\nwhere ps,t(i, j) = 1|E| = 1\nps(i) = ni\nfollows another distribution proportional to in-degree mj of target node j.\n\nM is the joint probability of link (positive example) (i, j) in set E,\nM follows a distribution proportional to out-degree ni of source node i, whereas pt(j) = mj\n\nM\n\nProof. Please refer to our Supplementary Material Section 2.\n\nClearly, (6) is the pointwise mutual information (PMI) shifted by log \u03b1, which can be viewed as link\nweights in terms of out-degree ni and in-degree mj. If we directly minimize the difference between\ntwo sides in (6) rather than maximize (5), then we are free from sampling negative examples (i, k) to\ntrain a model. Following the suggestions in [12], we \ufb01lter negative (less informative) PMI as shown\nin (4), causing further performance improvement.\nThe second-order proximity refers to the fact that the similarity of zi and zj is higher if nodes i, j\nhave similar sets of direct predecessors and successors (that is, the similarity re\ufb02ects 2-hop distance\nrelationships). Now we present how to preserve the second-order proximity using tri-factorization-\nif (i, j) \u2208 E; otherwise missing\nbased link prediction [13, 32]. Let APMI =\nbe the corresponding PMI matrix. Link prediction aims to predict the missing PMI values in\nAPMI. Factorization methods suppose APMI of low-rank D, and then learn matrix tri-factorization\nZ(cid:62)W Z \u2248 APMI using non-missing entries. Matrix Z = [z1z2 . . . zN ] aligns latent representations\nwith link distributions. Compared with classical factorization Z(cid:62)V , such tri-factorization supports\nthe asymmetric transitivity property of directed links. Speci\ufb01cally, the existence of two directed\nlinks (i, j) (z(cid:62)\ni W zk) via representation\npropagation zi \u2192 zj \u2192 zk, but not the case for (k, i) due to asymmetric W . Then we have a lemma\nas follows:\nLemma 3.2. Matrix tri-factorization Z(cid:62)W Z \u2248 APMI preserves the second-order proximity.\n\nj W zk) increase the likelihood of (i, k) (z(cid:62)\n\ni W zj), (j, k) (z(cid:62)\n\n0, log M\n\n(cid:110)\n\n(cid:111)\n\nmax\n\n\u03b1nimj\n\n(cid:104)\n\n(cid:105)\n\nProof. Please refer to our Supplementary Material Section 3.\n\nNext, we discuss the connection between matrix tri-factorization and community. Different from\nheuristic statements in [13, 32], we argue that the representation vector zi captures a D-community\ndistribution for node i (each dimension is proportional to the probability that node i belongs to certain\ncommunity), and shared matrix W implies the interactions among these D communities.\nLemma 3.3. Matrix tri-factorization z(cid:62)\ninteractions with distributions of link (i, j).\n\ni W zj can be regarded as the expectation of community\n\ni W zj \u221d E(i,j) [W ] =\nz(cid:62)\n\nPr(i \u2208 Cc) Pr(j \u2208 Cd)wcd,\n\n(7)\n\nD(cid:88)\n\nD(cid:88)\n\nc=1\n\nd=1\n\n5\n\n\fwhere each entry wcd is the expected number of interactions from community c to d, and Cc denotes\nthe set of nodes in community c.\n\nProof. Please refer to the Supplementary Material Section 4.\n\nBased on the binary classi\ufb01cation model (5), when a true link (i, j) is observed in the training data,\nthe corresponding inner product z(cid:62)\ni W zj is increased, which is equivalent to raising the expectation\nE(i,j) [W ].\nTo summarize, the derivations from logistic classi\ufb01cation (5) to PMI matrix tri-factorization (6)\nshow the tri-factorization model preserves the \ufb01rst-order proximity. Then Lemma 3.2 proves the\npreservation of second-order proximity. Besides, if a non-negative constraint is imposed, Lemma 3.3\nshows that the tri-factorization model can be interpreted as capturing community interactions. That\nsays, our proximity preserving loss achieves the \ufb01rst-order proximity, second-order proximity, and\ncommunity preservation.\nGiven non-negative log M\nnimj\ndetection. (6) can be rewritten as the following equation:\n\nas our setting in (4), we make another observation on community\n\n1 \u2212 exp(cid:0)\u2212z(cid:62)\n(cid:124)\n(cid:123)(cid:122)\n\n(cid:1)\n(cid:125)\n\ni W zj\n\n=\n\n1 \u2212 nimj\nM\n\n(cid:123)(cid:122)\n\n(cid:124)\n\n(cid:125)\n\nP(X (i,j)>0)=1\u2212P(X (i,j)=0)\n\nModularity as aij =1,\u03b1=1\n\n.\n\n(8)\n\nFollowing Lemma 3.3, we can then derive\nLemma 3.4. The left-hand side of (8) is the probability P(X (i,j) > 0) , where 0 \u2264 X (i,j) \u2264 D2\nrepresents the total numbers of interactions between all the community pairs (c, d) \u22001 \u2264 c \u2264 D, 1 \u2264\nd \u2264 D that affect the existence of this link (i, j), following Poisson distribution P(X (i,j)) with mean\nz(cid:62)\ni W zj.\n\nProof. Please refer to the Supplementary Material Section 5.\n\nIn fact, on either side of Equation (8), it evaluates the likelihood of the occurrence of a link. For\nthe left-hand side, as shown in reference [29] and our Supplementary Material 5, an existing link\nimplies at least one community interactions (X > 0), whose probability is assumed following Poisson\nwith means equal to the tri-factorization values. The right-hand side is commonly regarded as the\n\"modularity\" [11], which measures the difference between links from the observed data and links\nfrom random generation. Modularity is commonly used as an evaluation metric for the quality of a\ncommunity detection algorithm (see [21, 24]). The deep investigation of Equation (8) is left for our\nfuture work.\n\n3.4 Global node ranking preservation as PageRank upper bound\n\n\u2212 \u03c0j)2. Here the probability constraint(cid:80)\n\nHere we want to connect the second objective to PageRank. To be more precise, the second term\nin (4) (without parameter \u03bb) comes from an upper bound of PageRank assumption. PageRank [16]\nis arguably the most common unsupervised method to evaluate the rank of a node. It claims that\nranking score of a node j \u03c0j is the probability of visiting j through random walks. \u03c0j \u2200 j \u2208 V\ncan be obtained from the ranking score accumulation from direct predecessors i, weighted by the\nreciprocal of out-degree ni. One can express PageRank using the minimization of squared loss\ni\u2208V \u03c0i = 1 is not considered since\nwe care only about the relative rankings. The damping factor in PageRank is not considered either\nfor model simplicity. Unfortunately, it is infeasible to apply SGD to update L, since summation\n(i,j)\u2208E Lij where each\nsub-objective function Lij is relevant to a single training link (i, j). Instead, we choose to minimize\nan upper bound.\nLemma 3.5. By Cauchy\u2013Schwarz inequality, we have the upper bound as follows:\n\nj\u2208V ((cid:80)\nis inside the square, violating the standard SGD assumption L =(cid:80)\n\ni\u2208Pj\n\nL =(cid:80)\n(cid:80)\n\ni\u2208Pj\n\n\u03c0i\nni\n\n\uf8eb\uf8ed(cid:88)\n\ni\u2208Pj\n\n(cid:88)\n\nj\u2208V\n\n\uf8f6\uf8f82\n\n\u2264 (cid:88)\n\n(i,j)\u2208E\n\n(cid:18) \u03c0i\n\nni\n\nmj\n\n\u2212 \u03c0j\nmj\n\n(cid:19)2\n\n\u2212 \u03c0j\n\n\u03c0i\nni\n\n.\n\n(9)\n\n6\n\n\fProof. Please refer to our Supplementary Material Section 6.\n\nThe proof of approximation ratio of such upper bound (9) is left as our future work. Nevertheless,\nas will be shown later, the experiments have demonstrated the effectiveness of such upper bound.\nIntuitively, (9) minimizes the difference between \u03c0i\nweighted by in-degree mj. This could\nni\nbe explained by the following lemma:\nLemma 3.6. The objectvie \u03c0i\n= \u03c0j\nmj\nni\n\nat the right-hand side of (9) is a suf\ufb01cient condition of the\n\nand \u03c0j\nmj\n\nobjective(cid:80)\n\ni\u2208Pj\n\n\u03c0i\nni\n\n= \u03c0j at the left-hand side of (9).\n\nProof. Please refer to our Supplementary Material Section 7.\n\n3.5 Discussion\n\nWe have mentioned four major advantages of our model the introduction section. Here we would like\nto provide in-depth discussions on them. (I) Scalability. Since only the positive links are used for\ntraining, during SGD, our model spends O(M \u21262) time for each epoch, where \u2126 is the maximum\nnumber of neurons of a layer in our model, which is usually in the hundreds. Also, our model costs\nonly O(N + M ) space to store input networks and the sparse PMI matrix consumes O(M ) non-zero\nentries. In practice \u21262 (cid:28) M, our model is thus scalable. (II) Asymmetry. By the observation in\n(4), replacing (i, j) with (j, i) leads to different results since W and PageRank upper bound are\nasymmetric. (III) Unity. All the objectives in our model are jointly optimized under a multi-task\nSiamese neural network. (IV) Simplicity. As experiments shows, our model performs well with single\nhidden layers and the same hyperparameter setting across all the datasets, which could alleviate the\ndif\ufb01cult hyperparameter determination for unsupervised network embedding.\n\n4 Experiments\n\n4.1 Settings\n\nDatasets. We benchmark our model on three real-world networks in different application domains:\n(I) Hep-Ph 7.\nIt is a paper citation network from 1993 to 2003, including 34, 546 papers and\n421, 578 citations relationships. Following the same setup as [25], we leave citations before 1999 for\nembedding generation, and then evaluate paper ranks using the number of citations after 2000.\n(II) Webspam 8. It is a web page network used in Webspam Challenges. There are 114, 529 web\npages and 1, 836, 441 hyperlinks. Participants are challenged to build a model to rank the 1, 933\nlabeled non-spam web pages higher than 122 labeled spam ones.\n(III) FB Wall Post 9. Previous task [7] aims at ranking active users using a 63, 731-user, 831, 401-link\nwall post network in social media website Facebook, New Orlean 2009. The nodes denote users and\na link implies that a user posts at least an article on someone\u2019s wall. 14, 862 users are marked active,\nthat is, they continue to post articles in the next three weeks after a certain date. The goal is to rank\nactive users over inactive ones.\nCompetitors. We compare the performance of our model with DeepWalk [18], LINE [20], node2vec\n[6], SDNE [23] and NRCL [26]. DeepWalk, LINE and node2vec are popular models used in various\napplications. SDNE proposes another neural network structure to embed networks. NRCL is one\nof the state-of-the-art network embedding model, specially designed for link prediction. Note that\nNRCL encodes external node attributes into network embedding, but we discard this part since such\ninformation are not assumed available in our setup.\nModel Setup. For all experiments, our model \ufb01xes node embedding and hidden layers to be 128-\ndimensional, proximity representation to be 64-dimensional. Exponential Linear Unit (ELU) [4]\nactivation is adopted in hidden layers for faster learning, while output layers use softplus activation\nfor node ranking score and Recti\ufb01ed Linear Unit (ReLU) [5] activation for proximity representation\n\n7http://snap.stanford.edu/data/cit-HepPh.html\n8http://chato.cl/webspam/datasets/uk2007/\n9http://socialnetworks.mpi-sws.org/data-wosn2009.html\n\n7\n\n\fto avoid negative-or-zero scores as well as negative representation values. We recommend and \ufb01x\n\u03b1 = 5, \u03bb = 0.01. All training uses a batch size of 1024 and Adam [9] optimizer with learning rate\n0.0001.\nEvaluation. Similar to the previous works, we want to evaluate our embedding using supervised\nlearning tasks. That is, we want to evaluate whether the proposed embedding yields better results for\na (1) learning-to-rank (2) classi\ufb01cation and regression (3) link prediction tasks.\n\n4.2 Results\n\nIn the following paragraphs, we call our proposed model PRUNE. PRUNE without the global ranking\npart is named TriFac below.\nLearning-to-rank. In this setting, we use pairwise approach that formulates learning-to-rank as a\nbinary classi\ufb01cation problem and take embeddings as node attributes. Linear Support Vector Machine\nwith regularization C = 1.0 is used as our learning-to-rank classi\ufb01er. We train on 80% and evaluate\non 20% of datasets. Since Webspam and FB Wall Post possess binary labels, we choose Area Under\nROC Curve (AUC) as the evaluation metric. Following the setting in [25], Hep-Ph paper citation is a\nreal value, and thus suits better for Spearman\u2019s rank correlation coef\ufb01cient.\nThe results in Table 2 show that PRUNE signi\ufb01cantly outperforms the competitors. Note that PRUNE\nwhich incorporates global node ranking as a multi-task learning has superior performance compared\nwith TriFac which only considers the proximity. It shows that the unsupervised global ranking\nwe modeled is positively correlated with the rankings in these learning-to-ranking tasks. Also the\nmulti-task learning enriches the underlying interactions between two tasks and is the key to better\nperformance of PRUNE.\n\nTable 2: Learning-to-rank performance (\u2020: outperforms 2nd-best with p-value < 0.01).\nDataset\nHep-Ph\nWebspam\n\nEvaluation DeepWalk LINE node2vec\nRank Corr.\n\nSDNE NRCL TriFac\n0.353\n0.554\n0.821\n0.800\n0.749\n0.747\n\n0.327\n0.839\n0.573\n\n0.494\n0.843\n0.730\n\n0.485\n0.821\n0.702\n\n0.430\n0.818\n0.712\n\nAUC\nAUC\n\nFB Wall Post\n\nPRUNE\n0.621\u2020\n0.853\u2020\n0.765\u2020\n\nClassi\ufb01cation and Regression. In this experiment, embedding outputs are directly used for binary\nnode classi\ufb01cation on Webspam and FB Wall Post and node regression on Hep-Ph. We only observe\n80% nodes while training and predict the labels of remaining 20% nodes. Random Forest and Support\nVector Regression are used for classi\ufb01cation and regression, respectively. Classi\ufb01cation is evaluated\nby AUC and regression is evaluated by the Root Mean Square Error (RMSE). Table 3 shows that\nPRUNE reaches the lowest RMSE on the regression task and the highest AUC on two classi\ufb01cation\ntasks among embedding algorithms, while TriFac is competitive to others. The results show that the\nglobal ranking modeled by us contains useful information to capture certain properties of nodes.\nTable 3: Classi\ufb01cation and regression performance (\u2020: outperforms 2nd-best with p-value < 0.01).\nPRUNE\n11.720\u2020\n0.637\u2020\n0.775\u2020\n\nSDNE NRCL\n12.429\n12.451\n0.578\n0.605\n0.752\n0.759\n\nnode2vec\n11.909\n0.622\n0.744\n\nDataset\nHep-Ph\nWebspam\n\nTriFac\n11.967\n0.576\n0.763\n\nLINE\n12.307\n0.597\n0.707\n\nEvaluation DeepWalk\n\n12.079\n0.620\n0.733\n\nRMSE\nAUC\nAUC\n\nFB Wall Post\n\nLink Prediction. We randomly split network edges into 80%-20% train-test subsets as positive\nexamples and sample equal number of node pairs with no edge connection as negative samples.\nEmbeddings are learned on the training set and performance is evaluated on the test set. Logistic\nregression is adopted as the link prediction algorithm and models are evaluated by AUC. The results\nin Table 4 show that PRUNE outperforms all counterparts signi\ufb01cantly, while TriFac is competitive to\nothers. The results, together with previous two experiments, demonstrate the effectiveness of PRUNE\nfor diverse network applications.\nRobustness to Noisy Data. In the real-world settings, usually only partial network is observable as\nlinks can be missing. Perturbation analysis is then conducted in verifying the robustness of models\nby measuring the learning-to-rank performance when different fractions of edges are missing. Figure\n2 shows that PRUNE persistently outperforms competitors across different fractions of missing\n\n8\n\n\fTable 4: Link prediction performance (\u2020: outperforms 2nd-best with p-value < 0.01).\nDataset\nHep-Ph\nWebspam\n\nDeepWalk LINE node2vec\n\nSDNE NRCL TriFac\n0.814\n0.751\n0.946\n0.953\n0.855\n0.858\n\n0.688\n0.910\n0.731\n\nPRUNE\n0.861\u2020\n0.973\u2020\n0.878\u2020\n\nFB Wall Post\n\n0.803\n0.885\n0.828\n\n0.796\n0.954\n0.781\n\n0.805\n0.894\n0.853\n\nFigure 2: Perturbation analysis for learning-to-rank on Hep-Ph and FB Wall Post.\n\nedges. The results demonstrate its robustness to missing edges which is crucial for evolving or\ncostly-constructed networks.\nDiscussions. The superiority can be summarized based on the features of the models:\n(I) We have an explicit objective to optimize. Random walk based models (i.e. DeepWalk, node2vec)\nlack such objectives and moreover, noises are introduced during the random walk procedure.\n(II) We are the only model that considers global node ranking information.\n(III) We preserve \ufb01rst and second-order proximity and considers the asymmetry (i.e. direction of\nlinks). NRCL only preserves the \ufb01rst-order proximity and does not consider asymmetry. SDNE does\nnot consider asymmetry either. LINE does not handle \ufb01rst-order and second-order proximity jointly\nbut instead treating them independently.\n\n5 Conclusion\n\nWe propose a multi-task Siamese deep neural network to generate network embeddings that preserve\nglobal node ranking and community-aware proximity. We design a novel objective function for\nembedding training and provide corresponding theoretical interpretation. The experiments shows\nthat preserving the properties we have proposed can indeed improve the performance of supervised\nlearning tasks using the embedding as features.\n\nAcknowledgments\n\nThis study was supported in part by the Ministry of Science and Technology (MOST) of Taiwan,\nR.O.C., under Contracts 105-2628-E-001-002-MY2, 106-2628-E-006-005-MY3, 104-2628-E-002\n-015 -MY3 & 106-2218-E-002 -014 -MY4 , Air Force Of\ufb01ce of Scienti\ufb01c Research, Asian Of\ufb01ce\nof Aerospace Research and Development (AOARD) under award number No.FA2386-17-1-4038,\nand Microsoft under Contracts FY16-RES-THEME-021. All opinions, \ufb01ndings, conclusions, and\nrecommendations in this paper are those of the authors and do not necessarily re\ufb02ect the views of the\nfunding agencies.\n\nReferences\n\n[1] Amr Ahmed, Nino Shervashidze, Shravan Narayanamurthy, Vanja Josifovski, and Alexander J. Smola.\n\nDistributed large-scale natural graph factorization. WWW \u201913.\n\n[2] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Deep neural networks for learning graph representations.\n\nAAAI\u201916.\n\n9\n\n0.00.10.20.30.40.50.60.70.80.9drop rate0.10.20.30.40.50.60.7RankCorrHep-PhPRUNEDeepWalkLINEnode2vecSDNENRCL0.00.10.20.30.40.50.60.70.80.9drop rate0.500.550.600.650.700.750.80AUCFB Wall PostPRUNEDeepWalkLINEnode2vecSDNENRCL\f[3] Shaosheng Cao, Wei Lu, and Qiongkai Xu. Grarep: Learning graph representations with global structural\n\ninformation. CIKM \u201915.\n\n[4] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). CoRR, 2015.\n\n[5] Xavier Glorot, Antoine Bordes, and Yoshua Bengio. Deep sparse recti\ufb01er neural networks. AISTATS\u201911.\n[6] Aditya Grover and Jure Leskovec. Node2vec: Scalable feature learning for networks. KDD \u201916.\n[7] Julia Heidemann, Mathias Klier, and Florian Probst. Identifying key users in online social networks: A\n\npagerank based approach. ICIS\u201910.\n\n[8] Xiao Huang, Jundong Li, and Xia Hu. Label informed attributed network embedding. WSDM \u201917.\n[9] Diederik P. Kingma and Jimmy Ba. Adam: A method for stochastic optimization. CoRR, 2014.\n[10] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. J. ACM, 1999.\n[11] Elizabeth A Leicht and Mark EJ Newman. Community structure in directed networks. Physical review\n\nletters, 2008.\n\n[12] Omer Levy and Yoav Goldberg. Neural word embedding as implicit matrix factorization. NIPS\u201914.\n[13] Aditya Krishna Menon and Charles Elkan. Link prediction via matrix factorization. ECML PKDD\u201911.\n[14] Tomas Mikolov, Ilya Sutskever, Kai Chen, Greg S Corrado, and Jeff Dean. Distributed representations of\n\nwords and phrases and their compositionality. NIPS\u201913.\n\n[15] Mingdong Ou, Peng Cui, Jian Pei, Ziwei Zhang, and Wenwu Zhu. Asymmetric transitivity preserving\n\ngraph embedding. KDD \u201916.\n\n[16] Lawrence Page, Sergey Brin, Rajeev Motwani, and Terry Winograd. The pagerank citation ranking:\n\nBringing order to the web. Technical report, Stanford InfoLab, 1999.\n\n[17] Shirui Pan, Jia Wu, Xingquan Zhu, Chengqi Zhang, and Yang Wang. Tri-party deep network representation.\n\nIJCAI\u201916.\n\n[18] Bryan Perozzi, Rami Al-Rfou, and Steven Skiena. Deepwalk: Online learning of social representations.\n\nKDD \u201914.\n\n[19] Han Hee Song, Tae Won Cho, Vacha Dave, Yin Zhang, and Lili Qiu. Scalable proximity estimation and\n\nlink prediction in online social networks. IMC \u201909.\n\n[20] Jian Tang, Meng Qu, Mingzhe Wang, Ming Zhang, Jun Yan, and Qiaozhu Mei. Line: Large-scale\n\ninformation network embedding. WWW \u201915.\n\n[21] Lei Tang and Huan Liu. Relational learning via latent social dimensions. KDD \u201909.\n[22] Cunchao Tu, Weicheng Zhang, Zhiyuan Liu, and Maosong Sun. Max-margin deepwalk: Discriminative\n\nlearning of network representation. IJCAI\u201916.\n\n[23] Daixin Wang, Peng Cui, and Wenwu Zhu. Structural deep network embedding. KDD \u201916.\n[24] Xiao Wang, Peng Cui, Jing Wang, Jian Pei, Wenwu Zhu, and Shiqiang Yang. Community preserving\n\nnetwork embedding. AAAI\u201917.\n\n[25] Yujing Wang, Yunhai Tong, and Ming Zeng. Ranking scienti\ufb01c articles by exploiting citations, authors,\n\njournals, and time information. AAAI\u201913.\n\n[26] Xiaokai Wei, Linchuan Xu, Bokai Cao, and Philip S. Yu. Cross view link prediction by learning noise-\n\nresilient representation consensus. WWW \u201917.\n\n[27] Zhizheng Wu, Cassia Valentini-Botinhao, Oliver Watts, and Simon King. Deep neural networks employing\n\nmulti-task learning and stacked bottleneck features for speech synthesis. ICASSP \u201915, 2015.\n\n[28] Cheng Yang, Zhiyuan Liu, Deli Zhao, Maosong Sun, and Edward Y. Chang. Network representation\n\nlearning with rich text information. IJCAI\u201915.\n\n[29] Jaewon Yang and Jure Leskovec. Overlapping community detection at scale: A nonnegative matrix\n\nfactorization approach. WSDM \u201913.\n\n[30] D. Zhang, J. Yin, X. Zhu, and C. Zhang. Homophily, structure, and content augmented network representa-\n\ntion learning. ICDM\u201916.\n\n[31] Chang Zhou, Yuqiong Liu, Xiaofei Liu, Zhongyi Liu, and Jun Gao. Scalable graph embedding for\n\nasymmetric proximity. AAAI\u201917.\n\n[32] Shenghuo Zhu, Kai Yu, Yun Chi, and Yihong Gong. Combining content and link for classi\ufb01cation using\n\nmatrix factorization. SIGIR \u201907.\n\n10\n\n\f", "award": [], "sourceid": 2716, "authors": [{"given_name": "Yi-An", "family_name": "Lai", "institution": "National Taiwan University"}, {"given_name": "Chin-Chi", "family_name": "Hsu", "institution": "Academia Sinica"}, {"given_name": "Wen Hao", "family_name": "Chen", "institution": "National Taiwan University"}, {"given_name": "Mi-Yen", "family_name": "Yeh", "institution": "Academia Sinica"}, {"given_name": "Shou-De", "family_name": "Lin", "institution": "National Taiwan University"}]}