{"title": "A Linear Time Active Learning Algorithm for Link Classification", "book": "Advances in Neural Information Processing Systems", "page_first": 1610, "page_last": 1618, "abstract": "We present very efficient active learning algorithms for link classification in signed networks. Our algorithms are motivated by a stochastic model in which edge labels are obtained through perturbations of a initial sign assignment consistent with a two-clustering of the nodes. We provide a theoretical analysis within this model, showing that we can achieve an optimal (to whithin a constant factor) number of mistakes on any graph $G = (V,E)$ such that $|E|$ is at least order of $|V|^{3/2}$ by querying at most order of $|V|^{3/2}$ edge labels. More generally, we show an algorithm that achieves optimality to within a factor of order $k$ by querying at most order of $|V| + (|V|/k)^{3/2}$ edge labels. The running time of this algorithm is at most of order $|E| + |V|\\log|V|$.", "full_text": "A Linear Time Active Learning Algorithm\n\nfor Link Classi\ufb01cation\u2217\n\nNicol`o Cesa-Bianchi\n\nClaudio Gentile\n\nDipartimento di Informatica\n\nDipartimento di Scienze Teoriche ed Applicate\n\nUniversit`a degli Studi di Milano, Italy\n\nUniversit`a dell\u2019Insubria, Italy\n\nFabio Vitale\n\nDipartimento di Informatica\n\nGiovanni Zappella\n\nDipartimento di Matematica\n\nUniversit`a degli Studi di Milano, Italy\n\nUniversit`a degli Studi di Milano, Italy\n\nAbstract\n\nWe present very e\ufb03cient active learning algorithms for link classi\ufb01cation\nin signed networks. Our algorithms are motivated by a stochastic model\nin which edge labels are obtained through perturbations of a initial sign\nassignment consistent with a two-clustering of the nodes. We provide a the-\noretical analysis within this model, showing that we can achieve an optimal\n(to whithin a constant factor) number of mistakes on any graph G = (V, E)\nsuch that |E| = \u2126(|V |3/2) by querying O(|V |3/2) edge labels. More gen-\nerally, we show an algorithm that achieves optimality to within a factor\nof O(k) by querying at most order of |V | + (|V |/k)3/2 edge labels. The\nrunning time of this algorithm is at most of order |E| + |V | log |V |.\n\n1 Introduction\nA rapidly emerging theme in the analysis of networked data is the study of signed networks.\nFrom a mathematical point of view, signed networks are graphs whose edges carry a sign\nrepresenting the positive or negative nature of the relationship between the incident nodes.\nFor example, in a protein network two proteins may interact in an excitatory or inhibitory\nfashion. The domain of social networks and e-commerce o\ufb00ers several examples of signed\nrelationships: Slashdot users can tag other users as friends or foes, Epinions users can rate\nother users positively or negatively, Ebay users develop trust and distrust towards sellers\nin the network. More generally, two individuals that are related because they rate similar\nproducts in a recommendation website may agree or disagree in their ratings.\nThe availability of signed networks has stimulated the design of link classi\ufb01cation algorithms,\nespecially in the domain of social networks. Early studies of signed social networks are from\nthe Fifties. E.g., [8] and [1] model dislike and distrust relationships among individuals as\n(signed) weighted edges in a graph. The conceptual underpinning is provided by the theory\nof social balance, formulated as a way to understand the structure of con\ufb02icts in a network\nof individuals whose mutual relationships can be classi\ufb01ed as friendship or hostility [9]. The\nadvent of online social networks has revamped the interest in these theories, and spurred a\nsigni\ufb01cant amount of recent work \u2014see, e.g., [7, 11, 14, 3, 5, 2], and references therein.\nMany heuristics for link classi\ufb01cation in social networks are based on a form of social balance\nsummarized by the motto \u201cthe enemy of my enemy is my friend\u201d. This is equivalent to\nsaying that the signs on the edges of a social graph tend to be consistent with some two-\nclustering of the nodes. By consistency we mean the following: The nodes of the graph can\nbe partitioned into two sets (the two clusters) in such a way that edges connecting nodes\n\u2217This work was supported in part by the PASCAL2 Network of Excellence under EC grant\n216886 and by \u201cDote Ricerca\u201d, FSE, Regione Lombardia. This publication only re\ufb02ects the authors\u2019\nviews.\n\n1\n\n\ffrom the same set are positive, and edges connecting nodes from di\ufb00erent sets are negative.\nAlthough two-clustering heuristics do not require strict consistency to work, this is admittely\na rather strong inductive bias. Despite that, social network theorists and practitioners\nfound this to be a reasonable bias in many social contexts, and recent experiments with\nonline social networks reported a good predictive power for algorithms based on the two-\nclustering assumption [11, 13, 14, 3]. Finally, this assumption is also fairly convenient from\nthe viewpoint of algorithmic design.\nIn the case of undirected signed graphs G = (V, E), the best performing heuristics exploiting\nthe two-clustering bias are based on spectral decompositions of the signed adiacency matrix.\n\nNoticeably, these heuristics run in time \u2126(cid:0)|V |2(cid:1), and often require a similar amount of\n\nmemory storage even on sparse networks, which makes them impractical on large graphs.\nIn order to obtain scalable algorithms with formal performance guarantees, we focus on the\nactive learning protocol, where training labels are obtained by querying a desired subset\nof edges. Since the allocation of queries can match the graph topology, a wide range of\ngraph-theoretic techniques can be applied to the analysis of active learning algorithms. In\nthe recent work [2], a simple stochastic model for generating edge labels by perturbing some\nunknown two-clustering of the graph nodes was introduced. For this model, the authors\nproved that querying the edges of a low-stretch spanning tree of the input graph G = (V, E)\nis su\ufb03cient to predict the remaining edge labels making a number of mistakes within a\nfactor of order (log |V |)2 log log |V | from the theoretical optimum. The overall running time\nis O(|E| ln|V |). This result leaves two main problems open: First, low-stretch trees are a\npowerful structure, but the algorithm to construct them is not easy to implement. Second,\nthe tree-based analysis of [2] does not generalize to query budgets larger than |V | \u2212 1 (the\nedge set size of a spanning tree).\nIn this paper we introduce a di\ufb00erent active learning\napproach for link classi\ufb01cation that can accomodate a large spectrum of query budgets.\nWe show that on any graph with \u2126(|V |3/2) edges, a query budget of O(|V |3/2) is su\ufb03cient\nto predict the remaining edge labels within a constant factor from the optimum. More in\nqueries is su\ufb03cient to\nmake a number of mistakes within a factor of O(k) from the optimum with a running time\nof order |E| + (|V |/k) log(|V |/k). Hence, a query budget of \u0398(|V |), of the same order as the\nalgorithm based on low-strech trees, achieves an optimality factor O(|V |1/3) with a running\ntime of just O(|E|).\nAt the end of the paper we also report on a preliminary set of experiments on medium-sized\nsynthetic and real-world datasets, where a simpli\ufb01ed algorithm suggested by our theoretical\n\ufb01ndings is compared against the best performing spectral heuristics based on the same\ninductive bias. Our algorithm seems to perform similarly or better than these heuristics.\n\ngeneral, we show that a budget of at most order of |V | +(cid:0)|V |\n\n(cid:1)3/2\n\nk\n\n2 Preliminaries and notation\n\nWe consider undirected and connected graphs G = (V, E) with unknown edge labeling\nYi,j \u2208 {\u22121, +1} for each (i, j) \u2208 E. Edge labels can collectively be represented by the\nassociated signed adjacency matrix Y , where Yi,j = 0 whenever (i, j) (cid:54)\u2208 E. In the sequel,\nthe edge-labeled graph G will be denoted by (G, Y ).\nWe de\ufb01ne a simple stochastic model for assigning binary labels Y to the edges of G. This\nis used as a basis and motivation for the design of our link classi\ufb01cation strategies. As\nwe mentioned in the introduction, a good trade-o\ufb00 between accuracy and e\ufb03ciency in link\nclassi\ufb01cation is achieved by assuming that the labeling is well approximated by a two-\nclustering of the nodes. Hence, our stochastic labeling model assumes that edge labels are\nobtained by perturbing an underlying labeling which is initially consistent with an arbitrary\n(and unknown) two-clustering. More formally, given an undirected and connected graph\nG = (V, E), the labels Yi,j \u2208 {\u22121, +1}, for (i, j) \u2208 E, are assigned as follows. First, the\nnodes in V are arbitrarily partitioned into two sets, and labels Yi,j are initially assigned\nconsistently with this partition (within-cluster edges are positive and between-cluster edges\nare negative). Note that the consistency is equivalent to the following multiplicative rule:\nFor any (i, j) \u2208 E, the label Yi,j is equal to the product of signs on the edges of any path\nconnecting i to j in G. This is in turn equivalent to say that any simple cycle within the\ngraph contains an even number of negative edges. Then, given a nonnegative constant p < 1\n2 ,\n\nlabels are randomly \ufb02ipped in such a way that P(cid:0)Yi,j is \ufb02ipped(cid:1) \u2264 p for each (i, j) \u2208 E.\n\n2\n\n\fWe call this a p-stochastic assignment. Note that this model allows for correlations between\n\ufb02ipped labels.\nA learning algorithm in the link classi\ufb01cation setting receives a training set of signed edges\nand, out of this information, builds a prediction model for the labels of the remaining edges.\nIt is quite easy to prove a lower bound on the number of mistakes that any learning algorithm\nmakes in this model.\nFact 1. For any undirected graph G = (V, E), any training set E0 \u2282 E of edges, and any\nlearning algorithm that is given the labels of the edges in E0, the number M of mistakes\n\nmade by A on the remaining E \\ E0 edges satis\ufb01es E M \u2265 p(cid:12)(cid:12)E \\ E0\n\n(cid:12)(cid:12), where the expectation\n\nis with respect to a p-stochastic assignment of the labels Y .\n\nProof. Let Y be the following randomized labeling: \ufb01rst, edge labels are set consistently\nwith an arbitrary two-clustering of V . Then, a set of 2p|E| edges is selected uniformly at\nrandom and the labels of these edges are set randomly (i.e., \ufb02ipped or not \ufb02ipped with equal\nprobability). Clearly, P(Yi,j is \ufb02ipped) = p for each (i, j) \u2208 E. Hence this is a p-stochastic\n\nassignment of the labels. Moreover, E \\ E0 contains in expectation 2p(cid:12)(cid:12)E \\ E0\nlabeled edges, on which A makes p(cid:12)(cid:12)E \\ E0\n\n(cid:12)(cid:12) mistakes in expectation.\n\n(cid:12)(cid:12) randomly\n\nIn this paper we focus on active learning algorithms. An active learner for link classi\ufb01cation\n\ufb01rst constructs a query set E0 of edges, and then receives the labels of all edges in the query\nset. Based on this training information, the learner builds a prediction model for the labels\nof the remaining edges E \\ E0. We assume that the only labels ever revealed to the learner\nare those in the query set. In particular, no labels are revealed during the prediction phase.\nIt is clear from Fact 1 that any active learning algorithm that queries the labels of at most\na constant fraction of the total number of edges will make on average \u2126(p|E|) mistakes.\nWe often write VG and EG to denote, respectively, the node set and the edge set of some\nunderlying graph G. For any two nodes i, j \u2208 VG, Path(i, j) is any path in G having i\nand j as terminals, and |Path(i, j)| is its length (number of edges). The diameter DG of a\ngraph G is the maximum over pairs i, j \u2208 VG of the shortest path between i and j. Given\na tree T = (VT , ET ) in G, and two nodes i, j \u2208 VT , we denote by dT (i, j) the distance\nof i and j within T , i.e., the length of the (unique) path PathT (i, j) connecting the two\nnodes in T . Moreover, \u03c0T (i, j) denotes the parity of this path, i.e., the product of edge\nsigns along it. When T is a rooted tree, we denote by ChildrenT (i) the set of children of\ni in T . Finally, given two disjoint subtrees T (cid:48), T (cid:48)(cid:48) \u2286 G such that VT (cid:48) \u2229 VT (cid:48)(cid:48) \u2261 \u2205, we let\n\nEG(T (cid:48), T (cid:48)(cid:48)) \u2261(cid:8)(i, j) \u2208 EG : i \u2208 VT (cid:48), j \u2208 VT (cid:48)(cid:48)(cid:9) .\n\n3 Algorithms and their analysis\n\nIn this section, we introduce and analyze a family of active learning algorithms for link\nclassi\ufb01cation. The analysis is carried out under the p-stochastic assumption. As a warm\nup, we start o\ufb00 recalling the connection to the theory of low-stretch spanning trees (e.g.,\n[4]), which turns out to be useful in the important special case when the active learner is\na\ufb00orded to query only |V | \u2212 1 labels.\nLet E\ufb02ip \u2282 E denote the (random) subset of edges whose labels have been \ufb02ipped in a\np-stochastic assignment, and consider the following class of active learning algorithms pa-\nrameterized by an arbitrary spanning tree T = (VT , ET ) of G. The algorithms in this class\nuse E0 = ET as query set. The label of any test edge e(cid:48) = (i, j) (cid:54)\u2208 ET is predicted as the\nparity \u03c0T (e(cid:48)). Clearly enough, if a test edge e(cid:48) is predicted wrongly, then either e(cid:48) \u2208 E\ufb02ip\nor PathT (e(cid:48)) contains at least one \ufb02ipped edge. Hence, the number of mistakes MT made\nby our active learner on the set of test edges E \\ ET can be deterministically bounded by\n\n(cid:88)\n\n(cid:88)\n\nI(cid:8)e \u2208 PathT (e(cid:48))(cid:9)I(cid:8)e \u2208 E\ufb02ip\n\n(cid:9)\n\nMT \u2264 |E\ufb02ip| +\n\nwhere I(cid:8)\u00b7(cid:9) denotes the indicator of the Boolean predicate at argument. A quantity which\n\ne(cid:48)\u2208E\\ET\n\ne\u2208E\n\ncan be related to MT is the average stretch of a spanning tree T which, for our purposes,\nreduces to\n\n(1)\n\n(cid:104)|V | \u2212 1 +(cid:80)\n\n1|E|\n\n(cid:12)(cid:12)PathT (e(cid:48))(cid:12)(cid:12)(cid:105)\n\n.\n\ne(cid:48)\u2208E\\ET\n\n3\n\n\fA stunning result of [4] shows that every connected, undirected and unweighted graph has\n\nuses a spanning tree with the same low stretch, then the following result holds.\nTheorem 1 ([2]). Let (G, Y ) = ((V, E), Y ) be a labeled graph with p-stochastic assigned\nIf the active learner queries the edges of a spanning tree T = (VT , ET ) with\nlabels Y .\n\na spanning tree with an average stretch of just O(cid:0)log2 |V | log log |V |(cid:1). If our active learner\naverage stretch O(cid:0)log2 |V | log log |V |(cid:1), then E MT \u2264 p|E| \u00d7 O(cid:0)log2 |V | log log |V |(cid:1).\nAlthough low-stretch trees can be constructed in time O(cid:0)|E| ln|V |(cid:1), the algorithms are fairly\n\nWe call the quantity multiplying p|E| in the upper bound the optimality factor of the\nalgorithm. Recall that Fact 1 implies that this factor cannot be smaller than a constant\nwhen the query set size is a constant fraction of |E|.\n\ncomplicated (we are not aware of available implementations), and the constants hidden in\nthe asymptotics can be high. Another disadvantage is that we are forced to use a query set\nof small and \ufb01xed size |V |\u2212 1. In what follows we introduce algorithms that overcome both\nlimitations.\nA key aspect in the analysis of prediction performance is the ability to select a query set\nso that each test edge creates a short circuit with a training path. This is quanti\ufb01ed by\n\nI(cid:8)e \u2208 PathT (e(cid:48))(cid:9) in (1). We make this explicit as follows. Given a test edge (i, j)\n\n(cid:80)\n\ne\u2208E\n\nand a path Path(i, j) whose edges are queried edges, we say that we are predicting label Yi,j\nusing path Path(i, j) Since (i, j) closes Path(i, j) into a circuit, in this case we also say that\n(i, j) is predicted using the circuit.\nFact 2. Let (G, Y ) = ((V, E), Y ) be a labeled graph with p-stochastic assigned labels Y .\nGiven query set E0 \u2286 E, the number M of mistakes made when predicting test edges (i, j) \u2208\nE\\ E0 using training paths Path(i, j) whose length is uniformly bounded by (cid:96) satis\ufb01es EM \u2264\n(cid:96) p|E \\ E0| .\n\n(cid:0)1 \u2212 (1 \u2212 p)|Path(i,j)|(cid:1) \u2264\n\n(i,j)\u2208E\\E0\n\n(cid:80)\n\nProof. We have the chain of\n\n(cid:0)1 \u2212 (1 \u2212 p)(cid:96)(cid:1) \u2264(cid:80)\n\n(i,j)\u2208E\\E0\n\ninequalities EM \u2264(cid:80)\n\n(cid:96) p \u2264 (cid:96) p|E \\ E0| .\n\n(i,j)\u2208E\\E0\n\nFor instance, if the input graph G = (V, E) has diameter DG and the queried edges are\nthose of a breadth-\ufb01rst spanning tree, which can be generated in O(|E|) time, then the\nabove fact holds with |E0| = |V | \u2212 1, and (cid:96) = 2 DG. Comparing to Fact 1 shows that this\nsimple breadth-\ufb01rst strategy is optimal up to constants factors whenever G has a constant\ndiameter. This simple observation is especially relevant in the light of the typical graph\ntopologies encountered in practice, whose diameters are often small. This argument is at\nthe basis of our experimental comparison \u2014see Section 4 .\nYet, this mistake bound can be vacuous on graph having a larger diameter. Hence, one may\nthink of adding to the training spanning tree new edges so as to reduce the length of the\ncircuits used for prediction, at the cost of increasing the size of the query set. A similar\ntechnique based on short circuits has been used in [2], the goal there being to solve the link\nclassi\ufb01cation problem in a harder adversarial environment. The precise tradeo\ufb00 between\nprediction accuracy (as measured by the expected number of mistakes) and fraction of\nqueried edges is the main theoretical concern of this paper.\nWe now introduce an intermediate (and simpler) algorithm, called treeCutter, which\nimproves on the optimality factor when the diameter DG is not small. In particular, we\ndemonstrate that treeCutter achieves a good upper bound on the number of mistakes\n\non any graph such that |E| \u2265 3|V | +(cid:112)|V |. This algorithm is especially e\ufb00ective when the\ninput graph is dense, with an optimality factor between O(1) and O((cid:112)|V |). Moreover, the\n\ntotal time for predicting the test edges scales linearly with the number of such edges, i.e.,\ntreeCutter predicts edges in constant amortized time. Also, the space is linear in the size\nof the input graph.\nThe algorithm (pseudocode given in Figure 1) is parametrized by a positive integer k ranging\nfrom 2 to |V |. The actual setting of k depends on the graph topology and the desired fraction\nof query set edges, and plays a crucial role in determining the prediction performance.\nSetting k \u2264 DG makes treeCutter reduce to querying only the edges of a breadth-\ufb01rst\nspanning tree of G, otherwise it operates in a more involved way by splitting G into smaller\nnode-disjoint subtrees.\n\n4\n\n\fIn a preliminary step (Line 1 in Figure 1), treeCutter draws an arbitrary breadth-\ufb01rst\nspanning tree T = (VT , ET ). Then subroutine extractTreelet(T, k) is used in a do-while\nloop to split T into vertex-disjoint subtrees T (cid:48) whose height is k (one of them might have a\nsmaller height). extractTreelet(T, k) is a very simple procedure that performs a depth-\n\ufb01rst visit of the tree T at argument. During this visit, each internal node may be visited\nseveral times (during backtracking steps). We assign each node i a tag hT (i) representing\nthe height of the subtree of T rooted at i. hT (i) can be recursively computed during the\nvisit. After this assignment, if we have hT (i) = k (or i is the root of T ) we return the\nsubtree Ti of T rooted at i. Then treeCutter removes (Line 6) Ti from T along with\nall edges of ET which are incident to nodes of Ti, and then iterates until VT gets empty.\nBy construction, the diameter of the generated subtrees will not be larger than 2k. Let T\ndenote the set of these subtrees. For each T (cid:48) \u2208 T , the algorithm queries all the labels of\nET (cid:48), each edge (i, j) \u2208 EG \\ ET (cid:48) such that i, j \u2208 VT (cid:48) is set to be a test edge, and label Yi,j is\npredicted using PathT (cid:48)(i, j) (note that this coincides with PathT (cid:48)(i, j), since T (cid:48) \u2286 T ), that\nis, \u02c6Yi,j = \u03c0T (i, j). Finally, for each pair of distinct subtrees T (cid:48), T (cid:48)(cid:48) \u2208 T such that there exists\na node of VT (cid:48) adjacent to a node of VT (cid:48)(cid:48) , i.e., such that EG(T (cid:48), T (cid:48)(cid:48)) is not empty, we query the\nlabel of an arbitrarily selected edge (i(cid:48), i(cid:48)(cid:48)) \u2208 EG(T (cid:48), T (cid:48)(cid:48)) (Lines 8 and 9 in Figure 1). Each\nedge (u, v) \u2208 EG(T (cid:48), T (cid:48)(cid:48)) whose label has not been previously queried is then part of the\ntest set, and its label will be predicted as \u02c6Yu,v \u2190 \u03c0T (u, i(cid:48)) \u00b7 Yi(cid:48),i(cid:48)(cid:48) \u00b7 \u03c0T (i(cid:48)(cid:48), v) (Line 11). That\nis, using the path obtained by concatenating PathT (cid:48)(u, i(cid:48)) to edge (i(cid:48), i(cid:48)(cid:48)) to PathT (cid:48)(i(cid:48)(cid:48), v).\nThe following theorem1 quanti\ufb01es the number of mistakes made by treeCutter. The\n\nParameter: k \u2265 2\n\ntreeCutter(k)\nInitialization: T \u2190 \u2205.\n1. Draw an arbitrary breadth-\ufb01rst spanning tree T of G\n2. Do\nT (cid:48) \u2190 extractTreelet(T, k), and query all labels in ET (cid:48)\n3.\nT \u2190 T \u222a {T (cid:48)}\n4.\nFor each i, j \u2208 VT (cid:48), set predict \u02c6Yi,j \u2190 \u03c0T (i, j)\n5.\nT \u2190 T \\ T (cid:48)\n6.\n7. While (VT (cid:54)\u2261 \u2205)\n8. For each T (cid:48), T (cid:48)(cid:48) \u2208 T : T (cid:48) (cid:54)\u2261 T (cid:48)(cid:48)\n9.\n10.\n11.\n\npredict \u02c6Yu,v \u2190 \u03c0T (cid:48)(u, i(cid:48)) \u00b7 Yi(cid:48),i(cid:48)(cid:48) \u00b7 \u03c0T (cid:48)(cid:48) (i(cid:48)(cid:48), v)\n\nIf EG(T (cid:48), T (cid:48)(cid:48)) (cid:54)\u2261 \u2205 query the label of an arbitrary edge (i(cid:48), i(cid:48)(cid:48)) \u2208 EG(T (cid:48), T (cid:48)(cid:48))\nFor each (u, v) \u2208 EG(T (cid:48), T (cid:48)(cid:48)) \\ {(i(cid:48), i(cid:48)(cid:48))}, with i(cid:48), u \u2208 VT (cid:48) and v, i(cid:48)(cid:48) \u2208 VT (cid:48)(cid:48)\n\nFigure 1: treeCutter pseudocode.\n\nParameters: tree T , k \u2265 2.\n\nextractTreelet(T, k)\n1. Perform a depth-\ufb01rst visit of T starting from the root.\n2. During the visit\n3.\n4.\n5.\n6.\n\nIf i is a leaf set hT (i) \u2190 0\nElse set hT (i) \u2190 1 + max{hT (j) : j \u2208 ChildrenT (i)}\nIf hT (i) = k or i \u2261 T \u2019s root return subtree rooted at i\n\nFor each i \u2208 VT visited for the |1 + ChildrenT (i)|-th time (i.e., the last visit of i)\n\nFigure 2: extractTreelet pseudocode.\n\nrequirement on the graph density in the statement, i.e., |V | \u2212 1 +\nimplies\nthat the test set is not larger than the query set. This is a plausible assumption in active\nlearning scenarios, and a way of adding meaning to the bounds.\nTheorem 2. For any integer k \u2265 2, the number M of mistakes made by treeCutter on\n|V |\n|V |2\nany graph G(V, E) with |E| \u2265 2|V | \u2212 2 +\nk satis\ufb01es EM \u2264 min{4k + 1, 2DG}p|E|,\nk2 +\n|V |2\nwhile the query set size is bounded by |V | \u2212 1 +\n2k2 +\nWe now re\ufb01ne the simple argument leading to treeCutter, and present our active link\nclassi\ufb01er. The pseudocode of our re\ufb01ned algorithm, called starMaker, follows that of\n\n|V |\n2k \u2264 |E|\n2 .\n\n|V |\n2k \u2264 |E|\n\n|V |2\n2k2 +\n\n2\n\n1 Due to space limitations long proofs are presented in the supplementary material.\n\n5\n\n\f2 \u2264 |E|\n2 .\n\nFigure 1 with the following di\ufb00erences: Line 1 is dropped (i.e., starMaker does not draw\nan initial spanning tree), and the call to extractTreelet in Line 3 is replaced by a call\nto extractStar. This new subroutine just selects the star T (cid:48) centered on the node of G\nhaving largest degree, and queries all labels of the edges in ET (cid:48). The next result shows that\nthis algorithm gets a constant optimality factor while using a query set of size O(|V |3/2).\nTheorem 3. The number M of mistakes made by starMaker on any given graph G(V, E)\nwith |E| \u2265 2|V |\u2212 2 + 2|V | 3\n2 satis\ufb01es EM \u2264 5 p|E|, while the query set size is upper bounded\nby |V | \u2212 1 + |V | 3\nFinally, we combine starMaker with treeCutter so as to obtain an algorithm, called\ntreeletStar, that can work with query sets smaller than |V |\u2212 1 +|V | 3\n2 labels. treelet-\nStar is parameterized by an integer k and follows Lines 1\u20136 of Figure 1 creating a set\nT of trees through repeated calls to extractTreelet. Lines 7\u201311 are instead replaced\nby the following procedure: a graph G(cid:48) = (VG(cid:48), EG(cid:48)) is created such that: (1) each node\nin VG(cid:48) corresponds to a tree in T , (2) there exists an edge in EG(cid:48) if and only if the two\ncorresponding trees of T are connected by at least one edge of EG. Then, extractStar\nis used to generate a set S of stars of vertices of G(cid:48), i.e., stars of trees of T . Finally, for\neach pair of distinct stars S(cid:48), S(cid:48)(cid:48) \u2208 S connected by at least one edge in EG, the label of an\narbitrary edge in EG(S(cid:48), S(cid:48)(cid:48)) is queried. The remaining edges are all predicted.\nTheorem 4. For any integer k \u2265 2 and for any graph G = (V, E) with |E| \u2265 2|V | \u2212 2 +\n2 , the number M of mistakes made by treeletStar(k) on G satis\ufb01es EM =\n2 \u2264 |E|\n2 .\nHence, even if DG is large, setting k = |V |1/3 yields a O(|V |1/3) optimality factor just by\nquerying O(|V |) edges. On the other hand, a truly constant optimality factor is obtained\nby querying as few as O(|V |3/2) edges (provided the graph has su\ufb03ciently many edges). As\na direct consequence (and surprisingly enough), on graphs which are only moderately dense\nwe need not observe too many edges in order to achieve a constant optimality factor. It is\ninstructive to compare the bounds obtained by treeletStar to the ones we can achieve\nby using the cccc algorithm of [2], or the low-stretch spanning trees given in Theorem 1.\nBecause cccc operates within a harder adversarial setting, it is easy to show that Theorem\n9 in [2] extends to the p-stochastic assignment model by replacing \u22062(Y ) with p|E| therein.2\n\nk + 1(cid:1) 3\n2(cid:0)|V |\u22121\nO(min{k, DG}) p|E|, while the query set size is bounded by |V | \u2212 1 +(cid:0)|V |\u22121\n\n2(cid:112)|V |, where \u03b1 \u2208 (0, 1] is the fraction of\n(cid:1) 3\nThe resulting optimality factor is of order(cid:0) 1\u2212\u03b1\norder to obtain an optimality factor which is lower than (cid:112)|V |, cccc has to query in the\n\nqueried edges out of the total number of edges. A quick comparison to Theorem 4 reveals\nthat treeletStar achieves a sharper mistake bound for any value of \u03b1. For instance, in\nworst case a fraction of edges that goes to one as |V | \u2192 \u221e. On top of this, our algorithms\nare faster and easier to implement \u2014see Section 3.1.\nNext, we compare to query sets produced by low-stretch spanning trees. A low-stretch\nspanning tree achieves a polylogarithmic optimality factor by querying |V | \u2212 1 edge labels.\nThe results in [4] show that we cannot hope to get a better optimality factor using a single\nlow-stretch spanning tree combined by the analysis in (1). For a comparable amount \u0398(|V |)\nof queried labels, Theorem 4 o\ufb00ers the larger optimality factor |V |1/3. However, we can get\na constant optimality factor by increasing the query set size to O(|V |3/2). It is not clear\nhow multiple low-stretch trees could be combined to get a similar scaling.\n\nk + 1(cid:1) 3\n\n\u03b1\n\n3.1 Complexity analysis and implementation\n\nWe now compute bounds on time and space requirements for our three algorithms. Recall\nthe di\ufb00erent lower bound conditions on the graph density that must hold to ensure that the\n|V |\nquery set size is not larger than the test set size. These were |E| \u2265 2|V | \u2212 2 +\nk for\ntreeCutter(k) in Theorem 2, |E| \u2265 2|V | \u2212 2 + 2|V | 3\n2 for starMaker in Theorem 3, and\n|E| \u2265 2|V | \u2212 2 + 2\n\nfor treeletStar(k) in Theorem 4.\n\n(cid:16)|V |\u22121\n\n|V |2\nk2 +\n\n(cid:17) 3\n\n2\n\nk + 1\n\n2 This theoretical comparison is admittedly unfair, as cccc has been designed to work in a\nharder setting than p-stochastic. Unfortunately, we are not aware of any other general active\nlearning scheme for link classi\ufb01cation to compare with.\n\n6\n\n\fO(cid:0)|E| + |V | log |V |(cid:1)\n(cid:18)\n(cid:19)\n\nO(|E|)\n\nO\n\n|E| +\n\n|V |\nk\n\nlog\n\n|V |\nk\n\nfor treeCutter(k) and for all k\nfor starMaker\n\nfor treeletStar(k) and for all k.\n\nTheorem 5. For any input graph G = (V, E) which is dense enough to ensure that the\nquery set size is no larger than the test set size, the total time needed for predicting all test\nlabels is:\n\nIn particular, whenever k|E| = \u2126(|V | log |V |) we have that treeletStar(k) works in con-\nstant amortized time. For all three algorithms, the space required is always linear in the\ninput graph size |E|.\n\n4 Experiments\n\nIn this preliminary set of experiments we only tested the predictive performance of\ntreeCutter(|V |). This corresponds to querying only the edges of the initial spanning\ntree T and predicting all remaining edges (i, j) via the parity of PathT (i, j). The spanning\ntree T used by treeCutter is a shortest-path spanning tree generated by a breadth-\ufb01rst\nvisit of the graph (assuming all edges have unit length). As the choice of the starting node\nin the visit is arbitrary, we picked the highest degree node in the graph. Finally, we run\nthrough the adiacency list of each node in random order, which we empirically observed to\nimprove performance.\nOur baseline is the heuristic ASymExp from [11] which, among the many spectral heuristics\nproposed there, turned out to perform best on all our datasets. With integer parameter\nz, ASymExp(z) predicts using a spectral transformation of the training sign matrix Ytrain,\nwhose only non-zero entries are the signs of the training edges. The label of edge (i, j) is\nz is\nthe spectral decomposition of Ytrain containing only the z largest eigenvalues and their corre-\nsponding eigenvectors. Following [11], we ran ASymExp(z) with the values z = 1, 5, 10, 15.\nThis heuristic uses the two-clustering bias as follows : expand exp(Ytrain) in a series of\npowers Y n\ntrain)i,j is a sum of values of paths of length n between i and\nj. Each path has value 0 if it contains at least one test edge, otherwise its value equals the\nproduct of queried labels on the path edges. Hence, the sign of exp(Ytrain) is the sign of a\nlinear combination of path values, each corresponding to a prediction consistent with the\ntwo-clustering bias \u2014compare this to the multiplicative rule used by treeCutter. Note\nthat ASymExp and the other spectral heuristics from [11] have all running times of order\n\npredicted using(cid:0)exp(Ytrain(z))(cid:1)\ntrain. Then each (cid:0)Y n\n\ni,j. Here exp(cid:0)Ytrain(z)(cid:1) = Uz exp(Dz)U(cid:62)\n\nz , where UzDzU(cid:62)\n\n\u2126(cid:0)|V |2(cid:1).\n\nWe performed a \ufb01rst set of experiments on synthetic signed graphs created from a subset\nof the USPS digit recognition dataset. We randomly selected 500 examples labeled \u201c1\u201d and\n500 examples labeled \u201c7\u201d (these two classes are not straightforward to tell apart). Then,\nwe created a graph using a k-NN rule with k = 100. The edges were labeled as follows:\nall edges incident to nodes with the same USPS label were labeled +1; all edges incident\nto nodes with di\ufb00erent USPS labels were labeled \u22121. Finally, we randomly pruned the\npositive edges so to achieve an unbalance of about 20% between the two classes.3 Starting\nfrom this edge label assignment, which is consistent with the two-clustering associated with\nthe USPS labels, we generated a p-stochastic label assignment by \ufb02ipping the labels of a\nrandom subset of the edges. Speci\ufb01cally, we used the three following synthetic datasets:\nDELTA0: No \ufb02ippings (p = 0), 1,000 nodes and 9,138 edges;\nDELTA100: 100 randomly chosen labels of DELTA0 are \ufb02ipped;\nDELTA250: 250 randomly chosen labels of DELTA0 are \ufb02ipped.\nWe also used three real-world datasets:\nMOVIELENS: A signed graph we created using Movielens ratings.4 We \ufb01rst normalized\nthe ratings by subtracting from each user rating the average rating of that user. Then,\nwe created a user-user matrix of cosine distance similarities. This matrix was sparsi\ufb01ed by\n\n3 This is similar to the class unbalance of real-world signed networks \u2014see below.\n4 www.grouplens.org/system/files/ml-1m.zip.\n\n7\n\n\fFigure 3: F-measure against training set size for treeCutter(|V |) and ASymExp(z) with di\ufb00erent values of z\non both synthetic and real-world datasets. By construction, treeCutter never makes a mistake when the labeling\nis consistent with a two-clustering. So on DELTA0 treeCutter does not make mistakes whenever the training set\ncontains at least one spanning tree. With the exception of EPINIONS, treeCutter outperforms ASymExp using\na much smaller training set. We conjecture that ASymExp responds to the bias not as well as treeCutter, which\non the other hand is less robust than ASymExp to bias violations (supposedly, the labeling of EPINIONS).\n\nzeroing each entry smaller than 0.1 and removing all self-loops. Finally, we took the sign\nof each non-zero entry. The resulting graph has 6,040 nodes and 824,818 edges (12.6% of\nwhich are negative).\nSLASHDOT: The biggest strongly connected component of a snapshot of the Slashdot\nsocial network,5 similar to the one used in [11]. This graph has 26,996 nodes and 290,509\nedges (24.7% of which are negative).\nEPINIONS: The biggest strongly connected component of a snapshot of the Epinions\nsigned network,6 similar to the one used in [13, 12]. This graph has 41,441 nodes and\n565,900 edges (26.2% of which are negative).\nSlashdot and Epinions are originally directed graphs. We removed the reciprocal edges with\nmismatching labels (which turned out to be only a few), and considered the remaining edges\nas undirected.\nThe following table summarizes the key statistics of each dataset: Neg. is the fraction of\nnegative edges, |V |/|E| is the fraction of edges queried by treeCutter(|V |), and Avgdeg\nis the average degree of the nodes of the network.\n\nDataset\nDELTA0\nDELTA100\nDELTA250\nSLASHDOT\nEPINIONS\nMOVIELENS\n\n|V |\n1000\n1000\n1000\n26996\n41441\n6040\n\n|E|\n9138\n9138\n9138\n290509\n565900\n824818\n\nNeg.\n21.9%\n22.7%\n23.5%\n24.7%\n26.2%\n12.6%\n\n|V |/|E|\n10.9%\n10.9%\n10.9%\n9.2%\n7.3%\n0.7%\n\nAvgdeg\n18.2\n18.2\n18.2\n21.6\n27.4\n273.2\n\nOur results are summarized in Figure 3, where we plot F-measure (preferable to accuracy\ndue to the class unbalance) against the fraction of training (or query) set size. On all\ndatasets, but MOVIELENS, the training set size for ASymExp ranges across the values 5%,\n10%, 25%, and 50%. Since MOVIELENS has a higher density, we decided to reduce those\nfractions to 1%, 3%, 5% and 10%. treeCutter(|V |) uses a single spanning tree, and thus\nwe only have a single query set size value. All results are averaged over ten runs of the\nalgorithms. The randomness in ASymExp is due to the random draw of the training set.\nThe randomness in treeCutter(|V |) is caused by the randomized breadth-\ufb01rst visit.\n\n5 snap.stanford.edu/data/soc-sign-Slashdot081106.html.\n6 snap.stanford.edu/data/soc-sign-epinions.html.\n\n8\n\n 0.4 0.6 0.8 1 10 20 30 40 50F-MEASURE (%)TRAINING SET SIZE (%)DELTA0ASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter 0.4 0.6 0.8 1 10 20 30 40 50F-MEASURE (%)TRAINING SET SIZE (%)DELTA100ASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter 0.4 0.6 0.8 1 10 20 30 40 50F-MEASURE (%)TRAINING SET SIZE (%)DELTA250ASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter 0.2 0.4 0.6 1 2 3 4 5 6 7 8 9 10F-MEASURE (%)TRAINING SET SIZE (%)MOVIELENSASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter 0.2 0.4 0.6 10 20 30 40 50F-MEASURE (%)TRAINING SET SIZE (%)SLASHDOTASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter 0.2 0.4 0.6 0.8 10 20 30 40 50F-MEASURE (%)TRAINING SET SIZE (%)EPINIONSASymExp z=1ASymExp z=5ASymExp z=10ASymExp z=15TreeCutter\fReferences\n\n[1] Cartwright, D. and Harary, F. Structure balance: A generalization of Heider\u2019s theory.\n\nPsychological review, 63(5):277\u2013293, 1956.\n\n[2] Cesa-Bianchi, N., Gentile, C., Vitale, F., Zappella, G. A correlation clustering approach\nIn Proceedings of the 25th conference on\n\nto link classi\ufb01cation in signed networks.\nlearning theory (COLT 2012). To appear, 2012.\n\n[3] Chiang, K., Natarajan, N., Tewari, A., and Dhillon, I. Exploiting longer cycles for\nIn Proceedings of the 20th ACM Conference on\n\nlink prediction in signed networks.\nInformation and Knowledge Management (CIKM). ACM, 2011.\n\n[4] Elkin, M., Emek, Y., Spielman, D.A., and Teng, S.-H. Lower-stretch spanning trees.\n\nSIAM Journal on Computing, 38(2):608\u2013628, 2010.\n\n[5] Facchetti, G., Iacono, G., and Alta\ufb01ni, C. Computing global structural balance in\n\nlarge-scale signed social networks. PNAS, 2011.\n\n[6] Giotis, I. and Guruswami, V. Correlation clustering with a \ufb01xed number of clusters. In\nProceedings of the Seventeenth Annual ACM-SIAM Symposium on Discrete Algorithms,\npp. 1167\u20131176. ACM, 2006.\n\n[7] Guha, R., Kumar, R., Raghavan, P., and Tomkins, A. Propagation of trust and distrust.\nIn Proceedings of the 13th international conference on World Wide Web, pp. 403\u2013412.\nACM, 2004.\n\n[8] Harary, F. On the notion of balance of a signed graph. Michigan Mathematical Journal,\n\n2(2):143\u2013146, 1953.\n\n[9] Heider, F. Attitude and cognitive organization. J. Psychol, 21:107\u2013122, 1946.\n\n[10] Hou, Y.P. Bounds for the least Laplacian eigenvalue of a signed graph. Acta Mathe-\n\nmatica Sinica, 21(4):955\u2013960, 2005.\n\n[11] Kunegis, J., Lommatzsch, A., and Bauckhage, C. The Slashdot Zoo: Mining a social\nnetwork with negative edges. In Proceedings of the 18th International Conference on\nWorld Wide Web, pp. 741\u2013750. ACM, 2009.\n\n[12] Leskovec, J., Huttenlocher, D., and Kleinberg, J. Trust-aware bootstrapping of recom-\nmender systems. In Proceedings of ECAI 2006 Workshop on Recommender Systems,\npp. 29\u201333. ECAI, 2006.\n\n[13] Leskovec, J., Huttenlocher, D., and Kleinberg, J. Signed networks in social media.\nIn Proceedings of the 28th International Conference on Human Factors in Computing\nSystems, pp. 1361\u20131370. ACM, 2010.\n\n[14] Leskovec, J., Huttenlocher, D., and Kleinberg, J. Predicting positive and negative links\nin online social networks. In Proceedings of the 19th International Conference on World\nWide Web, pp. 641\u2013650. ACM, 2010.\n\n[15] Von Luxburg, U. A tutorial on spectral clustering. Statistics and Computing, 17(4):\n\n395\u2013416, 2007.\n\n9\n\n\f", "award": [], "sourceid": 758, "authors": [{"given_name": "Nicol\u00f2", "family_name": "Cesa-bianchi", "institution": null}, {"given_name": "Claudio", "family_name": "Gentile", "institution": null}, {"given_name": "Fabio", "family_name": "Vitale", "institution": null}, {"given_name": "Giovanni", "family_name": "Zappella", "institution": null}]}