{"title": "Online Prediction of Switching Graph Labelings with Cluster Specialists", "book": "Advances in Neural Information Processing Systems", "page_first": 7004, "page_last": 7014, "abstract": "We address the problem of predicting the labeling of a graph in an online setting when the labeling is changing over time. We present an algorithm based on a specialist approach; we develop the machinery of cluster specialists which probabilistically exploits the cluster structure in the graph. Our algorithm has two variants, one of which surprisingly only requires O(log n) time on any trial t on an n-vertex graph, an exponential speed up over existing methods. We prove switching mistake-bound guarantees for both variants of our algorithm. Furthermore these mistake bounds smoothly vary with the magnitude of the change between successive labelings. We perform experiments on Chicago Divvy Bicycle Sharing data and show that our algorithms significantly outperform an existing algorithm (a kernelized Perceptron) as well as several natural benchmarks.", "full_text": "Online Prediction of Switching Graph Labelings with\n\nCluster Specialists\n\nMark Herbster\n\nDepartment of Computer Science\n\nUniversity College London\n\nLondon\n\nUnited Kingdom\n\nm.herbster@cs.ucl.ac.uk\n\nj.robinson@cs.ucl.ac.uk\n\nJames Robinson\n\nDepartment of Computer Science\n\nUniversity College London\n\nLondon\n\nUnited Kingdom\n\nAbstract\n\nWe address the problem of predicting the labeling of a graph in an online setting\nwhen the labeling is changing over time. We present an algorithm based on a\nspecialist [11] approach; we develop the machinery of cluster specialists which\nprobabilistically exploits the cluster structure in the graph. Our algorithm has\ntwo variants, one of which surprisingly only requires O(log n) time on any trial\nt on an n-vertex graph, an exponential speed up over existing methods. We\nprove switching mistake-bound guarantees for both variants of our algorithm.\nFurthermore these mistake bounds smoothly vary with the magnitude of the change\nbetween successive labelings. We perform experiments on Chicago Divvy Bicycle\nSharing data and show that our algorithms signi\ufb01cantly outperform an existing\nalgorithm (a kernelized Perceptron) as well as several natural benchmarks.\n\n1 Introduction\n\nWe study the problem of predicting graph labelings that evolve over time. Consider the following\ngame for predicting the labeling of a graph in the online setting. Nature presents a graph G; Nature\nqueries a vertex i1 2 V = {1, 2, . . . , n}; the learner predicts the label of the vertex \u02c6y1 2 {1, 1};\nNature presents a label y1; Nature queries a vertex i2; the learner predicts \u02c6y2; and so forth. The\nlearner\u2019s goal is to minimize the total number of mistakes M = |{t : \u02c6yt 6= yt}|. If Nature is\nstrictly adversarial, the learner will incur a mistake on every trial, but if Nature is regular or\nsimple, there is hope that the learner may incur only a few mistakes. Thus, a central goal of\nmistake-bounded online learning is to design algorithms whose total mistakes can be bounded relative\nto the complexity of Nature\u2019s labeling. This (non-switching) graph labeling problem has been\nstudied extensively in the online learning literature [16, 15, 7, 34, 17]. In this paper we generalize\nthe setting to allow the underlying labeling to change arbitrarily over time. The learner has no\nknowledge of when a change in labeling will occur and therefore must be able to adapt quickly to\nthese changes.\nConsider an example of services placed throughout a city, such as public bicycle sharing stations.\nAs the population uses these services the state of each station\u2013such as the number of available\nbikes\u2013naturally evolves throughout the day, at times gradually and others abruptly, and we might\nwant to predict the state of any given station at any given time. Since the location of a given station\nas well as the state of nearby stations will be relevant to this learning problem it is natural to use a\ngraph-based approach. Another setting might be a graph of major road junctions (vertices) connected\nby roads (edges), in which one wants to predict whether or not a junction is congested at any given\ntime. Traf\ufb01c congestion is naturally non-stationary and also exhibits both gradual and abrupt changes\nto the structure of the labeling over time [24].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fThe structure of this paper is as follows. In Section 2 we discuss the background literature. In\nSection 3 we present the SWITCHING CLUSTER SPECIALISTS algorithm (SCS), a modi\ufb01cation of the\nmethod of specialists [11] with the novel machinery of cluster specialists, a set of specialists that in a\nrough sense correspond to clusters in the graph. We consider two distinct sets of specialists, Bn and\nFn, where Bn \u21e2F n. With the smaller set of specialists the bound is only larger by factor of log n.\nOn the other hand, prediction is exponentially faster per trial, remarkably requiring only O(log n)\ntime to predict. In Section 4 we provide experiments on Chicago Divvy Bicycle Sharing data. In\nSection 5 we provide some concluding remarks. All proofs are contained in the technical appendices.\n\n1.1 Notation\nWe \ufb01rst present common notation. Let G = (V, E) be an undirected, connected, n-vertex graph with\nvertex set V = {1, 2, . . . , n} and edge set E. Each vertex of this graph may be labeled with one of\ntwo states {1, 1} and thus a labeling of a graph may be denoted by a vector u 2 {1, 1}n where ui\ndenotes the label of vertex i. The underlying assumption is that we are predicting vertex labels from\na sequence u1, . . . , uT 2 {1, 1}n of graph labelings over T trials. The set K := {t 2{ 2, . . . , T} :\nut 6= ut1}[{ 1} contains the \ufb01rst trial of each of the |K| \u201csegments\u201d of the prediction problem.\nEach segment corresponds to a time period when the underlying labeling is unchanging. The cut-size\nof a labeling u on a graph G is de\ufb01ned as G(u) := |{(i, j) 2 E : ui 6= uj}|, i.e., the number of\nedges between vertices of disagreeing labels.\nWe let rG(i, j) denote the resistance distance (effective resistance) between vertices i and j when the\ngraph G is seen as a circuit where each edge has unit resistance (e.g., [26]). The effective resistance\nfor an unweighted graph G can be written as\nu2Rn P(p,q)2E\n\n(up uq)2 : ui uj = 1\n\nrG(i, j) =\n\nmin\n\n1\n\nThe resistance diameter of a graph is RG := max\ni,j2V\nlabeling u is r\n\nG(u) := P(i,j)2E:ui6=uj\n\nrG(i, j). Let n = {\u00b5 2 [0, 1]n :Pn\n\nprobability simplex. For \u00b5 2 n we de\ufb01ne H(\u00b5) := Pn\nFor \u00b5, ! 2 n we de\ufb01ne d(\u00b5, !) =Pn\nFor a vector ! and a set of indices I let !(I) :=Pi2I !i. For any positive integer N we de\ufb01ne\n[N ] := {1, 2, . . . , N} and for any predicate [PRED] := 1 if PRED is true and equals 0 otherwise.\n2 Related Work\n\ni=1 \u00b5i log2\n\n\u00b5i\n!i\n\nrG(i, j). The resistance weighted cut-size of a\ni=1 \u00b5i = 1} be the n-dimensional\nto be the entropy of \u00b5.\nto be the relative entropy between \u00b5 and !.\n\ni=1 \u00b5i log2\n\n1\n\u00b5i\n\nThe problem of predicting the labeling of a graph in the batch setting was introduced as a foundational\nmethod for semi-supervised (transductive) learning. In this work, the graph was built using both the\nunlabeled and labeled instances. The seminal work by [3] used a metric on the instance space and then\nbuilt a kNN or \u270f-ball graph. The partial labeling was then extended to the complete graph by solving\na mincut-max\ufb02ow problem where opposing binary labels represented sources and sinks. In practice\nthis method suffered from very unbalanced cuts. Signi\ufb01cant practical and theoretical advances\nwere made by replacing the mincut/max\ufb02ow model with methods based on minimising a quadratic\nform of the graph Laplacian. In\ufb02uential early results include but are not limited to [39, 2, 38]. A\nlimitation of the graph Laplacian-based techniques is that these batch methods\u2013depending on their\nimplementation\u2013typically require \u21e5(n2) to \u21e5(n3) time to produce a single set of predictions.\nPredicting the labeling of a graph in the online setting was introduced by [20]. The authors proved\nbounds for a Perceptron-like algorithm with a kernel based on the graph Laplacian. Since this\nwork there has been a number of extensions and improvements in bounds including but not limited\nto [16, 6, 15, 18, 17, 32]. Common to all of these papers is that a dominant term in their mistake\nbounds is the (resistance-weighted) cut-size.\nFrom a simpli\ufb01ed perspective, the methods for predicting the labeling of a graph (online) split into\ntwo approaches. The \ufb01rst approach works directly with the original graph and is usually based on a\ngraph Laplacian [20, 15, 17]; it provides bounds that utilize the additional connectivity of non-tree\ngraphs, which are particularly strong when the graph contains uniformly-labeled clusters of small\n\n2\n\n\f(resistance) diameter. The drawbacks of this approach are that the bounds are weaker on graphs with\nlarge diameter, and that computation times are slower.\nThe second approach is to approximate the original graph with an appropriately selected tree or \u201cline\u201d\ngraph [16, 7, 6, 34]. This enables faster computation times, and bounds that are better on graphs with\nlarge diameters. These algorithms may be extended to non-tree graphs by \ufb01rst selecting a spanning\ntree uniformly at random [7] and then applying the algorithm to the sampled tree. This randomized\napproach induces expected mistake bounds that also exploit the cluster structure in the graph (see\nSection 2.2). Our algorithm takes this approach.\n\n2.1 Switching Prediction\nIn this paper rather than predicting a single labeling of a graph we instead will predict a (switching)\nsequence of labelings. Switching in the mistake- or regret-bound setting refers to the problem of\npredicting an online sequence when the \u201cbest comparator\u201d is changing over time. In the simplest of\nswitching models the set of comparators is structureless and we simply pay per switch. A prominent\nearly result in this model is [21] which introduced the \ufb01xed-share update which will play a prominent\nrole in our main algorithm. Other prominent results in the structureless model include but are not\nlimited to [36, 4, 12, 28, 27, 5]. A stronger model is to instead prove a bound that holds for any\narbitrary contiguous sequence of trials. Such a bound is called an adaptive-regret bound. This type\nof bound automatically implies a bound on the structureless switching model. Adaptive-regret was\nintroduced in [13]1 other prominent results in this model include [1, 5, 9].\nThe structureless model may be generalized by introducing a divergence measure on the set of\ncomparators. Thus, whereas in the structureless model we pay for the number of switches, in the\nstructured model we instead pay in the sum of divergences between successive comparators. This\nmodel was introduced in [22]; prominent results include [25, 5].\nIn [12, 23, 13] meta-algorithms were introduced with regret bounds which convert any \u201cblack-box\u201d\nonline learning algorithm into an adaptive algorithm. Such methods could be used as an approach\nto predict switching graph labelings online, however these meta-algorithms introduce a factor of\nIn the online\nO(log T ) to the per-trial time complexity of the base online learning algorithm.\nswitching setting we will aim for our fastest algorithm to have O(log n) time complexity per trial.\nIn [18] the authors also consider switching graph label prediction. However, their results are not\ndirectly comparable to ours since they consider the combinatorially more challenging problem of\nrepeated switching within a small set of labelings contained in a larger set. That set-up was a problem\noriginally framed in the \u201cexperts\u201d setting and posed as an open problem by [10] and solved in [4]. If\nwe apply the bound in [18] to the case where there is not repeated switching within a smaller set, then\ntheir bound is uniformly and signi\ufb01cantly weaker than the bounds in this paper and the algorithm\nis quite slow requiring \u2713(n3) time per trial in a typical implementation. Also contained in [18] is\na baseline algorithm based on a kernel perceptron with a graph Laplacian kernel. The bound of\nthat algorithm has the signi\ufb01cant drawback in that it scales with respect to the \u201cworst\u201d labeling in\na sequence of labelings. However, it is simple to implement and we use it as a benchmark in our\nexperiments.\n\n2.2 Random Spanning Trees and Linearization\nSince we operate in the transductive setting where the entire unlabeled graph is presented to the\nlearner beforehand, this affords the learner the ability to perform any recon\ufb01guration to the graph\nas a preprocessing step. The bounds of most existing algorithms for predicting a labeling on a graph\nare usually expressed in terms of the cut-size of the graph under that labeling. A natural approach\nthen is to use a spanning tree of the original graph which can only reduce the cut-size of the labeling.\nThe effective resistance between vertices i and j, denoted rG(i, j), is equal to the probability that\na spanning tree of G drawn uniformly at random (from the set of all spanning trees of G) includes\n(i, j) 2 E as one of its n 1 edges (e.g., [30]). As \ufb01rst observed by [6], by selecting a spanning tree\nuniformly at random from the set of all possible spanning trees, mistake bounds expressed in terms of\nthe cut-size then become expected mistake bounds now in terms of the effective-resistance-weighted\nG(u) and thus\ncut-size of the graph. That is, if R is a random spanning tree of G then E[R(u)] = r\n\n1However, see the analysis of WML in [29] for a precursory result.\n\n3\n\n\f2 . On the other hand r\n\nG(u) \uf8ff G(u). A random spanning tree can be sampled from a graph ef\ufb01ciently using a random\nr\nwalk or similar methods (see e.g., [37]).\nTo illustrate the power of this randomization consider the simpli\ufb01ed example of a graph with two\ncliques each of size n\n2 , where one clique is labeled uniformly with \u2018+1\u2019 and the other \u2018-1\u2019 with an\n2 \u201ccut\u201d edges between the cliques. This dense graph exhibits two disjoint clusters\nadditional arbitrary n\nand G(u) = n\nG(u) = \u21e5(1), since between any two vertices in the opposing\n2 edge disjoint paths of length \uf8ff 3 and thus the effective resistance between any\ncliques there are n\npair of vertices is \u21e5( 1\nn ). Since bounds usually scale linearly with (resistance-weighted) cut-size, the\ncut-size bound would be vacuous but the resistance-weighted cut-size bound would be small.\nWe will make use of this preprocessing step of sampling a uniform random spanning tree, as well\nas a linearization of this tree to produce a (spine) line-graph, S. The linearization of G to S as\na preprocessing step was \ufb01rst proposed by [16] and has since been applied in, e.g., [7, 31]. In\norder to construct S, a random-spanning tree R is picked uniformly at random. A vertex of R is\nthen chosen and the graph is fully traversed using a depth-\ufb01rst search generating an ordered list\nVL =il1, . . . , il2m+1 of vertices in the order they were visited. Vertices in V may appear multiple\ntimes in VL. A subsequence VL0 \u2713 VL is then chosen such that each vertex in V appears only once.\nThe line graph S is then formed by connecting each vertex in VL0 to its immediate neighbors in\nVL0 with an edge. We denote the edge set of S by ES and let t := (ut), where the cut is\nwith respect to the linear embedding S. Surprisingly, as stated in the lemma below, the cut on this\nlinearized graph is no more than twice the cut on the original graph.\nLemma 1 ([16]). Given a labeling u 2 {1, 1}n on a graph G, for the mapping G!R!S\n, as\nabove, we have S(u) \uf8ff 2R(u) \uf8ff 2G(u).\nBy combining the above observations we may reduce the problem of learning on a graph to that\nof learning on a line graph. In particular, if we have an algorithm with a mistake bound of the\nform M \uf8ffO (G(u)) this implies we then may give an expected mistake bound of the form\nG(u)) by \ufb01rst sampling a random spanning tree and then linearizing it as above. One\nM \uf8ffO (r\ncaveat of this however depends on the whether Nature is oblivious or adaptive. If Nature is\noblivious we assume that learner\u2019s predictions have no effect on the labels chosen by Nature (or\nequivalently all labelings are chosen beforehand). Conversely if Nature is adaptive then Nature\u2019s\nlabelings are assumed to be adversarially chosen in response to learner\u2019s predictions. In this paper\nwe will only state the deterministic mistake bounds in terms of cut-size which will hold for oblivious\nand adaptive adversaries, while the expected bounds in terms of resistance-weighted cut-sizes will\nhold for an oblivious adversary.\n\n3 Switching Specialists\n\nIn this section we present a new method based on the idea of specialists [11] from the prediction\nwith expert advice literature [29, 35, 8]. Although the achieved bounds are slightly worse than other\nmethods for predicting a single labeling of a graph, the derived advantage is that it is possible to\nobtain \u201ccompetitive\u201d bounds with fast algorithms to predict a sequence of changing graph labelings.\nOur inductive bias is to predict well when a labeling has a small (resistance-weighted) cut-size. The\ncomplementary perspective implies that the labeling consists of a few uniformly labeled clusters.\nThis suggests the idea of maintaining a collection of basis functions where each such function is\nspecialized to predict a constant function on a given cluster of vertices. To accomplish this technically\nwe adapt the method of specialists [11, 27]. A specialist is a prediction function \" from an input space\nto an extended output space with abstentions. So for us the input space is just V = [n], the vertices\nof a graph; and the extended output space is {1, 1, \u21e4} where {1, 1} corresponds to predicted\nlabels of the vertices, but \u2018\u21e4\u2019 indicates that the specialist abstains from predicting. Thus a specialist\nspecializes its prediction to part of the input space and in our application the specialists correspond to\na collection of clusters which cover the graph, each cluster uniformly predicting 1 or 1.\nIn Algorithm 1 we give our switching specialists method. The algorithm maintains a weight vector\n!t over the specialists in which the magnitudes may be interpreted as the current con\ufb01dence we have\nin each of the specialists. The updates and their analyses are a combination of three standard methods:\ni) Halving loss updates, ii) specialists updates and iii) (delayed) \ufb01xed-share updates.\n\n4\n\n\f1, p 0, m 0\n\n1 (1 \u21b5)mp\"\n\n|E|\n\n!t,\" (1 \u21b5)mp\" \u02d9!t1,\" +\n!t,\" \"(it))\n\n1, \u02d9!0 1\n|E|\n\n: Specialists set E\ninput\nparameter : \u21b5 2 [0, 1]\ninitialize\n: !1 1\n|E|\nfor t = 1 to T do\nreceive it 2 V\nset At := {\" 2E : \"(it) 6= \u21e4}\nforeach \" 2A t do\npredict \u02c6yt sign(P\"2At\nreceive yt 2{ 1, 1}\nset Yt := {\" 2E : \"(it) = yt}\nif \u02c6yt 6= yt then\n\u02d9!t,\" 8><>:\n\n0\n\u02d9!t1,\"\n!t,\"\n\n!t(At)\n!t(Yt)\n\n\" 2A t \\ \u00afYt\n\" 62 At\n\" 2Y t\n\nforeach \" 2A t do\np\" m\nm m + 1\n\u02d9!t \u02d9!t1\n\nelse\n\n// delayed share update\n(1)\n\n// loss update\n\n(2)\n\nAlgorithm 1: SWITCHING CLUSTER SPECIALISTS\n\nThe loss update (2) zeros the weight components of incorrectly predicting specialists, while the\nnon-predicting specialists are not updated at all. In (1) we give our delayed \ufb01xed-share style update.\nA standard \ufb01xed share update may be written in the following form:\n\n!t,\" = (1 \u21b5) \u02d9!t1,\" +\n\n\u21b5\n|E|\n\n.\n\n(3)\n\nAlthough (3) super\ufb01cially appears different to (1), in fact these two updates are exactly the same\nin terms of predictions generated by the algorithm. This is because (1) caches updates until the\ngiven specialist is again active. The purpose of this computationally is that if the active specialists\nare, for example, logarithmic in size compared to the total specialist pool, we may then achieve an\nexponential speedup over (3); which in fact we will exploit.\nIn the following theorem we will give our switching specialists bound. The dominant cost of switching\non trial t to t + 1 is given by the non-symmetric JE (\u00b5t, \u00b5t+1) := |{\" 2E : \u00b5t,\" = 0, \u00b5t+1,\" 6= 0}|,\ni.e., we pay only for each new specialist introduced but we do not pay for removing specialists.\nTheorem 2. For a given specialist set E, let ME denote the number of mistakes made in predicting\nthe online sequence (i1, y1), . . . , (iT , yT ) by Algorithm 1. Then,\n\nME \uf8ff\n\n1\n\u21e11\n\nlog |E| +\n\nTXt=1\n\n1\n\u21e1t\n\nlog\n\n1\n\n1 \u21b5\n\n+\n\n|K|1Xi=1\n\nJE\u00b5ki, \u00b5ki+1 log |E|\n\n\u21b5\n\n,\n\n(4)\n\nfor any sequence of consistent and well-formed comparators \u00b51, . . . , \u00b5T 2 |E| where K :=\n{k1 = 1 < \u00b7\u00b7\u00b7 < k|K|} :={t2 [T ] : \u00b5t 6= \u00b5t1}[{ 1}, and \u21e1t := \u00b5t(Yt).\nThe bound in the above theorem depends crucially on the best sequence of consistent and well-\nformed comparators \u00b51, . . . , \u00b5T . The consistency requirement implies that on every trial there is no\nactive incorrect specialist assigned \u201cmass\u201d (\u00b5t(At \\ Yt) = 0). We may eliminate the consistency\nrequirement by \u201csoftening\u201d the loss update (2). A comparator \u00b5 2 |E| is well-formed if 8 v 2 V ,\nthere exists a unique \" 2E such that \"(v) 6= \u21e4 and \u00b5\" > 0, and furthermore there exists a \u21e1 2 (0, 1]\nsuch that 8\" 2E : \u00b5\" 2{ 0,\u21e1 }, i.e., each specialist in the support of \u00b5 has the same mass \u21e1 and\nthese specialists disjointly cover the input space (V ). At considerable complication to the form of the\nbound the well-formedness requirement may be eliminated.\nThe above bound is \u201csmooth\u201d in that it scales with a gradual change in the comparator. In the next\nsection we describe the novel specialists sets that we\u2019ve tailored to graph-label prediction so that a\nsmall change in comparator corresponds to a small change in a graph labeling.\n\n5\n\n\fy (v) := y if l \uf8ff v \uf8ff r and \"l,r\n\n3.1 Cluster Specialists\nIn order to construct the cluster specialists over a graph G = (V = [n], E), we \ufb01rst construct a\nline graph as described in Section 2.2. A cluster specialist is then de\ufb01ned by \"l,r\ny (\u00b7) which maps\nV ! {1, 1, \u21e4} where \"l,r\ny (v) := \u21e4 otherwise. Hence cluster\nspecialist \"l,r\ny (v) corresponds to a function that predicts the label y if vertex v lies between vertices l\nand r and abstains otherwise. Recall that by sampling a random spanning tree the expected cut-size\nof a labeling on the spine is no more than twice the resistance-weighted cut-size on G. Thus, given\na labeled graph with a small resistance-weighted cut-size with densely interconnected clusters and\nmodest intra-cluster connections, this implies a cut-bracketed linear segment on the spine will in\nexpectation roughly correspond to one of the original dense clusters. We will consider two basis sets\nof cluster specialists.\nBasis Fn: We \ufb01rst introduce the complete basis set Fn := {\"l,r\n: l, r 2 [n], l \uf8ff r; y 2 {1, 1}}.\nWe say that a set of specialists Cu \u2713E\u2713 2{1,1,\u21e4}n from basis E covers a labeling u 2 {1, 1}n if\nfor all v 2 V = [n] and \" 2C u that \"(v) 2{ uv, \u21e4} and if v 2 V then there exists \" 2C u such that\n\"(v) = uv. The basis E is complete if every labeling u 2 {1, 1}n is covered by some Cu \u2713E . The\nbasis Fn is complete and in fact has the following approximation property: for any u 2 {1, 1}n\nthere exists a covering set Cu \u2713F n such that |Cu| = S(u) + 1. This follows directly as a line with\nk 1 cuts is divided into k segments. We now illustrate the use of basis Fn to predict the labeling of\na graph. For simplicity we illustrate by considering the problem of predicting a single graph labeling\nwithout switching. As there is no switch we will set \u21b5 := 0 and thus if the graph is labeled with\nu 2 {1, 1}n with cut-size S(u) then we will need S(u) + 1 specialists to predict the labeling\nand thus the comparators may be post-hoc optimally determined so that \u00b5 = \u00b51 = \u00b7\u00b7\u00b7 = \u00b5T and\nthere will be S(u) + 1 components of \u00b5 each with \u201cweight\u201d\n= S(u) + 1,\nsince there will be only one specialist (with non-zero weight) active per trial. Since the cardinality\nof Fn is n2 + n, by substituting into (4) we have that the number of mistakes will be bounded by\n(S(u) + 1) log (n2 + n). Note for a single graph labeling on a spine this bound is not much worse\nthan the best known result [16, Theorem 4]. In terms of computation time however it is signi\ufb01cantly\nslower than the algorithm in [16] requiring \u21e5(n2) time to predict on a typical trial since on average\nthere are \u21e5(n2) specialists active per trial.\nBasis B1,n: We now introduce the basis Bn which has \u21e5(n) specialists and only requires O(log n)\ntime per trial to predict with only a small increase in bound. The basis is de\ufb01ned as\n\n(S (u)+1), thus 1\n\n\u21e11\n\ny\n\n1\n\nBp,q :=({\"p,q\n1,\" p,q\n1 }\n{\"p,q\n1,\" p,q\n1 }[B p,b p+q\n\n2 c [B b p+q\n\n2 c+1,q\n\np = q,\np 6= q\n\nn\n\n2e for n > 2.\n\nand is analogous to a binary tree. We have the following approximation property for Bn := B1,n,\nProposition 3. The basis Bn is complete. Furthermore, for any labeling u 2 {1, 1}n there exists\na covering set Cu \u2713B n such that |Cu|\uf8ff 2(S(u) + 1)dlog2\nFrom a computational perspective the binary tree structure ensures that there are only \u21e5(log n)\nspecialists active per trial, leading to an exponential speed-up in prediction. A similar set of specialists\nwere used for obtaining adaptive-regret bounds in [9, 23] and data-compression in [33]. In those\nworks however the \u201cbinary tree\u201d structure is over the time dimension (trial sequence) whereas in this\nwork the binary tree is over the space dimension (graph) and a \ufb01xed-share update is used to obtain\nadaptivity over the time dimension.2\nIn the corollary that follows we will exploit the fact that by making the algorithm conservative we\nmay reduce the usual log T term in the mistake bound induced by a \ufb01xed-share update to log log T .\nA conservative algorithm only updates the specialists\u2019 weights on trials on which a mistake is made.\nFurthermore the bound given in the following corollary is smooth as the cost per switch will be\nmeasured with a Hamming-like divergence H on the \u201ccut\u201d edges between successive labelings,\nde\ufb01ned as\n\nH(u, u0) := X(i,j)2ES\n\n[ [[ui 6= uj] _ [u0i 6= u0j]] ^ [[ui 6= u0i] _ [uj 6= u0j]] ] .\n\n2An interesting open problem is to try to \ufb01nd good bounds and time-complexity with sets of specialists over\n\nboth the time and space dimensions.\n\n6\n\n\fObserve that H(u, u0) is smaller than twice the hamming distance between u and u0 and is often\nsigni\ufb01cantly smaller. To achieve the bounds we will need the following proposition, which upper\nbounds divergence J by H, a subtlety is that there are many distinct sets of specialists consistent\nwith a given comparator. For example, consider a uniform labeling on S. One may \u201ccover\u201d this\nlabeling with a single specialist or alternatively n specialists, one covering each vertex. For the sake\nof simplicity in bounds we will always choose the smallest set of covering specialists. Thus we\nintroduce the following formal de\ufb01nitions of consistency and minimal-consistency.\nDe\ufb01nition 4. A comparator \u00b5 2 |E| is consistent with the labeling u 2 {1, 1}n if \u00b5 is well-\nformed and \u00b5\" > 0 implies that for all v 2 V that \"(v) 2{ uv, \u21e4}.\nDe\ufb01nition 5. A comparator \u00b5 2 |E| is minimal-consistent with the labeling u 2 {1, 1}n if it\nis consistent with u and the cardinality of its support set |{\u00b5\" : \u00b5\" > 0}| is the minimum of all\ncomparators consistent with u.\nProposition 6. For a linearized graph S, for comparators \u00b5, \u00b50 2 |Fn| that are minimal-consistent\nwith u and u0 respectively,\n\nJFn(\u00b5, \u00b50) \uf8ff min (2H(u, u0), S(u0) + 1) .\n\nA proof is given in Appendix C. In the following corollary we summarize the results of the SCS\nalgorithm using the basis sets Fn and Bn with an optimally-tuned switching parameter \u21b5.\nCorollary 7. For a connected n-vertex graph G and with randomly sampled spine S, the number\nof mistakes made in predicting the online sequence (i1, y1), . . . , (iT , yT ) by the SCS algorithm with\noptimally-tuned \u21b5 is upper bounded with basis Fn by\n\nand with basis Bn by\n\nO0@1 log n +\nO0@0@1 log n +\n|K|1Xi=1\n\nH(uki, uki+1) (log n + log |K| + log log T )1A\n|K|1Xi=1\nH(uki, uki+1) (log n + log |K| + log log T )1A log n1A\n\nfor any sequence of labelings u1, . . . , uT 2 {1, 1}n such that ut,it = yt for all t 2 [T ].\nThus the bounds are equivalent up to a factor of log n although the computation times vary dramat-\nically. See Appendix D for a technical proof of these results, and details on the selection of the\nswitching parameter \u21b5.\nOn the lower bound side, tight upper and lower bounds were proven for graph label prediction when\nthe graph was a tree in [6]. We now give a sketch of a simple argument for a lower bound on the\nnumber of mistakes made for predicting a switching sequence of labelings on S. We \ufb01rst describe\nhow introducing and removing cuts can force mistakes in the simplest case.\nGiven a single graph-labeling problem on an unlabeled line graph S, an adversary may force \u21e5(log n)\nmistakes with a resultant cut-size (u) = 1. In the switching case if S is uniformly labelled\n((u) = 0) and up to two cuts are introduced, then the learner can be forced to make O(log n)\nmistakes. On the other hand if we have cut-size of (u0) = 2 an adversary when a \u201cswitch\u201d occurs\ncan force a single mistake with the outcome that the cut-size (u00) 2{ 0, 1, 2}.\nNow for a switching sequence of graph labelings, u1, . . . , uT , let (ut) \u2327 n for all t. For a\nlabeling u, S can be divided into (u) + 1 segments of length\n(u)+1. Each segment can be made\nindependent of one another by \ufb01xing the boundary vertices between segments. We therefore have\n(u) )) mistakes for\n(u) + 1 independent learning problems and an adversary can force \u21e5(log ( n\nevery two cuts introduced and 1 mistake for every 2 cuts removed.\nWhile the bounds in Corollary 7 re\ufb02ect the smoothness of the sequence of labelings, we pay O(log n+\nlog |K| + log log T ) for every cut removed and introduced for basis set Fn, with an additional\nlogarithmic factor for basis Bn. There is therefore an interesting gap between these bounds and the\nsketched lower bound, not least of which caused by the log log T term, which we conjecture should\nbe possible to remove.\n\nn\n\n7\n\n\fTable 1: Mean error \u00b1 std over 25 iterations on a 404-vertex graph for all algorithms and benchmarks,\nand for all ensemble sizes of SCS-F and SCS-B.\n\nAlgorithm\n\n1\n\n3\n\n5\n\nSCS-F\nSCS-B\nKernel Perceptron\nLocal\nGlobal\nTemporal (Local)\nTemporal (Global)\n\n1947 \u00b1 49\n1438 \u00b1 32\n3326 \u00b1 43\n3411 \u00b1 55\n4240 \u00b1 44\n2733 \u00b1 42\n3989 \u00b1 44\n\n1597 \u00b1 32\n1198 \u00b1 27\n\n-\n-\n-\n-\n-\n\n1475 \u00b1 30\n1127 \u00b1 25\n\n-\n-\n-\n-\n-\n\nEnsemble Size\n\n9\n\n1364 \u00b1 28\n1079 \u00b1 24\n\n-\n-\n-\n-\n-\n\n17\n\n33\n\n65\n\n1293 \u00b1 26\n1050 \u00b1 23\n\n-\n-\n-\n-\n-\n\n1247 \u00b1 21\n1032 \u00b1 22\n\n-\n-\n-\n-\n-\n\n1218 \u00b1 19\n1021 \u00b1 18\n\n-\n-\n-\n-\n-\n\nNote that we may avoid the issue of needing to optimally tune \u21b5 using the following method proposed\nby [14] and by [28]. We use a time-varying parameter and on trial t we set \u21b5t = 1\nt+1. We have the\nfollowing guarantee for this method, see Appendix E for a proof.\nProposition 8. For a connected n-vertex graph G and with randomly sampled spine S, the SCS\nalgorithm with bases Fn and Bn in predicting the online sequence (i1, y1), . . . , (iT , yT ) now with\ntime-varying \u21b5 set equal to 1\nt+1 on trial t achieves the same asymptotic mistake bounds as in\nCorollary 7 with an optimally-tuned \u21b5, under the assumption that S(u1) \uf8ffP|K|1\nJE(\u00b5ki, \u00b5ki+1).\n\ni=1\n\n4 Experiments\n\nIn this section we present results of experiments on real data. The City of Chicago currently contains\n608 public bicycle stations for its \u201cDivvy Bike\u201d sharing system. Current and historical data is\navailable from the City of Chicago3 containing a variety of features for each station, including\nlatitude, longitude, number of docks, number of operational docks, and number of docks occupied.\nThe latest data on each station is published approximately every ten minutes.\nWe used a sample of 72 hours of data, consisting of three consecutive weekdays in April 2019. The\n\ufb01rst 24 hours of data were used for parameter selection, and the remaining 48 hours of data were used\nfor evaluating performance. On each ten-minute snapshot we took the percentage of empty docks of\neach station. We created a binary labeling from this data by setting a threshold of 50%. Thus each\nbicycle station is a vertex in our graph and the label of each vertex indicates whether that station is\n\u2018mostly full\u2019 or \u2018mostly empty\u2019. Due to this thresholding the labels of some \u2018quieter\u2019 stations were\nobserved not to switch, as the percentage of available docks rarely changed. These stations tended to\nbe on the \u2018outskirts\u2019, and thus we excluded these stations from our experiments, giving 404 vertices\nin our graph.\nUsing the geodesic distance between each station\u2019s latitude and longitudinal position a connected\ngraph was built using the union of a k-nearest neighbor graph (k = 3) and a minimum spanning\ntree. For each instance of our algorithm the graph was then transformed in the manner described\nin Section 2.2, by \ufb01rst drawing a spanning tree uniformly at random and then linearizing using\ndepth-\ufb01rst search.\nAs natural benchmarks for this setting we considered the following four methods. 1.) For all vertices\npredict with the most frequently occurring label of the entire graph from the training data (\u201cGlobal\u201d).\n2.) For each vertex predict with its most frequently occurring label from the training data (\u201cLocal\u201d).\n3.) For all vertices at any given time predict with the most frequently occurring label of the entire\ngraph at that time from the training data (\u201cTemporal-Global\u201d) 4.) For each vertex at any given time\npredict with that vertex\u2019s label observed at the same time in the training data (\u201cTemporal-Local\u201d). We\nalso compare our algorithms against a kernel Perceptron proposed by [18] for predicting switching\ngraph labelings (see Appendix F for details).\nFollowing the experiments of [7] in which ensembles of random spanning trees were drawn and\naggregated by an unweighted majority vote, we tested the effect on performance of using ensem-\nbles of instances of our algorithms, aggregated in the same fashion. We tested ensemble sizes in\n{1, 3, 5, 9, 17, 33, 65}, using odd numbers to avoid ties.\nFor every ten-minute snapshot (labeling) we queried 30 vertices uniformly at random (with replace-\nment) in an online fashion, giving a sequence of 8640 trials over 48 hours. The average performance\n\n3\n\nhttps://data.cityofchicago.org/Transportation/Divvy-Bicycle-Stations-Historical/eq45-8inv\n\n8\n\n\f+\n\n\u2212\n\n+\n\n\u2212\n\nFigure 1: Left: Mean cumulative mistakes over 25 iterations for all algorithms and benchmarks over 48 hours\n(8640 trials) on a 404-vertex graph. A comparison of the mean performance of SCS with bases Fn and Bn\n(SCS-F and SCS-B respectively) using an ensemble of size 1 and 65 is shown. Right: An example of two binary\nlabelings taken from the morning and evening of the \ufb01rst 24 hours of data. An \u2018orange\u2019 label implies that station\nis < 50% full and a \u2018black\u2019 label implies that station is 50% full.\nover 25 iterations is shown in Figure 1. There are several surprising observations to be made from\nour results. Firstly, both SCS algorithms performed signi\ufb01cantly better than all benchmarks and\ncompeting algorithms. Additionally basis Bn outperformed basis Fn by quite a large margin, despite\nhaving the weaker bound and being exponentially faster. Finally we observed a signi\ufb01cant increase\nin performance of both SCS algorithms by increasing the ensemble size (see Figure 1 and Table 1),\nadditional details on these experiments and results of all ensemble sizes are given in Appendix G.\nInterestingly when tuning \u21b5 we found basis Bn to be very robust, while Fn was very sensitive. This\nobservation combined with the logarithmic per-trial time complexity suggests that SCS with Bn has\npromise to be a very practical algorithm.\n\n5 Conclusion\n\nOur primary result was an algorithm for predicting switching graph labelings with a per-trial prediction\ntime of O(log n) and a mistake bound that smoothly tracks changes to the graph labeling over time. In\nthe long version of this paper we plan to extend the analysis of the primary algorithm to the expected\nregret setting by relaxing our simplifying assumption of the well-formed comparator sequence that is\nminimal-consistent with the labeling sequence. From a technical perspective the open problem that\nwe found most intriguing is to eliminate the log log T term from our bounds. The natural approach to\nthis would be to replace the conservative \ufb01xed-share update with a variable-share update [21]; in our\nefforts however we found many technical problems with this approach. On both the more practical\nand speculative side; we observe that the specialists sets Bn, and Fn were chosen to \u201cprove bounds\u201d.\nIn practice we can use any hierarchical graph clustering algorithm to produce a complete specialist\nset and furthermore multiple such clusterings may be pooled. Such a pooled set of subgraph \u201cmotifs\u201d\ncould be then be used for example in a multi-task setting (see for example, [27]).\n\nLeaflet\t(http://leafletjs.com)\n\nReferences\n[1] D. Adamskiy, W. M. Koolen, A. Chernov, and V. Vovk. A closer look at adaptive regret. In\nProceedings of the 23rd International Conference on Algorithmic Learning Theory, ALT\u201912,\npages 290\u2013304, 2012.\n\n[2] M. Belkin and P. Niyogi. Semi-supervised learning on riemannian manifolds. Machine learning,\n\n56(1-3):209\u2013239, 2004.\n\n[3] A. Blum and S. Chawla. Learning from labeled and unlabeled data using graph mincuts. In\nProceedings of the Eighteenth International Conference on Machine Learning, ICML \u201901, pages\n19\u201326, 2001.\n\n9\n\nLeaflet\t(http://leafletjs.com)\n\n\f[4] O. Bousquet and M. K. Warmuth. Tracking a small set of experts by mixing past posteriors.\n\nJournal of Machine Learning Research, 3(Nov):363\u2013396, 2002.\n\n[5] N. Cesa-Bianchi, P. Gaillard, G. Lugosi, and G. Stoltz. Mirror descent meets \ufb01xed share (and\nfeels no regret). In Proceedings of the 25th International Conference on Neural Information\nProcessing Systems - Volume 1, NIPS \u201912, pages 980\u2013988, 2012.\n\n[6] N. Cesa-Bianchi, C. Gentile, and F. Vitale. Fast and optimal prediction on a labeled tree. In\nProceedings of the 22nd Annual Conference on Learning Theory, pages 145\u2013156. Omnipress,\n2009.\n\n[7] N. Cesa-Bianchi, C. Gentile, F. Vitale, and G. Zappella. Random spanning trees and the\nprediction of weighted graphs. Journal of Machine Learning Research, 14(1):1251\u20131284, 2013.\n[8] N. Cesa-Bianchi and G. Lugosi. Prediction, Learning, and Games. Cambridge University Press,\n\nNew York, NY, USA, 2006.\n\n[9] A. Daniely, A. Gonen, and S. Shalev-Shwartz. Strongly adaptive online learning. In Proceedings\nof the 32nd International Conference on International Conference on Machine Learning -\nVolume 37, ICML\u201915, pages 1405\u20131411, 2015.\n\n[10] Y. Freund. Private communication, 2000. Also posted on http://www.learning-theory.org.\n[11] Y. Freund, R. E. Schapire, Y. Singer, and M. K. Warmuth. Using and combining predictors\nthat specialize. In Proceedings of the Twenty-ninth Annual ACM Symposium on Theory of\nComputing, STOC \u201997, pages 334\u2013343, 1997.\n\n[12] A. Gyorgy, T. Linder, and G. Lugosi. Ef\ufb01cient tracking of large classes of experts. IEEE\n\nTransactions on Information Theory, 58(11):6709\u20136725, Nov 2012.\n\n[13] E. Hazan and C. Seshadhri. Adaptive algorithms for online decision problems. Electronic\n\nColloquium on Computational Complexity (ECCC), 14(088), 2007.\n\n[14] M. Herbster. Tracking the best expert II. Unpublished manuscript, 1997.\n[15] M. Herbster and G. Lever. Predicting the labelling of a graph via minimum $p$-seminorm\n\ninterpolation. In COLT 2009 - The 22nd Conference on Learning Theory, 2009.\n\n[16] M. Herbster, G. Lever, and M. Pontil. Online prediction on large diameter graphs. In Proceedings\nof the 21st International Conference on Neural Information Processing Systems, NIPS \u201908,\npages 649\u2013656, 2008.\n\n[17] M. Herbster, S. Pasteris, and S. Ghosh. Online prediction at the limit of zero temperature. In\nProceedings of the 28th International Conference on Neural Information Processing Systems -\nVolume 2, NIPS\u201915, pages 2935\u20132943, 2015.\n\n[18] M. Herbster, S. Pasteris, and M. Pontil. Predicting a switching sequence of graph labelings.\n\nJournal of Machine Learning Research, 16(1):2003\u20132022, 2015.\n\n[19] M. Herbster and M. Pontil. Prediction on a graph with a perceptron. In Proceedings of the 19th\nInternational Conference on Neural Information Processing Systems, NIPS\u201906, pages 577\u2013584,\n2006.\n\n[20] M. Herbster, M. Pontil, and L. Wainer. Online learning over graphs. In Proceedings of the 22nd\n\nInternational Conference on Machine Learning, ICML \u201905, pages 305\u2013312, 2005.\n\n[21] M. Herbster and M. Warmuth. Tracking the best expert. Machine Learning, 32(2):151\u2013178,\n\n1998.\n\n[22] M. Herbster and M. K. Warmuth. Tracking the best linear predictor. Journal of Machine\n\nLearning Research, 1:281\u2013309, 2001.\n\n[23] K. Jun, F. Orabona, S. Wright, and R. Willett. Improved strongly adaptive online learning using\ncoin betting. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, volume 54 of Proceedings of Machine Learning Research, pages 943\u2013951. PMLR,\n2017.\n\n10\n\n\f[24] B. S. Kerner. Experimental features of self-organization in traf\ufb01c \ufb02ow. Phys. Rev. Lett.,\n\n81:3797\u20133800, 1998.\n\n[25] J. Kivinen, A. Smola, and R. Williamson. Online learning with kernels. Trans. Sig. Proc.,\n\n52(8):2165\u20132176, 2004.\n\n[26] D. J. Klein and M. Randi\u00b4c. Resistance distance. Journal of mathematical chemistry, 12(1):81\u201395,\n\n1993.\n\n[27] W. M. Koolen, D. Adamskiy, and M. K. Warmuth. Putting bayes to sleep. In Proceedings of the\n25th International Conference on Neural Information Processing Systems - Volume 1, NIPS \u201912,\npages 135\u2013143, 2012.\n\n[28] W. M. Koolen and S. Rooij. Combining expert advice ef\ufb01ciently. In 21st Annual Conference on\n\nLearning Theory - COLT 2008, pages 275\u2013286, 2008.\n\n[29] N. Littlestone and M. K. Warmuth. The weighted majority algorithm.\n\nComputation, 108(2):212\u2013261, 1994.\n\nInformation and\n\n[30] R. Lyons and Y. Peres. Probability on Trees and Networks. Cambridge University Press, New\n\nYork, NY, USA, 1st edition, 2017.\n\n[31] O. H. M. Padilla, J. Sharpnack, J. G. Scott, and R. J. Tibshirani. The dfs fused lasso: Linear-time\n\ndenoising over general graphs. Journal of Machine Learning Research, 18(1):1\u201336, 2018.\n\n[32] A. Rakhlin and K. Sridharan. Ef\ufb01cient online multiclass prediction on graphs via surrogate\nlosses. In Proceedings of the 20th International Conference on Arti\ufb01cial Intelligence and\nStatistics, AISTATS 2017, pages 1403\u20131411, 2017.\n\n[33] J. Veness, M. White, M. Bowling, and A. Gy\u00f6rgy. Partition tree weighting. In Data Compression\n\nConference, pages 321\u2013330. IEEE, 2013.\n\n[34] F. Vitale, N. Cesa-Bianchi, C. Gentile, and G. Zappella. See the tree through the lines: The\nshazoo algorithm. In Advances in Neural Information Processing Systems 23, pages 1584\u20131592,\n2011.\n\n[35] V. Vovk. Aggregating strategies. In Proceedings of the Third Annual Workshop on Computa-\n\ntional Learning Theory, COLT \u201990, pages 371\u2013386, 1990.\n\n[36] V. Vovk. Derandomizing stochastic prediction strategies. Machine Learning, 35(3):247\u2013282,\n\n1999.\n\n[37] D. B. Wilson. Generating random spanning trees more quickly than the cover time.\n\nIn\nProceedings of the Twenty-eighth Annual ACM Symposium on Theory of Computing, STOC \u201996,\npages 296\u2013303, 1996.\n\n[38] D. Zhou, O. Bousquet, T. N. Lal, J. Weston, and B. Sch\u00f6lkopf. Learning with local and\nglobal consistency. In Proceedings of the 16th International Conference on Neural Information\nProcessing Systems, NIPS \u201903, pages 321\u2013328, 2003.\n\n[39] X. Zhu, Z. Ghahramani, and J. D. Lafferty. Semi-supervised learning using gaussian \ufb01elds and\nharmonic functions. In Proceedings of the Twentieth International Conference on International\nConference on Machine Learning, ICML \u201903, pages 912\u2013919, 2003.\n\n11\n\n\f", "award": [], "sourceid": 3792, "authors": [{"given_name": "Mark", "family_name": "Herbster", "institution": "University College London"}, {"given_name": "James", "family_name": "Robinson", "institution": "UCL"}]}