{"title": "Large-Scale Multiclass Transduction", "book": "Advances in Neural Information Processing Systems", "page_first": 411, "page_last": 418, "abstract": "", "full_text": "Large-Scale Multiclass Transduction\n\nFraunhofer AIS.KD, 53754 Sankt Augustin, Thomas.Gaertner@ais.fraunhofer.de\n\nThomas G\u00a8artner\n\nQuoc V. Le, Simon Burton, Alex J. Smola, Vishy Vishwanathan\n\nStatistical Machine Learning Program, NICTA and ANU, Canberra, ACT\n{Quoc.Le, Simon.Burton, Alex.Smola, SVN.Vishwanathan}@nicta.com.au\n\nAbstract\n\nWe present a method for performing transductive inference on very large\ndatasets. Our algorithm is based on multiclass Gaussian processes and is\neffective whenever the multiplication of the kernel matrix or its inverse\nwith a vector can be computed suf\ufb01ciently fast. This holds, for instance,\nfor certain graph and string kernels. Transduction is achieved by varia-\ntional inference over the unlabeled data subject to a balancing constraint.\n\n1 Introduction\n\nWhile obtaining labeled data remains a time and labor consuming task, acquisition and\nstorage of unlabelled data is becoming increasingly cheap and easy. This development\nhas driven machine learning research into exploring algorithms that make extensive use of\nunlabelled data at training time in order to obtain better generalization performance.\n\nA common problem of many transductive approaches is that they scale badly with the\namount of unlabeled data, which prohibits the use of massive sets of unlabeled data. Our\nalgorithm shows improved scaling behavior, both for standard Gaussian Process classi\ufb01ca-\ntion and transduction. We perform classi\ufb01cation on a dataset consisting of a digraph with\n75, 888 vertices and 508, 960 edges. To the best of our knowledge it has so far not been\npossible to perform transduction on graphs of this size in reasonable time (with standard\nhardware). On standard data our method shows competitive or better performance.\nExisting Transductive Approaches for SVMs use nonlinear programming [2] or EM-style\niterations for binary classi\ufb01cation [4]. Moreover, on graphs various methods for unsuper-\nvised learning have been proposed [12, 11], all of which are mainly concerned with com-\nputing the kernel matrix on training and test set jointly. Other formulations impose that the\nlabel assignment on the test set be consistent with the assumption of con\ufb01dent classi\ufb01cation\n[8]. Yet others impose that training and test set have similar marginal distributions [4].\n\nThe present paper uses all three properties.\nIt is particularly ef\ufb01cient whenever K\u03b1 or\nK \u22121\u03b1 can be computed in linear time, where K \u2208 Rm\u00d7m is the kernel matrix and \u03b1 \u2208 Rm.\n\u2022 We require consistency of training and test marginals. This avoids problems with\n\noverly large majority classes and small training sets.\n\n\u2022 Kernels (or their inverses) are computed on training and test set simultaneously.\n\nOn graphs this can lead to considerable computational savings.\n\n\u2022 Self consistency of the estimates is achieved by a variational approach. This al-\n\nlows us to make use of Gaussian Process multiclass formulations.\n\n\f2 Multiclass Classi\ufb01cation\n\nWe begin with a brief overview over Gaussian Process multiclass classi\ufb01cation [10] recast\nin terms of exponential families. Denote by X \u00d7 Y with Y = {1..n} the domain of obser-\nvations and labels. Moreover let X := {x1, . . . , xm} and Y := {y1, . . . , ym} be the set of\nobservations. It is our goal to estimate y|x via\n\np(y|x, \u03b8) = exp (h\u03c6(x, y), \u03b8i \u2212 g(\u03b8|x)) where g(\u03b8|x) = logXy\u2208Y\n\nexp (h\u03c6(x, y), \u03b8i) . (1)\n\n\u03c6(x, y) are the joint suf\ufb01cient statistics of x and y and g(\u03b8|x) is the log-partition function\nwhich takes care of the normalization. We impose a normal prior on \u03b8, leading to the\nfollowing negative joint likelihood in \u03b8 and Y :\n\nP := \u2212 log p(\u03b8, Y |X) =\n\n[g(\u03b8|xi) \u2212 h\u03c6(xi, yi), \u03b8i] +\n\n1\n2\u03c32 k\u03b8k2 + const.\n\n(2)\n\nFor transduction purposes p(\u03b8, Y |X) will prove more useful than p(\u03b8|Y, X). Note that a\nnormal prior on \u03b8 with variance \u03c321 implies a Gaussian process on the random variable\nt(x, y) := h\u03c6(x, y), \u03b8i with covariance kernel\n\nCov [t(x, y), t(x\u2032, y\u2032)] = \u03c32 h\u03c6(x, y), \u03c6(x\u2032, y\u2032)i =: \u03c32k((x, y), (x\u2032, y\u2032)).\n\n(3)\n\nParametric Optimization Problem In the following we assume isotropy among the\nclass labels, that is h\u03c6(x, y), \u03c6(x\u2032, y\u2032)i = \u03b4y,y\u2032 h\u03c6(x), \u03c6(x\u2032)i (this is not a necessary re-\nquirement for the ef\ufb01ciency of our algorithm, however it greatly simpli\ufb01es the presenta-\ntion). This allows us to decompose \u03b8 into \u03b81, . . . , \u03b8n such that\nn\n\nh\u03c6(x, y), \u03b8i = h\u03c6(x), \u03b8yi and k\u03b8k2 =\n\nk\u03b8yk2.\n\n(4)\n\nXy=1\n\nm\n\nXi=1\n\nm\n\nXi=1\n\nApplying the representer theorem allows us to expand \u03b8 in terms of \u03c6(xi, yi) as \u03b8 =\n\ny=1 \u03b1iy\u03c6(xi, y). In conjunction with (4) we have\n\ni=1Pn\nPm\n\n\u03b8y =\n\n\u03b1iy\u03c6(xi) where \u03b1 \u2208 Rm\u00d7n.\n\n(5)\n\nLet \u00b5 \u2208 Rm\u00d7n with \u00b5ij = 1 if yi = j and \u00b5ij = 0 otherwise, and K \u2208 Rm\u00d7m with\nKij = h\u03c6(xi), \u03c6(xj)i. Here joint log-likelihood (2) in terms of \u03b1 and K yields\n\nm\n\nn\n\nXy=1\n\nlog\n\nXi=1\n\nexp ([K\u03b1]iy) \u2212 tr \u00b5\u22a4K\u03b1 +\n\n1\n2\u03c32 tr \u03b1\u22a4K\u03b1 + const.\n\n(6)\n\nEquivalently we could expand (2) in terms of t := K\u03b1. This is commonly done in Gaussian\nprocess literature and we will use both formulations, depending on the problem we need to\nsolve: if K\u03b1 can be computed effectively, as is the case with string kernels [9], we use the\n\u03b1-parameterization. Conversely, if K \u22121\u03b1 is cheap, as for example with graph kernels [7],\nwe use the t-parameterization.\n\nDerivatives Second order methods such as Conjugate Gradient require the computation\nof derivatives of \u2212 log p(\u03b8, Y |X) with respect to \u03b8 in terms of \u03b1 or t. Using the shorthand\n\u03c0 \u2208 Rm\u00d7n with \u03c0ij := p(y = j|xi, \u03b8) we have\n\n\u2202\u03b1P = K(\u03c0 \u2212 \u00b5 + \u03c3\u22122\u03b1) and \u2202tP = \u03c0 \u2212 \u00b5 + \u03c3\u22122K \u22121t.\n\n(7)\nTo avoid spelling out tensors of fourth order for the second derivatives (since \u03b1 \u2208 Rm\u00d7n)\nwe state the action of the latter as bilinear forms on vectors \u03b2, \u03b3, u, v \u2208 Rm\u00d7n. For con-\nvenience we use the \u201cMatlab\u201d notation of \u2019.\u2217\u2019 to denote element-wise multiplication of\nmatrices:\n\n\f\u03b1P[\u03b2, \u03b3] = tr(K\u03b3)\u22a4(\u03c0. \u2217 (K\u03b2)) \u2212 tr(\u03c0. \u2217 K\u03b3)\u22a4(\u03c0. \u2217 (K\u03b2)) + \u03c3\u22122 tr \u03b3\u22a4K\u03b2 (8a)\n\u22022\nt P[u, v] = tr u\u22a4(\u03c0. \u2217 v) \u2212 tr(\u03c0. \u2217 u)\u22a4(\u03c0. \u2217 v) + \u03c3\u22122 tr u\u22a4K \u22121v.\n\u22022\n(8b)\n\nLet L \u00b7 n be the computational time required to compute K\u03b1 and K \u22121t respectively. One\nmay check that L = O(m) implies that each conjugate gradient (CG) descent step can\nbe performed in O(m) time. Combining this with rates of convergence for Newton-type\nor nonlinear CG solver strategies yields overall time costs in the order of O(m log m) to\nO(m2) worst case, a signi\ufb01cant improvement over conventional O(m3) methods.\n\n3 Transductive Inference by Variational Methods\n\nAs we are interested in transduction, the labels Y (and analogously the data X) decompose\nas Y = Ytrain \u222a Ytest. To directly estimate p(Ytest|X, Ytrain) we would need to integrat-\ning out \u03b8, which is usually intractable. Instead, we now aim at estimating the mode of\np(\u03b8|X, Ytrain) by variational means. With the KL-divergence D and an arbitrary distribu-\ntion q the well-known bound (see e.g. [5])\n\n\u2212 log p(\u03b8|X, Ytrain) \u2264 \u2212 log p(\u03b8|X, Ytrain) + D(q(Ytest)kp(Ytest|X, Ytrain, \u03b8))\n\n(log p(Ytest, \u03b8|X, Ytrain) \u2212 log q(Ytest)) q(Ytest)\n\n= \u2212XYtest\n\nholds. This bound (10) can be minimized with respect to \u03b8 and q in an iterative fashion. The\nkey trick is that while using a factorizing approximation for q we restrict the latter to dis-\ntributions which satisfy balancing constraints. That is, we require them to yield marginals\non the unlabeled data which are comparable with the labeled observations.\n\n(9)\n\n(10)\n\nDecomposing the Variational Bound To simplify (10) observe that\n\np(Ytest, \u03b8|X, Ytrain) = p(Ytrain, Ytest, \u03b8|X)/p(Ytrain|X).\n\n(11)\n\nIn other words, the \ufb01rst term in (10) equals (6) up to a constant independent of \u03b8 or Ytest.\nWith qij := q(yi = j) we de\ufb01ne \u00b5ij(q) = qij for all i > mtrain and \u00b5ij(q) = 1 if yi = 1\nand 0 otherwise for all i \u2264 mtrain. In other words, we are taking the expectation in \u00b5 over\nall unobserved labels Ytest with respect to the distribution q(Ytest). We have\n\nq(Ytest) log p(Ytest, \u03b8|X, Ytrain)\n\nm\n\nXYtest\nXi=1\n\nn\n\nXj=1\n\nXYtest\n\n=\n\nlog\n\nexp ([K\u03b1]ij) \u2212 tr \u00b5(q)\u22a4K\u03b1 +\n\n1\n2\u03c32 tr \u03b1\u22a4K\u03b1 + const.\n\n(12)\n\nFor \ufb01xed q the optimization over \u03b8 proceeds as in Section 2. Next we discuss q.\n\nOptimization over q The second term in (10) is the negative entropy of q. Since q fac-\ntorizes we have\n\nm\n\nq(Ytest) log q(Ytest) =\n\nqij log qij.\n\n(13)\n\nXi=mtrain+1\n\nIt is unreasonable to assume that q may be chosen freely from all factorizing distributions\n(the latter would lead to a straightforward EM algorithm for transductive inference): if we\nobserve a certain distribution of labels on the training set, e.g., for binary classi\ufb01cation we\nsee 45% positive and 55% negative labels, then it is very unlikely that the label distribution\non the test set deviates signi\ufb01cantly. Hence we should make use of this information.\n\n\fIf m \u226b mtrain, however, a naive application of the variational bound can lead to cases\nwhere q is concentrated on one class \u2014 the increase in likelihood for a resulting very sim-\nple classi\ufb01er completely outweighs any balancing constraints implicit in the data. This is\ncon\ufb01rmed by experimental results. It is, incidentally, also the reason why SVM transduc-\ntion optimization codes [4] impose a balancing constraint on the assignment of test labels.\nWe impose the following conditions:\n\nr\u2212\nj \u2264\n\nm\n\nXi=mtrain+1\n\nqij \u2264 r+\n\nj for all j \u2208 Y and\n\nn\n\nXj=1\n\nqij = 1 for all i \u2208 {mtrain..m} .\n\nHere the constraints r\u2212\nj = pemp(y = j) + \u01eb are chosen such\nas to correspond to con\ufb01dence intervals given by \ufb01nite sample size tail bounds. In other\nwords we set pemp(y = j) = m\u22121\n\nj = pemp(y = j) \u2212 \u01eb and r+\n\ni=1 {yi = j} and \u01eb such as to satisfy\n\n> \u01eb) \u2264 \u03b4\n\n(14)\n\nmtrain\n\ntrainPmtrain\nXi=1\n\nmtest\n\ntrain\n\nm\u22121\n\n\u03bei \u2212 m\u22121\ntest\n\nPr((cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\ni(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nthe class labels that \u01eb \u2264plog(2n/\u03b4)m/ (2mtrainmtest).\n\nXi=1\n\n\u03be\u2032\n\n4 Graphs, Strings and Vectors\n\nfor iid {0, 1} random variables \u03bei and \u03be\u2032\ni with mean p. This is a standard ghost-sample\ninequality. It follows directly from [3, Eq. (2.7)] after application of a union bound over\n\nWe now discuss the two main applications where computational savings can be achieved:\ngraphs and strings. In the case of graphs, the advantage arises from the fact that K \u22121 is\nsparse, whereas for texts we can use fast string kernels [9] to compute K\u03b1 in linear time.\nGraphs Denote by G(V, E) the graph given by vertices V and edges E where each edge is\na set of two vertices. Then W \u2208 R|V |\u00d7|V | denotes the adjacency matrix of the graph, where\nWij > 0 only if edge {i, j} \u2208 E. We assume that the graph G, and thus also the adjacency\nmatrix W , is sparse. Now denote by 1 the identity matrix and by D the diagonal matrix of\n\nvertex degrees, i.e., Dii = Pj Wij. Then the graph Laplacian and the normalized graph\n\nLaplacian of G are given by\n\nL := D \u2212 W and\n\n\u02dcL := 1 \u2212 D\u2212 1\n\n2 W D\u2212 1\n2 ,\n\n(15)\n\nrespectively. Many kernels K (or their inverse) on G are given by low-degree polynomials\nof the Laplacian or the adjacency matrix of G, such as the following:\n\nl\n\nl\n\nK =\n\nciW 2i, K =\n\n(1 \u2212 ci \u02dcL), or K \u22121 = \u02dcL + \u01eb1.\n\n(16)\n\nXi=1\n\nYi=1\n\nIn all three cases we assumed ci, \u01eb \u2265 0 and l \u2208 N. The \ufb01rst kernel arises from an l-step\nrandom walk, the third case is typically referred to as regularized graph Laplacian. In these\ncases K\u03b1 or K \u22121t can be computed using L = l(|V | + |E|) operations. This means\nthat if the average degree of the graph does not increase with the number of observations,\nL = O(m) as m = |V | for inference on graphs.\n\nFrom Graphs to Graphical Models Graphs are one of the examples where transduction\nactually improves computational cost: Assume that we are given the inverse kernel matrix\nK \u22121 on training and test set and we wish to perform induction only. In this case we need\nto compute the kernel matrix (or its inverse) restricted to the training set. Let K \u22121 =\n\n(cid:20) A B\nB\u22a4 C (cid:21), then the upper left hand corner (representing the training set part only) of\n\n\fK is given by the Schur complement(cid:0)A \u2212 B\u22a4C \u22121B(cid:1)\u22121. Computing the latter is costly.\n\nMoreover, neither the Schur complement nor its inverse are typically sparse.\n\nHere we have a nice connection between graphical models and graph kernels. Assume that\nt is a normal random variable with conditional independence properties. In this case the\ninverse covariance matrix has nonzero entries only for variables with a direct dependency\nstructure. This follows directly from an application of the Clifford-Hammersley theorem to\nGaussian random variables [6]. In other words, if we are given a graphical model of normal\nrandom variables, their conditional independence structure is re\ufb02ected by K \u22121.\nIn the same way as in graphical models marginalization may induce dependencies, com-\nputing the kernel matrix on the training set only, may lead to dense matrices, even when\nthe inverse kernel on training and test data combined is sparse. The bottom line is there are\ncases where it is computationally cheaper to take both training and test set into account and\noptimize over a larger set of variables rather than dealing with a smaller dense matrix.\nStrings: Ef\ufb01cient computation of string kernels using suf\ufb01x trees was described in [9]. In\ni=1 \u03b1ik(xi, x) can be evaluated\nin linear time in the length of x, provided some preprocessing for the coef\ufb01cients \u03b1 and\nobservations xi is performed. This preprocessing is independent of x and can be computed\n\nparticular, it was observed that expansions of the formPm\nin O(Pi |xi|) time. The ef\ufb01cient computation scheme covers all kernels of type\n\nws#s(x)#s(x\u2032)\n\n(17)\n\nk(x, x\u2032) =Xs\n\nfor arbitrary ws \u2265 0. Here, #s(x) denotes the number of occurrences of s in x and the\nsum is carried out over all substrings of x. This means that computation time for evaluating\n\nK\u03b1 is again O(Pi |xi|) as we need to evaluate the kernel expansion for all x \u2208 X. Since\n\nthe average string length is independent of m this yields an O(m) algorithm for K\u03b1.\nVectors: If k(x, x\u2032) = \u03c6(x)\u22a4\u03c6(x\u2032) and \u03c6(x) \u2208 Rd for d \u226a m, it is possible to carry\nout matrix vector multiplications in O(md) time. This is useful for cases where we have a\nsparse matrix with a small number of low-rank updates (e.g. from low rank dense \ufb01ll-ins).\n\n5 Optimization\n\nOptimization in \u03b1 and t: P is convex in \u03b1 (and in t since t = K\u03b1). This means that a com-\nbination of Conjugate-Gradient and Newton-Raphson (NR) can be used for optimization.\n\n\u2022 Compute updates \u03b1 \u2190\u2212 \u03b1 \u2212 \u03b7\u22022\n\n\u22121\u2202\u03b1P via\n\n\u03b1P\n\n\u2013 Solve the linear system approximately by Conjugate Gradient iterations.\n\u2013 Find optimal \u03b7 by line search.\n\n\u2022 Repeat until the norm of the gradient is suf\ufb01ciently small.\n\nKey is the fact that the arising linear system is only solved approximately, which can be\ndone using very few CG iterations. Since each of them is O(m) for fast kernel-vector\ncomputations the overall cost is a sub-quadratic function of m.\nOptimization in q is somewhat less straightforward: we need to \ufb01nd the optimal q in terms\nof KL-divergence subject to the marginal constraint. Denote by \u03c4 the part of K\u03b1 pertaining\nto test data, or more formally \u03c4 \u2208 Rmtest\u00d7n with \u03c4ij = [K\u03b1]i+mtrain,j. We have:\n\nqij log qij\n\n(18)\n\nminimize\n\nq\n\ntr q\u22a4\u03c4 +Xi,j\nj \u2264Xi\n\nsubject to q\u2212\n\nqij \u2264 q+\n\nqli = 1 for all j \u2208 Y, l \u2208 {1..mtest}\n\nj , qij \u2265 0 and Xi\n\n\fTable 1: Error rates on some benchmark datasets (mostly from UCI). The last column is\nthe error rates reported in [1]\n\nDATASET\n\ncancer\n\ncancer (progn.)\nheart (cleave.)\n\nhousing\n\nionosphere\n\npima\nsonar\nglass\nwine\n\ntictactoe\n\ncmc\nUSPS\n\n#INST #ATTR\n9\n30\n13\n13\n34\n8\n60\n10\n13\n9\n10\n256\n\n699\n569\n297\n506\n351\n769\n208\n214\n178\n958\n1473\n9298\n\n7.0%\u00b11.0%\n8.6%\u00b16.3%\n\nIND. GP\n3.4%\u00b14.1%\n6.1%\u00b13.7%\n\nTRANSD. GP\n2.1%\u00b14.7%\n6.0%\u00b13.7%\n15.0%\u00b15.6% 13.0%\u00b16.3%\n6.8%\u00b10.9%\n6.1%\u00b13.4%\n19.6%\u00b18.1% 17.6%\u00b18.0%\n10.5%\u00b15.1%\n8.6%\u00b13.4%\n20.5%\u00b11.6% 17.3%\u00b14.5%\n19.4%\u00b15.7% 15.6%\u00b14.2%\n3.3%\u00b10.6%\n32.5%\u00b17.1% 28.9%\u00b17.5%\n\n3.9%\u00b10.7%\n\n5.9%\n\n4.8%\n\nS3VMMIP\n3.4%\n3.3%\n16.0%\n15.1%\n10.6%\n22.2%\n21.9%\n\u2014\n\u2014\n\u2014\n\u2014\n\u20141\n\nPn\n\nl=1 exp(\u2212\u03c4il)cl\n\nThis is a convex optimization problem. Using Lagrange multipliers one can show that q\nj qij = 1 yields\n. This means that instead of an optimization problem in mtest \u00d7 n\n\nneeds to satisfy qij = exp(\u2212\u03c4ij)bicj where bi, cj \u2265 0. Solving for Pn\n\nqij = exp(\u2212\u03c4ij )cj\nvariables we now only need to optimize over n variables subject to 2n constraints.\nNote that the exact matching constraint where q+\ni amounts to a maximum likelihood\nproblem for a shifted exponential family model where qij = exp(\u03c4ij) exp(\u03b3i \u2212 gj(\u03b3i)).\nIt can be shown that the approximate matching problem is equivalent to a maximum a\nposteriori optimization problem using the norm dual to expectation constraints on qij. We\nare currently working on extending this setting\n\ni = q\u2212\n\nIn summary, the optimization now only depends on n variables. It can be solved by standard\nsecond order methods. As initialization we choose \u03b3i such that the per class averages match\nthe marginal constraint while ignoring the per sample balance. After that a small number\nNewton steps suf\ufb01ces for optimization.\n\n6 Experiments\n\nUnfortunately, we are not aware of other multiclass transductive learning algorithms. To\nstill be able to compare our approach to other transductive learning algorithms we per-\nformed experiments on some benchmark datasets. To investigate the performance of our\nalgorithm in classifying vertices of a graph, we choose the WebKB dataset.\nBenchmark datasets Table 1 reports results on some benchmark datasets. To be able to\ncompare the error rates of the transductive multiclass Gaussian Process classi\ufb01er proposed\nin this paper, we also report error rates from [2] and an inductive multiclass Gaussian\nProcess classi\ufb01er. The reported error rates are for 10-fold crossvalidations. Parameters\nwere chosen by crossvalidation inside the training folds.\nGraph Mining To illustrate the effectiveness of our approach on graphs we performed\nexperiments on the well known WebKB dataset. This dataset consists of 8275 webpages\nclassi\ufb01ed into 7 classes. Each webpage contains textual content and/or links to other web-\npages. As we are using this dataset to evaluate our graph mining algorithm, we ignore the\ntext on each webpage and consider the dataset as a labelled directed graph. To have the data\n\n1In [2] only subsets of USPS were considered due to the size of this problem.\n\n\fTable 2: Results on WebKB for \u2018inverse\u2019 10-fold crossvalidation\n\nDATASET\nCornell\nTexas\n\nWashington\nWisconsin\n\n|V |\n867\n827\n1205\n1263\n\n|E|\n1793\n1683\n2368\n3678\n\nDATASET\n\nERROR\n10%\n8%\n10% Universities\n15%\n\nMisc\nall\n\n|V |\n4113\n8275\n4162\n\n|E|\n4462\n14370\n9591\n\nERROR\n66%\n53%\n12%\n\nset as large as possible, we did not remove any webpages, opposed to most other work.\n\nTable 2 reports the results of our algorithm on different subsets of the WebKB data as\nwell as on the full data. We use the co-linkage graph and report results for \u2018inverse\u2019 10-\nfold strati\ufb01ed crossvalidations, i.e., we use 1 fold as training data and 9 folds as test data.\nParameters are the same for all reported experiments and were found by experimenting with\na few parametersets on the \u2018Cornell\u2019 subset only. It turned out that the class membership\nprobabilities are not well-calibrated on this dataset. To overcome this, we predict on the\ntest set as follows: For each class the instances that are most likely to be in this class are\npicked (if they haven\u2019t been picked for a class with lower index) such that the fraction of\ninstances assigned to this class is the same on the training and test set. We will investigate\nthe reason for this in future work.\n\nThe setting most similar to ours is probably the one described in [11]. Although a di-\nrected graph approach outperforms there an undirected approach, we resorted to kernels\nfor undirected graphs, as those are computationally more attractive. We will investigate\ncomputationally attractive digraph kernels in future work and expect similar bene\ufb01ts as re-\nported by [11]. Though we are using more training data than [11] we are also considering\na more dif\ufb01cult learning problem (multiclass without removing various instances). To in-\nvestigate the behaviour of our algorithm with less training data, we performed a 20-fold\ninverse crossvalidation on the \u2018wisconsin\u2019 subset and observed an error rate of 17% there.\nTo further strengthen our results and show that the runtime performance of our algorithm\nis suf\ufb01cient for classifying the vertices of massive graphs, we also performed initial ex-\nperiments on the Epinions dataset collected by Mathew Richardson and Pedro Domingos.\nThe dataset is a social network consisting of 75, 888 people connected by 508, 960 \u2018trust\u2019\nedges. Additionally the dataset comes with a list of 185 \u2018topreviewers\u2019 for 25 topic areas.\nWe tried to predict these but only got 12% of the topreviewers correct. As we are not aware\nof any predictive results on this task, we suppose this low accuracy is inherent to this task.\nHowever, the experiments show that the algorithm can be run on very large graph datasets.\n\n7 Discussion and Extensions\n\nWe presented an ef\ufb01cient method for performing transduction on multiclass estimation\nproblems with Gaussian Processes. It performs particularly well whenever the kernel ma-\ntrix has special numerical properties which allow fast matrix vector multiplication. That\nsaid, also on standard dense problems we observed very good improvements (typically a\n10% reduction of the training error) over standard induction.\nStructured Labels and Conditional Random Fields are a clear area where to extend\nthe transductive setting. The key obstacle to overcome in this context is to \ufb01nd a suitable\nmarginal distribution: with increasing structure of the labels the con\ufb01dence bounds per\nsubclass decrease dramatically. A promising strategy is to use only partial marginals on\nmaximal cliques and enforce them directly similarly to an unconditional Markov network.\n\n\fApplications to Document Analysis require ef\ufb01cient small-memory-footprint suf\ufb01x tree\nimplementations. We are currently working on this, which will allow GP classi\ufb01cation to\nperform estimation on large document collections. We believe it will be possible to use\nout-of-core storage in conjunction with annotation to work on sequences of 108 characters.\nOther Marginal Constraints than matching marginals are worth exploring. In particular,\nconstraints derived from exchangeable distributions such as those used by Latent Dirichlet\nAllocation are a promising area to consider. This may also lead to connections between GP\nclassi\ufb01cation and clustering.\nSparse O(m1.3) Solvers for Graphs have recently been proposed by the theoretical com-\nputer science community. It is worthwhile exploring their use for inference on graphs.\nAcknowledgements The authors thank Mathew Richardson and Pedro Domingos for col-\nlecting the Epinions data and Deepayan Chakrabarti and Christos Faloutsos for providing\na preprocessed version. Parts of this work were carried out when TG was visiting NICTA.\nNational ICT Australia is funded through the Australian Government\u2019s Backing Australia\u2019s\nAbility initiative, in part through the Australian Research Council. This work was supported\nby grants of the ARC and by the Pascal Network of Excellence.\n\nReferences\n[1] K. Bennett. Combining support vector and mathematical programming methods for\nclassi\ufb01cation. In Advances in Kernel Methods - -Support Vector Learning, pages 307\n\u2013 326. MIT Press, 1998.\n\n[2] K. Bennett. Combining support vector and mathematical programming methods for\ninduction. In B. Sch\u00a8olkopf, C. J. C. Burges, and A. J. Smola, editors, Advances in\nKernel Methods - -SV Learning, pages 307 \u2013 326, Cambridge, MA, 1999. MIT Press.\n[3] W. Hoeffding. Probability inequalities for sums of bounded random variables. Journal\n\nof the American Statistical Association, 58:13 \u2013 30, 1963.\n\n[4] T. Joachims. Learning to Classify Text Using Support Vector Machines: Methods,\nTheory, and Algorithms. The Kluwer International Series In Engineering And Com-\nputer Science. Kluwer Academic Publishers, Boston, May 2002. ISBN 0 - 7923 -\n7679-X.\n\n[5] M. I. Jordan, Z. Ghahramani, Tommi S. Jaakkola, and L. K. Saul. An introduction to\nvariational methods for graphical models. Machine Learning, 37(2):183 \u2013 233, 1999.\n\n[6] S. L. Lauritzen. Graphical Models. Oxford University Press, 1996.\n[7] A. J. Smola and I. R. Kondor. Kernels and regularization on graphs. In B. Sch\u00a8olkopf\nand M. K. Warmuth, editors, Proceedings of the Annual Conference on Computa-\ntional Learning Theory, Lecture Notes in Computer Science. Springer, 2003.\n\n[8] V. Vapnik. Statistical Learning Theory. John Wiley and Sons, New York, 1998.\n[9] S. V. N. Vishwanathan and A. J. Smola. Fast kernels for string and tree matching.\nIn K. Tsuda, B. Sch\u00a8olkopf, and J.P. Vert, editors, Kernels and Bioinformatics, Cam-\nbridge, MA, 2004. MIT Press.\n\n[10] C. K. I. Williams and D. Barber. Bayesian classi\ufb01cation with Gaussian processes.\nIEEE Transactions on Pattern Analysis and Machine Intelligence PAMI, 20(12):1342\n\u2013 1351, 1998.\n\n[11] D. Zhou, J. Huang, and B. Sch\u00a8olkopf. Learning from labeled and unlabeled data on a\n\ndirected graph. In International Conference on Machine Learning, 2005.\n\n[12] X. Zhu, J. Lafferty, and Z. Ghahramani. Semi-supervised learning using gaussian\nIn International Conference on Machine Learning\n\n\ufb01elds and harmonic functions.\nICML\u201903, 2003.\n\n\f", "award": [], "sourceid": 2881, "authors": [{"given_name": "Thomas", "family_name": "G\u00e4rtner", "institution": null}, {"given_name": "Quoc", "family_name": "Le", "institution": null}, {"given_name": "Simon", "family_name": "Burton", "institution": null}, {"given_name": "Alex", "family_name": "Smola", "institution": null}, {"given_name": "Vishy", "family_name": "Vishwanathan", "institution": null}]}