{"title": "Link Prediction in Graphs with Autoregressive Features", "book": "Advances in Neural Information Processing Systems", "page_first": 2834, "page_last": 2842, "abstract": "In the paper, we consider the problem of link prediction in time-evolving graphs. We assume that certain graph features, such as the node degree, follow a vector autoregressive (VAR) model and we propose to use this information to improve the accuracy of prediction. Our strategy involves a joint optimization procedure over the space of adjacency matrices and VAR matrices which takes into account both sparsity and low rank properties of the matrices. Oracle inequalities are derived and illustrate the trade-offs in the choice of smoothing parameters when modeling the joint effect of sparsity and low rank property. The estimate is computed efficiently using proximal methods through a generalized forward-backward agorithm.", "full_text": "Link Prediction in Graphs with Autoregressive\n\nFeatures\n\nEmile Richard\n\nCMLA UMR CNRS 8536,\n\nENS Cachan, France\n\nSt\u00e9phane Ga\u00efffas\n\nCMAP - Ecole Polytechnique\n& LSTA - Universit\u00e9 Paris 6\n\nAbstract\n\nNicolas Vayatis\n\nCMLA UMR CNRS 8536,\n\nENS Cachan, France\n\nIn the paper, we consider the problem of link prediction in time-evolving graphs.\nWe assume that certain graph features, such as the node degree, follow a vector\nautoregressive (VAR) model and we propose to use this information to improve\nthe accuracy of prediction. Our strategy involves a joint optimization procedure\nover the space of adjacency matrices and VAR matrices which takes into account\nboth sparsity and low rank properties of the matrices. Oracle inequalities are de-\nrived and illustrate the trade-offs in the choice of smoothing parameters when\nmodeling the joint effect of sparsity and low rank property. The estimate is com-\nputed ef\ufb01ciently using proximal methods through a generalized forward-backward\nagorithm.\n\n1\n\nIntroduction\n\nForecasting systems behavior with multiple responses has been a challenging issue in many contexts\nof applications such as collaborative \ufb01ltering, \ufb01nancial markets, or bioinformatics, where responses\ncan be, respectively, movie ratings, stock prices, or activity of genes within a cell. Statistical model-\ning techniques have been widely investigated in the context of multivariate time series either in the\nmultiple linear regression setup [4] or with autoregressive models [23]. More recently, kernel-based\nregularized methods have been developed for multitask learning [7, 2]. These approaches share the\nuse of the correlation structure among input variables to enrich the prediction on every single output.\nOften, the correlation structure is assumed to be given or it is estimated separately. A discrete en-\ncoding of correlations between variables can be modeled as a graph so that learning the dependence\nstructure amounts to performing graph inference through the discovery of uncovered edges on the\ngraph. The latter problem is interesting per se and it is known as the problem of link prediction\nwhere it is assumed that only a part of the graph is actually observed [15, 9]. This situation occurs\nin various applications such as recommender systems, social networks, or proteomics, and the ap-\npropriate tools can be found among matrix completion techniques [21, 5, 1]. In the realistic setup\nof a time-evolving graph, matrix completion was also used and adapted to take into account the\ndynamics of the features of the graph [18]. In this paper, we study the prediction problem where the\nobservation is a sequence of graphs adjacency matrices (At)0\u2264t\u2264T and the goal is to predict AT +1.\nThis type of problem arises in applications such as recommender systems where, given informa-\ntion on purchases made by some users, one would like to predict future purchases. In this context,\nusers and products can be modeled as the nodes of a bipartite graph, while purchases or clicks are\nmodeled as edges. In functional genomics and systems biology, estimating regulatory networks in\ngene expression can be performed by modeling the data as graphs and \ufb01tting predictive models is\na natural way for estimating evolving networks in these contexts. A large variety of methods for\nlink prediction only consider predicting from a single static snapshot of the graph - this includes\nheuristics [15, 20], matrix factorization [13], diffusion [16], or probabilistic methods [22]. More\nrecently, some works have investigated using sequences of observations of the graph to improve the\nprediction, such as using regression on features extracted from the graphs [18], using matrix factor-\nization [14], continuous-time regression [25]. Our main assumption is that the network effect is a\n\n1\n\n\fcause and a symptom at the same time, and therefore, the edges and the graph features should be\nestimated simultaneously. We propose a regularized approach to predict the uncovered links and the\nevolution of the graph features simultaneously. We provide oracle bounds under the assumption that\nthe noise sequence has subgaussian tails and we prove that our procedure achieves a trade-off in the\ncalibration of smoothing parameters which adjust with the sparsity and the rank of the unknown ad-\njacency matrix. The rest of this paper is organized as follows. In Section 2, we describe the general\nsetup of our work with the main assumptions and we formulate a regularized optimization problem\nwhich aims at jointly estimating the autoregression parameters and predicting the graph. In Section\n3, we provide technical results with oracle inequalities and other theoretical guarantees on the joint\nestimation-prediction. Section 4 is devoted to the description of the numerical simulations which\nillustrate our approach. We also provide an ef\ufb01cient algorithm for solving the optimization prob-\nlem and show empirical results. The proof of the theoretical results are provided as supplementary\nmaterial in a separate document.\n\n2 Estimation of low-rank graphs with autoregressive features\n\nOur approach is based on the asumption that features can explain most of the information contained\nin the graph, and that these features are evolving with time. We make the following assumptions\nabout the sequence (At)t\u22650 of adjacency matrices of the graphs sequence.\n\nLow-Rank. We assume that the matrices At have low-rank. This re\ufb02ects the presence of highly\nconnected groups of nodes such as communities in social networks, or product categories and groups\nof loyal/fan users in a market place data, and is sometimes motivated by the small number of factors\nthat explain nodes interactions.\nAutoregressive linear features. We assume to be given a linear map \u03c9 : Rn\u00d7n \u2192 Rd de\ufb01ned by\n\n(cid:16)(cid:104)\u21261, A(cid:105),\u00b7\u00b7\u00b7 ,(cid:104)\u2126d, A(cid:105)(cid:17)(cid:62)\n\n,\n\n\u03c9(A) =\n\n(1)\nwhere (\u2126i)1\u2264i\u2264d is a set of n\u00d7 n matrices. These matrices can be either deterministic or random in\nour theoretical analysis, but we take them deterministic for the sake of simplicity. The vector time\nseries (\u03c9(At))t\u22650 has autoregressive dynamics, given by a VAR (Vector Auto-Regressive) model:\n(2)\nwhere W0 \u2208 Rd\u00d7d is a unknown sparse matrix and (Nt)t\u22650 is a sequence of noise vectors in Rd.\nAn example of linear features is the degree (i.e. number of edges connected to each node, or the sum\nof their weights if the edges are weighted), which is a measure of popularity in social and commerce\nnetworks. Introducing\n\n\u03c9(At+1) = W (cid:62)\n\n0 \u03c9(At) + Nt+1,\n\nXT\u22121 = (\u03c9(A0), . . . , \u03c9(AT\u22121))(cid:62) and XT = (\u03c9(A1), . . . , \u03c9(AT ))(cid:62),\n\nwhich are both T \u00d7 d matrices, we can write this model in a matrix form:\n\nXT = XT\u22121W0 + NT ,\n\n(3)\n\nwhere NT = (N1, . . . , NT )(cid:62).\nThis assumes that the noise is driven by time-series dynamics (a martingale increment), where each\ncoordinates are independent (meaning that features are independently corrupted by noise), with a\nsub-gaussian tail and variance uniformly bounded by a constant \u03c32. In particular, no independence\nassumption between the Nt is required here.\nNotations. The notations (cid:107)\u00b7(cid:107)F , (cid:107)\u00b7(cid:107)p, (cid:107)\u00b7(cid:107)\u221e, (cid:107)\u00b7(cid:107)\u2217 and (cid:107)\u00b7(cid:107)op stand, respectively, for the Frobenius\nnorm, entry-wise (cid:96)p norm, entry-wise (cid:96)\u221e norm, trace-norm (or nuclear norm, given by the sum of the\nsingular values) and operator norm (the largest singular value). We denote by (cid:104)A, B(cid:105) = tr(A(cid:62)B)\nthe Euclidean matrix product. A vector in Rd is always understood as a d \u00d7 1 matrix. We denote\nby (cid:107)A(cid:107)0 the number of non-zero elements of A. The product A \u25e6 B between two matrices with\nmatching dimensions stands for the Hadamard or entry-wise product between A and B. The matrix\n|A| contains the absolute values of entries of A. The matrix (M )+ is the componentwise positive part\nof the matrix M, and sign(M ) is the sign matrix associated to M with the convention sign(0) = 0\n\n2\n\n\fIf A is a n \u00d7 n matrix with rank r, we write its SVD as A = U \u03a3V (cid:62) = (cid:80)r\n\nj where\n\u03a3 = diag(\u03c31, . . . , \u03c3r) is a r \u00d7 r diagonal matrix containing the non-zero singular values of A in\ndecreasing order, and U = [u1, . . . , ur], V = [v1, . . . , vr] are n\u00d7 r matrices with columns given by\nthe left and right singular vectors of A. The projection matrix onto the space spanned by the columns\n(resp. rows) of A is given by PU = U U(cid:62) (resp. PV = V V (cid:62)). The operator PA : Rn\u00d7n \u2192 Rn\u00d7n\ngiven by PA(B) = PU B + BPV \u2212 PU BPV is the projector onto the linear space spanned by the\nk for 1 \u2264 j, k \u2264 r and x, y \u2208 Rn. The projector onto the orthogonal space is\nmatrices ukx(cid:62) and yv(cid:62)\ngiven by P\u22a5\n\nA (B) = (I \u2212 PU )B(I \u2212 PV ). We also use the notation a \u2228 b = max(a, b).\n\nj=1 \u03c3jujv(cid:62)\n\n2.1\n\nJoint prediction-estimation through penalized optimization\n\nIn order to re\ufb02ect the autoregressive dynamics of the features, we use a least-squares goodness-of-\n\ufb01t criterion that encourages the similarity between two feature vectors at successive time steps. In\norder to induce sparsity in the estimator of W0, we penalize this criterion using the (cid:96)1 norm. This\nleads to the following penalized objective function:\n\nJ1(W ) =\n\nF + \u03ba(cid:107)W(cid:107)1,\n\n(cid:107)XT \u2212 XT\u22121W(cid:107)2\n\n1\nT\nwhere \u03ba > 0 is a smoothing parameter.\nNow, for the prediction of AT +1, we propose to minimize a least-squares criterion penalized by the\ncombination of an (cid:96)1 norm and a trace-norm. This mixture of norms induces sparsity and a low-rank\nof the adjacency matrix. Such a combination of (cid:96)1 and trace-norm was already studied in [8] for the\nmatrix regression model, and in [19] for the prediction of an adjacency matrix.\nThe objective function de\ufb01ned below exploits the fact that if W is close to W0, then the features of\nthe next graph \u03c9(AT +1) should be close to W (cid:62)\u03c9(AT ). Therefore, we consider\nF + \u03c4(cid:107)A(cid:107)\u2217 + \u03b3(cid:107)A(cid:107)1,\n\n(cid:107)\u03c9(A) \u2212 W (cid:62)\u03c9(AT )(cid:107)2\n\nJ2(A, W ) =\n\n1\nd\n\n.\n=\n\n1\nT\n\n1\nd\n\nwhere \u03c4, \u03b3 > 0 are smoothing parameters. The overall objective function is the sum of the two\npartial objectives J1 and J2, which is jointly convex with respect to A and W :\nL(A, W )\n2 + \u03c4(cid:107)A(cid:107)\u2217 + \u03b3(cid:107)A(cid:107)1, (4)\nIf we choose convex cones A \u2282 Rn\u00d7n and W \u2282 Rd\u00d7d, our joint estimation-prediction procedure is\nde\ufb01ned by\n\n(cid:107)\u03c9(A) \u2212 W (cid:62)\u03c9(AT )(cid:107)2\n\n(cid:107)XT \u2212 XT\u22121W(cid:107)2\n\nF + \u03ba(cid:107)W(cid:107)1 +\n\n( \u02c6A, \u02c6W ) \u2208 arg min\n(A,W )\u2208A\u00d7W\n\n(5)\nIt is natural to take W = Rd\u00d7d and A = (R+)n\u00d7n since there is no a priori on the values of the\nfeature matrix W0, while the entries of the matrix AT +1 must be positive.\nIn the next section we propose oracle inequalities which prove that this procedure can estimate W0\nand predict AT +1 at the same time.\n\nL(A, W ).\n\n2.2 Main result\n\nThe central contribution of our work is to bound the prediction error with high probability under the\nfollowing natural hypothesis on the noise process.\nAssumption 1. We assume that (Nt)t\u22650 satis\ufb01es E[Nt|Ft\u22121] = 0 for any t \u2265 1 and that there is\n\u03c3 > 0 such that for any \u03bb \u2208 R and j = 1, . . . , d and t \u2265 0:\n\nE[e\u03bb(Nt)j|Ft\u22121] \u2264 e\u03c32\u03bb2/2.\n\nMoreover, we assume that for each t \u2265 0, the coordinates (Nt)1, . . . , (Nt)d are independent.\nThe main result can be summarized as follows. The prediction error and the estimation error can be\nsimultaneously bounded by the sum of three terms that involve homogeneously (a) the sparsity, (b)\nthe rank of the adjacency matrix AT +1, and (c) the sparsity of the VAR model matrix W0. The tight\nbounds we obtain are similar to the bounds of the Lasso and are upper bounded by:\n\n3\n\n\flog d\n\n(cid:107)W0(cid:107)0 + C2\n\nC1\n\nlog n\n\n(cid:107)AT +1(cid:107)0 + C3\n\nlog n\n\nrank AT +1 .\n\nT\n\nd\n\nd\n\nThe positive constants C1, C2, C3 are proportional to the noise level \u03c3. The interplay between the\nrank and sparsity constraints on AT +1 are re\ufb02ected in the observation that the values of C2 and C3\ncan be changed as long as their sum remains constant.\n\n3 Oracle inequalities\n\nIn this section we give oracle inequalities for the mixed prediction-estimation error which is given,\nfor any A \u2208 Rn\u00d7n and W \u2208 Rd\u00d7d, by\n\n1\nd\n\n1\nT\n\n2 +\n\nE(A, W )2 .\n=\n\n(cid:107)(W \u2212 W0)(cid:62)\u03c9(AT ) \u2212 \u03c9(A \u2212 AT +1)(cid:107)2\n\n(6)\nIt is important to have in mind that an upper-bound on E implies upper-bounds on each of\nits two components.\nIt entails in particular an upper-bound on the feature estimation error\n\n(cid:107)XT\u22121((cid:99)W \u2212 W0)(cid:107)F that makes (cid:107)((cid:99)W \u2212 W0)(cid:62)\u03c9(AT )(cid:107)2 smaller and consequently controls the\nprediction error over the graph edges through (cid:107)\u03c9((cid:98)A \u2212 AT +1)(cid:107)2.\n\n(cid:107)XT\u22121(W \u2212 W0)(cid:107)2\nF .\n\nThe upper bounds on E given below exhibit the dependence of the accuracy of estimation and pre-\ndiction on the number of features d, the number of edges n and the number T of observed graphs in\nthe sequence.\nLet us recall NT = (N1, . . . , NT )(cid:62) and introduce the noise processes\n\nM = \u2212 1\nd\n\n(NT +1)j\u2126j\n\nand \u039e =\n\n1\nT\n\n\u03c9(At\u22121)N(cid:62)\n\nt +\n\n\u03c9(AT )N(cid:62)\n\nT +1,\n\n1\nd\n\nwhich are, respectively, n \u00d7 n and d \u00d7 d random matrices. The source of randomness comes from\nthe noise sequence (Nt)t\u22650, see Assumption 1. If these noise processes are controlled correctly, we\ncan prove the following oracle inequalities for procedure (5). The next result is an oracle inequality\nof slow type (see for instance [3]), that holds in full generality.\nTheorem 1. Under Assumption 2, let ( \u02c6A, \u02c6W ) be given by (5) and suppose that\n\u03b3 \u2265 2(1 \u2212 \u03b1)(cid:107)M(cid:107)\u221e and \u03ba \u2265 2(cid:107)\u039e(cid:107)\u221e\n\n(7)\n\n\u03c4 \u2265 2\u03b1(cid:107)M(cid:107)op,\nfor some \u03b1 \u2208 (0, 1). Then, we have\n\nE((cid:98)A,(cid:99)W )2 \u2264\n\ninf\n\n(A,W )\u2208A\u00d7W\n\n(cid:110)E(A, W )2 + 2\u03c4(cid:107)A(cid:107)\u2217 + 2\u03b3(cid:107)A(cid:107)1 + 2\u03ba(cid:107)W(cid:107)1\n\n(cid:111)\n\n.\n\nd(cid:88)\n\nj=1\n\nT(cid:88)\n\nt=1\n\nFor the proof of oracle inequalities of fast type, the restricted eigenvalue (RE) condition introduced\nin [3] and [10, 11] is of importance. Restricted eigenvalue conditions are implied by, and in gen-\neral weaker than, the so-called incoherence or RIP (Restricted isometry property, [6]) assumptions,\nwhich excludes, for instance, strong correlations between covariates in a linear regression model.\nThis condition is acknowledged to be one of the weakest to derive fast rates for the Lasso (see [24]\nfor a comparison of conditions).\nMatrix version of these assumptions are introduced in [12]. Below is a version of the RE assumption\nthat \ufb01ts in our context. First, we need to introduce the two restriction cones.\nThe \ufb01rst cone is related to the (cid:107)W(cid:107)1 term used in procedure (5). If W \u2208 Rd\u00d7d, we denote by\n\u0398W = sign(W ) \u2208 {0,\u00b11}d\u00d7d the signed sparsity pattern of W and by \u0398\u22a5\nW \u2208 {0, 1}d\u00d7d the\northogonal sparsity pattern. For a \ufb01xed matrix W \u2208 Rd\u00d7d and c > 0, we introduce the cone\n\nC1(W, c)\n\n.\n=\n\nW (cid:48) \u2208 W : (cid:107)\u0398\u22a5\n\nW \u25e6 W (cid:48)(cid:107)1 \u2264 c(cid:107)\u0398W \u25e6 W (cid:48)(cid:107)1\n\nThis cone contains the matrices W (cid:48) that have their largest entries in the sparsity pattern of W .\nThe second cone is related to mixture of the terms (cid:107)A(cid:107)\u2217 and (cid:107)A(cid:107)1 in procedure (5). Before de\ufb01ning\nit, we need further notations and de\ufb01nitions.\n\n4\n\n(cid:111)\n\n.\n\n(cid:110)\n\n\fFor a \ufb01xed A \u2208 Rn\u00d7n and c, \u03b2 > 0, we introduce the cone\n\nC2(A, c, \u03b2)\n\n.\n=\n\nA(cid:48) \u2208 A : (cid:107)P\u22a5\n\nA (A(cid:48))(cid:107)\u2217 + \u03b2(cid:107)\u0398\u22a5\n\nA \u25e6 A(cid:48)(cid:107)1 \u2264 c\n\n(cid:110)\n\n(cid:16)(cid:107)PA(A(cid:48))(cid:107)\u2217 + \u03b2(cid:107)\u0398A \u25e6 A(cid:48)(cid:107)1\n\n(cid:17)(cid:111)\n\n.\n\nThis cone consist of the matrices A(cid:48) with large entries close to that of A and that are \u201calmost aligned\u201d\nwith the row and column spaces of A. The parameter \u03b2 quanti\ufb01es the interplay between these too\nnotions.\nAssumption 2 (Restricted Eigenvalue (RE)). For W \u2208 W and c > 0, we have\n\n\u00b51(W, c) = inf\n\n\u00b5 > 0 : (cid:107)\u0398W \u25e6 W (cid:48)(cid:107)F \u2264 \u00b5\u221a\nT\n\nFor A \u2208 A and c, \u03b2 > 0, we introduce\n\n(cid:107)XT\u22121W (cid:48)(cid:107)F , \u2200W (cid:48) \u2208 C1(W, c)\n\n.\n\n\u00b52(A, W, c, \u03b2) = inf\n\n\u00b5 > 0 : (cid:107)PA(A(cid:48))(cid:107)F \u2228 (cid:107)\u0398A \u25e6 A(cid:48)(cid:107)F \u2264 \u00b5\u221a\nd\n\n(cid:107)W (cid:48)(cid:62)\u03c9(AT ) \u2212 \u03c9(A(cid:48))(cid:107)2\n\n(cid:110)\n(cid:110)\n\n(cid:111)\n\n(cid:111)\n\n\u2200W (cid:48) \u2208 C1(W, c),\u2200A(cid:48) \u2208 C2(A, c, \u03b2)\n\n.\n\n(8)\n\nThe RE assumption consists of assuming that the constants \u00b51 and \u00b52 are \ufb01nite. Now we can state\nthe following Theorem that gives a fast oracle inequality for our procedure using RE.\nTheorem 2. Under Assumption 2 and Assumption 2, let ( \u02c6A, \u02c6W ) be given by (5) and suppose that\n\nE((cid:98)A,(cid:99)W )2 \u2264\n\n\u03c4 \u2265 3\u03b1(cid:107)M(cid:107)op,\nfor some \u03b1 \u2208 (0, 1). Then, we have\n\n(cid:110)E(A, W )2+\n\n\u03b3 \u2265 3(1 \u2212 \u03b1)(cid:107)M(cid:107)\u221e and \u03ba \u2265 3(cid:107)\u039e(cid:107)\u221e\n\n\u00b52(A, W )2(cid:0)\u03c4 2 rank(A)+\u03b32(cid:107)A(cid:107)0)+\n\n25\n36\nwhere \u00b51(W ) = \u00b51(W, 5) and \u00b52(A, W ) = \u00b52(A, W, 5, \u03b3/\u03c4 ) (see Assumption 2).\n\n(A,W )\u2208A\u00d7W\n\n25\n18\n\ninf\n\n(9)\n\n\u03ba2\u00b51(W )2(cid:107)W(cid:107)0\n\n(cid:111)\n\n,\n\nThe proofs of Theorems 1 and 2 use tools introduced in [12] and [3].\nNote that the residual term from this oracle inequality mixes the notions of sparsity of A and W\nvia the terms rank(A), (cid:107)A(cid:107)0 and (cid:107)W(cid:107)0. It says that our mixed penalization procedure provides an\noptimal trade-off between \ufb01tting the data and complexity, measured by both sparsity and low-rank.\nThis is the \ufb01rst result of this nature to be found in literature.\nIn the next Theorem 3, we obtain convergence rates for the procedure (5) by combining Theorem 2\nwith controls on the noise processes. We introduce\n\n(cid:13)(cid:13)(cid:13) 1\n\nd\n\nd(cid:88)\n\nj=1\n\n(cid:13)(cid:13)(cid:13)op\n\n\u2228(cid:13)(cid:13)(cid:13) 1\n\nd\n\nd(cid:88)\n\nj=1\n\n\u2126j\u2126(cid:62)\n\nv2\n\u2126,op =\n\n\u2126(cid:62)\nj \u2126j\n\n\u03c32\n\u03c9 = max\nj=1,...,d\n\n\u03c9,j, where \u03c32\n\u03c32\n\n\u03c9,j =\n\nd(cid:88)\n\n(cid:13)(cid:13)(cid:13) 1\n\n\u2126j \u25e6 \u2126j\n\nd\n\nv2\n\u2126,\u221e =\n\n\u03c9j(At\u22121)2 + \u03c9j(AT )2(cid:17)\n\nj=1\n\n,\n\n(cid:13)(cid:13)(cid:13)\u221e,\n\nwhich are the (observable) variance terms that naturally appear in the controls of the noise processes.\nWe introduce also\n\nj\n\n,\n\n(cid:13)(cid:13)(cid:13)op\n(cid:16) 1\nT(cid:88)\n(cid:18)\n\nt=1\n\nT\n\n(cid:19)\n\n(cid:96)T = 2 max\n\nj=1,...,d\n\nlog log\n\n\u03c9,j \u2228 1\n\u03c32\n\u03c32\n\n\u03c9,j\n\n\u2228 e\n\n,\n\nwhich is a small (observable) technical term that comes out of our analysis of the noise process \u039e.\nThis term is a small price to pay for the fact that no independence assumption is required on the\nnoise sequence (Nt)t\u22650, but only a martingale increment structure with sub-gaussian tails.\nTheorem 3. Consider the procedure ( \u02c6A, \u02c6W ) given by (5) with smoothing parameters given by\n\n(cid:114)\n(cid:18)(cid:114)\n\n\u03c4 = 3\u03b1\u03c3v\u2126,op\n\u03b3 = 3(1 \u2212 \u03b1)\u03c3v\u2126,\u221e\n\n(cid:114)\n\n2(x + log(2n))\n\n,\n\nd\n2(x + 2 log n)\n\nd\n\n2e(x + 2 log d + (cid:96)T )\n\n\u03ba = 6\u03c3\u03c3\u03c9\n\nT\n\n5\n\n,\n\n(cid:112)2e(x + 2 log d + (cid:96)T )\n\n(cid:19)\n\nd\n\n.\n\n+\n\n\ffor some \u03b1 \u2208 (0, 1) and \ufb01x a con\ufb01dence level x > 0. Then, we have\n\n(cid:110)E(A, W )2 + C1(cid:107)W(cid:107)0(x + 2 log d + (cid:96)T )\n(cid:16) 1\n\nE((cid:98)A,(cid:99)W )2 \u2264\n\ninf\n\n(A,W )\u2208A\u00d7W\n\n+\n\nT\n\n+ C2(cid:107)A(cid:107)0\n\n2(x + 2 log n)\n\n+ C3 rank(A)\n\nd\n\n(cid:17)\n\n1\nd2\n2(x + log(2n))\n\n(cid:111)\n\nd\n\n\u03c9, C2 = 25\u00b52(A, W )2(1\u2212\u03b1)2\u03c32v2\n\nwhere\nC1 = 100e\u00b51(W )2\u03c32\u03c32\nwith a probability larger than 1 \u2212 17e\u2212x, where \u00b51 and \u00b52 are the same as in Theorem 2.\nThe proof of Theorem 3 follows directly from Theorem 2 basic noise control results. In the next\nTheorem, we propose more explicit upper bounds for both the indivivual estimation of W0 and the\nprediction of AT +1.\nTheorem 4. Under the same assumptions as in Theorem 3 and the same choice of smoothing pa-\nrameters, for any x > 0 the following inequalities hold with probability larger than 1 \u2212 17e\u2212x:\n\n\u2126,\u221e, C3 = 25\u00b52(A, W )2\u03b12\u03c32v2\n\n\u2126,op,\n\n\u2022 Feature prediction error:\n(cid:107)XT ( \u02c6W \u2212 W0)(cid:107)2\n\n1\nT\n\nF \u2264 25\n36\n\n\u03ba2\u00b51(W0)2(cid:107)W0(cid:107)0\n\n\u00b52(A, W0)2(cid:0)\u03c4 2 rank(A) + \u03b32(cid:107)A(cid:107)0)\n\n(cid:111)\n\n(10)\n\n+ inf\nA\u2208A\n\n(cid:107)\u03c9(A) \u2212 \u03c9(AT +1)(cid:107)2\n\n2 +\n\n25\n18\n\n\u2022 VAR parameter estimation error:\n\n(cid:114) 1\n\n(cid:107) \u02c6W \u2212 W0(cid:107)1 \u2264 5\u03ba\u00b51(W0)2(cid:107)W0(cid:107)0\n\n+6(cid:112)(cid:107)W0(cid:107)0\u00b51(W0) inf\n(cid:107) \u02c6A\u2212AT +1(cid:107)\u2217 \u2264 5\u03ba\u00b51(W0)2(cid:107)W0(cid:107)0+\u00b52(AT +1, W0)(6(cid:112)rank AT +1+5\n\n\u00b52(A, W0)2(cid:0)\u03c4 2 rank(A) + \u03b32(cid:107)A(cid:107)0)\n(cid:112)(cid:107)AT +1(cid:107)0)\n\n(cid:107)\u03c9(A) \u2212 \u03c9(AT +1)(cid:107)2\n\n\u2022 Link prediction error:\n\n25\n18\n\nA\u2208A\n\n2 +\n\nd\n\n(11)\n\n\u00b52(A, W0)2(cid:0)\u03c4 2 rank(A) + \u03b32(cid:107)A(cid:107)0) .\n\n\u03b3\n\u03c4\n\n(12)\n\n\u00d7 inf\nA\u2208A\n\n(cid:107)\u03c9(A) \u2212 \u03c9(AT +1)(cid:107)2\n\n2 +\n\n25\n18\n\nd\n\n(cid:110) 1\n\nd\n\n(cid:114) 1\n\n4 Algorithms and Numerical Experiments\n4.1 Generalized forward-backward algorithm for minimizing L\nWe use the algorithm designed in [17] for minimizing our objective function. Note that this algo-\nrithm is preferable to the method introduced in [18] as it directly minimizes L jointly in (S, W )\nrather than alternately minimizing in W and S.\nMoreover we use the novel\ngraphs. The proximal operator for the trace norm is given by the shrinkage operation,\nZ = U diag(\u03c31,\u00b7\u00b7\u00b7 , \u03c3n)V T is the singular value decomposition of Z,\nprox\u03c4||.||\u2217 (Z) = U diag((\u03c3i \u2212 \u03c4 )+)iV T .\n\nis more suited for estimating\nif\n\njoint penalty from [19]\n\nthat\n\nSimilarly, the proximal operator for the (cid:96)1-norm is the soft thresholding operator de\ufb01ned by using\nthe entry-wise product of matrices denoted by \u25e6:\n\nprox\u03b3||.||1(Z) = sgn(Z) \u25e6 (|Z| \u2212 \u03b3)+ .\n\nThe algorithm converges under very mild conditions when the step size \u03b8 is smaller than 2\nL is the operator norm of the joint quadratic loss:\n\nL, where\n\n\u03a6 : (A, W ) (cid:55)\u2192 1\nT\n\n(cid:107)XT \u2212 XT\u22121W(cid:107)2\n\nF +\n\n1\nd\n\n(cid:107)\u03c9(A) \u2212 W (cid:62)\u03c9(AT )(cid:107)2\n\nF .\n\n6\n\n\fAlgorithm 1 Generalized Forward-Backward to Minimize L\n\nInitialize A, Z1, Z2, W\nrepeat\n\nCompute (GA, GW ) = \u2207A,W \u03a6(A, W ).\nCompute Z1 = prox2\u03b8\u03c4||.||\u2217 (2A \u2212 Z1 \u2212 \u03b8GA)\nCompute Z2 = prox2\u03b8\u03b3||.||1(2A \u2212 Z2 \u2212 \u03b8GA)\nSet A = 1\nSet W = prox\u03b8\u03ba||.||1(W \u2212 \u03b8GW )\n\n2 (Z1 + Z2)\n\nuntil convergence\nreturn (A, W ) minimizing L\n\n(cid:18)\n\n(cid:19)\n\n(cid:18)\n\n=\n\nt\n\n(cid:19)\n\nt\n\n4.2 A generative model for graphs having linearly autoregressive features\n(cid:62)\u2020\nLet V0 \u2208 Rn\u00d7r be a sparse matrix, V\n0 = Ir. Fix two\nsparse matrices W0 \u2208 Rr\u00d7r and U0 \u2208 Rn\u00d7r . Now de\ufb01ne the sequence of matrices (At)t\u22650 for\nt = 1, 2,\u00b7\u00b7\u00b7 by\n\n\u2020\n0 its pseudo-inverse such, that V\n\n\u2020\n0 V0 = V (cid:62)\n0 V\n\nand\n\nUt = Ut\u22121W0 + Nt\n\nAt = UtV (cid:62)\n\n0 + Mt\n\nfor i.i.d sparse noise matrices Nt and Mt, which means that for any pair of indices (i, j), with high\n(cid:62)\u2020\nprobability (Nt)i,j = 0 and (Mt)i,j = 0. We de\ufb01ne the linear feature map \u03c9(A) = AV\n0 , and\npoint out that\n\n1. The sequence\n\n\u03c9(At)(cid:62)\n\n(cid:62)\u2020\nUt + MtV\n0\n\nfollows the linear autoregressive relation\n\n\u03c9(At)(cid:62) = \u03c9(At\u22121)(cid:62)W0 + Nt + MtV\n\n(cid:62)\u2020\n0\n\n.\n\n2. For any time index t, the matrix At is close to UtV0 that has rank at most r\n3. The matrices At and Ut are both sparse by construction.\n\n4.3 Empirical evaluation\n\nWe tested the presented methods on synthetic data generated as in section (4.2). In our experiments\nthe noise matrices Mt and Nt where built by soft-thresholding i.i.d. noise N (0, \u03c32). We took as\ninput T = 10 successive graph snapshots on n = 50 nodes graphs of rank r = 5. We used d = 10\nlinear features, and \ufb01nally the noise level was set to \u03c3 = .5. We compare our methods to standard\nbaselines in link prediction. We use the area under the ROC curve as the measure of performance\nand report empirical results averaged over 50 runs with the corresponding con\ufb01dence intervals in\n\ufb01gure 4.3. The competitor methods are the nearest neighbors (NN) and static sparse and low-rank\ncumulative graph adjacency matrix (cid:102)AT = (cid:80)T\nestimation, that is the link prediction algorithm suggested in [19]. The algorithm NN scores pairs\nof nodes with the number of common friends between them, which is given by A2 when A is the\nis obtained by minimizing the objective (cid:107)X \u2212 (cid:102)AT(cid:107)2\nt=0 At and the static sparse and low-rank estimation\nF + \u03c4(cid:107)X(cid:107)\u2217 + \u03b3(cid:107)X(cid:107)1, and can be seen as the\nclosest static version of our method. The two methods autoregressive low-rank and static low-rank\nare regularized using only the trace-norm, (corresponding to forcing \u03b3 = 0) and are slightly inferior\nconsider the feature map \u03c9(A) = AV where (cid:102)AT = U \u03a3V (cid:62) is the SVD of (cid:102)AT . The parameters \u03c4\nto their sparse and low-rank rivals. Since the matrix V0 de\ufb01ning the linear map \u03c9 is unknown we\n\nand \u03b3 are chosen by 10-fold cross validation for each of the methods separately.\n\n4.4 Discussion\n\n1. Comparison with the baselines. This experiment sharply shows the bene\ufb01t of using a tem-\nporal approach when one can handle the feature extraction task. The left-hand plot shows\nthat if few snapshots are available (T \u2264 4 in these experiments), then static approaches are\n\n7\n\n\fFigure 1: Left: performance of algorithms in terms of Area Under the ROC Curve, average and\ncon\ufb01dence intervals over 50 runs. Right: Phase transition diagram.\n\nto be preferred, whereas feature autoregressive approaches outperform as soon as suf\ufb01cient\nnumber T graph snapshots are available (see phase transition). The decreasing performance\nof static algorithms can be explained by the fact that they use as input a mixture of graphs\nobserved at different time steps. Knowing that at each time step the nodes have speci\ufb01c\nlatent factors, despite the slow evolution of the factors, adding the resulting graphs leads to\nconfuse the factors.\n\n2. Phase transition. The right-hand \ufb01gure is a phase transition diagram showing in which part\nof rank and time domain the estimation is accurate and illustrates the interplay between\nthese two domain parameters.\n\n3. Choice of the feature map \u03c9. In the current work we used the projection onto the vector\nspace of the top-r singular vectors of the cumulative adjacency matrix as the linear map \u03c9,\nand this choice has shown empirical superiority to other choices. The question of choosing\nthe best measurement to summarize graph information as in compress sensing seems to\nhave both theoretical and application potential. Moreover, a deeper understanding of the\nconnections of our problem with compressed sensing, for the construction and theoretical\nvalidation of the features mapping, is an important point that needs several developments.\nOne possible approach is based on multi-kernel learning, that should be considered in a\nfuture work.\n\n4. Generalization of the method. In this paper we consider only an autoregressive process of\norder 1. For better prediction accuracy, one could consider mode general models, such as\nvector ARMA models, and use model-selection techniques for the choice of the orders of\nthe model. A general modelling based on state-space model could be developed as well.\nWe presented a procedure for predicting graphs having linear autoregressive features. Our\napproach can easily be generalized to non-linear prediction through kernel-based methods.\n\nReferences\n\n[1] J. Abernethy, F. Bach, Th. Evgeniou, and J.-Ph. Vert. A new approach to collaborative \ufb01ltering:\n\noperator estimation with spectral regularization. JMLR, 10:803\u2013826, 2009.\n\n[2] A. Argyriou, M. Pontil, Ch. Micchelli, and Y. Ying. A spectral regularization framework for\nmulti-task structure learning. Proceedings of Neural Information Processing Systems (NIPS),\n2007.\n\n[3] P. J. Bickel, Y. Ritov, and A. B. Tsybakov. Simultaneous analysis of lasso and dantzig selector.\n\nAnnals of Statistics, 37, 2009.\n\n[4] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression.\nJournal of the Royal Statistical Society (JRSS): Series B (Statistical Methodology), 59:3\u201354,\n1997.\n\n8\n\n23456789100.750.80.850.90.95TAUCLink prediction performance Autoregressive Sparse and Low\u2212rankAutoregressive Low\u2212rankStatic Sparse and Low\u2212rankStatic Low\u2212rankNearest\u2212Neighborsrank AT+1 TAUC 010203040506070246810120.90.910.920.930.940.950.960.970.980.99\f[5] E.J. Cand\u00e8s and T. Tao. The power of convex relaxation: Near-optimal matrix completion.\n\nIEEE Transactions on Information Theory, 56(5), 2009.\n\n[6] Cand\u00e8s E. and Tao T. Decoding by linear programming. In Proceedings of the 46th Annual\n\nIEEE Symposium on Foundations of Computer Science (FOCS), 2005.\n\n[7] Th. Evgeniou, Ch. A. Micchelli, and M. Pontil. Learning multiple tasks with kernel methods.\n\nJournal of Machine Learning Research, 6:615\u2013637, 2005.\n\n[8] S. Gaiffas and G. Lecue. Sharp oracle inequalities for high-dimensional matrix prediction.\n\nInformation Theory, IEEE Transactions on, 57(10):6942 \u20136957, oct. 2011.\n\n[9] M. Kolar and E. P. Xing. On time varying undirected graphs.\n\nin Proceedings of the 14th\n\nInternational Conference on Arti\ufb01cal Intelligence and Statistics AISTATS, 2011.\n\n[10] V. Koltchinskii. The Dantzig selector and sparsity oracle inequalities. Bernoulli, 15(3):799\u2013\n\n828, 2009.\n\n[11] V. Koltchinskii. Sparsity in penalized empirical risk minimization. Ann. Inst. Henri Poincar\u00e9\n\nProbab. Stat., 45(1):7\u201357, 2009.\n\n[12] V. Koltchinskii, K. Lounici, and A. Tsybakov. Nuclear norm penalization and optimal rates for\n\nnoisy matrix completion. Annals of Statistics, 2011.\n\n[13] Y. Koren. Factorization meets the neighborhood: a multifaceted collaborative \ufb01ltering model.\nIn Proceeding of the 14th ACM SIGKDD international conference on Knowledge discovery\nand data mining, pages 426\u2013434. ACM, 2008.\n\n[14] Y. Koren. Collaborative \ufb01ltering with temporal dynamics. Communications of the ACM,\n\n53(4):89\u201397, 2010.\n\n[15] D. Liben-Nowell and J. Kleinberg. The link-prediction problem for social networks. Journal\n\nof the American society for information science and technology, 58(7):1019\u20131031, 2007.\n\n[16] S.A. Myers and Jure Leskovec. On the convexity of latent social network inference. In NIPS,\n\n2010.\n\n[17] H. Raguet, J. Fadili, and G. Peyr\u00e9. Generalized forward-backward splitting. Arxiv preprint\n\narXiv:1108.4404, 2011.\n\n[18] E. Richard, N. Baskiotis, Th. Evgeniou, and N. Vayatis. Link discovery using graph feature\n\ntracking. Proceedings of Neural Information Processing Systems (NIPS), 2010.\n\n[19] E. Richard, P.-A. Savalle, and N. Vayatis. Estimation of simultaneously sparse and low-rank\nmatrices. In Proceeding of 29th Annual International Conference on Machine Learning, 2012.\n[20] P. Sarkar, D. Chakrabarti, and A.W. Moore. Theoretical justi\ufb01cation of popular link prediction\n\nheuristics. In International Conference on Learning Theory (COLT), pages 295\u2013307, 2010.\n\n[21] N. Srebro, J. D. M. Rennie, and T. S. Jaakkola. Maximum-margin matrix factorization. In\nLawrence K. Saul, Yair Weiss, and L\u00e9on Bottou, editors, in Proceedings of Neural Information\nProcessing Systems 17, pages 1329\u20131336. MIT Press, Cambridge, MA, 2005.\n\n[22] B. Taskar, M.F. Wong, P. Abbeel, and D. Koller. Link prediction in relational data. In Neural\n\nInformation Processing Systems, volume 15, 2003.\n\n[23] R. S. Tsay. Analysis of Financial Time Series. Wiley-Interscience; 3rd edition, 2005.\n[24] S. A. van de Geer and P. B\u00fchlmann. On the conditions used to prove oracle results for the\n\nLasso. Electron. J. Stat., 3:1360\u20131392, 2009.\n\n[25] D.Q. Vu, A. Asuncion, D. Hunter, and P. Smyth. Continuous-time regression models for\nIn Advances in Neural Information Processing Systems. MIT Press,\n\nlongitudinal networks.\n2011.\n\n9\n\n\f", "award": [], "sourceid": 1291, "authors": [{"given_name": "Emile", "family_name": "Richard", "institution": null}, {"given_name": "Stephane", "family_name": "Gaiffas", "institution": null}, {"given_name": "Nicolas", "family_name": "Vayatis", "institution": null}]}