{"title": "Learnability of Influence in Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3186, "page_last": 3194, "abstract": "We establish PAC learnability of influence functions for three common influence models, namely, the Linear Threshold (LT), Independent Cascade (IC) and Voter models, and present concrete sample complexity results in each case. Our results for the LT model are based on interesting connections with neural networks; those for the IC model are based an interpretation of the influence function as an expectation over random draw of a subgraph and use covering number arguments; and those for the Voter model are based on a reduction to linear regression. We show these results for the case in which the cascades are only partially observed and we do not see the time steps in which a node has been influenced. We also provide efficient polynomial time learning algorithms for a setting with full observation, i.e. where the cascades also contain the time steps in which nodes are influenced.", "full_text": "Learnability of In\ufb02uence in Networks\n\nHarikrishna Narasimhan\n\nhnarasimhan@seas.harvard.edu, {parkes, yaron}@seas.harvard.edu\n\nHarvard University, Cambridge, MA 02138\n\nDavid C. Parkes\n\nYaron Singer\n\nAbstract\n\nWe show PAC learnability of in\ufb02uence functions for three common in\ufb02uence mod-\nels, namely, the Linear Threshold (LT), Independent Cascade (IC) and Voter mod-\nels, and present concrete sample complexity results in each case. Our results for\nthe LT model are based on interesting connections with neural networks; those for\nthe IC model are based an interpretation of the in\ufb02uence function as an expecta-\ntion over random draw of a subgraph and use covering number arguments; and\nthose for the Voter model are based on a reduction to linear regression. We show\nthese results for the case in which the cascades are only partially observed and we\ndo not see the time steps in which a node has been in\ufb02uenced. We also provide\nef\ufb01cient polynomial time learning algorithms for a setting with full observation,\ni.e. where the cascades also contain the time steps in which nodes are in\ufb02uenced.\n\n1\n\nIntroduction\n\nFor several decades there has been much interest in understanding the manner in which ideas, lan-\nguage, and information cascades spread through society. With the advent of social networking\ntechnologies in recent years, digital traces of human interactions are becoming available, and the\nproblem of predicting information cascades from these traces has gained enormous practical value.\nFor example, this is critical in applications like viral marketing, where one needs to maximize aware-\nness about a product by selecting a small set of in\ufb02uential users [1].\nTo this end, the spread of information in networks is modeled as an in\ufb02uence function which maps\na set of seed nodes who initiate the cascade to (a distribution on) the set of individuals who will be\nin\ufb02uenced as a result [2]. These models are parametrized by variables that are unknown and need\nto be estimated from data. There has been much work on estimating the parameters of in\ufb02uence\nmodels (or the structure of the underlying social graph) from observed cascades of in\ufb02uence spread,\nand on using the estimated parameters to predict in\ufb02uence for a given seed set [3, 4, 5, 6, 7, 8].\nThese parameter estimation techniques make use of local in\ufb02uence information at each node, and\nthere has been a recent line of work devoted to providing sample complexity guarantees for these\nlocal estimation techniques [9, 10, 11, 12, 13].\nHowever, one cannot locally estimate the in\ufb02uence parameters when the cascades are not completely\nobserved (e.g. when the cascades do not contain the time at which the nodes are in\ufb02uenced). More-\nover, in\ufb02uence functions can be sensitive to errors in model parameters, and existing results do not\ntell us to what accuracy the individual parameters need to be estimated to obtain accurate in\ufb02uence\npredictions. If the primary goal in an application is to predict in\ufb02uence accurately, it is natural to\nask for algorithms that have learnability guarantees on the in\ufb02uence function itself. A benchmark\nfor studying such questions is the Probably Approximately Correct (PAC) learning framework [14]:\n\nAre in\ufb02uence functions PAC learnable?\n\nWhile many in\ufb02uence models have been popularized due to their approximation guarantees for\nin\ufb02uence maximization [2, 15, 16], learnability of in\ufb02uence is an equally fundamental property.\nPart of this work was done when HN was a PhD student at the Indian Institute of Science, Bangalore.\n\n1\n\n\fIn this paper, we show PAC learnability for three well-studied in\ufb02uence models: the Linear Thresh-\nold, the Independent Cascade, and the Voter models. We primarily consider a setting where the\ncascades are partially observed, i.e. where only the nodes in\ufb02uenced and not the time steps at which\nthey were in\ufb02uenced are observed. This is a setting where existing local estimation techniques can-\nnot be applied to obtain parameter estimates. Additionally, for a fully observed setting where the\ntime of in\ufb02uence is also observed, we show polynomial time learnability; our methods here are akin\nto using local estimation techniques, but come with guarantees on the global in\ufb02uence function.\n\nMain results. Our learnability results are summarized below.\n\n\u2022 Linear threshold (LT) model: Our result here is based on an interesting observation that\nLT in\ufb02uence functions can be seen as multi-layer neural network classi\ufb01ers, and proceed by\nbounding their VC-dimension. The method analyzed here picks a function with zero training\nerror. While this can be computationally hard to implement under partial observation, we\nprovide a polynomial time algorithm for the full observation case using local computations.\n\u2022 Independent cascade (IC) model: Our result uses an interpretation of the in\ufb02uence function\nas an expectation over random draw of a subgraph [2]; this allows us to show that the function is\nLipschitz and invoke covering number arguments. The algorithm analyzed for partial observa-\ntion is based on global maximum likelihood estimation. Under full observation (and additional\nassumptions), we show polynomial time learnability using a local estimation technique.\n\u2022 Voter model: Our result follows from a reduction of the learning problem to a linear regression\nproblem; the resulting learning algorithm can be implemented in polynomial time for both the\nfull and partial observation settings.\n\nRelated work. A related problem to ours is that of inferring the structure of the underlying social\ngraph from cascades [6]. There has been a series of results on polynomial sample complexity guar-\nantees for this problem under variants of the IC model [9, 12, 10, 11]. Most of these results make\nspeci\ufb01c assumptions on the cascades/graph structure, and assume a full observation setting. On the\nother hand, in our problem, the structure of the social graph is assumed to be known, and the goal\nis to provably learn the underlying in\ufb02uence function. Our results do not depend on assumptions on\nthe network structure, and primarily apply to the more challenging partial observation setting.\nThe work that is most related to ours is that of Du et al. [13], who show polynomial sample complex-\nity results for learning in\ufb02uence in the LT and IC models (under partial observation). However, their\napproach uses approximations to in\ufb02uence functions and consequently requires a strong technical\ncondition to hold, which is not necessarily satis\ufb01ed in general. Our results for the LT and IC models\nare some what orthogonal. While the authors in [13] trade-off assumptions on learnability and gain\nef\ufb01cient algorithms that work well in practice, our goal is to show unconditional sample complexity\nfor learning in\ufb02uence. We do this at the expense of the ef\ufb01ciency of the learning algorithms in the\npartial observation setting. Moreover, the technical approach we take is substantially different.\nThere has also been work on learnability of families of discrete functions such as submodular [17]\nand coverage functions [18], under the PAC and the variant PMAC frameworks. These results\nassume availability of a training sample containing exact values of the target function on the given\ninput sets. While IC in\ufb02uence functions can be seen as coverage functions [2], the previous results do\nnot directly apply to the IC class, as in practice, the true (expected) value of an IC in\ufb02uence function\non a seed set is never observed, and only a random realization is seen. In contrast, our learnability\nresult for IC functions do not require the exact function values to be known. Moreover, the previous\nresults require strict assumptions on the input distribution. Since we focus on learnability of speci\ufb01c\nfunction classes rather than large families of discrete functions, we are able to handle general seed\ndistributions for most part. Other results relevant to our work include learnability of linear in\ufb02uence\ngames [19], where the techniques used bear some similarity to our analysis for the LT model.\n\n2 Preliminaries\n\nIn\ufb02uence models. We represent a social network as a \ufb01nite graph G = (V, E), where the nodes\nV = {1, . . . , n} represent a set of n individuals and edges E \u2286 V 2 represent their social links. Let\n|E| = r. The graph is assumed to be directed unless otherwise speci\ufb01ed. Each edge (u, v) \u2208 E\nis associated with a weight wuv \u2208 R+ that indicates the strength of in\ufb02uence of node v on node\nu. We consider a setting where each node in the network holds an opinion in {0, 1} and opinions\n\n2\n\n\fdisseminate in the network. This dissemination process begins with a small subset of nodes called\nthe seed which have opinion 1 while the rest have opinion 0, and continues in discrete time steps.\nIn every time step, a node may change its opinion from 0 to 1 based on the opinion of its neighbors,\nand according to some local model of in\ufb02uence; if this happens, we say that the node is in\ufb02uenced.\nWe will use N (u) to denote the set of neighbors of node u, and At to denote the set of nodes that\nare in\ufb02uenced at time step t. We consider three well-studied models:\n\n\u2022 Linear threshold (LT) model: Each node u holds a threshold ru \u2208 R+, and is in\ufb02uenced\nat time t if the total incoming weight from its neighbors that were in\ufb02uenced at the previous\nwuv \u2265 ku. Once in\ufb02uenced, node u\n\ntime step t \u2212 1 exceeds the threshold:(cid:80)\n\nv\u2208N (u)\u2229At\u22121\n\ncan then in\ufb02uence its neighbors for one time step, and never changes its opinion to 0.1\n\u2022 Independent cascade (IC) model: Restricting edge weights wuv to be in [0, 1], a node u is\nin\ufb02uenced at time t independently by each neighbor v who was in\ufb02uenced at time t\u2212 1. The\nnode can then in\ufb02uence its neighbors for one time step, and never changes its opinion to 0.\n\u2022 Voter model: The graph is assumed to be undirected (with self-loops); at time step t, a node\nv(cid:48)\u2208N (u)\u222a{u} wuv(cid:48). Unlike the\n\nu adopts the opinion of its neighbor v with probability wuv/(cid:80)\n\nLT and IC models, here a node may change its opinion from 1 to 0 or 0 to 1 at every step.\n\nWe stress that a node is in\ufb02uenced at time t if it changes its opinion from 0 to 1 exactly at t. Also, in\nboth the LT and IC models, no node gets in\ufb02uenced more than once and hence an in\ufb02uence cascade\ncan last for at most n time steps. For simplicity, we shall consider in all our de\ufb01nitions only cascades\nof length n. While revisiting the Voter model in Section 5, we will look at more general cascades.\nDe\ufb01nition 1 (In\ufb02uence function). Given an in\ufb02uence model, a (global) in\ufb02uence function F :\n2V \u2192 [0, 1]n maps an initial set of nodes X \u2286 V seeded with opinion 1 to a vector of probabilities\n[F1(X), . . . , Fn(X)] \u2208 [0, 1]n, where the uth coordinate indicates the probability of node u \u2208 V\nbeing in\ufb02uenced during any time step of the corresponding in\ufb02uence cascades.\nNote that for the LT model, the in\ufb02uence process is deterministic, and the in\ufb02uence function simply\noutputs a binary vector in {0, 1}n. Let FG denote the class of all in\ufb02uence functions under an\nin\ufb02uence model over G, obtained for different choices of parameters (edge weights/thresholds) in\nthe model. We will be interested in learning the in\ufb02uence function for a given parametrization of\nthis in\ufb02uence model. We shall assume that the initial set of nodes that are seeded with opinion 1 at\nthe start of the in\ufb02uence process, or the seed set, is chosen i.i.d. according to a distribution \u00b5 over\nall subsets of nodes. We are given a training sample consisting of draws of initial seed sets from \u00b5,\nalong with observations of nodes in\ufb02uenced in the corresponding in\ufb02uence process. Our goal is to\nthen learn from FG an in\ufb02uence function that best captures the observed in\ufb02uence process.\nMeasuring Loss. To measure quality of the learned in\ufb02uence function, we de\ufb01ne a loss function\n(cid:96) : 2V \u00d7 [0, 1]n \u2192 R+ that for any subset of in\ufb02uenced nodes Y \u2286 V and predicted in\ufb02uence\nprobabilities p \u2208 [0, 1]n assigns a value (cid:96)(Y, p) measuring discrepancy between Y and p. We de\ufb01ne\nthe error of a learned function F \u2208 FG for a given seed distribution \u00b5 and model parametrization as\nthe expected loss incurred by F :\n\n(cid:2)(cid:96)(cid:0)Y, F (X)(cid:1)(cid:3),\n\nerr(cid:96)[F ] = EX,Y\n\nwhere the above expectation is over a random draw of the seed set X from distribution \u00b5 and over\nthe corresponding subsets of nodes Y in\ufb02uenced during the cascade.\nWe will be particularly interested in the difference between the error of an in\ufb02uence function FS \u2208\nFG learned from a training sample S and the minimum possible error achievable over all in\ufb02uence\n\n(cid:3)\u2212 inf F\u2208FG err(cid:96)(cid:2)F(cid:3), and would like to learn in\ufb02uence functions for which\n\nfunctions in FG: err(cid:96)(cid:2)FS\n\nthis difference is guaranteed to be small (using only polynomially many training examples).\nFull and partial observation. We primarily work in a setting in which we observe the nodes\nin\ufb02uenced in a cascade, but not the time step at which they were in\ufb02uenced. In other words, we\nassume availability of a partial observed training sample S = {(X 1, Y 1) . . . , (X m, Y m)}, where\nX i denotes the seed set of a cascade i and Y i is the set of nodes in\ufb02uenced in that cascade. We\nwill also consider a re\ufb01ned notion of full observation in which we are provided a training sample\nS = {(X 1, Y 1\nis the set of nodes in\n\n1:n)}, where Y i\n\n1:n) . . . , (X m, Y m\n\nn} and Y i\n\nt\n\n1:n = {Y i\n\n1 , . . . , Y i\n\n1In settings where the node thresholds are unknown, it is common to assume that they are chosen randomly\n\nby each node [2]. In our setup, the thresholds are parameters that need to be learned from cascades.\n\n3\n\n\fin\ufb02uenced in cascade i is given by(cid:83)n\n\nt=1 Y i\n\ncascade i who were in\ufb02uenced precisely at time step t. Notice that here the complete set of nodes\nt . This setting is particularly of interest when discussing\nlearnability in polynomial time. The structure of the social graph is always assumed to be known.\nPAC learnability of in\ufb02uence functions. Let FG be the class of all in\ufb02uence functions under an\nin\ufb02uence model over a n-node social network G = (V, E). We say FG is probably approximately\ncorrect (PAC) learnable w.r.t. loss (cid:96) if there exists an algorithm s.t. the following holds for \u2200\u0001, \u03b4 \u2208\n(0, 1), for all parametrizations of the model, and for all (or a subset of) distributions \u00b5 over seed sets:\nwhen the algorithm is given a partially observed training sample S = {(X 1, Y 1), . . . , (X m, Y m)}\nwith m \u2265 poly(1/\u0001, 1/\u03b4) examples, it outputs an in\ufb02uence function FS \u2208 FG for which\n\n(cid:16)\n\nerr(cid:96)(cid:2)FS\n\n(cid:3) \u2212 inf\n\nF\u2208FG\n\nPS\n\nerr(cid:96)(cid:2)F(cid:3) \u2265 \u0001\n\n(cid:17) \u2264 \u03b4,\n\nwhere the above probability is over the randomness in S. Moreover, FG is ef\ufb01ciently PAC learnable\nunder this setting if the running time of the algorithm in the above de\ufb01nition is polynomial in m\nand in the size of G. We say FG is (ef\ufb01ciently) PAC learnable under full observation if the above\nde\ufb01nition holds with a fully observed training sample S = {(X 1, Y 1\nSensitivity of in\ufb02uence functions to parameter errors. A common approach to predicting in\ufb02u-\nence under full observation is to estimate the model parameters using local in\ufb02uence information at\neach node. However, an in\ufb02uence function can be highly sensitive to errors in estimated parameters.\nE.g. consider an IC model on a chain of n nodes where all edge parameters are 1; if the parameters\nhave all been underestimated with a constant error of \u0001, the estimated probability of the last node\nbeing in\ufb02uenced is (1 \u2212 \u0001)n, which is exponentially smaller than the true value 1 for large n. Our\nresults for full observation provide concrete sample complexity guarantees for learning in\ufb02uence\nfunctions using local estimation, to any desired accuracy; in particular, for the above example, our\nresults prescribe that \u0001 be driven below 1/n for accurate predictions (see Section 4 on IC model).\nOf course, under partial observation, we do not see enough information to locally estimate the indi-\nvidual model parameters, and the in\ufb02uence function needs to be learned directly from cascades.\n\n1:n), . . . , (X m, Y m\n\n1:n)}.\n\n3 The Linear Threshold model\n\nWe start with learnability in the Linear Threshold (LT) model. Given that the in\ufb02uence process is\n(cid:80)n\ndeterministic and the in\ufb02uence function outputs binary values, we use the 0-1 loss for evaluation; for\nany subset of nodes Y \u2286 V and predicted boolean vector q \u2208 {0, 1}n, this is the fraction of nodes on\nu=1 1(\u03c7u(Y ) (cid:54)= qu), where \u03c7u(Y ) = 1(u \u2208 Y ).\nwhich the prediction is wrong: (cid:96)0-1(Y, q) = 1\nn\nTheorem 1 (PAC learnability under LT model). The class of in\ufb02uence functions under the LT\n\nmodel is PAC learnable w.r.t. (cid:96)0-1 and the corresponding sample complexity is (cid:101)O(cid:0)\u0001\u22121(r + n)(cid:1). Fur-\n\nthermore, in the full observation setting the in\ufb02uence functions can be learned in polynomial time.\nThe proof is in Appendix A and we give an outline here. Let F w denote a LT in\ufb02uence function\nwith parameters w \u2208 Rr+n (edge weights and thresholds) and let us focus on the partial observation\nsetting (only a node and not its time of in\ufb02uence is observed). Consider a simple algorithm that\noutputs an in\ufb02uence function with zero error on training sample S = {(X 1, Y 1), . . . , (X m, Y m)}:\n\nm(cid:88)\n\ni=1\n\n1\nm\n\n(cid:0)Y i, F w(X i)(cid:1) =\n\n(cid:96)0-1\n\nm(cid:88)\n\nn(cid:88)\n\ni=1\n\nu=1\n\n1\nmn\n\n1(cid:0)\u03c7u(Y i) (cid:54)= F w\n\nu (X i)(cid:1).\n\n(1)\n\nSuch a function always exists as the training cascades are generated using the LT model. We will\nshortly look at computational issues in implementing this algorithm. We now explain our PAC\nlearnability result for this algorithm. The main idea is in interpreting LT in\ufb02uence functions as\nneural networks with linear threshold activations. The proof follows by bounding the VC-dimension\nof the class of all functions F w\nu for node u, and using standard arguments in showing learnability\nunder \ufb01nite VC-dimension [20]. We sketch the neural network (NN) construction in two steps (local\nin\ufb02uence as a two-layer NN, and the global in\ufb02uence as a multilayer network; see Figure 1), where a\ncrucial part is in ensuring that no node gets in\ufb02uenced more than once during the in\ufb02uence process:\n1. Local in\ufb02uence as a two-layer NN. Recall that the (local) in\ufb02uence at a node u for previously\n\n(cid:1). This can be modeled as a linear (binary)\n\nin\ufb02uenced nodes Z is given by 1(cid:0)(cid:80)\n\nv\u2208N (u)\u2229Z wuv \u2265 ku\n\nclassi\ufb01er, or equivalently as a two-layer NN with linear threshold activations. Here the input layer\ncontains a unit for each node in the network and takes a binary value indicating whether the node\n\n4\n\n\fFigure 1: Modeling a single time step t of the in\ufb02uence\nprocess Ft,u : 2V \u2192{0, 1} as a neural network (t \u2265 2):\nthe portion in black computes whether or not node u is in-\n\ufb02uenced in the current time step t, while that in red/blue\nenforces the constraint that u does not get in\ufb02uenced more\nthan once during the in\ufb02uence process. Here \u03bet,u is 1 when\na node has been in\ufb02uenced previously and 0 otherwise.\nThe dotted red edges represent strong negative signals (has\na large negative weight) and the dotted blue edges represent\nstrong positive signals. The initial input to each node u in\nthe input layer is 1(u \u2208 X), while that for the auxiliary\nnodes (in red) is 0.\n\nis present in Z; the output layer contains a binary unit indicating whether u is in\ufb02uenced after one\ntime step; the connections between the two layers correspond to the edges between u and other\nnodes; and the threshold term on the output unit is the threshold parameter ku. Thus the \ufb01rst step\nof the in\ufb02uence process can be modeled using a NN with two n-node layers (the input layer takes\ninformation about the seed set, and the binary output indicates which nodes got in\ufb02uenced).\n2. From local to global: the multilayer network. The two-layer NN can be extended to multiple\ntime steps by replicating the output layer once for each step. However, the resulting NN will allow a\nnode to get in\ufb02uenced more than once during the in\ufb02uence process. To avoid this, we introduce an\nadditional binary unit u(cid:48) for each node u in a layer, which will record whether node u was in\ufb02uenced\nin previous time steps. In particular, whenever node u is in\ufb02uenced in a layer, a strong positive signal\nis sent to activate u(cid:48) in the next layer, which in turn will send out strong negative signals to ensure\nu is never activated in subsequent layers2; we use additional connections to ensure that u(cid:48) remains\nactive there after. Note that a node u in layer t + 1 is 1 whenever u is in\ufb02uenced at time step t;\nt,u : 2V \u2192{0, 1} denote this function computed at u for a given seed set. The LT in\ufb02uence\nlet F w\nu (which for seed set X is 1 whenever u is in\ufb02uenced in any one of the n time steps) is\nfunction F w\nu can be modeled as a NN with n + 1 layers.\nthen given by F w\nof the class of functions Fu is (cid:101)O(n(r + n)) (counting r + n parameters for each layer). Since the\nA naive application of classic VC-dimension results for NN [21] will give us that the VC-dimension\nsame parameters are repeated across layers, this can be tightened to (cid:101)O(r + n). The remaining proof\n\nu (X) =(cid:80)n\n\nt,u(X). Clearly, F w\n\ninvolves standard uniform convergence arguments [20] and a union bound over all nodes.\n\nt=1 F w\n\n3.1 Ef\ufb01cient computation\n\nHaving shown PAC learnability, we turn to ef\ufb01cient implementation of the prescribed algorithm.\nPartial observation. In the case where the training set does not specify the time at which each\nnode was infected, \ufb01nding an in\ufb02uence function with zero training error is computationally hard\nin general (as this is similar to learning a recurrent neural network). In practice, however, we can\nleverage the neural network construction, and solve the problem approximately by replacing linear\nthreshold activation functions with sigmoidal activations and the 0-1 loss with a suitable continuous\nsurrogate loss, and apply back-propagation based methods used for neural network learning.\nFull observation. Here it turns out that the algorithm can be implemented in polynomial time using\n1:n)}, the loss of\nlocal computations. Given a fully observed sample S = {(X 1, Y 1\nan in\ufb02uence function F for any (X, Y1:n) is given by (cid:96)0-1(\u222an\nt=1Yt, F (X)) and as before measures the\nfraction of mispredicted nodes. The prescribed algorithm then seeks to \ufb01nd parameters w for which\nthe corresponding training error is 0. Given that the time of in\ufb02uence is observed, this problem\ncan be decoupled into a set of linear programs (LPs) at each node; this is akin to locally estimating\nthe parameters at each node. In particular, let wu denote the parameters local to node u (incoming\n\nweights and threshold), and let fu(Z; wu) = 1(cid:0)(cid:80)\nat u for set Z of previously in\ufb02uence nodes. Let(cid:98)\u03b11,u(wu) = 1\nand (cid:98)\u03b1t,u(wu) = 1\n\n(cid:1) denote the local in\ufb02uence\ni=1 1(cid:0)\u03c7u(Y i\n1 ) (cid:54)= fu(X i; wu)(cid:1)\n(cid:80)m\nv\u2208N (u)\u2229Z wuv \u2265 ku\nt\u22121; wu)(cid:1), t \u2265 2, that given the set of nodes Y i\n\nt\u22121\nin\ufb02uenced at time t \u2212 1, measures the local prediction error at time t. Since the training sample was\n2By a strong signal, we mean a large positive/negative connection weight which will outweigh signals from\n\ni=1 1(cid:0)\u03c7u(Y i\n(cid:80)m\n\n1:n), . . . , (X m, Y m\n\nother connections. Indeed such connections can be created when the weights are all bounded.\n\nm\n\nt ) (cid:54)= fu(Y i\n\nm\n\n5\n\n\fgenerated by a LT model, there always exists parameters such that(cid:98)\u03b1t,u(wu) = 0 for each t and u,\n\nwhich also implies that the overall training error is 0. Such a set of parameters can be obtained by\nformulating a suitable LP that can be solved in polynomial time. The details are in Appendix A.2.\n\n4 The Independent Cascade model\n\n1\nn\n\nWe now address the question of learnability in the Independent Cascade (IC) model. Since the\nin\ufb02uence functions here have probabilistic outputs, the proof techniques we shall use will be dif-\nferent from the previous section, and will rely on arguments based on covering numbers. In this\ncase, we use the squared loss which for any Y \u2286 V and q \u2208 [0, 1]n, is given by: (cid:96)sq(Y, q) =\nu]. We shall make a mild assumption that the edge prob-\nabilities are bounded away from 0 and 1, i.e. w \u2208 [\u03bb, 1 \u2212 \u03bb]r for some \u03bb \u2208 (0, 0.5).\nTheorem 2 (PAC learnability under IC model). The class of in\ufb02uence functions under the IC\n\n(cid:80)n\nu=1[\u03c7u(Y )(1 \u2212 qu)2 + (1 \u2212 \u03c7u(Y ))q2\nmodel is PAC learnable w.r.t. (cid:96)sq and the sample complexity is m = (cid:101)O(cid:0)\u0001\u22122n3r(cid:1). Furthermore,\nfunctions can be learned in polynomial time with sample complexity (cid:101)O(\u0001\u22122nr3).\n\nin the full observation setting, under additional assumptions (see Assumption 1), the in\ufb02uence\n\nThe proof is given in Appendix B. As noted earlier, an IC in\ufb02uence function can be sensitive to errors\nin estimated parameters. Hence before discussing our algorithms and analysis, we seek to understand\nthe extent to which changes in the IC parameters can produce changes in the in\ufb02uence function, and\nin particular, check if the function is Lipschitz. For this, we use the closed-form interpretation of\nthe IC function as an expectation of an indicator term over a randomly drawn subset of edges from\nthe network (see [2]). More speci\ufb01cally, the IC cascade process can be seen as activating a subset\nof edges in the network; since each edge can be activated at most once, the active edges can be seen\nas having been chosen apriori using independent Bernoulli draws. Consider a random subgraph of\nactive edges obtained by choosing each edge (u, v) \u2208 E independently with probability wuv. For\na given subset of such edges A \u2286 E and seed set X \u2286 V , let \u03c3u(A, X) be an indicator function\nthat evaluates to 1 if u is reachable from a node in X via edges in A and 0 otherwise. Then the IC\nin\ufb02uence function can be written as an expectation of \u03c3 over random draw of the subgraph:\n\nF w\n\nu (X) =\n\nwab\n\n(1 \u2212 wab) \u03c3u(A, X).\n\n(2)\n\n(cid:88)\n\n(cid:89)\n\n(cid:89)\n\nA\u2286E\n\n(a,b)\u2208A\n\n(a,b) /\u2208A\n\nLemma 3. Fix X \u2286 V . For any w, w(cid:48) \u2208 Rr with (cid:107)w \u2212 w(cid:48)(cid:107)1 \u2264 \u0001,(cid:12)(cid:12)F w\n\nWhile the above de\ufb01nition involves an exponential number of terms, it can be veri\ufb01ed that the\ncorresponding gradient is bounded, thus implying that the IC function is Lipschitz.3\nu (X) \u2212 F w(cid:48)\n\nu (X)(cid:12)(cid:12) \u2264 \u0001.\n\nThis result tells us how small the parameter errors need to be to obtain accurate in\ufb02uence predictions\nand will be crucially used in our learnability results. Note that for the chain example in Section 2,\nthis tells us that the errors need to be less than 1/n for meaningful in\ufb02uence predictions.\nWe are now ready to provide the PAC learning algorithm for the partial observation setting with\nsample S = {(X 1, Y 1), . . . , (X m, Y m)}; we shall sketch the proof here. The full observation\ncase is outlined in Section 4.1, where we shall make use of the a different approach based on local\nestimation. Let F w denote the IC in\ufb02uence function with parameters w. The algorithm that we\nconsider for partial observation resorts to a maximum likelihood (ML) estimation of the (global) IC\nfunction. Let \u03c7u(Y ) = 1(u \u2208 Y ). De\ufb01ne the (global) log-likelihood for a cascade (X, Y ) as:\n\nL(X, Y ; w) =\n\n\u03c7u(Y ) ln(cid:0)F w\n\nn(cid:88)\n\nu=1\n\nu (X)(cid:1),\n\nu (X)(cid:1) + (1 \u2212 \u03c7u(Y )) ln(cid:0)1 \u2212 F w\nm(cid:88)\n\nL(X i, Y i; w).\n\nThe prescribed algorithm then solves the following optimization problem, and outputs an IC in\ufb02u-\nence function F w from the solution w obtained.\n\nmax\n\nw \u2208 [\u03bb,1\u2212\u03bb]r\n\ni=1\n\n(3)\n\n3In practice, IC in\ufb02uence functions can be computed through suitable sampling approaches. Also, note that\n\na function class can be PAC learnable even if the individual functions cannot be computed ef\ufb01ciently.\n\n6\n\n\fTo provide learnability guarantees for the above ML based procedure, we construct a \ufb01nite \u0001-cover\nover the space of IC in\ufb02uence functions, i.e. show that the class can be approximated to a factor of \u0001\n(in the in\ufb01nity norm sense) by a \ufb01nite set of IC in\ufb02uence functions. We \ufb01rst construct an \u0001-cover of\nsize O((r/\u0001)r) over the space of parameters [\u03bb, 1 \u2212 \u03bb]r, and use Lipschitzness to translate this to an\n\u0001-cover of same size over the IC class. Following this, standard uniform convergence arguments [20]\ncan be used to derive a sample complexity guarantee on the expected likelihood with a logarithmic\ndependence on the cover size; this then implies the desired learnability result w.r.t. (cid:96)sq:\nLemma 4 (Sample complexity guarantee on the log-likelihood objective). Fix \u0001, \u03b4 \u2208 (0, 1) and\n\nm = (cid:101)O(cid:0)\u0001\u22122n3r(cid:1). Let w be the parameters obtained from ML estimation. Then w.p. \u2265 1 \u2212 \u03b4,\n\nL(X, Y ; w)\n\n\u2212 E\n\nL(X, Y ; w)\n\n\u2264 \u0001.\n\n(cid:20) 1\n\nn\n\nsup\n\nw\u2208[\u03bb,1\u2212\u03bb]r\n\nE\n\n(cid:21)\n\n(cid:20) 1\n\nn\n\n(cid:21)\n\nCompared to results for the LT model, the sample complexity in Theorem 2 has a square dependence\non 1/\u0001. This is not surprising, as unlike the LT model, where the optimal 0-1 error is zero, the optimal\nsquared error here is non-zero in general; in fact, there are standard sample complexity lower bound\nresults that show that for similar settings, one cannot obtain a tighter bound in terms of 1/\u0001 [20].\n(2014) for learning in\ufb02uence under partial\nWe wish to also note that the approach of Du et al.\nobservation [13] uses the same interpretation of the IC in\ufb02uence function as in Eq. (2), but rather\nthan learning the parameters of the model, they seek to learn the weights on the individual indicator\nfunctions. Since there are exponentially many indicator terms, they resort to constructing approxi-\nmations to the in\ufb02uence function, for which a strong technical condition needs to be satis\ufb01ed; this\ncondition need not however hold in most settings. In contrast, our result applies to general settings.\n\n4.1 Ef\ufb01cient computation\n\n1:n), . . . , (X m, Y m\n\nPartial observation. The optimization problem in Eq. (3) that we need to solve for the partial obser-\nvation case is non-convex in general. Of course, in practice, this can be solved approximately using\ngradient-based techniques, using sample-based gradient computations to deal with the exponential\nnumber of terms in the de\ufb01nition of F w in the objective (see Appendix B.5).\nFull observation. On the other hand, when training sample S = {(X 1, Y 1\n1:n)}\ncontains fully observed cascades, we are able to show polynomial time learnability. For the LT\nmodel, we were assured of a set of parameters that would yield zero 0-1 error on the training sample,\nand hence the same procedure prescribed for partial information could be implemented under the\nfull observation in polynomial time by reduction to local computations. This is not the case with the\nIC model, where we resort to the common approach of learning in\ufb02uence by estimating the model\nparameters through a local maximum likelihood (ML) estimation technique. This method is similar\nto the maximum likelihood procedure used in [9] for solving a different problem of recovering the\nstructure of an unknown network from cascades. For the purpose of showing learnability, we \ufb01nd it\nsuf\ufb01cient to apply this procedure to only the \ufb01rst time step of the cascade.\nOur analysis \ufb01rst provides guarantees on the estimated parameters, and uses the Lipschitz property\nin Lemma 3 to translate them to guarantees on the in\ufb02uence function. Since we now wish to give\nguarantees in the parameter space, we will require that there exists unique set of parameters that\nexplains the IC cascade process; for this, we will need stricter assumptions. We assume that all edges\nhave a minimum in\ufb02uence strength, and that even when all neighbors of a node u are in\ufb02uenced in\na time step, there is a small probability of u not being in\ufb02uenced in the next step; we consider a\nspeci\ufb01c seed distribution, where each node has a non-zero probability of (not) being a seed node.\nAssumption 1. Let w\u2217 denote the parameters of the underlying IC model. Then there exists \u03bb \u2265\nv\u2208N (u)(1\u2212 wuv) \u2265 \u03b3 for all u \u2208 V . Also,\n\u03b3 \u2208 (0, 0.5) such that w\u2217\neach node in V is chosen independently in the initial seed set with probability \u03ba \u2208 (0, 1).\n(cid:21)\nWe \ufb01rst de\ufb01ne the local log-likelihood for given seed set X and nodes Y1 in\ufb02uenced at t = 1:\nL(X, Y1; \u03b2) =\n\u03b2uv\nwhere we have used log-transformed parameters \u03b2uv = \u2212 ln(1 \u2212 wuv), so that the objective is\nconcave in \u03b2. The prescribed algorithm then solves the following maximization problem over all\n\nuv \u2265 \u03bb for all (u, v) \u2208 E and(cid:81)\n\u2212 (cid:88)\n\n\u2212 (1\u2212 \u03c7u(Y1))\n\n(cid:19)(cid:19)\n\n1\u2212 exp\n\n(cid:88)\n\n(cid:88)\n\nv\u2208N (u)\u2229X\n\n\u03b2uv\n\nv\u2208N (u)\u2229X\n\n\u03c7u(Y1) ln\n\nu /\u2208X\n\n(cid:18)\n\n(cid:18)\n\n(cid:20)\n\n,\n\n7\n\n\fparameters that satisfy Assumption 1 and constructs an IC in\ufb02uence function from the parameters.\n\nL(X i, Y i\n\n1 ; \u03b2)\n\ns.t. \u2200(u, v) \u2208 E, \u03b2uv \u2265 ln\n\n, \u2200u \u2208 V,\n\n\u03b2uv \u2265 ln\n\nm(cid:88)\n\ni=1\n\nmax\n\u03b2 \u2208 Rr\n\n+\n\n(cid:19)\n\n(cid:18) 1\n\n1 \u2212 \u03bb\n\n(cid:88)\n\nv\u2208N (u)\n\n(cid:19)\n\n.\n\n(cid:18) 1\n\n\u03b3\n\nThis problem breaks down into smaller convex problems and can be solved ef\ufb01ciently (see [9]).\nProposition 5 (PAC learnability under IC model with full observation). Under full observation\nand Assumption 1, the class of IC in\ufb02uence functions is PAC learnable in polynomial time through\n\nlocal ML estimation. The corresponding sample complexity is (cid:101)O(cid:0)nr3(\u03ba2(1 \u2212 \u03ba)4\u03bb2\u03b32\u00012)\u22121(cid:1).\n\nThe proof is provided in Appendix B.6 and proceeds through the following steps: (1) we use cover-\ning number arguments to show that the local log-likelihood for the estimated parameters is close to\nthe optimal value; (2) we then show that under Assumption 1, the expected log-likelihood is strongly\nconcave, which gives us that closeness to the true model parameters in terms of the likelihood also\nimplies closeness to the true parameters in the parameter space; (3) we \ufb01nally use the Lipschitz\nproperty in Lemma 3 to translate this to guarantees on the global in\ufb02uence function.\nNote that the sample complexity here has a worse dependence on the number of edges r compared\nto the partial observation case; this is due to the two-step approach of requiring guarantees on the\nindividual parameters, and then transferring them to the in\ufb02uence function. The better dependence\non the number of nodes n is a consequence of estimating parameters locally. It would be interesting\nto see if tighter results can be obtained by using in\ufb02uence information from all time steps, and\nmaking different assumptions on the model parameters (e.g. correlation decay assumption in [9]).\n\n5 The Voter model\n\nBefore closing, we sketch of our learnability results for the Voter model, where unlike previous\nmodels the graph is undirected (with self-loops). Here we shall be interested in learning in\ufb02uence\nfor a \ufb01xed number of K time steps as the cascades can be longer than n. With the squared loss again\nas the loss function, this problem almost immediately reduces to linear least squares regression.\n\nLet W \u2208 [0, 1]n\u00d7n be a matrix of normalized edge weights with Wuv = wuv/(cid:80)\n\nv\u2208N (u)\u222a{u} wuv\nif (u, v) \u2208 E and 0 otherwise. Note that W can be seen as a one-step probability transition matrix.\nThen for an initial seed set Z \u2286 V , the probability of a node u being in\ufb02uenced under this model\nu W1X, where 1X \u2208 {0, 1}n is a column vector containing\nafter one time step can be veri\ufb01ed to be 1(cid:62)\n1 in entries corresponding to nodes in X, and 0 everywhere else. Similarly, for calculating the\nprobability of a node u being in\ufb02uenced after K time steps, one can use the K-step transition\nu (WK)1X. Now setting b = (WK)(cid:62)1u, we have Fu(X) = b(cid:62)1X which is\nmatrix: Fu(X) = 1(cid:62)\nessentially a linear function parametrized by n weights.\nThus learning in\ufb02uence in the Voter model (for \ufb01xed cascade length) can be posed as n independent\nlinear regression (one per node) with n coef\ufb01cients each. This can be solved in polynomial time\neven with partially observed data. We then have the following from standard results [20].\nTheorem 6 (PAC learnability under Voter model). The class of in\ufb02uence functions under the\n\nVoter model is PAC learnable w.r.t. (cid:96)sq in polynomial time and the sample complexity is (cid:101)O(cid:0)\u0001\u22122n(cid:1).\n\n6 Conclusion\n\nWe have established PAC learnability of some of the most celebrated models of in\ufb02uence in social\nnetworks. Our results point towards interesting connections between learning theory and the liter-\nature on in\ufb02uence in networks. Beyond the practical implications of the ability to learn in\ufb02uence\nfunctions from cascades, the fact that the main models of in\ufb02uence are PAC learnable, serves as fur-\nther evidence of their potent modeling capabilities. It would be interesting to see if our results extend\nto generalizations of the LT and IC models, and to investigate sample complexity lower bounds.\nAcknowledgements. Part of this work was carried out while HN was visiting Harvard as a part of a student visit\nunder the Indo-US Joint Center for Advanced Research in Machine Learning, Game Theory & Optimization\nsupported by the Indo-US Science & Technology Forum. HN thanks Kevin Murphy, Shivani Agarwal and\nHarish G. Ramaswamy for helpful discussions. YS and DP were supported by NSF grant CCF-1301976 and\nYS by CAREER CCF-1452961 and a Google Faculty Research Award.\n\n8\n\n\fReferences\n[1] Pedro Domingos and Matthew Richardson. Mining the network value of customers. In KDD,\n\n2001.\n\n[2] David Kempe, Jon M. Kleinberg, and \u00b4Eva Tardos. Maximizing the spread of in\ufb02uence through\n\na social network. In KDD, 2003.\n\n[3] Amit Goyal, Francesco Bonchi, and Laks VS Lakshmanan. Learning in\ufb02uence probabilities\n\nin social networks. In KDD, 2010.\n\n[4] Manuel Gomez-Rodriguez, David Balduzzi, and Bernhard Sch\u00a8olkopf. Uncovering the tempo-\n\nral dynamics of diffusion networks. In ICML, 2011.\n\n[5] Nan Du, Le Song, Alexander J. Smola, and Ming Yuan. Learning networks of heterogeneous\n\nin\ufb02uence. In NIPS, 2012.\n\n[6] Manuel Gomez-Rodriguez, Jure Leskovec, and Andreas Krause. Inferring networks of diffu-\n\nsion and in\ufb02uence. ACM Transactions on Knowledge Discovery from Data, 5(4):21, 2012.\n\n[7] Nan Du, Le Song, Manuel Gomez-Rodriguez, and Hongyuan Zha. Scalable in\ufb02uence estima-\n\ntion in continuous-time diffusion networks. In NIPS, 2013.\n\n[8] Abir De, Sourangshu Bhattacharya, Parantapa Bhattacharya, Niloy Ganguly, and Soumen\nChakrabarti. Learning a linear in\ufb02uence model from transient opinion dynamics. In CIKM,\n2014.\n\n[9] Praneeth Netrapalli and Sujay Sanghavi. Learning the graph of epidemic cascades. In SIG-\n\nMETRICS, 2012.\n\n[10] Hadi Daneshmand, Manuel Gomez-Rodriguez, Le Song, and Bernhard Sch\u00a8olkopf. Estimating\ndiffusion network structures: Recovery conditions, sample complexity & soft-thresholding\nalgorithm. In ICML, 2014.\n\n[11] Jean Pouget-Abadie and Thibaut Horel. Inferring graphs from cascades: A sparse recovery\n\nframework. ICML, 2015.\n\n[12] Bruno D. Abrahao, Flavio Chierichetti, Robert Kleinberg, and Alessandro Panconesi. Trace\n\ncomplexity of network inference. In KDD, 2013.\n\n[13] Nan Du, Yingyu Liang, Maria-Florina Balcan, and Le Song. In\ufb02uence function learning in\n\ninformation diffusion networks. In ICML, 2014.\n\n[14] Leslie G. Valiant. A theory of the learnable. Commununications of the ACM, 27(11):1134\u2013\n\n1142, 1984.\n\n[15] Elchanan Mossel and S\u00b4ebastien Roch. On the submodularity of in\ufb02uence in social networks.\n\nIn STOC, 2007.\n\n[16] Eyal Even-Dar and Asaf Shapira. A note on maximizing the spread of in\ufb02uence in social\n\nnetworks. Information Processing Letters, 111(4):184\u2013187, 2011.\n\n[17] Maria-Florina Balcan and Nicholas J.A. Harvey. Learning submodular functions. In STOC,\n\n2011.\n\n[18] Vitaly Feldman and Pravesh Kothari. Learning coverage functions and private release of\n\nmarginals. In COLT, 2014.\n\n[19] Jean Honorio and Luis Ortiz. Learning the structure and parameters of large-population graph-\nical games from behavioral data. Journal of Machine Learning Research, 16:1157\u20131210, 2015.\n[20] Martin Anthony and Peter L. Bartlett. Neural network learning: Theoretical foundations.\n\nCambridge University Press, 1999.\n\n[21] Peter L. Bartlett and Wolfgang Maass. Vapnik Chervonenkis dimension of neural nets. Hand-\n\nbook of Brain Theory and Neural Networks, pages 1188\u20131192, 1995.\n\n[22] Tong Zhang. Statistical behaviour and consistency of classi\ufb01cation methods based on convex\n\nrisk minimization. Annals of Mathematical Statistics, 32:56\u2013134, 2004.\n\n9\n\n\f", "award": [], "sourceid": 1782, "authors": [{"given_name": "Harikrishna", "family_name": "Narasimhan", "institution": "Harvard University"}, {"given_name": "David", "family_name": "Parkes", "institution": "Harvard University"}, {"given_name": "Yaron", "family_name": "Singer", "institution": "Harvard University"}]}