{"title": "Learning Time-Varying Coverage Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 3374, "page_last": 3382, "abstract": "Coverage functions are an important class of discrete functions that capture laws of diminishing returns. In this paper, we propose a new problem of learning time-varying coverage functions which arise naturally from applications in social network analysis, machine learning, and algorithmic game theory. We develop a novel parametrization of the time-varying coverage function by illustrating the connections with counting processes. We present an efficient algorithm to learn the parameters by maximum likelihood estimation, and provide a rigorous theoretic analysis of its sample complexity. Empirical experiments from information diffusion in social network analysis demonstrate that with few assumptions about the underlying diffusion process, our method performs significantly better than existing approaches on both synthetic and real world data.", "full_text": "Learning Time-Varying Coverage Functions\n\nNan Du\u2020, Yingyu Liang\u2021, Maria-Florina Balcan(cid:5), Le Song\u2020\n\n\u2020College of Computing, Georgia Institute of Technology\n\u2021Department of Computer Science, Princeton University\n(cid:5)School of Computer Science, Carnegie Mellon University\ndunan@gatech.edu,yingyul@cs.princeton.edu\n\nninamf@cs.cmu.edu,lsong@cc.gatech.edu\n\nAbstract\n\nCoverage functions are an important class of discrete functions that capture the\nlaw of diminishing returns arising naturally from applications in social network\nanalysis, machine learning, and algorithmic game theory. In this paper, we pro-\npose a new problem of learning time-varying coverage functions, and develop a\nnovel parametrization of these functions using random features. Based on the con-\nnection between time-varying coverage functions and counting processes, we also\npropose an ef\ufb01cient parameter learning algorithm based on likelihood maximiza-\ntion, and provide a sample complexity analysis. We applied our algorithm to the\nin\ufb02uence function estimation problem in information diffusion in social networks,\nand show that with few assumptions about the diffusion processes, our algorithm\nis able to estimate in\ufb02uence signi\ufb01cantly more accurately than existing approaches\non both synthetic and real world data.\n\nIntroduction\n\n1\nCoverage functions are a special class of the more general submodular functions which play impor-\ntant role in combinatorial optimization with many interesting applications in social network anal-\nysis [1], machine learning [2], economics and algorithmic game theory [3], etc. A particularly\nimportant example of coverage functions in practice is the in\ufb02uence function of users in information\ndiffusion modeling [1] \u2014 news spreads across social networks by word-of-mouth and a set of in\ufb02u-\nential sources can collectively trigger a large number of follow-ups. Another example of coverage\nfunctions is the valuation functions of customers in economics and game theory [3] \u2014 customers are\nthought to have certain requirements and the items being bundled and offered ful\ufb01ll certain subsets\nof these demands.\nTheoretically, it is usually assumed that users\u2019 in\ufb02uence or customers\u2019 valuation are known in ad-\nvance as an oracle. In practice, however, these functions must be learned. For example, given past\ntraces of information spreading in social networks, a social platform host would like to estimate\nhow many follow-ups a set of users can trigger. Or, given past data of customer reactions to differ-\nent bundles, a retailer would like to estimate how likely customer would respond to new packages of\ngoods. Learning such combinatorial functions has attracted many recent research efforts from both\ntheoretical and practical sides (e.g., [4, 5, 6, 7, 8]), many of which show that coverage functions can\nbe learned from just polynomial number of samples.\nHowever, the prior work has widely ignored an important dynamic aspect of the coverage functions.\nFor instance, information spreading is a dynamic process in social networks, and the number of\nfollow-ups of a \ufb01xed set of sources can increase as observation time increases. A bundle of items\nor features offered to customers may trigger a sequence of customer actions over time. These real\nworld problems inspire and motivate us to consider a novel time-varying coverage function, f (S, t),\nwhich is a coverage function of the set S when we \ufb01x a time t, and a continuous monotonic function\nof time t when we \ufb01x a set S. While learning time-varying combinatorial structures has been ex-\n\n1\n\n\fplored in graphical model setting (e.g., [9, 10]), as far as we are aware of, learning of time-varying\ncoverage function has not been addressed in the literature. Furthermore, we are interested in esti-\nmating the entire function of t, rather than just treating the time t as a discrete index and learning\nthe function value at a small number of discrete points. From this perspective, our formulation is the\ngeneralization of the most recent work [8] with even less assumptions about the data used to learn\nthe model.\nGenerally, we assume that the historical data are provided in pairs of a set and a collection of times-\ntamps when caused events by the set occur. Hence, such a collection of temporal events associated\nwith a particular set Si can be modeled principally by a counting process Ni(t), t (cid:62) 0 which is a\nstochastic process with values that are positive, integer, and increasing along time [11]. For instance,\nin the information diffusion setting of online social networks, given a set of earlier adopters of some\nnew product, Ni(t) models the time sequence of all triggered events of the followers, where each\njump in the process records the timing tij of an action. In the economics and game theory setting, the\ncounting process Ni(t) records the number of actions a customer has taken over time given a partic-\nular bundled offer. This essentially raises an interesting question of how to estimate the time-varying\ncoverage function from the angle of counting processes. We thus propose a novel formulation which\nbuilds a connection between the two by modeling the cumulative intensity function of a counting\nprocess as a time-varying coverage function. The key idea is to parametrize the intensity function\nas a weighted combination of random kernel functions. We then develop an ef\ufb01cient learning algo-\nrithm TCOVERAGELEARNER to estimate the parameters of the function using maximum likelihood\napproach. We show that our algorithm can provably learn the time-varying coverage function using\nonly polynomial number of samples. Finally, we validate TCOVERAGELEARNER on both in\ufb02uence\nestimation and maximization problems by using cascade data from information diffusion. We show\nthat our method performs signi\ufb01cantly better than alternatives with little prior knowledge about the\ndynamics of the actual underlying diffusion processes.\n\n2 Time-Varying Coverage Function\nWe will \ufb01rst give a formal de\ufb01nition of the time-varying coverage function, and then explain its\nadditional properties in details.\nDe\ufb01nition. Let U be a (potentially uncountable) domain. We endow U with some \u03c3-algebra A and\ndenote a probability distribution on U by P. A coverage function is a combinatorial function over a\n\ufb01nite set V of items, de\ufb01ned as\n\n(1)\nwhere Us \u2282 U is the subset of domain U covered by item s \u2208 V, and Z is the additional nor-\nmalization constant. For time-varying coverage functions, we let the size of the subset Us to grow\nmonotonically over time, that is\n\n,\n\ns\u2208S Us\n\nfor all S \u2208 2V ,\n\nf (S) := Z \u00b7 P(cid:16)(cid:91)\n\n(cid:17)\n\nfor all t (cid:54) \u03c4 and s \u2208 V,\n\n(2)\n\nwhich results in a combinatorial temporal function\n\nUs(t) \u2286 Us(\u03c4 ),\n\nf (S, t) = Z \u00b7 P(cid:16)(cid:91)\n\n(cid:17)\n\n,\n\nfor all S \u2208 2V .\n\ns\u2208S Us(t)\n(3)\nIn this paper, we assume that f (S, t) is smooth and continuous, and its \ufb01rst order derivative with\nrespect to time, f(cid:48)(S, t), is also smooth and continuous.\nRepresentation. We now show that a time-varying coverage function, f (S, t), can be represented\nas an expectation over random functions based on multidimensional step basis functions. Since\nUs(t) is varying over time, we can associate each u \u2208 U with a |V|-dimensional vector \u03c4u of change\npoints. In particular, the s-th coordinate of \u03c4u records the time that source node s covers u. Let \u03c4\nto be a random variable obtained by sampling u according to P and setting \u03c4 = \u03c4u. Note that given\nall \u03c4u we can compute f (S, t); now we claim that the distribution of \u03c4 is suf\ufb01cient.\nWe \ufb01rst introduce some notations. Based on \u03c4u we de\ufb01ne a |V|-dimensional step function ru(t) :\nR+ (cid:55)\u2192 {0, 1}|V|\n, where the s-th dimension of ru(t) is 1 if u is covered by the set Us(t) at time t, and\n0 otherwise. To emphasize the dependence of the function ru(t) on \u03c4u, we will also write ru(t) as\nru(t|\u03c4u). We denote the indicator vector of a set S by \u03c7S \u2208 {0, 1}|V| where the s-th dimension of\nS ru(t) (cid:62) 1.\n\n\u03c7S is 1 if s \u2208 S, and 0 otherwise. Then u \u2208 U is covered by(cid:83)\n\ns\u2208S Us(t) at time t if \u03c7(cid:62)\n\n2\n\n\fLemma 1. There exists a distribution Q(\u03c4 ) over the vector of change points \u03c4 , such that the time-\nvarying coverage function can be represented as\n\nf (S, t) = Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS r(t|\u03c4 ))(cid:3)\n\n(4)\n\nwhere \u03c6(x) := min{x, 1}, and r(t|\u03c4 ) is a multidimensional step function parameterized by \u03c4 .\ns\u2208S Us(t). By de\ufb01nition (3), we have the following integral representation\nI{u \u2208 US} dP(u) = Z \u00b7\n\nf (S, t) = Z \u00b7\nWe can de\ufb01ne the set of u having the same \u03c4 as U\u03c4 := {u \u2208 U | \u03c4u = \u03c4} and de\ufb01ne a distribution\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS ru(t))(cid:3) .\n\nProof. Let US :=(cid:83)\n(cid:90)\nover \u03c4 as dQ(\u03c4 ) :=(cid:82)\n\ndP(u). Then the integral representation of f (S, t) can be rewritten as\n\n\u03c6(\u03c7(cid:62)\n\nS ru(t)) dP(u) = Z \u00b7 Eu\u223cP(u)\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS ru(t))(cid:3) = Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n\n(cid:2)\u03c6(\u03c7(cid:62)\nS r(t|\u03c4 ))(cid:3) ,\n\nU\u03c4\nZ \u00b7 Eu\u223cP(u)\n\n(cid:90)\n\nU\n\nU\n\nwhich proves the lemma.\n\n3 Model for Observations\nIn general, we assume that the input data are provided in the form of pairs, (Si, Ni(t)), where Si is\na set, and Ni(t) is a counting process in which each jump of Ni(t) records the timing of an event.\nWe \ufb01rst give a brief overview of a counting process [11] and then motivate our model in details.\nCounting Process. Formally, a counting process {N (t), t (cid:62) 0} is any nonnegative, integer-valued\nstochastic process such that N (t(cid:48)) (cid:54) N (t) whenever t(cid:48) (cid:54) t and N (0) = 0. The most common\nuse of a counting process is to count the number of occurrences of temporal events happening along\ntime, so the index set is usually taken to be the nonnegative real numbers R+. A counting process\nis a submartingale: E[N (t)|Ht(cid:48)] (cid:62) N (t(cid:48)) for all t > t(cid:48) where Ht(cid:48) denotes the history up to time t(cid:48).\nBy Doob-Meyer theorem [11], N (t) has the unique decomposition:\n\n(5)\nwhere \u039b(t) is a nondecreasing predictable process called the compensator (or cumulative intensity),\nand M (t) is a mean zero martingale. Since E[dM (t)|Ht\u2212 ] = 0, where dM (t) is the increment of\nM (t) over a small time interval [t, t + dt), and Ht\u2212 is the history until just before time t,\n\nN (t) = \u039b(t) + M (t)\n\nE[dN (t)|Ht\u2212 ] = d\u039b(t) := a(t) dt\n\n(6)\n\nwhere a(t) is called the intensity of a counting process.\nModel formulation. We assume that the cumulative intensity of the counting process is modeled\nby a time-varying coverage function, i.e., the observation pair (Si, Ni(t)) is generated by\n\nNi(t) = f (Si, t) + Mi(t)\n\n(7)\nin the time window [0, T ] for some T > 0, and df (S, t) = a(S, t)dt. In other words, the time-\nvarying coverage function controls the propensity of occurring events over time. Speci\ufb01cally, for a\n\ufb01xed set Si, as time t increases, the cumulative number of events observed grows accordingly for\nthat f (Si, t) is a continuous monotonic function over time; for a given time t, as the set Si changes\nto another set Sj, the amount of coverage over domain U may change and hence can result in a\ndifferent cumulative intensity. This abstract model can be mapped to real world applications. In\nthe information diffusion context, for a \ufb01xed set of sources Si, as time t increases, the number of\nin\ufb02uenced nodes in the social network tends to increase; for a given time t, if we change the sources\nto Sj, the number of in\ufb02uenced nodes may be different depending on how in\ufb02uential the sources\nare. In the economics and game theory context, for a \ufb01xed bundle of offers Si, as time t increases, it\nis more likely that the merchant will observe the customers\u2019 actions in response to the offers; even\nat the same time t, different bundles of offers, Si and Sj, may have very different ability to drive the\ncustomers\u2019 actions.\nCompared to a regression model yi = g(Si) + \u0001i with i.i.d. input data (Si, yi), our model outputs\na special random function over time, that is, a counting process Ni(t) with the noise being a zero\nmean martingale Mi(t). In contrast to functional regression models, our model exploits much more\ninteresting structures of the problem. For instance, the random function representation in the last\nsection can be used to parametrize the model. Such special structure of the counting process allows\nus to estimate the parameter of our model using maximum likelihood approach ef\ufb01ciently, and the\nmartingale noise enables us to use exponential concentration inequality in analyzing our algorithm.\n\n3\n\n\f4 Parametrization\nBased on the following two mild assumptions, we will show how to parametrize the intensity func-\ntion as a weighted combination of random kernel functions, learn the parameters by maximum\nlikelihood estimation, and eventually derive a sample complexity.\n\nis absolutely continuous with(cid:82) \u00a8a(t)dt < \u221e.\n\n(A1) a(S, t) is smooth and bounded on [0, T ]: 0 < amin (cid:54) a (cid:54) amax < \u221e, and \u00a8a := d2a/dt2\n(A2) There is a known distribution Q(cid:48)(\u03c4 ) and a constant C with Q(cid:48)(\u03c4 )/C (cid:54) Q(\u03c4 ) (cid:54) CQ(cid:48)(\u03c4 ).\nKernel Smoothing To facilitate our \ufb01nite dimensional parameterization, we \ufb01rst convolve the\n\u221a\nintensity function with K(t) = k(t/\u03c3)/\u03c3 where \u03c3 is the bandwidth parameter and k is a kernel\nfunction (such as the Gaussian RBF kernel k(t) = e\u2212t2/2/\n\n2\u03c0) with\n\n0 (cid:54) k(t) (cid:54) \u03bamax,\n\nk(t) dt = 1,\n\n(8)\nThe convolution results in a smoothed intensity aK(S, t) = K(t) (cid:63) (df (S, t)/dt) = d(K(t) (cid:63)\n\u039b(S, t))/dt. By the property of convolution and exchanging derivative with integral, we have that\naK(S, t) = d(Z \u00b7 E\u03c4\u223cQ(\u03c4 )[K(t) (cid:63) \u03c6(\u03c7(cid:62)\n\nt k(t) dt = 0,\n\nand \u03c32\n\nk :=\n\nt2k(t) dt < \u221e.\n\n(cid:2)d(K(t) (cid:63) \u03c6(\u03c7(cid:62)\n\nS r(t|\u03c4 )])/dt\n\nS r(t|\u03c4 ))/dt(cid:3)\n\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 )\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 ) [K(t) (cid:63) \u03b4(t \u2212 t(S, r)]\n= Z \u00b7 E\u03c4\u223cQ(\u03c4 ) [K(t \u2212 t(S, \u03c4 ))]\n\nby de\ufb01nition of f (\u00b7)\nexchange derivative and integral\nby property of convolution and function \u03c6(\u00b7)\nby de\ufb01nition of \u03b4(\u00b7)\n\n(cid:90)\n\n(cid:90)\n\nS r(t|\u03c4 )) jumps from 0 to 1. If we choose small enough\nwhere t(S, \u03c4 ) is the time when function \u03c6(\u03c7(cid:62)\nkernel bandwidth, aK only incurs a small bias from a. But the smoothed intensity still results in\nin\ufb01nite number of parameters, due to the unknown distribution Q(\u03c4 ). To address this problem, we\ndesign the following random approximation with \ufb01nite number of parameters.\n\n(cid:90)\n\ni=1\n\nZ\nC\n\n(cid:41)\n\nA =\n\nESEt\n\n(cid:54) ZC\n\n(cid:82) T\n\nW(cid:88)\n\n(cid:54) (cid:107)w(cid:107)1\n\nw (S, t) =\naK\n\nRandom Function Approximation The key idea is to sample a collection of W random change\npoints \u03c4 from a known distribution Q(cid:48)(\u03c4 ) which can be different from Q(\u03c4 ). If Q(cid:48)(\u03c4 ) is not very\nfar way from Q(\u03c4 ), the random approximation will be close to aK, and thus close to a. More\nspeci\ufb01cally, we will denote the space of weighted combination of W random kernel function by\n,{\u03c4i} i.i.d.\u223c Q(cid:48)(\u03c4 ).\n\nLemma 2. If W = \u02dcO(Z 2/(\u0001\u03c3)2), then with probability (cid:62) 1 \u2212 \u03b4, there exists an(cid:101)a \u2208 A such that\n\n(cid:40)\n(cid:2)(a(S, t) \u2212(cid:101)a(S, t))2(cid:3) := ES\u223cP(S)\n\n(cid:2)(a(S, t) \u2212(cid:101)a(S, t))2(cid:3) dt/T = O(\u00012 + \u03c34).\n\nwi K(t \u2212 t(S, \u03c4i)) : w (cid:62) 0,\n\n\u0001) to get O(\u00012) approximation error.\n\n\u221a\nThe lemma then suggests to set the kernel bandwidth \u03c3 = O(\n5 Learning Algorithm\nWe develop a learning algorithm, referred to as TCOVERAGELEARNER, to estimate the parameters\nw (S, t) by maximizing the joint likelihood of all observed events based on convex optimization\nof aK\ntechniques as follows.\nMaximum Likelihood Estimation Instead of directly estimating the time-varying coverage func-\ntion, which is the cumulative intensity function of the counting process, we turn to estimate\nthe intensity function a(S, t) = \u2202\u039b(S, t)/\u2202t. Given m i.i.d. counting processes, Dm :=\n{(S1, N1(t)), . . . , (Sm, Nm(t))} up to observation time T , the log-likelihood of the dataset is [11]\n\n(9)\n\n0\n\n(cid:96)(Dm|a) =\n\n{log a(Si, t)} dNi(t) \u2212\n\na(Si, t) dt\n\n(10)\nMaximizing the log-likelihood with respect to the intensity function a(S, t) then gives us the esti-\n\nmation(cid:98)a(S, t). The W -term random kernel function approximation reduces a function optimization\n\nproblem to a \ufb01nite dimensional optimization problem, while incurring only small bias in the esti-\nmated function.\n\ni=1\n\n0\n\n0\n\n.\n\n(cid:90) T\n\n(cid:41)\n\n(cid:40)(cid:90) T\n\nm(cid:88)\n\n4\n\n\fAlgorithm 1 TCOVERAGELEARNER\n\nINPUT : {(Si, Ni(t))} , i = 1, . . . , m;\nSample W random features \u03c41, . . . , \u03c4W from Q(cid:48)(\u03c4 );\nCompute {t(Si, \u03c4w)} ,{gi} ,{k(tij)} , i \u2208 {1, . . . , m} , w = 1, . . . , W, tij < T ;\nInitialize w0 \u2208 \u2126 = {w (cid:62) 0,(cid:107)w(cid:107)1 (cid:54) 1};\nApply projected quasi-newton algorithm [12] to solve 11;\nOUTPUT : aK\n\ni=1 wi K(t \u2212 t(S, \u03c4i))\n\nConvex Optimization. By plugging the parametrization aK\nwe formulate the optimization problem as :\n\nw (S, t) (9) into the log-likelihood (10),\n\nlog(cid:0)w(cid:62)k(tij)(cid:1)\uf8fc\uf8fd\uf8fe subject to w (cid:62) 0, (cid:107)w(cid:107)1 (cid:54) 1,\n\nmin\nw\n\nwhere we de\ufb01ne\n\nw (S, t) =(cid:80)W\n\uf8f1\uf8f2\uf8f3w(cid:62)gi \u2212 (cid:88)\n(cid:90) T\n\nm(cid:88)\n\ni=1\n\ntij