{"title": "Uplift Modeling from Separate Labels", "book": "Advances in Neural Information Processing Systems", "page_first": 9927, "page_last": 9937, "abstract": "Uplift modeling is aimed at estimating the incremental impact of an action on an individual's behavior, which is useful in various application domains such as targeted marketing (advertisement campaigns) and personalized medicine (medical treatments). Conventional methods of uplift modeling require every instance to be jointly equipped with two types of labels: the taken action and its outcome. However, obtaining two labels for each instance at the same time is difficult or expensive in many real-world problems. In this paper, we propose a novel method of uplift modeling that is applicable to a more practical setting where only one type of labels is available for each instance. We show a mean squared error bound for the proposed estimator and demonstrate its effectiveness through experiments.", "full_text": "Uplift Modeling from Separate Labels\n\nIkko Yamane1,2 Florian Yger3,2\n\nJamal Atif3 Masashi Sugiyama2,1\n\n1 The University of Tokyo, CHIBA, JAPAN\n\n2 RIKEN Center for Advanced Intelligence Project (AIP), TOKYO, JAPAN\n\n3 LAMSADE, CNRS, Universit\u00e9 Paris-Dauphine, Universit\u00e9 PSL, PARIS, FRANCE\n\n{yamane@ms., sugi@}k.u-tokyo.ac.jp, {florian.yger@, jamal.atif@}dauphine.fr\n\nAbstract\n\nUplift modeling is aimed at estimating the incremental impact of an action on\nan individual\u2019s behavior, which is useful in various application domains such as\ntargeted marketing (advertisement campaigns) and personalized medicine (medical\ntreatments). Conventional methods of uplift modeling require every instance to\nbe jointly equipped with two types of labels: the taken action and its outcome.\nHowever, obtaining two labels for each instance at the same time is dif\ufb01cult or\nexpensive in many real-world problems. In this paper, we propose a novel method\nof uplift modeling that is applicable to a more practical setting where only one type\nof labels is available for each instance. We show a mean squared error bound for\nthe proposed estimator and demonstrate its effectiveness through experiments.\n\n1\n\nIntroduction\n\nIn many real-world problems, a central objective is to optimally choose a right action to maximize\nthe pro\ufb01t of interest. For example, in marketing, an advertising campaign is designed to promote\npeople to purchase a product [29]. A marketer can choose whether to deliver an advertisement to\neach individual or not, and the outcome is the number of purchases of the product. Another example\nis personalized medicine, where a treatment is chosen depending on each patient to maximize the\nmedical effect and minimize the risk of adverse events or harmful side effects [1, 13]. In this case,\ngiving or not giving a medical treatment to each individual are the possible actions to choose, and the\noutcome is the rate of recovery or survival from the disease. Hereafter, we use the word treatment for\ntaking an action, following the personalized medicine example.\nA/B testing [14] is a standard method for such tasks, where two groups of people, A and B, are\nrandomly chosen. The outcomes are measured separately from the two groups after treating all the\nmembers of Group A but none of Group B. By comparing the outcomes between the two groups by a\nstatistical test, one can examine whether the treatment positively or negatively affected the outcome.\nHowever, A/B testing only compares the two extreme options: treating everyone or no one. These\ntwo options can be both far from optimal when the treatment has positive effect on some individuals\nbut negative effect on others.\nTo overcome the drawback of A/B testing, uplift modeling has been investigated recently [11, 28, 32].\nUplift modeling is the problem of estimating the individual uplift, the incremental pro\ufb01t brought by\nthe treatment conditioned on features of each individual. Uplift modeling enables us to design a\nre\ufb01ned decision rule for optimally determining whether to treat each individual or not, depending on\nhis/her features. Such a treatment rule allows us to only target those who positively respond to the\ntreatment and avoid treating negative responders.\nIn the standard uplift modeling setup, there are two types of labels [11, 28, 32]: One is whether the\ntreatment has been given to the individual and the other is its outcome. Existing uplift modeling\nmethods require each individual to be jointly given these two labels for analyzing the association\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fbetween outcomes and the treatment [11, 28, 32]. However, joint labels are expensive or hard (or even\nimpossible) to obtain in many real-world problems. For example, when distributing an advertisement\nby email, we can easily record to whom the advertisement has been sent. However, for technical\nor privacy reasons, it is dif\ufb01cult to keep track of those people until we observe the outcomes on\nwhether they buy the product or not. Alternatively, we can easily obtain information about purchasers\nof the product at the moment when the purchases are actually made. However, we cannot know\nwhether those who are buying the product have been exposed to the advertisement or not. Thus, every\nindividual always has one missing label. We term such samples separately labeled samples.\nIn this paper, we consider a more practical uplift modeling setup where no jointly labeled samples\nare available, but only separately labeled samples are given. Theoretically, we \ufb01rst show that the\nindividual uplift is identi\ufb01able when we have two sets of separately labeled samples collected under\ndifferent treatment policies. We then propose a novel method that directly estimates the individual\nuplift only from separately labeled samples. Finally, we demonstrate the effectiveness of the proposed\nmethod through experiments.\n\n2 Problem Setting\n\nThis paper focuses on estimation of the individual uplift u(x), often called individual treatment\neffect (ITE) in the causal inference literature [31], de\ufb01ned as u(x) := E[Y1 | x] \u2212 E[Y\u22121 | x],\nwhere E[ \u00b7 | \u00b7 ] denotes the conditional expectation, and x is a X -valued random variable (X \u2286 Rd)\nrepresenting features of an individual, and Y1, Y\u22121 are Y-valued potential outcome variables [31]\n(Y \u2286 R) representing outcomes that would be observed if the individual was treated and not treated,\nrespectively. Note that only one of either Y1 or Y\u22121 can be observed for each individual. We denote\nthe {1,\u22121}-valued random variable of the treatment assignment by t, where t = 1 means that the\nindividual has been treated and t = \u22121 not treated. We refer to the population for which we want to\nevaluate u(x) as the test population, and denote the density of the test population by p(Y1, Y\u22121, x, t).\nWe assume that t is unconfounded with either of Y1 and Y\u22121 conditioned on x, i.e. p(Y1 | x, t) =\np(Y1 | x) and p(Y\u22121 | x, t) = p(Y\u22121 | x). Unconfoundedness is an assumption commonly made in\nobservational studies [5, 33]. For notational convenience, we denote by y := Yt the outcome of the\ntreatment assignment t. Furthermore, we refer to any conditional density of t given x as a treatment\npolicy.\nIn addition to the test population, we suppose that there are two training populations k = 1, 2, whose\njoint probability density pk(Y1, Y\u22121, x, t) satisfy\n\npk(Yt0 = y0 | x = x0) = p(Yt0 = y0 | x = x0)\np1(t = t0 | x = x0) (cid:54)= p2(t = t0 | x = x0),\n\n(1)\n(2)\nfor all possible realizations x0 \u2208 X , t0 \u2208 {\u22121, 1}, and y0 \u2208 Y. Intuitively, Eq. (1) means that\npotential outcomes depend on x in the same way as those in the test population, and Eq. (2) states\nthat those two policies give a treatment with different probabilities for every x = x0.\nWe suppose that the following four training data sets, which we call separately labeled samples, are\ngiven:\n\n(for k = 1, 2),\n\n{(x(k)\n\ni\n\n)}nk\n\ni=1\n\ni.i.d.\u223c pk(x, y),\n\ni\n\n, y(k)\n\nwhere nk and (cid:101)nk, k = 1, 2, are positive integers. Under Assumptions (1), (2), and the uncon-\nfoundedness, we have pk(Yt | x, t = t0) = p(Yt0 | x, t = t0) = p(Yt0 | x) for t0 \u2208 {\u22121, 1}\nand k \u2208 {1, 2}. Note that we can safely denote p(y | x, t) := pk(y | x, t). Moreover, we have\nE[Yt0 | x] = E[y | x, t = t0] for t0 = 1,\u22121, and thus our goal boils down to the estimation of\n\n(for k = 1, 2),\n\n, t(k)\n\ni=1\n\ni\n\nu(x) = E[y | x, t = 1] \u2212 E[y | x, t = \u22121]\n\n(3)\n\nfrom the separately labeled samples, where the conditional expectation is taken over p(y | x, t).\nEstimation of the individual uplift is important for the following reasons.\nIt enables the estimation of the average uplift. The average uplift U (\u03c0) of the treatment policy\n\u03c0(t | x) is the average outcome of \u03c0, subtracted by that of the policy \u03c0\u2212, which constantly assigns\n\n2\n\n{((cid:101)x(k)\n\ni\n\n)}(cid:101)nk\n\ni.i.d.\u223c pk(x, t)\n\n\fthe treatment as t = \u22121, i.e., \u03c0\u2212(t = \u03c4 | x) := 1[\u03c4 = \u22121], where 1[\u00b7] denotes the indicator function:\nyp(y | x, t)\u03c0\u2212(t | x)p(x)dydx\n\nyp(y | x, t)\u03c0(t | x)p(x)dydx \u2212\n\nU (\u03c0) :=\n\n(cid:90)(cid:90) (cid:88)\n\nt=\u22121,1\n\n(cid:90)(cid:90) (cid:88)\n(cid:90)\n\nt=\u22121,1\n\n=\n\nu(x)\u03c0(t = 1 | x)p(x)dx.\n\n(4)\n\nThis quantity can be estimated from samples of x once we obtain an estimate of u(x).\nIt provides the optimal treatment policy. The treatment policy given by \u03c0(t = 1 | x) = 1[0 \u2264\nu(x)] is the optimal treatment that maximizes the average uplift U (\u03c0) and equivalently the average\n\nt=\u22121,1 yp(y | x, t)\u03c0(t | x)p(x)dydx (see Eq. (4)) [32].\n\noutcome(cid:82)(cid:82)(cid:80)\n\nIt is the optimal ranking scoring function. From a practical viewpoint, it may be useful to prioritize\nindividuals to be treated according to some ranking scores especially when the treatment is costly\nand only a limited number of individuals can be treated due to some budget constraint. In fact, u(x)\nserves as the optimal ranking scores for this purpose [36]. More speci\ufb01cally, we de\ufb01ne a family of\ntreatment policies {\u03c0f,\u03b1}\u03b1\u2208R associated with scoring function f by \u03c0f,\u03b1(t = 1 | x) = 1[\u03b1 \u2264 f (x)].\nThen, under some technical condition, f = u maximizes the area under the uplift curve (AUUC)\nde\ufb01ned as\n\n(cid:90) 1\n(cid:90)\n(cid:90) 1\n\n0\n\nAUUC(f ) :=\n\nU (\u03c0f,\u03b1)dC\u03b1\n\nu(x)1[\u03b1 \u2264 f (x)]p(x)dxdC\u03b1\n\n=\n= E[1[f (x) \u2264 f (x(cid:48))]u(x(cid:48))],\n\n0\n\nwhere C\u03b1 := Pr[f (x) < \u03b1], x, x(cid:48) i.i.d.\u223c p(x), and E denotes the expectation with respect to these\nvariables. AUUC is a standard performance measure for uplift modeling methods [11, 25, 28, 32].\nFor more details, see Appendix B in the supplementary material.\nRemark on the problem setting: Uplift modeling is often referred to as individual treatment effect\nestimation or heterogeneous treatment effect estimation and has been extensively studied especially\nin the causal inference literature [5, 7, 9, 12, 16, 24, 31, 37]. In particular, recent research has\ninvestigated the problem under the setting of observational studies, inference using data obtained\nfrom uncontrolled experiments because of its practical importance [33]. Here, experiments are said\nto be uncontrolled when some of treatment variables are not controlled to have designed values.\nGiven that treatment policies are unknown, our problem setting is also of observational studies but\nposes an additional challenge that stems from missing labels. What makes our problem feasible is\nthat we have two kinds of data sets following different treatment policies.\nIt is also important to note that our setting generalizes the standard setting for observational studies\nsince the former is reduced to the latter when one of the treatment policies always assigns individuals\nto the treatment group, and the other to the control group.\nOur problem is also closely related to individual treatment effect estimation via instrumental vari-\nables [2, 6, 10, 19].1\n\n3 Naive Estimators\nA naive approach is \ufb01rst estimating the conditional density pk(y | x) and pk(t | x) from training\nsamples by some conditional density estimator [4, 34], and then solving the following linear system\nfor p(y | x, t = 1) and p(y | x, t = \u22121):\n\n(cid:123)(cid:122)\n\npk(y | x)\nEstimated from {(x(k)\n\n(cid:125)\n\n(cid:124)\n\n, y(k)\n\ni\n\ni\n\n=\n\n)}n\n\ni=1\n\n(cid:88)\n\nt=\u22121,1\n\np(y | x, t)\n\n(cid:123)(cid:122)\n(cid:125)\npk(t | x)\nEstimated from {((cid:101)x(k)\n\n(cid:124)\n\ni\n\n(for k = 1, 2).\n\n(5)\n\n)}(cid:101)n\n\ni=1\n\n, t(k)\n\ni\n\n1Among the related papers mentioned above, the most relevant one is Lewis and Syrgkanis [19], which is\n\nconcurrent work with ours.\n\n3\n\n\flinear system implied by Eq. (5) instead: Ek[y | x] = (cid:80)\n\nAfter that, the conditional expectations of y over p(y | x, t = 1) and p(y | x, t = \u22121) are calculated\nby numerical integration, and \ufb01nally their difference is calculated to obtain another estimate of u(x).\nHowever, this may not yield a good estimate due to the dif\ufb01culty of conditional density estimation\nand the instability of numerical integration. This issue may be alleviated by working on the following\nt=\u22121,1 E[y | x, t]pk(t | x), k = 1, 2,\nwhere Ek[y | x] and pk(t | x) can be estimated from our samples. Solving this new system for\nE[y | x, t = 1] and E[y | x, t = \u22121] and taking their difference gives an estimate of u(x). A method\ncalled two-stage least-squares for instrumental variable regression takes such an approach [10].\nThe second approach of estimation Ek[y|x] and pk(t|x) avoids both conditional density estimation\nand numerical integration, but it still involves post processing of solving the linear system and\nsubtraction, being a potential cause of performance deterioration.\n\n4 Proposed Method\n\nIn this section, we develop a method that can overcome the aforementioned problems by directly\nestimating the individual uplift.\n\n4.1 Direct Least-Square Estimation of the Individual Uplift\n\nFirst, we will show an important lemma that directly relates the marginal distributions of separately\nlabeled samples to the individual uplift u(x).\nLemma 1. For every x such that p1(x) (cid:54)= p2(x), u(x) can be expressed as\n\nu(x) = 2 \u00d7 Ey\u223cp1(y|x)[y] \u2212 Ey\u223cp2(y|x)[y]\nEt\u223cp1(t|x)[t] \u2212 Et\u223cp2(t|x)[t]\n\n.\n\n(6)\n\nFor a proof, refer to Appendix C in the supplementary material.\nUsing Eq. (6), we can re-interpret the naive methods described in Section 3 as estimating the condi-\ntional expectations on the right-hand side by separately performing regression on {(x(1)\n)}n1\ni=1,\n{(x(2)\ni=1. This approach may result in unreliable\nperformance when the denominator is close to zero, i.e., p1(t | x) (cid:39) p2(t | x).\nLemma 1 can be simpli\ufb01ed by introducing auxiliary variables z and w, which are Z-valued and\n{\u22121, 1}-valued random variables whose conditional probability density and mass are de\ufb01ned by\n\ni=1, and {((cid:101)x(2)\n)}(cid:101)n1\n\ni=1, {((cid:101)x(1)\n\n)}(cid:101)n2\n\n)}n2\n\n, y(1)\n\n, y(2)\n\n, t(2)\n\n, t(1)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\np(z = z0 | x) = 1\np(w = w0 | x) = 1\n\n2 p1(y = z0 | x) + 1\n2 p1(t = w0 | x) + 1\n\n2 p2(y = \u2212z0 | x),\n2 p2(t = \u2212w0 | x),\n\nfor any z0 \u2208 Z and any w0 \u2208 {\u22121, 1}, where Z := {s0y0 | y0 \u2208 Y, s0 \u2208 {1,\u22121}}.\nLemma 2. For every x such that p1(x) (cid:54)= p2(x), u(x) can be expressed as\n\nu(x) = 2 \u00d7 E[z | x]\nE[w | x]\n\n,\n\nwhere E[z | x] and E[w | x] are the conditional expectations of z given x over p(z | x) and w given\nx over p(w | x), respectively.\n\ni\n\n:= (\u22121)k\u22121t(k)\n\nA proof can be found in Appendix D in the supplementary material.\nLet w(k)\n\nn1 = n2, and (cid:101)n1 = (cid:101)n2 for simplicity, {((cid:101)xi, wi)}n\n)}k=1,2; i=1,...,(cid:101)nk and\nx)p(x) and p(x, w) := p(w | x)p(x), respectively, where n = n1 + n2 and(cid:101)n = (cid:101)n1 +(cid:101)n2. The\n)}k=1,2; i=1,...,nk can be seen as samples drawn from p(x, z) := p(z |\nmore general cases where p1(x) (cid:54)= p2(x), n1 (cid:54)= n2, or(cid:101)n1 (cid:54)=(cid:101)n2 are discussed in Appendix I in the\n\ni=1 := {((cid:101)x(k)\n\n. Assuming that p1(x) = p2(x) =: p(x),\n\n:= (\u22121)k\u22121y(k)\n\ni=1 := {(x(k)\n\n{(xi, zi)}n\n\nand z(k)\n\n, w(k)\n\n, z(k)\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\ni\n\nsupplementary material.\n\n4\n\n\fTheorem 1. Assume that \u00b5w, \u00b5z \u2208 L2(p) and \u00b5w(x) (cid:54)= 0 for every x such that p(x) > 0, where\nL2(p) := {f : X \u2192 R | Ex\u223cp(x)[f (x)2] < \u221e}. The individual uplift u(x) equals the solution to\nthe following least-squares problem:\n\nu(x) = argmin\nf\u2208L2(p)\n\nE[(\u00b5w(x)f (x) \u2212 2\u00b5z(x))2],\n\n(7)\n\nwhere E denotes the expectation over p(x), \u00b5w(x) := E[w | x], and \u00b5z(x) := E[z | x].\nTheorem 1 follows from Lemma 2. Note that p1(x) (cid:54)= p2(x) in Eq. (2) implies \u00b5w(x) (cid:54)= 0.\nIn what follows, we develop a method that directly estimates u(x) by solving Eq. (7). A challenge\nhere is that it is not straightforward to evaluate the objective functional since it involves unknown\nfunctions, \u00b5w and \u00b5z.\n\n4.2 Disentanglement of z and w\n\nOur idea is to transform the objective functional in Eq. (7) into another form in which \u00b5w(x) and\n\u00b5z(x) appear separately and linearly inside the expectation operator so that we can approximate them\nusing our separately labeled samples.\nFor any function g \u2208 L2(p) and any x \u2208 X , expanding the left-hand side of the inequality\nE[(\u00b5w(x)f (x) \u2212 2\u00b5z(x) \u2212 g(x))2] \u2265 0, we have\n\nE[(\u00b5w(x)f (x) \u2212 2\u00b5z(x))2] \u2265 2E[(\u00b5w(x)f (x) \u2212 2\u00b5z(x))g(x)] \u2212 E[g(x)2] =: J(f, g).\n\n(8)\nThe equality is attained when g(x) = \u00b5w(x)f (x) \u2212 \u00b5z(x) for any \ufb01xed f. This means that the\nobjective functional of Eq. (7) can be calculated by maximizing J(f, g) with respect to g. Hence,\n\nu(x) = argmin\nf\u2208L2(p)\n\nmax\ng\u2208L2(p)\n\nJ(f, g).\n\n(9)\n\nFurthermore, \u00b5w and \u00b5z are separately and linearly included in J(f, g), which makes it possible to\nwrite it in terms of z and w as\n\n(10)\nUnlike the original objective functional in Eq. (7), J(f, g) can be easily estimated using sample\naverages by\n\nJ(f, g) = 2E[wf (x)g(x)] \u2212 4E[zg(x)] \u2212 E[g(x)2].\n\n(cid:98)J(f, g) =\n\n(cid:101)n(cid:88)\n\ni=1\n\n2(cid:101)n\n\nwif ((cid:101)xi)g((cid:101)xi) \u2212 4\n\nn\n\nzig(xi) \u2212 1\n2n\n\nn(cid:88)\nn(cid:88)\n(cid:98)J(f, g) + \u2126(f, g),\n\ni=1\n\ni=1\n\nmin\nf\u2208F\n\nmax\ng\u2208G\n\nIn practice, we solve the following regularized empirical optimization problem:\n\n(cid:101)n(cid:88)\n\ni=1\n\n2(cid:101)n\n\ng((cid:101)xi)2.\n\ng(xi)2 \u2212 1\n\n(11)\n\n(12)\n\nwhere F , G are models for f, g respectively, and \u2126(f, g) is some regularizer.\nAn advantage of the proposed framework is that it is model-independent, and any models can be\ntrained by optimizing the above objective.\nThe function g can be interpreted as a critic of f as follows. Minimizing Eq. (10) with respect\nto f is equivalent to minimizing J(f, g) = E[g(x){\u00b5w(x)f (x) \u2212 2\u00b5z(x)}]. g(x) serves as a\ngood critic of f (x) when it makes the cost g(x){\u00b5w(x)f (x) \u2212 2\u00b5z(x)} larger for x at which f\nmakes a larger error |\u00b5w(x)f (x) \u2212 2\u00b5z(x)|. In particular, g maximizes the objective above when\ng(x) = \u00b5w(x)f (x)\u2212 2\u00b5z(x) for any f, and the maximum coincides with the least-squares objective\nin Eq. (7).\nSuppose that F and G are linear-in-parameter models: F = {f\u03b1 : x (cid:55)\u2192 \u03b1(cid:62)\u03c6(x) | \u03b1 \u2208 Rbf} and\nG = {g\u03b2 : x (cid:55)\u2192 \u03b2(cid:62)\u03c8(x) | \u03b2 \u2208 Rbg}, where \u03c6 and \u03c8 are bf-dimensional and bg-dimensional\n\nvectors of basis functions in L2(p). Then, (cid:98)J(f\u03b1, g\u03b2) = 2\u03b1(cid:62)A\u03b2 \u2212 4b(cid:62)\u03b2 \u2212 \u03b2(cid:62)C\u03b2, where\n\nA :=\n\nC :=\n\n(cid:101)n(cid:88)\nwi\u03c6((cid:101)xi)\u03c8((cid:101)xi)(cid:62),\nn(cid:88)\n\n\u03c8(xi)\u03c8(xi)(cid:62) +\n\ni=1\n\n1(cid:101)n\n\n1\n2n\n\ni=1\n\n5\n\n1\nn\n\nn(cid:88)\n\u03c8((cid:101)xi)\u03c8((cid:101)xi)(cid:62).\n\nzi\u03c8(xi),\n\ni=1\n\nb :=\n\n(cid:101)n(cid:88)\n\ni=1\n\n1\n\n2(cid:101)n\n\n\fUsing (cid:96)2-regularizers, \u2126(f, g) = \u03bbf \u03b1(cid:62)\u03b1 \u2212 \u03bbg\u03b2(cid:62)\u03b2 with some positive constants \u03bbf and \u03bbg, the\nsolution to the inner maximization problem can be obtained in the following analytical form:\n\n(cid:98)\u03b2\u03b1 := argmax\n\n(cid:98)J(f\u03b1, g\u03b2) = (cid:101)C\u22121(A(cid:62)\u03b1 \u2212 2b),\n\n\u03b2\n\n(cid:98)J(f\u03b1, g(cid:98)\u03b2\u03b1\n\nwhere (cid:101)C = C + \u03bbgIbg and Ibg is the bg-by-bg identity matrix. Then, we can obtain the solution to\nEq. (12) analytically as(cid:98)\u03b1 := argmin\nFinally, from Eq. (7), our estimate of u(x) is given as (cid:98)\u03b1(cid:62)\u03c6(x).\nInstead, we may evaluate the value of J((cid:98)f ,(cid:98)g), where ((cid:98)f ,(cid:98)g) \u2208 F \u00d7 G is the optimal solution pair to\nminf\u2208F maxg\u2208G (cid:98)J(f, g). However, it is still nontrivial to tell if the objective value is small because\n\nRemark on model selection: Model selection for F and G is not straightforward since the test\nperformance measure cannot be directly evaluated with (held out) training data of our problem.\n\n) = 2(A(cid:101)C\u22121A(cid:62) + \u03bbf Ibg )\u22121A(cid:101)C\u22121b.\n\nthe solution is good in terms of the outer minimization, or because it is poor in terms of the inner\nmaximization. We leave this issue for future work.\n\n\u03b1\n\n5 Theoretical Analysis\n\nA theoretically appealing property of the proposed method is that its objective consists of simple\nsample averages. This enables us to establish a generalization error bound in terms of the Rademacher\ncomplexity [15, 22].\nDenote \u03b5G(f ) := supg\u2208L2(p) J(f, g) \u2212 supg\u2208G J(f, g). Also, let RN\nq (H) denote the Rademacher\ncomplexity of a set of functions H over N random variables following probability density q (refer\nto Appendix E for the de\ufb01nition). Proofs of the following theorems and corollary can be found in\nTheorem 2. Assume that n1 = n2, (cid:101)n1 = (cid:101)n2, p1(x) = p2(x), W := inf x\u2208X |\u00b5w(x)| > 0,\nAppendix E, Appendix F, and Appendix G in the supplementary material.\nMZ := supz\u2208Z |z| < \u221e, MF := supf\u2208F,x\u2208X |f (x)| < \u221e, and MG := supg\u2208G,x\u2208X |g(x)| < \u221e.\n(cid:35)\nThen, the following holds with probability at least 1 \u2212 \u03b4 for every f \u2208 F :\n\n(cid:19)(cid:114)\n:= 4MY MG + M 2G/2, Mw = 2MF MG + M 2G/2, and Rn,(cid:101)n\n\n(cid:18) Mz\u221a\n(cid:98)J(f, g) + Rn,(cid:101)n\np(x,w)(F ) + 2(MF + MG)R(cid:101)n\n\np(x,z)(G) + 2(2MF + MG)R(cid:101)n\n\nEx\u223cp(x)[(f (x) \u2212 u(x))2] \u2264 1\nW 2\n\nwhere Mz\n4MZ)Rn\n\nF,G := 2(MF +\n\np(x,w)(G).\n\n+ \u03b5G(f )\n\n,\n\n2(cid:101)n\n\nF,G +\n\nMw\u221a\n\nsup\ng\u2208G\n\n(cid:34)\n\nlog\n\n2\n\u03b4\n\n2n\n\n+\n\nIn particular, the following bound holds for the linear-in-parameter models.\nCorollary 1. Let F = {x (cid:55)\u2192 \u03b1(cid:62)\u03c6(x) | (cid:107)\u03b1(cid:107)2 \u2264 \u039bF}, G = {x (cid:55)\u2192 \u03b2(cid:62)\u03c8(x) | (cid:107)\u03b2(cid:107)2 \u2264 \u039bG}.\nAssume that rF := supx\u2208X (cid:107)\u03c6(x)(cid:107) < \u221e and rG := supx\u2208X (cid:107)\u03c8(x)(cid:107) < \u221e, where (cid:107)\u00b7(cid:107)2 is the\nL2-norm. Under the assumptions of Theorem 2, it holds with probability at least 1 \u2212 \u03b4 that for every\nf \u2208 F ,\n\nEx\u223cp(x)[(f (x) \u2212 u(x))2] \u2264 1\nW 2\n\n\uf8ee\uf8f0sup\n\ng\u2208G\n\n(cid:113)\n\n(cid:98)J(f, g) +\n\nCz\n\n(cid:113)\n\nlog 2\n\u221a\n\n\u03b4 + Dz\n2n\n\nCw\n\n+\n\n\u03b4 + Dw\n\nlog 2\n\u221a\n\n2(cid:101)n\n\n+ \u03b5G(f )\n\n\uf8f9\uf8fb ,\n\nG\u039b2\n\nG\u039b2\n\nwhere Cz := r2\n4rG\u039bGMY, and Dw := r2\n\nG + 4rG\u039bGMY, Cw := 2r2\n\nF \u039b2\nG/2 + 4rF rG\u039bF \u039bG.\n\nTheorem 2 and Corollary 1 imply that minimizing supg\u2208G (cid:98)J(f, g), as the proposed method does,\n\namounts to minimizing an upper bound of the mean squared error. In fact, for the linear-in-parameter\nmodels, it can be shown that the mean squared error of the proposed estimator is upper bounded by\n\u221a\nO(1/\n\n\u221a(cid:101)n) plus some model mis-speci\ufb01cation error with high probability as follows.\n\nF + 2rF rG\u039bF \u039bG + r2\n\nG, Dz := r2\n\nG/2 +\n\nn + 1/\n\nG\u039b2\n\nG\u039b2\n\n6\n\n\fTheorem 3 (Informal). Let (cid:98)f \u2208 F be any approximate solution to inf f\u2208F supg\u2208G (cid:98)J(f, g) with\n\nsuf\ufb01cient precision. Under the assumptions of Corollary 1, it holds with probability at least 1\u2212 \u03b4 that\n\nlog\n\n1\n\u03b4\n\n+\n\n2\u03b5F\n\nG + \u03b5F\nW 2\n\n,\n\n(13)\n\nEx\u223cp(x)[((cid:98)f (x) \u2212 u(x))2] \u2264 O\n\n(cid:18)(cid:18) 1\u221a\n\nn\nG := supf\u2208F \u03b5G(f ) and \u03b5F := inf f\u2208F J(f ).\n\nwhere \u03b5F\n\n(cid:19)\n\n(cid:19)\n\n+\n\n1\u221a(cid:101)n\n\nA more formal version of Theorem 3 can be found in Appendix G.\n\n6 More General Loss Functions\n\nOur framework can be extended to more general loss functions:\n\ninf\n\nf\u2208L2(p)\n\nE[(cid:96)(\u00b5w(x)f (x), 2\u00b5z(x))],\n\n(14)\nwhere (cid:96) : R \u00d7 R \u2192 R is a loss function that is lower semi-continuous and convex with respect to\nboth the \ufb01rst and the second arguments, where a function \u03d5 : R \u2192 R is lower semi-continuous if\nlim inf y\u2192y0 \u03d5(y) = \u03d5(y0) for every y0 \u2208 R [30].2 As with the squared loss, a major dif\ufb01culty in\nsolving this optimization problem is that the operand of the expectation has nonlinear dependency\non both \u00b5w(x) and \u00b5z(x) at the same time. Below, we will show a way to transform the objective\nfunctional into a form that can be easily approximated using separately labeled samples.\nFrom the assumptions on (cid:96), we have (cid:96)(y, y(cid:48)) = supz\u2208R yz \u2212 (cid:96)\u2217(z, y(cid:48)), where (cid:96)\u2217(\u00b7, y(cid:48)) is the convex\nconjugate of the function y (cid:55)\u2192 (cid:96)(y, y(cid:48)) de\ufb01ned for any y(cid:48) \u2208 R as z (cid:55)\u2192 (cid:96)\u2217(z, y(cid:48)) = supy\u2208R[yz \u2212\n(cid:96)(y, y(cid:48))] (see Rockafellar [30]). Hence,\n\nE[(cid:96)(\u00b5w(x)f (x), 2\u00b5z(x))] = sup\n\ng\u2208L2(p)\n\nE[\u00b5w(x)f (x)g(x) \u2212 (cid:96)\u2217(g(x), 2\u00b5z(x))].\n\nSimilarly, we obtain E[(cid:96)\u2217(g(x), 2\u00b5z(x))] = suph\u2208L2(p) 2E[\u00b5z(x)h(x)]\u2212E[(cid:96)\u2217\n\u2217(y,\u00b7) is the convex conjugate of the function y(cid:48)\n(cid:96)\u2217\n\u2217(y, z(cid:48)) := supy(cid:48)\u2208R[y(cid:48)z \u2212 (cid:96)\u2217(y, y(cid:48))]. Thus, Eq. (14) can be rewritten as\n(cid:96)\u2217\n\n\u2217(g(x), h(x))], where\n(cid:55)\u2192 (cid:96)\u2217(y, y(cid:48)) de\ufb01ned for any y, z(cid:48) \u2208 R by\n\ninf\n\nf\u2208L2(p)\n\nsup\n\ng\u2208L2(p)\n\ninf\n\nh\u2208L2(p)\n\nK(f, g, h),\n\nwhere K(f, g, h) := E[\u00b5w(x)f (x)g(x)] \u2212 2E[\u00b5z(x)h(x)] + E[(cid:96)\u2217\n\u2217(g(x), h(x))]. Since \u00b5w and \u00b5z\nappear separately and linearly, K(f, g, h) can be approximated by sample averages using separately\nlabeled samples.\n\n7 Experiments\n\nIn this section, we test the proposed method and compare it with baselines.\n\n7.1 Data Sets\n\nWe use the following data sets for experiments.\nSynthetic data: Features x are drawn from the two-dimensional Gaussian distribution with mean\nzero and covariance 10I2. We set p(y | x, t) as the following logistic models: p(y | x, t) =\n1/(1 \u2212 exp(\u2212ya(cid:62)\nt x)), where a\u22121 = (10, 10)(cid:62) and a1 = (10,\u221210)(cid:62). We also use the logistic\nmodels for pk(t | x): p1(t | x) = 1/(1 \u2212 exp(\u2212tx2)) and p2(t | x) = 1/(1 \u2212 exp(\u2212t{x2 + b}),\nwhere b is varied over 25 equally spaced points in [0, 10]. We investigate how the performance\nchanges when the difference between p1(t | x) and p2(t | x) varies.\nEmail data: This data set consists of data collected in an email advertisement campaign for promoting\ncustomers to visit a website of a store [8, 27]. Outcomes are whether customers visited the website\nor not. We use 4 \u00d7 5000 and 2000 randomly sub-sampled data points for training and evaluation,\nrespectively.\n\n2lim inf y\u2192y0 \u03d5(y) := lim\u03b4(cid:38)0 inf|y\u2212y0|\u2264\u03b4 \u03d5(y).\n\n7\n\n\fJobs data: This data set consists of randomized experimental data obtained from a job training\nprogram called the National Supported Work Demonstration [17], available at http://users.nber.\norg/~rdehejia/data/nswdata2.html. There are 9 features, and outcomes are income levels\nafter the training program. The sample sizes are 297 for the treatment group and 425 for the control\ngroup. We use 4 \u00d7 50 randomly sub-sampled data points for training and 100 for evaluation.\nCriteo data: This data set consists of banner advertisement log data collected by Criteo [18]\navailable at http://www.cs.cornell.edu/~adith/Criteo/. The task is to select a product to\nbe displayed in a given banner so that the click rate will be maximized. We only use records for\nbanners with only one advertisement slot. Each display banner has 10 features, and each product\nhas 35 features. We take the 12th feature of a product as a treatment variable merely because\nit is a well-balanced binary variable. The outcome is whether the displayed advertisement was\nclicked. We treat the data set as the population although it is biased from the actual population since\nnon-clicked impressions were randomly sub-sampled down to 10% to reduce the data set size. We\nmade two subsets with different treatment policies by appropriately sub-sampling according to the\nprede\ufb01ned treatment policies (see Appendix L in the supplementary material). We set pk(t | x) as\np1(t | x) = 1/(1 + exp(\u2212t1(cid:62)x)) and p2(t | x) = 1/(1 + exp(t1(cid:62)x)), where 1 := (1, . . . , 1)(cid:62).\n\n7.2 Experimental Settings\n\nWe conduct experiments under the following settings.\nMethods compared: We compare the proposed method with baselines that separately estimate\nthe four conditional expectations in Eq. (6). In the case of binary outcomes, we use the logistic-\nregression-based (denoted by FourLogistic) and a neural-network-based method trained with the\nsoft-max cross-entropy loss (denoted by FourNNC). In the case of real-valued outcomes, the ridge-\nregression-based (denoted by FourRidge) and a neural-network-based method trained with the squared\nloss (denoted by FourNNR). The neural networks are fully connected ones with two hidden layers\neach with 10 hidden units. For the proposed method, we use the linear-in-parameter models with\nGaussian basis functions centered at randomly sub-sampled training data points (see Appendix K for\nmore details).\nPerformance evaluation: We evaluate trained uplift models by the area under the uplift curve\n(AUUC) estimated on test samples with joint labels as well as uplift curves [26]. The uplift curve of\nan estimated individual uplift is the trajectory of the average uplift when individuals are gradually\nmoved from the control group to the treated group in the descending order according to the ranking\ngiven by the estimated individual uplift. These quantities can be estimated when data are randomized\nexperiment ones. The Criteo data are not randomized experiment data unlike other data sets, but there\nare accurately logged propensity scores available. In this case, uplift curves and the AUUCs can be\nestimated using the inverse propensity scoring [3, 20]. We conduct 50 trials of each experiment with\ndifferent random seeds.\n\n7.3 Results\n\nThe results on the synthetic data are summarized in Figure 1. From the plots, we can see that all\nmethods perform relatively well in terms of AUUCs when the policies are distant from each other\n(i.e., b is larger). However, the performance of the baseline methods immediately declines as the\ntreatment policies get closer to each other (i.e., b is smaller).3 In contrast, the proposed method\nmaintains its performance relatively longer until b reaches the point around 2. Note that the two\npolicies would be identical when b = 0, which makes it impossible to identify the individual uplift\nfrom their samples by any method since the system in Eq. (5) degenerates. Figure 2 highlights their\nperformance in terms of the squared error. For FourNNC, test points with small policy difference\n|p1(t = 1 | x) \u2212 p2(t = 1 | x)| (colored darker) tend to have very large estimation errors. On the\nother hand, the proposed method has relatively small errors even for such points. Figure 3 shows\nresults on real data sets. The proposed method and the baseline method with logistic regressors\nboth performed better than the baseline method with neural nets on the Email data set (Figure 3a).\n\n3The instability of performance of FourLogistic can be explained as follows. FourLogistic uses linear models,\nwhose expressive power is limited. The resulting estimator has small variance with potentially large bias. Since\ndifferent b induces different u(x), the bias depends on b. For this reason, the method works well for some b but\npoorly for other b.\n\n8\n\n\fFigure 1: Results on the synthetic data. The plot shows the aver-\nage AUUCs obtained by the proposed method and the baseline\nmethods for different b. p1(t | x) and p2(t | x) are closer to\neach other when b is smaller.\n\n(a) Baseline (FourLogistic).\n\n(b) Baseline (FourNNC).\n\n(c) Proposed (MinMaxGau).\n\nFigure 2: The plots show the squared errors of the estimated individual uplifts on the synthetic\ndata with b = 1. Each point is darker-colored when |p1(t = 1 | x) \u2212 p2(t = 1 | x)| is smaller, and\nlighter-colored otherwise.\n\n(a) The Email data.\nFigure 3: Average uplifts as well as their standard errors on real-world data sets.\n\n(b) The Jobs data.\n\n(c) The Criteo data.\n\nOn the Jobs data set, the proposed method again performed better than the baseline methods with\nneural networks. For the Criteo data set, the proposed method outperformed the baseline methods\n(Figure 3c). Overall, we con\ufb01rmed the superiority of the proposed both on synthetic and real data\nsets.\n\n8 Conclusion\n\nWe proposed a theoretically guaranteed and practically useful method for uplift modeling or individual\ntreatment effect estimation under the presence of systematic missing labels. The proposed method\nshowed promising results in our experiments on synthetic and real data sets. The proposed framework\nis model-independent: any models can be used to approximate the individual uplift including ones\ntailored for speci\ufb01c problems and complex models such as neural networks. On the other hand, model\nselection may be a challenging problem due to the min-max structure. Addressing this issue would be\nimportant research directions for further expanding the applicability and improving the performance\nof the proposed method.\n\n9\n\n0246810b0.0250.0000.0250.0500.0750.1000.1250.150AUUCThe Synthetic DataProposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic)0.00.20.40.60.81.0Proportion of Treated Individuals0.000.010.020.030.04Average UpliftUplift CurveProposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic)0.20.40.60.81.0Proportion of Treated Individuals0200400600800100012001400Average UpliftUplift CurveProposed (MinMaxGau)Baseline (FourNNR)Baseline (FourRidge)0.00.20.40.60.81.0Proportion of Treated Individuals0.0000.0020.0040.0060.0080.010Average UpliftUplift CurveProposed (MinMaxGau)Baseline (FourNNC)Baseline (FourLogistic)\fAcknowledgments\n\nWe are grateful to Marthinus Christoffel du Plessis and Takeshi Teshima for their inspiring suggestions\nand for the meaningful discussions. We would like to thank the anonymous reviewers for their helpful\ncomments. IY was supported by JSPS KAKENHI 16J07970. JA and FY would like to thank\nAdway for its support. MS was supported by the International Research Center for Neurointelligence\n(WPI-IRCN) at The University of Tokyo Institutes for Advanced Study.\n\nReferences\n[1] E. Abrahams and M. Silver. The Case for Personalized Medicine. Journal of Diabetes Science and\n\nTechnology, 3(4):680\u2013684, July 2009.\n\n[2] S. Athey, J. Tibshirani, and S. Wager. Generalized Random Forests. arXiv:1610.01271 [econ, stat],\n\nOctober 2016. arXiv: 1610.01271.\n\n[3] P. C. Austin. An introduction to propensity score methods for reducing the effects of confounding in\n\nobservational studies. Multivariate behavioral research, 46(3):399\u2013424, 2011.\n\n[4] C. M. Bishop. Pattern Recognition and Machine Learning (Information Science and Statistics). Springer-\n\nVerlag New York, Inc., 2006.\n\n[5] P. Gutierrez and J. Y. G\u00e9rardy. Causal inference and uplift modelling: A review of the literature. In\nProceedings of The 3rd International Conference on Predictive Applications and APIs, volume 67 of\nProceedings of Machine Learning Research, pages 1\u201313. PMLR, 11\u201312 Oct 2017.\n\n[6] J. Hartford, G. Lewis, K. Leyton-Brown, and M. Taddy. Deep IV: A Flexible Approach for Counterfactual\n\nPrediction. In International Conference on Machine Learning, pages 1414\u20131423, July 2017.\n\n[7] J. L. Hill. Bayesian Nonparametric Modeling for Causal Inference. Journal of Computational and\n\nGraphical Statistics, 20(1):217\u2013240, January 2011.\n\n[8] K. Hillstrom. The minethatdata e-mail analytics and data mining challenge, 2008. https://blog.\n\nminethatdata.com/2008/03/minethatdata-e-mail-analytics-and-data.html.\n\n[9] K. Imai and M. Ratkovic. Estimating treatment effect heterogeneity in randomized program evaluation.\n\nThe Annals of Applied Statistics, 7(1):443\u2013470, March 2013.\n\n[10] G. W. Imbens. Instrumental Variables: An Econometrician\u2019s Perspective. Statistical Science, 29(3):\n\n323\u2013358, August 2014.\n\n[11] M. Jaskowski and S. Jaroszewicz. Uplift modeling for clinical trial data. In ICML Workshop on Clinical\n\nData Analysis, 2012.\n\n[12] F. D. Johansson, U. Shalit, and D. Sontag. Learning Representations for Counterfactual Inference. In\nProceedings of the 33rd International Conference on Machine Learning, volume 48 of Proceedings of\nMachine Learning Research, pages 3020\u20133029. JMLR, 2016.\n\n[13] S. H. Katsanis, G. Javitt, and K. Hudson. A Case Study of Personalized Medicine. Science, 320(5872):\n\n53\u201354, April 2008.\n\n[14] R. Kohavi, R. Longbotham, D. Sommer\ufb01eld, and R. M. Henne. Controlled experiments on the web: survey\n\nand practical guide. Data Mining and Knowledge Discovery, 18(1):140\u2013181, February 2009.\n\n[15] V. Koltchinskii. Rademacher penalties and structural risk minimization. IEEE Transactions on Information\n\nTheory, 47(5):1902\u20131914, July 2001.\n\n[16] S. R. K\u00fcnzel, J. S. Sekhon, P. J. Bickel, and B. Yu. Meta-learners for Estimating Heterogeneous Treatment\n\nEffects using Machine Learning. arXiv:1706.03461 [math, stat], June 2017. arXiv: 1706.03461.\n\n[17] R. J. LaLonde. Evaluating the econometric evaluations of training programs with experimental data. The\n\nAmerican economic review, pages 604\u2013620, 1986.\n\n[18] D. Lefortier, A. Swaminathan, X. Gu, T. Joachims, and M. de Rijke. Large-scale validation of counterfactual\nIn NIPS Workshop on \"Inference and Learning of Hypothetical and\n\nlearning methods: A test-bed.\nCounterfactual Interventions in Complex Systems\", 2016.\n\n[19] G. Lewis and V. Syrgkanis. Adversarial Generalized Method of Moments. arXiv:1803.07164 [cs, econ,\n\nmath, stat], March 2018. arXiv: 1803.07164.\n\n10\n\n\f[20] L. Li, W. Chu, J. Langford, and X. Wang. An unbiased of\ufb02ine evaluation of contextual bandit algorithms\nwith generalized linear models. In Proceedings of the Workshop on On-line Trading of Exploration and\nExploitation 2, volume 26 of Proceedings of Machine Learning Research, pages 19\u201336. PMLR, 02 Jul\n2012.\n\n[21] S. Liu, A. Takeda, T. Suzuki, and K. Fukumizu. Trimmed density ratio estimation. In I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Advances in Neural\nInformation Processing Systems 30, pages 4518\u20134528. Curran Associates, Inc., 2017.\n\n[22] M. Mohri, A. Rostamizadeh, and A. Talwalkar. Foundations of machine learning. MIT press, 2012.\n\n[23] X. Nguyen, M. J. Wainwright, and M. I. Jordan. Estimating divergence functionals and the likelihood ratio\nby convex risk minimization. IEEE Transactions on Information Theory, 56(11):5847\u20135861, Nov 2010.\n\n[24] J. Pearl. Causality. Cambridge university press, 2009.\n\n[25] N. Radcliffe. Using control groups to target on predicted lift: Building and assessing uplift model. Direct\n\nMarketing Analytics Journal, pages 14\u201321, 2007.\n\n[26] N. Radcliffe and P. Surry. Differential Response Analysis: Modeling True Responses by Isolating the\n\nEffect of a Single Action. Credit Scoring and Credit Control IV, 1999.\n\n[27] N. J. Radcliffe. Hillstrom\u2019s minethatdata email analytics challenge: An approach using uplift modelling.\n\nTechnical report, Stochastic Solutions Limited, 2008.\n\n[28] N. J. Radcliffe and P. D. Surry. Real-world uplift modelling with signi\ufb01cance-based uplift trees. White\n\nPaper TR-2011-1, Stochastic Solutions, 2011.\n\n[29] R. Renault. Chapter 4 - Advertising in Markets. In Simon P. Anderson, Joel Waldfogel, and David\nStr\u00f6mberg, editors, Handbook of Media Economics, volume 1 of Handbook of Media Economics, pages\n121\u2013204. North-Holland, January 2015.\n\n[30] R. T. Rockafellar. Convex Analysis. Princeton University Press, 1970.\n\n[31] D. B. Rubin. Causal inference using potential outcomes: Design, modeling, decisions. Journal of the\n\nAmerican Statistical Association, 100(469):322\u2013331, 2005.\n\n[32] P. Rzepakowski and S. Jaroszewicz. Decision trees for uplift modeling with single and multiple treatments.\n\nKnowledge and Information Systems, 32(2):303\u2013327, 2012.\n\n[33] U. Shalit, F. D. Johansson, and D. Sontag. Estimating individual treatment effect: generalization bounds\nand algorithms. In Doina Precup and Yee Whye Teh, editors, Proceedings of the 34th International\nConference on Machine Learning, volume 70 of Proceedings of Machine Learning Research, pages\n3076\u20133085. PMLR, 06\u201311 Aug 2017.\n\n[34] M. Sugiyama, I. Takeuchi, T. Suzuki, T. Kanamori, H. Hachiya, and D. Okanohara. Conditional density\nestimation via least-squares density ratio estimation. In International Conference on Arti\ufb01cial Intelligence\nand Statistics, pages 781\u2013788, 2010.\n\n[35] M. Sugiyama, T. Suzuki, and T. Kanamori. Density Ratio Estimation in Machine Learning. Cambridge\n\nUniversity Press, 2012.\n\n[36] S. Tuff\u00e9ry. Data mining and statistics for decision making, volume 2. Wiley Chichester, 2011.\n\n[37] S. Wager and S. Athey. Estimation and Inference of Heterogeneous Treatment Effects using Random\n\nForests. arXiv:1510.04342 [math, stat], October 2015.\n\n[38] M. Yamada, T. Suzuki, T. Kanamori, H. Hachiya, and M. Sugiyama. Relative density-ratio estimation for\n\nrobust distribution comparison. Neural Computation, 25(5):1324\u20131370, 2013.\n\n11\n\n\f", "award": [], "sourceid": 6456, "authors": [{"given_name": "Ikko", "family_name": "Yamane", "institution": "The University of Tokyo/RIKEN"}, {"given_name": "Florian", "family_name": "Yger", "institution": "Universit\u00e9 Paris-Dauphine"}, {"given_name": "Jamal", "family_name": "Atif", "institution": "Universit\u00e9 Paris-Dauphine"}, {"given_name": "Masashi", "family_name": "Sugiyama", "institution": "RIKEN / University of Tokyo"}]}