{"title": "Using Embeddings to Correct for Unobserved Confounding in Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 13792, "page_last": 13802, "abstract": "We consider causal inference in the presence of unobserved confounding. We study the case where a proxy is available for the unobserved confounding in the form of a network connecting the units. For example, the link structure of a social network carries information about its members. We show how to effectively use the proxy to do causal inference. The main idea is to reduce the causal estimation problem to a semi-supervised prediction of both the treatments and outcomes. Networks admit high-quality embedding models that can be used for this semi-supervised prediction. We show that the method yields valid inferences under suitable (weak) conditions on the quality of the predictive model. We validate the method with experiments on a semi-synthetic social network dataset.", "full_text": "Using Embeddings to Correct for Unobserved\n\nConfounding in Networks\n\nVictor Veitch1, Yixin Wang1, and David M. Blei1,2\n\n1Department of Statistics, Columbia University\n\n2Department of Computer Science, Columbia University\n\nAbstract\n\nWe consider causal inference in the presence of unobserved confounding. We\nstudy the case where a proxy is available for the unobserved confounding in the\nform of a network connecting the units. For example, the link structure of a social\nnetwork carries information about its members. We show how to effectively use\nthe proxy to do causal inference. The main idea is to reduce the causal estimation\nproblem to a semi-supervised prediction of both the treatments and outcomes.\nNetworks admit high-quality embedding models that can be used for this semi-\nsupervised prediction. We show that the method yields valid inferences under\nsuitable (weak) conditions on the quality of the predictive model. We validate\nthe method with experiments on a semi-synthetic social network dataset. Code at\ngithub.com/vveitch/causal-network-embeddings.\n\n1\n\nIntroduction\n\nWe consider causal inference in the presence of unobserved confounding, i.e., where unobserved\nvariables may affect both the treatment and the outcome. We study the case where there is an observed\nproxy for the unobserved confounders, but (i) the proxy has non-iid structure, and (ii) a well-speci\ufb01ed\ngenerative model for the data is not available.\n\nExample 1.1. We want to infer the ef\ufb01cacy of a drug based on observed outcomes of people who are\nconnected in a social network. Each unit i is a person. The treatment variable ti indicates whether\nthey took the drug, a response variable yi indicates their health outcome, and latent confounders zi\nmight affect the treatment or response. For example, zi might be unobserved age or sex. We would\nlike to compute the average treatment effect, controlling for these confounds. We assume the social\nnetwork itself is associated with z, e.g., similar people are more likely to be friends. This means that\nthe network itself may implicitly contain confounding information that is not explicitly collected.\n\nIn this example, inference of the causal effect would be straightforward if the confounder z were\navailable. So, intuitively, we would like to infer substitutes for the latent zi from the underlying social\nnetwork structure. Once inferred, these estimates \u02c6zi could be used as a substitute for zi and we could\nestimate the causal effect [SM16].\n\nFor this strategy to work, however, we need a well-speci\ufb01ed generative model (i.e., joint probability\ndistribution) for z and the full network structure. But typically no such model is available. For\nexample, generative models of networks with latent unit structure\u2014such as stochastic block models\n[WW87; Air+08] or latent space models [Hof+02]\u2014miss properties of real-world networks [Dur06;\nNew09; OR15]. Causal estimates based on substitutes inferred from misspeci\ufb01ed models are\ninherently suspect.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fEmbedding methods offer an alternative to fully speci\ufb01ed generative models. Informally, an embed-\nding method assigns a real-valued embedding vector \u02c6\u03bbi to each unit, with the aim that conditioning\non the embedding should decouple the properties of the unit and the network structure. For example,\n\u02c6\u03bbi might be chosen to explain the local network structure of user i.\nThe embeddings are learned by minimizing an objective function over the network, with no re-\nquirement that this objective correspond to any generative model. For pure predictive tasks, e.g.,\nclassi\ufb01cation of vertices in a graph, embedding-based approaches are state of the art for many real-\nworld datasets [e.g., Per+14; Cha+17; Ham+17; Ham+17; Vei+19a]. This suggests that network\nembeddings might be usefully adapted to the inference of causal effects.\n\nThe method we develop here stems from the following insight. Even if we knew the confounders\n{zi} we would not actually use all the information they contain to infer the causal effect. Instead, if\nwe use estimator \u02c6\u03c8n to estimate the effect \u03c8, then we only require the part of zi that is actually used\nby the estimator \u02c6\u03c8n. For example, if \u02c6\u03c8n is an inverse probability weighted estimator [CH08] then we\nrequire only estimates for the propensity scores P(Ti = 1 | zi) for each unit.\nWhat this means is that if we can build a good predictive model for the treatment then we can plug\nthe outputs into a causal effect estimate directly, without any need to learn the true zi. The same\nidea applies generally by using a predictive model for both the treatment and outcome. Reducing\nthe causal inference problem to a predictive problem is the crux of this paper. It allows us to replace\nthe assumption of a well-speci\ufb01ed model with the more palatable assumption that the black-box\nembedding method produces a strong predictor.\n\nThe contributions of this paper are:\n\n\u2022 a procedure for estimating treatment effects using network embeddings;\n\u2022 an extension of robust estimation results to (non-iid) network data, showing the method\n\u2022 and, an empirical study of the method on social network data.\n\nyields valid estimates under weak conditions;\n\n2 Related Work\n\nOur results connect to a number of different areas.\n\nCausal Inference in Networks. Causal inference in networks has attracted signi\ufb01cant attention\n[e.g., SM16; Tch+17; Ogb+17; OV17; Ogb18]. Much of this work is aimed at inferring the causal\neffects of treatments applied using the network; e.g., social in\ufb02uence or contagion. A major challenge\nin this area is that homophily\u2014the tendency of similar people to cluster in a network\u2014is generally\nconfounded with contagion\u2014the in\ufb02uence people have on their neighbors [ST11]. In this paper, we\nassume that each person\u2019s treatment and outcome are independent of the network once we know that\nperson\u2019s latent attributes; i.e., we assume pure homophily. This is a reasonable assumption in some\nsituations, but certainly not all. Our major motivation is simply that pure homophily is the simplest\ncase, and is thus the natural proving ground for the use of black-box methods in causal network\nproblems. It is an import future direction to extend the results developed here to the contagion case.\n\nShalizi and McFowland III [SM16] address the homophily/contagion issue with a two-stage estimation\nprocedure. They \ufb01rst estimate latent confounders (node properties), then use these in a regression\nbased estimator in the second stage. Their main result is a proof that if the network was actually\ngenerated by either a stochastic block model or a latent space model then the estimation procedure is\nvalid. Our main motivation here is to avoid such well-speci\ufb01ed model assumptions. Their work is\ncomplementary to our approach: we impose a weaker assumption, but we only address homophily.\n\nCausal Inference Using Proxy Confounders. Another line of connected research deals with causal\ninference with hidden confounding when there is an observed proxy for the confounder [KM99;\nPea12; KP14; Mia+18; Lou+17]. This work assumes the data is generated independently and\n\nidentically as (Xi, Zi, Ti, Yi) iid\u223c P for some data generating distribution P . The variable Zi causally\naffects Ti, Yi, and Xi. The variable(s) Xi are interpreted as noisy versions of Zi. The main question\nhere is when the causal effect is (non-parametrically) identi\ufb01able. The typical \ufb02avor of the results is:\nif the proxy distribution satis\ufb01es certain conditions then the marginal distribution P (Zi, Ti, Yi) is\nidenti\ufb01able, and thus so too is the causal effect (though weaker identi\ufb01cation conditions are possible\n\n2\n\n\f[Mia+18]). The main differences with the problem we address here are that the network surrogate has\nnon-iid structure, we expect that the information content of the exact confounder can be recovered in\nthe in\ufb01nite-sample limit, and we do not demand recovery the true data generating distribution.\n\nDouble machine learning. Chernozhukov et al. [Che+17a] addresses robust estimation of causal\neffects in the i.i.d. setting. Mathematically, our main estimation result, theorem 5.1, is a fairly\nstraightforward adaptation of their result. The important distinction is conceptual: we treat a different\ndata generating scenario.\n\nEmbedding methods. Veitch et al. [Vei+19b] use the strategy of reducing causal estimation to\nprediction to harness text embedding methods for causal inference with text data. In particular, that\npaper views the embeddings as a dimension reduction strategy and asks how the dimension reduction\ncan be achieved in a manner that preserves causal identi\ufb01cation.\n\n3 Setup\n\nWe \ufb01rst \ufb01x some notation and recall some necessary ideas about the statistical estimation of causal\neffects. We take each statistical unit to be a tuple Oi = (Yi, Ti, Zi), where Yi is the response, Ti is\nthe treatment, and Zi are (possibly confounding) unobserved attributes of the units. We assume that\niid\u223c P .\nthe units are drawn independently and identically at random from some distribution P , i.e., Oi\nWe study the case where there is a network connecting the units. We assume that the treatments and\noutcomes are independent of the network given the latent attributes {Zi}. This condition is implied\nby the (ubiquitous) exchangeable network assumption [OR15; VR15; CD15], though our requirement\nis weaker than exchangeability.\n\nThe average treatment effect of a binary outcome is de\ufb01ned as\n\n\u03c8 = E[Y | do(T = 1)] \u2212 E[Y | do(T = 0)].\n\nThe use of Pearl\u2019s do notation indicates that the effect of interest is causal: what is the expected\noutcome if we intervene by assigning the treatment to a given unit? If Zi contains all common\nin\ufb02uencers (a.k.a. confounders) of Yi and Ti then the causal effect is ident\ufb01able as a parameter of the\nobservational distribution:\n\n\u03c8 = E[E[Y | Z, T = 1] \u2212 E[Y | Z, T = 0]].\n\n(3.1)\n\nBefore turning to the unobserved Z case, we recall some ideas from the case where Z is observed.\nLet Q(t, z) = E[Y | t, z] be the conditional expected outcome, and \u02c6Qn be an estimator for this\nfunction. Following 3.1, a natural choice of estimator \u02c6\u03c8n is:\n\n1\n\n\u02c6\u03c8Q\n\nn =\n\nnXi h \u02c6Qn(1, zi) \u2212 \u02c6Qn(0, zi)i .\n\nThat is, \u03c8 is estimated by a two-stage procedure: First, produce an estimate for \u02c6Qn. Second, plug \u02c6Qn\ninto a pre-determined statistic to compute the estimate.\nOf course, \u02c6\u03c8Q\nn is not the only possible choice of estimator. In principle, it is possible to do better by\nincorporating estimates \u02c6gn of the propensity scores g(z) = P(T = 1 | z). The augmented inverse\nprobability of treatment weighted (A-IPTW) estimator \u02c6\u03c8A\nn is an important example [Rob+00; Rob00]:\n\n\u02c6\u03c8A\n\nn =\n\n1\n\nnXi\n\n\u02c6Qn(1, zi) \u2212 \u02c6Qn(0, zi) +\n\n1\n\nnXi (cid:18) I[ti = 1]\n\u02c6gn(zi) \u2212\n\nI[ti = 0]\n\n1 \u2212 \u02c6gn(zi)(cid:19) (yi \u2212 \u02c6Qn(ti, zi)).\n\n(3.2)\n\nWe call \u03b7(z) = (Q(0, z), Q(1, z), g(z)) the nuisance parameters. The main advantage of \u02c6\u03c8A\nn is that\nit is robust to misestimation of the nuisance parameters [Rob+94; vR11; Che+17a]. For example,\nit has the double robustness property: \u02c6\u03c8n is consistent if either \u02c6gn or \u02c6Qn is consistent. If both are\nconsistent, then \u02c6\u03c8A\nn is the asymptotically most ef\ufb01cient possible estimator [Bic+00]. We will show\nbelow that the good theoretical properties of the suitably modi\ufb01ed A-IPTW estimator persist for the\nembedding method even in the non-iid setting of this paper.\n\nThere is a remaining complication. In the general case, if the same data On is used to estimate the\nnuisance parameters \u02c6\u03b7n and to compute \u02c6\u03c8n(On; \u02c6\u03b7n) then the estimator is not guaranteed to maintain\ngood asymptotic properties. This problem can be solved by splitting the data, using one part to\nestimate \u02c6\u03b7n and the other to compute the estimate [Che+17a]. We rely on this data splitting approach.\n\n3\n\n\f4 Estimation\n\nWe now return to the setting where the {zi} are unobserved, but a network proxy is available.\nFollowing the previous section, we want to hold out a subset of the units i \u2208 I0 and, for each of these\nunits, produce estimates of the propensity score g(zi) and the conditional expected outcome Q(ti, zi).\nOur starting point is (an immediate corollary of) [RR83, Thm. 3]:\n\nTheorem 4.1. Suppose \u03bb(z) is some function of the latent attributes such that at least one of the\nfollowing is \u03bb(Z)-measurable: (i) (Q(0, Z), Q(1, Z)), or (ii) g(Z). If adjusting for Z suf\ufb01ces to\nrender the average treatment effect identi\ufb01able then adjusting for only \u03bb(Z) also suf\ufb01ces. That is,\n\u03c8 = E[E[Y | \u03bb(Z), T = 1] \u2212 E[Y | \u03bb(Z), T = 0]]\nThe signi\ufb01cance of this result is that adjusting for the confounding effect of the latent attributes does\nnot actually require us to recover the latent attributes. Instead, it suf\ufb01ces to recover only the aspects\n\u03bb(zi) that are relevant for the prediction of the propensity score or conditional expected outcome.\n\nThe idea is that we may view network embedding methods as black-box tools for extracting informa-\ntion from the network that is relevant to solving prediction problems. We make use of embedding\n\nbased semi-supervised prediction models. What this means is that we assign an embedding \u03bbi \u2208 Rp\nto each unit, and de\ufb01ne predictors \u02dcQ(ti, \u03bbi; \u03b3Q) mapping the embedding and treatment to a predic-\ntion for yi, and predictor \u02dcg(\u03bbi; \u03b3g) mapping the embeddings to predictions for ti. In this context,\n\u2018semi-supervised\u2019 means that when training the model we do not use the labels of units in I0, but we\ndo use all other data\u2014including the proxy structure on units in I0.\n\nAn example clari\ufb01es the general approach.\n\nExample 4.2. We denote the network Gn. We assume a continuous valued outcome. Consider the\ncase where \u02dcQ(0,\u00b7; \u03b3Q), \u02dcQ(1,\u00b7; \u03b3Q) and logit \u02dcg(\u00b7; \u03b3g) are all linear predictors. We train a model with\n\na relational empirical risk minimization procedure [Vei+19a]. We set:\n\n\u02c6\u03bbn, \u02c6\u03b3Q\n\nn , \u02c6\u03b3g\n\nn = argmin\n\u03bb,\u03b3Q,\u03b3g\n\nEGk=Sample(Gn,k)[L(Gk; \u03bb, \u03b3Q, \u03b3g)]\n\nwhere Sample(Gn, k) is a randomized sampling algorithm that returns a random subgraph of size k\nfrom Gn (e.g., a random walk with k edges), and\n\nL(Gk; \u03bb, \u03b3Q, \u03b3g) = Xi\u2208I\\I0\n\n(yi \u2212 \u02dcQ(ti, \u03bbi; \u03b3Q))2 + Xi\u2208I\\I0\n+ Xi,j\u2208I\u00d7I\n\nCrossEntropy(1[(i, j) \u2208 Gk], sigmoid(\u03bbT\n\ni \u03bbj)).\n\nCrossEntropy(ti, \u02dcg(\u03bbi; \u03b3g))\n\nHere, I is the full set of units, and 1[(i, j) \u2208 Gk] indicates whether units i and j are linked. Note that\nthe \ufb01nal term of the model is the one that explains the relational structure. Intuitively, it says that\nthe logit probability of an edge is the inner product of the embeddings of the end points of the edge.\nThis loss term makes use of the entire dataset, including links that involve the heldout units. This is\nimportant to ensure that the embeddings for the heldout data \u2018match\u2019 the rest of the embeddings.\n\nEstimation. With a trained model in hand, computing the estimate of the treatment effect is\nstraightforward. Simply plug-in the estimated values of the nuisance parameters to a standard\nestimator. For example, using the A-IPTW estimator eq. (3.2),\n\n\u02c6\u03c8A\n\nn (I0) :=\n\n1\n\n\u02dcQ(1, \u02c6\u03bbn,i; \u02c6\u03b3Q\n\n|I0| Xi\u2208I0\n|I0| Xi\u2208I0(cid:18) I[ti = 1]\nn) \u2212\n\nn ) \u2212 \u02dcQ(0, \u02c6\u03bbn,i; \u02c6\u03b3Q\nn )\nI[ti = 0]\n1 \u2212 \u02dcg(\u02c6\u03bbn,i; \u02c6\u03b3g\n\n\u02dcg(\u02c6\u03bbn,i; \u02c6\u03b3g\n\n+\n\n1\n\nn)(cid:19)(yi \u2212 \u02dcQ(ti, \u02c6\u03bbn,i; \u02c6\u03b3Q\n\nn )).\n\n(4.1)\n\nWe also allow for a more sophisticated variant. We split the data into K folds I0, . . . , IK\u22121 and\nde\ufb01ne our estimator as:\n\n\u02c6\u03c8A\n\nn =\n\n\u02c6\u03c8A\n\nn (Ij).\n\n(4.2)\n\nThis variant is more data ef\ufb01cient than just using a single fold. Finally, the same procedure applies to\nestimators other than the A-IPTW. We consider the effect of the choice of estimator in section 6.\n\n1\n\nK Xj\n\n4\n\n\f5 Validity\n\nWhen does the procedure outlined in the previous section yield valid inferences? We now present a\ntheorem establishing suf\ufb01cient conditions. The result is an adaption of the \u201cdouble machine learning\u201d\nof Chernozhukov et al. [Che+17a; Che+17b] to the network setting. We \ufb01rst give the technical\nstatement, and then discuss its signi\ufb01cance and interpretation.\nFix notation as in the previous section. We also de\ufb01ne \u02c6\u03b3Q,I c\ncalculated using all but the kth data fold.\n\nto be the estimates for \u03b3Q, \u03b3g\n\nand \u02c6\u03b3g,I c\n\nn\n\nn\n\nk\n\nk\n\nAssumption 1. The probability distributions P satis\ufb01es\n\nFurther, we requrie that T does not causally affect either Z or the network.\n\nY = Q(T, Z) + \u03b6,\nT = g(Z) + \u03bd,\n\nE[\u03b6 | Z, T ] = 0,\nE[\u03bd | Z] = 0.\n\nThe second part of the statement is necessary to rule out a linear-gaussian edge case.\n\nAssumption 2. There is some function \u03bb mapping features Z into Rp such that \u03bb satis\ufb01es the\ncondition of theorem 4.1, and each of || \u02dcQn(0, \u02c6\u03bbn,i; \u02c6\u03b3Q,I c\n)\u2212\nQ(1, \u03bb(Zi))||P,2, and ||\u02dcgn(\u02c6\u03bbn,i; \u02c6\u03b3g,I c\n) \u2212 g(\u03bb(Zi))||P,2 goes to 0 as n \u2192 \u221e. Additionally, \u03bb must\n\n)\u2212 Q(0, \u03bb(Zi))||P,2, || \u02dcQn(1, \u02c6\u03bbn,i; \u02c6\u03b3Q,I c\n\nsatisfy all of the following assumptions.\n\nk\n\nk\n\nk\n\nAssumption 3. The following moment conditions hold for some \ufb01xed \u03b5, C, c, some q > 4, and all\nt \u2208 {0, 1}\n\n||Q(t, \u03bb(Z))||P,q \u2264 C,\n||Y ||P,q \u2264 C,\nP (\u03b5 \u2264 g(\u03bb(Z)) \u2264 1 \u2212 \u03b5) = 1,\nP (EP (cid:2)\u03b6 2 | \u03bb(Z)(cid:3) \u2264 C) = 1,\n||\u03b6||P,2 \u2265 c,\n||\u03bd||P,2 \u2265 c.\n\nAssumption 4. The estimators of nuisance parameters satisfy the following accuracy requirements.\nThere is some \u03b4n, \u2206nK \u2192 0 such that for all n \u2265 2K and d \u2208 {0, 1} it holds with probability no\nless than 1 \u2212 \u2206nK :\n|| \u02dcQn(d, \u02c6\u03bbn,i; \u02c6\u03b3Q,I c\n\n) \u2212 Q(d, \u03bb(Zi))||P,2 \u00b7 ||\u02dcgn(\u02c6\u03bbn,i; \u02c6\u03b3g,I c\n\n) \u2212 g(\u03bb(Zi))||P,2 \u2264 \u03b4nK \u00b7 n\u22121/2\n\n(5.1)\n\nK\n\nk\n\nk\n\nAnd,\n\nP (\u03b5 \u2264 \u02dcgn(\u02c6\u03bbn,i; \u02c6\u03b3g,I c\n\nk\n\n) \u2264 1 \u2212 \u03b5) = 1,\n\n(5.2)\n\nAssumption 5. We assume the dependence between the trained embeddings is not too strong: For\nany i, j and all bounded continuous functions f with mean 0,\n\nEhf (\u02c6\u03bbn,i) \u00b7 f (\u02c6\u03bbn,j)i = o(\n\n1\nn\n\n).\n\n(5.3)\n\nTheorem 5.1. Denote the true ATE as \u03c8. Let \u02c6\u03c8n be the K-fold A-IPTW variant de\ufb01ned in eq. (4.2).\nUnder Assumptions 1 to 5, \u02c6\u03c8n concentrates around \u03c8 with the rate 1/\u221an and is approximately\n\nunbiased and normally distributed:\n\n\u03c3\u22121\u221an( \u02c6\u03c8n \u2212 \u03c8) d\u2192 N (0, 1)\n\u03c32 = EP (cid:2)\u03d52\n\n0(Y, T, \u03bb(Z); \u03b80, \u03b7(\u03bb(Z)))(cid:3) ,\n\nwhere\n\n\u03d50(Y, T, \u03bb(Z); \u03b80, \u03b7(\u03bb(Z))) =\n\nT\n\ng(\u03bb(Z)){Y \u2212 Q(1, \u03bb(Z))} \u2212\n+ {Q(1, \u03bb(Z)) \u2212 Q(0, \u03bb(Z))} \u2212 \u03c8.\n\n1 \u2212 T\n\n1 \u2212 g(\u03bb(Z)){Y \u2212 Q(0, \u03bb(Z))}\n\n5\n\n\fProof. The proof follows Chernozhukov et al. [Che+17b]. The main changes are technical modi\ufb01-\ncations exploiting Assumption 5 to allow for the use of the full data in the embedding training. We\ndefer the proof to the appendix.\n\nInterpretation and Signi\ufb01cance. Under suitable conditions, theorem 5.1 promises us that the\ntreatment effect is identi\ufb01able and can be estimated at a fast rate. It is not surprising that there\nare some conditions under which this holds. The insight from theorem 5.1 lies with the particular\nassumptions that are required.\n\nAssumptions 1 and 3 are standard conditions. Assumption 1 posits a causal model that (i) restricts\nthe treatments and outcomes to a pure unit effect (i.e., it forbids contagion effects), and that (ii)\nrenders the causal effects identi\ufb01able when Z observed. Assumption 3 is technical conditions on the\ndata generating distribution. This assumption includes the standard positivity condition. Possible\nviolations of these conditions are important and must be considered carefully in practice. However,\nsuch considerations are standard, independent of the non-iid, no-generative-model setting that is our\nfocus, so we do not comment further.\n\nOur \ufb01rst deviation from the standard causal inference setup is Assumption 2. This is the identi\ufb01cation\ncondition when Z is not observed. It requires that the learned embeddings are able to extract whatever\ninformation is relevant to the prediction of the treatment and outcome. This assumption is the crux of\nthe method.\n\nA more standard assumption would directly posit the relationship between Z and the proxy network;\ne.g., by assuming a stochastic block model or latent space model. The practitioner is then required\nto assess whether the posited model is realistic. In practice, all generative models of networks\nfail to capture the structure of real-world networks. Instead, we ask the practitioner to judge the\nplausibility of the predictive embedding model. Such judgments are non-falsi\ufb01able, and must be\nbased on experience with the methods and trials on semi-synthetic data. This is a dif\ufb01cult task, but\nthe assumption is at least not violated a priori.\n\nIn practice, we do not expect the identi\ufb01cation assumption to hold exactly. Instead, the hope is that\napplying the method will adjust for whatever confounding information is present in the network. This\nis useful even if there is confounding exogenous to the network. We study the behavior of the method\nin the presence of exogenous confounding in section 6.\n\nThe condition in Assumption 4 addresses the statistical quality of the nuisance parameter estimation\nprocedure. For an estimator to be useful, it must produce accurate estimates with a reasonable\namount of data. It is intuitive that if accurately estimating the nuisance parameters requires an\nenormous amount of data, then so too will estimation of \u03c8. eq. (5.1) shows that this is not so. It\nsuf\ufb01ces, in principle, to estimate the nuisance parameters crudely, e.g., a rate of o(n1/4) each. This is\nimportant because the need to estimate the embeddings may rule out parametric-rate convergence of\nthe nuisance parameters; theorem 5.1 shows this is not damning.\n\nAssumption 5 is the price we pay for training the embeddings with the full data. If the pairwise\ndependence between the learned embeddings is very strong then the data splitting procedure does\nnot guarantee that the estimate is valid. However, the condition is weak and holds empirically.\nThe condition can also be removed by a two-stage procedure where the embeddings are trained in\nan unsupervised manner and then used as a direct surrogate for the confounders. However, such\napproaches have relatively poor predictive performance [Yan+16; Vei+19a]. We compare to the\ntwo-stage approach in section 6.\n\n6 Experiments\n\nThe main remaining questions are: Is the method able to adjust for confounding in practice? If so, is\nthe joint training of embeddings and classi\ufb01er important? And, what is the best choice of plug-in\nestimator for the second stage of the procedure? Additionally, what happens in the (realistic) case\nthat the network does not carry all confounding information?\n\nWe investigate these questions with experiments on a semi-synthetic network dataset.1 We \ufb01nd\nthat in realistic situations, the network adjustment improves the estimation of the average treatment\n\n1Code and pre-processed data at github.com/vveitch/causal-network-embeddings\n\n6\n\n\feffect. The estimate is closer to the truth than estimates from either a parametric baseline, or a\ntwo-stage embedding procedure. Further, we \ufb01nd that network adjustment improves estimation\nquality even in the presence of confounding that is exogenous to the network. That is, the method\nstill helps even when full identi\ufb01cation is not possible. Finally, as predicted by theory, we \ufb01nd\nthat the robust estimators are best when the theoretical assumptions hold. However, the simple\nconditional-outcome-only estimator has better performance in the presence of signi\ufb01cant exogenous\nconfounding.\n\nChoice of estimator. We consider 4 options for the plug-in treatment effect estimator.\n\n1. The conditional expected outcome based estimator,\n\n\u02c6\u03c8Q\n\nn =\n\n1\n\nnXi h \u02dcQn(1, \u02c6\u03bbn,i; \u02c6\u03b3n) \u2212 \u02dcQn(0, \u02c6\u03bbn,i; \u02c6\u03b3n)i ,\n\nwhich only makes use of the outcome model.\n\n2. The inverse probability of treatment weighted estimator,\n\n\u02c6\u03c8g\n\nn =\n\n1\n\nnXi \" 1[ti = 1]\n\u02dcg(\u02c6\u03bbn,i; \u02c6\u03b3n) \u2212\n\n1[ti = 0]\n\n1 \u2212 \u02dcg(\u02c6\u03bbn,i; \u02c6\u03b3n)# Yi,\n\nwhich only makes use of the treatment model.\n\n3. The augmented inverse probability treatment estimator \u02c6\u03c8A\n4. A targeted minimum loss based estimator (TMLE) [vR11].\n\nn , de\ufb01ned in eq. (4.1).\n\nThe later two estimators both make full use of the nuisance parameter estimates. The TMLE also\nadmits the asymptotic guarantees of theorem 5.1 (though we only state the theorem for the simpler\nA-IPTW estimator). The TMLE is a variant designed for better \ufb01nite sample performance.\n\nPokec. To study the properties of the procedure, we generate semi-synthetic data using a real-world\nsocial network. We use a subset of the Pokec social network. Pokec is the most popular online social\nnetwork in Slovakia. For our purposes, the main advantages of Pokec are: the anonymized data are\nfreely and openly available [TZ12; LK14] 2, and the data includes signi\ufb01cant attribute information for\nthe users, which is necessary for our simulations. We pre-process the data to restrict to three districts\n( \u02c7Zilina, Cadca, Namestovo), all within the same region ( \u02c7Zilinsk\u00b4y). The pre-processed network has 79\nthousand users connected by 1.3 million links.\n\nSimulation. We make use of three user level attributes in our simulations: the district they live\nin, the user\u2019s age, and their Pokec join date. These attributes were selected because they have low\nmissingness and have some dependency with the the network structure. We discretize age and join\ndate to a 3-level categorical variable (to match district).\n\nFor the simulation, we take each of these attributes to be the hidden confounder. We will attempt to\nadjust for the confounding using the Pokec network. We take the probability of treatment to be wholly\ndetermined by the confounder z, with the three levels corresponding to g(z) \u2208 {0.15, 0.5, 0.85}. The\ntreatment and outcome for user i is simulated from their confounding attribute zi as:\n\nti = Bern(g(zi)),\nyi = ti + \u03b2(g(zi) \u2212 0.5) + \u03b5i\n\n\u03b5i \u223c N (0, 1).\n\n(6.1)\n\n(6.2)\n\nIn each case, the true treatment effect is 1.0. The parameter \u03b2 controls the amount of confounding.\n\nEstimation. For each simulated dataset, we estimate the nuisance parameters using the procedure\ndescribed in section 4 with K = 10 folds. We use a random-walk sampler with negative sampling with\nthe default relational ERM settings [Vei+19a]. We pre-train the embeddings using the unsupervised\nobjective only, run until convergence.\n\nBaselines. We consider three baselines. The \ufb01rst is the naive estimate that does not attempt to\ncontrol for confounding; i.e., 1\nindividuals. The second baseline is the two-stage procedure, where we \ufb01rst train the embeddings\non the unsupervised objective, freeze them, and then use them as features for the same predictor\nmaps. The \ufb01nal baseline is a parametric approach to controlling for the confounding. We \ufb01t a\n\nn\u2212mPi:ti=0 yi, where m is the number of treated\n\nmPi:ti=1 yi \u2212 1\n\n2snap.stanford.edu/data/soc-Pokec.html\n\n7\n\n\fTable 1: Adjusting using the network improves ATE estimate in all cases. Further, the single-stage method is\nmore accurate than baselines. Table entries are estimated ATE with 10-fold std. Ground truth is 1.0. Low and\nhigh confounding correspond to \u03b2 = 1.0 and 10.0.\n\nConf.\n\nLow\n\nHigh\n\nLow\n\nHigh\n\nLow\n\nHigh\n\nage\n\ndistrict\n\njoin date\n\nUnadjusted\nParametric\nTwo-stage\n\u02c6\u03c8A\nn\n\n1.32 \u00b1 0.02\n1.30 \u00b1 0.00\n1.33 \u00b1 0.02\n1.24 \u00b1 0.04\n\n4.34 \u00b1 0.05\n4.06 \u00b1 0.01\n4.55 \u00b1 0.05\n3.40 \u00b1 0.04\n\n1.34 \u00b1 0.03\n1.21 \u00b1 0.00\n1.34 \u00b1 0.02\n1.09 \u00b1 0.02\n\n4.51 \u00b1 0.05\n3.22 \u00b1 0.01\n4.55 \u00b1 0.05\n2.03 \u00b1 0.07\n\n1.29 \u00b1 0.03\n1.26 \u00b1 0.00\n1.30 \u00b1 0.03\n1.21 \u00b1 0.05\n\n4.03 \u00b1 0.06\n3.73 \u00b1 0.01\n4.16 \u00b1 0.06\n3.26 \u00b1 0.09\n\nTable 2: The conditional-outcome-only estimator is usually most accurate. Table entries are estimated ATE\nwith 10-fold std. Ground truth is 1.0. Low and high confounding correspond to \u03b2 = 1.0 and 10.0.\njoin date\n\ndistrict\n\nage\n\nConf.\n\nLow\n\nHigh\n\nLow\n\nHigh\n\nLow\n\nHigh\n\n\u02c6\u03c8Q\nn\n\u02c6\u03c8g\nn\n\u02c6\u03c8A\nn\n\u02c6\u03c8TMLE\n\nn\n\n1.05 \u00b1 0.24\n1.27 \u00b1 0.03\n1.24 \u00b1 0.04\n1.21 \u00b1 0.03\n\n2.77 \u00b1 0.35\n3.12 \u00b1 0.06\n3.40 \u00b1 0.04\n3.26 \u00b1 0.07\n\n1.03 \u00b1 0.25\n1.10 \u00b1 0.03\n1.09 \u00b1 0.02\n1.09 \u00b1 0.04\n\n1.75 \u00b1 0.20\n1.66 \u00b1 0.07\n2.03 \u00b1 0.07\n2.02 \u00b1 0.05\n\n1.17 \u00b1 0.35\n1.29 \u00b1 0.05\n1.21 \u00b1 0.05\n1.20 \u00b1 0.05\n\n2.41 \u00b1 0.45\n3.10 \u00b1 0.07\n3.26 \u00b1 0.09\n3.13 \u00b1 0.09\n\nmixed-membership stochastic block model [GB13] to the data, with 128 communities (chosen to\nmatch the embedding dimension). We predict the outcome using a linear regression of the outcome\non the community identities and the treatment. The estimated treatment effect is the coef\ufb01cient of the\ntreatment.\n\n6.1 Results\n\nComparison to baselines. We report comparisons to the baselines in table 1. As expected, adjusting\nfor the network improves estimation in every case. Further, the one-stage embedding procedure is\nmore accurate than baselines.\n\nrobust method.\n\ninformation about\n\nWe report comparisons of downstream estimators in table 2.\nsubstantially\nThis is likely because the network does not\nthe confounding factors, violating one of our assumptions.\n\nChoice of estimator.\nThe conditional-outcome-only estimator usually yields the best estimates,\nimproving on either\ncarry all\nWe expect that district has the strongest de-\npendence with the network, and we see best per-\nformance for this attribute. Poor performance of\nrobust estimators when assumptions are violated\nhas been observed in other contexts [KS07].\n\nConfounding exogenous to the network. In\npractice, the network may not carry information\nabout all sources of confounding. For instance,\nin our simulation, the confounders may not be\nwholly predictable from the network structure.\nWe study the effect of exogenous confounding\nby a second simulation where the confounder\nconsists of a part that can be fully inferred from\nthe network and part that is wholly exogenous.\n\nFor the inferrable part, we use the estimated\npropensity scores {\u02c6gi} from the district ex-\nperiment above. By construction, the network\ncarries all information about each \u02c6gi. We de\ufb01ne\nthe (ground truth) propensity score for our new\nsimulation as logit gsim = (1 \u2212 p) logit \u02c6gi + p\u03bei,\n\nFigure 1: Adjusting for the network helps even when\nthe no exogenous confounding assumption is violated.\nThe robust TMLE estimator is the best estimator when\nno assumptions are violated. The simple conditional-\noutcome-only estimator (\u201cSimple\u201d) is better in the pres-\nence of moderate exogeneity. Plot shows estimates of\nATE from district simulation. Ground truth is 1.\n\n8\n\n0.00.20.40.60.81.0Exogeneity0.51.01.52.02.53.03.54.0ATE EstimateSimpleTMLEUnadjusted\fwith \u03bei\ncontrols the level of exogeneity. We simulate treatments and outcomes as in eq. (6.1).\n\niid\u223c N(0, 1). The second term, \u03bei, is the exogenous part of the confounding. The parameter p\n\nIn \ufb01g. 1 we plot the estimates at various levels of exogeneity. We observe that network adjustment\nhelps even when the no exogenous confounding assumption is violated. Further, we see that the\nrobust estimator has better performance when p = 0, i.e., when the assumptions of theorem 5.1\nare satis\ufb01ed. However, the conditional-outcome-only estimator is better with substantial exogenous\nconfounding.\n\n7 Discussion\n\nWe have seen how black-box embedding methods can be harnessed for causal inference in the\ncontext of networks. The important conceptual points of the development are: First, the method\neliminates the need to precisely specify the properties that in\ufb02uence both network formation and\nthat are confounding. In particular, we need not specify a parametric model for how the network is\nformed. And, second, identi\ufb01cation and estimation can be achieved even if the embedding method\nextracts the necessary information at only a slow rate. That is, absence of a parametric model is not a\ngrevious problem from a sample-complexity perspective. These are substantial strengths. However,\nthere also signi\ufb01cant limitations and opportunity for future work.\n\nAssumption 2 may be dif\ufb01cult to reason about in practice. It requires the practitioner to assess\nboth whether (1) the network carries suf\ufb01cient information for identi\ufb01cation, and (2) the embedding\nmethod is able to effectively extract this information. The \ufb01rst part is an assessment based on\napplication-speci\ufb01c domain knowledge. The second part is based on past performance of embedding\nmethodse.g., a method that reliably predicts political af\ufb01liation in datasets where af\ufb01liation is\nlabeled can be expected to effectively extract information relevant to political identity. This is an\nimprovement on the (impossible) requirement of \ufb01nding a well-speci\ufb01ed model describing how\nthe network was generated. While we do not expect that either condition is exactly satis\ufb01ed, the\nexogeneous-confounding experiments in section 6 suggest that applying the adjustment can still\nimprove estimation. An important direction for future work is to develop new methods for sensitivity\nanalysis\u2014applicable in this black-box setting\u2014and formal results about when Assumption 2 can be\nexpected to hold. As an example of the need for such results, partial adjustment for confounding is\nknown to hurt estimation in certain cases [e.g., bias ampli\ufb01cation Mid+16; Din+17].\n\nThe pure-homophily assumption is restrictive. Many of the most interesting causal questions on\nnetworks are explicitly about in\ufb02uence and contagion. We restricted to the homophily case here\nfor simplicity, but it does not appear to be fundamentally required. Extending the results to handle\ncontagion and in\ufb02uence is an important direction for future work.\n\nReferences\n\n[Air+08]\n\n[Bic+00]\n\n[Cha+17]\n\nE. Airoldi, D. Blei, S. Fienberg, and E. Xing. \u201cMixed membership stochastic blockmod-\nels\u201d. In: Journal of Machine Learning Research (2008).\nP. J. Bickel, C. A. J. Klaassen, Y. Ritov, and J. A. Wellner. \u201cEf\ufb01cient and adaptive\nestimation for semiparametric models\u201d. In: Sankhy: The Indian Journal of Statistics,\nSeries A (2000).\nB. P. Chamberlain, J. Clough, and M. P. Deisenroth. \u201cNeural embeddings of graphs in\nhyperbolic space\u201d. In: arXiv e-prints, arXiv:1705.10359 (2017).\n\n[Che+17a] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Du\ufb02o, C. Hansen, W. Newey, and\nJ. Robins. \u201cDouble/debiased machine learning for treatment and structural parameters\u201d.\nIn: The Econometrics Journal (2017).\n\n[Che+17b] V. Chernozhukov, D. Chetverikov, M. Demirer, E. Du\ufb02o, C. Hansen, and W. Newey.\n\u201cDouble/debiased/neyman machine learning of treatment effects\u201d. In: American Eco-\nnomic Review 5 (2017).\nS. R. Cole and M. A. Hern\u00b4an. \u201cConstructing inverse probability weights for marginal\nstructural models.\u201d In: American Journal of Epidemiology (2008).\nH. Crane and W. Dempsey. \u201cA framework for statistical network modeling\u201d. In: arXiv\ne-prints, arXiv:1509.08185 (2015).\n\n[CH08]\n\n[CD15]\n\n9\n\n\f[Din+17]\n\n[Dur06]\n[GB13]\n\nP. Ding, T. Vanderweele, and J. M. Robins. \u201cInstrumental variables as bias ampli\ufb01ers\nwith general outcome and confounding\u201d. In: Biometrika 2 (2017).\nR. Durrett. Random Graph Dynamics. 2006.\nP. K. Gopalan and D. M. Blei. \u201cEf\ufb01cient discovery of overlapping communities in\nmassive networks\u201d. In: Proceedings of the National Academy of Sciences (2013).\n\n[Ham+17] W. Hamilton, Z. Ying, and J. Leskovec. \u201cInductive representation learning on large\n\ngraphs\u201d. In: Advances in neural information processing systems 30. 2017.\n\n[Ham+17] W. L. Hamilton, R. Ying, and J. Leskovec. \u201cRepresentation learning on graphs: methods\n\n[Hof+02]\n\n[KS07]\n\n[KM99]\n\n[KP14]\n\n[LK14]\n\n[Lou+17]\n\nand applications\u201d. In: arXiv e-prints, arXiv:1709.05584 (2017).\nP. Hoff, A. Raftery, and M. Handcock. \u201cLatent space approaches to social network\nanalysis\u201d. In: Journal of the American Statistical Association 460 (2002).\nJ. D. Y. Kang and J. L. Schafer. \u201cDemystifying double robustness: a comparison of\nalternative strategies for estimating a population mean from incomplete data\u201d. In: Statist.\nSci. 4 (2007).\nM. Kuroki and M. Miyakawa. \u201cIdenti\ufb01ability criteria for causal effects of joint interven-\ntions\u201d. In: Journal of the Japan Statistical Society 2 (1999).\nM. Kuroki and J. Pearl. \u201cMeasurement bias and effect restoration in causal inference\u201d.\nIn: Biometrika 2 (2014).\nJ. Leskovec and A. Krevl. SNAP Datasets: Stanford Large Network Dataset Collection.\nhttp://snap.stanford.edu/data. 2014.\nC. Louizos, U. Shalit, J. M. Mooij, D. Sontag, R. Zemel, and M. Welling. \u201cCausal\neffect inference with deep latent-variable models\u201d. In: Advances in neural information\nprocessing systems. 2017.\n\n[Mia+18] W. Miao, Z. Geng, and E. J. Tchetgen Tchetgen. \u201cIdentifying causal effects with proxy\n\n[Mid+16]\n\n[New09]\n[Ogb18]\n\n[OV17]\n\n[Ogb+17]\n\n[OR15]\n\n[Pea12]\n\n[Per+14]\n\n[Rob00]\n\n[Rob+94]\n\n[Rob+00]\n\n[RR83]\n\n[SM16]\n\nvariables of an unmeasured confounder\u201d. In: Biometrika 4 (2018).\nJ. A. Middleton, M. A. Scott, R. Diakow, and J. L. Hill. \u201cBias ampli\ufb01cation and bias\nunmasking\u201d. In: Political Analysis 3 (2016).\nM. Newman. Networks. An Introduction. 2009.\nE. L. Ogburn. \u201cChallenges to estimating contagion effects from observational data\u201d. In:\nComplex spreading phenomena in social systems: in\ufb02uence and contagion in real-world\nsocial networks. 2018.\nE. L. Ogburn and T. J. VanderWeele. \u201cVaccines, contagion, and social networks\u201d. In:\nAnn. Appl. Stat. 2 (2017).\nE. L. Ogburn, O. Sofrygin, I. Diaz, and M. J. van der Laan. \u201cCausal inference for social\nnetwork data\u201d. In: arXiv e-prints, arXiv:1705.08527 (2017).\nP. Orbanz and D. Roy. \u201cBayesian models of graphs, arrays and other exchangeable\nrandom structures\u201d. In: Pattern Analysis and Machine Intelligence, IEEE Transactions\non 2 (2015).\nJ. Pearl. \u201cOn measurement bias in causal inference\u201d. In: arXiv e-prints, arXiv:1203.3504\n(2012).\nB. Perozzi, R. Al-Rfou, and S. Skiena. \u201cDeepwalk: online learning of social represen-\ntations\u201d. In: Proc. 20th int. conference on knowledge discovery and data mining (kdd\n\u201914). 2014.\nJ. M. Robins. \u201cRobust estimation in sequentially ignorable missing data and causal\ninference models\u201d. In: ASA Proceedings of the Section on Bayesian Statistical Science\n(2000).\nJ. M. Robins, A. Rotnitzky, and L. P. Zhao. \u201cEstimation of regression coef\ufb01cients\nwhen some regressors are not always observed\u201d. In: Journal of the American Statistical\nAssociation 427 (1994).\nJ. M. Robins, A. Rotnitzky, and M. van der Laan. \u201cOn pro\ufb01le likelihood: comment\u201d. In:\nJournal of the American Statistical Association 450 (2000).\nP. R. Rosenbaum and D. B. Rubin. \u201cThe central role of the propensity score in observa-\ntional studies for causal effects\u201d. In: Biometrika 1 (1983).\nC. Shalizi and E. McFowland III. \u201cEstimating causal peer in\ufb02uence in homophilous\nsocial networks by inferring latent locations\u201d. In: arXiv e-prints, arXiv:1607.06565\n(2016).\n\n10\n\n\f[ST11]\n\n[TZ12]\n\n[Tch+17]\n\n[vR11]\n\n[VR15]\n\nC. R. Shalizi and A. C. Thomas. \u201cHomophily and contagion are generically confounded\nin observational social network studies\u201d. In: Sociol. Methods Res. 2 (2011).\nL. Takac and M. Zabovsky. \u201cData analysis in public social networks\u201d. In: International\nScienti\ufb01c Conference and International Workshop Present Day Trends of Innovations\n(2012).\nE. J. Tchetgen Tchetgen, I. Fulcher, and I. Shpitser. \u201cAuto-g-computation of causal\neffects on a network\u201d. In: arXiv e-prints, arXiv:1709.01577 (2017).\nM. van der Laan and S. Rose. Targeted Learning: Causal Inference for Observational\nand Experimental Data. 2011.\nV. Veitch and D. M. Roy. \u201cThe class of random graphs arising from exchangeable\nrandom measures\u201d. In: arXiv e-prints, arXiv:1512.03099 (2015).\n\n[Vei+19a] V. Veitch, M. Austern, W. Zhou, D. M. Blei, and P. Orbanz. \u201cEmpirical risk minimiza-\ntion and stochastic gradient descent for relational data\u201d. In: Proceedings of the 22nd\ninternational conference on arti\ufb01cial intelligence and statistics. 2019.\n\n[Vei+19b] V. Veitch, D. Sridhar, and D. M. Blei. \u201cUsing text embeddings for causal inference\u201d. In:\n\n[WW87]\n\n[Yan+16]\n\narXiv e-prints, arXiv:1905.12741 (2019).\nY. Wang and G. Wong. \u201cStochastic block models for directed graphs\u201d. In: Journal of\nthe American Statistical Association 397 (1987).\nZ. Yang, W. Cohen, and R. Salakhudinov. \u201cRevisiting semi-supervised learning with\ngraph embeddings\u201d. In: Proceedings of the 33rd international conference on machine\nlearning. 2016.\n\n11\n\n\f", "award": [], "sourceid": 7684, "authors": [{"given_name": "Victor", "family_name": "Veitch", "institution": "Columbia University"}, {"given_name": "Yixin", "family_name": "Wang", "institution": "Columbia University"}, {"given_name": "David", "family_name": "Blei", "institution": "Columbia University"}]}