{"title": "Variance Reduction in Bipartite Experiments through Correlation Clustering", "book": "Advances in Neural Information Processing Systems", "page_first": 13309, "page_last": 13319, "abstract": "Causal inference in randomized experiments typically assumes that the units of randomization and the units of analysis are one and the same. In some applications, however, these two roles are played by distinct entities linked by a bipartite graph. The key challenge in such bipartite settings is how to avoid interference bias, which would typically arise if we simply randomized the treatment at the level of analysis units. One effective way of minimizing interference bias in standard experiments is through cluster randomization, but this design has not been studied in the bipartite setting where conventional clustering schemes can lead to poorly powered experiments. This paper introduces a novel clustering objective and a corresponding algorithm that partitions a bipartite graph so as to maximize the statistical power of a bipartite experiment on that graph. Whereas previous work relied on balanced partitioning, our formulation suggests the use of a correlation clustering objective. We use a publicly-available graph of Amazon user-item reviews to validate our solution and illustrate how it substantially increases the statistical power in bipartite experiments.", "full_text": "Variance Reduction in Bipartite Experiments through\n\nCorrelation Clustering\n\nJean Pouget-Abadie\n\nGoogle Research\n\nNew York, NY 10011\njeanpa@google.com\n\nKevin Aydin\n\nGoogle Research\n\nMountain View, CA 94043\n\nkaydin@google.com\n\nWarren Schudy\nGoogle Research\n\nNew York, NY 10011\nwschudy@google.com\n\nKay Brodersen\n\nGoogle\n\nZ\u00a8urich, Switzerland\n\nkbrodersen@google.com\n\nAbstract\n\nVahab Mirrokni\nGoogle Research\n\nNew York, NY 10011\n\nmirrokni@google.com\n\nCausal inference in randomized experiments typically assumes that the units of\nrandomization and the units of analysis are one and the same. In some applica-\ntions, however, these two roles are played by distinct entities linked by a bipartite\ngraph. The key challenge in such bipartite settings is how to avoid interference\nbias, which would typically arise if we simply randomized the treatment at the\nlevel of analysis units. One effective way of minimizing interference bias in stan-\ndard experiments is through cluster randomization, but this design has not been\nstudied in the bipartite setting where conventional clustering schemes can lead to\npoorly powered experiments. This paper introduces a novel clustering objective\nand a corresponding algorithm that partitions a bipartite graph so as to maximize\nthe statistical power of a bipartite experiment on that graph. Whereas previous\nwork relied on balanced partitioning, our formulation suggests the use of a cor-\nrelation clustering objective. We use a publicly-available graph of Amazon user-\nitem reviews to validate our solution and illustrate how it substantially increases\nthe statistical power in bipartite experiments.\n\n1\n\nIntroduction\n\nWhether the setting is a medical trial or an A/B test, causal inference in randomized experiments\ntypically assumes that the units of randomization and the units of analysis are one and the same.\nOne possible exception to this rule is in the context of interference, a \ufb01eld of growing interest to\nstatisticians, where experimenters will occasionally consider the indirect effect of a unit\u2019s peers\nbeing treated. In some cases, it is infeasible or undesirable to assign treatment to the experimental\nunits we care about, and we can then draw a distinction between the units that are directly treated\nand the units whose outcomes we care about, a setting known as a bipartite experiment.\nBipartite experiments are critical in various settings, including pricing in online marketplaces (Frad-\nkin et al., 2018), recommender systems (Gilotte et al., 2018), social networks (Bakshy et al., 2014),\nand display ad auctions (Chawla et al., 2016). For example, consider an online retailer wishing to\ndetermine the impact of offering a discount on certain items through a randomized experiment. The\nretailer faces the choice of randomizing on users, items, or user-item pairs. Randomizing on users\nor user-item pairs could result in different users seeing different prices on the same item, which is\nundesirable. If the retailer decides to randomize on items, standard causal theory would suggest\nusing items as the unit of analysis. However, this ignores interference between substitute goods. For\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fexample, experimenters may be misled by the sudden spike in demand for an item randomly selected\nfor a discount, when in fact the treatment (discount) would be neutral if applied to all of that item\u2019s\nsubstitutes. An effective way of overcoming these issues is to design and analyze the experiment\nin a bipartite way, with users as the units of analysis and items as the units of randomization. An-\nother application arises in recommender systems, which create non-trivial interference mechanisms:\nboosting the recommendation of one item during a randomized experiment can negatively affect the\nlikelihood of other candidates being recommended to a user. These experiments on such systems\nmay bene\ufb01t from being analyzed in a bipartite way. A third application is given by display ad auc-\ntions (Pouget-Abadie et al., 2018), where bidders compete for ad impressions in auctions, leading to\ncomplex interference mechanisms that can be elegantly dealt with using a bipartite experiment.\nDespite their importance, the existing literature on bipartite experiments is scarce. Zigler and Pa-\npadogeorgou (2018) introduce a formal framework for bipartite experiments and suggest a restriction\nof the potential outcomes space inspired by the literature on causal inference with interference (Hud-\ngens and Halloran, 2008; Toulis and Kao, 2013; Tchetgen and VanderWeele, 2012). We consider\na very similar setting to Zigler and Papadogeorgou (2018) by tying the analysis of bipartite ran-\ndomized experiments to the dose-response literature, alias continuous treatment literature, more\ncommonly found in medical and public health settings (Galagate, 2016). Extending the idea of\ncluster-randomized designs to the bipartite setting is the core contribution of the present paper.\nIn a cluster-randomized design, the treatment is assigned to groups (clusters) of units rather than\nto individual units. They are a popular design choice for their simplicity and ability to minimize\nestimation bias (Eckles et al., 2017). Because cluster-based designs are easily implemented and\nstraightforward to analyze, the literature has focused on the choice of clustering, relying either on\ndomain-speci\ufb01c knowledge (e.g. schools districts (Basse and Feller, 2018), villages (Shakya et al.,\n2017)) or graph algorithms (Ugander and Backstrom, 2013; Gui et al., 2015). A popular clustering\ntechnique applied to these settings is balanced graph partitioning (Delling et al., 2012; Stanton and\nKliot, 2012; Ugander and Backstrom, 2013; Tsourakakis et al., 2014; Aydin et al., 2016). While\nsome of these clustering techniques can be extended to the bipartite setting, we suggest an entirely\nnew clustering objective\u2014and a corresponding algorithm\u2014that leverages the unique nature of bi-\npartite experiments. In particular, we show how to model our clustering objective as a correlation\nclustering problem.\nIn doing so, we come closest to the optimal design literature (Raudenbush,\n1997; Pokhiko et al., 2019), which selects the experimental design that is optimal according to some\nstatistical criterion.\nIn Section 2, we de\ufb01ne the bipartite randomized experiments framework, and suggest an analysis\napproach based on a linear treatment exposure assumption. In Section 3, we introduce a new cluster-\ning objective for running cluster-based bipartite experiments, and show that it is a speci\ufb01c instance\nof the well-studied correlation clustering problem. Finally, in section 4, using a publicly-available\nAmazon user-item review dataset, we show that our suggested algorithm improves experimental\npower signi\ufb01cantly over other more straightforward extensions of cluster randomized designs to the\nbipartite setting.\n\n2 Bipartite randomized experiments\n\n2.1 De\ufb01nitions\n\nA bipartite randomized experiment is a randomized experiment in which we distinguish two types\nof units: diversion units and outcome units. Treatment and control is assigned at the level of diver-\nsion units, and outcomes are measured at the level of outcome units. In a traditional randomized\nexperiment, diversion units and outcome units are one and the same; in a bipartite randomized ex-\nperiment, they are distinct. An outcome unit does not receive any treatment directly. Instead, its\nobserved outcome is determined by the assignment of the diversion units to treatment and control.\nThis dependence is often represented by a bipartite graph\u2014hence the name\u2014linking diversion units\nto outcome units.\nUsing potential outcomes notation (Rubin, 2005), let Z \u2208 {0, 1}M be the assignment vector of the\nM diversion units to treatment (Zj = 1) or control (Zj = 0) and Y \u2208 {0, 1}M \u2192 RN be the\npotential outcomes of the N outcome units. For each outcome unit i, we de\ufb01ne its exposure set\nEi \u2286 {0, 1}M as the subset of diversion units such that Yi depends only on ZEi = {Zj : j \u2208 Ei}.\nIn other words, an outcome unit i\u2019s expected outcome is entirely determined by the assignment to\n\n2\n\n\fSUTVA\n\nclusters with SUTVA\n\nno interference\n\nour setting\n\ntreatment and control of its exposure set Ei. In the bipartite graph representation, Ei is the set of\nneighboring diversion units of outcome unit i. Similarly, we de\ufb01ne the in\ufb02uence set Ij of a diversion\nunit j as the neighboring outcome units of unit j in the bipartite graph.\nRecall the de\ufb01nition of the standard unit treatment value assumption (SUTVA) (Imbens and Rubin,\n2015), also known as the individualistic treatment response (Manski, 2013), in the traditional setting:\n\nThe standard unit treatment value assumption (SUTVA) holds if (i) there is only\none form of treatment and one form of control, (ii) a unit\u2019s outcome is unaffected\nby the assignment of other units to treatment or control.\n\nWe distinguish four cases in the bipartite setting:\n\n1. All exposure sets Ei and all in\ufb02uence sets Ij are singletons: the mapping of outcome units\nto diversion units is one-to-one, known as a perfect matching in the graph algorithm liter-\nature. While the diversion and outcome units remain distinct physically, we can consider\neach matched pair as the units of interest for the purpose of the analysis, for which SUTVA\nholds, and the traditional causal framework applies straightforwardly.\n\n2. All exposure sets Ei are singletons: the mapping of outcome units to diversion units is\nmany-to-one. There is only one form of treatment and control for each outcome unit. This\ncorresponds to Fig. 1.a of Zigler and Papadogeorgou (2018), which they call the clusters\nwith SUTVA setting. When considering the disconnected components of the bipartite graph\nas the causal units of interest, cluster-based analysis methods (Eckles et al., 2017) apply.\n\n3. All in\ufb02uence sets Ij are singletons: the mapping of diversion units to outcome units is\none-to-many. Because all exposure sets are disjoint from one another, we say there is\nno interference between outcome units. When assigning the diversion units to treatment\nand control grouped by in\ufb02uence sets, SUTVA holds and the traditional causal framework\napplies.\n\n4. In the most general setting, the mapping of diversion units to outcome units is many-to-\n\nmany, and no trivial grouping of diversion and outcome units exists.\n\nZigler and Papadogeorgou (2018) consider a particular setting of partial interference, where the\nbipartite graph can be broken up into multiple connected components. In this paper, we consider the\nmost general setting for which no immediate reduction to the standard causal setting exists.\n\n2.2 The linear treatment exposure assumption\n\nIn contrast to traditional randomized experiments where SUTVA holds and each unit only has two\npotential outcomes, bipartite randomized experiments are more dif\ufb01cult to analyze due to the large\nnumber of potential outcomes per outcome unit: 2|Ei| for outcome unit i \u2208 [1, N ]. In particular,\nexperimenters may wish to de\ufb01ne estimands from potential outcomes that are very rarely observed.\nConsider, for example, the following natural extension of the average treatment effect estimand \u03c4 to\nthe bipartite setting,\n\nN(cid:88)\n\n\u03c4 =\n\nYi(ZEi = (cid:126)1) \u2212 Yi(ZEi = (cid:126)0)\n\n(1)\n\ni=1\n\nIf there exists an outcome unit i for which P r(ZEi = (cid:126)0) = 0 or P r(ZEi = (cid:126)1) = 0, then the average\ntreatment effect \u03c4 is not identi\ufb01able without additional assumptions on the structure of potential\noutcomes.\n\n3\n\nDiversion unitsOutcome unitsDiversion unitsOutcome unitsDiversion unitsOutcome unitsDiversion unitsOutcome units\fA similar problem is encountered in the non-bipartite setting when interference is present. This\ncausal literature has focused on restrictions of the space of potential outcomes in order to make\nprecise inference possible. For example, the anonymous interactions assumption (Manski, 2013)\nstates that a unit\u2019s outcome is unchanged for any permutation of treatment assignments of its direct\nneighbors in an interference graph. Another popular model (Hudgens and Halloran, 2008; Toulis and\nKao, 2013; Pouget-Abadie et al., 2018) assumes each unit is exposed to a direct effect (depending on\nits treatment assignment) and an indirect effect (depending on the proportion of treated neighboring\nunits in an interference graph). Zigler and Papadogeorgou (2018) apply this outcome model to the\nbipartite setting, decomposing outcomes into the sum of a direct effect\u2014the result of treating the\ndiversion unit \u2018associated\u2019 with the outcome unit\u2014and an indirect effect, proportional to the number\ntreated units in its exposure set.\nWe consider a slight variation of their model, also studied in Toulis and Kao (2013) in the non-\nbipartite setting with interference, by assuming that an outcome unit\u2019s outcome is determined by a\nweighted proportion of the treated diversion units in its exposure set. In other words, let wij \u2208 R\nbe the weight of the edge between diversion unit j and outcome unit i, such that under the linear\ntreatment exposure model,\n\nM(cid:88)\n\n\u2200i \u2208 [1, N ], Yi(Z) = Yi(ZEi) = Yi(ei(Z)), where ei(Z) =\n\nwijZj\n\nWe call ei the treatment exposure of outcome unit i under assignment vector Z. The linear treatment\nexposure model reduces the dependence of potential outcomes on Z to a single scalar e(Z) \u2208 R+,\nby assuming that each treated diversion unit contributes an additive treatment exposure to the out-\ncome units in its in\ufb02uence set. Under the linear treatment exposure assumption, the analysis of\nbipartite experiments is very similar to the dose-response analysis commonly found in health and\neducational studies (Hong and Raudenbush, 2005; Moodie and Stephens, 2012; Kluve et al., 2012),\nwhich extends the causal inference literature to continuous treatment values.\n\nj=1\n\n2.3 Estimands and inference\n\nmake a normalized exposure assumption: \u2200i,(cid:80)\n\nAnalogously to the dose-response literature, there are several estimands that may interest an experi-\nmenter analyzing a bipartite experiment under the linear treatment exposure assumption. Examples\nof such estimands are the average-exposure-response function \u00b5 : e (cid:55)\u2192 E[Yi(e)] or the effect of\nincreasing exposure by a set amount \u00b5(e0 + \u2202e) \u2212 \u00b5(e0). Some experimenters may even wish to\nj wij = 1, such that ei(Z) \u2208 [0, 1],\u2200i, Z. Under\nsuch an assumption, the extension of the commonly-used average treatment effect to the bipartite\nsetting in Eq. 1 can be neatly rewritten\n\nYi(ei = 1) \u2212 Yi(ei = 0) = \u00b5(1) \u2212 \u00b5(0).\n\n(2)\n\n(cid:88)\n\ni\n\n\u03c4 =\n\n1\nN\n\nA powerful framework for inference on causal estimands is model-based imputation (Imbens and\nRubin, 2015; Galagate, 2016): covariates Xi are collected for each outcome unit, and a parametric\noutcome model is assumed Yi(Z) = f (ei(Z), Xi), for an appropriately chosen function f. Once an\n\u02c6f (e, Xi) can\nbe used. For example, a simple causal-regression-based procedure to estimate the extended average\ntreatment effect \u03c4 of Eq. 1 is:\n\napproximating function \u02c6f of f is learned from the data, the estimator \u02c6\u00b5(e) = 1/N(cid:80)\nStep 1 Find \u02c6\u03b1, \u02c6\u03b2 such that(cid:80)N\n(cid:16)\nStep 2 Return \u02c6\u03c4 = 1/N(cid:80)N\nwith probability p, the treatment exposure of outcome unit i has expectation EZ[ei(Z)] = p(cid:80)\nand VarZ[ei(Z)] = p(1 \u2212 p)(cid:80)\n\nThis method can be sensitive to model-misspeci\ufb01cation. In particular, treatment exposures are not\nidentically distributed and thus realized treatment exposures may be correlated with potential out-\ncomes. For a Bernoulli randomized design that assigns diversion units to treatment independently\nj wij\nij, such that even under a normalized exposure assumption, the\n\n(cid:1)2 is minimized.\n\n(cid:0)Yi \u2212 \u03b1ei \u2212 \u03b2T Xi\n(cid:17) \u2212 1/N(cid:80)N\n\n\u02c6\u03b1 + \u02c6\u03b2T Xi\n\n\u02c6\u03b2T Xi = \u02c6\u03b1.\n\ni=1\n\nj w2\n\ni=1\n\ni=1\n\ni\n\ndistribution of treatment exposures depends on the bipartite graph.\n\n4\n\n\fFrom the dose-response literature, Imai and Van Dyk (2004) and Hirano and Imbens (2004) suggest\nincluding generalized propensity scores as covariates in the regression or stratifying on these scores\nfor provably-consistent inference. Zigler and Papadogeorgou (2018) rely on inverse-probability-\nof-treatment-weighted estimators, which are also provably consistent under a partial interference\nassumption. These methods may not necessarily account for the correlation between two outcome\nunits\u2019 exposure if their exposure sets are not disjoint. Further work is needed beyond the scope of\nthis paper. See the appendix for more details.\n\n3 Clustering for bipartite experiments\n\n3.1 A new clustering objective\n\nAs seen in the previous section, the analysis of bipartite randomized experiments is strongly tied to\nthe treatment exposure distribution received by outcome units. This suggests that tuning the statis-\ntical properties of this distribution might allow us to substantially improve the statistical inferences\nsupported by the experiment. This section proposes such an optimization.\nWhether the objective is to learn the average-exposure-response function or the average treatment\neffect estimand in Eq. 2, inferences should intuitively bene\ufb01t from observing as wide a range of\ntreatment exposures as possible across all outcome units, rather than a concentrated set of values\naround their expectation. One way to increase this range is to choose a design that maximizes\nthe empirical variance of the treatment exposure vector. Cluster randomized designs, which assign\ntreatment and control to groups of units, are a particularly popular class of designs because they\ndo not rely too strongly on any single model-based estimator, and yet have been shown to improve\nvariance and interference-based bias under certain conditions (Ugander et al., 2013; Gui et al., 2015;\nEckles et al., 2017; Saveski et al., 2017).\nWe propose to choose the clustering of diversion units that maximizes the empirical variance of\nthe treatment exposure vector e(Z) = {ei(Z)}i, de\ufb01ned as 1/N(e \u2212 \u00afe)T (e \u2212 \u00afe). Since this em-\npirical variance is a random variable, which we would like maximized for as many values of the\ntreatment assignment vector Z as possible, we consider maximizing its expectation across treatment\nassignments as a clustering objective for diversion units:\n\n\u2206 = EZ\n\n(e \u2212 \u00afe)T (e \u2212 \u00afe)\n\n= EZ\n\n(ei(Z) \u2212 \u00afe(Z))2\n\n(3)\n\n(cid:20) 1\n\nN\n\n(cid:21)\n\n(cid:34)\n\n1\nN\n\nN(cid:88)\n\ni=1\n\n(cid:35)\n\nAn easier objective could have been to maximize the sum of individual treatment exposure variances\nover all outcome units. It is easy to see why this strategy fails in non-trivial bipartite graphs: assign-\ning all diversion units to the same cluster maximizes the individual variances of each outcome unit\u2019s\ntreatment exposure, but results in only two possible treatment exposure vector e(Z = (cid:126)0) = (cid:126)0 and\ne(Z = (cid:126)1), which provides no useful basis for making causal claims.\nWe now provide some justi\ufb01cations for the empirical variance objective \u2206 in Eq. 3.\nProposition 1. Let M be the number of diversion units and p the probability of assigning a diversion\nunit to treatment: \u2206 \u2264 p(M \u2212 1)/M + p2/M. If this upper-bound is met, then there exists an unbiased\nestimator of the average treatment effect.\n\nThus, in the unlikely scenario that the empirical variance maximization objective achieves the upper-\nbound in Proposition 1, we can obtain an unbiased estimate of the average treatment effect from a\nresulting cluster randomized assignment. By invoking Cramer-Rao, we show that the variance max-\nimization objective can be interpreted as an information-theoretic lower bound to certain estimators.\nProposition 2. Suppose an outcome unit i\u2019s response is linear in its treatment exposure ei and a\nset of covariates Xi \u2208 Rd: Yi(Z) = \u03b1ei(Z) + \u03b2T Xi + \u0001i, where \u03b1 \u2208 R, \u03b2 \u2208 Rd, \u0001 \u223c N (0, \u03c32)\nfor \u03c32 \u2208 R+, independent of Z. For any unbiased estimator \u02c6\u03c4 of the average treatment effect,\nVarZ,\u0001[\u02c6\u03c4 ] \u2265 \u03c32/\u2206 and there is equality for the ordinary least square estimator of \u03b1 as an unbiased\nestimator for the average treatment effect.\n\nIn other words, if outcomes are linear in the treatment exposure with Gaussian noise, our clustering\nobjective minimizes a lower-bound on the variance of estimators of the average treatment effect.\n\n5\n\n\f3.2 A reduction to correlation clustering\n\nTheorem 1 establishes the reduction of our suggested empirical variance maximization objective in\nEq. 3 to a well-known problem in graph theory.\nTheorem 1. For diversion unit j, consider the vector (cid:126)w\u00b7j = {wij} of length N, where wij is\nthe weight of the edge between diversion unit j and outcome unit i. Construct a diversion-unit-\nonly graph, such that for each diversion unit pair (j, k), its edge weight is Wjk = (cid:104) (cid:126)w\u00b7j, (cid:126)w\u00b7k(cid:105) \u2212\n1/N(cid:104) (cid:126)w\u00b7j,(cid:126)1(cid:105)(cid:104) (cid:126)w\u00b7k,(cid:126)1(cid:105). Let W +\njk = max(0, Wjk) be the positive and negative\nedges of the diversion-unit-only graph. For a clustering {C} of the diversion units, the variance-\nmaximization objective can be rewritten as\n\njk = min(0, Wjk) and W \u2212\n\n\uf8eb\uf8ed(cid:88)\n\n(cid:88)\n\njk \u2212 (cid:88)\n\nW +\n\n(cid:88)\n\nC\n\nj,k\u2208C\n\nC(cid:54)=C(cid:48)\n\nj\u2208C(cid:48),k\u2208C\n\n\uf8f6\uf8f8\n\nW \u2212\n\njk\n\n(4)\n\nwhere \u03b1 = p2 \u2212 \u03b2(cid:80)\n\n\u2206 = \u03b1 + \u03b2\n\nj,k W \u2212\n\njk and \u03b2 = p(1 \u2212 p)/N are constants with respect to the clustering.\n\nWe wish to maximize \u2206 in Eq. 4 as a function of the clustering {C} of the diversion units. Up to mul-\ntiplicative and additive constants \u03b1 and \u03b2, maximizing \u2206 is the maximizing-agreement formulation\nof the correlation clustering problem.\nIn contrast with other popular graph partitioning objectives (k-means, k-center, or balanced parti-\ntioning), correlation clustering does not specify the number of clusters explicitly. Rather, it seeks\nto maximize the difference of the number of positive edges that are within-clusters with the num-\nber of negative edges that are across clusters, without specifying a \ufb01xed number of clusters. In\nparticular, this saves the experimenter the trouble of tuning the number of clusters explicitly, a com-\nmon hyper-parameter optimization problem for cluster-based randomized experiments. Like many\nother constrained graph clustering problems, correlation clustering is NP-hard (Bansal et al., 2002;\nCharikar et al., 2003; Ailon et al., 2008) and even hard to approximate (Charikar et al., 2003; De-\nmaine et al., 2006). Aiming to solve this clustering problem on large data sets, and inspired by\npreviously studied heuristic algorithms for this problem (Elsner and Schudy, 2009), we propose and\napply a scalable heuristic clustering algorithm for ef\ufb01ciently maximizing \u2206.\n\n3.3 Algorithm\n\nTo produce a clustering of diversion units, our algorithm proceeds in two steps. We \ufb01rst construct the\ndiversion-unit-only graph with edge weights {Wjk} as given in Theorem 1, a step we call folding.\nIn a second stage, we apply a scalable heuristic for our correlation clustering problem.\nThe diversion unit graph {Wjk} from Theorem 1 is a complete graph due to the subtracted term\n(cid:104) (cid:126)w\u00b7j,(cid:126)1(cid:105)(cid:104) (cid:126)w\u00b7k,(cid:126)1(cid:105). To avoid \u0398(M 2) space usage, we represent this term implicitly by constructing a\nsparse graph with edge weights (cid:104) (cid:126)w\u00b7j, (cid:126)w\u00b7k(cid:105) and node weights (cid:104) (cid:126)w\u00b7j,(cid:126)1(cid:105). The implicit edge weight Wjk\nequals the explicitly speci\ufb01ed edge weight (cid:104) (cid:126)w\u00b7j, (cid:126)w\u00b7k(cid:105) minus the product of the weights of the two\nincident nodes (cid:104) (cid:126)w\u00b7j,(cid:126)1(cid:105) and (cid:104) (cid:126)w\u00b7k,(cid:126)1(cid:105). For large and sparse graphs, with more than 1 million units, we\ncan use sketching to minimize the O(M 2) number of necessary explicit edge computations. In the\ndataset we evaluate on, we used a weighted MinHash implementation (Ioffe, 2010).\nNext we discuss our scalable heuristic algorithm for correlation clustering. While correlation clus-\ntering is well studied in different settings (Bansal et al., 2002; Charikar et al., 2003; Ailon et al.,\n2008), some provable approximation algorithms only apply to special cases of the problem (Bansal\net al., 2002; Ailon et al., 2008). Other logarithmic approximation algorithms for general graphs\nare based on solving a linear programming relaxation (Charikar et al., 2003; Demaine et al., 2006),\nwhich is hard to scale to larger data sets. Among scalable heuristic algorithms for this problem,\nlocal search algorithms have been shown to perform well in practice (Elsner and Schudy, 2009). In-\nspired by these results, we apply the following local search algorithm: start from singleton clusters,\nand apply the following local search operations until the amount of improvement of the objective is\nbelow a threshold\u2014hinting at the convergence of the local search procedure.\n\n\u2022 Move one node from one cluster to another cluster if the objective function improves, and\n\u2022 Merge two existing clusters if the objective function improves\n\n6\n\n\fIn practice, we maintain three data structures: a map from each vertex to its respective cluster, a map\nfrom each cluster to the vertices in it, and a map from each cluster to the total weight of the nodes\nin it. Using the above data structures, for any given vertex, we can compute the best cluster to move\nit to in time linear in its degree. Likewise, for any given cluster, we can compute the best cluster to\nmerge with in time linear in the number of edges incident to its vertices.\n\n4 Simulation study\n\n4.1 The Amazon user-item review graph\n\nIn cases where interference between certain units is present, bipartite randomized experiments can\nreduce bias and variance by choosing different units of analysis than the units randomly assigned\nto treatment. This is particularly clear in marketplace experiments, where competition from sellers\noften violates SUTVA. Consider, for example, the case of a randomized experiment to estimate the\neffect of discounting certain products in an online marketplace like Amazon on purchases. Because\nassigning treatment at random at the user-level may result in different users seeing different prices\nor incentives for the same item, randomizing on items may be preferable. Experimenters can choose\nto measure outcomes at the item-level as standard causal inference theory would suggest, but they\ncan also choose to measure outcomes at the user-level as suggested by the bipartite randomized\nframework presented here. Doing so may help avoid complex interference mechanisms.\nTo evaluate our clustering algorithm, we choose a publicly-available bipartite graph\ndataset (McAuley et al., 2015; He and McAuley, 2016), where each edge in the dataset corresponds\nto a user review of a product on Amazon, totaling 83M reviews made by 121k users on 9.8M items\nin the graph. We discard any users having made fewer than 100 reviews from this bipartite graph,\nsuch that we can infer realistic treatment exposures that the remaining users might receive during a\nrandomized experiment on items. The resulting unweighted graph has 7.7M edges between 2.4M\nitems (randomization units) and 34k users (outcome units). When simulating a bipartite randomized\nexperiment on this graph, we will use items as the randomization units we seek to cluster and users\nas the outcome units.\n\n4.2 Clustering baselines\n\nClustering for causal inference has previously been studied in non-bipartite settings when interfer-\nence is present. Interference is often represented by a graph in which units are linked by an edge\nwhen their potential outcomes depend on each other\u2019s treatment assignment. To optimize the bias\nand variance of common estimators, the solution suggested in the interference literature is to run\ncluster randomized designs with clusters of approximately equal size that minimize the number of\nedges cut in the interference graph (Eckles et al., 2017; Pouget-Abadie et al., 2018). Finding these\nclusters is an NP-hard problem, known as balanced partitioning, for which several heuristics have\nbeen suggested (Karypis and Kumar, 1998b; Delling et al., 2012; Stanton and Kliot, 2012; Ugander\nand Backstrom, 2013; Tsourakakis et al., 2014; Aydin et al., 2016). Consequently, an important\nbaseline is to apply state-of-the-art balanced partitioning algorithms to our setting.\nTo produce clusters of diversion units, it is possible to run balanced partitioning on the original bi-\npartite graph by ignoring the distinction between diversion units and outcome units, and removing\nall outcome units from the resulting clusters in a second stage. Balance can be enforced on both\nitems and users, separately or together, of the bipartite graph, by setting the relevant node weights\nto 1 and all other node weights to 0. Another way to produce clusters of diversion units is to run\nbalanced partitioning directly on the folded diversion unit graph we construct in the \ufb01rst stage of our\nalgorithm, thus replacing the second phase of our algorithm. Each run of a balanced partitioning al-\ngorithm requires the experimenter to specify the desired number of clusters. Because our suggested\ncorrelation-clustering algorithm produced 86 clusters exactly, we run the balanced partitioning base-\nlines for different cluster cardinalities K \u2208 {50, 500, 2000, 10000} and pick the best one according\nto the mean-squared error. For their scalability, we chose to implement and compare the balanced\npartitioning algorithms suggested in (Karypis and Kumar, 1998a; Aydin et al., 2016), referred to as\nMETIS and LINE respectively.\nAs a set of weak baselines, we also include clusterings of cardinality K \u2208 {50, 500, 2000, 10000}\nwhere each diversion unit is placed in a cluster at random.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Standard deviation of the observed treatment exposure vector, averaged across assign-\nments Z. \u201cbipartite\u201d indicates the algorithm was applied to the original bipartite user-item graph;\n\u201cfolded\u201d that it was applied to the folded graph from Theorem 1; \u201cuser-balanced\u201d (resp. \u201ditem-\nbalanced\u201d) that it enforced balance only on the user (resp. item) side of the bipartite graph. Cor-\nrelation clustering does not have a cluster cardinality hyper-parameter. It produced 86 clusters. (b)\nMean squared error (MSE) \u02c6\u03c4 in Step 2 of Section 2.3 for estimating the average treatment effect for\nvarying levels of noise \u03c3. The best performing cluster cardinality K \u2208 {50, 500, 2000, 10000} was\nchosen for each baseline in this plot. We include the 95% con\ufb01dence interval of our estimate of the\nMSE, obtained by sampling, for the correlation clustering algorithm, all other con\ufb01dence intervals\nbeing of similar size.\n\n4.3 Results\n\nBecause the most common A/B tests used to compare two versions of an app or an online service\nrarely assign treatment beyond a small subset of the general cohort, we simulate a randomized exper-\niment on this cohort of user-items by randomly assigning 10% of item-clusters to receive a simulated\n\u2018treatment\u2019. We compute each users exposure ei to treatment as the proportion of treated items they\nhave reviewed historically, as determined by the bipartite graph we considered in Section 4.1.\nIn Figure 1.a, we report the standard deviation of the observed treatment exposure vector\u2014the square\nroot of our optimization objective\u2014averaged across random assignments. The correlation clustering\nmethod clearly outperforms all other baselines despite hyper-parameter tuning of the number of\n.1 \u00d7 .9 = .3). All\nclusters, achieving almost 70% of the highest possible value for this objective (\nerror bars, estimated by bootstrap, were suf\ufb01ciently small to not be reported.\nFurthermore, in Figure 1.b, we evaluate experimental power for each clustering\u2014after hyperparam-\neter optimization for all baselines\u2014by assuming a model of potential outcomes. Sampling outcomes\ndirectly proportional to the treatment exposure received and using a causal regression analysis to de-\non a misspeci\ufb01ed model: we suppose the following linear outcome model: Yi \u223c \u221a\ntermine the average treatment effect proved too easy a task. Instead, we evaluate experimental power\nei + \u03c3\u0001i, where\n\u0001i \u223c N (0, 1). and use \u02c6\u03c4 given in Step 2 of Section 2.3 as an estimator of the average treatment\neffect \u03c4 (= 1) given in Eq. 1. We report the mean-squared error (MSE) of our estimator \u02c6\u03c4 de\ufb01ned\nas EZ,\u0001[(\u02c6\u03c4 \u2212 \u03c4 )2] for different levels of noise \u03c3. In the shaded area, we report the error bars of our\ncorrelation clustering algorithm, estimated by bootstrap. All other error bars are of equal magnitude\nand are not included in the \ufb01gure for the sake of clarity. In low to medium noise settings (\u03c3 \u2264 5),\nour suggested clustering achieves a signi\ufb01cantly lower mean-squared error over other baselines.\n\n\u221a\n\n8\n\n101102103104Number of clusters0.000.050.100.150.200.25Std. dev. of treatment exposure (avg. over Z)Correlation clusteringMETIS (bipartite, user-balanced)METIS (bipartite, item-balanced)METIS (bipartite)LINE (bipartite graph)LINE (folded graph)Random clustering10-210-1100101\u03c310-210-1100101102Mean Squared ErrorCorrelation clusteringMETIS (bipartite, user-balanced)METIS (bipartite, item-balanced)METIS (bipartite)LINE (bipartite graph)LINE (folded graph)Random clustering\fAcknowledgments\n\nThe authors would like to thank Daniel Saban\u00b4es Bov\u00b4e and Michele Borassi for their helpful advice.\nWe would also like to thank the anonymous reviewers for their feedback and suggestions to improve\nthe paper.\n\nReferences\nNir Ailon, Moses Charikar, and Alantha Newman. 2008. Aggregating inconsistent information:\n\nRanking and clustering. J. ACM 55, 5 (2008), 23:1\u201323:27.\n\nKevin Aydin, MohammadHossein Bateni, and Vahab Mirrokni. 2016. Distributed balanced parti-\n\ntioning via linear embedding. In WDSM. ACM, 387\u2013396.\n\nEytan Bakshy, Dean Eckles, and Michael S Bernstein. 2014. Designing and deploying online \ufb01eld\nexperiments. In Proceedings of the 23rd international conference on World wide web. ACM, 283\u2013\n292.\n\nNikhil Bansal, Avrim Blum, and Shuchi Chawla. 2002. Correlation Clustering. In 43rd Sympo-\nsium on Foundations of Computer Science (FOCS 2002), 16-19 November 2002, Vancouver, BC,\nCanada, Proceedings. 238. https://doi.org/10.1109/SFCS.2002.1181947\n\nGuillaume Basse and Avi Feller. 2018. Analyzing two-stage experiments in the presence of inter-\n\nference. J. Amer. Statist. Assoc. 113, 521 (2018), 41\u201355.\n\nMoses Charikar, Venkatesan Guruswami, and Anthony Wirth. 2003. Clustering with Qualitative In-\nformation. In 44th Symposium on Foundations of Computer Science (FOCS 2003), 11-14 October\n2003, Cambridge, MA, USA, Proceedings. 524\u2013533. https://doi.org/10.1109/SFCS.2003.1238225\n\nShuchi Chawla, Jason Hartline, and Denis Nekipelov. 2016. A/B testing of auctions. In Proceedings\n\nof the 2016 ACM Conference on Economics and Computation. ACM, 19\u201320.\n\nDaniel Delling, Andrew V Goldberg, Ilya Razenshteyn, and Renato F Werneck. 2012. Exact combi-\nnatorial branch-and-bound for graph bisection. In 2012 Proceedings of the Fourteenth Workshop\non Algorithm Engineering and Experiments (ALENEX). SIAM, 30\u201344.\n\nErik D. Demaine, Dotan Emanuel, Amos Fiat, and Nicole Immorlica. 2006. Correlation clustering\n\nin general weighted graphs. Theor. Comput. Sci. 361, 2-3 (2006), 172\u2013187.\n\nDean Eckles, Brian Karrer, and Johan Ugander. 2017. Design and analysis of experiments in net-\n\nworks: Reducing bias from interference. Journal of Causal Inference 5, 1 (2017).\n\nMicha Elsner and Warren Schudy. 2009. Bounding and comparing methods for correlation clus-\ntering beyond ILP. In Proceedings of the Workshop on Integer Linear Programming for Natural\nLangauge Processing. Association for Computational Linguistics, 19\u201327.\n\nAndrey Fradkin, Elena Grewal, and David Holtz. 2018. The determinants of online review informa-\n\ntiveness: Evidence from \ufb01eld experiments on Airbnb. (2018).\n\nDouglas Galagate. 2016. Causal inference with a continuous treatment and outcome: alternative\n\nestimators for parametric dose-response functions with applications. Ph.D. Dissertation.\n\nAlexandre Gilotte, Cl\u00b4ement Calauz`enes, Thomas Nedelec, Alexandre Abraham, and Simon Doll\u00b4e.\n2018. Of\ufb02ine a/b testing for recommender systems. In Proceedings of the Eleventh ACM Interna-\ntional Conference on Web Search and Data Mining. ACM, 198\u2013206.\n\nHuan Gui, Ya Xu, Anmol Bhasin, and Jiawei Han. 2015. Network a/b testing: From sampling to\nestimation. In Proceedings of the 24th International Conference on World Wide Web. International\nWorld Wide Web Conferences Steering Committee, 399\u2013409.\n\nRuining He and Julian McAuley. 2016. Ups and downs: Modeling the visual evolution of fashion\ntrends with one-class collaborative \ufb01ltering. In proceedings of the 25th international conference\non world wide web. International World Wide Web Conferences Steering Committee, 507\u2013517.\n\n9\n\n\fKeisuke Hirano and Guido W Imbens. 2004. The propensity score with continuous treatments. Ap-\nplied Bayesian modeling and causal inference from incomplete-data perspectives 226164 (2004),\n73\u201384.\n\nGuanglei Hong and Stephen W Raudenbush. 2005. Effects of kindergarten retention policy on chil-\ndrens cognitive growth in reading and mathematics. Educational evaluation and policy analysis\n27, 3 (2005), 205\u2013224.\n\nMichael G Hudgens and M Elizabeth Halloran. 2008. Toward causal inference with interference. J.\n\nAmer. Statist. Assoc. 103, 482 (2008), 832\u2013842.\n\nKosuke Imai and David A Van Dyk. 2004. Causal inference with general treatment regimes: Gen-\n\neralizing the propensity score. J. Amer. Statist. Assoc. 99, 467 (2004), 854\u2013866.\n\nGuido W Imbens and Donald B Rubin. 2015. Causal inference in statistics, social, and biomedical\n\nsciences. Cambridge University Press.\n\nSergey Ioffe. 2010.\n\nImproved consistent sampling, weighted minhash and l1 sketching. In 2010\n\nIEEE International Conference on Data Mining. IEEE, 246\u2013255.\n\nGeorge Karypis and Vipin Kumar. 1998a. A fast and high quality multilevel scheme for partitioning\n\nirregular graphs. SIAM Journal on scienti\ufb01c Computing 20, 1 (1998), 359\u2013392.\n\nGeorge Karypis and Vipin Kumar. 1998b. Multilevelk-way partitioning scheme for irregular graphs.\n\nJournal of Parallel and Distributed computing 48, 1 (1998), 96\u2013129.\n\nJochen Kluve, Hilmar Schneider, Arne Uhlendorff, and Zhong Zhao. 2012. Evaluating continuous\ntraining programmes by using the generalized propensity score. Journal of the Royal Statistical\nSociety: Series A (Statistics in Society) 175, 2 (2012), 587\u2013617.\n\nCharles F Manski. 2013. Identi\ufb01cation of treatment response with social interactions. The Econo-\n\nmetrics Journal 16, 1 (2013), S1\u2013S23.\n\nJulian McAuley, Christopher Targett, Qinfeng Shi, and Anton Van Den Hengel. 2015. Image-based\nrecommendations on styles and substitutes. In Proceedings of the 38th International ACM SIGIR\nConference on Research and Development in Information Retrieval. ACM, 43\u201352.\n\nErica EM Moodie and David A Stephens. 2012. Estimation of dose\u2013response functions for longitu-\ndinal data using the generalised propensity score. Statistical methods in medical research 21, 2\n(2012), 149\u2013166.\n\nVictoria Pokhiko, Qiong Zhang, Lulu Kang, et al. 2019. D-optimal Design for Network A/B Testing.\n\narXiv preprint arXiv:1902.00482 (2019).\n\nJean Pouget-Abadie, Vahab Mirrokni, David C. Parkes, and Edoardo M. Airoldi. 2018. Optimizing\n\nCluster-based Randomized Experiments Under Monotonicity. KDD (2018), 2090\u20132099.\n\nStephen W Raudenbush. 1997. Statistical analysis and optimal design for cluster randomized trials.\n\nPsychological Methods 2, 2 (1997), 173.\n\nDonald B Rubin. 2005. Causal inference using potential outcomes: Design, modeling, decisions. J.\n\nAmer. Statist. Assoc. 100, 469 (2005), 322\u2013331.\n\nMartin Saveski, Jean Pouget-Abadie, Guillaume Saint-Jacques, Weitao Duan, Souvik Ghosh, Ya\nXu, and Edoardo M Airoldi. 2017. Detecting network effects: Randomizing over randomized\nexperiments. In Proceedings of the 23rd ACM SIGKDD international conference on knowledge\ndiscovery and data mining. ACM, 1027\u20131035.\n\nHolly B Shakya, Derek Stafford, D Alex Hughes, Thomas Keegan, Rennie Negron, Jai Broome,\nMark McKnight, Liza Nicoll, Jennifer Nelson, Emma Iriarte, et al. 2017. Exploiting social in\ufb02u-\nence to magnify population-level behaviour change in maternal and child health: study protocol\nfor a randomised controlled trial of network targeting algorithms in rural Honduras. BMJ open 7,\n3 (2017), e012996.\n\n10\n\n\fIsabelle Stanton and Gabriel Kliot. 2012. Streaming graph partitioning for large distributed graphs.\nIn Proceedings of the 18th ACM SIGKDD international conference on Knowledge discovery and\ndata mining. ACM, 1222\u20131230.\n\nEric J Tchetgen Tchetgen and Tyler J VanderWeele. 2012. On causal inference in the presence of\n\ninterference. Statistical methods in medical research 21, 1 (2012), 55\u201375.\n\nPanos Toulis and Edward Kao. 2013. Estimation of causal peer in\ufb02uence effects. ICML (2013),\n\n1489\u20131497.\n\nCharalampos Tsourakakis, Christos Gkantsidis, Bozidar Radunovic, and Milan Vojnovic. 2014. Fen-\n\nnel: Streaming graph partitioning for massive scale graphs. In WSDM. ACM, 333\u2013342.\n\nJohan Ugander and Lars Backstrom. 2013. Balanced label propagation for partitioning massive\ngraphs. In Proceedings of the sixth ACM international conference on Web search and data mining.\nACM, 507\u2013516.\n\nJohan Ugander, Brian Karrer, Lars Backstrom, and Jon Kleinberg. 2013. Graph cluster randomiza-\n\ntion: Network exposure to multiple universes. KDD (2013), 329\u2013337.\n\nCorwin M Zigler and Georgia Papadogeorgou. 2018. Bipartite Causal Inference with Interference.\n\narXiv preprint arXiv:1807.08660 (2018).\n\n11\n\n\f", "award": [], "sourceid": 7294, "authors": [{"given_name": "Jean", "family_name": "Pouget-Abadie", "institution": "Google"}, {"given_name": "Kevin", "family_name": "Aydin", "institution": "Google"}, {"given_name": "Warren", "family_name": "Schudy", "institution": "Google"}, {"given_name": "Kay", "family_name": "Brodersen", "institution": "Google"}, {"given_name": "Vahab", "family_name": "Mirrokni", "institution": "Google Research NYC"}]}