{"title": "Covariate-Powered Empirical Bayes Estimation", "book": "Advances in Neural Information Processing Systems", "page_first": 9620, "page_last": 9632, "abstract": "We study methods for simultaneous analysis of many noisy experiments in the presence\nof rich covariate information. The goal of the analyst is to optimally estimate the true effect underlying\neach experiment. Both the noisy experimental results and the auxiliary covariates are useful for this purpose,\nbut neither data source on its own captures all the information available to the analyst.\nIn this paper, we propose a flexible plug-in empirical Bayes estimator that synthesizes both sources of information and may leverage any black-box predictive model. We show that our approach is within a constant factor of minimax for a simple data-generating model.\nFurthermore, we establish robust convergence guarantees for our method that hold under considerable\ngenerality, and exhibit promising empirical performance on both real and simulated data.", "full_text": "Covariate-Powered Empirical Bayes Estimation\n\nNikolaos Ignatiadis\nStatistics Department\nStanford University\n\nignat@stanford.edu\n\nStefan Wager\n\nGraduate School of Business\n\nStanford University\n\nswager@stanford.edu\n\nAbstract\n\nWe study methods for simultaneous analysis of many noisy experiments in the\npresence of rich covariate information. The goal of the analyst is to optimally\nestimate the true effect underlying each experiment. Both the noisy experimental\nresults and the auxiliary covariates are useful for this purpose, but neither data\nsource on its own captures all the information available to the analyst. In this\npaper, we propose a \ufb02exible plug-in empirical Bayes estimator that synthesizes\nboth sources of information and may leverage any black-box predictive model.\nWe show that our approach is within a constant factor of minimax for a simple\ndata-generating model. Furthermore, we establish robust convergence guarantees\nfor our method that hold under considerable generality, and exhibit promising\nempirical performance on both real and simulated data.\n\n1\n\nIntroduction\n\nIt is nowadays common for a geneticist to simultaneously study the association of thousands of\ndifferent genes with a disease [Efron et al., 2001, L\u00f6nnstedt and Speed, 2002, Love et al., 2014],\nfor a technology \ufb01rm to have records from thousands of randomized experiments [McMahan et al.,\n2013], or for a social scientist to examine data from hundreds of different regions at once [Abadie\nand Kasy, 2018]. In all of these settings, we are fundamentally interested in learning something about\neach sample (i.e., gene, experimental intervention, etc.) on its own; however, the abundance of data\non other samples can give us useful context with which to interpret our measurements about each\nindividual sample [Efron, 2010, Robbins, 1964]. In this paper, we propose a method for simultaneous\nanalysis of many noisy experiments, and show that it is able to exploit rich covariate information for\nimproved power by leveraging existing machine learning tools geared towards a basic prediction task.\nAs a motivation for our statistical setting, suppose we have access to a dataset of movie reviews\nwhere each movie i = 1, ..., n has an average rating Zi over a limited number of viewers; we also\nhave access to a number of covariates Xi about the movie (e.g., genre, length, cast, etc.). The task is\nto estimate the \u201ctrue\u201d rating \u00b5i of the movie, i.e., the average rating had the movie been reviewed\nby a large number of reviewers similar to the ones who already reviewed it. A \ufb01rst simple approach\nto estimating \u00b5i is to use its observed average rating as a point estimate, i.e., to set \u02c6\u00b5i = Zi. This\napproach is clearly valid for movies where we have enough data for sampling noise to dissipate, e.g.,\nwith over 50,000 reviews in the MovieLens 20M data [Harper and Konstan, 2016], we expect the\n4.2/5 rating of Pulp Fiction to be quite stable. Conversely, for movies with fewer reviews, this strategy\nmay be unstable: the rating 1.6/5 of Urban Justice is based on less than 20 reviews, and appears liable\nto change as we collect more data. A second alternative would be to just rely on covariates: We could\n\nlearn to predict average ratings from covariates, m(x) = E(cid:2)Zi(cid:12)(cid:12) Xi = x(cid:3), and then set \u02c6\u00b5i = \u02c6m(Xi).\n\nThis may be more appropriate than using the observed mean rating for movies with very few reviews,\nbut is limited in its accuracy if the covariates aren\u2019t expressive enough to perfectly capture \u00b5i.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Optimal empirical Bayes shrinkage. All three plots show \u00b5i and Zi drawn from (1) for\nvarious values of A/\u03c32, with the covariate values Xi \ufb01xed and the regression curve m(\u00b7) shown in\nblue. The arrows depict how the oracle Bayes denoiser from (2) moves the point estimate \u02c6\u00b5i away\nfrom the raw observation Zi and towards m(Xi). a) When A/\u03c32 = 0, the oracle estimator shrinks\nZi all the way back to m(Xi). b) For A/\u03c32 = 1, optimal shrinkage uses (Zi + m(Xi))/2 to estimate\n\u00b5i. c) When A/\u03c32 is very large, it is preferable to discard m(Xi) and just use the information in Zi.\n\nWe develop an approach that reconciles (and optimally interpolates between) the two estimation\nstrategies discussed above. The starting point for our discussion is the following generative model,\n\nXi \u223c PX , \u00b5i | Xi \u223c N (m(Xi), A) , Zi | \u00b5i \u223c N(cid:0)\u00b5i, \u03c32(cid:1) ,\n\naccording to which the true rating \u00b5i of each movie is partially explained by its covariates Xi, but\nalso has an idiosyncratic and unpredictable component with a Gaussian distribution N (0, A). Recall\nthat we observe Xi and Zi for each i = 1, ..., n, and want to estimate the vector of \u00b5i. Given this\nsetting, if we knew both the idiosyncratic noise level A and m(x), the conditional mean of \u00b5i given\nXi = x, then the mean-square-error-optimal estimate of \u00b5i could directly be read off of Bayes\u2019 rule,\n\u02c6\u00b5\u2217i = t\u2217m,A(Xi, Zi), with\n\n(1)\n\nt\u2217m,A(x, z) := Em,A [\u00b5i | Xi = x, Zi = z] =\n\nA\n\n\u03c32 + A\n\nz +\n\n\u03c32\n\n\u03c32 + A\n\nm(x).\n\n(2)\n\nAs shown in Figure 1, the behavior of this shrinker depends largely on the ratio A/\u03c32: As this ratio\ngets large, the Bayes rule gets close to just setting \u02c6\u00b5i = Zi, whereas when the ratio is small, it shrinks\neverything to predictions made using covariates.\nNow in practice, m(\u00b7) and A are unlikely to be known a-priori and, furthermore, we may not believe\nthat the hierarchical structure (1) is a perfect description of the underlying data-generating process.\nThe main contribution of this paper is an estimation strategy that addresses these challenges. First, we\nderive the minimax risk for estimating \u00b5i in model (1) in a setting where m(\u00b7) is unknown but we are\nwilling to make various regularity assumptions (e.g., that m(\u00b7) is Lipschitz). Second, we show that a\nfeasible plug-in version of (2) with estimated \u02c6m(\u00b7) and (cid:98)A attains this lower bound up to constants\nthat do not depend on \u03c32 or A.\nFinally, we consider robustness of our approach to misspeci\ufb01cation of the model (1), and establish an\nextension to the classic result of James and Stein [1961], whereby without any assumptions on the\ndistribution of \u00b5i conditionally on Xi, we can show that our approach still improves over both simple\nbaselines \u02c6\u00b5i = Zi and \u02c6\u00b5i = \u02c6m(Xi) in considerable generality (see Section 4 for precise statements).\nWe also consider behavior of our estimator in situations where the distribution of Zi conditionally on\ni of Zi given \u00b5i, Xi may be different for\n\u00b5i, Xi may not be Gaussian, and the conditional variance \u03c32\ndifferent samples.\nOur approach builds on a long tradition of empirical Bayes estimation that seeks to establish fre-\nquentist guarantees for plug-in Bayesian estimators and related procedures in data-rich environments\n[Efron, 2010, Robbins, 1964]. Empirical Bayes estimation in the setting without covariates Xi is by\nnow well understood [Brown and Greenshtein, 2009, Efron, 2011, Efron and Morris, 1973, Ignatiadis\net al., 2019, Ignatiadis and Wager, 2019, James and Stein, 1961, Jiang and Zhang, 2009, Johnstone\nand Silverman, 2004, Muralidharan, 2010, Stephens, 2016, Weinstein et al., 2018].\n\n2\n\nbcXiResponsem(Xi)\u00b5iZiE[\u00b5i|Xi,Zi]XiResponseXiResponseXiResponseXiResponsem(Xi)\u00b5iZiE[\u00b5i|Xi,Zi]XiResponsem(Xi)\u00b5iZiE[\u00b5i|Xi,Zi]XiResponsem(Xi)\u00b5iZiE[\u00b5i|Xi,Zi]a\fIn contrast, empirical Bayes analysis with covariates has been less comprehensively explored, and\nexisting formal results are con\ufb01ned to special cases. Fay and Herriot [1979] introduced a model\nof the form (1) with a linear speci\ufb01cation, m(x) = x(cid:62)\u03b2, motivated by the problem of \u201csmall area\nestimation\u201d that arises when studying small groups of people based on census data. Further properties\nof empirical Bayes estimators in the linear speci\ufb01cation (including robustness to misspeci\ufb01cation)\nwere established by Green and Strawderman [1991] in the case where Xi \u2208 R and m(x) = x,\nand by Cohen et al. [2013], Tan [2016] and Kou and Yang [2017] when m(x) = x(cid:62)\u03b2. There has\nalso been some work on empirical Bayes estimation with nonparametric speci\ufb01cations for m, e.g.,\nMukhopadhyay and Maiti [2004] and Opsomer et al. [2008]. In a genetics application, Stephan\net al. [2015] parametrized m(x) as a random forest. Banerjee et al. [2018] utilize univariate side\ninformation to estimate sequences of \u00b5i that consist mostly of zeros. We also note recent work by\nCoey and Cunningham [2019] who considered experiment splitting as an alternative to empirical\nBayes estimation. Our paper adds to this body of knowledge by providing the \ufb01rst characterization of\nminimax-optimal error in the general model (1), by proposing a \ufb02exible estimator that attains this\nbound up to constants, and by studying robustness of non-parametric empirical Bayes methods to\nmodel misspeci\ufb01cation.\n\n2 Minimax rates for empirical Bayes estimation with covariates\n\nWe \ufb01rst develop minimax optimality theory for model (1), when m is known to lie in a class C of\nfunctions. To this end, we formalize the notion of regret in empirical Bayes estimation, following Rob-\nbins [1964]. Concretely, as before, we assume that we have access to n i.i.d. copies (Xi, Zi) from\nmodel (1); \u00b5i is not observed. Our task at hand then is to construct a denoiser \u02c6tn : X \u00d7 R \u2192 R that\nwe will use to estimate \u00b5n+1 by \u02c6tn(Xn+1, Zn+1) for a future sample (Xn+1, Zn+1). We benchmark\nthis estimator against the unknown Bayes estimator t\u2217m,A(Xn+1, Zn+1) from (2) in terms of its\nregret (excess risk) L(\u02c6tn; m, A), where:\n\nL(t; m, A) := Em,A(cid:104)(t(Xn+1, Zn+1) \u2212 \u00b5n+1)2(cid:105) \u2212 Em,A(cid:104)(cid:0)t\u2217m,A(Xn+1, Zn+1) \u2212 \u00b5n+1(cid:1)2(cid:105) (3)\n\nWe characterize the dif\ufb01culty of this task by exhibiting the minimax rates for the empirical Bayes\nexcess risk incurred by not knowing m \u2208 C (but knowing A), where C is a pre-speci\ufb01ed class of\nfunctions:1\n(4)\n\nMEB\n\nn (cid:0)C; A, \u03c32(cid:1) := inf\n\n\u02c6tn\n\nsup\n\nm\u2208C(cid:8)Em,A(cid:2)L(\u02c6tn; m, A)(cid:3)(cid:9)\n\nOur key result, informally stated, is that the minimax excess risk MEB\nn can be characterized in terms\nof the minimax risk for estimating m(\u00b7) with respect to L2(PX ) in the regression problem in which\nwe observe (Xi, Zi)1\u2264i\u2264n with Zi | Xi \u223c N (m(Xi), A + \u03c32), i.e.,\n\nEm,A(cid:20)(cid:90) ( \u02c6mn(x) \u2212 m(x))2 dPX (x)(cid:21) ,\n\n\u03c34\n\n(\u03c32 + A)2 MReg\n\nn (cid:0)C; A + \u03c32(cid:1) .\n\n(5)\n\n(6)\n\nsuch that, for many commonly used function classes C, we have 2\n\nMReg\n\nsup\nm\u2208C\n\n\u02c6mn\n\nn (cid:0)C; A + \u03c32(cid:1) := inf\nn (cid:0)C; A, \u03c32(cid:1) (cid:16)\n\nMEB\n\nIn other words, when A/\u03c32 is very large, we \ufb01nd that it is easy to match the performance of Bayes\nrule (2), since it collapses to Zi. On the other hand, when A/\u03c32 is small, matching the Bayes rule\nrequires estimating m(\u00b7) well, and (6) precisely describes how the dif\ufb01culty of estimating m(\u00b7) affects\nour problem of interest.\nPrevious work on minimax rates for the excess risk (3) has been sparse; some exceptions include\nBenhaddou and Pensky [2013], Li et al. [2005] and Penskaya [1995], who develop minimax bounds\n\non (3) when \u00b5 \u223c G, Z | \u00b5 \u223c N(cid:0)0, \u03c32(cid:1), i.e., in the setting without covariates but with potentially\n\nmore general priors. Beyond the modulation through covariates, a crucial difference of our approach\nis that we pay attention to the behavior in terms of A and \u03c3, instead of absorbing them into constants.\n\n1We will propose procedures adaptive to unknown A in Section 3.\n2Throughout, we use the following notation for the asymptotic rates: For two sequences an, bn > 0, we\nan/bn \u2264 c for a constant c that does not depend on A, \u03c3, n. Similarly, we say\n\nsay an (cid:46) bn if lim supn\u2192\u221e\nan (cid:38) bn if bn (cid:46) an and \ufb01nally an (cid:16) bn if both an (cid:38) bn and an (cid:46) bn.\n\n3\n\n\fLower bound Here we provide a lemma for deriving lower bounds for worst case expected excess\nrisk (4) through reduction to hypothesis testing. The result is applicable to any class C for which we\ncan prove a lower bound on the minimax regression error using Le Cam\u2019s two point method or Fano\u2019s\nmethod [Duchi, 2019, Gy\u00f6r\ufb01 et al., 2006, Ibragimov and Hasminskii, 1981, Tsybakov, 2008]; we\nwill provide concrete examples below.\nLemma 1. For each n, let Vn be a \ufb01nite set and Cn = {mn,v | v \u2208 Vn} \u2282 C be a collection of\nfunctions indexed by Vn such that for a sequence \u03b4n > 0:\n(cid:90) (mn,v(x) \u2212 mn,v(cid:48)(x))2 dPX (x) \u2265 \u03b42\n(\u03c32 + A)2 \u00b7 \u03b42\n\nIf furthermore, supv,v(cid:48)\u2208Vn supx (mn,v(x) \u2212 mn,v(cid:48)(x))2 \u2192 0 as n \u2192 \u221e, then:\n\nn for all v (cid:54)= v(cid:48) \u2208 Vn, for all n\n\nP[ \u02c6Vn (cid:54)= Vn]\n\nn \u00b7 inf\n\u02c6Vn\n\nMEB\n\n\u03c34\n\nHere, inf \u02c6Vn P[ \u02c6Vn (cid:54)= Vn] is to be interpreted as follows: Vn is drawn uniformly from Vn and condition-\nally on Vn = v, we draw the pairs (Xi, Zi)1\u2264i\u2264n from model (1) with regression function mn,v(\u00b7).\nThe in\ufb01mum is taken over all estimators \u02c6Vn that are measurable with respect to (Xi, Zi)1\u2264i\u2264n.\nThe Lemma may be interpreted as follows: If information theoretically we cannot determine which\nmn,v \u2208 Cn generated (Xi, Zi)1\u2264i\u2264n, yet the mn,v are well separated in L2(PX ) norm, then the\nminimax empirical Bayes regret (4) must be large. Proving lower bounds involves contructing Cn.\nUpper bound Previously, we described the relationship of model (1) to nonparametric regression.\nHowever, there is a further connection: Under (1), it also holds that Zi | Xi \u223c N(cid:0)m(Xi), \u03c32 + A(cid:1).\nThus m(\u00b7) may estimated from the data by directly running a regression Zi \u223c Xi. Then, for known A,\nthe natural impetus to approximate (2) in a data-driven way is to use a plug-in estimator. Concretely,\ngiven a \u02c6mn that achieves the minimax risk (5), we just plug that into the Bayes rule (2):\n\nn (cid:0)C; A, \u03c32(cid:1) (cid:38)\n\n\u02c6tn(x, z) := t\u2217\u02c6mn,A(x, z) =\n\n\u03c32 + A\nThis plug-in estimator, establishes the following upper bound on (4):\nTheorem 2. Under model (1), it holds that:\n\n\u03c32 + A\n\nA\n\nz +\n\n\u03c32\n\n\u02c6mn(x)\n\n(7)\n\nMEB\n\nn (cid:0)C; A, \u03c32(cid:1) \u2264\n\n\u03c34\n\n(\u03c32 + A)2 MReg\n\nn (cid:0)C; A + \u03c32(cid:1)\n\nIn deriving the lower bound Lemma (1), the estimators considered may use the unknown A. For this\nreason, for the upper bound we also benchmark against estimators that know A; however in Section 3\nwe demonstrate that in fact knowledge of A is not required to attain optimal rates. Next we provide\ntwo concrete examples of classes, where the lower and upper bounds match up to constants.\n\nThe linear class (Fay-Herriot shrinkage) As a \ufb01rst, simple example, we consider the model\n\nof Fay and Herriot [1979], in which: X = Rd, and C = Lin(cid:0)Rd(cid:1) =(cid:8)m | m(x) = x(cid:62)\u03b2, \u03b2 \u2208 Rd(cid:9).\n\u223cN (0, \u03a3) for an unknown covariance matrix \u03a3 (cid:31) 0, \u03a3 \u2208 Rd\u00d7d.\n\nTheorem 3. Assume the Xi are iid\nThen there exists a constant CLin (which does not depend on the problem parameters) such that:\n\nlog(cid:32)MEB\n\nn (cid:0)Lin(cid:0)Rd(cid:1) ; A, \u03c32(cid:1)(cid:30) \u03c34\n\n(\u03c32 + A)2 \u00b7\n\n(\u03c32 + A)d\n\nn\n\nlim\n\nn\u2192\u221e(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 CLin\n\nThe Lipschitz class Next we let X = [0, 1]d and for L > 0 we consider the Lipschitz class:\nC = Lip([0, 1]d, L) :=(cid:8)m : [0, 1]d \u2192 R | |m(x) \u2212 m(x(cid:48))| \u2264 L(cid:107)x \u2212 x(cid:48)(cid:107)2 \u2200 x, x(cid:48) \u2208 [0, 1]d(cid:9) .\nTheorem 4. Assume the Xi are iid\n\u223c F X, where F X is a measure on [0, 1]d with Lebesgue density\nf X that satis\ufb01es \u03b7 \u2264 f X (u) \u2264 1/\u03b7 for all u \u2208 [0, 1]d for some \u03b7 > 0. Then there exists a constant\nCLip(d, \u03b7) which depends only on d, \u03b7 such that:\nn\u2192\u221e(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\nlog\uf8eb\uf8edMEB\n\nn (cid:0)Lip([0, 1]d, L); A, \u03c32(cid:1)(cid:30) \u03c34\n\n(\u03c32 + A)2 \u00b7(cid:32) Ld(cid:0)\u03c32 + A(cid:1)\n\n2+d\uf8f6\uf8f8(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n(cid:33) 2\n\n\u2264 CLip(d, \u03b7)\n\nlim\n\nn\n\n4\n\n\f3 Feasible estimation via split-sample empirical Bayes\n\nThe minimax estimator in (7) that implements (2) in a data-driven way is not feasible, because A\nis unknown in practice. In principle, A + \u03c32 (with \u03c32 known) is just Var [Zi | Xi], hence deriving\na plug-in estimator for A just takes us to the realm of variance estimation in regression problems.\nBut variance estimation for the general setting we consider here is a notoriously dif\ufb01cult problem,\nwith only partial solutions available for very speci\ufb01c settings [e.g., Janson et al., 2017, Reid et al.,\n2016]. Furthermore, even for 1-dimensional smooth nonparametric regression the minimax rates for\nvariance estimation may be slower than parametric [Brown and Levine, 2007, Shen et al., 2019].\nFortunately, it turns out that we do not need to accurately estimate A in (1) in order for our approach\nto perform well. Rather, as shown below, if we naively read off an estimate of A derived via sample\nsplitting as in (8), we still obtain strong guarantees. Concretely, we study the following algorithm:\n\n1. Form a partition of {1, . . . , n} into two folds I1 and I2.\n2. Use observations in I1, to estimate the regression m(x) = E [Zi | Xi = x] by \u02c6mI1(\u00b7).\n3. Use observations in I2, to estimate A, through the formula:\n\n\u02c6AI2 =(cid:32) 1\n\n( \u02c6mI1 (Xi) \u2212 Zi)2 \u2212 \u03c32(cid:33)\n|I2|(cid:88)i\u2208I2\n(\u00b7,\u00b7) = t\u2217\u02c6mI1 , \u02c6AI2\n\n(\u00b7,\u00b7).\n\nn\n\n+\n\n4. The estimated denoiser is then \u02c6tEBCF\n\n(8)\n\nWe prove the following guarantee for this estimator. In particular, the following implies that if the\nminimax rate for regression (5) is slower than the parametric rate 1/n and if |I1| /n converges to a\nnon-trivial limit, then our algorithm attains the minimax rate even when A is unknown.\nTheorem 5. Consider a split of the data into two folds I1, I2, where n1 = |I1| , n2 = |I2|. Further-\nmore assume that \u02c6mI1 satis\ufb01es Em,A[ \u02c6mI1 (X)4 | \u02c6mI1] \u2264 M almost surely for some M < \u221e, where\nX is a fresh draw from PX. Then the estimator \u02c6tEBCF\n\nsatis\ufb01es the following guarantee:\n\nn\n\nEm,A(cid:2)L(cid:0)\u02c6tEBCF\n\nn\n\n; m, A(cid:1)(cid:3) \u2264 Em,A(cid:104)L(cid:16)t\u2217\u02c6mn1 ,A; m, A(cid:17)(cid:105) +\n\n1\nn2\n\nO (1)\n\nWe emphasize that this result does not depend on (cid:98)A from (8) being a particularly accurate estimate\nof A. Rather, what\u2019s driving our result is the following fact: If (1) holds, but we use (2) with\n\u02dcm(\u00b7) (cid:54)= m(\u00b7), then the choice of \u02dcA that minimizes the Bayes risk among all estimators of the form\nt\u2217\u02dcm, \u02dcA\n\n(\u00b7,\u00b7), \u02dcA \u2265 0 is not A, but rather (cf. derivation in Proposition 15 of the Appendix)\nA \u02dcm := Em,A(cid:104)( \u02dcm(Xn+1) \u2212 Zn+1)2(cid:105) \u2212 \u03c32 = A + Em,A(cid:104)( \u02dcm(Xn+1) \u2212 m(Xn+1))2(cid:105).\n\nIn other words, we\u2019re better off in\ufb02ating the prior variance to account for the additional estimation\nerror of \u02dcm(\u00b7); and this in\ufb02ated prior variance is exactly what\u2019s captured in (8).\n4 Robustness to misspeci\ufb01cation\n\nSo far, our results and estimator apply to Robbins\u2019 model [Robbins, 1964] in which (1) holds and we\nare interested in a estimating a future \u00b5n+1. However, it is also of considerable interest to understand\nthe behavior of empirical Bayes estimation when the speci\ufb01cation (1) doesn\u2019t hold. In this section,\nwe consider properties of our estimator under the weaker assumption that we only have a generic\ndata-generating distribution for (Xi, \u00b5i, Zi) of the form\n\n(Xi, \u00b5i) \u223c P(Xi,\u00b5i), E [Zi | \u00b5i, Xi] = \u00b5i, Var [Zi | \u00b5i, Xi] = \u03c32,\n\n(9)\nand we seek to estimate the unknown \u00b51, . . . , \u00b5n underlying the experiments we have data for. The\ndistributions indexed by i are assumed to be independent, but need not be identical. This setting is\nsometimes referred to as the compound estimation problem [Brown and Greenshtein, 2009].\nWe proceed with a cross-fold estimator, which we call EBCF (empirical Bayes with cross-\ufb01tting), as\nfollows: We split the data as above, but now also consider \ufb02ipping the roles of I1 and I2 such that we\ncan make predictions \u02c6\u00b5i for all i = 1, ..., n as\n\n\u02c6\u00b5EBCF\ni\n\n= t\u2217\u02c6mI1 , \u02c6AI2\n\n(Xi, Zi) for i \u2208 I2 & \u02c6\u00b5EBCF\n\ni\n\n= t\u2217\u02c6mI2 , \u02c6AI1\n\n(Xi, Zi) for i \u2208 I1.\n\n5\n\n\fThis is a 2-fold cross-\ufb01tting scheme, which has been fruitful in causal inference [Chernozhukov et al.,\n2017, Nie and Wager, 2018, Schick, 1986] and multiple testing [Ignatiadis et al., 2016, Ignatiadis and\nHuber, 2018]. We also note that extensions to k-fold cross-\ufb01tting are immediate.\n\nSURE for empirical Bayes The key property of our estimator that enables our approach to be\nrobust outside of the strict model (1) is as follows. Let SURE(\u00b7) denote Stein\u2019s Unbiased Risk\nEstimate, a \ufb02exible risk estimator that is motivated by the study of estimators for \u00b5i in the Gaussian\nmodel Zi \u223c N (\u00b5i, \u03c32) [Stein, 1981]. Then, although our estimator was not originally motivated by\nSURE, one can algebraically verify that our estimator with a plug-in choice of (cid:98)A in fact minimizes\nSURE among all comparable shrinkage estimators (the same holds true with I1, I2 \ufb02ipped):\n\n\u02c6AI2 =(cid:32) 1\n\n|I2|(cid:88)i\u2208I2\n\nwhere SUREI2(A) :=\n\n( \u02c6mI1 (Xi) \u2212 Zi)2 \u2212 \u03c32(cid:33)\n|I2|(cid:88)i\u2208I2(cid:18)\u03c32 +\n\n1\n\n\u21d0\u21d2 \u02c6AI2 = argmin\n+\n\u03c34\n\nA\u22650 {SUREI2(A)} ,\nA + \u03c32(cid:19) .\n\n\u03c34\n\n(A + \u03c32)2 (Zi \u2212 \u02c6mI1(Xi))2 \u2212 2\n\n(10)\n\nFurthermore, SURE has the following remarkable property in our setting: For any data-generating\nprocess as in (9) and any A \u2265 0 [see also Jiang et al., 2011, Kou and Yang, 2017, Xie et al., 2012],\n(11)\n\n1\n\nE(cid:20)(cid:16)\u00b5i \u2212 t\u2217\u02c6mI1 ,A(Xi, Zi)(cid:17)2\n\n| X1:n, \u00b51:n(cid:21) ,\n\nE [SUREI2(A) | X1:n, \u00b51:n] =\n\n|I2|(cid:88)i\u2208I2\n\neven when the distribution of Zi conditionally on \u00b5i and Xi is not Gaussian. Putting (10) and\n(11) together, we \ufb01nd that we can argue using SURE that our estimator minimizes an unbiased\nrisk estimate for the generic speci\ufb01cation (9), despite the fact that our procedure was not directly\nmotivated by SURE and SURE itself was only designed for Gaussian estimation.\n\nGaussian data with equal variance and James-Stein property To derive a \ufb01rst consequence of\n\nthe above, let us \ufb01rst focus on a special case of (9), where Zi | (\u00b5i, Xi) \u223c N(cid:0)\u00b5i, \u03c32(cid:1). Then the\nEBCF estimate satis\ufb01es the James-Stein property of strictly dominating the direct estimator Zi [James\nand Stein, 1961]3. In other words, even if one has covariates Xi, which are uninformative, or one\nuses a really poor method for prediction, one still does no worse than just using \u02c6\u00b5i := Zi.\nTheorem 6 (James-Stein property). Under the assumptions above and if |I1| ,|I2| \u2265 5, the proposed\nestimator \u02c6\u00b5i uniformly dominates the (conditional) maximum likelihood estimator Zi, in other words\nfor all \u00b51, . . . , \u00b5n and X1, . . . , Xn, it holds that:\n\n1\nn\n\nn(cid:88)i=1\n\nE(cid:2)(\u00b5i \u2212 \u02c6\u00b5EBCF\n\ni\n\n)2 | X1:n, \u00b51:n(cid:3) <\n\n1\nn\n\nn(cid:88)i=1\n\nE(cid:2)(\u00b5i \u2212 Zi)2 | X1:n, \u00b51:n(cid:3) = \u03c32\n\nNon-Gaussian data with equal variance Next we drop the Gaussianity assumption, and consider\nthe model (9) in full generality. We use properties of SURE outlined above to establish the following:\nTheorem 7. Assume the pairs (Xi, Zi)1\u2264i\u2264n are independent and satisfy (9). Furthermore as-\ni | \u00b5i, Xi(cid:3) \u2264 \u03934 and that supi |\u00b5i| \u2264 M,\nsume that there exist \u0393, M < \u221e such that supi E(cid:2)Z 4\n(cid:33)\n(cid:17)2 (cid:12)(cid:12) X1:n, \u00b51:n, ZI1\nsupx | \u02c6mI1 (x)| \u2264 M almost surely. Then (the analogous claim holds also with I1, I2 \ufb02ipped):\n1(cid:112)\n\n(cid:20)(cid:16)\n\u00b5i \u2212 \u02c6\u00b5EBCF\n\n\u00b5i \u2212 t\u2217\u02c6mI1\n\n(cid:88)\n\n(cid:21)(cid:41)\n\n,A(Xi, Zi)\n\n(cid:17)2\n\n\u2264 O\n\n(cid:40)\n\n(cid:32)\n\n(cid:16)\n\ni\n\n\u2212\n\nsup\nA\u22650\n\n1\n|I2|\n\nE\n\ni\u2208I2\n\n|I2|\n\nCorollary 8. Assume that |I1| = |I2| = n/2 and (Xi, \u00b5i, Zi) are i.i.d. and satisfy the assumptions\nof Theorem 7. Then, the following holds, with (X, \u00b5) a fresh draw from (9):\n\n1\nn\n\nn(cid:88)i=1\n\nE(cid:104)(cid:0)\u00b5i \u2212 \u02c6\u00b5EBCF\n\ni\n\n(cid:1)2(cid:105) \u2264\n\n\u03c32E(cid:104)(cid:0) \u02c6mn/2(X) \u2212 \u00b5(cid:1)2(cid:105)\n\u221an(cid:19) .\n\u03c32 + E(cid:104)(cid:0) \u02c6mn/2(X) \u2212 \u00b5(cid:1)2(cid:105) + O(cid:18) 1\n\n3Li and Hwang [1984] provide a similar result when \u02c6m(\u00b7) is a linear smoother.\n\n(12)\n\n6\n\n\fFigure 2: Root mean squared error (RMSE) for estimating \u00b5i in model (1). Results are shown\nas a function of n for the four estimators described in the main text. a) Here we let \u03c3 = 2, A =\n0 corresponding to the case of nonparametric regression. In panel b), we let \u03c3 = \u221aA = 2.0\ncorresponding to intermediate shrinkage and in panel c) we let \u03c3 = 2,\u221aA = 3. The standard errors\nof all RMSEs are smaller or equal to 0.01.\n\nHere \u02c6mn/2(\u00b7) is the \ufb01tted function based on n/2 samples (Xi, Zi). To interpret this result, we note\nthat when \u02c6m(\u00b7) can accurately capture \u00b5i, i.e., \u02c6m(\u00b7) is a good estimate of m(\u00b7) and \u00b5i can be well\nexplained using the available covariates Xi, the error in (12) essentially matches the error of the\ndirect regression-based method \u02c6\u00b5i := \u02c6mn/2(Xi). Conversely, when the error of \u02c6m(\u00b7) for estimating\n\u00b5i is large, we recover the error \u03c32 of the simple estimator \u02c6\u00b5i := Zi. But in the interesting regime\nwhere the mean-squared error of \u02c6m(\u00b7) for \u00b5i is comparable to \u03c32, we can do a much better job by\ntaking a convex combination of the regression prediction \u02c6mn/2(Xi) and Zi, and the EBCF estimator\nautomatically and robustly navigates this trade-off.\n\nNon-Gaussian data with unequal variance: Finally, we note that we may even drop the assump-\ni in (9) rather than\ntion of equal variance and assume each unit has its own (conditional) variance \u03c32\nthe same \u03c32 for everyone. We may think of the Bayes estimator (2) as also being a function of \u03c3i, i.e.\nwrite it as t\u2217m,A(x, z, \u03c3). Then, the EBCF estimator takes the following form: For i \u2208 I2 we estimate\n(cid:19)\n\u00b5i by t\u2217\u02c6mI1 , \u02c6AI2\n\u02c6AI2 = argmin\n\n(Xi, Zi, \u03c3i). We get \u02c6mI1 by regression, while for \u02c6AI2, we generalize (10):\n\n(cid:88)\n\n(cid:18)\n\n\u03c32\n\ni +\n\n\u03c34\ni\n\nA\u22650 {SUREI2 (A)} , SUREI2 (A) =\n\n1\n|I2|\n\ni\u2208I2\n\n(A + \u03c32\n\ni )2\n\n(Zi \u2212 \u02c6mI1 (Xi))2 \u2212 2\n\n\u03c34\ni\n\nA + \u03c32\ni\n\nThe result of Theorem 7 (see Appendix C.2) also holds in this case and we demonstrate the claims in\nthe empirical application on the MovieLens dataset below.\n\n5 Empirical results\n\nFor our empirical results we compare the following 4 estimation methods for \u00b5i: a) The unbiased\nestimator \u02c6\u00b5i := Zi, b) the out-of-fold 4 regression prediction \u02c6\u00b5i := \u02c6m(Xi), where \u02c6m is the \ufb01t from\nboosted regression trees, as implemented in XGBoost [Chen and Guestrin, 2016] with number of\niterations chosen by 5-fold cross-validation and \u03b7 = 0.1 (weight with which new trees are added\nto the ensemble), c) the empirical Bayes estimator (2) without covariates that shrinks Zi towards\nthe grand average(cid:80)n\ni=1 Zi/n, with tuning parameters selected via SURE following [Xie et al.,\n2012], and d) the proposed EBCF (empirical Bayes with cross-\ufb01tting) method, with 5 folds used for\ncross-\ufb01tting and XGBoost as the regression learner (with cross-validation nested within cross-\ufb01tting).\nSynthetic data: We generate data from model (1) with PX = U [0, 1]15 and m(\u00b7) is the Friedman\n[1991] function m(x) = 10 sin(\u03c0x1x2) + 20(x3 \u2212 1/2)2 + 10x4 + 5x5, and the last 10 coordinates\nare noise. Furthermore, we let \u03c3 = 2.0 and vary A \u2208 {0, 4, 9}, mimicking the three cases in Figure 1,\nand we also vary n. Results are averaged over 100 simulations and shown in Figure 2. We make the\nfollowing observation: The unbiased estimator Zi and the SURE estimator which shrinks towards\nthe grand mean have constant mean squared error and results do not improve with increasing n. The\nXGBoost predictor improves with increasing n, since m(\u00b7) is estimated more accurately; indeed in\npanel a), if \u02c6m(\u00b7) would be exactly equal to m(\u00b7), then the error would be 0. However, as seen in panels\n\n4By out-of-fold we mean that the regression prediction is the one used by 5-fold EBCF described below.\n\n7\n\nbca30011001900024nE[(\u02c6\u00b5i\u00b5i)2]1230011001900024nE[(\u02c6\u00b5i\u00b5i)2]1230011001900024nE[(\u02c6\u00b5i\u00b5i)2]12UnbiasedXGBoostSUREEBCF\fsquared error (MSE) n\u22121(cid:80)n\n\nFigure 3: EB analysis of the Movielens dataset for prediction of average movie rating. a) Mean-\ni=1(\u02c6\u00b5i \u2212 \u02dcZi)2 (\u00b1 2 standard errors of the MSE ) of four estimators for\nthe Movielens dataset (where \u02dcZi is the average rating computed from the heldout data with 90% of\nusers) for all movies, as well as the subset of movies that are classi\ufb01ed as both Horror and Sci-Fi.\nb) LOESS smooth of mean squared error across all movies against the rank of Ni, where Ni is the\nnumber of users that rated movie i in the training set. c) Deviations of EBCF (empirical Bayes with\ncross-\ufb01tting) and SURE (Stein\u2019s unbiased risk estimate) predictions from the unbiased estimator Zi\nas a function of Ni for all Horror & Sci-Fi movies. We also show the \u201ctrue\u201d errors \u02dcZi \u2212 Zi.\n\nb, c), when A > 0, the mean squared error of XGBoost is lower bounded by A, even under perfect\nprediction of m(\u00b7). In contrast, EBCF always improves with n by leveraging the improved predictions\nof XGBoost, and outperforms all other estimators, even in the case A = 0 which corresponds to\nnonparametric regression.\nMovieLens data [Harper and Konstan, 2016]: Here we elaborate on the example from the introduc-\ntion which aims to predict the average movie rating given ratings from a \ufb01nite number of users. The\nMovieLens dataset consists of approximately 20 million ratings in {0, 0.5, . . . , 5} from 138,000 users\napplied to 27,000 movies. To demonstrate the applicability of our approach, when model (1) does not\nnecessarily hold, we randomly choose 10% of all users and attempt to estimate the movie ratings\nfrom them. This corresponds to having a much smaller dataset. We then summarize the i-th movie,\nby Zi, the average of the Ni users (in the training dataset) that rated it. We further have covariates\nXi \u2208 R20 that include Ni, the year the movie was released, as well as indicators of 18 genres to\nwhich the movie may belong (action, comedy, etc.). We posit that Zi | \u00b5i, Xi \u223c (\u00b5i, \u03c32/Ni) and\nwant to estimate \u00b5i.5 As our pseudo ground truth for movie i we use \u02dcZi, the average movie rating\namong the remaining 90% of users and then report the error(cid:80)n\ni=1( \u02dcZi \u2212 \u02c6\u00b5i)2/n, where n is the total\nnumber of movies.6\nThe average error across all movies is shown in Figure 3a; here the XGBoost predictor performs worst,\nfollowed by the unbiased estimator Zi. Instead, the two EB approaches perform a lot better with\nEBCF scoring the lowest error. The same is true when comparing only the 253 movies with genre\ntags for both horror and Sci-Fi. In panel b), we show the relationship between the error ( \u02dcZi \u2212 \u02c6\u00b5i)2\nand the rank of the per-movie number of reviews Ni using a LOESS smoother [Cleveland and Devlin,\n1988]. We observe that the 3 estimators that use Zi, do a perfect job for large Ni and a worse job\nfor smaller Ni. In particular, the error of Zi blows up at small Ni, and the error gains of EBCF\noccur precisely at low sample sizes. On the other hand, the XGBoost prediction has an error that\ndoes not get reduced by larger N, but is competitive at small N. Panel c) shows \u02c6\u00b5i \u2212 Zi for the 253\npredictions of EBCF and SURE for horror/Sci-Fi movies as a function of the rank of Ni. For large\nNi, again both EB estimators agree with the unbiased estimator. However, for small Ni, it appears\nthat most Sci-Fi/Horror movies are worse than the average movie, and EB without covariates tries to\ncorrect for this by assigning them a higher rating. Conversely, EBCF automatically realizes that these\nmovies tend to get low ratings, and pulls the unbiased estimator Zi further down.\nCommunities and Crimes data from the UCI repository [Dua and Graff, 2017, Redmond and Baveja,\n2002]: The dataset provides information about the number of crimes in multiple US communities\n\n5We replace \u03c32 by \u02c6\u03c32 .\n6We \ufb01lter movies and keep only movies with at least 3 ratings in the training set and 11 in the validation set.\n\n= 0.94, the average of the sample standard deviations across all movies.\n\n8\n\nabcB=200B=500Unbiased223.9(\u00b116.8)0.098(\u00b10.032)XGBoost0.145(\u00b10.004)0.183(\u00b10.030)SURE0.061(\u00b10.002)0.064(\u00b10.018)EBCF0.055(\u00b10.002)0.052(\u00b10.012)Table1:EBanalysisoftheMovielensdatasetforpredictionofaveragemovierating.a)Mean-squarederrorn1Pni=1(\u02c6\u00b5i\u02dcZi)2of4es-timatorsfortheMovielensdataset(where\u02dcZiistheaverageratingcomputedfromtheheldoutdatawith90%ofusers)AllSci-Fi&HorrorUnbiased0.0980.098(\u00b10.005)(\u00b10.032)XGBoost0.1500.210(\u00b10.005)(\u00b10.036)SURE0.0610.064(\u00b10.002)(\u00b10.018)EBCF0.0550.051(\u00b10.002)(\u00b10.012)CommunitiesandCrimesunnormalizeddatafromtheUCIrepository[DuaandGraff,2017]:ThedatasetprovidesinformationaboutthenumberofcrimesinmultipleUScommunitiesascompiledbytheFBIUniformCrimeReportingprogramin1995.Ourtaskistopredictthenon-violentcrimeratepiofcommunityi,de\ufb01nedaspi:=Crimesincommunityi/Populationi,foreachofn=2118communities7.Weobserveadatasetinwhichthepopulationofeachcommunityisdown-sampledtoB=200asfollows:Ci\u21e0Hypergeometric(B,Crimesincommunityi,Populationi)OurgoalthenistopredictpibasedonCiandcovariatesXi2R74whichincludeallnumericpredictivecovariatesintheUCIdatasetdescription(afterremovingcovariateswithmissingentries)andcomprisefeaturesderivedfromCensusandlawenforcementdata,suchaspercentageofpeoplethatareemployedandpercentageofpoliceof\ufb01cersassignedtodrugunits.Wenotethatthehypergeo-metricsubsamplingmakestheestimationtaskharderandalsoprovidesuswithpseudogroundtruthpi;cf.Wager[2015]forfurthermotivationofsuchdown-sampling.Theproblemmaybecastintothesettingofthispaperbyde\ufb01ningZi:=pCi/B.Then,byavariancestabilizingargumentitfollowsthatZi\u02d9\u21e0ppi,1/(4\u00b7B)andwemayapplythesamemethodsasintheprecedingexamplestoestimate\u00b5i:=ppiby\u02c6\u00b5i.Aftertransformingtheestimatesbacktotheoriginalscalethrough\u02c6pi=\u02c6\u00b52i,wereporttheerrorPni=1(pi\u02c6pi)2/n,wherenisthenumberofcommunitiesanalyzed.Table1showstheresultsofthisanalysis,aswellasthesameanalysisrepeatedforB=500.Weobservethatalsointhisexample,EBCFoutperformstheothermethods.6DiscussionEmpiricalBayesisapowerfulframeworkforpoolinginformationacrossmanyexperiments,andimprovetheprecisionofourinferenceabouteachexperimentonitsown[Efron,2010,Robbins,1964].ExistingempiricalBayesmethods,however,donotallowtheanalysttoleveragecovariateinformationunlesstheyassumearigidparametricmodelasinFayandHerriot[1979],orarewillingtocommittoaspeci\ufb01cend-to-endestimationstrategyasin,e.g.,Opsomeretal.[2008].Incontrast,theapproachproposedhereallowstheanalysttoperformcovariate-poweredempiricalBayesestimationonthebasisofanyblack-boxpredictivemodel,andhasstrongformalpropertieswhetherornotthemodel(1)usedtomotivateourprocedureiswellspeci\ufb01ed.Ourapproachmaybeextendedinfutureworkbyconsideringextensionsto(1),suchascovariate-basedmodulationofthepriorvariance,i.e.,\u00b5iXi\u21e0N(m(Xi),A(Xi)).Theworkingassumptionofanormalpriorcouldalsoberelaxedtoheavy-tailedpriorsorpriorswithapointmassatzero.Theprevalenceofsettingswhereweneedtoanalyzeresultsfrommanylooselyrelatedexperimentsseemsonlydestinedtogrow,andwebelievethatempiricalBayesmethodsthatallowforvarious7We\ufb01ltercommunitieswithamissingnumberofnon-violentcrimes9250050007500100000.000.050.100.150.200.25RankofNiSmoothed(\u02dcZi\u02c6\u00b5i)2UnbiasedXGBoostSUREEBCF250050007500100001.51.00.50.00.51.0RankofNi\u02c6\u00b5iZiEBCFSURE\u02dcZ\fUnbiased\nXGBoost\nSURE\nEBCF\n\nB = 200\n\nB = 500\nMSE (\u00d7106) MSE (\u00d7106)\n92.2 (\u00b17.1)\n223.9 (\u00b116.8)\n398.0 (\u00b181.8)\n370.2 (\u00b178.6)\n184.2 (\u00b118.9)\n85.6 (\u00b17.2)\n78.5 (\u00b110.3)\n152.0 (\u00b122.2)\n\nTable 1: EB analysis of the Communities and\nCrimes dataset: The table reports the mean-\nsquared error (\u00b1 2 standard errors) of four dif-\nferent estimators for the non-violent crime rate.\nThe columns correspond to down-sampling the\ndataset to a population of B = 200 or B = 500\nfor each community.\n\nas compiled by the FBI Uniform Crime Reporting program in 1995. Our task is to predict the\nnon-violent crime rate pi of community i, de\ufb01ned as pi := Crimes in community i/Population i, for\neach of n = 2118 communities7. We observe a dataset in which the population of each community is\ndown-sampled to B = 200 as\n\nCi \u223c Hypergeometric(B, Crimes in community i, Population i)\n\nWe seek to predict pi based on Ci and covariates Xi \u2208 R74 which include all unnormalized,\nnumeric predictive covariates in the UCI data set description (after removing covariates with missing\nentries) and comprise features derived from Census and law enforcement data, such as percentage\nof people that are employed and percentage of police of\ufb01cers assigned to drug units. We note that\nthe hypergeometric subsampling makes the estimation task harder and also provides us with pseudo\nground truth pi; cf. Wager [2015] for further motivation of such down-sampling.\n\nThe problem may be cast into the setting of this paper by de\ufb01ning Zi :=(cid:112)Ci/B. Then, by a variance\n\u02d9\u223c(cid:0)\u221api, 1/(4 \u00b7 B)(cid:1) and we may apply the same methods as\nstabilizing argument, it follows that Zi\nin the preceding examples to estimate \u00b5i := \u221api by \u02c6\u00b5i. After transforming the estimates back to\nthe original scale through \u02c6pi = \u02c6\u00b52\ni=1(pi \u2212 \u02c6pi)2/n, where n is the number\nof communities analyzed. Table 1 shows the results of this analysis, as well as the same analysis\nrepeated for B = 500. EBCF shows promising performance compared to the other baselines for both\nB. As we decrease the amount of downsampling from B = 200 to B = 500, we see that methods\nthat depend on Zi (unbiased, SURE and EBCF) improve a lot, while XGBoost does not.\n\ni , we report the error(cid:80)n\n\n6 Discussion\n\nEmpirical Bayes is a powerful framework for pooling information across many experiments, and\nimprove the precision of our inference about each experiment on its own [Efron, 2010, Robbins,\n1964]. Existing empirical Bayes methods, however, do not allow the analyst to leverage covariate\ninformation unless they assume a rigid parametric model as in Fay and Herriot [1979], or are willing to\ncommit to a speci\ufb01c end-to-end estimation strategy as in, e.g., Opsomer et al. [2008]. In contrast, the\napproach proposed here allows the analyst to perform covariate-powered empirical Bayes estimation\non the basis of any black-box predictive model, and has strong formal properties whether or not the\nmodel (1) used to motivate our procedure is well speci\ufb01ed. Our approach may be extended in future\nwork by considering generalizations of (1), such as covariate-based modulation of the prior variance,\n\ni.e., \u00b5i(cid:12)(cid:12) Xi \u223c N (m(Xi), A(Xi)). The working assumption of a normal prior could also be replaced\n\nby heavy-tailed priors [Zhu, Ibrahim, and Love, 2018] or priors with a point mass at zero.\nThe prevalence of settings where we need to analyze results from many loosely related experiments\nseems only destined to grow, and we believe that empirical Bayes methods that allow for various\nforms of structured side information hold promise for fruitful application across several different\nareas.\n\nCode availability and reproducibility\nThe proposed EBCF (empirical Bayes with cross-\ufb01tting) method has been implemented in\nEBayes.jl (https://github.com/nignatiadis/EBayes.jl), a package written in the Julia lan-\nguage [Bezanson et al., 2017]. Dependencies of EBayes.jl include MLJ.jl [Blaom et al., 2019],\nOptim.jl [Mogensen and Riseth, 2018] and Distributions.jl [Besan\u00e7on et al., 2019]. We also provide a\nGithub repository (https://github.com/nignatiadis/EBCrossFitPaper) with code to repro-\nduce all empirical results in this paper, including a speci\ufb01cation for downloading the dependencies\nand datasets.\n\n7We \ufb01lter out communities with a missing number of non-violent crimes.\n\n9\n\n\fAcknowledgments\nThe authors are grateful for enlightening conversations with Brad Efron, Guido Imbens, Panagiotis\nLolas and Paris Syminelakis. This research was funded by a gift from Google.\n\nReferences\nAlberto Abadie and Maximilian Kasy. Choosing among regularized estimators in empirical eco-\n\nnomics: The risk of machine learning. Review of Economics and Statistics, (0), 2018.\n\nAlekh Agarwal, Martin J Wainwright, Peter L Bartlett, and Pradeep K Ravikumar. Information-\ntheoretic lower bounds on the oracle complexity of convex optimization. In Advances in Neural\nInformation Processing Systems, pages 1\u20139, 2009.\n\nTrambak Banerjee, Gourab Mukherjee, and Wenguang Sun. Adaptive sparse estimation with side\n\ninformation. arXiv preprint arXiv:1811.11930, 2018.\n\nAlvin J Baranchik. Multiple regression and estimation of the mean of a multivariate normal distribu-\n\ntion. Technical report, Stanford University, 1964.\n\nRida Benhaddou and Marianna Pensky. Adaptive nonparametric empirical Bayes estimation via\nwavelet series: The minimax study. Journal of Statistical Planning and Inference, 143(10):\n1672\u20131688, 2013.\n\nMathieu Besan\u00e7on, David Anthoff, Alex Arslan, Simon Byrne, Dahua Lin, Theodore Papamarkou,\nand John Pearson. Distributions. jl: De\ufb01nition and modeling of probability distributions in the\nJuliaStats ecosystem. arXiv preprint arXiv:1907.08611, 2019.\n\nJeff Bezanson, Alan Edelman, Stefan Karpinski, and Viral B Shah. Julia: A fresh approach to\n\nnumerical computing. SIAM review, 59(1):65\u201398, 2017.\n\nAnthony Blaom, Franz Kiraly, Thibaut Lienart, and Sebastian Vollmer. alan-turing-institute/MLJ.jl:\n\nv0.5.3, November 2019. URL https://doi.org/10.5281/zenodo.3541506.\n\nLawrence D Brown. Admissible estimators, recurrent diffusions, and insoluble boundary value\n\nproblems. The Annals of Mathematical Statistics, 42(3):855\u2013903, 1971.\n\nLawrence D Brown and Eitan Greenshtein. Nonparametric empirical Bayes and compound decision\napproaches to estimation of a high-dimensional vector of normal means. The Annals of Statistics,\npages 1685\u20131704, 2009.\n\nLawrence D Brown and Michael Levine. Variance estimation in nonparametric regression via the\n\ndifference sequence method. The Annals of Statistics, 35(5):2219\u20132232, 2007.\n\nTianqi Chen and Carlos Guestrin. XGBoost: A scalable tree boosting system. In Proceedings of the\n22nd ACM SIGKDD international conference on knowledge discovery and data mining, pages\n785\u2013794. ACM, 2016.\n\nVictor Chernozhukov, Denis Chetverikov, Mert Demirer, Esther Du\ufb02o, Christian Hansen, Whit-\nney Newey, and James Robins. Double/debiased machine learning for treatment and structural\nparameters. The Econometrics Journal, 2017.\n\nWilliam S Cleveland and Susan J Devlin. Locally weighted regression: an approach to regression\nanalysis by local \ufb01tting. Journal of the American statistical association, 83(403):596\u2013610, 1988.\n\nDominic Coey and Tom Cunningham. Improving treatment effect estimators through experiment\n\nsplitting. In The World Wide Web Conference, pages 285\u2013295. ACM, 2019.\n\nNoam Cohen, Eitan Greenshtein, and Ya\u2019acov Ritov. Empirical Bayes in the presence of explanatory\n\nvariables. Statistica Sinica, 23:333\u2013357, 2013.\n\nDheeru Dua and Casey Graff. UCI machine learning repository, 2017. URL http://archive.ics.\n\nuci.edu/ml.\n\n10\n\n\fJohn Duchi. Lecture notes for Statistics 311/Electrical Engineering 377. https://stanford.edu/\n\nclass/stats311/lecture-notes.pdf. Last visited on March 13, 2019.\n\nB. Efron, R. Tibshirani, J.D. Storey, and V. Tusher. Empirical Bayes analysis of a microarray\n\nexperiment. Journal of the American Statistical Association, 96(456):1151\u20131160, 2001.\n\nBradley Efron. Large-Scale Inference: Empirical Bayes Methods for Estimation, Testing, and\n\nPrediction. Cambridge University Press, 2010.\n\nBradley Efron. Tweedie\u2019s formula and selection bias. Journal of the American Statistical Association,\n\n106(496):1602\u20131614, 2011.\n\nBradley Efron and Carl Morris. Stein\u2019s estimation rule and its competitors\u2014an empirical Bayes\n\napproach. Journal of the American Statistical Association, 68(341):117\u2013130, 1973.\n\nRobert E Fay and Roger A Herriot. Estimates of income for small places: an application of James-\nStein procedures to census data. Journal of the American Statistical Association, 74(366a):269\u2013277,\n1979.\n\nJerome H Friedman. Multivariate adaptive regression splines. The Annals of Statistics, 19(1):1\u201367,\n\n1991.\n\nEdwin J Green and William E Strawderman. A James-Stein type estimator for combining unbiased\nand possibly biased estimators. Journal of the American Statistical Association, 86(416):1001\u2013\n1006, 1991.\n\nL\u00e1szl\u00f3 Gy\u00f6r\ufb01, Michael Kohler, Adam Krzyzak, and Harro Walk. A distribution-free theory of\n\nnonparametric regression. Springer Science & Business Media, 2006.\n\nF Maxwell Harper and Joseph A Konstan. The MovieLens datasets: History and context. ACM\n\nTransactions on Interactive Intelligent Systems (TIIS)), 5(4):19, 2016.\n\nIldar Abdulovic Ibragimov and Rafail Zalmanovich Hasminskii. Statistical estimation: asymptotic\n\ntheory. Springer Verlag, 1981.\n\nNikolaos Ignatiadis and Wolfgang Huber. Covariate powered cross-weighted multiple testing.\n\narXiv:1701.05179, 2018.\n\nNikolaos Ignatiadis and Stefan Wager. Bias-aware con\ufb01dence intervals for empirical Bayes analysis.\n\narXiv preprint arXiv:1902.02774, 2019.\n\nNikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. Data-driven hypothesis\nweighting increases detection power in genome-scale multiple testing. Nature methods, 13(7):577,\n2016.\n\nNikolaos Ignatiadis, Sujayam Saha, Dennis L Sun, and Omkar Muralidharan. Empirical Bayes\narXiv preprint\n\nmean estimation with nonparametric errors via order statistic regression.\narXiv:1911.05970, 2019.\n\nWillard James and Charles Stein. Estimation with quadratic loss. In Proceedings of the fourth\nBerkeley symposium on mathematical statistics and probability, volume 1, pages 361\u2013379, 1961.\n\nLucas Janson, Rina Foygel Barber, and Emmanuel Candes. EigenPrism: inference for high di-\nmensional signal-to-noise ratios. Journal of the Royal Statistical Society: Series B (Statistical\nMethodology), 79(4):1037\u20131065, 2017.\n\nJiming Jiang, Thuan Nguyen, and J Sunil Rao. Best predictive small area estimation. Journal of the\n\nAmerican Statistical Association, 106(494):732\u2013745, 2011.\n\nWenhua Jiang and Cun-Hui Zhang. General maximum likelihood empirical Bayes estimation of\n\nnormal means. The Annals of Statistics, 37(4):1647\u20131684, 2009.\n\nIain M Johnstone and Bernard W Silverman. Needles and straw in haystacks: Empirical Bayes\n\nestimates of possibly sparse sequences. The Annals of Statistics, 32(4):1594\u20131649, 2004.\n\n11\n\n\fSC Kou and Justin J Yang. Optimal shrinkage estimation in heteroscedastic hierarchical linear models.\n\nIn Big and Complex Data Analysis, pages 249\u2013284. Springer, 2017.\n\nJianjun Li, Shanti S Gupta, and Friedrich Liese. Convergence rates of empirical Bayes estimation in\n\nexponential family. Journal of statistical planning and inference, 131(1):101\u2013115, 2005.\n\nKer-Chau Li. Asymptotic optimality of CL and generalized cross-validation in ridge regression with\n\napplication to spline smoothing. The Annals of Statistics, 14(3):1101\u20131112, 1986.\n\nKer-Chau Li and Jiunn Tzon Hwang. The data-smoothing aspect of Stein estimates. The Annals of\n\nStatistics, 12(3):887\u2013897, 1984.\n\nIngrid L\u00f6nnstedt and Terry Speed. Replicated microarray data. Statistica Sinica, pages 31\u201346, 2002.\n\nMichael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and\n\ndispersion for RNA-seq data with DESeq2. Genome biology, 15(12):550, 2014.\n\nH Brendan McMahan, Gary Holt, David Sculley, et al. Ad click prediction: a view from the trenches.\nIn Proceedings of the 19th ACM SIGKDD international conference on Knowledge discovery and\ndata mining, pages 1222\u20131230. ACM, 2013.\n\nPatrick Kofod Mogensen and Asbj\u00f8rn Nilsen Riseth. Optim: A mathematical optimization package\n\nfor julia. Journal of Open Source Software, 3(24), 2018.\n\nPushpal Mukhopadhyay and Tapabrata Maiti. Two stage non-parametric approach for small area\n\nestimation. Proceedings of ASA Section on Survey Research Methods, 4058:4065, 2004.\n\nSaurabh Mukhopadhyay and Brani Vidakovic. Ef\ufb01ciency of linear Bayes rules for a normal mean:\nskewed priors class. Journal of the Royal Statistical Society: Series D (The Statistician), 44(3):\n389\u2013397, 1995.\n\nOmkar Muralidharan. An empirical Bayes mixture method for effect size and false discovery rate\n\nestimation. The Annals of Applied Statistics, 4(1):422\u2013438, 2010.\n\nXinkun Nie and Stefan Wager. Quasi-oracle estimation of heterogeneous treatment effects. arXiv\n\npreprint arXiv:1712.04912, 2018.\n\nJean D Opsomer, Gerda Claeskens, Maria Giovanna Ranalli, Goeran Kauermann, and FJ Breidt.\nNon-parametric small area estimation using penalized spline regression. Journal of the Royal\nStatistical Society: Series B (Statistical Methodology), 70(1):265\u2013286, 2008.\n\nM Ya Penskaya. On the lower bounds for mean square error of empirical Bayes estimators. Journal\n\nof Mathematical Sciences, 75(2):1524\u20131535, 1995.\n\nMichael Redmond and Alok Baveja. A data-driven software tool for enabling cooperative information\nsharing among police departments. European Journal of Operational Research, 141(3):660\u2013678,\n2002.\n\nStephen Reid, Robert Tibshirani, and Jerome Friedman. A study of error variance estimation in lasso\n\nregression. Statistica Sinica, pages 35\u201367, 2016.\n\nHerbert Robbins. The empirical Bayes approach to statistical decision problems. Annals of Mathe-\n\nmatical Statistics, 35:1\u201320, 1964.\n\nSaharon Rosset and Ryan J Tibshirani. From \ufb01xed-X to random-X regression: Bias-variance\ndecompositions, covariance penalties, and prediction error estimation. Journal of the American\nStatistical Association, pages 1\u201314, 2018.\n\nAnton Schick. On asymptotically ef\ufb01cient estimation in semiparametric models. The Annals of\n\nStatistics, pages 1139\u20131151, 1986.\n\nYandi Shen, Chao Gao, Daniela Witten, and Fang Han. Optimal estimation of variance in nonpara-\n\nmetric regression with random design. arXiv preprint arXiv:1902.10822, 2019.\n\n12\n\n\fCharles M Stein. Estimation of the mean of a multivariate normal distribution. The Annals of\n\nStatistics, pages 1135\u20131151, 1981.\n\nJohannes Stephan, Oliver Stegle, and Andreas Beyer. A random forest approach to capture genetic\n\neffects in the presence of population structure. Nature communications, 6:7432, 2015.\nMatthew Stephens. False discovery rates: a new deal. Biostatistics, 18(2):275\u2013294, 2016.\nZhiqiang Tan. Steinized empirical Bayes estimation for heteroscedastic data. Statistica Sinica, pages\n\n1219\u20131248, 2016.\n\nA.B. Tsybakov. Introduction to Nonparametric Estimation. Springer Series in Statistics. Springer\n\nNew York, 2008. ISBN 9780387790527.\n\nStefan Wager. The ef\ufb01ciency of density deconvolution. arXiv preprint arXiv:1507.00832, 2015.\n\nAsaf Weinstein, Zhuang Ma, Lawrence D Brown, and Cun-Hui Zhang. Group-linear empirical Bayes\nestimates for a heteroscedastic normal mean. Journal of the American Statistical Association, 113\n(522):698\u2013710, 2018.\n\nXianchao Xie, SC Kou, and Lawrence D Brown. SURE estimates for a heteroscedastic hierarchical\n\nmodel. Journal of the American Statistical Association, 107(500):1465\u20131479, 2012.\n\nAnqi Zhu, Joseph G Ibrahim, and Michael I Love. Heavy-tailed prior distributions for sequence count\n\ndata: removing the noise and preserving large differences. Bioinformatics, 2018.\n\n13\n\n\f", "award": [], "sourceid": 5101, "authors": [{"given_name": "Nikolaos", "family_name": "Ignatiadis", "institution": "Stanford University"}, {"given_name": "Stefan", "family_name": "Wager", "institution": "Stanford University"}]}