{"title": "NeuralFDR: Learning Discovery Thresholds from Hypothesis Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1541, "page_last": 1550, "abstract": "As datasets grow richer, an important challenge is to leverage the full features in the data to maximize the number of useful discoveries while controlling for false positives. We address this problem in the context of multiple hypotheses testing, where for each hypothesis, we observe a p-value along with a set of features specific to that hypothesis. For example, in genetic association studies, each hypothesis tests the correlation between a variant and the trait. We have a rich set of features for each variant (e.g. its location, conservation, epigenetics etc.) which could inform how likely the variant is to have a true association. However popular testing approaches, such as Benjamini-Hochberg's procedure (BH) and independent hypothesis weighting (IHW), either ignore these features or assume that the features are categorical. We propose a new algorithm, NeuralFDR, which automatically learns a discovery threshold as a function of all the hypothesis features. We parametrize the discovery threshold as a neural network, which enables flexible handling of multi-dimensional discrete and continuous features as well as efficient end-to-end optimization. We prove that NeuralFDR has strong false discovery rate (FDR) guarantees, and show that it makes substantially more discoveries in synthetic and real datasets. Moreover, we demonstrate that the learned discovery threshold is directly interpretable.", "full_text": "NeuralFDR: Learning Discovery Thresholds\n\nfrom Hypothesis Features\n\nFei Xia\u21e4, Martin J. Zhang\u21e4,\n\nJames Zou\u2020, David Tse\u2020\n\nStanford University\n\n{feixia,jinye,jamesz,dntse}@stanford.edu\n\nAbstract\n\nAs datasets grow richer, an important challenge is to leverage the full features\nin the data to maximize the number of useful discoveries while controlling for\nfalse positives. We address this problem in the context of multiple hypotheses\ntesting, where for each hypothesis, we observe a p-value along with a set of\nfeatures speci\ufb01c to that hypothesis. For example, in genetic association studies,\neach hypothesis tests the correlation between a variant and the trait. We have a\nrich set of features for each variant (e.g. its location, conservation, epigenetics etc.)\nwhich could inform how likely the variant is to have a true association. However\npopular empirically-validated testing approaches, such as Benjamini-Hochberg\u2019s\nprocedure (BH) and independent hypothesis weighting (IHW), either ignore these\nfeatures or assume that the features are categorical or uni-variate. We propose a\nnew algorithm, NeuralFDR, which automatically learns a discovery threshold as a\nfunction of all the hypothesis features. We parametrize the discovery threshold as\na neural network, which enables \ufb02exible handling of multi-dimensional discrete\nand continuous features as well as ef\ufb01cient end-to-end optimization. We prove\nthat NeuralFDR has strong false discovery rate (FDR) guarantees, and show that it\nmakes substantially more discoveries in synthetic and real datasets. Moreover, we\ndemonstrate that the learned discovery threshold is directly interpretable.\n\n1\n\nIntroduction\n\nIn modern data science, the analyst is often swarmed with a large number of hypotheses \u2014 e.g. is a\nmutation associated with a certain trait or is this ad effective for that section of the users. Deciding\nwhich hypothesis to statistically accept or reject is a ubiquitous task. In standard multiple hypothesis\ntesting, each hypothesis is boiled down to one number, a p-value computed against some null\ndistribution, with a smaller value indicating less likely to be null. We have powerful procedures to\nsystematically reject hypotheses while controlling the false discovery rate (FDR) Note that here the\nconvention is that a \u201cdiscovery\u201d corresponds to a \u201crejected\u201d null hypothesis.\nThese FDR procedures are widely used but they ignore additional information that is often available\nin modern applications. Each hypothesis, in addition to the p-value, could also contain a set of\nfeatures pertinent to the objects being tested in the hypothesis. In the genetic association setting\nabove, each hypothesis tests whether a mutation is correlated with the trait and we have a p-value\nfor this. Moreover, we also have other features about both the mutation (e.g. its location, epigenetic\nstatus, conservation etc.) and the trait (e.g. if the trait is gene expression then we have features on the\ngene). Together these form a feature representation of the hypothesis. This feature vector is ignored\nby the standard multiple hypotheses testing procedures.\nIn this paper, we present a \ufb02exible method using neural networks to learn a nonlinear mapping\nfrom hypothesis features to a discovery threshold. Popular procedures for multiple hypotheses\n\n\u21e4These authors contributed equally to this work and are listed in alphabetical order.\n\u2020These authors contributed equally.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fInput\n\nH1 p1\nH2 p2\nH3 p3\nH4 p4\n\nx1\nx2\nx3\nx4\n\nDiscovery \nThreshold\n\ndiscovery\n\ntrue \n\nalternative\ntrue \ntrue \nfalse \nfalse \n\nyes\nyes\nno\nyes\n\nH1\nH2\nH3\nH4\n\nEnd-to-end learning of the \nneural network t(x; \u03b8)\n\nCovariate X \n\nFDP = 1/3\n\nFigure 1: NeuralFDR: an end-to-end learning procedure.\n\ntesting correspond to having one constant threshold for all the hypotheses (BH [3]), or a constant\nfor each group of hypotheses (group BH [13], IHW [14, 15]). Our algorithm takes account of all\nthe features to automatically learn different thresholds for different hypotheses. Our deep learning\narchitecture enables ef\ufb01cient optimization and gracefully handles both continuous and discrete multi-\ndimensional hypothesis features. Our theoretical analysis shows that we can control false discovery\nproportion (FDP) with high probability. We provide extensive simulation on synthetic and real\ndatasets to demonstrate that our algorithm makes more discoveries while controlling FDR compared\nto state-of-the-art methods.\n\nContribution. As shown in Fig. 1, we provide NeuralFDR, a practical end-to-end algorithm\nto the multiple hypotheses testing problem where the hypothesis features can be continuous and\nmulti-dimensional. In contrast, the currently widely-used algorithms either ignore the hypothesis\nfeatures (BH [3], Storey\u2019s BH [21]) or are designed for simple discrete features (group BH [13],\nIHW [15]). Our algorithm has several innovative features. We learn a multi-layer perceptron as\nthe discovery threshold and use a mirroring technique to robustly estimate false discoveries. We\nshow that NeuralFDR controls false discovery with high probability for independent hypotheses\nand asymptotically under weak dependence [13, 21], and we demonstrate on both synthetic and real\ndatasets that it controls FDR while making substantially more discoveries. Another advantage of\nour end-to-end approach is that the learned discovery threshold are directly interpretable. We will\nillustrate in Sec. 4 how the threshold conveys biological insights.\n\nRelated works. Holm [12] investigated the use of p-value weights, where a larger weight suggests\nthat the hypothesis is more likely to be an alternative. Benjamini and Hochberg [4] considered\nassigning different losses to different hypotheses according to their importance. Some more recent\nworks are [9, 10, 13]. In these works, the features are assumed to have some speci\ufb01c forms, either\nprespeci\ufb01ed weights for each hypothesis or the grouping information. The more general formulation\nconsidered in this paper was purposed quite recently [15, 16, 18, 19]. It assumes that for each\nhypothesis, we observe not only a p-value Pi but also a feature Xi lying in some generic space\nX . The feature is meant to capture some side information that might bear on the likelihood of\na hypothesis to be signi\ufb01cant, or on the power of Pi under the alternative, but the nature of this\nrelationship is not fully known ahead of time and must be learned from the data.\nThe recent work most relevant to ours is IHW [15]. In IHW, the data is grouped into G groups based\non the features and the decision threshold is a constant for each group. IHW is similar to NeuralFDR\nin that both methods optimize the parameters of the decision rule to increase the number of discoveries\nwhile using cross validation for asymptotic FDR control. IHW has several limitations: \ufb01rst, binning\nthe data into G groups can be dif\ufb01cult if the feature space X is multi-dimensional; second, the\ndecision rule, restricted to be a constant for each group, is arti\ufb01cial for continuous features; and third,\nthe asymptotic FDR control guarantee requires the number of groups going to in\ufb01nity, which can\nbe unrealistic. In contrast, NeuralFDR uses a neural network to parametrize the decision rule which\nis much more general and \ufb01ts the continuous features. As demonstrated in the empirical results, it\nworks well with multi-dimensional features. In addition to asymptotic FDR control, NeuralFDR also\nhas high-probability false discovery proportion control guarantee with a \ufb01nite number of hypotheses.\nSABHA [19] and AdaPT [16] are two recent FDR control frameworks that allow \ufb02exible methods to\nexplore the data and compute the feature dependent decision rules. The focus there is the framework\nrather than the end-to-end algorithm as compared to NueralFDR. For the empirical experiment,\nSABHA estimates the null proportion using non-parametric methods while AdaPT estimates the\n\n2\n\n\fdistribution of the p-value and the features with a two-group Gamma GLM mixture model and\nspline regression. The multi-dimensional case is discussed without empirical validation. Hence\nboth methods have a similar limitation to IHW in that they do not provide an empirically validated\nend-to-end approach for multi-dimensional features. This issue is addressed in [5], where the null\nproportion is modeled as a linear combination of some hand-crafted transformation of the features.\nNeuralFDR models this relation in a more \ufb02exible way.\n\n2 Preliminaries\nWe have n hypotheses and each hypothesis i is characterized by a tuple (Pi, Xi, Hi), where Pi 2\n(0, 1) is the p-value, Xi 2X is the hypothesis feature, and Hi 2{ 0, 1} indicates if this hypothesis\nis null ( Hi = 0) or alternative ( Hi = 1). The p-value Pi represents the probability of observing\nan equally or more extreme value compared to the testing statistic when the hypothesis is null, and\nis calculated based on some data different from Xi. The alternate hypotheses (Hi = 1) are the\ntrue signals that we would like to discover. A smaller p-value presents stronger evidence for a\nhypothesis to be alternative. In practice, we observe Pi and Xi but do not know Hi. We de\ufb01ne\nthe null proportion \u21e10(x) to be the probability that the hypothesis is null conditional on the feature\nXi = x. The standard assumption is that under the null (Hi = 0), the p-value is uniformly distributed\nin (0, 1). Under the alternative (Hi = 1), we denote the p-value distribution by f1(p|x). In most\napplications, the p-values under the alternative are systematically smaller than those under the null. A\ndetailed discussion of the assumptions can be found in Sec. 5.\nThe general goal of multiple hypotheses testing is to claim a maximum number of discoveries based\ni=1 while controlling the false positives. The most popular quantities\non the observations {(Pi, Xi)}n\nthat conceptualize the false positives are the family-wise error rate (FWER) [8] and the false discovery\nrate (FDR) [3]. We speci\ufb01cally consider FDR in this paper. FDR is the expected proportion of false\ndiscoveries, and one closely related quantity, the false discovery proportion (FDP), is the actual\nproportion of false discoveries. We note that FDP is the actual realization of FDR. Formally,\nDe\ufb01nition 1. (FDP and FDR) For any decision rule t, let D(t) and F D(t) be the number of\ndiscoveries and the number of false discoveries. The false discovery proportion F DP (t) and the\nfalse discovery rate F DR(t) are de\ufb01ned as F DP (t) , F D(t)/D(t) and F DR(t) , E[F DP (t)].\nIn this paper, we aim to maximize D(t) while controlling F DP (t) \uf8ff \u21b5 with high probability. This\nis a stronger statement than those in FDR control literature of controlling FDR under the level \u21b5.\n\nMotivating example. Consider a genetic association study where the genotype and phenotype (e.g.\nheight) are measured in a population. Hypothesis i corresponds to testing the correlation between the\nvariant i and the individual\u2019s height. The null hypothesis is that there is no correlation, and Pi is the\nprobability of observing equally or more extreme values than the empirically observed correlation\nconditional on the hypothesis is null Hi = 0. Small Pi indicates that the null is unlikely. Here Hi = 1\n(or 0) corresponds to the variant truly is (or is not) associated with height. The features Xi could\ninclude the location, conservation, etc. of the variant. Note that Xi is not used to compute Pi, but it\ncould contain information about how likely the hypotheses is to be an alternative. Careful readers\nmay notice that the distribution of Pi given Xi is uniform between 0 and 1 under the null and f1(p|x)\nunder the alternative, which depends on x. This implies that Pi and Xi are independent under\nthe null and dependent under the alternative.\nTo illustrate why modeling the features could improve discovery power, suppose hypothetically that\nall the variants truly associated with height reside on a single chromosome j\u21e4 and the feature is\nthe chromosome index of each SNP (see Fig. 2 (a)). Standard multiple testing methods ignore this\nfeature and assign the same discovery threshold to all the chromosomes. As there are many purely\nnoisy chromosomes, the p-value threshold must be very small in order to control FDR. In contrast, a\nmethod that learns the threshold t(x) could learn to assign a higher threshold to chromosome j\u21e4 and\n0 to other chromosomes. As a higher threshold leads to more discoveries and vice versa, this would\neffectively ignore much of the noise and make more discoveries under the same FDR.\n\n3 Algorithm Description\n\nSince a smaller p-value presents stronger evidence against the null hypothesis, we consider the\nthreshold decision rule without loss of generality. As the null proportion \u21e10(x) and the alternative\n\n3\n\n\fTrain\n\nt*(x; \u03b8)\n\nOptimize (3)\n\nCV\n\np\n\n\u03b3*t*(x; \u03b8)\n\nRescale\n\nTest\n\n(a)\n\ndF D(t)\n\nD(t)\n\nTrain\n\nt*(x; \u03b8)\n\nOptimize (3)\n\nCV\n\n\u03b3*t*(x; \u03b8)\n\nRescale\n\nTest\n\n(c)\n\nCovariate X\n(b)\n\nMirroring estimator\n\nFigure 2: (a) Hypothetical example where small p-values are enriched at chromosome j\u21e4. (b) The\nmirroring estimator. (c) The training and cross validation procedure.\n\ndistribution f1(p|x) vary with x, the threshold should also depend on x. Therefore, we can write\nthe rule as t(x) in general, which claims hypothesis i to be signi\ufb01cant if Pi < t(Xi). Let I be the\nindicator function. For t(x), the number of discoveries D(t) and the number of false discoveries\nF D(t) can be expressed as D(t) =Pn\ni=1 I{Pi 1t(x)}.The\n\nmirroring estimator of F D(t) is de\ufb01ned asdF D(t) =Pi I{(Pi,Xi)2CM (t)}.\n\nThe mirroring estimator overestimates the number of false discoveries in expectation:\n\n4\n\n\fLemma 1. (Positive bias of the mirroring estimator)\n\nE[dF D(t)] E[F D(t)] =\n\nnXi=1\n\nP\u21e5(Pi, Xi) 2 CM (t), Hi = 1\u21e4 0.\n\n(2)\n\nRemark 1. In practice, t(x) is always very small and f1(p|x) approaches 0 very fast as p ! 1.\nThen for any hypothesis with (Pi, Xi) 2 CM (t), Pi is very close to 1 and hence P(Hi = 1) is very\nsmall. In other words, the bias in (2) is much smaller than E[F D(t)]. Thus the estimator is accurate.\nIn addition,dF D(t) and F D(t) are both sums of n terms. Under mild conditions, they concentrate\nwell around their means. Thus we should expect thatdF D(t) approximates F D(t) well most of the\n\ntimes. We make this precise in Sec. 5 in the form of the high probability FDP control statement.\n\nThird, we use cross validation to address the over\ufb01tting problem introduced by optimization. To\nbe more speci\ufb01c, we divide the data into M folds. For fold j, the decision rule tj(x; \u2713), before\napplied on fold j, is trained and cross validated on the rest of the data. The cross validation is done by\nrescaling the learned threshold tj(x) by a factor j so that the corresponding mirror estimate \\F DP\non the CV set is \u21b5. This will not introduce much of additional over\ufb01tting since we are only searching\nover a scalar . The discoveries in all M folds are merged as the \ufb01nal result. We note here distinct\nfolds correspond to subsets of hypotheses rather than samples used to compute the corresponding\np-values. This procedure is shown in Fig. 2 (c). The details of the procedure as well as the FDP\ncontrol property are also presented in Sec. 5.\n\nAlgorithm 1 NeuralFDR\n1: Randomly divide the data {(Pi, Xi)}n\n2: for fold j = 1,\u00b7\u00b7\u00b7 , M do\n3:\n4:\n\ni=1 into M folds.\n\nLet the testing data be fold j, the CV data be fold j0 6= j, and the training data be the rest.\nTrain tj(x; \u2713) based on the training data by optimizing\n\nmaximize\u2713 D(t(\u2713)) s.t. \\F DP (t\u21e4j (\u2713)) \uf8ff \u21b5.\n\n(3)\n\nRescale t\u21e4j (x; \u2713) by \u21e4j so that the estimated FDP on the CV data \\F DP (\u21e4j t\u21e4j (\u2713)) = \u21b5.\nApply \u21e4j t\u21e4j (\u2713) on the data in fold j (the testing data).\n\n5:\n6:\n7: Report the discoveries in all M folds.\n\nThe proposed method NeuralFDR is summarized as Alg. 1. There are two techniques that enabled\nrobust training of the neural network. First, to have non-vanishing gradients, the indicator functions\nin (3) are substituted by sigmoid functions with the intensity parameters automatically chosen based\non the dataset. Second, the training process of the neural network may be unstable if we use random\ninitialization. Hence, we use an initialization method called the k-cluster initialization: 1) use\nk-means clustering to divide the data into k clusters based on the features; 2) compute the optimal\nthreshold for each cluster based on the optimal group threshold condition ((7) in Sec. 5); 3) initialize\nthe neural network by training it to \ufb01t a smoothed version of the computed thresholds. See Supp. Sec.\n2 for more implementation details.\n\n4 Empirical Results\n\nWe evaluate our method using both simulated data and two real-world datasets3. The implementation\ndetails are in Supp. Sec. 2. We compare NeuralFDR with three other methods: BH procedure\n(BH) [3], Storey\u2019s BH procedure (SBH) with threshold = 0.4 [21], and Independent Hypothesis\nWeighting (IHW) with number of bins and folds set as default [15]. BH and SBH are two most\npopular methods without using the hypothesis features and IHW is the state-of-the-art method that\nutilizes hypothesis features. For IHW, in the multi-dimensional feature case, k-means is used to\ngroup the hypotheses. In all experiments, k is set to 20 and the group index is provided to IHW as the\nhypothesis feature. Other than the FDR control experiment, we set the nominal FDR level \u21b5 = 0.1.\n\n3We released the software at https://github.com/fxia22/NeuralFDR\n\n5\n\n\f5/19/17, 12)45 AM\n\n5/19/17, 12)44 AM\n\n(a)\n\n(b)\n\nFigure 3: FDP for (a) DataIHW and (b) 1DGM. Dashed line indicate 45 degrees, which is optimal.\n\nTable 1: Simulated data: # of discoveries and gain over BH at FDR = 0.1.\n\nDataIHW\n2259\nBH\n2651(+17.3%)\nSBH\nIHW\n5074(+124.6%)\nNeuralFDR 6222(+175.4%)\n2D GM\n9917\n11334(+14.2%)\n12175(+22.7%)\n18844(+90.0%)\n\n1D slope\nBH\n11794\nSBH\n13593(+15.3%)\nIHW\n12658(+7.3%)\nNeuralFDR 15781(+33.8%)\n\nDataIHW(WD)\n6674\n7844(+17.5%)\n10382(+55.6%)\n12153(+82.1%)\n\n2D slope\n8473\n9539(+12.58%)\n8758(+3.36%)\n10318(+21.7%)\n\n1D GM\n8266\n9227(+11.62%)\n11172(+35.2%)\n14899(+80.2%)\n5D GM\n9917\n11334(+14.28%)\n11408(+15.0%)\n18364(+85.1%)\n\nhttp://localhost:8894/files/sideinfo/FDR2.svg\n\nhttp://localhost:8894/files/sideinfo/FDR1.svg\n\nPage 1 of 1\n\nPage 1 of 1\n\nSimulated data. We \ufb01rst consider DataIHW, the simulated data in the IHW paper ( Supp. 7.2.2\n[15]). Then, we use our own data that are generated to have two feature structures commonly seen\nin practice, the bumps and the slopes. For the bumps, the alternative proportion \u21e11(x) is generated\nfrom a Gaussian mixture (GM) to have a few peaks with abundant alternative hypotheses. For the\nslopes, \u21e11(x) is generated linearly dependent with the features. After generating \u21e11(x), the p-values\nare generated following a beta mixture under the alternative and uniform (0, 1) under the null. We\ngenerated the data for both 1D and 2D cases, namely 1DGM, 2DGM, 1Dslope, 2Dslope. For example,\nFig. 4 (a) shows the alternative proportion of 2Dslope. In addition, for the high dimensional feature\nscenario, we generated a 5D data, 5DGM, which contains the same alternative proportion as 2DGM\nwith 3 addition non-informative directions.\nWe \ufb01rst examine the FDR control property using DataIHW and 1DGM. Knowing the ground truth,\nwe plot the FDP (actual FDR) over different values of the nominal FDR \u21b5 in Fig. 3. For a perfect\nFDR control, the curve should be along the 45-degree dashed line. As we can see, all the methods\ncontrol FDR. NeuralFDR controls FDR accurately while IHW tends to make overly conservative\ndecisions. Second, we visualize the learned threshold by both NeuralFDR and IWH. As mentioned in\nSec. 3, to make more discoveries, the learned threshold should roughly have the same shape as \u21e11(x).\nThe learned thresholds of NeuralFDR and IHW for 2Dslope are shown in Fig. 3 (b,c). As we can see,\nNeuralFDR well recovers the slope structure while IHW fails to assign the highest threshold to the\nbottom right block. IHW is forced to be piecewise constant while NeuralFDR can learn a smooth\nthreshold, better recovering the structure of \u21e11(x). In general, methods that partition the hypotheses\ninto discrete groups would not scale for higher-dimensional features. In Appendix 1, we show that\nNeuralFDR is also able to recover the correct threshold for the Gaussian signal. Finally, we report\nthe total numbers of discoveries in Tab. 1.\nIn addition, we ran an experiment with dependent p-values with the same dependency structure as\nSec. 3.2 in [15]. We call this dataset DataIHW(WD). The number of discoveries are shown in Tab.\n1. NeuralFDR has the actual FDP 9.7% while making more discoveries than SBH and IHW. This\nempirically shows that NeuralFDR also works for weakly dependent data.\nAll numbers are averaged over 10 runs of the same simulation setting. We can see that NeuralFDR\noutperforms IHW in all simulated datasets. Moreover, it outperforms IHW by a large margin\nmulti-dimensional feature settings.\n\n6\n\n\f5/19/17, 12)58 AM\n\n5/19/17, 12)58 AM\n\n5/19/17, 12)57 AM\n\n(a) Actual alternative proportion\nfor 2Dslope.\n\n(b) NeuralFDR\u2019s learned thresh-\nold.\n\n(c) IHW\u2019s learned threshold\n\n5/19/17, 1(06 AM\n\n5/19/17, 11(02 AM\n\n5/19/17, 11(02 AM\n\nhttp://localhost:8894/files/sideinfo/2dslope1.png\n\nhttp://localhost:8894/files/sideinfo/2dslope2.png\n\nPage 1 of 1\n\nhttp://localhost:8894/files/sideinfo/2dslope3.png\n\nPage 1 of 1\n\nPage 1 of 1\n\n(d) NeuralFDR\u2019s learned thresh-\nold for Airway log count.\n\n(e) NeuralFDR\u2019s learned thresh-\nold for GTEx log distance.\n\n(f) NeuralFDR\u2019s learned thresh-\nold for GTEx expression level.\n\nFigure 4: (a-c) Results for 2Dslope: (a) the alternative proportion for 2Dslope; (b) NeuralFDR\u2019s\nlearned threshold; (c) IHW\u2019s learned threshold. (d-f): Each dot corresponds to one hypothesis. The\nred curves shows the learned threshold by NeuralFDR: (d) for log count for airway data; (e) for log\ndistance for GTEx data; (f) for expression level for GTEx data.\n\nTable 2: Real data: # of discoveries at FDR = 0.1.\n\nhttp://deepfei:8894/files/airway.png\n\nhttp://deep.fxia.me:8894/files/gtex-expression.png\n\nPage 1 of 1\n\nhttp://deep.fxia.me:8894/files/gtex-distance.png\n\nAirway\n4079\nBH\n4038(-1.0%)\nSBH\nIHW\n4873(+19.5%)\nNeuralFDR 6031(+47.9%)\n\nPage 1 of 1\n\nGTEx-dist\n29348\n29758(+1.4%)\n35771(+21.9%)\n36127(+23.1%)\n\nGTEx-PhastCons GTEx-2D\n29348\nBH\n29758(+1.4%)\nSBH\nIHW\n30241(+3.0%)\nNeuralFDR 30525(+4.0%)\n\n29348\n29758(+1.4%)\n35705(+21.7%)\n37095(+26.4%)\n\nPage 1 of 1\n\nGTEx-exp\n29348\n29758(+1.4%)\n32195(+9.7%)\n32214(+9.8%)\nGTEx-3D\n29348\n29758(+1.4%)\n35598(+21.3%)\n37195(+26.7%)\n\nAirway RNA-Seq data. Airway data [11] is a RNA-Seq dataset that contains n = 33469 genes\nand aims to identify glucocorticoid responsive (GC) genes that modulate cytokine function in airway\nsmooth muscle cells. The p-values are obtained by a standard two-group differential analysis using\nDESeq2 [20]. We consider the log count for each gene as the hypothesis feature. As shown in the\n\ufb01rst column in Tab. 2, NeuralFDR makes 800 more discoveries than IHW. The learned threshold\nby NeuralFDR is shown in Fig. 4 (d). It increases monotonically with the log count, capturing the\npositive dependency relation. Such learned structure is interpretable: low count genes tend to have\nhigher variances, usually dominating the systematic difference between the two conditions; on the\ncontrary, it is easier for high counts genes to show a strong signal for differential expression [15, 20].\n\nGTEx data. A major component of the GTEx [6] study is to quantify expression quantitative\ntrait loci (eQTLs) in human tissues. In such an eQTL analysis, each pair of single nucleotide\npolymorphism (SNP) and nearby gene forms one hypothesis. Its p-value is computed under the null\nhypothesis that the SNP\u2019s genotype is not correlated with the gene expression.We obtained all the\nGTEx p-values from chromosome 1 in a brain tissue (interior caudate), corresponding to 10, 623, 893\nSNP-gene combinations. In the original GTEx eQTL study, no features were considered in the FDR\nanalysis, corresponding to running the standard BH or SBH on the p-values. However, we know many\nbiological features affect whether a SNP is likely to be a true eQTL; i.e. these features could vary\nthe alternative proportion \u21e11(x) and accounting for them could increase the power to discover true\neQTL\u2019s while guaranteeing that the FDR remains the same. For each hypothesis, we generated three\n\n7\n\n\ffeatures: 1) the distance (GTEx-dist) between the SNP and the gene (measured in log base-pairs) ; 2)\nthe average expression (GTEx-exp) of the gene across individuals (measured in log rpkm); 3) the\nevolutionary conservation measured by the standard PhastCons scores (GTEx-PhastCons).\nThe numbers of discoveries are shown in Tab. 2. For GTEx-2D, GTEx-dist and GTEx-exp are used.\nFor NeuralFDR, the number of discoveries increases as we put in more and more features, indicating\nthat it can work well with multi-dimensional features. For IHW, however, the number of discoveries\ndecreases as more features are incorporated. This is because when the feature dimension becomes\nhigher, each bin in IHW will cover a larger space, decreasing the resolution of the piecewise constant\nfunction, preventing it from capturing the informative part of the feature.\nThe learned discovery thresholds of NeuralFDR are directly interpretable and match prior biological\nknowledge. Fig. 4 (e) shows that the threshold is higher when SNP is closer to the gene. This allows\nmore discoveries to be made among nearby SNPs, which is desirable since we know there most\nof the eQTLs tend to be in cis (i.e. nearby) rather than trans (far away) from the target gene [6].\nFig. 4 (f) shows that the NeuralFDR threshold for gene expression decreases as the gene expression\nbecomes large. This also con\ufb01rms known biology: the highly expressed genes tend to be more\nhousekeeping genes which are less variable across individuals and hence have fewer eQTLs [6].\nTherefore it is desirable that NeuralFDR learns to place less emphasis on these genes. We also show\nthat NeuralFDR learns to give higher threshold to more conserved variants in Supp. Sec. 1, which\nalso matches biology.\n\n5 Theoretical Guarantees\nWe assume the tuples {(Pi, Xi, Hi)}n\nXi\n\ni.i.d.\u21e0 \u00b5(X),\n\ni=1 are i.i.d. samples from an empirical Bayes model:\n\n[Hi|Xi = x] \u21e0 Bern(1 \u21e10(x)),\u21e2 [Pi|Hi = 0, X = x] \u21e0 Unif(0, 1)\n\n[Pi|Hi = 1, X = x] \u21e0 f1(p|x)\n\n(4)\n\nThe features Xi are drawn i.i.d. from some unknown distribution \u00b5(x). Conditional on the feature\nXi = x, hypothesis i is null with probability \u21e10(x) and is alternative otherwise. The conditional\ndistributions of p-values are Unif(0, 1) under the null and f1(p|x) under the alternative.\nFDR control via cross validation. The cross validation procedure is described as follows. The data\nis divided randomly into M folds of equal size m = n/M. For fold j, let the testing set Dte(j) be\nitself, the cross validation set Dcv(j) be any other fold, and the training set Dtr(j) be the remaining.\nThe size of the three are m, m, (M 2)m respectively. For fold j, suppose at most L decision rules\nare calculated based on the training set, namely tj1,\u00b7\u00b7\u00b7 , tjL. Evaluated on the cross validation set,\nlet l\u21e4-th rule be the rule with most discoveries among rules that satis\ufb01es 1) its mirroring estimate\n\\F DP (tjl) \uf8ff \u21b5; 2) D(tjl)/m > c0, for some small constant c0 > 0. Then, tjl\u21e4 is selected to apply\non the testing set (fold j). Finally, discoveries from all folds are combined.\nThe FDP control follows a standard argument of cross validation. Intuitively, the FDP of the rules\nl=1 are estimated based on Dcv(j), a dataset independent of the training set. Hence there is no\n{tjl}L\nover\ufb01tting and the overestimation property of the mirroring estimator, as in Lemma 1, is statistical\nvalid, leading to a conservative decision that controls FDP. This is formally stated as below.\nTheorem 1. (FDP control) Let M be the number of folds and let L be the maximum number of\ndecision rule candidates evaluated by the cross validation set. Then with probability at least 1 ,\nthe overall FDP is less than (1 + )\u21b5, where = O\u21e3q M\nRemark 2. There are two subtle points. First, L can not be too large. Otherwise Dcv(j) may\neventually be over\ufb01tted by being used too many times for FDP estimation. Second, the FDP estimates\nmay be unstable if the probability of discovery E[D(tjl)/m] approaches 0. Indeed, the mirroring\nmethod estimates FDP by \\F DP (tjl) = dF D(tjl)\nD(tjl) , where bothdF D(tjl) and D(tjl) are i.i.d. sums of n\nBernoulli random variables with mean roughly \u21b5E[D(tjl)/m] and E[D(tjl)/m]. When their means\nare small, the concentration property will fail. So we need E[D(tjl)/m] to be bounded away from\nzero. Nevertheless this is required in theory but may not be used in practice.\nRemark 3. (Asymptotic FDR control under weak dependence) Besides the i.i.d. case, NeuralFDR can\nalso be extended to control FDR asymptotically under weak dependence [13, 21]. Generalizing the\nconcept in [13] from discrete groups to continuous features X, the data are under weak dependence\n\n \u2318.\n\n\u21b5n log M L\n\n8\n\n\fif the CDF of (Pi, Xi) for both the null and the alternative proportion converge almost surely to\ntheir true values respectively. The linkage disequilibrium (LD) in GWAS and the correlated genes\nin RNA-Seq can be addressed by such dependence structure. In this case, if learned threshold\nis c-Lipschitz continuous for some constant c, NeuralFDR will control FDR asymptotically. The\nLipschitz continuity can be achieved, for example, by weight clipping [2], i.e. clamping the weights to\na bounded set after each gradient update when training the neural network. See Supp. 3 for details.\nOptimal decision rule with in\ufb01nite hypotheses. When n = 1, we can recover the joint den-\nsity fP X(p, x). Based on that, the explicit form of the optimal decision rule can be obtained\nif we are willing to further assumer f1(p|x) is monotonically non-increasing w.r.t. p. This\nrule is used for the k-cluster initialization for NeuralFDR as mentioned in Sec. 3. Now sup-\npose we know fP X(p, x). Then \u00b5(x) and fP|X(p|x) can also be determined. Furthermore, as\n1\u21e10(x) (fP|X(p|x) \u21e10(x)), once we specify \u21e10(x), the entire model is speci\ufb01ed.\nf1(p|x) =\nLet S(fP X) be the set of null proportions \u21e10(x) that produces the model consistent with fP X.\nBecause f1(p|x) 0, we have 8p, x,\u21e1 0(x) \uf8ff fP|X(p|x). This can be further simpli\ufb01ed as\n\u21e10(x) \uf8ff fP|X(1|x) by recalling that fP|X(p|x) is monotonically decreasing w.r.t. p. Then we know\n(5)\n\nS(fP X) = {\u21e10(x) : 8x,\u21e1 0(x) \uf8ff fP|X(1|x)}.\n\n1\n\nGiven fP X(p, x), the model is not fully identi\ufb01able. Hence we should look for a rule t that\nmaximizes the power while controlling FDP for all elements in S(fP X). For (P1, X1, H1) \u21e0\n(fP X,\u21e1 0, f1) following (4), the probability of discovery and the probability of false discovery are\nPD(t, fP X) = P(P1 \uf8ff t(X1)), PF D(t, fP X,\u21e1 0) = P(P1 \uf8ff t(X1), H1 = 0). Then the FDP\nis F DP (t, fP X,\u21e1 0) = PF D(t,fP X,\u21e10)\n. In this limiting case, all quantities are deterministic and\nFDP coincides with FDR. Given that the FDP is controlled, maximizing the power is equivalent to\nmaximizing the probability of discovery. Then we have the following minimax problem:\n\nPD(t,fP X)\n\nmax\n\nt\n\nmin\n\n\u21e102S(fP X)\n\nPD(t, fP X)\n\ns.t.\n\nmax\n\n\u21e102S(fP X)\n\nF DP (t, fP X,\u21e1 0) \uf8ff \u21b5,\n\n(6)\n\nwhere S(fP X) is the set of possible null proportions consistent with fP X, as de\ufb01ned in (5).\nTheorem 2. Fixing fP X and let \u21e1\u21e40(x) = fP|X(1|x). If f1(p|x) is monotonically non-increasing\nw.r.t. p, the solution to problem (6), t\u21e4(x), satis\ufb01es\n\n1.\n\nfP X(1, x)\n\nfP X(t\u21e4(x), x)\n\n= const, almost surely w.r.t. \u00b5(x)\n\n2. F DR(t\u21e4, fP X,\u21e1 \u21e40) = \u21b5.\n\n(7)\n\nRemark 4. To compute the optimal rule t\u21e4 by the conditions (7), consider any t that satis\ufb01es (7.1).\nAccording to (7.1), once we specify the value of t(x) at any location x, say t(0), the entire function is\ndetermined. Also, F DP (t, fP X,\u21e1 \u21e40) is monotonically non-decreasing w.r.t. t(0). These suggests the\nfollowing strategy: starting with t(0) = 0, keep increasing t(0) until the corresponding FDP equals\n\u21b5, which gives us the optimal threshold t\u21e4. Similar conditions are also mentioned in [15, 16].\n\n6 Discussion\n\nWe proposed NeuralFDR, an end-to-end algorithm to the learn discovery threshold from hypothesis\nfeatures. We showed that the algorithm controls FDR and makes more discoveries on synthetic and\nreal datasets with multi-dimensional features. While the results are promising, there are also a few\nchallenges. First, we notice that NeuralFDR performs better when both the number of hypotheses\nand the alternative proportion are large. Indeed, in order to have large gradients for the optimization,\nwe need a lot of elements at the decision boundary t(x) and the mirroring boundary 1 t(x). It\nis important to improve the performance of NeuralFDR on small datasets with small alternative\nproportion. Second, we found that a 10-layer MLP performed well to model the decision threshold\nand that shallower networks performed more poorly. A better understanding of which network\narchitectures optimally capture signal in the data is also an important question.\n\nReferences\n[1] Ery Arias-Castro, Shiyun Chen, et al. Distribution-free multiple testing. Electronic Journal of\n\nStatistics, 11(1):1983\u20132001, 2017.\n\n9\n\n\f[2] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein gan. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[3] Yoav Benjamini and Yosef Hochberg. Controlling the false discovery rate: a practical and\npowerful approach to multiple testing. Journal of the royal statistical society. Series B (Method-\nological), pages 289\u2013300, 1995.\n\n[4] Yoav Benjamini and Yosef Hochberg. Multiple hypotheses testing with weights. Scandinavian\n\nJournal of Statistics, 24(3):407\u2013418, 1997.\n\n[5] Simina M Boca and Jeffrey T Leek. A regression framework for the proportion of true null\n\nhypotheses. bioRxiv, page 035675, 2015.\n\n[6] GTEx Consortium et al. The genotype-tissue expression (gtex) pilot analysis: Multitissue gene\n\nregulation in humans. Science, 348(6235):648\u2013660, 2015.\n\n[7] John Duchi, Elad Hazan, and Yoram Singer. Adaptive subgradient methods for online learning\nand stochastic optimization. Journal of Machine Learning Research, 12(Jul):2121\u20132159, 2011.\n[8] Olive Jean Dunn. Multiple comparisons among means. Journal of the American Statistical\n\nAssociation, 56(293):52\u201364, 1961.\n\n[9] Bradley Efron. Simultaneous inference: When should hypothesis testing problems be combined?\n\nThe annals of applied statistics, pages 197\u2013223, 2008.\n\n[10] Christopher R Genovese, Kathryn Roeder, and Larry Wasserman. False discovery control with\n\np-value weighting. Biometrika, pages 509\u2013524, 2006.\n\n[11] Blanca E Himes, Xiaofeng Jiang, Peter Wagner, Ruoxi Hu, Qiyu Wang, Barbara Klanderman,\nReid M Whitaker, Qingling Duan, Jessica Lasky-Su, Christina Nikolos, et al. Rna-seq transcrip-\ntome pro\ufb01ling identi\ufb01es crispld2 as a glucocorticoid responsive gene that modulates cytokine\nfunction in airway smooth muscle cells. PloS one, 9(6):e99625, 2014.\n\n[12] Sture Holm. A simple sequentially rejective multiple test procedure. Scandinavian journal of\n\nstatistics, pages 65\u201370, 1979.\n\n[13] James X Hu, Hongyu Zhao, and Harrison H Zhou. False discovery rate control with groups.\n\nJournal of the American Statistical Association, 105(491):1215\u20131227, 2010.\n\n[14] Nikolaos Ignatiadis and Wolfgang Huber. Covariate-powered weighted multiple testing with\n\nfalse discovery rate control. arXiv preprint arXiv:1701.05179, 2017.\n\n[15] Nikolaos Ignatiadis, Bernd Klaus, Judith B Zaugg, and Wolfgang Huber. Data-driven hypoth-\nesis weighting increases detection power in genome-scale multiple testing. Nature methods,\n13(7):577\u2013580, 2016.\n\n[16] Lihua Lei and William Fithian. Adapt: An interactive procedure for multiple testing with side\n\ninformation. arXiv preprint arXiv:1609.06035, 2016.\n\n[17] Lihua Lei and William Fithian. Power of ordered hypothesis testing. In International Conference\n\non Machine Learning, pages 2924\u20132932, 2016.\n\n[18] Lihua Lei, Aaditya Ramdas, and William Fithian. Star: A general interactive framework for fdr\n\ncontrol under structural constraints. arXiv preprint arXiv:1710.02776, 2017.\n\n[19] Ang Li and Rina Foygel Barber. Multiple testing with the structure adaptive benjamini-hochberg\n\nalgorithm. arXiv preprint arXiv:1606.07926, 2016.\n\n[20] Michael I Love, Wolfgang Huber, and Simon Anders. Moderated estimation of fold change and\n\ndispersion for rna-seq data with deseq2. Genome biology, 15(12):550, 2014.\n\n[21] John D Storey, Jonathan E Taylor, and David Siegmund. Strong control, conservative point esti-\nmation and simultaneous conservative consistency of false discovery rates: a uni\ufb01ed approach.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 66(1):187\u2013205,\n2004.\n\n10\n\n\f", "award": [], "sourceid": 982, "authors": [{"given_name": "Fei", "family_name": "Xia", "institution": "Stanford University"}, {"given_name": "Martin", "family_name": "Zhang", "institution": "Stanford University"}, {"given_name": "James", "family_name": "Zou", "institution": "Stanford"}, {"given_name": "David", "family_name": "Tse", "institution": "Stanford University"}]}