{"title": "Learning to Pivot with Adversarial Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 981, "page_last": 990, "abstract": "Several techniques for domain adaptation have been proposed to account for differences in the distribution of the data used for training and testing. The majority of this work focuses on a binary domain label. Similar problems occur in a scientific context where there may be a continuous family of plausible data generation processes associated to the presence of systematic uncertainties. Robust inference is possible if it is based on a pivot -- a quantity whose distribution does not depend on the unknown values of the nuisance parameters that parametrize this family of data generation processes. In this work, we introduce and derive theoretical results for a training procedure based on adversarial networks for enforcing the pivotal property (or, equivalently, fairness with respect to continuous attributes) on a predictive model. The method includes a hyperparameter to control the trade-off between accuracy and robustness. We demonstrate the effectiveness of this approach with a toy example and examples from particle physics.", "full_text": "Learning to Pivot with Adversarial Networks\n\nGilles Louppe\n\nNew York University\ng.louppe@nyu.edu\n\nMichael Kagan\n\nSLAC National Accelerator Laboratory\n\nmakagan@slac.stanford.edu\n\nKyle Cranmer\n\nNew York University\n\nkyle.cranmer@nyu.edu\n\nAbstract\n\nSeveral techniques for domain adaptation have been proposed to account for\ndifferences in the distribution of the data used for training and testing. The majority\nof this work focuses on a binary domain label. Similar problems occur in a scienti\ufb01c\ncontext where there may be a continuous family of plausible data generation\nprocesses associated to the presence of systematic uncertainties. Robust inference\nis possible if it is based on a pivot \u2013 a quantity whose distribution does not depend\non the unknown values of the nuisance parameters that parametrize this family\nof data generation processes. In this work, we introduce and derive theoretical\nresults for a training procedure based on adversarial networks for enforcing the\npivotal property (or, equivalently, fairness with respect to continuous attributes) on\na predictive model. The method includes a hyperparameter to control the trade-\noff between accuracy and robustness. We demonstrate the effectiveness of this\napproach with a toy example and examples from particle physics.\n\n1\n\nIntroduction\n\nMachine learning techniques have been used to enhance a number of scienti\ufb01c disciplines, and they\nhave the potential to transform even more of the scienti\ufb01c process. One of the challenges of applying\nmachine learning to scienti\ufb01c problems is the need to incorporate systematic uncertainties, which\naffect both the robustness of inference and the metrics used to evaluate a particular analysis strategy.\nIn this work, we focus on supervised learning techniques where systematic uncertainties can be\nassociated to a data generation process that is not uniquely speci\ufb01ed. In other words, the lack of\nsystematic uncertainties corresponds to the (rare) case that the process that generates training data is\nunique, fully speci\ufb01ed, and an accurate representative of the real world data. By contrast, a common\nsituation when systematic uncertainty is present is when the training data are not representative\nof the real data. Several techniques for domain adaptation have been developed to create models\nthat are more robust to this binary type of uncertainty. A more generic situation is that there are\nseveral plausible data generation processes, speci\ufb01ed as a family parametrized by continuous nuisance\nparameters, as is typically found in scienti\ufb01c domains. In this broader context, statisticians have for\nlong been working on robust inference techniques based on the concept of a pivot \u2013 a quantity whose\ndistribution is invariant with the nuisance parameters (see e.g., (Degroot and Schervish, 1975)).\nAssuming a probability model p(X, Y, Z), where X are the data, Y are the target labels, and Z are the\nnuisance parameters, we consider the problem of learning a predictive model f (X) for Y conditional\non the observed values of X that is robust to uncertainty in the unknown value of Z. We introduce a\n\ufb02exible learning procedure based on adversarial networks (Goodfellow et al., 2014) for enforcing that\nf (X) is a pivot with respect to Z. We derive theoretical results proving that the procedure converges\ntowards a model that is both optimal and statistically independent of the nuisance parameters (if\nthat model exists) or for which one can tune a trade-off between accuracy and robustness (e.g., as\ndriven by a higher level objective). In particular, and to the best of our knowledge, our contribution is\nthe \ufb01rst solution for imposing pivotal constraints on a predictive model, working regardless of the\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fClassi\ufb01er f\n\nAdversary r\n\nZ\n\nX\n\n...\n\n\u03b8f\n\nf (X; \u03b8f )\n\nLf (\u03b8f )\n\n...\n\n\u03b8r\n\n\u03b31(f (X; \u03b8f ); \u03b8r)\n\n\u03b32(f (X; \u03b8f ); \u03b8r)\n\n. . .\n\nP(\u03b31, \u03b32, . . . )\n\np\u03b8r (Z|f (X; \u03b8f ))\n\nLr(\u03b8f , \u03b8r)\n\nFigure 1: Architecture for the adversarial training of a binary classi\ufb01er f against a nuisance parameters Z. The\nadversary r models the distribution p(z|f (X; \u03b8f ) = s) of the nuisance parameters as observed only through\nthe output f (X; \u03b8f ) of the classi\ufb01er. By maximizing the antagonistic objective Lr(\u03b8f , \u03b8r), the classi\ufb01er f\nforces p(z|f (X; \u03b8f ) = s) towards the prior p(z), which happens when f (X; \u03b8f ) is independent of the nuisance\nparameter Z and therefore pivotal.\n\ntype of the nuisance parameter (discrete or continuous) or of its prior. Finally, we demonstrate the\neffectiveness of the approach with a toy example and examples from particle physics.\n\n2 Problem statement\nWe begin with a family of data generation processes p(X, Y, Z), where X \u2208 X are the data, Y \u2208 Y\nare the target labels, and Z \u2208 Z are the nuisance parameters that can be continuous or categorical. Let\nus assume that prior to incorporating the effect of uncertainty in Z, our goal is to learn a regression\nfunction f : X \u2192 S with parameters \u03b8f (e.g., a neural network-based probabilistic classi\ufb01er) that\nminimizes a loss Lf (\u03b8f ) (e.g., the cross-entropy). In classi\ufb01cation, values s \u2208 S = R|Y| correspond\nto the classi\ufb01er scores used for mapping hard predictions y \u2208 Y, while S = Y for regression.\nWe augment our initial objective so that inference based on f (X; \u03b8f ) will be robust to the value\nz \u2208 Z of the nuisance parameter Z \u2013 which remains unknown at test time. A formal way of enforcing\nrobustness is to require that the distribution of f (X; \u03b8f ) conditional on Z (and possibly Y ) be\ninvariant with the nuisance parameter Z. Thus, we wish to \ufb01nd a function f such that\n\np(f (X; \u03b8f ) = s|z) = p(f (X; \u03b8f ) = s|z(cid:48))\n\n(1)\nfor all z, z(cid:48) \u2208 Z and all values s \u2208 S of f (X; \u03b8f ). In words, we are looking for a predictive function\nf which is a pivotal quantity with respect to the nuisance parameters. This implies that f (X; \u03b8f ) and\nZ are independent random variables.\nAs stated in Eqn. 1, the pivotal quantity criterion is imposed with respect to p(X|Z) where Y is\nmarginalized out. In some situations however (see e.g., Sec. 5.2), class conditional independence of\nf (X; \u03b8f ) on the nuisance Z is preferred, which can then be stated as requiring\n\nfor one or several speci\ufb01ed values y \u2208 Y.\n\np(f (X; \u03b8f ) = s|z, y) = p(f (X; \u03b8f ) = s|z(cid:48), y)\n\n(2)\n\n3 Method\n\nJoint training of adversarial networks was \ufb01rst proposed by (Goodfellow et al., 2014) as a way to\nbuild a generative model capable of producing samples from random noise z. More speci\ufb01cally, the\nauthors pit a generative model g : Rn \u2192 Rp against an adversarial classi\ufb01er d : Rp \u2192 [0, 1] whose\nantagonistic objective is to recognize real data X from generated data g(Z). Both models g and d are\ntrained simultaneously, in such a way that g learns to produce samples that are dif\ufb01cult to identify by\nd, while d incrementally adapts to changes in g. At the equilibrium, g models a distribution whose\nsamples can be identi\ufb01ed by d only by chance. That is, assuming enough capacity in d and g, the\ndistribution of g(Z) eventually converges towards the real distribution of X.\n\n2\n\n\fAlgorithm 1 Adversarial training of a classi\ufb01er f against an adversary r.\nInputs: training data {xi, yi, zi}N\n1: for t = 1 to T do\n2:\n3:\n4:\n\nSample minibatch {xm, zm, sm = f (xm; \u03b8f )}M\nWith \u03b8f \ufb01xed, update r by ascending its stochastic gradient \u2207\u03b8r E(\u03b8f , \u03b8r) :=\n\ni=1; Outputs: \u02c6\u03b8f , \u02c6\u03b8r.\n\nfor k = 1 to K do\n\nm=1 of size M;\n\nM(cid:88)\n\nm=1\n\n\u2207\u03b8r\n\nlog p\u03b8r (zm|sm);\n\nend for\nSample minibatch {xm, ym, zm, sm = f (xm; \u03b8f )}M\n\n5:\n6:\n7: With \u03b8r \ufb01xed, update f by descending its stochastic gradient \u2207\u03b8f E(\u03b8f , \u03b8r) :=\n\nm=1 of size M;\n\nM(cid:88)\n\n(cid:2)\u2212 log p\u03b8f (ym|xm) + log p\u03b8r (zm|sm)(cid:3) ,\n\n\u2207\u03b8f\n\nwhere p\u03b8f (ym|xm) denotes 1(ym = 0)(1 \u2212 sm) + 1(ym = 1)sm;\n\n8: end for\n\nm=1\n\nIn this work, we repurpose adversarial networks as a means to constrain the predictive model f\nin order to satisfy Eqn. 1. As illustrated in Fig. 1, we pit f against an adversarial model r :=\np\u03b8r (z|f (X; \u03b8f ) = s) with parameters \u03b8r and associated loss Lr(\u03b8f , \u03b8r). This model takes as input\nrealizations s of f (X; \u03b8f ) and produces as output a function modeling the posterior probability\ndensity p\u03b8r (z|f (X; \u03b8f ) = s). Intuitively, if p(f (X; \u03b8f ) = s|z) varies with z, then the corresponding\ncorrelation can be captured by r. By contrast, if p(f (X; \u03b8f ) = s|z) is invariant with z, as we require,\nthen r should perform poorly and be close to random guessing. Training f such that it additionally\nminimizes the performance of r therefore acts as a regularization towards Eqn. 1.\nIf Z takes discrete values, then p\u03b8r can be represented as a probabilistic classi\ufb01er R \u2192 R|Z| whose\njth output (for j = 1, . . . ,|Z|) is the estimated probability mass p\u03b8r (zj|f (X; \u03b8f ) = s). Similarly, if\nZ takes continuous values, then we can model the posterior probability density p(z|f (X; \u03b8f ) = s)\nwith a suf\ufb01ciently \ufb02exible parametric family of distributions P(\u03b31, \u03b32, . . . ), where the parameters \u03b3j\ndepend on f (X, \u03b8f ) and \u03b8r. The adversary r may take any form, i.e. it does not need to be a neural\nnetwork, as long as it exposes a differentiable function p\u03b8r (z|f (X; \u03b8f ) = s) of suf\ufb01cient capacity\nto represent the true distribution. Fig. 1 illustrates a concrete example where p\u03b8r (z|f (X; \u03b8f ) = s)\nis a mixture of gaussians, as modeled with a mixture density network (Bishop, 1994)). The jth\noutput corresponds to the estimated value of the corresponding parameter \u03b3j of that distribution (e.g.,\nthe mean, variance and mixing coef\ufb01cients of its components). The estimated probability density\np\u03b8r (z|f (X; \u03b8f ) = s) can then be evaluated for any z \u2208 Z and any score s \u2208 S.\nAs with generative adversarial networks, we propose to train f and r simultaneously, which we carry\nout by considering the value function\n\n(3)\n\n(4)\n\nE(\u03b8f , \u03b8r) = Lf (\u03b8f ) \u2212 Lr(\u03b8f , \u03b8r)\n\nthat we optimize by \ufb01nding the minimax solution\n\u02c6\u03b8f , \u02c6\u03b8r = arg min\n\u03b8f\n\nmax\n\n\u03b8r\n\nE(\u03b8f , \u03b8r).\n\nWithout loss of generality, the adversarial training procedure to obtain (\u02c6\u03b8f , \u02c6\u03b8r) is formally presented\nin Algorithm 1 in the case of a binary classi\ufb01er f : Rp \u2192 [0, 1] modeling p(Y = 1|X). For reasons\nfurther explained in Sec. 4, Lf and Lr are respectively set to the expected value of the negative\nlog-likelihood of Y |X under f and of Z|f (X; \u03b8f ) under r:\n\nLf (\u03b8f ) = Ex\u223cXEy\u223cY |x[\u2212 log p\u03b8f (y|x)],\n\n(5)\n(6)\nThe optimization algorithm consists in using stochastic gradient descent alternatively for solving\nEqn. 4. Finally, in the case of a class conditional pivot, the settings are the same, except that the\nadversarial term Lr(\u03b8f , \u03b8r) is restricted to Y = y.\n\nLr(\u03b8f , \u03b8r) = Es\u223cf (X;\u03b8f )Ez\u223cZ|s[\u2212 log p\u03b8r (z|s)].\n\n3\n\n\f4 Theoretical results\nIn this section, we show that in the setting of Algorithm 1 where Lf and Lr are respectively set\nto expected value of the negative log-likelihood of Y |X under f and of Z|f (X; \u03b8f ) under r, the\nminimax solution of Eqn. 4 corresponds to a classi\ufb01er f which is a pivotal quantity.\nIn this setting, the nuisance parameter Z is considered as a random variable of prior p(Z), and\nour goal is to \ufb01nd a function f (\u00b7; \u03b8f ) such that f (X; \u03b8f ) and Z are independent random variables.\nImportantly, classi\ufb01cation of Y with respect to X is considered in the context where Z is marginalized\nout, which means that the classi\ufb01er minimizing Lf is optimal with respect to Y |X, but not necessarily\nwith Y |X, Z. Results hold for a nuisance parameters Z taking either categorical or continuous values.\nBy abuse of notation, H(Z) denotes the differential entropy in this latter case. Finally, the proposition\nbelow is derived in a non-parametric setting, by assuming that both f and r have enough capacity.\nProposition 1. If there exists a minimax solution (\u02c6\u03b8f , \u02c6\u03b8r) for Eqn. 4 such that E(\u02c6\u03b8f , \u02c6\u03b8r) =\nH(Y |X) \u2212 H(Z), then f (\u00b7; \u02c6\u03b8f ) is both an optimal classi\ufb01er and a pivotal quantity.\n\nProof. For \ufb01xed \u03b8f , the adversary r is optimal at\n\n\u02c6\u02c6\u03b8r = arg max\n\n\u03b8r\n\nE(\u03b8f , \u03b8r) = arg min\n\u03b8r\n\n(7)\n(z|f (X; \u03b8f ) = s) = p(z|f (X; \u03b8f ) = s) for all z and all s, and Lr reduces to the\nin which case p\u02c6\u02c6\u03b8r\nexpected entropy Es\u223cf (X;\u03b8f )[H(Z|f (X; \u03b8f ) = s)] of the conditional distribution of the nuisance\nparameters. This expectation corresponds to the conditional entropy of the random variables Z and\nf (X; \u03b8f ) and can be written as H(Z|f (X; \u03b8f )). Accordingly, the value function E can be restated\nas a function depending on \u03b8f only:\n\nLr(\u03b8f , \u03b8r),\n\n(8)\n\n(9)\n\nE(cid:48)(\u03b8f ) = Lf (\u03b8f ) \u2212 H(Z|f (X; \u03b8f )).\n\nIn particular, we have the lower bound\n\nH(Y |X) \u2212 H(Z) \u2264 Lf (\u03b8f ) \u2212 H(Z|f (X; \u03b8f ))\n\nwhere the equality holds at \u02c6\u03b8f = arg min\u03b8f E(cid:48)(\u03b8f ) when:\n\n\u2022 \u02c6\u03b8f minimizes the negative log-likelihood of Y |X under f, which happens when \u02c6\u03b8f are the\nparameters of an optimal classi\ufb01er. In this case, Lf reduces to its minimum value H(Y |X).\n\u2022 \u02c6\u03b8f maximizes the conditional entropy H(Z|f (X; \u03b8f )), since H(Z|f (X; \u03b8)) \u2264 H(Z) from\nthe properties of entropy. Note that this latter inequality holds for both the discrete and the\ndifferential de\ufb01nitions of entropy.\n\nBy assumption, the lower bound is active, thus we have H(Z|f (X; \u03b8f )) = H(Z) because of the\nsecond condition, which happens exactly when Z and f (X; \u03b8f ) are independent variables. In other\nwords, the optimal classi\ufb01er f (\u00b7; \u02c6\u03b8f ) is also a pivotal quantity.\n\nProposition 1 suggests that if at each step of Algorithm 1 the adversary r is allowed to reach its\noptimum given f (e.g., by setting K suf\ufb01ciently high) and if f is updated to improve Lf (\u03b8f ) \u2212\nH(Z|f (X; \u03b8f )) with suf\ufb01ciently small steps, then f should converge to a classi\ufb01er that is both\noptimal and pivotal, provided such a classi\ufb01er exists. Therefore, the adversarial term Lr can be\nregarded as a way to select among the class of all optimal classi\ufb01ers a function f that is also pivotal.\nDespite the former theoretical characterization of the minimax solution of Eqn. 4, let us note that\nformal guarantees of convergence towards that solution by Algorithm 1 in the case where a \ufb01nite\nnumber K of steps is taken for r remains to be proven.\nIn practice, the assumption of existence of an optimal and pivotal classi\ufb01er may not hold because the\nnuisance parameter directly shapes the decision boundary. In this case, the lower bound\n\n(10)\nis strict: f can either be an optimal classi\ufb01er or a pivotal quantity, but not both simultaneously. In\nthis situation, it is natural to rewrite the value function E as\n\nH(Y |X) \u2212 H(Z) < Lf (\u03b8f ) \u2212 H(Z|f (X; \u03b8f ))\n\nE\u03bb(\u03b8f , \u03b8r) = Lf (\u03b8f ) \u2212 \u03bbLr(\u03b8f , \u03b8r),\n\n(11)\n\n4\n\n\fFigure 2: Toy example. (Left) Conditional probability densities of the decision scores at Z = \u2212\u03c3, 0, \u03c3 without\nadversarial training. The resulting densities are dependent on the continuous parameter Z, indicating that f is\nnot pivotal. (Middle left) The associated decision surface, highlighting the fact that samples are easier to classify\nfor values of Z above \u03c3, hence explaining the dependency. (Middle right) Conditional probability densities of\nthe decision scores at Z = \u2212\u03c3, 0, \u03c3 when f is built with adversarial training. The resulting densities are now\nalmost identical to each other, indicating only a small dependency on Z. (Right) The associated decision surface,\nillustrating how adversarial training bends the decision function vertically to erase the dependency on Z.\n\nwhere \u03bb \u2265 0 is a hyper-parameter controlling the trade-off between the performance of f and its\nindependence with respect to the nuisance parameter. Setting \u03bb to a large value will preferably\nenforces f to be pivotal while setting \u03bb close to 0 will rather constraint f to be optimal. When the\nlower bound is strict, let us note however that there may exist distinct but equally good solutions \u03b8f , \u03b8r\nminimizing Eqn. 11. In this zero-sum game, an increase in accuracy would exactly be compensated\nby a decrease in pivotality and vice-versa. How to best navigate this Pareto frontier to maximize a\nhigher-level objective remains a question open for future works.\nInterestingly, let us \ufb01nally emphasize that our results hold using only the (1D) output s of f (\u00b7; \u03b8f ) as\ninput to the adversary. We could similarly enforce an intermediate representation of the data to be\npivotal, e.g. as in (Ganin and Lempitsky, 2014), but this is not necessary.\n\n5 Experiments\n\nIn this section, we empirically demonstrate the effectiveness of the approach with a toy example\nand examples from particle physics. Notably, there are no other other approaches to compare to in\nthe case of continuous nuisance parameters, as further explained in Sec. 6. In the case of binary\nparameters, we do not expect results to be much different from previous works. The source code to\nreproduce the experiments is available online 1.\n\n5.1 A toy example with a continous nuisance parameter\n\nAs a guiding toy example, let us consider the binary classi\ufb01cation of 2D data drawn from multivariate\ngaussians with equal priors, such that\n\n(cid:18)\n(cid:18)\n\nx \u223c N\n\nx|Z = z \u223c N\n\n(cid:20) 1\n(cid:20)1\n\n\u22120.5\n\n0\n\n(0, 0),\n\n(1, 1 + z),\n\n(cid:21)(cid:19)\n\n\u22120.5\n1\n\n(cid:21)(cid:19)\n\n0\n1\n\nwhen Y = 0,\n\nwhen Y = 1.\n\n(12)\n\n(13)\n\nThe continuous nuisance parameter Z here represents our uncertainty about the location of the mean\nof the second gaussian. Our goal is to build a classi\ufb01er f (\u00b7; \u03b8f ) for predicting Y given X, but such\nthat the probability distribution of f (X; \u03b8f ) is invariant with respect to the nuisance parameter Z.\nAssuming a gaussian prior z \u223c N (0, 1), we generate data {xi, yi, zi}N\ni=1, from which we train a\nneural network f minimizing Lf (\u03b8f ) without considering its adversary r. The network architecture\ncomprises 2 dense hidden layers of 20 nodes respectively with tanh and ReLU activations, followed\nby a dense output layer with a single node with a sigmoid activation. As shown in Fig. 2, the resulting\nclassi\ufb01er is not pivotal, as the conditional probability densities of its decision scores f (X; \u03b8f ) show\n\n1https://github.com/glouppe/paper-learning-to-pivot\n\n5\n\n0.00.20.40.60.81.0f(X)0.00.51.01.52.02.53.03.54.0p(f(X))p(f(X)|Z=\u2212\u03c3)p(f(X)|Z=0)p(f(X)|Z=+\u03c3)1.00.50.00.51.01.52.01.00.50.00.51.01.52.02.53.0Z=\u2212\u03c3Z=0Z=+\u03c3\u00b50\u00b51|Z=z0.10.20.30.40.50.60.70.80.91.00.00.20.40.60.81.0f(X)0.00.51.01.52.02.53.03.54.0p(f(X))p(f(X)|Z=\u2212\u03c3)p(f(X)|Z=0)p(f(X)|Z=+\u03c3)1.00.50.00.51.01.52.01.00.50.00.51.01.52.02.53.0Z=\u2212\u03c3Z=0Z=\u03c3\u00b50\u00b51|Z=z0.120.240.360.480.600.720.84\fFigure 3: Toy example. Training curves for Lf (\u03b8f ),\nLr(\u03b8f , \u03b8r) and Lf (\u03b8f ) \u2212 \u03bbLr(\u03b8f , \u03b8r). Initialized\nwith a pre-trained classi\ufb01er f, adversarial training\nwas performed for 200 iterations, mini-batches of\nsize M = 128, K = 500 and \u03bb = 50.\n\nFigure 4: Physics example. Approximate median\nsigni\ufb01cance as a function of the decision threshold\non the output of f. At \u03bb = 10, trading accuracy\nfor independence to pileup results in a net bene\ufb01t in\nterms of statistical signi\ufb01cance.\n\nlarge discrepancies between values z of the nuisance parameters. While not shown here, a classi\ufb01er\ntrained only from data generated at the nominal value Z = 0 would also not be pivotal.\nLet us now consider the joint training of f against an adversary r implemented as a mixture density\nnetwork modeling Z|f (X; \u03b8f ) as a mixture of \ufb01ve gaussians. The network architecture of r comprises\n2 dense hidden layers of 20 nodes with ReLU activations, followed by an output layer of 15 nodes\ncorresponding to the means, standard deviations and mixture coef\ufb01cients of the gaussians. Output\nnodes for the mean values come with linear activations, output nodes for the standard deviations\nwith exponential activations to ensure positivity, while output nodes for the mixture coef\ufb01cients\nimplement the softmax function to ensure positivity and normalization. When running Algorithm 1\nas initialized with the classi\ufb01er f obtained previously, adversarial training effectively reshapes the\ndecision function so it that becomes almost independent on the nuisance parameter, as shown in\nFig. 2. The conditional probability densities of the decision scores f (X; \u03b8f ) are now very similar to\neach other, indicating only a residual dependency on the nuisance, as theoretically expected. The\ndynamics of adversarial training is illustrated in Fig. 3, where the losses Lf , Lr and Lf \u2212 \u03bbLr are\nevaluated after each iteration. In the \ufb01rst iterations, we observe that the global objective Lf \u2212 \u03bbLr\nis minimized by making the classi\ufb01er less accurate, hence the corresponding increase of Lf , but\nwhich results in a classi\ufb01er that is more pivotal, hence the associated increase of Lr and the total\nnet bene\ufb01t. As learning goes, minimizing E requires making predictions that are more accurate,\nhence decreasing Lf , or that are even less dependent on Z, hence shaping p\u03b8r towards the prior p(Z).\nIndeed, Lf eventually starts decreasing, while remaining bounded from below by min\u03b8f Lf (\u03b8f ) as\napproximated by the dashed line in the \ufb01rst plot. Similarly, Lr tends towards the differential entropy\nH(Z) of the prior (where H(Z) = log(\u03c3\n2\u03c0e) = 1.419 in the case of a standard normal), as shown\nby the dashed line in the second plot. Finally, let us note that the ideal situation of a classi\ufb01er that\nis both optimal and pivotal is unreachable for this problem, as shown in the third plot by the offset\nbetween Lf \u2212 \u03bbLr and the dashed line approximating H(Y |X) \u2212 \u03bbH(Z).\n\n\u221a\n\n5.2 High energy physics examples\n\nBinary Case Experiments at high energy colliders like the LHC (Evans and Bryant, 2008) are\nsearching for evidence of new particles beyond those described by the Standard Model (SM) of\nparticle physics. A wide array of theories predict the existence of new massive particles that would\ndecay to known particles in the SM such as the W boson. The W boson is unstable and can decay to\ntwo quarks, each of which produce collimated sprays of particles known as jets. If the exotic particle is\nheavy, then the W boson will be moving very fast, and relativistic effects will cause the two jets from\nits decay to merge into a single \u2018W -jet\u2019. These W -jets have a rich internal substructure. However,\njets are also produced ubiquitously at high energy colliders through more mundane processes in the\n\n6\n\n0.450.500.550.600.650.70Lf1.361.371.381.391.401.411.42Lr050100150200T70.570.069.569.068.568.067.5Lf\u2212\u03bbLr0.00.20.40.60.81.0threshold on f(X)1012345678AMS\u03bb=0|Z=0\u03bb=0\u03bb=1\u03bb=10\u03bb=500\fSM, which leads to a challenging classi\ufb01cation problem that is beset with a number of sources of\nsystematic uncertainty. The classi\ufb01cation challenge used here is common in jet substructure studies\n(see e.g. (CMS Collaboration, 2014; ATLAS Collaboration, 2015, 2014)): we aim to distinguish\nnormal jets produced copiously at the LHC (Y = 0) and from W -jets (Y = 1) potentially coming\nfrom an exotic process. We reuse the datasets used in (Baldi et al., 2016a).\nChallenging in its own right, this classi\ufb01cation problem is made all the more dif\ufb01cult by the presence\nof pileup, or multiple proton-proton interactions occurring simultaneously with the primary interaction.\nThese pileup interactions produce additional particles that can contribute signi\ufb01cant energies to jets\nunrelated to the underlying discriminating information. The number of pileup interactions can vary\nwith the running conditions of the collider, and we want the classi\ufb01er to be robust to these conditions.\nTaking some liberty, we consider an extreme case with a categorical nuisance parameter, where\nZ = 0 corresponds to events without pileup and Z = 1 corresponds to events with pileup, for which\nthere are an average of 50 independent pileup interactions overlaid.\nWe do not expect that we will be able to \ufb01nd a function f that simultaneously minimizes the\nclassi\ufb01cation loss Lf and is pivotal. Thus, we need to optimize the hyper-parameter \u03bb of Eqn. 11 with\nrespect to a higher-level objective. In this case, the natural higher-level context is a hypothesis test of\na null hypothesis with no Y = 1 events against an alternate hypothesis that is a mixture of Y = 0 and\nY = 1 events. In the absence of systematic uncertainties, optimizing Lf simultaneously optimizes\nthe power of a classical hypothesis test in the Neyman-Pearson sense. When we include systematic\nuncertainties we need to balance the classi\ufb01cation performance against the robustness to uncertainty\nin Z. Since we are still performing a hypothesis test against the null, we only wish to impose the\npivotal property on Y = 0 events. To this end, we use as a higher level objective the Approximate\nMedian Signi\ufb01cance (AMS), which is a natural generalization of the power of a hypothesis test when\nsystematic uncertainties are taken into account (see Eqn. 20 of Adam-Bourdarios et al. (2014)).\nFor several values of \u03bb, we train a classi\ufb01er using Algorithm 1 but consider the adversarial term\nLr conditioned on Y = 0 only, as outlined in Sec. 2. The architecture of f comprises 3 hidden\nlayers of 64 nodes respectively with tanh, ReLU and ReLU activations, and is terminated by a single\n\ufb01nal output node with a sigmoid activation. The architecture of r is the same, but uses only ReLU\nactivations in its hidden nodes. As in the previous example, adversarial training is initialized with\nf pre-trained. Experiments are performed on a subset of 150000 samples for training while AMS\nis evaluated on an independent test set of 5000000 samples. Both training and testing samples are\nweighted such that the null hypothesis corresponded to 1000 of Y = 0 events and the alternate\nhypothesis included an additional 100 Y = 1 events prior to any thresholding on f. This allows us\nto probe the ef\ufb01cacy of the method proposed here in a representative background-dominated high\nenergy physics environment. Results reported below are averages over 5 runs.\nAs Fig. 4 illustrates, without adversarial training (at \u03bb = 0|Z = 0 when building a classi\ufb01er at the\nnominal value Z = 0 only, or at \u03bb = 0 when building a classi\ufb01er on data sampled from p(X, Y, Z)),\nthe AMS peaks at 7. By contrast, as the pivotal constraint is made stronger (for \u03bb > 0) the AMS\npeak moves higher, with a maximum value around 7.8 for \u03bb = 10. Trading classi\ufb01cation accuracy\nfor robustness to pileup thereby results in a net bene\ufb01t in terms of the power of the hypothesis\ntest. Setting \u03bb too high however (e.g. \u03bb = 500) results in a decrease of the maximum AMS, by\nfocusing the capacity of f too strongly on independence with Z, at the expense of accuracy. In effect,\noptimizing \u03bb yields a principled and effective approach to control the trade-off between accuracy and\nrobustness that ultimately maximizes the power of the enveloping hypothesis test.\n\nContinous Case Recently, an independent group has used our approach to learn jet classi\ufb01ers that\nare independent of the jet mass (Shimmin et al., 2017), which is a continuous attribute. The results of\ntheir studies show that the adversarial training strategy works very well for real-world problems with\ncontinuous attributes, thus enhancing the sensitivity of searches for new physics at the LHC.\n\n6 Related work\n\nLearning to pivot can be related to the problem of domain adaptation (Blitzer et al., 2006; Pan et al.,\n2011; Gopalan et al., 2011; Gong et al., 2013; Baktashmotlagh et al., 2013; Ajakan et al., 2014;\nGanin and Lempitsky, 2014), where the goal is often stated as trying to learn a domain-invariant\nrepresentation of the data. Likewise, our method also relates to the problem of enforcing fairness\n\n7\n\n\fin classi\ufb01cation (Kamishima et al., 2012; Zemel et al., 2013; Feldman et al., 2015; Edwards and\nStorkey, 2015; Zafar et al., 2015; Louizos et al., 2015), which is stated as learning a classi\ufb01er that is\nindependent of some chosen attribute such as gender, color or age. For both families of methods, the\nproblem can equivalently be stated as learning a classi\ufb01er which is a pivotal quantity with respect\nto either the domain or the selected feature. As an example, unsupervised domain adaptation with\nlabeled data from a source domain and unlabeled data from a target domain can be recast as learning\na predictive model f (i.e., trained to minimize Lf evaluated on labeled source data only) that is also a\npivot with respect to the domain Z (i.e., trained to maximize Lr evaluated on both source and target\ndata). In this context, (Ganin and Lempitsky, 2014; Edwards and Storkey, 2015) are certainly among\nthe closest to our work, in which domain invariance and fairness are enforced through an adversarial\nminimax setup composed of a classi\ufb01er and an adversarial discriminator. Following this line of work,\nour method can be regarded as a uni\ufb01ed generalization that also supports a continuously parametrized\nfamily of domains or as enforcing fairness over continuous attributes.\nMost related work is based on the strong and limiting assumption that Z is a binary random variable\n(e.g., Z = 0 for the source domain, and Z = 1 for the target domain). In particular, (Pan et al., 2011;\nGong et al., 2013; Baktashmotlagh et al., 2013; Zemel et al., 2013; Ganin and Lempitsky, 2014;\nAjakan et al., 2014; Edwards and Storkey, 2015; Louizos et al., 2015) are all based on the minimization\nof some form of divergence between the two distributions of f (X)|Z = 0 and f (X)|Z = 1. For this\nreason, these works cannot directly be generalized to non-binary or continuous nuisance parameters,\nboth from a practical and theoretical point of view. Notably, Kamishima et al. (2012) enforces\nfairness through a prejudice regularization term based on empirical estimates of p(f (X)|Z). While\nthis approach is in principle suf\ufb01cient for handling non-binary nuisance parameters Z, it requires\naccurate empirical estimates of p(f (X)|Z = z) for all values z, which quickly becomes impractical\nas the cardinality of Z increases. By contrast, our approach models the conditional dependence\nthrough an adversarial network, which allows for generalization without necessarily requiring an\nexponentially growing number of training examples.\nA common approach to account for systematic uncertainties in a scienti\ufb01c context (e.g. in high energy\nphysics) is to take as \ufb01xed a classi\ufb01er f built from training data for a nominal value z0 of the nuisance\nparameter, and then propagate uncertainty by estimating p(f (x)|z) with a parametrized calibration\nprocedure. Clearly, this classi\ufb01er is however not optimal for z (cid:54)= z0. To overcome this issue, the\nclassi\ufb01er f is sometimes built instead on a mixture of training data generated from several plausible\nvalues z0, z1, . . . of the nuisance parameter. While this certainly improves classi\ufb01cation performance\nwith respect to the marginal model p(X, Y ), there is no reason to expect the resulting classi\ufb01er to\nbe pivotal, as shown previously in Sec. 5.1. As an alternative, parametrized classi\ufb01ers (Cranmer\net al., 2015; Baldi et al., 2016b) directly take (nuisance) parameters as additional input variables,\nhence ultimately providing the most statistically powerful approach for incorporating the effect\nof systematics on the underlying classi\ufb01cation task. In practice, parametrized classi\ufb01ers are also\ncomputationally expensive to build and evaluate. In particular, calibrating their decision function,\ni.e. approximating p(f (x, z)|y, z) as a continuous function of z, remains an open challenge. By\ncontrast, constraining f to be pivotal yields a classi\ufb01er that can be directly used in a wider range of\napplications, since the dependence on the nuisance parameter Z has already been eliminated.\n\n7 Conclusions\n\nIn this work, we proposed a \ufb02exible learning procedure for building a predictive model that is\nindependent of continuous or categorical nuisance parameters by jointly training two neural networks\nin an adversarial fashion. From a theoretical perspective, we motivated the proposed algorithm by\nshowing that the minimax value of its value function corresponds to a predictive model that is both\noptimal and pivotal (if that models exists) or for which one can tune the trade-off between power and\nrobustness. From an empirical point of view, we con\ufb01rmed the effectiveness of our method on a toy\nexample and a particle physics example.\nIn terms of applications, our solution can be used in any situation where the training data may not be\nrepresentative of the real data the predictive model will be applied to in practice. In the scienti\ufb01c\ncontext, the presence of systematic uncertainty can be incorporated by considering a family of data\ngeneration processes, and it would be worth revisiting those scienti\ufb01c problems that utilize machine\nlearning in light of this technique. The approach also extends to cases where independence of the\npredictive model with respect to observed random variables is desired, as in fairness for classi\ufb01cation.\n\n8\n\n\fAcknowledgements\n\nWe would like to thank the authors of (Baldi et al., 2016a) for sharing the data used in their studies.\nKC and GL are both supported through NSF ACI-1450310, additionally KC is supported through\nPHY-1505463 and PHY-1205376. MK is supported by the US Department of Energy (DOE) under\ngrant DE-AC02-76SF00515 and by the SLAC Panofsky Fellowship.\n\nReferences\nAdam-Bourdarios, C., Cowan, G., Germain, C., Guyon, I., K\u00e9gl, B., and Rousseau, D. (2014). The\nhiggs boson machine learning challenge. In NIPS 2014 Workshop on High-energy Physics and\nMachine Learning, volume 42, page 37.\n\nAjakan, H., Germain, P., Larochelle, H., Laviolette, F., and Marchand, M. (2014). Domain-adversarial\n\nneural networks. arXiv preprint arXiv:1412.4446.\n\nATLAS Collaboration (2014). Performance of Boosted W Boson Identi\ufb01cation with the ATLAS\n\nDetector. Technical Report ATL-PHYS-PUB-2014-004, CERN, Geneva.\n\u221a\nCERN, Geneva.\n\nATLAS Collaboration (2015). Identi\ufb01cation of boosted, hadronically-decaying W and Z bosons in\ns = 13 TeV Monte Carlo Simulations for ATLAS. Technical Report ATL-PHYS-PUB-2015-033,\n\nBaktashmotlagh, M., Harandi, M., Lovell, B., and Salzmann, M. (2013). Unsupervised domain\nadaptation by domain invariant projection. In Proceedings of the IEEE International Conference\non Computer Vision, pages 769\u2013776.\n\nBaldi, P., Bauer, K., Eng, C., Sadowski, P., and Whiteson, D. (2016a). Jet substructure classi\ufb01cation\n\nin high-energy physics with deep neural networks. Physical Review D, 93(9):094034.\n\nBaldi, P., Cranmer, K., Faucett, T., Sadowski, P., and Whiteson, D. (2016b). Parameterized neural\n\nnetworks for high-energy physics. Eur. Phys. J., C76(5):235.\n\nBishop, C. M. (1994). Mixture density networks.\n\nBlitzer, J., McDonald, R., and Pereira, F. (2006). Domain adaptation with structural correspondence\nIn Proceedings of the 2006 conference on empirical methods in natural language\n\nlearning.\nprocessing, pages 120\u2013128. Association for Computational Linguistics.\n\nCMS Collaboration (2014). Identi\ufb01cation techniques for highly boosted W bosons that decay into\n\nhadrons. JHEP, 12:017.\n\nCranmer, K., Pavez, J., and Louppe, G. (2015). Approximating likelihood ratios with calibrated\n\ndiscriminative classi\ufb01ers. arXiv preprint arXiv:1506.02169.\n\nDegroot, M. H. and Schervish, M. J. (1975). Probability and statistics. 1st edition.\n\nEdwards, H. and Storkey, A. J. (2015). Censoring representations with an adversary. arXiv preprint\n\narXiv:1511.05897.\n\nEvans, L. and Bryant, P. (2008). LHC Machine. JINST, 3:S08001.\n\nFeldman, M., Friedler, S. A., Moeller, J., Scheidegger, C., and Venkatasubramanian, S. (2015).\nCertifying and removing disparate impact. In Proceedings of the 21th ACM SIGKDD International\nConference on Knowledge Discovery and Data Mining, pages 259\u2013268. ACM.\n\nGanin, Y. and Lempitsky, V. (2014). Unsupervised Domain Adaptation by Backpropagation. arXiv\n\npreprint arXiv:1409.7495.\n\nGong, B., Grauman, K., and Sha, F. (2013). Connecting the dots with landmarks: Discriminatively\nlearning domain-invariant features for unsupervised domain adaptation. In Proceedings of The\n30th International Conference on Machine Learning, pages 222\u2013230.\n\n9\n\n\fGoodfellow, I., Pouget-Abadie, J., Mirza, M., Xu, B., Warde-Farley, D., Ozair, S., Courville, A., and\nBengio, Y. (2014). Generative adversarial nets. In Advances in Neural Information Processing\nSystems, pages 2672\u20132680.\n\nGopalan, R., Li, R., and Chellappa, R. (2011). Domain adaptation for object recognition: An\nunsupervised approach. In Computer Vision (ICCV), 2011 IEEE International Conference on,\npages 999\u20131006. IEEE.\n\nKamishima, T., Akaho, S., Asoh, H., and Sakuma, J. (2012). Fairness-aware classi\ufb01er with prejudice\n\nremover regularizer. Machine Learning and Knowledge Discovery in Databases, pages 35\u201350.\n\nLouizos, C., Swersky, K., Li, Y., Welling, M., and Zemel, R. (2015). The variational fair autoencoder.\n\narXiv preprint arXiv:1511.00830.\n\nPan, S. J., Tsang, I. W., Kwok, J. T., and Yang, Q. (2011). Domain adaptation via transfer component\n\nanalysis. Neural Networks, IEEE Transactions on, 22(2):199\u2013210.\n\nShimmin, C., Sadowski, P., Baldi, P., Weik, E., Whiteson, D., Goul, E., and S\u00f8gaard, A. (2017).\n\nDecorrelated Jet Substructure Tagging using Adversarial Neural Networks.\n\nZafar, M. B., Valera, I., Rodriguez, M. G., and Gummadi, K. P. (2015). Fairness constraints: A\n\nmechanism for fair classi\ufb01cation. arXiv preprint arXiv:1507.05259.\n\nZemel, R. S., Wu, Y., Swersky, K., Pitassi, T., and Dwork, C. (2013). Learning fair representations.\n\nICML (3), 28:325\u2013333.\n\n10\n\n\f", "award": [], "sourceid": 624, "authors": [{"given_name": "Gilles", "family_name": "Louppe", "institution": "New York University"}, {"given_name": "Michael", "family_name": "Kagan", "institution": "SLAC / Stanford"}, {"given_name": "Kyle", "family_name": "Cranmer", "institution": "New York University"}]}