{"title": "Disentangling Influence: Using disentangled representations to audit model predictions", "book": "Advances in Neural Information Processing Systems", "page_first": 4496, "page_last": 4506, "abstract": "Motivated by the need to audit complex and black box models, there has been extensive research on quantifying how data features influence model predictions. Feature influence can be direct (a direct influence on model outcomes) and indirect (model outcomes are influenced via proxy features). Feature influence can also be expressed in aggregate over the training or test data or locally with respect to a single point. Current research has typically focused on one of each of these dimensions. In this paper, we develop disentangled influence audits, a procedure to audit the indirect influence of features. Specifically, we show that disentangled representations provide a mechanism to identify proxy features in the dataset, while allowing an explicit computation of feature influence on either individual outcomes or aggregate-level outcomes. We show through both theory and experiments that disentangled influence audits can both detect proxy features and show, for each individual or in aggregate, which of these proxy features affects the classifier being audited the most. In this respect, our method is more powerful than existing methods for ascertaining feature influence.", "full_text": "Disentangling In\ufb02uence: Using disentangled\nrepresentations to audit model predictions\u2217\n\nCharles T. Marx\nHaverford College\n\ncmarx@haverford.edu\n\nRichard Lanas Phillips\n\nCornell University\n\nrlp246@cornell.edu\n\nSorelle A. Friedler\nHaverford College\n\nsorelle@cs.haverford.edu\n\nCarlos Scheidegger\nUniversity of Arizona\n\ncscheid@cs.arizona.edu\n\nSuresh Venkatasubramanian\n\nUniversity of Utah\n\nsuresh@cs.utah.edu\n\nAbstract\n\nMotivated by the need to audit complex and black box models, there has been\nextensive research on quantifying how data features in\ufb02uence model predictions.\nFeature in\ufb02uence can be direct (a direct in\ufb02uence on model outcomes) and indirect\n(model outcomes are in\ufb02uenced via proxy features). Feature in\ufb02uence can also\nbe expressed in aggregate over the training or test data or locally with respect to\na single point. Current research has typically focused on one of each of these\ndimensions. In this paper, we develop disentangled in\ufb02uence audits, a procedure\nto audit the indirect in\ufb02uence of features. Speci\ufb01cally, we show that disentangled\nrepresentations provide a mechanism to identify proxy features in the dataset, while\nallowing an explicit computation of feature in\ufb02uence on either individual outcomes\nor aggregate-level outcomes. We show through both theory and experiments that\ndisentangled in\ufb02uence audits can both detect proxy features and show, for each\nindividual or in aggregate, which of these proxy features affects the classi\ufb01er\nbeing audited the most. In this respect, our method is more powerful than existing\nmethods for ascertaining feature in\ufb02uence.\n\n1\n\nIntroduction\n\nAs machine learning models have become increasingly complex, there has been a growing sub\ufb01eld of\nwork on interpreting and explaining the predictions of these models [24, 11]. In order to assess the\nimportance of particular features to aggregated or individual model predictions, a variety of feature\nin\ufb02uence techniques have been developed.\nThe goal of explaining model predictions is of particular importance in the context of fairness in\nmachine learning. In human-centered prediction tasks such as recidivism prediction and consumer\n\ufb01nance, understanding how protected attributes such as gender or race affect a prediction can improve\ntransparency and aligns with the principles of contestable design [14]. While direct in\ufb02uence methods\n[7, 12, 20, 25] focus on determining how a feature is used directly by the model to determine an\noutcome, it is also possible for the model to access protected information through proxy variables\n\u2013 variables which are closely related the protected attribute. Indirect feature in\ufb02uence techniques\n[1, 2, 15] report that a feature is important if that feature or a proxy has an in\ufb02uence on the model\noutcomes.\n\n\u2217This research was funded in part by the NSF under grants DMR-1709351, IIS-1633387, IIS-1633724, and\nIIS-1815238, by the DARPA SD2 program, and the Arnold and Mabel Beckman Foundation. The Titan Xp used\nfor this research was donated by the NVIDIA Corporation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFeature in\ufb02uence methods can focus on the in\ufb02uence of a feature taken over all instances in the\ntraining or test set [7, 2], or on the local feature in\ufb02uence on a single individual item of the training\nor test set [25, 20] (both of which are different than the in\ufb02uence of a speci\ufb01c training instance\non a model\u2019s parameters [16]). Both the global perspective given by considering the impact of a\nfeature on all training and/or test instances as well as the local, individual perspective can be useful to\ninvestigate the fairness of a machine learning model. Consider, for example, the question of fairness\nin an automated hiring decision: determining the indirect in\ufb02uence of gender on all test outcomes\ncould help us understand whether the system had disparate impacts overall, while an individual-level\nfeature audit could help determine if a speci\ufb01c person\u2019s predictions were due in part to their gender.2\n\nOur Work.\nIn this paper we present a general technique to perform both global and individual-level\nindirect in\ufb02uence audits. Our technique is modular \u2013 it solves the indirect in\ufb02uence problem by\nreduction to a direct in\ufb02uence problem, allowing us to bene\ufb01t from existing techniques.\nOur key insight is that disentangled representations can be used to do indirect in\ufb02uence computation.\nThe idea of a disentangled representation is to learn independent factors of variation that re\ufb02ect the\nnatural symmetries of a data set. This approach has been very successful in generating representations\nin deep learning that can be manipulated while creating realistic inputs [3, 4, 9, 17, 26]. Disentan-\nglement has recently been shown to be a useful property for learning and evaluating fair machine\nlearning models [6, 18]. Related methods use competitive learning to ensure a representation is free\nof protected information while preserving other information [8, 21].\nIn our context, the idea is to disentangle the in\ufb02uence of the feature whose (indirect) in\ufb02uence we\nwant to compute. By doing this, we obtain a representation in which we can manipulate the feature\ndirectly to estimate its in\ufb02uence. Our approach has a number of advantages. We can connect indirect\nin\ufb02uence in the native representation to direct in\ufb02uence in the disentangled representation. Our\nmethod creates a disentangled model: a wrapper to the original model with the disentangled features\nas inputs. This implies that it works for (almost) any model for which direct in\ufb02uence methods work,\nand also allows us to use any future developed direct in\ufb02uence model.\nSpeci\ufb01cally, our disentangled in\ufb02uence audits approach provides the following contributions:\n\n1. Theoretical and experimental justi\ufb01cation that the disentangled model and associated disen-\ntangled in\ufb02uence audits we create provides an accurate indirect in\ufb02uence audit of complex,\nand potentially black box, models.\n\n2. Quality measures, based on the error of the disentanglement and the error of the reconstruc-\n\ntion of the original input, that can be associated with the audit results.\n\n3. An indirect in\ufb02uence method that can work in association with both global and individual-\nlevel feature in\ufb02uence mechanisms. Our disentangled in\ufb02uence audits can additionally audit\ncontinuous features and image data; types of audits that were not possible with previous\nindirect audit methods (without additional preprocessing).\n\n2 Our Methodology\n\n2.1 Theoretical background\nLet P and X denote sets of attributes with associated domains P and X . P represents features of\ninterest: these could be protected attributes of the data or any other features whose in\ufb02uence we wish\nto determine. For convenience we will assume that P consists of the values taken by a single feature \u2013\nour exposition and techniques work more generally. X represents other attributes of the data that may\nor may not be in\ufb02uenced by features in P . An instance is thus a point (p, x) \u2208 P \u00d7 X . Let Y denote\nthe space of labels for a learning task (Y = {+1,\u22121} for binary classi\ufb01cation or R for regression).\n\nDisentangled representation. Our goal is to \ufb01nd an alternate representation of an instance (p, x).\nSpeci\ufb01cally, we would like to construct x(cid:48) \u2208 X (cid:48) that represents all factors of variation that are\nindependent of P , as well as an invertible mapping f such that f (p, x) = (p, x(cid:48)). We will refer to the\noriginal domain as D = P \u00d7 X and the associated new domain as D(cid:48) = P \u00d7 X (cid:48).\n\n2While unrelated to feature in\ufb02uence, the idea of recourse [28] also emphasizes the importance of individual-\n\nlevel explanations of an outcome or how to change it.\n\n2\n\n\fn!\n\nz\u2286x\n\n|z|!(n\u2212|z|\u22121)!\n\n\u03c6p(M, x) =(cid:80)\n\nDirect and indirect in\ufb02uence. Given a model M : D \u2192 Y , a direct in\ufb02uence measure quan-\nti\ufb01es the degree to which any particular feature in\ufb02uences the outcome of M on a speci\ufb01c input.\nIn our experiments, we use the SHAP values proposed by [20] that are inspired by the Shap-\nley values in game theory, but our framework applies to an arbitrary direct in\ufb02uence function\n\u03c6. For a model M and input x, the in\ufb02uence of feature p is de\ufb01ned by SHAP as [20, Eq. 8]\n[Mx (z) \u2212 Mx (z \\ p)] where (cid:107)z(cid:107) denotes the number of nonzero\nentries in z, z \u2286 x is a vector whose nonzero entries are a subset of the nonzero entries in x, z \\ p de-\nnotes z with the feature p set to zero, and n is the number of features. Finally, Mx(z) = E[M (z)|zS],\nthe conditional expected value of the model subject to \ufb01xing all the nonzero entries of z (S is the set\nof nonzero entries in z).\nIndirect in\ufb02uence attempts to capture how a feature might in\ufb02uence the outcome of a model even\nif it is not explicitly represented in the data, i.e its in\ufb02uence is via proxy features. The above direct\nin\ufb02uence measure cannot capture these effects because the information encoded in protected feature\np might be retained in other features even if p is removed. We propose the following de\ufb01nition of\nindirect in\ufb02uence via a reduction to direct in\ufb02uence:\nDe\ufb01nition 1 (Indirect In\ufb02uence). The indirect in\ufb02uence of a feature p on the prediction of a model\nM (p, x) is the direct in\ufb02uence \u03c6p of p on the prediction of the disentangled model (M \u25e6 f\u22121)(p, x(cid:48)):\n(1)\n\nII p(M, (p, x)) = \u03c6p(M \u25e6 f\u22121, (p, x(cid:48)))\n\nEquation (1) states that the indirect in\ufb02uence of p is the direct in\ufb02uence of p when considering M as\nacting on the disentangled representation instead of the original features. Whereas direct in\ufb02uence\nmeasures the sensitivity of M to changes in p independent of all other features, indirect in\ufb02uence\nalso considers how p in\ufb02uences the prediction of M through proxy features for p. Hence, indirect\nin\ufb02uence is inherently speci\ufb01c to a data distribution and should be interpreted with respect to the\njoint distribution for (p, x) observed during training.3 Here, a proxy for p consists of a set of features\nS and a function g that predicts p: i.e such that g(xS) (cid:39) p. Note that if there are no features that can\npredict p, then the indirect and direct in\ufb02uence of p are the same (because the only proxy for p is\nitself).\n\nDealing with errors.\nIn practice, it might not be possible to perfectly learn the invertible mapping\nf from (p, x) to (p, x(cid:48)). In particular, assume that our decoder function is some g (cid:54)= f\u22121. While\nwe do not provide an explicit formula for the dependence of the in\ufb02uence function parameters, we\nnote that it is a linear function of the predictions, and so we can begin to understand the errors in the\nin\ufb02uence estimates by looking at the behavior of the predictor with respect to p.\nModel output can be written as \u02c6y = (M \u25e6 g)(p, x(cid:48)). Recalling that g(p, x(cid:48)) = (p, \u02c6x), the partial\nderivative of \u02c6y with respect to p can be written as \u2202(cid:98)y\n\u2202p .\n\u2202p + \u2202M\nConsider the term \u2202x(cid:48)\n\u2202p . If the disentangled representation is perfect, then this term is zero (because x(cid:48)\nis unaffected by p), and therefore we get \u2202(cid:98)y\n\u2202p which is as we would expect. If the reconstruction\nis perfect (but not necessarily the disentangling), then the term \u2202 \u02c6x\n\u2202x(cid:48) is 1. What remains is the partial\nderivative of M with respect to the latent encoding (x(cid:48), p).\n\n\u2202p = \u2202(M\u25e6g)\n\n\u2202p = \u2202M\n\n\u2202p = \u2202M\n\n\u2202p = \u2202M\n\n\u2202p + \u2202M\n\n\u2202 \u02c6x\n\n\u2202x(cid:48) \u2202x(cid:48)\n\n\u2202 \u02c6x\n\n\u2202 \u02c6x\n\n\u2202 \u02c6x\n\n2.2\n\nImplementation\n\nOur overall process requires two separate pieces: 1) a method to create disentangled representations,\nand 2) a method to audit direct features. In most experiments in this paper, we use methods related to\nadversarial autoencoders [22] to generate disentangled representations, and Shapley values from the\nshap technique for auditing direct features [20] (as described above in Section 2.1).\nWe train a disentangled representation to estimate (p, x(cid:48)) for each feature of interest p. This allows\nus to compute representations with only two factors in a supervised manner, avoiding many of the\npractical issues affecting methods for learning disentangled representations [19]. A key limitation\nof this approach is that while easier to train, it potentially requires one to train many disentangled\nrepresentations. This means the technique may be most useful in domains such as fairness where we\n\n3See the related perspective of disentangled representations as recovering symmetries in underlying world\n\nstates, which directly inspired our approach [13].\n\n3\n\n\fcare speci\ufb01cally about the impact of one or a small collection of distinguished features that may or\nmay not be directly used as inputs to the model.\n\nDisentangled representations via adversarial autoencoders We create disentangled repre-\nsentations by training three separate neural networks, which we denote f, g, and h (see\nFigure 1). Networks f and g are autoencoders:\nthe image of f has lower dimension-\nality than the domain of f, and the training process seeks for g \u25e6 f to be an approx-\nthrough gradient descent on the reconstruction error ||(g \u25e6 f )(x) \u2212 x||.\nimate identity,\nUnlike regular autoencoders, g is also given di-\nrect access to the protected attribute. Adversar-\nial autoencoders [22], in addition, use an ancil-\nlary network h that attempts to recover the pro-\ntected attribute from the image of f, without ac-\ncess to p itself. (Note the slight abuse of notation\nhere: h is assumed not to have access to p, while\ng does have access to it.) During the training of\nf and g, we seek to reduce ||(g \u25e6 f )(x) \u2212 x||,\nbut also to increase the error of the discrimina-\ntor h \u25e6 f. The optimization process of h tries\nto recover the protected attribute from the code\ngenerated by f. (h and f are the adversaries.)\nWhen the process converges to an equilibrium,\nthe code generated by f will contain no infor-\nmation about p that is useful to h, but g \u25e6 f\nstill reconstructs the original data correctly: f\ndisentangles p from the other features.\nThe loss functions used to codify this process are\nLEnc = MSE(x, \u02c6x) \u2212 \u03b2 MSE(p, \u02c6p), LDec =\nMSE(x, \u02c6x), and LDisc = MSE(p, \u02c6p), where\nMSE is the mean squared error and \u03b2 is a hyper-\nparameter determining the importance of disen-\ntanglement relative to reconstruction. When p is\na binary feature, LEnc and LDisc are adjusted to\nuse binary cross entropy loss between p and \u02c6p.\n\nFigure 1: System diagram when auditing the in-\ndirect in\ufb02uence of feature p on the outcomes of\nmodel M for instance (p, x). The autoencoder\ng \u25e6 f learns the instance (p, x) as a function of\nindependent factors (p, x(cid:48)). The independence of\np and x(cid:48) is enforced by the adversary h. The dis-\nentangled representation (p, x(cid:48)) is the input for the\ndisentangled model M(cid:48) = M \u25e6 g which is audited\nusing the direct in\ufb02uence algorithm I.\n\nDisentangled feature audits Concretely, our method works as follows, where the variable names\nmatch the diagram in Figure 1:\n\nDISENTANGLED-INFLUENCE-AUDIT(X, M )\n1\n2\n3\n4\n5\n6\n\nreturn {SHAPp for p in FEATURES(X)}\n\nfor p in FEATURES(X)\n\n(f , g, h) = DISENTANGLED-REPRESENTATION(X, p) // (h is not used)\nM(cid:48) = g \u25e6 M\nX(cid:48) = {f (x) for x in X}\nSHAPp = DIRECT-INFLUENCE(X(cid:48), p, M(cid:48))\n\nWhile we use shap in our implementations, our framework applies to other direct in\ufb02uence functions\nas well. We note here one important difference in the interpretation of disentangled in\ufb02uence values\nwhen contrasted with regular Shapley values. Because the in\ufb02uence of each feature is determined on\na different disentangled model, the scores we get are not directly interpretable as a partition of the\nmodel\u2019s prediction. For example, consider a dataset in which feature p1 is responsible for 50% of the\ndirect in\ufb02uence, while feature p2 is a perfect proxy for p1, but shows 0% in\ufb02uence under a direct\naudit. Relative judgments of feature importance remain sensible.\n\n4\n\npxfencoderx(cid:48)pgdecoderp\u02c6xMmodeltobeauditeddisentangledmodel:M(cid:48)\u02c6yIfeaturein\ufb02uencealgorithmx(cid:48)\u02c6pdiscriminatorh\fFigure 2: Synthetic x + y data direct shap (left) and indirect (right) feature in\ufb02uences using a\nhandcrafted (top row) or learned disentangled representation (bottom row). In each plot, a point\nrepresents the in\ufb02uence of a feature on a single prediction. Points far from the center indicate high\nimportance, so a row with many points far from the center indicates a feature which is deemed\nimportant. As expected, the direct in\ufb02uence method shap reports that the only important features are\nx and y, but our methods capture that x and y are perfect proxies for 2x, 2y, x2, y2 so these features\nhave equal indirect in\ufb02uence.\n\n3 Experiments\n\nIn this section, we will assess the extent to which disentangled in\ufb02uence audits are able to identify\nsources of indirect in\ufb02uence to a model and quantify its error. All data and code4 for the described\nmethod and below experiments is available in the Supplementary Materials.\n\n3.1 Synthetic x + y Regression Data\n\nIn order to evaluate whether the indirect in\ufb02uence calculated by the disentangled in\ufb02uence audits\ncorrectly captures all in\ufb02uence of individual-level features on an outcome, we will consider in\ufb02uence\non a simple synthetic x + y dataset. It includes 5,000 instances of two variables x and y drawn\nindependently from a uniform distribution over [0, 1] that are added to determine the label x + y. It\nalso includes proxy variables 2x, x2, 2y, and y2. A random noise variable c is also included that is\ndrawn independently of x and y uniformly from [0, 1]. The model we are auditing is a handcrafted\nmodel that contains no hidden layers and has \ufb01xed weights of 1 corresponding to x and y and weights\nof 0 for all other features (i.e., it directly computes x + y). We use shap as the direct in\ufb02uence\ndelegate method [20].5\nIn order to examine the impact of the quality of the disentangled representation on the results, we\nconsidered both a handcrafted disentangled representation and a learned one. For the former, nine\nunique models were handcrafted to disentangle each of the nine features perfectly (see Supplementary\nMaterials for details). The learned disentangled representation is created according to the adversarial\nautoencoder methodology described in more detail in the previous section.\nThe results for the handcrafted disentangled representation (top of Figure 2) are as expected: features\nx and y are the only ones with direct in\ufb02uence, all x or y based features have the same amount of\nindirect in\ufb02uence, while all features including c have zero in\ufb02uence. Using the learned disentangled\nrepresentation introduces the potential for error: the resulting in\ufb02uences (bottom of Figure 2) show\nmore variation between features, but the same general trends as in the handcrafted test case.\n\n4Code is available at: https://github.com/charliemarx/disentangling-influence\n5This method is available via pip install shap. See also: https://github.com/slundberg/shap\n\n5\n\nSHAP value (impact on model output)0.00.5-0.50.00.5-0.50.00.5-0.50.00.5-0.5HighLowFeature ValueDirect InfluenceHandcrafted DRLearned DRIndirect Influence\fAdditionally, note that since shap gives in\ufb02uence results per individual instance, we can also see\nthat (for both models) instances with larger (or, respectively, smaller) 2x or 2y values give larger\n(respectively, smaller) results for the label x + y, i.e., have larger absolute in\ufb02uences on the outcomes.\n\n3.1.1 Error Analyses\n\nThere are two main sources of error for disentangled in\ufb02uence audits: error in the reconstruction\nof the original input x and error in the disentanglement of p from x(cid:48) such that the discriminator is\nable to accurately predict some \u02c6p close to p. We will measure the former error in two ways. First,\nwe will consider the reconstruction error, which we de\ufb01ne as x \u2212 \u02c6x. Second, we consider the\nprediction error, which is M (x) \u2212 M (\u02c6x) - a measure of the impact of the reconstruction error on the\nmodel to be audited. Reconstruction and prediction errors close to 0 indicate that the disentangled\nmodel M(cid:48) is similar to the model M being audited. Common measures for disentanglement such\n(cid:80)n\nas the mutual information gap (MIG) do not apply well to our method since we disentangle the\nfeatures one at a time, as opposed to simultaneously [5]. We measure the disentanglement error,\ni=1(p \u2212 \u02c6p)2/var(p) where var(p) is the variance of p. A disentanglement error of below 1\nas 1\nn\nindicates that information about that feature may have been revealed, i.e., that there may be indirect\nin\ufb02uence that is not accounted for in the resulting in\ufb02uence score. In addition to the usefulness of\nthese error measures during training time, they also provide information that helps us to assess the\nquality of the indirect in\ufb02uence audit, including at the level of the error for an individual instance.\n\nFigure 3: Errors on the synthetic x + y data for the reconstruction error (left) when taken across\nin\ufb02uence audits for each feature, prediction error (middle), and disentanglement error (right). Optimal\nis reconstruction error and prediction error of 0 for all features (indicating no errors in autoencoding),\nand disentanglement error of 1 for all features (indicating p and x(cid:48) are independent).\n\nThese in\ufb02uence experiments on the x + y dataset demonstrate the importance of a good disentangled\nrepresentation to the quality of the resulting indirect in\ufb02uence measures, since the handcrafted zero-\nerror disentangled representation clearly results in more accurate in\ufb02uence results. Each of the error\ntypes described above are given for the learned disentangled representation in Figure 3. While most\nfeatures have reconstruction and prediction errors close to 0 and disentanglement errors close to 1, a\nfew features also have some far outlying instances. For example, we can see that the c, 2c, and c2\nvariables have high prediction error on some instances, and this is re\ufb02ected in the incorrect indirect\nin\ufb02uence that they\u2019re found to have on the learned representation for some instances.\n\n3.2 dSprites Image Classi\ufb01cation\n\nThe second synthetic dataset is the dSprites\ndataset commonly used in the disentangled rep-\nresentations literature to disentangle indepen-\ndent factors that are sources of variation [23].\nThe dataset consists of 737, 280 images (64\u00d764\npixels) of a white shape (a square, ellipse, or\nheart) on a black background. The independent\nlatent factors are x position, y position, orienta-\ntion, scale, and shape. The images were down-\nsampled to 16 \u00d7 16 resolution and the half of\nthe data in which the shapes are largest were\n\nFigure 4: dSprites data indirect latent factor in\ufb02u-\nences on a model predicting shape.\n\n6\n\nHighLowFeature Valueshapescaleorient.x pos.y pos.SHAP value (impact on model output)0-0.0150.015\fused. The binary classi\ufb01cation task is to predict whether the shape is a heart. A good disentangled\nrepresentation should be able to separate the shape from the other latent factors.\n\nFigure 5: The mean squared reconstruction error (left), absolute prediction error (middle), and\nabsolute disentanglement error (right) of the latent factors in the dSprites data under an indirect\nin\ufb02uence audit. Optimal is reconstruction error and prediction error of 0, and disentanglement error\nof 1. We see that the quality of the disentangled representation varies for the dSprites data.\n\nIn this experiment we seek to quantify the indirect in\ufb02uence of each latent factor on a model trained\nto predict the shape from an image. Since shape is the label and the latent factors are independent,\nwe expect the feature shape to have more indirect in\ufb02uence on the model than any other latent\nfactor. Note that a direct in\ufb02uence audit is impossible since the latent factors are not themselves\nfeatures of the data. Model and disentangled representation training information can be found in the\nSupplementary Material.\nThe indirect in\ufb02uence audit, shown in Figure 4, correctly identi\ufb01es shape as the most important latent\nfactor, and also correctly shows the other four factors as having essentially zero indirect in\ufb02uence.\nHowever, the audit struggles to capture the extent of the indirect in\ufb02uence of shape since the resulting\nshap values are small.\nThe associated error measures for the dSprites in\ufb02uence audit are shown in Figure 5. We report the\nreconstruction error as the mean squared error between x and \u02c6x for each latent factor. The prediction\nerror is the difference between M (x) and M (\u02c6x) of the model\u2019s estimate of the probability the shape\nis a heart. While the reconstruction errors are relatively low (less than 0.05 for all but y position) the\nprediction error and disentanglement errors are high. A high prediction error indicates that the model\nis sensitive to the errors in reconstruction and the indirect in\ufb02uence results may be unstable, which\nmay explain the low shap values for shape in the indirect in\ufb02uence audit.\n\n3.3 Adult Income Data\n\nFigure 6: Ten selected features for Adult dataset. Direct (left) and indirect (right) in\ufb02uence are shown.\nFor all features, see Supplemental Material. Low values indicate a one-hot encoded feature is false.\nFeatures with many points far from the center (shown here using width of a cluster) are identi\ufb01ed as\nbeing of high importance. These results indicate that features sex=Male, relationship=Husband\nand workclass=Private may be used by the model via proxy variables since they have higher\nindirect in\ufb02uence than direct in\ufb02uence.\n\n7\n\n\fFinally, we will consider a real-world dataset containing Adult Income data that is commonly used as\na test case in the fairness-aware machine learning community. The Adult dataset includes 14 features\ndescribing type of work, demographic information, and capital gains information for individuals\nfrom the 1994 U.S. census [27]. The classi\ufb01cation task is predicting whether an individual makes\nmore or less than $50,000 per year. Preprocessing, model, and disentangled representation training\ninformation are included in the Supplementary Material.\nDirect and indirect in\ufb02uence audits on the Adult dataset are given in Figure 6 and in the Supplementary\nMaterial. While many of the resulting in\ufb02uence scores are the same in both the direct and indirect\ncases, the disentangled in\ufb02uence audits \ufb01nds substantially more in\ufb02uence based on sex than the\ndirect in\ufb02uence audit - this is not surprising given the large in\ufb02uence that sex is known to have on\nU.S. income. Other important features in a fairness context, such as nationality, are also shown to\nhave indirect in\ufb02uences that are not apparent on a direct in\ufb02uence audit. The error results (Figure 7\nand the Supplementary Material) indicate that while the error is low across all three types of errors for\nmany features, the disentanglement errors are higher (further from 1) for some rare-valued features.\nThis means that despite the indirect in\ufb02uence that the audit did \ufb01nd, there may be additional indirect\nin\ufb02uence it did not pick up for those features.\n\nFigure 7: The reconstruction error (left), prediction error (middle), and disentanglement error (right)\nof selected Adult Income features under an indirect in\ufb02uence audit. Optimal is a reconstruction error\nand prediction error of 0, and a disentanglement error of 1 for all features. See the Supplementary\nMaterial for the complete \ufb01gure.\n\n3.4 Comparison to Other Methods\n\nFigure 8: Comparison on the synthetic x + y data of the disentangled in\ufb02uence audits using the\nhandcrafted (left) or learned (middle) disentangled representation with the BBA approach of [2] (right).\nAccording to our de\ufb01nition of indirect in\ufb02uence and using shap the features x, y, 2x, 2y, x2, y2\nshould have the same in\ufb02uence and c, 2c, c2 should have no in\ufb02uence.\n\nHere, we compare the disentangled in\ufb02uence audits results to results on the same datasets and models\nby the indirect in\ufb02uence technique introduced in [2], which we will refer to as BBA (black-box\nauditing).6 However, this is not a direct comparison, since BBA is not able to determine feature\nin\ufb02uence for individual instances, only in\ufb02uence for a feature taken over all instances. In order to\ncompare to our results, we will thus take the mean over all instances of the absolute value of the per\nfeature disentangled in\ufb02uence. BBA was designed to audit classi\ufb01ers, so in order to compare to the\nresults of disentangled in\ufb02uence audits we will consider the obscured data they generate as input into\nour regression models and then report the average change in mean squared error for the case of the\nsynthetic x + y data. (BBA cannot handle dSprites image data as input.)\n\n6This method is available via pip install BlackBoxAuditing. See also: https://github.com/\n\nalgofairness/BlackBoxAuditing\n\n8\n\n\fA comparison of the disentangled in\ufb02uence and\nBBA results on the synthetic x + y data shown in\n\ufb01gure 8 shows that all three variants of indirect\nin\ufb02uence are able to determine that the c, 2c, c2\nvariables have comparatively low in\ufb02uence on\nthe model. The disentangled in\ufb02uence with a\nhandcrafted disentangled representation shows\nthe correct indirect in\ufb02uence of each feature,\nwhile the learned disentangled representation in-\n\ufb02uence is somewhat more noisy, and the BBA re-\nsults suffer from relying on the mean squared er-\nror (i.e., the amount of in\ufb02uence changes based\non the feature\u2019s value).\nFigure 9 shows the mean absolute disentangled\nin\ufb02uence per feature on the x-axis and the BBA\nin\ufb02uence results on the y-axis. The features with\nlarge disentangled in\ufb02uence and low BBA score\nare marital.status=Married-civ-spouse and relationship=Husband. BBA can only detect\nin\ufb02uence present in pairwise dimensions, and not more complex high-dimensional correlations;\nperhaps this is why marital status is found to have such a large in\ufb02uence by the disentangled in\ufb02uence\naudit and not by BBA. The feature with large BBA score is age, and the reconstruction error on the\ndisentangled in\ufb02uence audit for that feature indicates that the audit may not have picked up the full\nin\ufb02uence of that feature. It\u2019s clear that overall the disentangled in\ufb02uence audits technique is much\nbetter able to \ufb01nd features with possible indirect in\ufb02uence on this dataset and model: most of the BBA\nin\ufb02uences are clustered near zero, while the disentangled in\ufb02uence values provide more variation and\npotential for insight.\n\nFigure 9: Comparison on the Adult data of the\ndisentangled in\ufb02uence audits versus the BBA indi-\nrect in\ufb02uence approach of [2]. Our disentangled\nfeature audits identi\ufb01es more plausible, potentially-\nin\ufb02uential features than BBA. See text for details.\n\nSHAP vs. LIME In Section 3.3, we use SHAP audits as our direct and indirect audit sources. As\nLee and Lundberg argue, the SHAP values are, in essence, a variation of the LIME method [20], one\nthat provides weights for samples and features that are consistent with the Shapley value formulation\nfrom game theory. As a result, the audits for LIME are not fundamentally different than those for\nSHAP; we provide them in the Supplementary Material.\n\n4 Discussion and Conclusion\n\nIn this paper, we introduce the idea of disentangling in\ufb02uence: using the ideas from disentangled\nrepresentations to allow for indirect in\ufb02uence audits. We show via theory and experiments that this\nmethod works across a variety of problems and data types including classi\ufb01cation and regression\nas well as numerical, categorical, and image data. The methodology allows us to turn any future\ndeveloped direct in\ufb02uence measures into indirect in\ufb02uence measures. In addition to the strengths of\nthe technique demonstrated here, disentangled in\ufb02uence audits have the added potential to allow for\nmultidimensional indirect in\ufb02uence audits that would, e.g., allow a fairness audit on both race and\ngender to be performed (without using a single combined race and gender feature [10]). We hope this\nopens the door for more nuanced fairness audits.\n\n9\n\n\fReferences\n[1] J. Adebayo and L. Kagal. Iterative orthogonal feature projection for diagnosing bias in black-box models.\n\narXiv preprint arXiv:1611.04967, 2016.\n\n[2] P. Adler, C. Falk, S. A. Friedler, T. Nix, G. Rybeck, C. Scheidegger, B. Smith, and S. Venkatasubramanian.\nAuditing black-box models for indirect in\ufb02uence. Knowledge and Information Systems, 54(1):95\u2013122,\n2018.\n\n[3] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy. Deep variational information bottleneck. International\n\nConference on Learning Representations, 2016.\n\n[4] Y. Bengio, A. Courville, and P. Vincent. Representation learning: A review and new perspectives. IEEE\n\ntransactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n[5] T. Q. Chen, X. Li, R. B. Grosse, and D. K. Duvenaud. Isolating sources of disentanglement in variational\n\nautoencoders. In Advances in Neural Information Processing Systems, pages 2610\u20132620, 2018.\n\n[6] E. Creager, D. Madras, J.-H. Jacobsen, M. A. Weis, K. Swersky, T. Pitassi, and R. Zemel. Flexibly fair\n\nrepresentation learning by disentanglement. arXiv preprint arXiv:1906.02589, 2019.\n\n[7] A. Datta, S. Sen, and Y. Zick. Algorithmic transparency via quantitative input in\ufb02uence: Theory and\nexperiments with learning systems. In Proceedings of 37th IEEE Symposium on Security and Privacy,\n2016.\n\n[8] H. Edwards and A. Storkey. Censoring representations with an adversary. In Proceedings of the 33th\n\nInternational Conference on Machine Learning, 2016.\n\n[9] B. Esmaeili, H. Wu, S. Jain, A. Bozkurt, N. Siddharth, B. Paige, D. H. Brooks, J. Dy, and J.-W. van de\nMeent. Structured disentangled representations. In K. Chaudhuri and M. Sugiyama, editors, Proceedings\nof Machine Learning Research, volume 89, pages 2525\u20132534. PMLR, 16\u201318 Apr 2019.\n\n[10] S. A. Friedler, C. Scheidegger, S. Venkatasubramanian, S. Choudhary, E. P. Hamilton, and D. Roth.\nA comparative study of fairness-enhancing interventions in machine learning. In Proceedings of the\nConference on Fairness, Accountability, and Transparency, pages 329\u2013338. ACM, 2019.\n\n[11] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of methods for\n\nexplaining black box models. ACM computing surveys (CSUR), 51(5):93, 2018.\n\n[12] A. Henelius, K. Puolam\u00e4ki, H. Bostr\u00f6m, L. Asker, and P. Papapetrou. A peek into the black box: exploring\n\nclassi\ufb01ers by randomization. Data Min Knowl Disc, 28:1503\u20131529, 2014.\n\n[13] I. Higgins, D. Amos, D. Pfau, S. Racaniere, L. Matthey, D. Rezende, and A. Lerchner. Towards a de\ufb01nition\n\nof disentangled representations. arXiv preprint arXiv:1812.02230, 2018.\n\n[14] T. Hirsch, K. Merced, S. Narayanan, Z. E. Imel, and D. C. Atkins. Designing contestability: Interaction\nIn Proceedings of the 2017 Conference on Designing\n\ndesign, machine learning, and mental health.\nInteractive Systems, pages 95\u201399. ACM, 2017.\n\n[15] B. Kim, M. Wattenberg, J. Gilmer, C. Cai, J. Wexler, F. Viegas, and R. Sayres.\n\nInterpretability be-\nyond feature attribution: Quantitative testing with concept activation vectors (tcav). arXiv preprint\narXiv:1711.11279, 2017.\n\n[16] P. W. Koh and P. Liang. Understanding black-box predictions via in\ufb02uence functions. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 1885\u20131894. JMLR. org, 2017.\n\n[17] A. Kumar, P. Sattigeri, and A. Balakrishnan. Variational inference of disentangled latent concepts from\n\nunlabeled observations. International Conference on Learning Representations, 2017.\n\n[18] F. Locatello, G. Abbati, T. Rainforth, S. Bauer, B. Sch\u00f6lkopf, and O. Bachem. On the fairness of\n\ndisentangled representations. arXiv preprint arXiv:1905.13662, 2019.\n\n[19] F. Locatello, S. Bauer, M. Lucic, S. Gelly, B. Sch\u00f6lkopf, and O. Bachem. Challenging common assumptions\n\nin the unsupervised learning of disentangled representations. arXiv preprint arXiv:1811.12359, 2018.\n\n[20] S. M. Lundberg and S.-I. Lee. A uni\ufb01ed approach to interpreting model predictions. In Advances in Neural\n\nInformation Processing Systems, pages 4765\u20134774, 2017.\n\n[21] D. Madras, E. Creager, T. Pitassi, and R. Zemel. Learning adversarially fair and transferable representations.\n\nIn Proceedings of the 35th International Conference on Machine Learning, 2018.\n\n10\n\n\f[22] A. Makhzani, J. Shlens, N. Jaitly, I. Goodfellow, and B. Frey. Adversarial autoencoders. arXiv preprint\n\narXiv:1511.05644, 2015.\n\n[23] L. Matthey, I. Higgins, D. Hassabis, and A. Lerchner. dsprites: Disentanglement testing sprites dataset.\n\nhttps://github.com/deepmind/dsprites-dataset/, 2017.\n\n[24] C. Molnar. Interpretable machine learning: A guide for making black box models explainable. Christoph\n\nMolnar, Leanpub, 2018.\n\n[25] M. T. Ribeiro, S. Singh, and C. Guestrin. \"Why Should I Trust You?\": Explaining the Predictions of Any\n\nClassi\ufb01er. In Proc. ACM KDD, 2016.\n\n[26] M. Tschannen, O. Bachem, and M. Lucic. Recent advances in autoencoder-based representation learning.\n\narXiv preprint arXiv:1812.05069, 2018.\n\n[27] I. M. L. R. University of California. Adult income dataset. https://archive.ics.uci.edu/ml/\n\ndatasets/adult.\n\n[28] B. Ustun, A. Spangher, and Y. Liu. Actionable recourse in linear classi\ufb01cation. In Proceedings of the\n\nConference on Fairness, Accountability, and Transparency, pages 10\u201319. ACM, 2019.\n\n11\n\n\f", "award": [], "sourceid": 2530, "authors": [{"given_name": "Charles", "family_name": "Marx", "institution": "Haverford College"}, {"given_name": "Richard", "family_name": "Phillips", "institution": "Cornell University"}, {"given_name": "Sorelle", "family_name": "Friedler", "institution": "Haverford College"}, {"given_name": "Carlos", "family_name": "Scheidegger", "institution": "The University of Arizona"}, {"given_name": "Suresh", "family_name": "Venkatasubramanian", "institution": "University of Utah"}]}