{"title": "Controllable Invariance through Adversarial Feature Learning", "book": "Advances in Neural Information Processing Systems", "page_first": 585, "page_last": 596, "abstract": "Learning meaningful representations that maintain the content necessary for a particular task while filtering away detrimental variations is a problem of great interest in machine learning. In this paper, we tackle the problem of learning representations invariant to a specific factor or trait of data. The representation learning process is formulated as an adversarial minimax game. We analyze the optimal equilibrium of such a game and find that it amounts to maximizing the uncertainty of inferring the detrimental factor given the representation while maximizing the certainty of making task-specific predictions. On three benchmark tasks, namely fair and bias-free classification, language-independent generation, and lighting-independent image classification, we show that the proposed framework induces an invariant representation, and leads to better generalization evidenced by the improved performance.", "full_text": "Controllable Invariance\n\nthrough Adversarial Feature Learning\n\nQizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, Graham Neubig\n\n{qizhex, dzihang, yulund, hovy, gneubig}@cs.cmu.edu\n\nLanguage Technologies Institute\n\nCarnegie Mellon University\n\nAbstract\n\nLearning meaningful representations that maintain the content necessary for a\nparticular task while \ufb01ltering away detrimental variations is a problem of great\ninterest in machine learning. In this paper, we tackle the problem of learning\nrepresentations invariant to a speci\ufb01c factor or trait of data. The representation\nlearning process is formulated as an adversarial minimax game. We analyze the\noptimal equilibrium of such a game and \ufb01nd that it amounts to maximizing the\nuncertainty of inferring the detrimental factor given the representation while maxi-\nmizing the certainty of making task-speci\ufb01c predictions. On three benchmark tasks,\nnamely fair and bias-free classi\ufb01cation, language-independent generation, and\nlighting-independent image classi\ufb01cation, we show that the proposed framework\ninduces an invariant representation, and leads to better generalization evidenced by\nthe improved performance.\n\n1\n\nIntroduction\n\nHow to produce a data representation that maintains meaningful variations of data while eliminating\nnoisy signals is a consistent theme of machine learning research. In the last few years, the dominant\nparadigm for \ufb01nding such a representation has shifted from manual feature engineering based on\nspeci\ufb01c domain knowledge to representation learning that is fully data-driven, and often powered by\ndeep neural networks [Bengio et al., 2013]. Being universal function approximators [Gybenko, 1989],\ndeep neural networks can easily uncover the complicated variations in data [Zhang et al., 2017],\nleading to powerful representations. However, how to systematically incorporate a desired invariance\ninto the learned representation in a controllable way remains an open problem.\nA possible avenue towards the solution is to devise a dedicated neural architecture that by construction\nhas the desired invariance property. As a typical example, the parameter sharing scheme and pooling\nmechanism in modern deep convolutional neural networks (CNN) [LeCun et al., 1998] take advantage\nof the spatial structure of image processing problems, allowing them to induce more generic feature\nrepresentations than fully connected networks. Since the invariance we care about can vary greatly\nacross tasks, this approach requires us to design a new architecture each time a new invariance\ndesideratum shows up, which is time-consuming and in\ufb02exible.\nWhen our belief of invariance is speci\ufb01c to some attribute of the input data, an alternative approach is\nto build a probabilistic model with a random variable corresponding to the attribute, and explicitly\nreason about the invariance. For instance, the variational fair auto-encoder (VFAE) [Louizos et al.,\n2016] employs the maximum mean discrepancy (MMD) to eliminate the negative in\ufb02uence of speci\ufb01c\n\u201cnuisance variables\u201d, such as removing the lighting conditions of images to predict the person\u2019s\nidentity. Similarly, under the setting of domain adaptation, standard binary adversarial cost [Ganin\nand Lempitsky, 2015, Ganin et al., 2016] and central moment discrepancy (CMD) [Zellinger et al.,\n2017] have been utilized to learn features that are domain invariant. However, all these invariance\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\finducing criteria suffer from a similar drawback, which is they are de\ufb01ned to measure the divergence\nbetween a pair of distributions. Consequently, they can only express the invariance belief w.r.t. a\npair of values of the random variable at a time. When the attribute is a multinomial variable that\ntakes more than two values, combinatorial number of pairs (speci\ufb01cally, O(n2)) have to be added to\nexpress the belief that the representation should be invariant to the attribute. The problem is even\nmore dramatic when the attribute represents a structure that has exponentially many possible values\n(e.g. the parse tree of a sentence) or when the attribute is simply a continuous variable.\nMotivated by the aforementioned drawbacks and dif\ufb01culties, in this work, we consider the problem of\nlearning a feature representation with the desired invariance. We aim at creating a uni\ufb01ed framework\nthat is (1) generic enough such that it can be easily plugged into different models, and (2) more\n\ufb02exible to express an invariance belief in quantities beyond discrete variables with limited value\nchoices. Speci\ufb01cally, inspired by the recent advancement of adversarial learning [Goodfellow et al.,\n2014], we formulate the representation learning as a minimax game among three players: an encoder\nwhich maps the observed data deterministically into a feature space, a discriminator which looks\nat the representation and tries to identify a speci\ufb01c type of variation we hope to eliminate from the\nfeature, and a predictor which makes use of the invariant representation to make predictions as in\ntypical discriminative models. We provide theoretical analysis of the equilibrium condition of the\nminimax game, and give an intuitive interpretation. On three benchmark tasks from different domains,\nwe show that the proposed approach not only improves upon vanilla discriminative approaches that do\nnot encourage invariance, but also outperforms existing approaches that enforce invariant features.\n\n2 Adversarial Invariant Feature Learning\n\nIn this section, we formulate our problem and then present the proposed framework of learning\ninvariant features.\n\n(a) y and s are marginally independent\n\n(b) y and s are not marginally independent\n\nFigure 1: Dependencies between x, s, y, where x is the observation and y is the target to be predicted.\ns is the attribute to which the prediction should be invariant.\n\nGiven observation/input x, we are interested in the task of predicting the target y based on the value\nof x using a discriminative approach. In addition, we have access to some intrinsic attribute s of x as\nwell as a prior belief that the prediction result should be invariant to s.\nThere are two possible dependency scenarios of x, s and y here: (1) s and y can be marginally\nindependent. For example, in image classi\ufb01cations, lighting conditions s and identities of persons y\nare independent. The data generation process is s \u223c p(s), y \u223c p(y), x \u223c p(x | s, y). (2) In some\ncases, s and y are not marginally independent. For example, in fairness classi\ufb01cations, s are the\nsensitive factors such as age and gender. y can be the saving, credit and health condition of a person.\ns and y are related due to the inherent bias within the data. Using a latent variable z to model the\ndependency between s and y, the data generation process is z \u223c p(z), s \u223c p(s | z), y \u223c p(y |\nz), x \u223c p(x | s, y). We show the corresponding dependency graphs in Figure 1.\nUnlike vanilla discriminative models that outputs the conditional distribution p(y | x), we model\np(y | x, s) to make predictions invariant to s. Our intuition is that, due to the explaining away effect,\ny and s are not independent when conditioned on x although they can be marginally independent.\nConsequently, p(y | x, s) is a more accurate estimation of y than p(y | x). Intuitively, this can\ninform and guide the model to remove information about undesired variations. For example, if we\nwant to learn a representation of image x that is invariant to the lighting condition s, the model\ncan learn to \u201cbrighten\u201d the input if it knows the original picture is dark, and vice versa. Also, in\nmulti-lingual machine translation, a word with the same surface form may have different meanings in\ndifferent languages. For instance, \u201cgift\u201d means \u201cpresent\u201d in English but means \u201cpoison\u201d in German.\n\n2\n\n\fHence knowing the language of a source sentence helps inferring the meaning of the sentence and\nconducting translation.\nAs the input x can have highly complicated structure, we employ a dedicated model or algorithm to\nextract an expressive representation h from x. Thus, when we extract the representation h from x,\nwe want the representation h to preserve variations that are necessary to predict y while eliminating\ninformation of s. To achieve the aforementioned goal, we employ a deterministic encoder E to\nobtain the representation by encoding x and s into h, namely, h = E(x, s). It should be noted\nhere that we are using s as an additional input. Given the obtained representation h, the target y is\npredicted by a predictor M, which effectively models the distribution qM (y | h). By construction,\ninstead of modeling p(y | x) directly, the discriminative model we formulate captures the conditional\ndistribution p(y | x, s) with additional information coming from s.\nSurely, feeding s into the encoder by no means guarantees the induced feature h will be invariant to s.\nThus, in order to enforce the desired invariance and eliminate variations of factor s from h, we set up\nan adversarial game by introducing a discriminator D which inspects the representation h and ensure\nthat it is invariant to s. Concretely, the discriminator D is trained to predict s based on the encoded\nrepresentation h, which effectively maximizes the likelihood qD(s | h). Simultaneously, the encoder\n\ufb01ghts to minimize the same likelihood of inferring the correct s by the discriminator. Intuitively, the\ndiscriminator and the encoder form an adversarial game where the discriminator tries to detect an\nattribute of the data while the encoder learns to conceal it.\nNote that under our framework, in theory, s can be any type of data as long as it represents an attribute\nof x. For example, s can be a real value scalar/vector, which may take many possible values, or a\ncomplex sub-structure such as the parse tree of a natural language sentence. But in this paper, we\nfocus mainly on instances where s is a discrete label with multiple choices. We plan to extend our\nframework to deal with continuous s and structured s in the future.\nFormally, E, M and D jointly play the following minimax game:\n\nmin\nE,M\n\nmax\n\nD\n\nJ(E, M, D)\n\nwhere\n\nJ(E, M, D) =\n\nE\n\nx,s,y\u223cp(x,s,y)\n\n[\u03b3 log qD(s | h = E(x, s)) \u2212 log qM (y | h = E(x, s))]\n\n(1)\n\nwhere \u03b3 is a hyper-parameter to adjust the strength of the invariant constraint, and p(x, s, y) is the\ntrue underlying distribution that the empirical observations are drawn from.\nNote that the problem of domain adaption can be seen as a special case of our problem, where s is a\nBernoulli variable representing the domain and the model only has access to the target y when s =\n\u201csource domain\u201d during training.\n\n3 Theoretical Analysis\n\nIn this section, we theoretically analyze, given enough capacity and training time, whether such a\nminimax game will converge to an equilibrium where variations of y are preserved and variations of\ns are removed. The theoretical analysis is done in a non-parametric limit, i.e., we assume a model\nwith in\ufb01nite capacity. In addition, we discuss the equilibriums of the minimax game when s is\nindependent/dependent to y.\nSince both the discriminator and the predictor only use h which is transformed deterministically from\nx and s, we can substitute x with h and de\ufb01ne a joint distribution \u02dcp(h, s, y) of h, s and y as follows\n\n\u02dcp(h, s, y) =\n\n\u02dcp(x, s, h, y)dx =\n\np(x, s, y)pE(h | x, s)dx =\n\np(x, s, y)\u03b4(E(x, s) = h)dx\n\n(cid:90)\n\n(cid:90)\n\n(cid:90)\n\nx\n\nx\n\nx\n\nHere, we have used the fact that the encoder is a deterministic transformation and thus the distribution\npE(h | x, s) is merely a delta function denoted by \u03b4(\u00b7). Intuitively, h absorbs the randomness in x\nand has an implicit distribution of its own. Also, note that the joint distribution \u02dcp(h, s, y) depends on\nthe transformation de\ufb01ned by the encoder.\nThus, we can equivalently rewrite objective (1) as\n\nJ(E, M, D) =\n\nE\n\nh,s,y\u223c \u02dcp(h,s,y)\n\n[\u03b3 log qD(s | h) \u2212 log qM (y | h)]\n\n(2)\n\n3\n\n\fTo analyze the equilibrium condition of the new objective (2), we \ufb01rst deduce the optimal discriminator\nD and the optimal predictor M for a given encoder E and then prove the global optimality of the\nminimax game.\nD(s | h) = \u02dcp(s | h) and the\nClaim 1. Given a \ufb01xed encoder E, the optimal discriminator outputs q\u2217\noptimal predictor corresponds to q\u2217\n\nM (y | h) = \u02dcp(y | h).\n\nProof. The proof uses the fact that the objective is functionally convex w.r.t. each distribution, and\nby taking the variations we can obtain the stationary point for qD and qM as a function of \u02dcq. The\ndetailed proof is included in the supplementary material A.\n\nNote that the optimal q\u2217\nE. Thus, by plugging q\u2217\nminimization problem only w.r.t. the encoder E with the following form:\n\nM (y | h) given in Claim 1 are both functions of the encoder\nM into the original minimax objective (2), it can be simpli\ufb01ed as a\n\nD(s | h) and q\u2217\nD and q\u2217\n\nmin\n\nE\n\nJ(E) = min\n\nE\n\n= min\n\nE\n\nE\n\n[\u03b3 log \u02dcq(s | h) \u2212 log \u02dcq(y | h)]\n\nh,s,y\u223c\u02dcq(h,s,y)\n\u2212\u03b3H(\u02dcq(s | h)) + H(\u02dcq(y | h))\n\n(3)\n\nwhere H(\u02dcq(s | h)) is the conditional entropy of the distribution \u02dcq(s | h).\n\nEquilibrium Analysis As we can see, the objective (3) consists of two conditional entropies with\ndifferent signs. Optimizing the \ufb01rst term amounts to maximizing the uncertainty of inferring s based\non h, which is essentially \ufb01ltering out any information of s from the representation. On the contrary,\noptimizing the second term leads to increasing the certainty of predicting y based on h. Implicitly,\nthe objective de\ufb01nes the equilibrium of the minimax game.\n\u2022 Win-win equilibrium: Firstly, for cases where the attribute s is entirely irrelevant to the prediction\ntask (corresponding to the dependency graph shown in Figure 1a), the two terms can reach the\noptimum at the same time, leading to a win-win equilibrium. For example, with the lighting\ncondition of an image removed, we can still/better classify the identity of the people in that image.\nWith enough model capacity, the optimal equilibrium solution would be the same regardless of the\nvalue of \u03b3.\n\n\u2022 Competing equilibrium: However, there are cases where these two optimization objectives are\ncompeting. For example, in fair classi\ufb01cations, sensitive factors such as gender and age may help\nthe overall prediction accuracies due to inherent biases within the data. In other words, knowing\ns may help in predicting y since s and y are not marginally independent (corresponding to the\ndependency graph shown in Figure 1b). Learning a fair/invariant representation is harmful to\npredictions. In this case, the optimality of these two entropies cannot be achieved simultaneously,\nand \u03b3 de\ufb01nes the relative strengths of the two objectives in the \ufb01nal equilibrium.\n\n4 Parametric Instantiation of the Proposed Framework\n\n4.1 Models\n\nTo show the general applicability of our framework, we experiment on three different tasks including\nsentence generation, image classi\ufb01cation and fair classi\ufb01cations. Due to the different natures of data\nof x and y, here we present the speci\ufb01c model instantiations we use.\n\nSentence Generation We use multi-lingual machine translation as the testbed for sentence genera-\ntion. Concretely, we have translation pairs between several source languages and a target language. x\nis the source sentence to be translated and s is a scalar denoting which source language x belongs to.\ny is the translated sentence for the target language.\nRecall that s is used as an input of E to obtain a language-invariant representation. To make full\nuse of s, we employ separate encoders Encs for sentences in each language s. In other words,\nh = E(s, x) = Encs(x) where each Encs is a different encoder. The representation of a sentence is\ncaptured by the hidden states of an LSTM encoder [Hochreiter and Schmidhuber, 1997] at each time\nstep.\n\n4\n\n\fWe employ a single LSTM predictor for different encoders. As often used in language generation,\nthe probability qM output by the predictor is parametrized by an autoregressive process, i.e.,\n\nT(cid:89)\n\nqM (y1:T | h) =\n\nqM (yt|y<t, h)\n\nt=1\n\nwhere we use an LSTM with attention model [Bahdanau et al., 2015] to compute qM (yt|y<t, h).\nThe discriminator is also parameterized as an LSTM which gives it enough capacity to deal with\ninput of multiple timesteps. qD(s | h) is instantiated with the multinomial distribution computed by a\nsoftmax layer on the last hidden state of the discriminator LSTM.\n\nClassi\ufb01cation For our classi\ufb01cation experiments, the input is either a picture or a feature vector.\nAll of the three players in the minimax game are constructed by feedforward neural networks. We\nfeed s to the encoder as an embedding vector.\n\n4.2 Optimization\n\nThere are two possible approaches to optimize our framework in an adversarial setting. The \ufb01rst one\nis similar to the alternating approach used in Generative Adversarial Nets (GANs) [Goodfellow et al.,\n2014]. We can alternately train the two adversarial components while freezing the third one. This\napproach has more control in balancing the encoder and the discriminator, which effectively avoids\nsaturation. Another method is to train all three components together with a gradient reversal layer\n[Ganin and Lempitsky, 2015]. In particular, the encoder admits gradients from both the discriminator\nand the predictor, with the gradient from the discriminator negated to push the encoder in the opposite\ndirection desired by the discriminator. Chen et al. [2016b] found the second approach easier to\noptimize since the discriminator and the encoder are fully in sync being optimized altogether. Hence\nwe adopt the latter approach. In all of our experiments, we use Adam [Kingma and Ba, 2014] with a\nlearning rate of 0.001.\n\n5 Experiments\n\nIn this section, we perform empirical experiments to evaluate the effectiveness of proposed framework.\nWe \ufb01rst introduce the tasks and corresponding datasets we consider. Then, we present the quantitative\nresults showing the superior performance of our proposed framework, and discuss some qualitative\nanalysis which veri\ufb01es the learned representations have the desired invariance property.\n\n5.1 Datasets\n\nOur experiments include three tasks in different domains: (1) fair classi\ufb01cation, in which predictions\nshould be unaffected by nuisance factors; (2) language-independent generation which is conducted\non the multi-lingual machine translation problem; (3) lighting-independent image classi\ufb01cation.\n\nFair Classi\ufb01cation For fair classi\ufb01cation, we use three datasets to predict the savings, credit ratings\nand health conditions of individuals with variables such as gender or age speci\ufb01ed as \u201cnuisance\nvariable\u201d that we would like to not consider in our decisions [Zemel et al., 2013, Louizos et al.,\n2016]. The German dataset [Frank et al., 2010] is a small dataset with 1, 000 samples describing\nwhether a person has a good credit rating. The sensitive nuisance variable to be factored out is gender.\nThe Adult income dataset [Frank et al., 2010] has 45, 222 data points and the objective is to predict\nwhether a person has savings of over 50, 000 dollars with the sensitive factor being age. The task of\nthe health dataset1 is to predict whether a person will spend any days in the hospital in the following\nyear. The sensitive variable is also the age and the dataset contains 147, 473 entries. We follow the\nsame 5-fold train/validation/test splits and feature preprocessing used in [Zemel et al., 2013, Louizos\net al., 2016].\nBoth the encoder and the predictor are parameterized by single-layer neural networks. A three-layer\nneural network with batch normalization [Ioffe and Szegedy, 2015] is employed for the discriminator.\nWe use a batch size of 16 and the number of hidden units is set to 64. \u03b3 is set to 1 in our experiments.\n\n1www.heritagehealthprize.com\n\n5\n\n\fMulti-lingual Machine Translation For the multi-lingual machine translation task we use French\nto English (fr-en) and German to English (de-en) pairs from IWSLT 2015 dataset [Cettolo et al., 2012].\nThere are 198, 435 pairs of fr-en sentences and 188, 661 pairs of de-en sentences in the training set. In\nthe test set, there are 4, 632 pairs of fr-en sentences and 7, 054 pairs of de-en sentences. We evaluate\nBLEU scores [Papineni et al., 2002] using the standard Moses multi-bleu.perl script. Here, s\nindicates the language of the source sentence.\nWe use the OpenNMT [Klein et al., 2017] in our multi-lingual MT experiments2. The encoder is a\ntwo-layer bidirectional LSTM with 256 units for each direction. The discriminator is a one-layer\nsingle-directional LSTM with 256 units. The predictor is a two-layer LSTM with 512 units and\nattention mechanism [Bahdanau et al., 2015]. We follow Johnson et al. [2016] and use Byte Pair\nEncoding (BPE) subword units [Sennrich et al., 2016] as the cross-lingual input. Every model is run\nfor 20 epochs. \u03b3 is set to 8 and the batch size is set to 64.\n\nImage Classi\ufb01cation We use the Extended Yale B dataset [Georghiades et al., 2001] for our image\nclassi\ufb01cation task. It comprises face images of 38 people under 5 different lighting conditions: upper\nright, lower right, lower left, upper left, or the front. The variable s to be purged is the lighting\ncondition. The label y is the identity of the person. We follow Li et al. [2014], Louizos et al. [2016]\u2019s\ntrain/test split and no validation is used: 38 \u00d7 5 = 190 samples are used for training and all other\n1, 096 data points are used for testing.\nWe use a one-layer neural network for the encoder and a one-layer neural network for prediction. \u03b3 is\nset to 2. The discriminator is a two-layer neural network with batch normalization. The batch size is\nset to 16 and the hidden size is set to 100.\n\n5.2 Results\n\nFair Classi\ufb01cation The results on three fairness tasks are shown in Figure 2. We compare our model\nwith two prior works on learning fair representations: Learning Fair Representations (LFR) [Zemel\net al., 2013] and Variational Fair Autoencoder (VFAE) [Louizos et al., 2016]. Results of VAE and\ndirectly using x as the representation are also shown.\nWe \ufb01rst study how much information about s is retained in the learned representation h by using a\nlogistic regression to predict factor s. In the top row, we see that s cannot be recognized from the\nrepresentations learned by three models targeting at fair representations. The accuracy of classifying\ns is similar to the trivial baseline predicting the majority label shown by the black line.\nThe performance on predicting label y is shown in the second row. We see that LFR and VFAE\nsuffer on Adult and German datasets after removing information of s. In comparison, our model\u2019s\nperformance does not suffer even when making fair predictions. Speci\ufb01cally, on German, our\nmodel\u2019s accuracy is 0.744 compared to 0.727 and 0.723 achieved by VFAE and LFR. On Adult, our\nmodel\u2019s accuracy is 0.844 while VFAE and LFR have accuracies of 0.813 and 0.823 respectively.\nOn the health dataset, all models\u2019 performances are barely better than the majority baseline. The\nunsatisfactory performances of all models may be due to the extreme imbalance of the dataset, in\nwhich 85% of the data has the same label.\nWe also investigate how fair representations would alleviate biases of machine learning models. We\nmeasure the unbiasedness by evaluating models\u2019 performances on identifying minority groups. For\ninstance, suppose the task is to predict savings with the nuisance factor being age, with savings\nabove a threshold of $50, 000 being adequate, otherwise being insuf\ufb01cient. If people of advanced\nage generally have fewer savings, then a biased model would tend to predict insuf\ufb01cient savings for\nthose with an advanced age. In contrast, an unbiased model can better factor out age information and\nrecognize people that do not \ufb01t into these stereotypes.\nConcretely, for groups pooled by each possible value of y, we seek for the minority s in each of these\ngroups and de\ufb01ne the minority s as the biased category for the group. Then we \ufb01rst calculate the\naccuracy on each biased category and report the average performance for all categories. We do not\ncompute the instance-level average performance since one category may hold the dominant amount\nof data among all categories.\n\n2Our MT code is available at https://github.com/qizhex/Controllable-Invariance\n\n6\n\n\f(a) Accuracy on predicting s. The closer the result is to the majority line, the better the model is in eliminating\nthe effect of nuisance variables.\n\n(b) Accuracy on predicting y. High accuracy in predicting y is desireable.\n\n(c) Overall performance and performance on biased categories. Fair representations lead to high accuracy on\nbaised categories.\nFigure 2: Fair classi\ufb01cation results on different representations. x denotes directly using the observa-\ntion x as the representation. The black lines in the \ufb01rst and the second row show the performance\nof predicting the majority label. \u201cBiased categories\u201d in the third row are explained in the fourth\nparagraph of Section 5.2.\n\nModel\nBilingual Enc-Dec [Bahdanau et al., 2015]\nMulti-lingual Enc-Dec [Johnson et al., 2016]\nOur model\n\nw.o. discriminator\nw.o. separate encoders\n\nTable 1: Results on multi-lingual machine translation.\n\ntest (fr-en)\n\ntest (de-en)\n\n35.2\n35.5\n36.1\n35.3\n35.4\n\n27.3\n27.7\n28.1\n27.6\n27.7\n\nAs shown in the third row of Figure 2, on German and Adult, we achieve higher accuracy on the\nbiased categories, even though our overall accuracy is similar to or lower than the baseline which\ndoes not employ fairness constraints. Speci\ufb01cally, on Adult, our performance on the biased categories\nis 0.788 while the baseline\u2019s accuracy is 0.748. On German, our accuracy on biased categories is\n0.676 while the baseline achieves 0.648. The results show that our model is able to learn a more\nunbiased representation.\n\nMulti-lingual Machine Translation The results of systems on multi-lingual machine translation\nare shown in Table 1. We compare our model with attention based encoder-decoder trained on\nbilingual data [Bahdanau et al., 2015] and multi-lingual data [Johnson et al., 2016]. The encoder-\ndecoder trained on multi-lingual data employs a single encoder for both source languages. Firstly,\nboth multi-lingual systems outperform the bilingual encoder-decoder even though multi-lingual\nsystems use similar number of parameters to translate two languages, which shows that learning\n\n7\n\nAdult0.40.530.650.780.9xLFRVAEVFAEOurs0.67MajorityGerman0.40.530.650.780.9xLFRVAEVFAEOurs0.8MajorityHealth0.40.530.650.780.9xLFRVAEVFAEOurs0.58MajorityAdult0.40.530.650.780.9xLFRVAEVFAEOurs0.75MajorityGerman0.40.530.650.780.9xLFRVAEVFAEOurs0.71MajorityHealth0.40.530.650.780.9xLFRVAEVFAEOurs0.84MajorityAdult0.40.5250.650.7750.9OverallBiased categoriesxOursGerman0.40.530.650.780.9OverallBiased categoriesxOursHealth0.40.530.650.780.9OverallBiased categoriesxOurs\fMethod\n\nAccuracy of classifying s Accuracy of classifying y\n\nLogistic regression\n\nNN + MMD [Li et al., 2014]\nVFAE [Louizos et al., 2016]\n\nOurs\n\n0.96\n\n-\n\n0.57\n0.57\n\n0.78\n0.82\n0.85\n0.89\n\nTable 2: Results on Extended Yale B dataset. A better representation has lower accuracy of classifying\nfactor s and higher accuracy of classifying label y\n\n(a) Using the original image x as the representation\n\n(b) Representation learned by our model\n\nFigure 3: t-SNE visualizations of images in the Extended Yale B. The original pictures are clustered\nby the lighting conditions, while the representation learned by our model is clustered by identities of\nindividuals\n\ninvariant representation leads to better generalization in this case. The better generalization may be\ndue to transferring statistical strength between data in two languages.\nComparing two multi-lingual systems, our model outperforms the baseline multi-lingual system on\nboth languages, where the improvement on French-to-English is 0.6 BLEU score. We also verify the\ndesign decisions in our framework by ablation studies. Firstly, without the discriminator, the model\u2019s\nperformance is worse than the standard multi-lingual system, which rules out the possibility that the\ngain of our model comes from more parameters of separating encoders. Secondly, when we do not\nemploy separate encoders, the model\u2019s performance deteriorates and it is more dif\ufb01cult to learn a\ncross-lingual representation, which\n\u2022 veri\ufb01es the theoretical advantage of modeling p(y | x, s) instead of p(y | x) as mentioned in\nSection 2. Intuitively, German and French have different grammars and vocabulary, so it is hard to\nobtain a uni\ufb01ed semantic representation by performing the same operations.\n\u2022 means that the encoder needs to have enough capacity to reach the equilibrium in the minimax\ngame. We also observe that the discriminator needs enough capacity to provide faithful gradients\ntowards the equilibrium. Speci\ufb01cally, instantiating the discriminator with feedforward neural\nnetwork w./w.o. attention mechanism [Bahdanau et al., 2015] does not work in our experiments.\n\nImage Classi\ufb01cation We report the results in Table 2 with two baselines [Li et al., 2014, Louizos\net al., 2016] that use MMD regularizations to remove lighting conditions. The advantage of factoring\nout lighting conditions is shown by the improved accuracy 89% for classifying identities, while the\nbest baseline achieves an accuracy of 85%.\nIn terms of removing s, our framework can \ufb01lter the lighting conditions since the accuracy of\nclassifying s drops from 0.96 to 0.57, as shown in Table 2. We also visualize the learned representation\nby t-SNE [Maaten and Hinton, 2008] in comparison to the visualization of original pictures in Figure\n3. We see that, without removing lighting conditions, the images are clustered based on the lighting\nconditions. After removing information of lighting conditions, images are clustered according to the\nidentity of each person.\n\n8\n\n\f6 Related Work\n\nAs a speci\ufb01c case of our problem where s takes two values, domain adaption has attracted a large\namount of research interest. Domain adaptation aims to learn domain-invariant representations that\nare transferable to other domains. For example, in image classi\ufb01cation, adversarial training has\nbeen shown to able to learn an invariant representation across domains [Ganin and Lempitsky, 2015,\nGanin et al., 2016, Bousmalis et al., 2016, Tzeng et al., 2017] and enables classi\ufb01ers trained on the\nsource domain to be applicable to the target domain. Moment discrepancy regularizations can also\neffectively remove domain speci\ufb01c information [Zellinger et al., 2017, Bousmalis et al., 2016] for\nthe same purpose. By learning language-invariant representations, classi\ufb01ers trained on the source\nlanguage can be applied to the target language [Chen et al., 2016b, Xu and Yang, 2017].\nWorks targeting the development of fair, bias-free classi\ufb01ers also aim to learn representations invariant\nto \u201cnuisance variables\u201d that could induce bias and hence makes the predictions fair, as data-driven\nmodels trained using historical data easily inherit the bias exhibited in the data. Zemel et al. [2013]\nproposes to regularize the (cid:96)1 distance between representation distributions for data with different\nnuisance variables to enforce fairness. The Variational Fair Autoencoder [Louizos et al., 2016] targets\nthe problem with a Variational Autoencoder [Kingma and Welling, 2014, Rezende et al., 2014]\napproach with maximum mean discrepancy regularization.\nOur work is also related to learning disentangled representations, where the aim is to separate different\nin\ufb02uencing factors of the input data into different parts of the representation. Ideally, each part of\nthe learned representation can be marginally independent to the other. An early work by Tenenbaum\nand Freeman [1997] propose a bilinear model to learn a representation with the style and content\ndisentangled. From information theory perspective, Chen et al. [2016a] augments standard generative\nadversarial networks with an inference network, whose objective is to infer part of the latent code\nthat leads to the generated sample. This way, the information carried by the chosen part of the latent\ncode can be retained in the generative sample, leading to disentangled representation.\nAs we have discussed in Section 1, these methods bear the same drawback that the cost used to\nregularize the representation is pairwise, which does not scale well as the number of values that the\nattribute can take could be large. Louppe et al. [2016] propose an adversarial training framework to\nlearn representations independent to a categorical or continuous variable. A basic assumption in their\ntheoretical analysis is that the attribute is irrelevant to the prediction, which limits its capabilities in\nanalyzing the fairness classi\ufb01cations.\n\n7 Conclusion\n\nIn sum, we propose a generic framework to learn representations invariant to a speci\ufb01ed factor or trait.\nWe cast the representation learning problem as an adversarial game among an encoder, a discriminator,\nand a predictor. We theoretically analyze the optimal equilibrium of the minimax game and evaluate\nthe performance of our framework on three tasks from different domains empirically. We show that\nan invariant representation is learned, resulting in better generalization and improvements on the\nthree tasks.\n\nAcknowledgement\n\nWe thank Shi Feng, Di Wang and Zhilin Yang for insightful discussions. This research was supported\nin part by DARPA grant FA8750-12-2-0342 funded under the DEFT program.\n\nReferences\nDzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. ICLR, 2015.\n\nYoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\nperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828,\n2013.\n\n9\n\n\fKonstantinos Bousmalis, George Trigeorgis, Nathan Silberman, Dilip Krishnan, and Dumitru Erhan.\n\nDomain separation networks. In NIPS, 2016.\n\nMauro Cettolo, Christian Girardi, and Marcello Federico. Wit3: Web inventory of transcribed and\ntranslated talks. In Proceedings of the 16th Conference of the European Association for Machine\nTranslation (EAMT), volume 261, page 268, 2012.\n\nXi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel. Infogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In\nNIPS, 2016a.\n\nXilun Chen, Yu Sun, Ben Athiwaratkun, Claire Cardie, and Kilian Weinberger. Adversarial deep\naveraging networks for cross-lingual sentiment classi\ufb01cation. arXiv preprint arXiv:1606.01614,\n2016b.\n\nAndrew Frank, Arthur Asuncion, et al. Uci machine learning repository, 2010. URL http://\n\narchive.ics.uci.edu/ml.\n\nYaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. ICML,\n\n2015.\n\nYaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois\nLaviolette, Mario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks.\nJournal of Machine Learning Research, 17(59):1\u201335, 2016.\n\nAthinodoros S. Georghiades, Peter N. Belhumeur, and David J. Kriegman. From few to many:\nIllumination cone models for face recognition under variable lighting and pose. IEEE transactions\non pattern analysis and machine intelligence, 23(6):643\u2013660, 2001.\n\nIan Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair,\n\nAaron Courville, and Yoshua Bengio. Generative adversarial nets. In NIPS, 2014.\n\nG Gybenko. Approximation by superposition of sigmoidal functions. Mathematics of Control,\n\nSignals and Systems, 2(4):303\u2013314, 1989.\n\nSepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation, 1997.\n\nSergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by\n\nreducing internal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\nMelvin Johnson, Mike Schuster, Quoc V Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil\nThorat, Fernanda Vi\u00e9gas, Martin Wattenberg, Greg Corrado, et al. Google\u2019s multilingual neural\nmachine translation system: Enabling zero-shot translation. arXiv preprint arXiv:1611.04558,\n2016.\n\nDiederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\nDiederik P Kingma and Max Welling. Auto-encoding variational bayes. ICLR, 2014.\n\nG. Klein, Y. Kim, Y. Deng, J. Senellart, and A. M. Rush. OpenNMT: Open-Source Toolkit for Neural\n\nMachine Translation. ArXiv e-prints, 2017.\n\nYann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\nYujia Li, Kevin Swersky, and Richard Zemel. Learning unbiased features.\n\narXiv:1412.5244, 2014.\n\narXiv preprint\n\nChristos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zemel. The variational fair\n\nautoencoder. ICLR, 2016.\n\nGilles Louppe, Michael Kagan, and Kyle Cranmer. Learning to pivot with adversarial networks.\n\narXiv preprint arXiv:1611.01046, 2016.\n\n10\n\n\fLaurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. JMLR, 2008.\n\nKishore Papineni, Salim Roukos, Todd Ward, and Wei-Jing Zhu. Bleu: a method for automatic\n\nevaluation of machine translation. In ACL, 2002.\n\nDanilo Jimenez Rezende, Shakir Mohamed, and Daan Wierstra. Stochastic backpropagation and\n\napproximate inference in deep generative models. ICML, 2014.\n\nRico Sennrich, Barry Haddow, and Alexandra Birch. Neural machine translation of rare words with\n\nsubword units. ACL, 2016.\n\nJoshua B Tenenbaum and William T Freeman. Separating style and content. NIPS, 1997.\n\nEric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain\n\nadaptation. arXiv preprint arXiv:1702.05464, 2017.\n\nRuochen Xu and Yiming Yang. Cross-lingual distillation for text classi\ufb01cation. ACL, 2017.\n\nWerner Zellinger, Thomas Grubinger, Edwin Lughofer, Thomas Natschl\u00e4ger, and Susanne Saminger-\nPlatz. Central moment discrepancy (cmd) for domain-invariant representation learning. ICLR,\n2017.\n\nRichard S Zemel, Yu Wu, Kevin Swersky, Toniann Pitassi, and Cynthia Dwork. Learning fair\n\nrepresentations. ICML, 2013.\n\nChiyuan Zhang, Samy Bengio, Moritz Hardt, Benjamin Recht, and Oriol Vinyals. Understanding\n\ndeep learning requires rethinking generalization. ICLR, 2017.\n\n11\n\n\fA Supplementary Material: Proofs\n\nThe proof for Claim 1:\nClaim. Given a \ufb01xed encoder E, the optimal discriminator outputs q\u2217\noptimal predictor corresponds to q\u2217\n\nM (y | h) = \u02dcp(y | h).\n\nD(s | h) = \u02dcp(s | h). The\n\nProof. We \ufb01rst prove the optimal solution of the discriminator. With a \ufb01xed encoder, we have the\nfollowing optimization problem\n\nmin\nqD\n\ns.t. (cid:88)\n\ns\n\n\u2212 J(E, M, D)\n\nqD(s | h) = 1,\u2200h\n\nThen L = J(E, M, D) \u2212(cid:80)\n\nh \u03bb(h)((cid:80)\n\noptimization problem where \u03bb(h) are the dual variables introduced for equality constraints.\nThe optimal D satis\ufb01es the following equation\n\ns qD(s | h) \u2212 1) is the Lagrangian dual function of the above\n\n0 =\n\n\u2202q\u2217\n\u21d0\u21d2 0 = \u2212\n\n\u2202L\nD(s | h)\n(cid:80)\n\u2202J\nD(s | h)\n\u2202q\u2217\n\u21d0\u21d2 \u03bb(h) = \u2212\nD(s | h) = \u2212\u02dcq(s, h)\n\u21d0\u21d2 \u03bb(h)q\u2217\n\ny \u02dcq(h, s, y)\nD(s | h)\nq\u2217\n\n\u2212 \u03bb(h)\n\n(4)\n\ns q\u2217\n\nD(s | h) = 1,\n(5)\n\nSumming w.r.t. s on both sides of the last line of Eqn. (4) and using the fact that(cid:80)\n\nwe get\n\n\u03bb(h) = \u2212\u02dcq(h)\n\nSubstituting Eqn. 5 back into Eqn. 4, we can prove the optimal discriminator is\n\nSimilarly, taking derivation w.r.t. qM (y | h) and setting it to 0, we can prove q\u2217\nh).\n\nM (y | h) = \u02dcq(y |\n\nD(s | h) = \u02dcq(s | h)\nq\u2217\n\n12\n\n\f", "award": [], "sourceid": 406, "authors": [{"given_name": "Qizhe", "family_name": "Xie", "institution": "Carnegie Mellon University"}, {"given_name": "Zihang", "family_name": "Dai", "institution": "Carnegie Mellon University"}, {"given_name": "Yulun", "family_name": "Du", "institution": "Carnegie Mellon University"}, {"given_name": "Eduard", "family_name": "Hovy", "institution": "CMU"}, {"given_name": "Graham", "family_name": "Neubig", "institution": "Carnegie Mellon University"}]}