{"title": "Unsupervised Adversarial Invariance", "book": "Advances in Neural Information Processing Systems", "page_first": 5092, "page_last": 5102, "abstract": "Data representations that contain all the information about target variables but are invariant to nuisance factors benefit supervised learning algorithms by preventing them from learning associations between these factors and the targets, thus reducing overfitting. We present a novel unsupervised invariance induction framework for neural networks that learns a split representation of data through competitive training between the prediction task and a reconstruction task coupled with disentanglement, without needing any labeled information about nuisance factors or domain knowledge. We describe an adversarial instantiation of this framework and provide analysis of its working. Our unsupervised model outperforms state-of-the-art methods, which are supervised, at inducing invariance to inherent nuisance factors, effectively using synthetic data augmentation to learn invariance, and domain adaptation. Our method can be applied to any prediction task, eg., binary/multi-class classification or regression, without loss of generality.", "full_text": "Unsupervised Adversarial Invariance\n\nAyush Jaiswal, Yue Wu, Wael AbdAlmageed, Premkumar Natarajan\n\nUSC Information Sciences Institute\n\nMarina del Rey, CA, USA\n\n{ajaiswal, yue_wu, wamageed, pnataraj}@isi.edu\n\nAbstract\n\nData representations that contain all the information about target variables but\nare invariant to nuisance factors bene\ufb01t supervised learning algorithms by pre-\nventing them from learning associations between these factors and the targets,\nthus reducing over\ufb01tting. We present a novel unsupervised invariance induction\nframework for neural networks that learns a split representation of data through\ncompetitive training between the prediction task and a reconstruction task coupled\nwith disentanglement, without needing any labeled information about nuisance\nfactors or domain knowledge. We describe an adversarial instantiation of this\nframework and provide analysis of its working. Our unsupervised model outper-\nforms state-of-the-art methods, which are supervised, at inducing invariance to\ninherent nuisance factors, effectively using synthetic data augmentation to learn\ninvariance, and domain adaptation. Our method can be applied to any prediction\ntask, eg., binary/multi-class classi\ufb01cation or regression, without loss of generality.\n\n1\n\nIntroduction\n\nSupervised learning, arguably the most popular branch of machine learning, involves estimating a\nmapping from data samples (x) to target variables (y). A common formulation of this task is the\n\nestimation of the conditional probability p(yx) from data through learning associations between y\nirrelevant to the prediction of y from x and estimation of p(yx) in such cases leads to over\ufb01tting\n\nand underlying factors of variation of x. However, data often contains nuisance factors (z) that are\n\nwhen the model incorrectly learns to associate some z with y. Thus, when applied to new data\ncontaining unseen variations of z, trained models perform poorly. For example, a nuisance factor\nin the case of face recognition in images is the lighting condition the photograph was captured in,\nand a recognition model that associates lighting with subject identity is expected to perform poorly.\nDeveloping machine learning methods that are invariant to nuisance factors has been a long-standing\nproblem in machine learning; studied under various names such as \u201cfeature selection\u201d, \u201crobustness\nthrough data augmentation\u201d and \u201cinvariance induction\u201d.\nWhile deep neural networks (DNNs) have outperformed traditional methods at highly sophisticated\n\nand challenging supervised learning tasks, providing better estimates of p(yx), they are prone to\n\nthe same problem of incorrectly learning associations between z and y. An architectural solution to\nthis problem is the development of neural network units that capture speci\ufb01c forms of information,\nand thus are inherently invariant to certain nuisance factors [3, 19]. For example, convolutional\noperations coupled with pooling strategies capture shift-invariant spatial information while recurrent\noperations robustly capture high-level trends in sequential data. However, this approach requires\nsigni\ufb01cant effort for engineering custom network modules and layers to achieve invariance to speci\ufb01c\nnuisance factors, making it in\ufb02exible [19]. A different but popularly adopted solution to the problem\nof nuisance factors is the use of data augmentation where synthetic versions of real data samples are\ngenerated, during training, with speci\ufb01c forms of variation [3]. For example, rotation, translation\nand additive noise are typical methods of augmentation used in computer vision, especially for\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fclassi\ufb01cation and detection tasks. However, models trained na\u00efvely on the augmented dataset become\nrobust to limited forms of nuisance by learning to associate every seen variation of such factors to the\ntarget variables. Consequently, such models perform poorly when applied to data exhibiting unseen\nnuisance variations, such as face images at previously unseen pose angles.\nA related but more systematic solution to this problem is the approach of invariance-induction by\nguiding neural networks through specialized training mechanisms to discard known nuisance factors\nfrom the learned latent representation of data that is used for prediction. Models trained in this\nfashion become robust by exclusion rather than inclusion and are, therefore, expected to perform well\neven on data containing variations of speci\ufb01c nuisance factors that were not seen during training.\nFor example, a face recognition model trained explicitly to not associate lighting conditions with the\nidentity of the person is expected to be more robust to lighting conditions than a similar model trained\nna\u00efvely on images of subjects under certain different lighting conditions [19]. This research area\nhas, therefore, garnered tremendous interest recently [6, 13, 14, 19]. However, a shortcoming of this\napproach, is the requirement of domain knowledge of possible nuisance factors and their variations,\nwhich is often hard to \ufb01nd [3]. Additionally, this solution to invariance applies only to cases where\nannotated data is available for each nuisance factor, such as labeled information about the lighting\ncondition of each image in the face recognition example, which is often not the case.\nWe present a novel unsupervised framework for invariance induction that overcomes the drawbacks\nof previous methods. Our framework promotes invariance through separating the underlying factors\nof variation of x into two latent embeddings: e1, which contains all the information required for\npredicting y, and e2, which contains other information irrelevant to the prediction task. While\ne1 is used for predicting y, a noisy version of e1, denoted as \u02dce1, and e2 are used to reconstruct\nx. This creates a competitive scenario where the reconstruction module tries to pull information\ninto e2 (because \u02dce1 is unreliable) while the prediction module tries to pull information into e1.\nThe training objective is augmented with a disentanglement term that ensures that e1 and e2 do\nnot contain redundant information. In our adversarial instantiation of this generalized framework,\ndisentanglement is achieved between e1 and e2 in a novel way through two adversarial disentanglers\n\u2014 one that aims to predict e2 from e1 and another that does the inverse. The parameters of the\ncombined model are learned through adversarial training between (a) the encoder, the predictor and\nthe decoder, and (b) the disentanglers. The framework makes no assumptions about the data, so it can\nbe applied to any prediction task without loss of generality, be it binary/multi-class classi\ufb01cation or\nregression. Unlike existing methods, the proposed method does not require annotation of nuisance\nfactors or specialized domain knowledge. We provide results on three tasks involving a diverse\ncollection of datasets \u2014 (1) invariance to inherent nuisance factors, (2) effective use of synthetic\ndata augmentation for learning invariance and (3) domain adaptation. Our unsupervised framework\noutperforms existing approaches for invariance induction, which are supervised, on all of them.\n\n2 Related Work\n\nMethods for preventing supervised learning algorithms from learning false associations between\ntarget variables and nuisance factors have been studied from various perspectives including \u201cfeature\nselection\u201d [16], \u201crobustness through data augmentation\u201d [10, 11] and \u201cinvariance induction\u201d [3, 14,\n19]. Feature selection has typically been employed when data is available as a set of conceptual\nfeatures, some of which are irrelevant to the prediction tasks. Our approach can be interpreted as an\nimplicit feature selection mechanism for neural networks, which can work on both raw data (such as\nimages) and feature-sets (e.g., frequency features computed from raw text). Popular feature selection\nmethods [16] incorporate information-theoretic measures or use supervised methods to score features\nwith their importance for the prediction task and prune the low-scoring ones. Our framework performs\nthis task implicitly on latent features that the model learns by itself from the provided data.\nDeep neural networks (DNNs) have outperformed traditional methods at several supervised learning\ntasks. However, they have a large number of parameters that need to be estimated from data, which\nmakes them especially vulnerable to learning relationships between target variables and nuisance\nfactors and, thus, over\ufb01tting. The most popular approach to expand the data size and prevent\nover\ufb01tting in deep learning has been synthetic data augmentation [3, 5, 9\u201311], where multiple copies\nof data samples are created by altering variations of certain known nuisance factors. DNNs trained\nwith data augmentation have been shown to generalize better and be more robust compared to those\ntrained without in many domains including vision, speech and natural language. This approach\n\n2\n\n\fworks on the principle of inclusion. More speci\ufb01cally, the model learns to associate multiple seen\nvariations of those nuisance factors to each target value. In contrast, our method encourages exclusion\nof information about nuisance factors from latent features used for predicting the target, thus creating\nmore robust features. Furthermore, combining our method with data augmentation further helps our\nframework remove information about nuisance factors used to synthesize additional data, without the\nneed to explicitly quantify or annotate the generated variations. This is especially helpful in cases\nwhere augmentation is performed using sophisticated analytical or composite techniques.\nSeveral supervised methods for invariance induction and invariant feature learning have been devel-\noped recently, such as Controllable Adversarial Invariance (CAI) [19], Variational Fair Autoencoder\n(VFAE) [14], and a maximum mean discrepancy based model (NN+MMD) [13]. These methods use\nannotated information about variations of speci\ufb01c nuisance factors to force their exclusion from the\nlearned latent representation. They have also been applied to learn \u201cfair\u201d representations based on\ndomain knowledge, such as making predictions about the savings of a person invariant to age, where\nmaking the prediction task invariant to such factors is of higher priority than the prediction perfor-\nmance itself [19]. Our method induces invariance to nuisance factors with respect to a supervised\ntask in an unsupervised way. However, it is not guaranteed to work in \u201cfairness\u201d settings because it\ndoes not incorporate any external knowledge about factors to induce invariance to.\nDisentangled representation learning is closely related to our work since disentanglement is one of the\npillars of invariance induction in our framework as the model learns two embeddings (for any given\ndata sample) that are expected to be uncorrelated to each other. Our method shares some properties\nwith multi-task learning (MTL) [17] in the sense that the model is trained with multiple objectives.\nHowever, a fundamental difference between our framework and MTL is that the latter promotes a\nshared representation across tasks whereas the only information shared loosely between the tasks\nof predicting y and reconstructing x in our framework is a noisy version of e1 to help reconstruct x\nwhen combined with a separate encoding e2, where e1 itself is used directly to predict y.\n\n3 Unsupervised Adversarial Invariance\n\nIn this section, we describe a generalized framework for unsupervised induction of invariance\nto nuisance factors by disentangling information required for predicting y from other unrelated\ninformation contained in x through the incorporation of data reconstruction as a competing task for\nthe primary prediction task and a disentanglement term in the training objective. This is achieved by\n\nlearning a split representation of data as e=[e1 e2], such that information essential for the prediction\n\ntask is pulled into e1 while all other information about x migrates to e2. We present an adversarial\ninstantiation of this framework, which we call Unsupervised Adversarial Invariance.\n\n3.1 Unsupervised Invariance Induction\n\nData samples (x) can be abstractly represented as a set of underlying factors of variation F={fi}.\n\nThis can be as simple as a collection of numbers denoting the position of a point in space or as\ncomplicated as information pertaining to various facial attributes that combine non-trivially to form\nthe image of someone\u2019s face. Understanding and modeling the interactions between factors of\nvariation of data is an open problem. However, supervised learning of the mapping of x to target (y)\ninvolves a relatively simpler (yet challenging) problem of \ufb01nding those factors of variation (Fy) that\ncontain all the information required for predicting y and discarding all the others (F y). Thus, Fy\nand F y form a partition of F , where we are more interested in the former than the latter. Since y is\n\nis bene\ufb01cial because the nuisance factors, which comprise F y, are never presented to the estimator,\nthus avoiding inaccurate learning of associations between nuisance factors and y. This forms the\nbasis for \u201cfeature selection\u201d, a research area that has been well-studied.\nWe incorporate the idea of splitting F into Fy and F y in our framework in a more relaxed sense as\n\nindependent of F y, i.e., y\u22a5 F y, we get p(yx)= p(yFy). Estimating p(yx) as q(yFy) from data\nlearning a disentangled latent representation of x in the form of e=[e1 e2], where e1 aims to capture\ncomprises four core modules: (1) an encoder Enc that embeds x into e=[e1 e2], (2) a predictor\n\nall the information in Fy and e2 that in F y. Once trained, the model can be used to infer e1 from x\nfollowed by y from e1. More formally, our general framework for unsupervised invariance induction\n\nP red that infers y from e1, (3) a noisy-transformer \u03c8 that converts e1 into its noisy version \u02dce1, and\n\n3\n\n\f(a)\n\n(b)\n\nFigure 1: (a) Unsupervised Invariance Induction Framework and (b) Adversarial Model Design\n\n(4) a decoder Dec that reconstructs x from \u02dce1 and e2. Additionally, the training objective contains a\n\nloss-term that enforces disentanglement between Enc(x)1= e1 and Enc(x)2= e2. Figure 1a shows\nL= \u03b1Lpred(y, P red(e1))+ \u03b2Ldec(x, Dec(\u03c8(e1), e2))+ \u03b3Ldis((e1, e2))\n= \u03b1Lpred(y, P red(Enc(x)1))+ \u03b2Ldec(x, Dec(\u03c8(Enc(x)1), Enc(x)2))+ \u03b3Ldis(Enc(x)) (1)\n\nour generalized framework. The training objective for this system can be written as Equation 1.\n\nThe predictor and the decoder are designed to enter into a competition, where P red tries to pull\ninformation relevant to y into e1 while Dec tries to extract all the information about x into e2. This\nis made possible by \u03c8, which makes \u02dce1 an unreliable source of information for reconstructing x.\nMoreover, a version of this framework without \u03c8 can converge to a degenerate solution where e1\ncontains all the information about x and e2 contains nothing (noise), because absence of \u03c8 allows\ne1 to be readily available to Dec. The competitive pulling of information into e1 and e2 induces\ninformation separation such that e1 tends to contain more information relevant for predicting y and\ne2 more information irrelevant to the prediction task. However, this competition is not suf\ufb01cient to\ncompletely partition information of x into e1 and e2. Without the disentanglement term (Ldis) in the\nobjective, e1 and e2 can contain redundant information such that e2 has information relevant to y and,\nmore importantly, e1 contains nuisance factors. The disentanglement term in the training objective\nencourages the desired clean partition. Thus, essential factors required for predicting y concentrate\ninto e1 and all other factors migrate to e2.\n\n3.2 Adversarial Model Design and Optimization\n\nWhile there are numerous ways to implement the proposed unsupervised invariance induction\nframework, we adopt an adversarial model design, introducing a novel approach to disentanglement\nin the process. Enc, P red and Dec are modeled as neural networks. \u03c8 can be modeled as a parametric\nnoisy-channel, where the parameters of \u03c8 can also be learned during training. However, we model\n\u03c8 as dropout [18] (multiplicative Bernoulli noise) because it provides an easy and straightforward\nmethod for noisy-transformation of e1 into \u02dce1 without complicating the training process.\nWe augment these core modules with two adversarial disentanglers Dis1 and Dis2. While Dis1\naims to predict e2 from e1, Dis2 aims to do the inverse. Hence, their objectives are in direct\nopposition to the desired disentanglement, forming the basis for adversarial minimax optimization.\nThus, Enc, P red and Dec can be thought of as a composite model (M1) that is pit against another\ncomposite model (M2) containing Dis1 and Dis2. Figure 1b shows our complete model design with\nM1 represented by the color blue and M2 with orange. The model is trained end-to-end through\nbackpropagation by playing the minimax game described in Equation 2.\n\nJ(Enc, P red, Dec, Dis1, Dis2); where:\n\nmin\n\nmax\n\nEnc,P red,Dec\n\nDis1,Dis2\n\nJ(Enc, P red,Dec, Dis1, Dis2)\n= \u03b1Lpred\u0001y,P red(e1)\u0001+ \u03b2Ldec\u0001x, Dec(\u03c8(e1), e2)\u0001+ \u03b3 \u02dcLdis\u0001(e1, e2)\u0001\n= \u03b1Lpred\u0001y,P red(Enc(x)1)\u0001+ \u03b2Ldec\u0001x, Dec(\u03c8(Enc(x)1)), Enc(x)2))\u0001\n\n\u0001Enc(x)2, Dis1(Enc(x)1)\u0001+ \u02dcLdis2\n\n+ \u03b3\u0001 \u02dcLdis1\n\n\u0001Enc(x)1, Dis2(Enc(x)2)\u0001\u0001\n\n(2)\n\nWe use mean squared error for the disentanglement losses \u02dcLdis1 and \u02dcLdis2. We optimize the proposed\nadversarial model using a scheduled update scheme where we freeze the weights of a composite\n\n4\n\n\fMetric\n\nAccuracy of predicting y from e1 (Ay)\nAccuracy of predicting z from e1 (Az)\n\nNN + MMD [13] VFAE [14] CAI [19] Ours\n0.95\n0.24\n\n0.85\n0.57\n\n0.89\n0.57\n\n0.82\n\n-\n\nTable 1: Results on Extended Yale-B dataset\n\n4 Analysis\n\nCompetition between prediction and reconstruction.\n\nplayer model (M1 or M2) when we update the weights of the other. M2 should ideally be trained\nto convergence before updating M1 in each training epoch to backpropagate accurate and stable\ndisentanglement-inducing gradients to Enc. However, this is not scalable in practice. We update M1\n\nThe prediction and reconstruction tasks\n\u03b2 in\ufb02uences which task has\nhigher priority in the overall objective. We analyze the affect of \u03b7 on the behavior of our framework\nat optimality, considering perfect disentanglement of e1 and e2. There are two asymptotic scenarios\n\nand M2 in the frequency ratio of 1\u2236 k. We found k= 5 to perform well in our experiments.\nin our framework are designed to compete with each other. Thus, \u03b7= \u03b1\nwith respect to \u03b7: (1) \u03b7 \u2192\u221e and (2) \u03b7 \u2192 0. In case (1), our framework reduces to a predictor\nmodel, where the reconstruction task is completely disregarded. Only the branch x\u21e2 e1\u21e2 y remains\nfunctional. Consequently, e1 contains all f\u2208 F\u2032 at optimality, where Fy\u2286 F\u2032\u2286 F . In contrast, case\nand only the branch x\u21e2 e2\u21e2 x\u2032 remains functional because the other input to Dec, \u03c8(e1), is noisy.\nThus, e2 contains all f\u2208 F and e1 contains nothing at optimality, under perfect disentanglement. In\n\u03b7 is gradually decreased, f\u2208(F\u2032\u08a8 Fy)\u2286 F y migrate from e1 to e2 because f\u2208 F y are irrelevant to\ninstead of \u03c8(e1). After a point, further decreasing \u03b7 is, however, detrimental to the prediction task as\nthe reconstruction task starts dominating the overall objective and pulling f\u2208 Fy from e1 to e2.\n\ntransition from case (1) to case (2), by keeping \u03b1 \ufb01xed and increasing \u03b2, the reconstruction loss starts\ncontributing more to the overall objective, thus inducing more competition between the two tasks. As\n\nthe prediction task but can improve reconstruction by being more readily available to Dec through e2\n\n(2) reduces the framework to an autoencoder, where the prediction task is completely disregarded,\n\nEquilibrium analysis of adversarial instantiation.\nThe disentanglement and prediction objec-\ntives in our adversarial model design can simultaneously reach an optimum where e1 contains Fy\nand e2 contains F y. Hence, the minimax objective in our method has a win-win equilibrium.\nSelecting loss weights.\n\nshould be suf\ufb01cient. On the other hand, \u03b1 and \u03b2 can be selected by starting with \u03b1\u00e2 \u03b2 and gradually\nincreasing \u03b2 as long as the performance of the prediction task improves. We found \u03b1= 100, \u03b2= 0.1\nand \u03b3= 1 to work well for all datasets on which we evaluated the proposed model.\n\nUsing the above analyses, any \u03b3 that successfully disentangles e1 and e2\n\n5 Experimental Evaluation\n\nWe provide experimental results on three tasks relevant to invariant feature learning for improved\nprediction of target variables: (1) invariance to inherent nuisance factors, (2) effective use of synthetic\ndata augmentation for learning invariance, and (3) domain adaptation through learning invariance to\n\u201cdomain\u201d information. We evaluate the performance of our model and prior works on two metrics \u2013\naccuracy of predicting y from e1 (Ay) and accuracy of predicting z from e1 (Az). The goal of the\nmodel is to achieve high Ay and Az close to random chance.\n\n5.1\n\nInvariance to inherent nuisance factors\n\nWe provide results of our framework at the task of learning invariance to inherent nuisance factors on\ntwo datasets \u2013 Extended Yale-B [7] and Chairs [2].\nExtended Yale-B.\nThis dataset contains face-images of 38 subjects under various lighting con-\nditions. The target y is the subject identity whereas the inherent nuisance factor z is the lighting\ncondition. We compare our framework to existing state-of-the-art supervised invariance induction\nmethods, CAI [19], VFAE [14], and NN+MMD [13]. We use the prior works\u2019 version of the dataset,\nwhich has lighting conditions classi\ufb01ed into \ufb01ve groups \u2013 front, upper-left, upper-right, lower-left\n\n5\n\n\f(a)\n\n(c)\n\n(b)\n\n(d)\n\nFigure 2: Extended Yale-B \u2013 t-SNE visualization of (a) raw data, (b) e2 labeled by lighting condition,\n(c) e1 labeled by lighting condition, and (d) e1 labeled by subject-ID (numerical markers, not colors).\n\n(a)\n\n(b)\n\nFigure 3: Reconstruction from e1 and e2 for (a) Extended Yale B and (b) Chairs. Columns in each\nblock re\ufb02ect (left to right): real, reconstruction from e1 and that from e2.\n\n6\n\n\fand lower-right, with the same split as 38\u00d7 5= 190 samples used for training and the rest used\n\nfor testing [13, 14, 19]. We use the same architecture for the predictor and the encoder as CAI (as\npresented in [19]), i.e., single-layer neural networks, except that our encoder produces two encodings\ninstead of one. We also model the decoder and the disentanglers as single-layer neural networks.\nTable 1 summarizes the results. The proposed unsupervised method outperforms existing state-of-the-\nart (supervised) invariance induction methods on both Ay and Az metrics, providing a signi\ufb01cant\nboost on Ay and complete removal of lighting information from e1 re\ufb02ected by Az. Furthermore,\nthe accuracy of predicting z from e2 is 0.89, which validates its automatic migration to e2. Figure 2\nshows t-SNE [15] visualization of raw data and embeddings e1 and e2 for our model. While raw data\nis clustered by lighting conditions z, e1 exhibits clustering by y with no grouping based on z, and e2\nexhibits near-perfect clustering by z. Figure 3a shows reconstructions from e1 and e2. Dedicated\ndecoder networks were trained (with weights of Enc frozen) to generate these visualizations. As\nevident, e1 captures identity-related information but not lighting while e2 captures the inverse.\n\n(a)\n\n(b)\n\nFigure 4: MNIST-ROT \u2013 t-SNE visualization of (a) raw data and (b) e1\n\nFigure 5:\nt-SNE visualization of MNIST-ROT e1 embedding for the proposed Unsupervised\nAdversarial Invariance model (a) & (c), and baseline model B0 (b) & (d). Models trained on\n\n\u0398={0,\u00b122.5,\u00b145}. Visualization generated for \u0398={\u00b155}.\n\n(c)\n\nMetric\n\nAy\nAz\n\nCAI Ours\n0.74\n0.68\n0.34\n0.69\n\nTable 2: Results on Chairs.\nHigh Ay and low Az are de-\nsired.\n\n(d)\n\n(a)\n\n(b)\n\nChairs.\nThis dataset consists of 1393 different chair types rendered at 31 yaw angles and two pitch\nangles using a computer aided design model. We treat the chair identity as the target y and the yaw\nangle \u03b8 as z. We split the data into training and testing sets by picking alternate yaw angles. Therefore,\nthere is no overlap of \u03b8 between the two sets. We compare the performance of our model to CAI. In\norder to train the CAI model, we group \u03b8 into four categories \u2013 front, left, right and back, and provide\nit this information as a one-hot encoded vector. We model the encoder and the predictor as two-layer\nneural networks for both CAI and our model. We also model the decoder as a two-layer network\nand the disentanglers as single-layer networks. Table 2 summarizes the results, showing that our\nmodel outperforms CAI on both Ay and Az. Moreover, the accuracy of predicting \u03b8 from e2 is 0.73,\nwhich shows that this information migrates to e2. Figure 3b shows results of reconstructing x from e1\nand e2 generated in the same way as for Extended Yale-B above. The \ufb01gure shows that e1 contains\nidentity information but nothing about \u03b8 while e2 contains \u03b8 with limited identity information.\n\n5.2 Effective use of synthetic data augmentation for learning invariance\n\nData is often not available for all possible variations of nuisance factors. A popular approach to learn\nmodels robust to such expected yet unobserved or infrequently seen (during training) variations is\ndata augmentation through synthetic generation using methods ranging from simple operations [10]\nlike rotation and translation to Generative Adversarial Networks [1, 8] for synthesis of more realistic\n\n7\n\n\fMetric\n\nAy\n\nAngle\n\n\u0398\n\n\u00b155\u00b0\n\u00b165\u00b0\n\nCAI\n0.958\n0.826\n0.662\n0.384\n\nOurs\n0.977\n0.856\n0.696\n0.338\n\nB0\n0.974\n0.826\n0.674\n0.586\n\nB1\n0.972\n0.829\n0.682\n0.409\n\nAz\n\nTable 3: Results on MNIST-ROT. \u0398 =\n{0,\u00b122.5\u00b0,\u00b145\u00b0} was used for training.\n\n-\n\nHigh Ay and low Az are desired.\n\nk\n-2\n2\n3\n4\n\nCAI\n0.816\n0.933\n0.795\n0.519\n\nOurs\n0.880\n0.958\n0.874\n0.606\n\nB0\n0.872\n0.942\n0.847\n0.534\n\nB1\n0.870\n0.940\n0.853\n0.550\n\ning y (Ay). k=\u22122 represents erosion with\n\nTable 4: MNIST-DIL \u2013 Accuracy of predict-\n\nkernel-size of 2.\n\nFigure 6: MNIST-ROT \u2013 reconstruction from e1 and e2, (c) e. Columns in each block re\ufb02ect (left to\nright): real, reconstruction from e1 and that from e2.\n\nvariations. The prediction model is then trained on the expanded dataset. The resulting model, thus,\nbecomes robust to speci\ufb01c forms of variations of certain nuisance factors that it has seen during\ntraining. Invariance induction, on the other hand, aims to completely prevent prediction models\nfrom using information about nuisance factors. Data augmentation methods can be more effectively\nused for improving the prediction of y by using the expanded dataset for inducing invariance by\nexclusion rather than inclusion. We use two variants of the MNIST [12] dataset of handwritten digits\nto (1) show the advantage of unsupervised invariance induction at this task over its supervised variant\nthrough comparison with CAI, and (2) perform ablation experiments for our model to justify our\nframework design. We use the same two-layer architectures for the encoder and the predictor in both\nour model and CAI, except that our encoder generates two encodings instead of one. We model the\ndecoder as a three-layer neural network and the disentanglers as single-layer neural networks. We\ntrain two baseline versions of our model for our ablation experiments \u2013 B0 composed of Enc and\n\nmodel M1, i.e., the proposed model trained non-adversarially without the disentanglers. B0 is used\nto validate the phenomenon that invariance by exclusion is a better approach than robustness through\ninclusion whereas B1 helps evaluate the importance of disentanglement in our framework.\nMNIST-ROT. We create this variant of the MNIST dataset by randomly rotating each image by\n\nP red, i.e., a single feed-forward network x\u21e2 h\u21e2 y and B1, which is the same as the composite\nan angle \u03b8\u2208{\u221245\u00b0,\u221222.5\u00b0, 0\u00b0, 22.5\u00b0, 45\u00b0} about the Y-axis. We denote this set of angles as \u0398. The\non \u03b8~\u2208 \u0398 to gauge the performance of these models on unseen variations of the rotation nuisance\n\u03b8~\u2208 \u0398. Results on Az show that our model discards more information about \u03b8 than CAI even though\nthe proposed model and the baseline B0, which models the classi\ufb01er x\u21e2 h\u21e2 y, to investigate the\nwere trained on digits rotated by \u03b8\u2208 \u0398 and t-SNE visualizations were generated for \u03b8\u2208{\u00b155}.\n\nCAI uses \u03b8 information during training. The information about \u03b8 migrates to e2, indicated by the\naccuracy of predicting it from e2 being 0.77. Figure 4 shows t-SNE visualization of raw MNIST-ROT\nimages and e1 learned by our model. While raw data tends to cluster by the rotation angle, e1 shows\nnear-perfect grouping based on the digit-class. We further visualize the e1 embedding learned by\n\nfactor. Table 3 summarizes the results, showing that our unsupervised adversarial model not only\nperforms better than the baseline ablation versions but also outperforms CAI, which uses supervised\ninformation about the rotation angle. The difference in Ay is especially notable for the cases where\n\nangle information is used as a one-hot encoding while training the CAI model. We evaluate all the\nmodels on the same metrics Ay and Az we previously used. We additionally test all the models\n\neffectiveness of invariance induction by exclusion versus inclusion, respectively. Both the models\n\nFigure 5 shows the results. As evident, e1 learned by the proposed model shows no clustering by the\nrotation angle, while that learned by B0 does, with encodings of some digit classes forming multiple\nclusters corresponding to rotation angles. Figure 6 shows results of reconstructing x from e1 and e2\ngenerated in the same way as Extended Yale-B above. The \ufb01gures show that reconstructions from e1\nre\ufb02ect the digit class but contain no information about \u03b8, while those from e2 exhibit the inverse.\nMNIST-DIL. We create this variant of MNIST by eroding or dilating MNIST digits using various\nkernel-sizes (k). We use models trained on MNIST-ROT to report evaluation results on this dataset, to\nshow the advantage of unsupervised invariance induction in cases where certain z are not annotated\n\n8\n\n\fSource - Target\n\nbooks - dvd\n\nbooks - electronics\n\nbooks - kitchen\n\ndvd - books\n\ndvd - electronics\n\ndvd - kitchen\n\nelectronics - books\nelectronics - dvd\n\nelectronics - kitchen\n\nkitchen - books\nkitchen - dvd\n\nkitchen - electronics\n\nDANN [6] VFAE [14] Ours\n0.820\n0.764\n0.791\n0.798\n0.790\n0.826\n0.734\n0.740\n0.890\n0.724\n0.745\n0.859\n\n0.784\n0.733\n0.779\n0.723\n0.754\n0.783\n0.713\n0.738\n0.854\n0.709\n0.740\n0.843\n\n0.799\n0.792\n0.816\n0.755\n0.786\n0.822\n0.727\n0.765\n0.850\n0.720\n0.733\n0.838\n\nTable 5: Results on Amazon Reviews dataset \u2013 Accuracy of predicting y from e1 (Ay)\n\nin the training data. Thus, information about these z cannot be used to train supervised invariance\ninduction models. We also provide ablation results on this dataset using the same baselines B0 and B1.\nTable 4 summarizes the results of this experiment. The results show signi\ufb01cantly better performance\nof our model compared to CAI and the baselines. More notably, CAI performs signi\ufb01cantly worse\nthan our baseline models, indicating that the supervised approach of invariance induction can worsen\nperformance with respect to nuisance factors not accounted for during training.\n\n5.3 Domain Adaptation\n\nDomain adaptation has been treated as an invariance induction task in recent literature [6, 14] where\nthe goal is to make the prediction task invariant to the \u201cdomain\u201d information. We evaluate the\nperformance of our model at domain adaptation on the Amazon Reviews dataset [4] using the same\npreprocessing as [14]. The dataset contains text reviews on products in four domains \u2013 \u201cbooks\u201d,\n\u201cdvd\u201d, \u201celectronics\u201d, and \u201ckitchen\u201d. Each review is represented as a feature vector of unigram and\nbigram counts. The target y is the sentiment of the review \u2013 either positive or negative. We use the\nsame experimental setup as [6, 14] where the model is trained on one domain and tested on another,\nthus creating 12 source-target combinations. We design the architectures of the encoder and the\ndecoder in our model to be similar to those of VFAE, as presented in [14]. Table 5 shows the results\nof the proposed unsupervised adversarial model and supervised state-of-the-art methods VFAE and\nDomain Adversarial Neural Network (DANN) [6]. The results of the prior works are quoted directly\nfrom [14]. The results show that our model outperforms both VFAE and DANN at nine out of the\ntwelve tasks. Thus, our model can also be used effectively for domain adaptation.\n\n6 Conclusion And Future Work\n\nIn this paper, we have presented a novel unsupervised framework for invariance induction in neural\nnetworks. Our method models invariance as an information separation task achieved by competitive\ntraining between a predictor and a decoder coupled with disentanglement. We described an adversarial\ninstantiation of this framework and provided analysis of its working. Experimental evaluation shows\nthat our unsupervised adversarial invariance induction model outperforms state-of-the-art methods,\nwhich are supervised, on learning invariance to inherent nuisance factors, effectively using synthetic\ndata augmentation for learning invariance, and domain adaptation. Furthermore, the fact that our\nframework requires no annotations for variations of nuisance factors, or even knowledge of such\nfactors, shows the conceptual superiority of our approach compared to previous methods. Since our\nmodel does not make any assumptions about the data, it can be applied to any supervised learning\ntask, eg., binary/multi-class classi\ufb01cation or regression, without loss of generality.\nThe proposed approach is not designed to learn \u201cfair representations\u201d of data, e.g., making predictions\nabout the savings of a person invariant to age, when such bias exists in data and making the prediction\ntask invariant to such biasing factors is of higher priority than the prediction performance [19].\nIn future work, we will augment our model with the capability to additionally use supervised\ninformation (when available) about known nuisance factors for learning invariance to them, which\nwill, consequently, help our model learn fair representations.\n\n9\n\n\fAcknowledgements\n\nThis work is based on research sponsored by the Defense Advanced Research Projects Agency\nunder agreement number FA8750-16-2-0204. The U.S. Government is authorized to reproduce and\ndistribute reprints for governmental purposes notwithstanding any copyright notation thereon. The\nviews and conclusions contained herein are those of the authors and should not be interpreted as\nnecessarily representing the of\ufb01cial policies or endorsements, either expressed or implied, of the\nDefense Advanced Research Projects Agency or the U.S. Government.\n\nReferences\n[1] Antreas Antoniou, Amos Storkey, and Harrison Edwards. Data augmentation generative adversarial\n\nnetworks. arXiv preprint arXiv:1711.04340, 2017.\n\n[2] Mathieu Aubry, Daniel Maturana, Alexei Efros, Bryan C. Russell, and Josef Sivic. Seeing 3d chairs:\nExemplar part-based 2d-3d alignment using a large dataset of cad models. In Proceedings of the IEEE\nComputer Society Conference on Computer Vision and Pattern Recognition, 06 2014.\n\n[3] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation learning: A review and new\n\nperspectives. IEEE transactions on pattern analysis and machine intelligence, 35(8):1798\u20131828, 2013.\n\n[4] Minmin Chen, Zhixiang Xu, Kilian Q. Weinberger, and Fei Sha. Marginalized denoising autoencoders for\ndomain adaptation. In Proceedings of the 29th International Conference on Machine Learning, ICML\u201912,\npages 1627\u20131634, USA, 2012. Omnipress.\n\n[5] Iacopo Masi et al. Learning pose-aware models for pose-invariant face recognition in the wild. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 2018.\n\n[6] Yaroslav Ganin, Evgeniya Ustinova, Hana Ajakan, Pascal Germain, Hugo Larochelle, Fran\u00e7ois Laviolette,\nMario Marchand, and Victor Lempitsky. Domain-adversarial training of neural networks. The Journal of\nMachine Learning Research, 17(1):2096\u20132030, 2016.\n\n[7] A. S. Georghiades, P. N. Belhumeur, and D. J. Kriegman. From few to many: illumination cone models for\nface recognition under variable lighting and pose. IEEE Transactions on Pattern Analysis and Machine\nIntelligence, 23(6):643\u2013660, Jun 2001.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative Adversarial Nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[9] Ayush Jaiswal, Dong Guo, Cauligi S Raghavendra, and Paul Thompson. Large-scale unsupervised deep\n\nrepresentation learning for brain structure. arXiv preprint arXiv:1805.01049, 2018.\n\n[10] Tom Ko, Vijayaditya Peddinti, Daniel Povey, and Sanjeev Khudanpur. Audio augmentation for speech\nrecognition. In Sixteenth Annual Conference of the International Speech Communication Association,\n2015.\n\n[11] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in neural information processing systems, pages 1097\u20131105, 2012.\n\n[12] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based Learning Applied to\n\nDocument Recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[13] Yujia Li, Kevin Swersky, and Richard Zemel. Learning unbiased features. arXiv preprint arXiv:1412.5244,\n\n2014.\n\n[14] Christos Louizos, Kevin Swersky, Yujia Li, Max Welling, and Richard Zeme. The variational fair\n\nautoencoder. In Proceedings of International Conference on Learning Representations, 2016.\n\n[15] Laurens van der Maaten and Geoffrey Hinton. Visualizing data using t-sne. Journal of machine learning\n\nresearch, 9(Nov):2579\u20132605, 2008.\n\n[16] Jianyu Miao and Lingfeng Niu. A survey on feature selection. Procedia Computer Science, 91:919 \u2013\n926, 2016. Promoting Business Analytics and Quantitative Management of Technology: 4th International\nConference on Information Technology and Quantitative Management (ITQM 2016).\n\n[17] Sebastian Ruder. An overview of multi-task learning in deep neural networks.\n\narXiv:1706.05098, 2017.\n\narXiv preprint\n\n10\n\n\f[18] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[19] Qizhe Xie, Zihang Dai, Yulun Du, Eduard Hovy, and Graham Neubig. Controllable invariance through\nadversarial feature learning. In I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan,\nand R. Garnett, editors, Advances in Neural Information Processing Systems 30, pages 585\u2013596. Curran\nAssociates, Inc., 2017.\n\n11\n\n\f", "award": [], "sourceid": 2454, "authors": [{"given_name": "Ayush", "family_name": "Jaiswal", "institution": "USC Information Sciences Institute"}, {"given_name": "Rex Yue", "family_name": "Wu", "institution": "USC ISI"}, {"given_name": "Wael", "family_name": "Abd-Almageed", "institution": "Information Sciences Institute"}, {"given_name": "Prem", "family_name": "Natarajan", "institution": "USC ISI"}]}