{"title": "Regularization Learning Networks: Deep Learning for Tabular Datasets", "book": "Advances in Neural Information Processing Systems", "page_first": 1379, "page_last": 1389, "abstract": "Despite their impressive performance, Deep Neural Networks (DNNs) typically underperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning tasks. We propose that applying a different regularization coefficient to each weight might boost the performance of DNNs by allowing them to make more use of the more relevant inputs. However, this will lead to an intractable number of hyperparameters. Here, we introduce Regularization Learning Networks (RLNs), which overcome this challenge by introducing an efficient hyperparameter tuning scheme which minimizes a new Counterfactual Loss. Our results show that RLNs significantly improve DNNs on tabular datasets, and achieve comparable results to GBTs, with the best performance achieved with an ensemble that combines GBTs and RLNs. RLNs produce extremely sparse networks, eliminating up to 99.8% of the network edges and 82% of the input features, thus providing more interpretable models and reveal the importance that the network assigns to different inputs. RLNs could efficiently learn a single network in datasets that comprise both tabular and unstructured data, such as in the setting of medical imaging accompanied by electronic health records. An open source implementation of RLN can be found at https://github.com/irashavitt/regularization_learning_networks.", "full_text": "Regularization Learning Networks: Deep Learning\n\nfor Tabular Datasets\n\nIra Shavitt\n\nWeizmann Institute of Science\n\nirashavitt@gmail.com\n\nEran Segal\n\nWeizmann Institute of Science\neran.segal@weizmann.ac.il\n\nAbstract\n\nDespite their impressive performance, Deep Neural Networks (DNNs) typically\nunderperform Gradient Boosting Trees (GBTs) on many tabular-dataset learning\ntasks. We propose that applying a different regularization coef\ufb01cient to each\nweight might boost the performance of DNNs by allowing them to make more use\nof the more relevant inputs. However, this will lead to an intractable number of\nhyperparameters. Here, we introduce Regularization Learning Networks (RLNs),\nwhich overcome this challenge by introducing an ef\ufb01cient hyperparameter tuning\nscheme which minimizes a new Counterfactual Loss. Our results show that RLNs\nsigni\ufb01cantly improve DNNs on tabular datasets, and achieve comparable results\nto GBTs, with the best performance achieved with an ensemble that combines\nGBTs and RLNs. RLNs produce extremely sparse networks, eliminating up to\n99.8% of the network edges and 82% of the input features, thus providing more\ninterpretable models and reveal the importance that the network assigns to differ-\nent inputs. RLNs could ef\ufb01ciently learn a single network in datasets that com-\nprise both tabular and unstructured data, such as in the setting of medical imaging\naccompanied by electronic health records. An open source implementation of\nRLN can be found at https://github.com/irashavitt/regularization_\nlearning_networks.\n\n1 Introduction\n\nDespite their impressive achievements on various prediction tasks on datasets with distributed rep-\nresentation [14, 4, 5] such as images [19], speech [9], and text [18], there are many tasks in which\nDeep Neural Networks (DNNs) underperform compared to other models such as Gradient Boosting\nTrees (GBTs). This is evident in various Kaggle [1, 2], or KDD Cup [7, 16, 27] competitions, which\nare typically won by GBT-based approaches and speci\ufb01cally by its XGBoost [8] implementation,\neither when run alone or within a combination of several different types of models.\n\nThe datasets in which neural networks are inferior to GBTs typically have different statistical proper-\nties. Consider the task of image recognition as compared to the task of predicting the life expectancy\nof patients based on electronic health records. One key difference is that in image classi\ufb01cation,\nmany pixels need to change in order for the image to depict a different object [25].1 In contrast, the\nrelative contribution of the input features in the electronic health records example can vary greatly:\nChanging a single input such as the age of the patient can profoundly impact the life expectancy of\nthe patient, while changes in other input features, such as the time that passed since the last test was\ntaken, may have smaller effects.\n\n1This is not contradictory to the existence of adversarial examples [12], which are able to fool DNNs by\nchanging a small number of input features, but do not actually depict a different object, and generally are not\nable to fool humans.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fWe hypothesized that this potentially large variability in the relative importance of different input\nfeatures may partly explain the lower performance of DNNs on such tabular datasets [11]. One\nway to overcome this limitation could be to assign a different regularization coef\ufb01cient to every\nweight, which might allow the network to accommodate the non-distributed representation and the\nvariability in relative importance found in tabular datasets.\n\nThis will require tuning a large number of hyperparameters. The default approach to hyperparameter\ntuning is using derivative-free optimization of the validation loss, i.e., a loss of a subset of the\ntraining set which is not used to \ufb01t the model. This approach becomes computationally intractable\nvery quickly.\n\nHere, we present a new hyperparameter tuning technique, in which we optimize the regulariza-\ntion coef\ufb01cients using a newly introduced loss function, which we term the Counterfactual Loss,\norLCF . We term the networks that apply this technique Regularization Learning Networks (RLNs).\nIn RLNs, the regularization coef\ufb01cients are optimized together with learning the network weight\nparameters. We show that RLNs signi\ufb01cantly and substantially outperform DNNs with other regu-\nlarization schemes, and achieve comparable results to GBTs. When used in an ensemble with GBTs,\nRLNs achieves state of the art results on several prediction tasks on a tabular dataset with varying\nrelative importance for different features.\n\n2 Related work\n\nApplying different regularization coef\ufb01cients to different parts of the network is a common prac-\ntice. The idea of applying different regularization coef\ufb01cients to every weight was introduced [23],\nbut it was only applied to images with a toy model to demonstrate the ability to optimize many\nhyperparameters.\n\nOur work is also related to the rich literature of works on hyperparameter optimization [29]. These\nworks mainly focus on derivative-free optimization [30, 6, 17]. Derivative-based hyperparameter\noptimization is introduced in [3] for linear models and in [23] for neural networks. In these works,\nthe hyperparameters are optimized using the gradients of the validation loss. Practically, this means\nthat every optimization step of the hyperparameters requires training the whole network and back\npropagating the loss to the hyperparameters. [21] showed a more ef\ufb01cient derivative based way\nfor hyperparameter optimization, which still required a substantial amount of additional parame-\nters. [22] introduce an optimization technique similar to the one introduced in this paper, however,\nthe optimization technique in [22] requires a validation set, and only optimizes a single regulariza-\ntion coef\ufb01cient for each layer, and at most 10-20 hyperparameters in any network. In comparison,\ntraining RLNs doesn\u2019t require a validation set, assigns a different regularization coef\ufb01cient for ev-\nery weight, which results in up to millions of hyperparameters, optimized ef\ufb01ciently. Additionally,\nRLNs optimize the coef\ufb01cients in the log space and adds a projection after every update to counter\nthe vanishing of the coef\ufb01cients. Most importantly, the ef\ufb01cient optimization of the hyperparameters\nwas applied to images and not to dataset with non-distributed representation like tabular datasets.\n\nDNNs have been successfully applied to tabular datasets like electronic health records, in [26, 24].\nThe use of RLN is complementary to these works, and might improve their results and allow the use\nof deeper networks on smaller datasets.\n\nTo the best of our knowledge, our work is the \ufb01rst to illustrate the statistical difference in distributed\nand non-distributed representations, to hypothesize that addition of hyperparameters could enable\nneural networks to achieve good results on datasets with non-distributed representations such as\ntabular datasets, and to ef\ufb01ciently train such networks on a real-world problems to signi\ufb01cantly and\nsubstantially outperform networks with other regularization schemes.\n\n3 Regularization Learning\n\nGenerally, when using regularization, we minimize \u02dcL (Z, W, \u03bb) = L (Z, W ) + exp (\u03bb) \u00b7Pn\n\nm=1 are the training samples, L is the loss function, W = {wi}n\n\ni=1 kwik,\ni=1 are the\n\nwhere Z = {(xm, ym)}M\n\n2\n\n\fand validation sets, respectively.\n\nweights of the model, k\u00b7k is some norm, and \u03bb is the regularization coef\ufb01cient,2 a hyperparameter\nof the network. Hyperparameters of the network, like \u03bb, are usually obtained using cross-validation,\nwhich is the application of derivative-free optimization on LCV (Zt, Zv, \u03bb) with respect to \u03bb where\n\nLCV (Zt, Zv, \u03bb) = L(cid:16)Zv, arg minW \u02dcL (Zt, W, \u03bb)(cid:17) and (Zt, Zv) is some partition of Z into train\nbecomes L\u2020 (Z, W, \u039b) = L (Z, W ) +Pn\n\nIf a different regularization coef\ufb01cient is assigned to each weight in the network, our learning loss\ni=1 are the regulariza-\ntion coef\ufb01cients. Using L\u2020 will require n hyperparameters, one for every network parameter, which\nmakes tuning with cross-validation intractable, even for very small networks. We would like to keep\nusing L\u2020 to update the weights, but to \ufb01nd a more ef\ufb01cient way to tune \u039b. One way to do so is\nthrough SGD, but it is unclear which loss to minimize: L doesn\u2019t have a derivative with respect to\n\u039b, while L\u2020 has trivial optimal values, arg min\u039b L\u2020 (Z, W, \u039b) = {\u2212\u221e}n\ni=1. LCV has a non-trivial\ndependency on \u039b, but it is very hard to evaluate \u2202LCV\n\u2202\u039b .\n\ni=1 exp (\u03bbi) \u00b7 kwik, where \u039b = {\u03bbi}n\n\nWe introduce a new loss function, called the Counterfactual Loss LCF , which has a non-trivial\ndependency on \u039b and can be evaluated ef\ufb01ciently. For every time-step t during the training, let Wt\nand \u039bt be the weights and regularization coef\ufb01cients of the network, respectively, and let wt,i \u2208 Wt\nand \u03bbt,i \u2208 \u039bt be the weight and the regularization coef\ufb01cient of the same edge i in the network.\nWhen optimizing using SGD, the value of this weight in the next time-step will be wt+1,i = wt,i \u2212\n\u03b7 \u00b7 \u2202L\u2020(Zt,Wt,\u039bt)\n, where \u03b7 is the learning rate, and Zt is the training batch at time t.3 We can split\nthe gradient into two parts:\n\n\u2202wt,i\n\nwt+1,i = wt,i \u2212 \u03b7 \u00b7 (gt,i + rt,i)\n\n(1)\n\n(2)\n\n(3)\n\n\u2202 kwt,ik\n\n\u2202wt,i\n\ngt,i =\n\nrt,i =\n\n\u2202L (Zt, Wt)\n\n\u2202wt,i\n\n\u2202\n\n\u2202wt,i\uf8eb\uf8ed\n\nnXj=1\n\nexp (\u03bbt,j ) \u00b7 kwt,j k\uf8f6\uf8f8 = exp (\u03bbt,i) \u00b7\n\nWe call gt,i the gradient of the empirical loss L and rt,i the gradient of the regularization term. All\n(exp (\u03bbt,j ) \u00b7 kwt,j k) = 0 for every j 6= i. Denote\nbut one of the addends of rt,i vanished since\nby Wt+1 = {wt+1,i}n\ni=1 the weights in the next time-step, which depend on Zt, Wt, \u039bt, and \u03b7, as\nshown in Equation 1, and de\ufb01ne the Counterfactual Loss to be\n\n\u2202wt,i\n\n\u2202\n\nLCF (Zt, Zt+1, Wt, \u039bt, \u03b7) = L (Zt+1, Wt+1)\n\n(4)\n\nLCF is the empirical loss L, where the weights have already been updated using SGD over the\nregularized loss L\u2020. We call this the Counterfactual Loss since we are asking a counterfactual\nquestion: What would have been the loss of the network had we updated the weights with respect\nto L\u2020? We will use LCF to optimize the regularization coef\ufb01cients using SGD while learning the\nweights of the network simultaneously using L\u2020. We call this technique Regularization Learning,\nand networks that employ it Regularization Learning Networks (RLNs).\n\nTheorem 1. The gradient of the Counterfactual loss with respect to the regularization coef\ufb01cient is\n\u2202LCF\n\u2202\u03bbt,i\n\n= \u2212\u03b7 \u00b7 gt+1,i \u00b7 rt,i\n\nProof. LCF only depends on \u03bbt,i through wt+1,i, allowing us to use the chain rule \u2202LCF\n\u2202\u03bbt,i\n\u2202wt+1,i\n\n. The \ufb01rst multiplier is the gradient gt+1,i. Regarding the second multiplier, from Equation\n\n= \u2202LCF\n\u2202wt+1,i\n\n\u00b7\n\n\u2202\u03bbt,i\n\n1 we see that only rt,i depends on \u03bbt,i. Combining with Equation 3 leaves us with:\n\n2The notation for the regularization term is typically \u03bb \u00b7 Pn\ni=1 kwik. We use the notation exp (\u03bb) \u00b7\ni=1 kwik to force the coef\ufb01cients to be positive, to accelerate their optimization and to simplify the cal-\n\nPn\nculations shown.\n\n3We assume vanilla SGD is used in this analysis for brevity, but the analysis holds for any derivative-based\n\noptimization method.\n\n3\n\n\f\u2202wt+1,i\n\n\u2202\u03bbt,i\n\n=\n\n\u2202\n\n\u2202\u03bbt,i\n\n(wt,i \u2212 \u03b7 \u00b7 (gt,i + rt,i)) = \u2212\u03b7 \u00b7\n\n= \u2212\u03b7 \u00b7\n\n\u2202\n\n\u2202\u03bbt,i(cid:18)exp (\u03bbt,i) \u00b7\n\n\u2202 kwt,ik\n\n\u2202wt,i (cid:19) = \u2212\u03b7 \u00b7 exp (\u03bbt,i) \u00b7\n\n=\n\n\u2202rt,i\n\u2202\u03bbt,i\n\u2202 kwt,ik\n\n\u2202wt,i\n\n= \u2212\u03b7 \u00b7 rt,i\n\nTheorem 1 gives us the update rule \u03bbt+1,i =\n\u03bbt,i \u2212 \u03bd \u00b7 \u2202LCF\n= \u03bbt,i + \u03bd \u00b7 \u03b7 \u00b7 gt+1,i \u00b7 rt,i,\n\u2202\u03bbt,i\nwhere \u03bd is the learning rate of the regularization\ncoef\ufb01cients.\n\nIntuitively, the gradient of the Counterfactual\nLoss has an opposite sign to the product of\ngt+1,i and rt,i. Comparing this result with\nEquation 1, this means that when gt+1,i and rt,i\nagree in sign, the regularization helps reduce\nthe loss, and we can strengthen it by increas-\ning \u03bbt,i. When they disagree, this means that\nthe regularization hurts the performance of the\nnetwork, and we should relax it for this weight.\n\nFigure 1: The input features, sorted by their R2\ncorrelation to the label. We display the micro-\nbiome dataset, with the covariates marked, in com-\nparison the MNIST dataset[20].\n\nThe size of the Counterfactual gradient is pro-\nportional to the product of the sizes of gt+1,i\nand rt,i. When gt+1,i is small, wt+1,i does not\naffect the loss L much, and when rt,i is small, \u03bbt,i does not affect wt+1,i much. In both cases, \u03bbt,i\nhas a small effect on LCF . Only when both rt,i is large (meaning that \u03bbt,i affects wt+1), and gt+1,i\nis large (meaning that wt+1 affects L), \u03bbt,i has a large effect on LCF , and we get a large gradient\n\u2202LCF\n\u2202\u03bbt,i\n\n.\n\nAt the limit of many training iterations, \u03bbt,i tends to continuously decrease. We try to give some\ninsight to this dynamics in the supplementary material. To address this issue, we project the regular-\nization coef\ufb01cients onto a simplex after updating them:\n\ne\u03bbt+1,i = \u03bbt,i + \u03bd \u00b7 \u03b7 \u00b7 gt+1,i \u00b7 rt,i\n\u03bbt+1,i =e\u03bbt+1,i + \u03b8 \u2212Pn\n\nj=1e\u03bbt+1,j\n\nn\n\n!\n\n(5)\n\n(6)\n\nwhere \u03b8 is the normalization factor of the regularization coef\ufb01cients, a hyperparameter of the net-\nwork tuned using cross-validation. This results in a zero-sum game behavior in the regularization,\nwhere a relaxation in one edge allows us to strengthen the regularization in other parts of the network.\nThis could lead the network to assign a modular regularization pro\ufb01le, where uninformative connec-\ntions are heavily regularized and informative connection get a very relaxed regularization, which\nmight boost performance on datasets with non-distributed representation such as tabular datasets.\nThe full algorithm is described in the supplementary material.\n\nFigure 2: Prediction of traits using microbiome data and covariates, given as the overall explained\nvariance (R2).\n\n4\n\n\f4 Experiments\n\nWe demonstrate the performance of our method on the problem of predicting human traits from\ngut microbiome data and basic covariates (age, gender, BMI). The human gut microbiome is the\ncollection of microorganisms found in the human gut and is composed of trillions of cells including\nbacteria, eukaryotes, and viruses. In recent years, there have been major advances in our understand-\ning of the microbiome and its connection to human health. Microbiome composition is determined\nby DNA sequencing human stool samples that results in short (75-100 basepairs) DNA reads. By\nmapping these short reads to databases of known bacterial species, we can deduce both the source\nspecies and gene from which each short read originated. Thus, upon mapping a collection of dif-\nferent samples, we obtain a matrix of estimated relative species abundances for each person and a\nmatrix of the estimated relative gene abundances for each person. Since these features have varying\nrelative importance (Figure 1), we expected GBTs to outperform DNNs on these tasks.\n\nWe sampled 2,574 healthy participants for which we measured, in addition to the gut microbiome, a\ncollection of different traits, including important disease risk factors such as cholesterol levels and\nBMI. Finding associations between these disease risk factors and the microbiome composition is of\n\ngreat scienti\ufb01c interest, and can raise novel hypothe-\nses about the role of the microbiome in disease. We\ntested 4 types of models: RLN, GBT, DNN, and Lin-\near Models (LM). The full list of hyperparameters,\nthe setting of the training of the models and the en-\nsembles, as well as the description of all the input\nfeatures and the measured traits, can be found in the\nsupplementary material.\n\n5 Results\n\nWhen running each model separately, GBTs achieve\nthe best results on all of the tested traits, but it is\nonly signi\ufb01cant on 3 of them (Figure 2). DNNs\nachieve the worst results, with 15% \u00b1 1% less ex-\nplained variance than GBTs on average. RLNs sig-\nni\ufb01cantly and substantially improve this by a factor\nof 2.57 \u00b1 0.05, and achieve only 2% \u00b1 2% less ex-\nplained variance than GBTs on average.\n\nConstructing an ensemble of models is a powerful\ntechnique for improving performance, especially for\nmodels which have high variance, like neural net-\nworks in our task. As seen in Figure 3, the average\nvariance of predictions of the top 10 models of RLN\nand DNN is 1.3%\u00b10.6% and 14%\u00b13% respectively,\nwhile the variance of predictions of the top 10 mod-\n\nFigure 3: For each model type and trait, we\ntook the 10 best performing models, based\non their validation performance, and calcu-\nlated the average variance of the predicted\ntest samples, and plotted it against the im-\nprovement in R2 obtained when training en-\nsembles of these models. Note that models\nthat have a high variance in their prediction\nbene\ufb01t more from the use of ensembles. As\nexpected, DNNs gain the most from ensem-\nbling.\n\nFigure 4: Ensembles of different predictors.\n\n5\n\n\fFigure 5: Results of various ensembles that are each composed of different types of models.\n\nTrait\n\nAge\n\nRLN + GBT\n\nLM + GBT\n\nGBT\n\nRLN\n\nMax\n\n31.9% \u00b1 0.2% 30.5% \u00b1 0.5%\n\n30.9% \u00b1 0.1% 29.1%\u00b1 0.2% 31.9%\n\nHbA1c\n\n30.5% \u00b1 0.2% 30.2% \u00b1 0.3% 30.5% \u00b1 0.04% 28.4%\u00b1 0.1% 30.5%\n\nHDL\ncholesterol\n\nMedian\nglucose\n\nMax\nglucose\n\nCRP\n\nGender\n\nBMI\n\n28.8% \u00b1 0.2% 27.7% \u00b1 0.2% 27.2% \u00b1 0.04% 27.9%\u00b1 0.1% 28.8%\n\n26.2% \u00b1 0.1% 26.1% \u00b1 0.1% 25.2% \u00b1 0.04% 25.5%\u00b1 0.1% 26.2%\n\n25.2% \u00b1 0.3% 25.0% \u00b1 0.1% 24.6% \u00b1 0.03% 23.7%\u00b1 0.4% 25.2%\n\n24.0% \u00b1 0.3% 23.7% \u00b1 0.2%\n\n22.4% \u00b1 0.1% 22.8%\u00b1 0.4% 24.0%\n\n17.9% \u00b1 0.4% 16.9% \u00b1 0.6% 18.7% \u00b1 0.03% 11.9%\u00b1 0.4% 18.7%\n\n17.6% \u00b1 0.1% 17.2% \u00b1 0.2% 16.9% \u00b1 0.04% 16.0%\u00b1 0.1% 17.6%\n\nCholesterol\n\n7.8% \u00b1 0.3% 7.6% \u00b1 0.3%\n\n7.8% \u00b1 0.1%\n\n5.8% \u00b1 0.2% 7.8%\n\nTable 1: Explained variance (R2) of various ensembles with different types of models. Only the 4\nensembles that achieved the best results are shown. The best result for each trait is highlighted, and\nunderlined if it outperforms signi\ufb01cantly all other ensembles.\n\nels of LM and GBT is only 0.13% \u00b1 0.05% and 0.26% \u00b1 0.02%, respectively. As expected, the high\nvariance of RLN and DNN models allows ensembles of these models to improve the performance\nover a single model by 1.5% \u00b1 0.7% and 4% \u00b1 1% respectively, while LM and GBT only improve\nby 0.2% \u00b1 0.3% and 0.3% \u00b1 0.4%, respectively. Despite the improvement, DNN ensembles still\nachieve the worst results on all of the traits except for Gender and achieve results 9% \u00b1 1% lower\nthan GBT ensembles (Figure 4). In comparison, this improvement allows RLN ensembles to outper-\nform GBT ensembles on HDL cholesterol, Median glucose, and CRP, and to obtain results 8% \u00b1 1%\nhigher than DNN ensembles and only 1.4% \u00b1 0.1% lower than GBT ensembles.\n\nUsing ensemble of different types of models could be even more effective because their errors are\nlikely to be even more uncorrelated than ensembles from one type of model. Indeed, as shown in\nFigure 5, the best performance is obtained with an ensemble of RLN and GBT, which achieves the\nbest results on all traits except Gender, and outperforms all other ensembles signi\ufb01cantly on Age,\nBMI, and HDL cholesterol (Table 1)\n\n6 Analysis\n\nWe next sought to examine the effect that our new type of regularization has on the learned networks.\nStrikingly, we found that RLNs are extremely sparse, even compared to L1 regulated networks. To\ndemonstrate this, we took the hyperparameter setting that achieved the best results on the HbA1c\ntask for the DNN and RLN models and trained a single network on the entire dataset. Both models\nachieved their best hyperparameter setting when using L1 regularization. Remarkably, 82% of the\n\n6\n\n\finput features in the RLN do not have any non-zero outgoing edges, while all of the input features\nhave at least one non-zero outgoing edge in the DNN (Figure 6a). A possible explanation could\nbe that the RLN was simply trained using a stronger regularization coef\ufb01cients, and increasing the\nvalue of \u03bb for the DNN model would result in a similar behavior for the DNN, but in fact the RLN\nwas obtained with an average regularization coef\ufb01cient of \u03b8 = \u22126.6 while the DNN model was\ntrained using a regularization coef\ufb01cient of \u03bb = \u22124.4. Despite this extreme sparsity, the non zero\nweights are not particularly small and have a similar distribution as the weights of the DNN (Figure\n6b).\n\n(a)\n\n(b)\n\nWe suspect\nthat\nthe combination\nof a sparse net-\nwork with large\nweights\nallows\nRLNs to achieve\ntheir\nimproved\nperformance,\nas our dataset\nincludes features\nwith varying rel-\native importance.\nTo show this, we\nre-optimized the\nhyperparameters\nof the DNN and\nmodels\nRLN\nafter removing the covariates from the datasets. The covariates are very important features (Figure\n1), and removing them would reduce the variability in relative importance. As can be seen in Figure\n7a, even without the covariates, the RLN and GBT ensembles still achieve the best results on 5 out\nof the 9 traits. However, this improvement is less signi\ufb01cant than when adding the covariates, where\nRLN and GBT ensembles achieve the best results on 8 out of the 9 traits. RLNs still signi\ufb01cantly\noutperform DNNs, achieving explained variance higher by 2% \u00b1 1%, but this is signi\ufb01cantly smaller\nthan the 9% \u00b1 2% improvement obtained when adding the covariates (Figure 7b). We speculate that\nthis is because RLNs particularly shine when features have very different relative importances.\n\nFigure 6: a) Each line represents an input feature in a model. The values of\neach line are the absolute values of its outgoing weights, sorted from greatest to\nsmallest. Noticeably, only 12% of the input features have any non-zero outgoing\nedge in the RLN model. b) The cumulative distribution of non-zero outgoing\nweights for the input features for different models. Remarkably, the distribution\nof non-zero weights is quite similar for the two models.\n\nTo understand what causes this interesting structure, we next explored how the weights in RLNs\nchange during training. During training, each edge performs a traversal in the w, \u03bb space. We expect\nthat when \u03bb decreases and the regularization is relaxed, the absolute value of w should increase, and\nvice versa. In Figure 8, we can see that 99.9% of the edges of the \ufb01rst layer \ufb01nish the training\nwith a zero value. There are still 434 non-zero edges in the \ufb01rst layer due to the large size of the\nnetwork. This is not unique to the \ufb01rst layer, and in fact, 99.8% of the weights of the entire network\nhave a zero value by the end of the training. The edges of the \ufb01rst layer that end up with a non-\nzero weight are decreasing rapidly at the beginning of the training because of the regularization, but\nduring the \ufb01rst 10-20 epochs, the network quickly learns better regularization coef\ufb01cients for its\nedges. The regularization coef\ufb01cients are normalized after every update, hence by applying stronger\nregularization on some edges, the network is allowed to have a more relaxed regularization on other\nedges and consequently a larger weight. By epoch 20, the edges of the \ufb01rst layer that end up with\na non-zero weight have an average regularization coef\ufb01cient of \u22129.4, which is signi\ufb01cantly smaller\nthan their initial value \u03b8 = \u22126.6. These low values pose effectively no regularization, and their\nweights are updated primarily to minimize the empirical loss component of the loss function, L.\n\nFinally, we reasoned that since RLNs assign non-zero weights to a relatively small number of inputs,\nthey may be used to provide insights into the inputs that the model found to be more important\nfor generating its predictions using Garson\u2019s algorithm [10]. There has been important progress in\nrecent years in sample-aware model interpretability techniques in DNNs [28, 31], but tools to pro-\nduce sample-agnostic model interpretations are lacking [15].4 Model interpretability is particularly\nimportant in our problem for obtaining insights into which bacterial species contribute to predicting\neach trait.\n\n4The sparsity of RLNs could be bene\ufb01cial for sample-aware model interpretability techniques such as [28,\n\n31]. This was not examined in this paper.\n\n7\n\n\f(a)\n\n(b)\n\nFigure 7: a) Training our models without adding the covariates. b) The relative improvement RLN\nachieves compared to DNN for different input features.\n\nFigure 8: On the left axis, shown is the traversal of edges of the \ufb01rst layer that \ufb01nished the training\nwith a non-zero weight in the w, \u03bb space. Each colored line represents an edge, its color represents\nits regularization, with yellow lines having strong regularization. On the right axis, the black line\nplots the percent of zero weight edges in the \ufb01rst layer during training.\n\nEvaluating feature importance is dif\ufb01cult, especially in domains in which little is known such as\nthe gut microbiome. One possibility is to examine the information it supplies. In Figure 9a we\nshow the feature importance achieved through this technique using RLNs and DNNs. While the\nimportance in DNNs is almost constant and does not give any meaningful information about the\nspeci\ufb01c importance of the features, the importance in RLNs is much more meaningful, with entropy\nof the 4.6 bits for the RLN importance, compared to more than twice for the DNN importance, 9.5\nbits.\n\n8\n\n\f(a)\n\n(b)\n\nFigure 9: a) The input features, sorted by their importance, in a DNN and RLN models. b) The\nJensen-Shannon divergence between the feature importance of different instantiations of a model.\n\nAnother possibility is to evaluate its consistency across different instantiations of the model. We\nexpect that a good feature importance technique will give similar importance distributions regardless\nof instantiation. We trained 10 instantiations for each model and phenotype and evaluated their\nfeature importance distributions, for which we calculated the Jensen-Shannon divergence. In Figure\n9b we see that RLNs have divergence values 48% \u00b1 1% and 54% \u00b1 2% lower than DNNs and LMs\nrespectively. This is an indication that Garson\u2019s algorithm results in meaningful feature importances\nin RLNs. We list of the 5 most important bacterial species for different traits in the supplementary\nmaterial.\n\n7 Conclusion\n\nIn this paper, we explore the learning of datasets with non-distributed representation, such as tab-\nular datasets. We hypothesize that modular regularization could boost the performance of DNNs\non such tabular datasets. We introduce the Counterfactual Loss, LCF , and Regularization Learn-\ning Networks (RLNs) which use the Counterfactual Loss to tune its regularization hyperparameters\nef\ufb01ciently during learning together with the learning of the weights of the network.\n\nWe test our method on the task of predicting human traits from covariates and microbiome data\nand show that RLNs signi\ufb01cantly and substantially improve the performance over classical DNNs,\nachieving an increased explained variance by a factor of 2.75 \u00b1 0.05 and comparable results with\nGBTs. The use of ensembles further improves the performance of RLNs, and ensembles of RLN\nand GBT achieve the best results on all but one of the traits, and outperform signi\ufb01cantly any other\nensemble not incorporating RLNs on 3 of the traits.\n\nWe further explore RLN structure and dynamics and show that RLNs learn extremely sparse net-\nworks, eliminating 99.8% of the network edges and 82% of the input features. In our setting, this\nwas achieved in the \ufb01rst 10-20 epochs of training, in which the network learns its regularization.\nBecause of the modularity of the regularization, the remaining edges are virtually not regulated at\nall, achieving a similar distribution to a DNN. The modular structure of the network is especially\nbene\ufb01cial for datasets with high variability in the relative importance of the input features, where\nRLNs particularly shine compared to DNNs. The sparse structure of RLNs lends itself naturally to\nmodel interpretability, which gives meaningful insights into the relation between features and the\nlabels, and may itself serve as a feature selection technique that can have many uses on its own [13].\n\nBesides improving performance on tabular datasets, another important application of RLNs could be\nlearning tasks where there are multiple data sources, one that includes features with high variability\nin the relative importance, and one which does not. To illustrate this point, consider the problem of\ndetecting pathologies from medical imaging. DNNs achieve impressive results on this task [32], but\nin real life, the imaging is usually accompanied by a great deal of tabular metadata in the form of\nthe electronic health records of the patient. We would like to use both datasets for prediction, but\ndifferent models achieve the best results on each part of the data. Currently, there is no simple way\nto jointly train and combine the models. Having a DNN architecture such as RLN that performs\nwell on tabular data will thus allow us to jointly train a network on both of the datasets natively, and\nmay improve the overall performance.\n\n9\n\n\fAcknowledgments\n\nWe would like to thank Ron Sender, Eran Kotler, Smadar Shilo, Nitzan Artzi, Daniel Greenfeld,\nGal Yona, Tomer Levy, Dror Kaufmann, Aviv Netanyahu, Hagai Rossman, Yochai Edlitz, Amir\nGloberson and Uri Shalit for useful discussions.\n\nReferences\n\n[1] David Beam and Mark Schramm. Rossmann Store Sales. 2015. 1\n\n[2] Kamil Belkhayat, Abou Omar, Gino Bruner, Yuyi Wang, and Roger Wattenhofer. XGBoost and LGBM\n\nfor Porto Seguro\u2019s Kaggle challenge: A comparison Semester Project. 2018. 1\n\n[3] Yoshua Bengio. Gradient-Based Optimization of 1 Introduction. pages 1\u201318, 1999. 2\n\n[4] Yoshua Bengio, Aaron Courville, and Pascal Vincent. Representation Learning: A Review and New\n\nPerspectives. 1\n\n[5] Yoshua Bengio and Yann LeCun. Scaling Learning Algorithms towards AI. 2007. 1\n\n[6] James Bergstra, R\u00e9mi Bardenet, Yoshua Bengio, and Bal\u00e1zs K\u00e9gl. Algorithms for Hyper-Parameter\n\nOptimization. Advances in Neural Information Processing Systems (NIPS), pages 2546\u20132554, 2011. 2\n\n[7] Hengxing Cai, Runxing Zhong, Chaohe Wang, Kejie Zhou, Hongyun Lee, Renxin Zhong, Yao Zhou,\n\nDa Li, Nan Jiang, Xu Cheng, and Jiawei Shen. KDD CUP 2018 Travel Time Prediction. 1\n\n[8] Tianqi Chen and Carlos Guestrin. XGBoost: A Scalable Tree Boosting System. 1\n\n[9] Chung-Cheng Chiu, Tara N Sainath, Yonghui Wu, Rohit Prabhavalkar, Patrick Nguyen, Zhifeng Chen,\nAnjuli Kannan, Ron J Weiss, Kanishka Rao, Ekaterina Gonina, Navdeep Jaitly, Bo Li, Jan Chorowski, and\nMichiel Bacchiani Google. State-Of-The-Art Speech Recognition with Sequence-To-Sequence Models.\n1\n\n[10] G D Garson. Interpreting neural network connection weights. AI Expert, 6(4):47\u201351, apr 1991. 6\n\n[11] Ian Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http://www.\n\ndeeplearningbook.org. 1\n\n[12] Ian J Goodfellow, Jonathon Shlens, and Christian Szegedy. Explaining And Harnessing Adversarial\n\nExamples. 1\n\n[13] Bryce Goodman and Seth Flaxman. European Union regulations on algorithmic decision-making and a \"\n\nright to explanation \". 7\n\n[14] GE HINTON, JL MCCLELLAND, and DE RUMELHART. Distributed representations. 1\n\n[15] Sara Hooker, Dumitru Erhan, Pieter-Jan Kindermans, and Been Kim. Evaluating Feature Importance\n\nEstimates. 6\n\n[16] Yide Huang. Highway Tollgates Traf\ufb01c Flow Prediction Task 1. Travel Time Prediction. 1\n\n[17] Frank Hutter, Holger H Hoos, and Kevin Leyton-Brown. Sequential Model - Based Optimization for\n\nGeneral Algorithm Con\ufb01guration. Lecture Notes in Computer Science, 5:507\u2013223, 2011. 2\n\n[18] Melvin Johnson, Mike Schuster, Quoc V. Le, Maxim Krikun, Yonghui Wu, Zhifeng Chen, Nikhil Tho-\nrat, Fernanda Vi\u00e9gas, Martin Wattenberg, Greg Corrado, Macduff Hughes, and Jeffrey Dean. Google\u2019s\nMultilingual Neural Machine Translation System: Enabling Zero-Shot Translation. 2016. 1\n\n[19] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. ImageNet Classi\ufb01cation with Deep Convolu-\n\ntional Neural Networks. 1\n\n[20] Yann LeCun. The mnist database of handwritten digits. http://yann. lecun. com/exdb/mnist/. 1\n\n[21] Jonathan Lorraine and David Duvenaud. Stochastic Hyperparameter Optimization through Hypernet-\n\nworks. 2018. 2\n\n[22] Jelena Luketina, Jelena Luketina@aalto Fi, Mathias Berglund, Mathias Berglund@aalto Fi, Klaus Greff,\nKlaus@idsia Ch, Tapani Raiko, and Tapani Raiko@aalto Fi. Scalable Gradient-Based Tuning of Contin-\nuous Regularization Hyperparameters. 2\n\n[23] Dougal Maclaurin, David Duvenaud, and Ryan P Adams. Gradient-based Hyperparameter Optimization\n\nthrough Reversible Learning. 2\n\n[24] Riccardo Miotto, Li Li, Brian A Kidd, and Joel T Dudley. Deep Patient: An Unsupervised Representation\nto Predict the Future of Patients from the Electronic Health Records. Nature Publishing Group, 2016. 2\n\n[25] Nicolas Papernot, Patrick Mcdaniel, Somesh Jha, Matt Fredrikson, Z Berkay Celik, and Ananthram\n\nSwami. The Limitations of Deep Learning in Adversarial Settings. 1\n\n10\n\n\f[26] Alvin Rajkomar, Eyal Oren, Kai Chen, Andrew M Dai, Nissan Hajaj, Michaela Hardt, Peter J Liu, Xiaob-\ning Liu, Jake Marcus, Mimi Sun, Patrik Sundberg, Hector Yee, Kun Zhang, Yi Zhang, Gerardo Flores,\nGavin E Duggan, Jamie Irvine, Quoc Le, Kurt Litsch, Alexander Mossin, Justin Tansuwan, De Wang,\nJames Wexler, Jimbo Wilson, Dana Ludwig, Samuel L Volchenboum, Katherine Chou, Michael Pearson,\nSrinivasan Madabushi, Nigam H Shah, Atul J Butte, Michael D Howell, Claire Cui, Greg S Corrado, and\nJeffrey Dean. Scalable and accurate deep learning with electronic health records. npj Digital Medicine, 1,\n2018. 2\n\n[27] Vlad Sandulescu, Adform Copenhagen, and Denmark Mihai Chiru. Predicting the future relevance of\n\nresearch institutions - The winning solution of the KDD Cup 2016. 1\n\n[28] Avanti Shrikumar, Peyton Greenside, and Anna Y Shcherbina. Not Just A Black Box: Learning Important\n\nFeatures Through Propagating Activation Differences. (3). 6, 4\n\n[29] Leslie N Smith. A disciplined approach to neural network hyper-parameters: Part 1 - learning rate, batch\n\nsize, momentum, and weight decay. 2\n\n[30] Jasper Snoek, Hugo Larochelle, and Ryan P. Adams. Practical Bayesian Optimization of Machine Learn-\n\ning Algorithms. pages 1\u201312, 2012. 2\n\n[31] Mukund Sundararajan, Ankur Taly, and Qiqi Yan. Gradients of Counterfactuals. 6, 4\n\n[32] Kenji Suzuki. Overview of deep learning in medical imaging. Radiological Physics and Technology, 10.\n\n7\n\n11\n\n\f", "award": [], "sourceid": 707, "authors": [{"given_name": "Ira", "family_name": "Shavitt", "institution": "Weizmann Institute of Science"}, {"given_name": "Eran", "family_name": "Segal", "institution": "Weizmann Institute of Science"}]}