{"title": "Distributed Weight Consolidation: A Brain Segmentation Case Study", "book": "Advances in Neural Information Processing Systems", "page_first": 4093, "page_last": 4103, "abstract": "Collecting the large datasets needed to train deep neural networks can be very difficult, particularly for the many applications for which sharing and pooling data is complicated by practical, ethical, or legal concerns. However, it may be the case that derivative datasets or predictive models developed within individual sites can be shared and combined with fewer restrictions. Training on distributed data and combining the resulting networks is often viewed as continual learning, but these methods require networks to be trained sequentially. In this paper, we introduce distributed weight consolidation (DWC), a continual learning method to consolidate the weights of separate neural networks, each trained on an independent dataset. We evaluated DWC with a brain segmentation case study, where we consolidated dilated convolutional neural networks trained on independent structural magnetic resonance imaging (sMRI) datasets from different sites. We found that DWC led to increased performance on test sets from the different sites, while maintaining generalization performance for a very large and completely independent multi-site dataset, compared to an ensemble baseline.", "full_text": "Distributed Weight Consolidation:\nA Brain Segmentation Case Study\n\nPatrick McClure\n\nNational Institute of Mental Health\n\npatrick.mcclure@nih.gov\n\nCharles Y. Zheng\n\nNational Institute of Mental Health\n\ncharles.zheng@nih.gov\n\nJakub R. Kaczmarzyk\n\nMassachusetts Institute of Technology\n\njakubk@mit.edu\n\nMassachusetts Institute of Technology\n\nSatrajit S. Ghosh\n\nsatra@mit.edu\n\nPeter Bandettini\n\nNational Institute of Mental Health\n\nbandettini@nih.gov\n\nJohn A. Lee\n\nNational Institute of Mental Health\n\njohn.rodgers-lee@nih.gov\n\nDylan Nielson\n\nNational Institute of Mental Health\n\ndylann.nielson@nih.gov\n\nFrancisco Pereira\n\nNational Institute of Mental Health\nfrancisco.pereira@nih.gov\n\nAbstract\n\nCollecting the large datasets needed to train deep neural networks can be very\ndif\ufb01cult, particularly for the many applications for which sharing and pooling data\nis complicated by practical, ethical, or legal concerns. However, it may be the case\nthat derivative datasets or predictive models developed within individual sites can\nbe shared and combined with fewer restrictions. Training on distributed data and\ncombining the resulting networks is often viewed as continual learning, but these\nmethods require networks to be trained sequentially. In this paper, we introduce\ndistributed weight consolidation (DWC), a continual learning method to consolidate\nthe weights of separate neural networks, each trained on an independent dataset.\nWe evaluated DWC with a brain segmentation case study, where we consolidated\ndilated convolutional neural networks trained on independent structural magnetic\nresonance imaging (sMRI) datasets from different sites. We found that DWC led\nto increased performance on test sets from the different sites, while maintaining\ngeneralization performance for a very large and completely independent multi-site\ndataset, compared to an ensemble baseline.\n\n1\n\nIntroduction\n\nDeep learning methods require large datasets to perform well. Collecting such datasets can be very\ndif\ufb01cult, particularly for the many applications for which sharing and pooling data is complicated\nby practical, ethical, or legal concerns. One prominent application is human subjects research, in\nwhich researchers may be prevented from sharing data due to privacy concerns or other ethical\nconsiderations. These concerns can signi\ufb01cantly limit the purposes for which the collected data can\nbe used, even within a particular collection site. If the datasets are collected in a clinical setting,\nthey may be subject to many additional constraints. However, it may be the case that derivative\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fdatasets or predictive models developed within individual sites can be shared and combined with\nfewer restrictions.\nIn the neuroimaging literature, several platforms have been introduced for combining models trained\non different datasets, such as ENIGMA ([29], for meta-analyses) and COINSTAC ([23], for distributed\ntraining of models). Both platforms support combining separately trained models by averaging the\nlearned parameters. This works for convex methods (e.g. linear regression), but does not generally\nwork for non-convex methods (e.g. deep neural networks, DNNs).\n[23] also discussed using\nsynchronous stochastic gradient descent training using server-client communication; this assumes\nthat all of the training data is simultaneously available. Also, for large models such as DNNs, the\nbandwidth required could be problematic, given the need to transmit gradients at every update.\nLearning from non-centralized datasets using DNNs is often viewed as continual learning, a sequential\nprocess where a given predictive model is updated to perform well on new datasets, while retaining\nthe ability to predict on those previously used for training [32, 4, 21, 16, 18]. Continual learning is\nparticularly applicable to problems with shifting input distributions, where the data collected in the\npast may not represent data collected now or in the future. This is true for neuroimaging, since the\nstatistics of MRIs may change due to scanner upgrades, new reconstruction algorithms, different\nsequences, etc. The scenario we envisage is a more complex situation where multiple continual\nlearning processes may take place non-sequentially. For instance, a given organization produces a\nstarting DNN, which different, independent sites will then use with their own data. The sites will\nthen contribute back updated DNNs, which the organization will use to improve the main DNN being\nshared, with the goal of continuing the sharing and consolidation cycle.\nOur application is segmentation of structural magnetic resonance imaging (sMRI) volumes. These\nsegmentations are often generated using the Freesurfer package [8], a process that can take close to a\nday for each subject. The computational resources for doing this at a scale of hundreds to thousands\nof subjects are beyond the capabilities of most sites. We use deep neural networks to predict the\nFreesurfer segmentation directly from the structural volumes, as done previously by other groups\n[26, 27, 7, 6]. We train several of those networks \u2013 each using data from a different site \u2013 and\nthen consolidate their weights. We show that this results in a model with improved generalization\nperformance in test data from these sites, as well as a very large, completely independent multi-site\ndataset.\n\n2 Data and Methods\n\n2.1 Datasets\n\nWe use several sMRI datasets collected at different sites. We train networks using 956 sMRI volumes\ncollected by the Human Connectomme Project (HCP) [30], 1,136 sMRI volumes collected by the\nNathan Kline Institute (NKI) [22], 183 sMRI volumes collected by the Buckner Laboratory [2], and\n120 sMRI volumes from the Washington University 120 (WU120) dataset [24]. In order to provide an\nindependent estimate of how well a given network generalizes to any new site, we also test networks\non a completely held-out dataset consisting of 893 sMRI volumes collected across several institutions\nby the ABIDE project [5].\n\n2.2 Architecture\n\nSeveral deep neural network architectures have been proposed for brain segmentation, such as U-net\n[26], QuickNAT [27], HighResNet [18] and MeshNet [7, 6]. We chose MeshNet because of its\nrelatively simple structure, its lower number of learned parameters, and its competitive performance.\nMeshNet uses dilated convolutional layers [31] due to the 3D structural nature of sMRI data. The\noutput of these discrete volumetric dilated convolutional layers can be expressed as:\n\n(wf \u2217l h)i,j,k =\n\na(cid:88)\n\nb(cid:88)\n\nc(cid:88)\n\n\u02dci=\u2212a\n\n\u02dcj=\u2212b\n\n\u02dck=\u2212c\n\nwf,\u02dci,\u02dcj,\u02dckhi\u2212l\u02dci,j\u2212l\u02dcj,k\u2212l\u02dck = (wf \u2217l h)v =\n\n(cid:88)\n\nt\u2208Wabc\n\nwf,thv\u2212lt.\n\n(1)\n\n2\n\n\fLayer\n\n1\n2\n3\n4\n\nFilter\n96x33\n96x33\n96x33\n96x33\n\nPad Dilation (l)\n1\n1\n1\n2\n\n1\n1\n1\n2\n\nFunction\nReLU\nReLU\nReLU\nReLU\n\nLayer\n\n5\n6\n7\n8\n\nFilter\n96x33\n96x33\n96x33\n50x13\n\nPad Dilation (l)\n4\n8\n1\n0\n\n4\n8\n1\n1\n\nFunction\nReLU\nReLU\nReLU\nSoftmax\n\nTable 1: The Meshnet-like dilated convolutional neural network architecture for brain segmentation.\n\nwhere h is the input to the layer, a, b, and c are the bounds for the i, j, and k axes of the \ufb01lter with\nweights wf , (i, j, k) is the voxel, v, where the convolution is computed. The set of indices for the\nelements of wf can be de\ufb01ned as Wabc = {\u2212a, ..., a}\u00d7{\u2212b, ..., b}\u00d7{\u2212c, ..., c}. The dilation factor\nl allows the convolution kernel to operate on every l-th voxel, since adjacent voxels are expected to\nbe highly correlated. The dilation factor, number of \ufb01lters, and other details of the MeshNet-like\narchitecture that we used for all experiments is shown in Table 1.\n\n2.3 Bayesian Inference in Neural Networks\n\n2.3.1 Maximum a Posteriori Estimate\n\nN(cid:88)\n\nWhen training a neural network, the weights of the network, w are learned by optimizing\nargmaxwp(w|D) where D = {(x1, y1), ..., (xN , yN )} and (xn, yn) is the nth input-output ex-\nample, per maximum likelihood estimation (MLE). However, this often over\ufb01ts, so we used a prior\non the network weights, p(w), to obtain a maximum a posteriori (MAP) estimate, by maximizing:\n\nLM AP (w) =\n\nlog p(yn|xn, w) + log p(w).\n\n(2)\n\n2.3.2 Approximate Bayesian Inference\n\nn=1\n\nIn Bayesian inference for neural networks, a distribution of possible weights is learned instead of just\na MAP point estimate. Using Bayes\u2019 rule, p(w|D) = p(D|w)p(w)/p(D), where p(w) is the prior\nover weights. However, directly computing the posterior, p(w|D), is often intractable, particularly\nfor DNNs. As a result, an approximate inference method must be used.\nOne of the most popular approximate inference methods for neural networks is variational inference,\nsince it scales well to large DNNs. In variational inference, the posterior distribution p(w|D) is\napproximated by a learned variational distribution of weights q\u03b8(w), with learnable parameters \u03b8.\nThis approximation is enforced by minimizing the Kullback-Leibler divergence (KL) between q\u03b8(w),\nand the true posterior, p(w|D), KL[q\u03b8(w)||p(w|D)]. This is equivalent to maximizing the variational\nlower bound [11, 10, 3, 14, 9, 20, 19], also known as the evidence lower bound (ELBO),\n\nwhere LD(\u03b8) is\n\nLELBO(\u03b8) = LD(\u03b8) \u2212 LKL(\u03b8),\n\nN(cid:88)\n\nn=1\n\nLD(\u03b8) =\n\nEq\u03b8(w)[log p(yn|xn, w)]\n\nand LKL(\u03b8) is the KL divergence between the variational distribution of weights and the prior,\n\nLKL(\u03b8) = KL[q\u03b8(w)||p(w)].\n\n(3)\n\n(4)\n\n(5)\n\nMaximizing LD seeks to learn a q\u03b8(w) that explains the training data, while minimizing LKL (i.e.\nkeeping q\u03b8(w) close to p(w)) prevents learning a q\u03b8(w) that over\ufb01ts to the training data.\n\n3\n\n\f2.3.3 Stochastic Variational Bayes\n\nOptimizing Eq. 3 for deep neural networks is usually impractical to compute, due to both: (1) being\na full-batch approach and (2) integrating over q\u03b8(w). (1) is often dealt with by using stochastic\nmini-batch optimization [25] and (2) is often approximated using Monte Carlo sampling. [14] applied\nthese methods to variational inference in deep neural networks. They used the \"reparameterization\ntrick\" [15], which formulates q\u03b8(w) as a deterministic differentiable function w = f (\u03b8, \u0001) where\n\u0001 \u223c N (0, I), to calculate an unbiased estimate of \u2207\u03b8LD for a mini-batch, {(x1, y1), ..., (xM , yM )},\nand one weight noise sample, \u0001m, for each mini-batch example:\n\nwhere\n\nLELBO(\u03b8) \u2248 LSGV BD\nM(cid:88)\n\n(\u03b8) \u2212 LKL(\u03b8),\n\nlog p(ym|xm, f (\u03b8, \u0001m)).\n\n(6)\n\n(7)\n\nLD(\u03b8) \u2248 LSGV BD\n\n(\u03b8) =\n\nN\nM\n\nm=1\n\n2.3.4 Variational Continual Learning\n\nIn Bayesian neural networks, p(w) is often set to a multivariate Gaussian with diagonal covariance\nN (000, \u03c32\npriorI). (A variational distribution of the same form is called a fully factorized Gaussian\n(FFG).) However, instead of using a na\u00efve prior, the parameters of a previously trained DNN can be\nused. Several methods, such elastic weight consolidation [16] and synaptic intelligence [32], have\nexplored this approach. Recently, these methods have been reinterpreted from a Bayesian perspective\n[21, 17]. In variational continual learning (VCL) [21] and Bayesian incremental learning [17], the\nDNNs trained on previously obtained data, D1-DT\u22121, are used to regularize the training of a new\nneural network trained on DT per:\n\np(w|D1:T ) =\n\np(D1:T|w)p(w)\n\np(D1:T )\n\np(D1:T\u22121|w)p(DT|w)p(w)\n\np(D1:T\u22121)p(DT )\n\n=\n\n=\n\np(w|D1:T\u22121)p(DT|w)\n\np(DT )\n\n,\n\n(8)\n\nwhere p(w|D1:T\u22121) is the network resulting from training on a sequence of datasets D1-DT\u22121.\nFor DNNs, computing p(w|D1:T ) directly can be intractable, so variational inference is iteratively\n\u03b8 (w)||p(w|D1:\u03c4 )] for each sequential\nused to learn an approximation, qT\ndataset D\u03c4 , with \u03c4 ranging over integers from 1 to T .\nThe sequential nature of this approach is a limitation in our setting. In many cases it is not feasible\nfor one site to wait for another site to complete training, which can take days, in order to begin their\nown training.\n\n\u03b8 (w), by minimizing KL[q\u03c4\n\n2.4 Distributed Weight Consolidation\n\nThe main motivation of our method \u2013 distributed weight consolidation (DWC) \u2013 is to make it possible\nto train neural networks on different, distributed datasets, independently, and consolidate their weights\ninto a single network.\n\n2.4.1 Bayesian Continual Learning for Distributed Data\n\nT , ...,DS\n\nIn DWC, we seek to consolidate several distributed DNNs trained on S separate, distributed datasets,\nDT = {D1\nT}, so that the resulting DNN can then be used to inform the training of a DNN on\nDT +1. The training on each dataset starts from an existing network p(w|D1:T\u22121).\nAssuming that the S datasets are independent allows Eq. 8 to be rewritten as:\n\n.\n\n(9)\n\np(w|D1:T\u22121)(cid:81)S\n\n(cid:81)S\ns=1 p(Ds\nT )\n\ns=1 p(Ds\n\nT|w)\n\np(w|D1:T ) =\n\n4\n\n\fHowever, training one of the S networks using VCL produces an approximation for p(w|D1:T\u22121,Ds\nT ).\nEq.\nT ) =\np(w|D1:T\u22121)p(Ds\n\n9 can be written in terms of the learned distributions, since p(w|D1:T\u22121,Ds\n\nT|w)/p(Ds\n\nT ) per Eq. 8:\n\np(w|D1:T ) =\n\n1\n\np(w|D1:T\u22121)S\u22121\n\np(w|D1:T\u22121, Ds\nT ).\n\n(10)\n\np(w|D1:T\u22121) and each p(w|D1:T\u22121,Ds\ndistribution can then be used to learn p(w|D1:T +1) per Eq. 8.\n\nT ) can be learned and then used to compute p(w|D1:T ). This\n\n2.4.2 Variational Approximation\n\nIn DNNs, however, directly calculating these probability distributions can be intractable, so varia-\ntional inference is used to learn an approximation, qT,s\nT ) by minimizing\nKL[qT\n\n(w), for p(w|D1:T\u22121,Ds\n\nT )]. This results in approximating Eq. 10 using:\n\n\u03b8 (w)||p(w|D1:T\u22121,Ds\n\n\u03b8\n\nS(cid:89)\n\ns=1\n\nS(cid:89)\n\ns=1\n\np(w|D1:T ) \u2248\n\n1\n(w)S\u22121\n\nqT\u22121\n\n\u03b8\n\nqT,s\n\u03b8\n\n(w).\n\n(11)\n\n2.4.3 Dilated Convolutions with Fully Factorized Gaussian Filters\n\neach of the F \ufb01lters are independent (i.e. p(w) =(cid:81)F\nalso independent (i.e. p(wf ) =(cid:81)\n\nAlthough more complicated variational families have recently been explored in DNNs, the relatively\nsimple FFG variational distribution can do as well as, or better than, more complex methods for\ncontinual learning [17]. In this paper, we use dilated convolutions with FFG \ufb01lters. This assumes\nf =1 p(wf )), that each weight within a \ufb01lter is\np(wf,t)), and that each weight is Gaussian (i.e. wf,t \u223c\nN (\u00b5f,t, \u03c32\nf,t)) with learnable parameters \u00b5f,t and \u03c3f,t. However, as discussed in [14, 20], randomly\nsampling each weight for each mini-batch example can be computationally expensive, so the fact\nthat the sum of independent Gaussian variables is also Gaussian is used to move the noise from the\nweights to the convolution operation. For, dilated convolutions, this is described by\n\nt\u2208Wabc\n\n(wf \u2217l h)v \u223c N (\u00b5\u2217\n\nf,v, (\u03c3\u2217\n\nf,v)2),\n\nwhere\n\nand\n\n(cid:88)\n\nt\u2208Wabc\n\n\u00b5\u2217\nf,v =\n\n\u00b5f,thv\u2212lt\n\n(cid:88)\n\nt\u2208Wabc\n\n(\u03c3\u2217\n\nf,v)2 =\n\n\u03c32\nf,th2\n\nv\u2212lt.\n\nEq. 12 can be rewritten using the Gaussian \"reparameterization trick\":\n\n(wf \u2217l h)v = \u00b5\u2217\n\nf,v + \u03c3\u2217\n\nf,v\u0001f,v where \u0001f,v \u223c N (0, 1).\n\n2.4.4 Consolidating an Ensemble of Fully Factorized Gaussian Networks\n\n(12)\n\n(13)\n\n(14)\n\n(15)\n\nEq. 11 can be used to consolidate an ensemble of distributed networks in order to allow for\ntraining on new datasets. Eq. 11 can be directly calculated if qT\u22121\nf,t)2)\nand qT,s\n\nf,t, (\u03c30\nf,t)2) are known, resulting in p(w|D1:T ) also being an FFG per\n\n(wf,t) = N (\u00b50\n\n(wf,t) = N (\u00b5s\n\nf,t, (\u03c3s\n\n\u03b8\n\n\u03b8\n\n5\n\n\fp(wf,t|D1:T ) \u221d\u223c e\n\n(S\u22121)\n\n(wf,0,t\u2212\u00b50\nf,t )2\n)2\n\n2(\u03c30\n\nf,t\n\nS(cid:89)\n\ns=1\n\ne\n\n\u2212(wf,s,t\u2212\u00b5s\nf,t )2\n)2\n\n2(\u03c3s\n\nf,t\n\n(16)\n\n(17)\n\n(cid:33)\n\n.\n\n1\n(\u03c30\nf,t)2\n\ns=1\n\n1\n\n1\n(\u03c3s\n\nf,t)2 \u2212(cid:80)S\u22121\nf,t)2 \u2212(cid:80)S\u22121\n\ns=1\n\ns=1\n\nand\n\n(cid:32)(cid:80)S\n(cid:80)S\n\nf,t)2 \u2212(cid:80)S\u22121\nf,t)2 \u2212(cid:80)S\u22121\n\ns=1\n\ns=1\n\nf,t\n\ns=1\n\n\u00b5s\n(\u03c3s\n\np(wf,t|D1:T ) \u2248 N\n\n(cid:80)S\nto the multivariate Gaussian density; it is de\ufb01ned when(cid:80)S\n\n\u00b50\nf,t\n(\u03c30\nf,t)2\n1\n(\u03c30\nf,t)2\n\n1\n(\u03c3s\n\ns=1\n\n,\n\nEq. 17 follows from Eq. 16 by completing the square inside the exponent and matching the parameters\nf,t)2 > 0. To ensure\nf,t)2. This should be the case if the loss is optimized, since LD\nf,t)2. p(wf,t|D1:T ) can then be used as a\n\nthis, we constrained (\u03c30\nf,t)2 to 0 and LKL pulls (\u03c3s\nshould pull (\u03c3s\nprior for training another variational DNN.\n\nf,t)2 towards (\u03c30\n\nf,t)2 \u2265 (\u03c3s\n\n1\n(\u03c30\n\n1\n(\u03c3s\n\ns=1\n\n3 Experiments\n\n3.1 Experimental Setup\n\n3.1.1 Data Preprocessing\n\nThe only pre-processing that we performed was conforming the input sMRIs with Freesurfer\u2019s\nmri_convert, which resized all of the volumes used to 256x256x256 1 mm voxels . We computed\n50-class Freesurfer [8] segmentations, as in [6], for all subjects in each of the datasets described\nearlier. These were used as the labels for prediction. A 90-10 training-test split was used for the HCP,\nNKI, Buckner, and WU120 datasets. During training and testing, input volumes were individually\nz-scored across voxels. We split each input volume into 512 non-overlapping 32x32x32 sub-volumes,\nas in [7, 6].\n\n3.1.2 Training Procedure\n\nAll networks were trained with Adam [13] and an initial learning rate of 0.001. The MAP networks\nwere trained until convergence. The subsequent networks were trained until the training loss started\nto oscillate around a stable loss value. These networks trained much faster than the MAP networks,\nsince they were initialized with previously trained networks. Speci\ufb01cally, we found that using VCL\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nFigure 1: The axial and sagittal segmentations produced by DWC and the HN BWM AP baseline on\na HCP subject. The subject was selected by matching the subject speci\ufb01c Dice with the average Dice\nacross HCP. Segmentations errors for all classes are shown in red in the respective plot.\n\n6\n\n\fDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nFigure 2: The axial and sagittal segmentations produced by DWC and the HN BWM AP baseline on\na NKI subject. The subject was selected by matching the subject speci\ufb01c Dice with the average Dice\nacross NKI. Segmentations errors for all classes are shown in red in the respective plot.\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nFigure 3: The axial and sagittal segmentations produced by DWC and the HN BWM AP baseline on\na Buckner subject. The subject was selected by matching the subject speci\ufb01c Dice with the average\nDice across Buckner. Segmentations errors for all classes are shown in red in the respective plot.\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nDWC errors\n\nDWC\n\nGround Truth\n\nMAP\n\nMAP errors\n\nFigure 4: The axial and sagittal segmentations produced by DWC and the HN BWM AP baseline on\na WU120 subject. The subject was selected by matching the subject speci\ufb01c Dice with the average\nDice across WU120. Segmentations errors for all classes are shown in red in the respective plot.\n\n7\n\n\fled to \u223c3x, \u223c2x, and \u223c4x convergence speedups for HCP to NKI, HCP to Buckner and HCP to\nWU120, respectively. The batch-size was set to 10. Weight normalization [28] was used for the\nweight means for all networks and the weight standard deviations were initialized to 0.001 as in [19]\nfor the variational network trained on HCP. For MAP networks and the variational network trained\non HCP, p(w) = N (0, 1).\n\n3.1.3 Performance Metric\n\nTo measure the quality of the produced segmentations, we calculated the Dice coef\ufb01cient, which is\nde\ufb01ned by\n\nDicec =\n\n2|\u02c6yc \u00b7 yc|\n\n||\u02c6yc||2 + ||yc||2 =\n\n2T Pc\n\n2T Pc + F Nc + F Pc\n\n,\n\n(18)\n\nwhere \u02c6yc is the binary segmentation for class c produced by a network, yc is the ground truth\nproduced by Freesurfer, T Pc is the true positive rate for class c, F Nc is the false negative rate for\nclass c, and F Pc is the false positive rate for class c. We calculate the Dice coef\ufb01cient for each class\nc and average across classes to compute the overall performance of a network.\n\n3.1.4 Baselines\n\nWe trained MAP networks on the HCP (HM AP ), NKI (NM AP ), Buckner (BM AP ) and WU120\n(WM AP ) datasets. We averaged the output probabilities of the HM AP , NM AP , BM AP , and WM AP\nnetworks to create an ensemble baseline. We also trained a MAP model on the aggregated HCP, NKI,\nBuckner, and WU120 training data (HN BWM AP ) to estimate the performance ceiling of having the\ntraining data from all sites available together.\n\n3.1.5 Variational Continual Learning\n\nWe trained an initial FFG variational network on HCP (H) using HM AP to initialize the network. We\nthen used used VCL with HCP as the prior for distributed training of the FFG variational networks\non the NKI (H \u2192 N), Buckner (H \u2192 B) and WU120 (H \u2192 W ) datasets. Additionally, we trained\nnetworks using VCL to transfer from HCP to NKI to Buckner to WU120 (H \u2192 N \u2192 B \u2192 W )\nand from HCP to WU120 to Buckner to NKI (H \u2192 W \u2192 B \u2192 N). These options test training on\nNKI, Buckner, and WU120 in increasing and decreasing order of dataset size, since dataset order\nmay matter and may be dif\ufb01cult to control in practice.\n\n3.1.6 Distributed Weight Consolidation\n\nFor DWC, our goal was to take distributed networks trained using VCL with an initial network\nas a prior, consolidate them per Eq. 17, and then use this consolidated model as a prior for \ufb01ne-\ntuning on the original dataset. We used DWC to consolidate H \u2192 N, H \u2192 B, and H \u2192 W into\nH \u2192 N + B + W per Eq. 17. VCL [21] performance was found to be improved by using coresets\n[1, 12], a small sample of data from the different training sets. However, if examples cannot be\ncollected from the different datasets, as may be the case when examples from the separate datasets\ncannot be shared, coresets are not applicable. For this reason, we used H \u2192 N + B + W as a prior\nfor \ufb01ne-tuning (FT) by training the network on H (H \u2192 N + B + W \u2192 H) and giving LD the\nweight of one example volume.\n\n3.2 Experimental Results\n\nIn Table 2 we show the average Dice scores across classes and sMRI volumes for the differently\ntrained networks. The weighted average Dice scores were computed across H, N, B, and W by\nweighing each of the Dice scores according to the number of volumes in each test set. For the\nvariational networks, 10 MC samples were used during test time to approximate the expected network\noutput. The weighted average Dice scores of DWC were better than the scores of the ensemble, the\nbaseline method for combining methods across sites, (p = 1.66e-15, per a two tailed paired t-test\nacross volumes). The ABIDE Dice scores of DWC were not signi\ufb01cantly different from the scores of\nthe ensemble (p = 0.733, per a two tailed paired t-test across volumes), showing that DWC does not\nreduce generalization performance for a very large and completely independent multi-site dataset.\n\n8\n\n\fN etwork\nHM AP\nNM AP\nBM AP\nWM AP\nH \u2192 N\nH \u2192 B\nH \u2192 W\n\nH \u2192 N + B + W (DWC w/o FT)\nH \u2192 N + B + W \u2192 H (DWC)\n\nEnsemble\n\nH \u2192 N \u2192 B \u2192 W\nH \u2192 W \u2192 B \u2192 N\n\nHN BWM AP\n\nH\n\n82.25\n71.20\n65.69\n70.18\n75.40\n73.85\n77.07\n77.42\n78.04\n78.28\n79.13\n80.34\n81.38\n\nN\n\n65.88\n72.19\n50.17\n66.27\n73.24\n56.79\n67.63\n71.46\n78.15\n73.52\n72.32\n73.64\n77.99\n\nB\n\n67.94\n70.73\n82.02\n72.20\n71.77\n79.49\n76.15\n79.70\n75.79\n78.02\n80.02\n77.46\n80.64\n\nW\n70.88\n73.06\n68.87\n76.38\n73.17\n68.53\n77.26\n79.82\n79.50\n77.37\n78.84\n78.10\n79.54\n\nAvg.\n72.92\n71.66\n59.25\n68.76\n74.03\n65.78\n72.51\n74.86\n77.99\n75.95\n75.94\n76.82\n79.62\n\nA\n\n55.25\n66.67\n50.23\n62.83\n64.62\n49.27\n62.31\n63.3\n70.79\n65.56\n66.27\n66.21\n70.76\n\nTable 2: The average Dice scores across test volumes for the trained networks on HCP (H), NKI (N),\nBuckner (B), and WU120 (W), along with the weighted average Dice scores across H, N, B, and W\nand the average Dice scores across volumes on the independent ABIDE (A) dataset.\n\nTraining on different datasets sequentially using VCL was very sensitive to dataset order, as seen by\nthe difference in Dice scores when training on NKI, Buckner, and WU120 in order of decreasing\nand increasing dataset size (H \u2192 N \u2192 B \u2192 W and H \u2192 W \u2192 B \u2192 N, respectively). The\nperformance of DWC was within the range of VCL performance. The weighted average and ABIDE\ndice scores of DWC were better than the H \u2192 N \u2192 B \u2192 W Dice scores, but not better than the\nH \u2192 W \u2192 B \u2192 N Dice scores.\nIn Figures 1, 2, 3, and 4, we show selected example segmentations for DWC and HN BWM AP ,\nfor volumes that have Dice scores similar to the average Dice score across the respective dataset.\nVisually, the DWC segmentation appears very similar to the ground truth. The errors made appear to\noccur mainly at region boundaries. Additionally, the DWC errors appear to be similar to the errors\nmade by HN BWM AP .\n\n4 Discussion\n\nThere are many problems for which accumulating data into one accessible dataset for training can\nbe dif\ufb01cult or impossible, such as for clinical data. It may, however, be feasible to share models\nderived from such data. A method often proposed for dealing with these independent datasets is\ncontinual learning, which trains on each of these datasets sequentially [4]. Several recent continual\nlearning methods use previously trained networks as priors for networks trained on the next dataset\n[32, 21, 16], albeit with the requirement that training happens sequentially. We developed DWC\nby modifying these methods to allow for training networks on several new datasets in a distributed\nway. Using DWC, we consolidated the weights of the distributed neural networks to perform brain\nsegmentation on data from different sites. The resulting weight distributions can then be used as a\nprior distribution for further training, either for the original site or for novel sites. Compared to an\nensemble made from models trained on different sites, DWC increased performance on the held-out\ntest sets from the sites used in training and led to similar ABIDE performance. This demonstrates the\nfeasibility of DWC for combining the knowledge learned by networks trained on different datasets,\nwithout either training on the sites sequentially or ensembling many trained models. One important\ndirection for future research is scaling DWC up to allow for consolidating many more separate,\ndistributed networks and repeating this training and consolidation cycle several times. Another area\nof research is to investigate the use of alternative families of variational distributions within the\nframework of DWC. Our method has the potential to be applied to many other applications where\nit is necessary to train specialized networks for speci\ufb01c sites, informed by data from other sites,\nand where constraints on data sharing necessitate a distributed learning approach, such as disease\ndiagnosis with clinical data.\n\n9\n\n\fAcknowledgments\n\nThis work was supported by the National Institute of Mental Health Intramural Research Program\n(ZIC-MH002968, ZIC-MH002960). JK\u2019s and SG\u2019s work was supported by NIH R01 EB020740.\n\nReferences\n[1] Olivier Bachem, Mario Lucic, and Andreas Krause. Coresets for nonparametric estimation-the case of\n\nDP-means. In International Conference on Machine Learning, pages 209\u2013217, 2015.\n\n[2] Bharat B Biswal, Maarten Mennes, Xi-Nian Zuo, Suril Gohel, Clare Kelly, Steve M Smith, Christian F\nBeckmann, Jonathan S Adelstein, Randy L Buckner, Stan Colcombe, et al. Toward discovery science of\nhuman brain function. Proceedings of the National Academy of Sciences, 107(10):4734\u20134739, 2010.\n\n[3] Charles Blundell, Julien Cornebise, Koray Kavukcuoglu, and Daan Wierstra. Weight uncertainty in neural\n\nnetworks. In International Conference on Machine Learning, pages 1613\u20131622, 2015.\n\n[4] Ken Chang, Niranjan Balachandar, Carson Lam, Darvin Yi, James Brown, Andrew Beers, Bruce Rosen,\nDaniel L Rubin, and Jayashree Kalpathy-Cramer. Distributed deep learning networks among institutions\nfor medical imaging. Journal of the American Medical Informatics Association.\n\n[5] Adriana Di Martino, Chao-Gan Yan, Qingyang Li, Erin Denio, Francisco X Castellanos, Kaat Alaerts,\nJeffrey S Anderson, Michal Assaf, Susan Y Bookheimer, Mirella Dapretto, et al. The autism brain imaging\ndata exchange: Towards a large-scale evaluation of the intrinsic brain architecture in autism. Molecular\nPsychiatry, 19(6):659, 2014.\n\n[6] Alex Fedorov, Eswar Damaraju, Vince Calhoun, and Sergey Plis. Almost instant brain atlas segmentation\n\nfor large-scale studies. arXiv preprint arXiv:1711.00457, 2017.\n\n[7] Alex Fedorov, Jeremy Johnson, Eswar Damaraju, Alexei Ozerin, Vince Calhoun, and Sergey Plis. End-to-\nend learning of brain tissue segmentation from imperfect labeling. In International Joint Conference on\nNeural Networks, pages 3785\u20133792. IEEE, 2017.\n\n[8] Bruce Fischl. Freesurfer. Neuroimage, 62(2):774\u2013781, 2012.\n\n[9] Yarin Gal and Zoubin Ghahramani. Dropout as a Bayesian approximation: Representing model uncertainty\n\nin deep learning. In International Conference on Machine Learning, pages 1050\u20131059, 2016.\n\n[10] Alex Graves. Practical variational inference for neural networks. In Advances in Neural Information\n\nProcessing Systems, pages 2348\u20132356, 2011.\n\n[11] Geoffrey E Hinton and Drew Van Camp. Keeping the neural networks simple by minimizing the description\nlength of the weights. In Proceedings of the sixth annual conference on Computational learning theory,\npages 5\u201313. ACM, 1993.\n\n[12] Jonathan Huggins, Trevor Campbell, and Tamara Broderick. Coresets for scalable Bayesian logistic\n\nregression. In Advances in Neural Information Processing Systems, pages 4080\u20134088, 2016.\n\n[13] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization.\n\nConference on Learning Representations, 2015.\n\nIn International\n\n[14] Diederik P Kingma, Tim Salimans, and Max Welling. Variational dropout and the local reparameterization\n\ntrick. In Advances in Neural Information Processing Systems, pages 2575\u20132583, 2015.\n\n[15] Diederik P Kingma and Max Welling. Auto-encoding variational Bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\n[16] James Kirkpatrick, Razvan Pascanu, Neil Rabinowitz, Joel Veness, Guillaume Desjardins, Andrei A Rusu,\nKieran Milan, John Quan, Tiago Ramalho, Agnieszka Grabska-Barwinska, et al. Overcoming catastrophic\nforgetting in neural networks. Proceedings of the National Academy of Sciences, 114(13):3521\u20133526,\n2017.\n\n[17] Max Kochurov, Timur Garipov, Dmitry Podoprikhin, Dmitry Molchanov, Arsenii Ashukha, and Dmitry\n\nVetrov. Bayesian incremental learning for deep neural networks. ICLR Workshop, 2018.\n\n[18] Wenqi Li, Guotai Wang, Lucas Fidon, Sebastien Ourselin, M Jorge Cardoso, and Tom Vercauteren. On the\ncompactness, ef\ufb01ciency, and representation of 3D convolutional networks: Brain parcellation as a pretext\ntask. In International Conference on Information Processing in Medical Imaging, pages 348\u2013360. Springer,\n2017.\n\n10\n\n\f[19] Christos Louizos and Max Welling. Multiplicative normalizing \ufb02ows for variational Bayesian neural\n\nnetworks. In International Conference on Machine Learning, pages 2218\u20132227, 2017.\n\n[20] Dmitry Molchanov, Arsenii Ashukha, and Dmitry Vetrov. Variational dropout sparsi\ufb01es deep neural\n\nnetworks. In International Conference on Machine Learning, pages 2498\u20132507, 2017.\n\n[21] Cuong V Nguyen, Yingzhen Li, Thang D Bui, and Richard E Turner. Variational continual learning. In\n\nInternational Conference on Learning Representations, 2018.\n\n[22] Kate Brody Nooner, Stanley Colcombe, Russell Tobe, Maarten Mennes, Melissa Benedict, Alexis Moreno,\nLaura Panek, Shaquanna Brown, Stephen Zavitz, Qingyang Li, et al. The NKI-Rockland sample: A model\nfor accelerating the pace of discovery science in psychiatry. Frontiers in Neuroscience, 6:152, 2012.\n\n[23] Sergey M Plis, Anand D Sarwate, Dylan Wood, Christopher Dieringer, Drew Landis, Cory Reed, Sandeep R\nPanta, Jessica A Turner, Jody M Shoemaker, Kim W Carter, et al. COINSTAC: a privacy enabled model\nand prototype for leveraging and processing decentralized brain imaging data. Frontiers in Neuroscience,\n10:365, 2016.\n\n[24] Jonathan D Power, Mark Plitt, Prantik Kundu, Peter A Bandettini, and Alex Martin. Temporal interpolation\nalters motion in fMRI scans: Magnitudes and consequences for artifact detection. PloS one, 12(9):e0182939,\n2017.\n\n[25] Herbert Robbins and Sutton Monro. A stochastic approximation method. The Annals of Mathematical\n\nStatistics, pages 400\u2013407, 1951.\n\n[26] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In International Conference on Medical Image Computing and Computer-Assisted\nIntervention, pages 234\u2013241. Springer, 2015.\n\n[27] Abhijit Guha Roy, Sailesh Conjeti, Nassir Navab, and Christian Wachinger. QuickNAT: Segmenting MRI\n\nneuroanatomy in 20 seconds. arXiv preprint arXiv:1801.04161, 2018.\n\n[28] Tim Salimans and Diederik P Kingma. Weight normalization: A simple reparameterization to accelerate\ntraining of deep neural networks. In Advances in Neural Information Processing Systems, pages 901\u2013909,\n2016.\n\n[29] Paul M Thompson, Jason L Stein, Sarah E Medland, Derrek P Hibar, Alejandro Arias Vasquez, Miguel E\nRenteria, Roberto Toro, Neda Jahanshad, Gunter Schumann, Barbara Franke, et al. The ENIGMA\nconsortium: Large-scale collaborative analyses of neuroimaging and genetic data. Brain imaging and\nbehavior, 8(2):153\u2013182, 2014.\n\n[30] David C Van Essen, Stephen M Smith, Deanna M Barch, Timothy EJ Behrens, Essa Yacoub, Kamil\nUgurbil, Wu-Minn HCP Consortium, et al. The WU-Minn human connectome project: An overview.\nNeuroImage, 80:62\u201379, 2013.\n\n[31] Fisher Yu and Vladlen Koltun. Multi-scale context aggregation by dilated convolutions. In International\n\nConference on Learning Representations, 2015.\n\n[32] Friedemann Zenke, Ben Poole, and Surya Ganguli. Continual learning through synaptic intelligence. In\n\nInternational Conference on Machine Learning, pages 3987\u20133995, 2017.\n\n11\n\n\f", "award": [], "sourceid": 2029, "authors": [{"given_name": "Patrick", "family_name": "McClure", "institution": "NIH"}, {"given_name": "Charles", "family_name": "Zheng", "institution": "National Institute of Mental Health"}, {"given_name": "Jakub", "family_name": "Kaczmarzyk", "institution": "MIT"}, {"given_name": "John", "family_name": "Rogers-Lee", "institution": "NIMH"}, {"given_name": "Satra", "family_name": "Ghosh", "institution": "MIT"}, {"given_name": "Dylan", "family_name": "Nielson", "institution": "NIMH"}, {"given_name": "Peter", "family_name": "Bandettini", "institution": "National Institute of Mental Health"}, {"given_name": "Francisco", "family_name": "Pereira", "institution": "National Institute of Mental Health"}]}