{"title": "Learning Neural Representations of Human Cognition across Many fMRI Studies", "book": "Advances in Neural Information Processing Systems", "page_first": 5883, "page_last": 5893, "abstract": "Cognitive neuroscience is enjoying rapid increase in extensive public brain-imaging datasets. It opens the door to large-scale statistical models. Finding a unified perspective for all available data calls for scalable and automated solutions to an old challenge: how to aggregate heterogeneous information on brain function into a universal cognitive system that relates mental operations/cognitive processes/psychological tasks to brain networks? We cast this challenge in a machine-learning approach to predict conditions from statistical brain maps across different studies. For this, we leverage multi-task learning and multi-scale dimension reduction to learn low-dimensional representations of brain images that carry cognitive information and can be robustly associated with psychological stimuli. Our multi-dataset classification model achieves the best prediction performance on several large reference datasets, compared to models without cognitive-aware low-dimension representations; it brings a substantial performance boost to the analysis of small datasets, and can be introspected to identify universal template cognitive concepts.", "full_text": "Learning Neural Representations of\n\nHuman Cognition across Many fMRI Studies\n\nArthur Mensch\u2217\n\nInria\n\narthur.mensch@m4x.org\n\nJulien Mairal\u2020\n\nInria\n\njulien.mairal@inria.fr\n\nBertrand Thirion\u2217\n\nInria\n\nDanilo Bzdok\n\nDepartment of Psychiatry, RWTH\ndanilo.bzdok@rwth-aachen.de\n\nGa\u00ebl Varoquaux\u2217\n\nInria\n\nbertrand.thirion@inria.fr\n\ngael.varoquaux@inria.fr\n\nAbstract\n\nCognitive neuroscience is enjoying rapid increase in extensive public brain-imaging\ndatasets. It opens the door to large-scale statistical models. Finding a uni\ufb01ed\nperspective for all available data calls for scalable and automated solutions to\nan old challenge: how to aggregate heterogeneous information on brain func-\ntion into a universal cognitive system that relates mental operations/cognitive\nprocesses/psychological tasks to brain networks? We cast this challenge in a\nmachine-learning approach to predict conditions from statistical brain maps across\ndifferent studies. For this, we leverage multi-task learning and multi-scale dimen-\nsion reduction to learn low-dimensional representations of brain images that carry\ncognitive information and can be robustly associated with psychological stimuli.\nOur multi-dataset classi\ufb01cation model achieves the best prediction performance\non several large reference datasets, compared to models without cognitive-aware\nlow-dimension representations; it brings a substantial performance boost to the\nanalysis of small datasets, and can be introspected to identify universal template\ncognitive concepts.\n\nDue to the advent of functional brain-imaging technologies, cognitive neuroscience is accumulating\nquantitative maps of neural activity responses to speci\ufb01c tasks or stimuli. A rapidly increasing\nnumber of neuroimaging studies are publicly shared (e.g., the human connectome project, HCP [1]),\nopening the door to applying large-scale statistical approaches [2]. Yet, it remains a major challenge\nto formally extract structured knowledge from heterogeneous neuroscience repositories. As stressed\nin [3], aggregating knowledge across cognitive neuroscience experiments is intrinsically dif\ufb01cult due\nto the diverse nature of the hypotheses and conclusions of the investigators. Cognitive neuroscience\nexperiments aim at isolating brain effects underlying speci\ufb01c psychological processes: they yield\nstatistical maps of brain activity that measure the neural responses to carefully designed stimulus.\nUnfortunately, neither regional brain responses nor experimental stimuli can be considered to be\natomic: a given experimental stimulus recruits a spatially distributed set of brain regions [4], while\neach brain region is observed to react to diverse stimuli. Taking advantage of the resulting data\nrichness to build formal models describing psychological processes requires to describe each cognitive\n\n\u2217Inria, CEA, Universit\u00e9 Paris-Saclay, 91191 Gif sur Yvette, France\n\u2020Univ. Grenoble Alpes, Inria, CNRS, Grenoble INP, LJK, 38000 Grenoble, France\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fconclusion on a common basis for brain response and experimental study design. Uncovering atomic\nbasis functions that capture the neural building blocks underlying cognitive processes is therefore a\nprimary goal of neuroscience [5], for which we propose a new data-driven approach.\nSeveral statistical approaches have been proposed to tackle the problem of knowledge aggregation in\nfunctional imaging. A \ufb01rst set of approaches relies on coordinate-based meta-analysis to de\ufb01ne robust\nneural correlates of cognitive processes: those are extracted from the descriptions of experiments \u2014\nbased on categories de\ufb01ned by text mining [6] or expert [7]\u2014 and correlated with brain coordinates\nrelated to these experiments. Although quantitative meta-analysis techniques provide useful sum-\nmaries of the existing literature, they are hindered by label noise in the experiment descriptions, and\nweak information on brain activation as the maps are reduced to a few coordinates [8]. A second,\nmore recent set of approaches models directly brain maps across studies, either focusing on studies\non similar cognitive processes [9], or tackling the entire scope of cognition [10, 11]. Decoding, i.e.\npredicting the cognitive process from brain activity, across many different studies touching different\ncognitive questions is a key goal for cognitive neuroimaging as it provides a principled answer to\nreverse inference [12]. However, a major roadblock to scaling this approach is the necessity to label\ncognitive tasks across studies in a rich but consistent way, e.g., building an ontology [13].\nWe follow a more automated approach and cast dataset accumulation into a multi-task learning\nproblem: our model is trained to decode simultaneously different datasets, using a shared architecture.\nMachine-learning techniques can indeed learn universal representations of inputs that give good\nperformance in multiple supervised problems [14, 15]. They have been successful, especially with\nthe development of deep neural network [see, e.g., 16], in sharing representations and transferring\nknowledge from one dataset prediction model to another (e.g., in computer-vision [17] and audio-\nprocessing [18]). A popular approach is to simultaneously learn to represent the inputs of the\ndifferent datasets in a low-dimensional space and to predict the outputs from the low-dimensional\nrepresentatives. Using very deep model architectures in functional MRI is currently thwarted by the\nsignal-to-noise ratio of the available recordings and the relative little size of datasets [19] compared\nto computer vision and text corpora. Yet, we show that multi-dataset representation learning is a\nfertile ground for identifying cognitive systems with predictive power for mental operations.\n\nContribution. We introduce a new model architecture dedicated to multi-dataset classi\ufb01cation, that\nperforms two successive linear dimension reductions of the input statistical brain images and predicts\npsychological conditions from a learned low-dimensional representation of these images, linked to\ncognitive processes. In contrast to previous ontology-based approaches, imposing a structure across\ndifferent cognitive experiments is not needed in our model: the representation of brain images is\nlearned using the raw set of experimental conditions for each dataset. To our knowledge, this work is\nthe \ufb01rst to propose knowledge aggregation and transfer learning in between functional MRI studies\nwith such modest level of supervision. We demonstrate the performance of our model on several\nopenly accessible and rich reference datasets in the brain-imaging domain. The different aspects of\nits architecture bring a substantial increase in out-of-sample accuracy compared to models that forgo\nlearning a cognitive-aware low-dimensional representation of brain maps. Our model remains simple\nenough to be interpretable: it can be collapsed into a collection of classi\ufb01cation maps, while the space\nof low-dimensional representatives can be explored to uncover a set of meaningful latent components.\n\n1 Model: multi-dataset classi\ufb01cation of brain statistical images\n\nOur general goal is to extract and integrate biological knowledge across many brain-imaging studies\nwithin the same statistical learning framework. We \ufb01rst outline how analyzing large repositories of\nfMRI experiments can be cast as a classi\ufb01cation problem. Here, success in capturing brain-behavior\nrelationships is measured by out-of-sample prediction accuracy. The proposed model (Figure 1)\nsolves a range of these classi\ufb01cation problems in an identical statistical estimation and imposes\na shared latent structure across the single-dataset classi\ufb01cation parameters. These shared model\nparameters may be viewed as a chain of two dimension reductions. The \ufb01rst reduction layer leverages\nknowledge about brain spatial regularities; it is learned from resting-state data and designed to capture\nneural activity patterns at different coarseness levels. The second reduction layer projects data on\ndirections generally relevant for cognitive-state prediction. The combination of both reductions\nyields low-dimensional representatives that are less affected by noise and subject variance than\n\n2\n\n\fFigure 1: Model architecture: Three-layer multi-dataset classi\ufb01cation. The \ufb01rst layer (orange) is\nlearned from data acquired outside of cognitive experiments and captures a spatially coherent signal\nat multiple scales, the second layer (blue) embeds these representations in a space common to all\ndatasets, from which the conditions are predicted (pink) from multinomial models.\n\nthe high-dimensional samples: classi\ufb01cation is expected to have better out-of-sample prediction\nperformance.\n\n1.1 Problem setting: predicting conditions from brain activity in multiple studies\n\nWe \ufb01rst introduce our notations and terminology, and formalize a general prediction problem applica-\nble to any task fMRI dataset. In a single fMRI study, each subject performs different experiments\nin the scanner. During such an experiment, the subjects are presented a set of sensory stimuli (i.e.,\nconditions) that aim at recruiting a target set of cognitive processes. We \ufb01t a \ufb01rst-level general linear\nmodel for every record to obtain z-score maps that quantify the importance of each condition in\nexplaining each voxel. Formally, the n statistical maps (xi)i\u2208[n] of a given study form a sequence\nin Rp, where p is the number of voxels in the brain. Each such observation is labelled by a condition ci\nin [1, k] whose effect captures xi. A single study typically features one or a few (if experiments are\nrepeated) statistical map per condition and per subject, and may present up to k = 30 conditions.\nAcross the studies, the observed brain maps can be modeled as generated from an unknown joint\ndistribution of brain activity and associated cognitive conditions ((xi, ci))i\u2208[n] where variability\nacross trials and subjects acts as confounding noise. In this context, we wish to learn a decoding\nmodel that predicts condition c from brain activity x measured from new subjects or new studies.\nInspired by recent work [10, 20, 21], we frame the condition prediction problem into the estimation\nof a multinomial classi\ufb01cation model. Our models estimate a probability vector of x being labeled by\neach condition in C. This vector is modeled as a function of (W, b) in Rp\u00d7k \u00d7 Rk that takes the\nsoftmax form. For all j in [1, k], its j-th coordinate is de\ufb01ned as\n\np(x, W, b)j (cid:44) P[c = j|x, W, b] =\n\n.\n\nx+b\n\n(1)\n\n(cid:80)\n\nx+b\n\neW(j)(cid:62)\nl\u2208C eW(l)(cid:62)\n\nFitting the model weights is done by minimizing the cross-entropy between (p(xi))i and the true\nlabels ([ci = j]j\u2208[k])i, with respect to (W, b), with or without imposing parameter regularization.\nIn this model, an input image is classi\ufb01ed in between all conditions presented in the whole study.\nIt is possible to restrict this classi\ufb01cation to the set of conditions used in a given experiment \u2014 the\nempirical results of this study can be reproduced in this setting.\n\nThe challenge of model parameter estimation. A major inconvenience of the vanilla multinomial\nmodel lies in the ratio between the limited number of samples provided by a typical fMRI dataset\nand the overwhelming number of model weights to be estimated. Fitting the model amounts to\nestimating k discriminative brain map, i.e. millions of parameters (4M for the 23 conditions of HCP),\nwhereas most brain-imaging studies yield less than a hundred observations and therefore only a few\nthousands samples. This makes it hard to reasonably approximate the population parameters for\nsuccessful generalization, especially because the variance between subjects is high compared to the\n\n3\n\n\fvariance between conditions. The obstacle is usually tackled in one of two major ways in brain-\nimaging: 1) we can impose sparsity or a-priori structure over the model weights. Alternatively, 2) we\ncan reduce the dimension of input data by performing spatial clustering or univariate feature selection\nby ANOVA. However, we note that, on the one hand, regularization strategies frequently incur costly\ncomputational budgets if one wants to obtain interpretable weights [22] and they introduce arti\ufb01cial\nbias. On the other hand, existing techniques developed in fMRI analysis for dimension reduction can\nlead to distorted signal and accuracy losses [23]. Most importantly, previous statistical approaches\nare not tuned to identifying conditions from task fMRI data. We therefore propose to use a dimension\nreduction that is estimated from data and tuned to capture the common hidden aspects shared by\nstatistical maps across studies \u2014 we aggregate several classi\ufb01cation models that share parameters.\n\n1.2 Learning shared representation across studies for decoding\n\nWe write D the set of all studies, Cd the set of all kd conditions from study d, k (cid:44)(cid:80)\n\nWe now consider several fMRI studies. (xi)i\u2208[n] is the union of all statistical maps from all datasets.\nd kd the total\nnumber of conditions and Sd the subset of [n] that index samples of study d. For each study d,\nwe estimate the parameters (Wd, bd) for the classi\ufb01cation problem de\ufb01ned above. Adapting the\nmulti-task learning framework of [14], we constrain the weights (Wd)d to share a common latent\nstructure: namely, we \ufb01x a latent dimension l \u2264 p, and enforce that for all datasets d,\n\nWd = WeW(cid:48)\nd,\n\n(2)\nwhere the matrix We in Rp\u00d7l is shared across datasets, and (W(cid:48)\nd)d are dataset-speci\ufb01c classi\ufb01cation\nmatrices from a l dimensional input space. Intuitively, We should be a \u201cconsensus\u201d projection matrix,\nthat project every sample xi from every dataset onto a lower dimensional representation W(cid:62)\ne xi in Rl\nthat is easy to label correctly.\nThe latent dimension l may be chosen larger than k. In this case, regularization is necessary to\nensure that the factorization (2) is indeed useful, i.e., that the multi-dataset classi\ufb01cation problem\ndoes not reduce to separate multinomial regressions on each dataset. To regularize our model, we\napply Dropout [24] to the projected data representation. Namely, during successive training iterations,\nwe set a random fraction r of the reduced data features to 0. This prevents the co-adaptation of\nmatrices We and (W(cid:48)\nd)d and ensures that every direction of We is useful for classifying every\ndataset. Formally, Dropout amounts to sample binary diagonal matrices M in Rl\u00d7l during training,\nwith Bernouilli distributed coef\ufb01cients; for all datasets d, W(cid:48)\nd is estimated through the task of\nclassifying Dropout-corrupted reduced data (MW(cid:62)\nIn practice, matrices We and (W(cid:48)\nd)d are learned by jointly minimizing the following expected risk,\nwhere the objective is the sum of each of single-study cross-entropies, averaged over Dropout noise:\n\ne xi)i\u2208Sd,M\u223cM.\n\n(cid:2) \u2212 \u03b4j=ci log pd[xi, WeMW(cid:48)\n\nd, bd]j](cid:3)\n\nEM\n\n(cid:88)\n\nd\u2208D\n\n1\n|Sd|\n\n(cid:88)\n\n(cid:88)\n\ni\u2208Sd\n\nj\u2208Cd\n\nmin\nWe\n(W(cid:48)\n\nd)d\n\n(3)\n\nImposing a common structure to the classi\ufb01cation matrices (Wd)d is natural as the classes to be\ndistinguished do share some common neural organization \u2014 brain maps have a correlated spatial\nstructure, while the psychological conditions of the diffent datasets may trigger shared cognitive\nprimitives underlying human cognition [21, 20]. With our design, we aim at learning a matrix We\nthat captures these common aspects and thus bene\ufb01ts the generalization performance of all the\nclassi\ufb01ers. As We is estimated from data, brain maps from one study are enriched by the maps from\nall the other studies, even if the conditions to be classi\ufb01ed are not shared among studies. In so doing,\nour modeling approach allows transfer learning among all the classi\ufb01cation tasks.\nUnfortunately, estimators provided by solving (3) may have limited generalization performance as n\nremain relatively small (\u223c 20, 000) compared to the number of parameters. We address this issue by\nperforming an initial dimension reduction that captures the spatial structure of brain maps.\n\n1.3\n\nInitial dimension reduction using localized rest-fMRI activity patterns\n\nThe projection expressed by We ignores the signal structure of statistical brain maps. Acknowledging\nthis structure in commonly acquired brain measurements should allow to reduce the dimensionality\nof data with little signal loss, and possibly the additional bene\ufb01t of a denoising effect. Several recent\n\n4\n\n\fstudies [25] in the brain-imaging domain suggest to use fMRI data acquired in experiment-free studies\nfor such dimension reduction. For this reason, we introduce a \ufb01rst reduction of dimension that is not\nestimated from statistical maps, but from resting-state data. Formally, we enforce We = WgW(cid:48)\ne,\ne \u2208 Rg\u00d7k. Intuitively, the multiplication by matrix Wg\nwhere g > l (g \u223c 300), Wg \u2208 Rp\u00d7g and W(cid:48)\nshould summarize the spatial distribution of brain maps, while multiplying by W(cid:48)\ne, that is estimated\nsolving (3), should \ufb01nd low-dimensional representations able to capture cognitive features. W(cid:48)\nis now of reasonable size (g \u00d7 l \u223c 15000): solving (3) should estimate parameters with better\ne\ngeneralization performance. De\ufb01ning an appropriate matrix Wg is the purpose of the next paragaphs.\n\nResting-state decomposition. The initial dimension reduction determines the relative contribution\nof statistical brain maps over what is commonly interpreted by neuroscience investigators as functional\nnetworks. We discover such macroscopical brain networks by performing a sparse matrix factorization\nover the massive resting-state dataset provided in the HCP900 release [1]: such a decomposition\ntechnique, described e.g., in [26, 27] ef\ufb01ciently provides (i.e., in the order of few hours) a given\nnumber of sparse spatial maps that decompose the resting state signal with good reconstruction\nperformance. That is, it \ufb01nds a sparse and positive matrix D in Rp\u00d7g and loadings A in Rg\u00d7m such\nthat the m resting-state brain images Xrs in Rp\u00d7m are well approximated by DA. D is this a set\nof slightly overlapping networks \u2014 each voxel belongs to at most two networks. To maximally\npreserve Euclidian distance when performing the reduction, we perform an orthogonal projection,\nwhich amounts to setting Wg (cid:44) D(D(cid:62)D)\u22121. Replacing in (3), we obtain the reduced expected risk\nminimization problem, where the input dimension is now the number g of dictionary components:\n\ng xi, W(cid:48)\n\neMW(cid:48)\n\nd, bd]\nj\n\n(4)\n\n(cid:3).\n\n(cid:88)\n\nd\u2208D\n\n(cid:88)\n\n(cid:88)\n\ni\u2208Sd\n\nj\u2208Cd\n\n1\n|Sd|\n\n(cid:2) \u2212 \u03b4j=cilog pd[W(cid:62)\n\nEM\n\nmin\ne\u2208Rg\u00d7l\nW(cid:48)\n(W(cid:48)\nd)d\n\nMultiscale projection. Selecting the \u201cbest\u201d number of brain networks q is an ill-posed prob-\nlem [28]: the size of functional networks that will prove relevant for condition classi\ufb01cation is\nunknown to the investigator. To address this issue, we propose to reduce high-resolution data (xi)i\nin a multi-scale fashion: we initially extract 3 sparse spatial dictionaries (Dj)j\u2208[3] with 16, 64\nand 512 components respectively. Then, we project statistical maps onto each of the dictionaries,\nand concatenate the loadings, in a process analogous to projecting on an overcomplete dictionary in\ncomputer vision [e.g., 29]. This amounts to de\ufb01ne the matrix Wg as the concatenation\n3 D3)\u22121] \u2208 Rp\u00d7(16+64+512).\n\n1 D1)\u22121 D2(D(cid:62)\n(5)\nWith this de\ufb01nition, the reduced data (W(cid:62)\ni carry information about the network activations\nat different scales. As such, it makes the classi\ufb01cation maps learned by the model more regular\nthan when using a single-scale dictionary, and indeed yields more interpretable classi\ufb01cation maps.\nHowever, it only brings only a small improvement in term of predictive accuracy, compared to using\na simple dictionary of size k = 512. We further discuss multi-scale decomposition in Appendix A.2.\n\n2 D2)\u22121 D3(D(cid:62)\ng xi)\n\nWg (cid:44) [D1(D(cid:62)\n\n1.4 Training with stochastic gradient descent\n\nAs illustrated in Figure 1, our model may be interpreted as a three-layer neural network with linear\nactivations and several read-out heads, each corresponding to a speci\ufb01c dataset. The model can be\ntrained using stochastic gradient descent, following a previously employed alternated training scheme\n[18]: we cycle through datasets d \u2208 D and select, at each iteration, a mini-batch of samples (xi)i\u2208B,\nwhere B \u2282 Sd has the same size for all datasets. We perform a gradient step \u2014 the weights W(cid:48)\nd, bd\nand W(cid:48)\ne are updated, while the others are left unchanged. The optimizer thus sees the same number\nof samples for each dataset, and the expected stochastic gradient is the gradient of (4), so that the\nempirical risk decreases in expectation and we \ufb01nd a critical point of (4) asymptotically. We use the\nAdam solver [30] as a \ufb02avor of stochastic gradient descent, as it allows faster convergence.\nComputational cost. Training the model on projected data (W(cid:62)\ni takes 10 minutes on a conven-\ntional single CPU machine with an Intel Xeon 3.21Ghz. The initial step of computing the dictionaries\n(D1, D2, D3) from all HCP900 resting-state (4TB of data) records takes 5 hours using [27], while\ntransforming data from all the studies with Wg projection takes around 1 hour. Adding a new dataset\nwith 30 subjects to our model and performing the joint training takes no more than 20 minutes. This\nis much less than the cost of \ufb01tting a \ufb01rst-level GLM on this dataset (\u223c 1h per subject).\n\ng xi)\n\n5\n\n\f2 Experiments\n\nWe characterize the behavior and performance of our model on several large, publicly available\nbrain-imaging datasets. First, to validate the relevance of all the elements of our model, we perform an\nablation study. It proves that the multi-scale spatial dimension reduction and the use of multi-dataset\nclassi\ufb01cation improves substancially classi\ufb01cation performance, and suggests that the proposed\nmodel captures a new interesting latent structure of brain images. We further illustrate the effect\nof transfer learning, by systematically varying the number of subjects in a single dataset: we show\nhow multi-dataset learning helps mitigating the decrease in accuracy due to smaller train size \u2014 a\nresult of much use for analysing cognitive experiments on small cohorts. Finally, we illustrate the\ninterpretability of our model and show how the latent \u201ccognitive-space\u201d can be explored to uncover\nsome template brain maps associated with related conditions in different datasets.\n\n2.1 Datasets and tools\n\nDatasets. Our experimental study features 5 publicly-available task fMRI study. We use all resting-\nstate records from the HCP900 release [1] to compute the sparse dictionaries that are used in the \ufb01rst\ndimension reduction materialized by Wg. We succinctly describe the conditions of each dataset \u2014\nwe refer the reader to the original publications for further details.\n\n\u2022 HCP: gambling, working memory, motor, language, social and relational tasks. 800 subjects.\n\u2022 Archi [31]: localizer protocol, motor, social and relational task. 79 subjects.\n\u2022 Brainomics [32]: localizer protocol. 98 subjects.\n\u2022 Camcan [33]: audio-video task, with frequency variation. 606 subjects.\n\u2022 LA5c consortium [34]: task-switching, balloon analog risk taking, stop-signal and spatial\n\nworking memory capacity tasks \u2014 high-level tasks. 200 subjects.\n\nThe last four datasets are target datasets, on which we measure out-of-sample prediction performance.\nThe larger HCP dataset serves as a knowledge transfering dataset, which should boost these perfor-\nmance when considered in the multi-dataset model. We register the task time-series in the reference\nMNI space before \ufb01tting a general linear model (GLM) and computing the maps (standardized by\nz-scoring) associated with each base condition \u2014 no manual design of contrast is involved. More\ndetails on the pipeline used for z-map extraction is provided in Appendix A.1.\n\nTools. We use pytorch 1 to de\ufb01ne and train the proposed models, nilearn [35] to handle brain datasets,\nalong with scikit-learn [36] to design the experimental pipelines. Sparse brain decompositions were\ncomputed from the whole HCP900 resting-state data. The code for reproducing experiments is\navailable at http://github.com/arthurmensch/cogspaces. Our model involves a few non-\ncritical hyperparameters: we use batches of size 256, set the latent dimension l = 100 and use a\nDropout rate r = 0.75 in the latent cognitive space \u2014 this value perform slightly better than r = 0.5.\nWe use a multi-scale dictionary with 16, 64 and 512 components, as it yields the best quantitative\nand qualitative results.2. Finally, test accuracy is measured on half of the subjects of each dataset,\nthat are removed from training sets beforehand. Benchmarks are repeated 20 times with random split\nfolds to estimate the variance in performance.\n\n2.2 Dimension reduction and transfer improves test accuracy\n\nFor the four benchmark studies, the proposed model brings between +1.3% to +13.4% extra test\naccuracy compared to a simple multinomial classi\ufb01cation. To further quantify which aspects of the\nmodel improve performance, we perform an ablation study: we measure the prediction accuracy\nof six models, from the simplest to the most complete model described in Section 1. The \ufb01rst\nthree experiments study the effect of initial dimension reduction and regularization3. The last three\nexperiments measure the performance of the proposed factored model, and the effect of multi-dataset\nclassi\ufb01cation.\n\n1http://pytorch.org/\n2Note that using only the 512-components dictionary yields comparable predictive accuracy. Quantitatively,\nthe multi-scale approach is bene\ufb01cial when using dictionary with less components (e.g., 16, 64, 128) \u2014 see\nAppendix A.2 for a quantitative validation of the multi-scale approach.\n\n3For these models, (cid:96)2 and Dropout regularization parameter are estimated by nested cross-validation.\n\n6\n\n\fFigure 2: Ablation results. Each dimension reduction of the model has a relevant contribution.\nDropout regularization is very effective when applied to the cognitive latent space. Learning this\nlatent space allows to transfer knowledge between datasets.\n\nFigure 3: Learning curves in the single-dataset and multi-dataset setting. Estimating the latent\ncognitive space from multiple datasets is very useful for studying small cohorts.\n\n1. Baseline (cid:96)2-penalized multinomial classi\ufb01cation, where we predict c from x \u2208 Rp directly.\n2. Multinomial classi\ufb01cation after projection on a dictionary, i.e. predicting c from Wgx.\n3. Same as experience 2, using Dropout noise on projected data Wgx.\n4. Factored model in the single-study case: solving (4) with the target study only.\n5. Factored model in a two-study case: using target study alongside HCP.\n6. Factored model in the multi-study case: using target study alongside all other studies.\n\nThe results are summarized in Figure 2. On average, both dimension reduction introduced by Wg\nand W(cid:48)\ne are bene\ufb01cial to generalization performance. Using many datasets for prediction brings a\nfurther increase in performance, providing evidence of transfer learning between datasets.\nIn detail, the comparison between experiments 1, 2 and 3 con\ufb01rms that projecting brain images onto\nfunctional networks of interest is a good strategy to capture cognitive information [20, 25]. Note that\nin addition to improving the statistical properties of the estimators, the projection reduces drastically\nthe computational complexity of training our full model. Experiment 2 and 3 measure the impact of\nthe regularization method without learning a further latent projection. Using Dropout on the input\nspace performs consistently better than (cid:96)2 regularization (+1% to +5%); this can be explained in\nview of [37], that interpret input-Dropout as a (cid:96)2 regularization on the natural model parametrization.\nExperiment 4 shows that Dropout regularization becomes much more powerful when learning a\nsecond dimension reduction, i.e. when solving problem (4). Even when using a single study for\nlearning, we observe a signi\ufb01cant improvement (+3% to +7%) in performance on three out of four\ndatasets. Learning a latent space projection together with Dropout-based data augmentation in this\nspace is thus a much better regularization strategy than a simple (cid:96)2 or input-Dropout regularization.\nFinally, the comparison between experiments 4, 5 and 6 exhibits the expected transfer effect. On\nthree out of four target studies, learning the projection matrix W(cid:48)\ne using several datasets leads to an\naccuracy gain from +1.1% to +1.6%, consistent across folds. The more datasets are used, the higher\nthe accuracy gain \u2014 already note that this gain increases with smaller train size. Jointly classifying\nimages on several datasets thus brings extra information to the cognitive model, which allows to \ufb01nd\nbetter representative brain maps for the target study. In particular, we conjecture that the large number\nof subjects in HCP helps modeling inter-subject noises. On the other hand, we observe a negative\ntransfer effect on LA5c, as the tasks of this dataset share little cognitive aspects with the tasks of the\nother datasets. This encourages us to use richer dataset repositories for further improvement.\n\n7\n\n77.584.685.490.791.091.9Brainomics50%55%60%65%60.659.961.061.362.062.9CamCan55.855.661.162.661.859.8LA5C75%80%85%90%95%Testaccuracy76.579.181.886.787.487.8ArchiFullinput+L2Dim.reduction+L2Dim.red.+dropoutFactoredmodel+dropoutTransferfromHCPTransferfromalldatasets510203039Trainsize65%70%80%90%TestaccuracyArchiTrainsubjectsNotransferTransferfromHCPTransferfromalldatasets5102030404960%70%80%90%Brainomics206010020030250%60%68%Camcan\fFigure 4: Classi\ufb01cation maps from our model are more speci\ufb01c of higher level functions: they focus\nmore on the FFA for faces, and on the left intraparietal suci for calculations.\n\nFigure 5: The latent space of our model can be explored to unveil some template brain statistical\nmaps, that corresponds to bags of conditions related across color-coded datasets.\n\n2.3 Transfer learning is very effective on small datasets\n\nTo further demonstrate the bene\ufb01ts of the multi-dataset model, we vary the size of target datasets\n(Archi, Brainomics and CamCan) and compare the performance of the single-study model with the\nmodel that aggregates Archi, Brainomics, CamCan and HCP studies. Figure 3 shows that the effect\nof transfer learning increases as we reduce the training size of the target dataset. This suggests that\nthe learned data embedding WgW(cid:48)\ne does capture some universal cognitive information, and can be\nlearned from different data sources. As a consequence, aggregating a larger study to mitigate the\nsmall number of training samples in the target dataset. With only 5 subjects, the gain in accuracy due\nto transfer is +13% on Archi, +8% on Brainomics, and +6% on CamCan. Multi-study learning\nshould thus proves very useful to classify conditions in studies with ten or so subjects, which are still\nvery common in neuroimaging.\n\n2.4\n\nIntrospecting classi\ufb01cation maps\n\neW(cid:48)\n\nAt prediction time, our multi-dataset model can be collapsed into one multinomial model per dataset.\nEach dataset d is then classi\ufb01ed using matrix WgW(cid:48)\nd. Similar to the linear models classically\nused for decoding, the model weights for each condition can be represented as a brain map. Figure 4\nshows the maps associated with digit computation and face viewing, for the Archi dataset. The\nmodels 2, 4 and 5 from the ablation study are compared. Although it is hard to assess the intrinsic\nquality of the maps, we can see that the introduction of the second projection layer and the multi-\nstudy problem formulation (here, appending the HCP dataset) yields maps with more weight on the\nhigh-level functional regions known to be speci\ufb01c of the task: for face viewing, the FFA stands out\nmore compared to primary visual cortices; for calculations, the weights of the intraparietal sulci\nbecomes left lateralized, as it has been reported for symbolic number processing [38].\n\n2.5 Exploring the latent space\n\nWithin our model, classi\ufb01cation is performed on the same l-dimensional space E for all datasets,\nthat is learned during training. To further show that this space captures some cognitive information,\nwe extract from E template brain images associated to general cognitive concepts. Fitting our\nmodel on the Archi, Brainomics, CamCan and HCP studies, we extract representative vectors of E\nwith a k-means clustering over the projected data and consider the centroids (yj)j of 50 clusters.\nEach centroid yj can be associated to a brain image tj \u2208 Rp that lies in the span of D1, D2\n\n8\n\nMulti-scalespatialprojectionLRFacez=-10mmLatentcognitivespace(single)Latentcognitive(multi-study)Multi-scalespatialprojectionLRAudiocalculationz=46mmLatentcognitivespace(single)Latentcognitive(multi-study)\f(cid:62)\n\nW(cid:48)(cid:62)\n\nd yj = W(cid:48)\n\nand D3. In doing so, we go backward through the model and obtain a representative of yj with\nwell delineated spatial regions. Going forward, we compute the classi\ufb01cation probability vectors\nW(cid:62)\ng tj for each study d. Together, these probability vectors give an indication\non the cognitive functions that tj captures. Figure 5 represents six template images, associated\nto their probability vectors, shown as word clouds. We clearly obtain interpretable pairs of brain\nimage/cognitive concepts. These pairs capture across datasets clusters of experiment conditions with\nsimilar brain representations.\n\ne W(cid:62)\n\nd\n\n3 Discussion\n\nWe compare our model to a previously proposed formulation for brain image classi\ufb01cation. We show\nhow our model differs from convex multi-task learning, and stress the importance of Dropout.\n\nTask fMRI classi\ufb01cation. Our model is related to a previous semi-supervised classi\ufb01cation\nmodel [20] that also performs multinomial classi\ufb01cation of conditions in a low-dimensional space:\nthe dimension reduction they propose is the equivalent of our projection Wg. Our approach differs\nin two aspects. First, we replace the initial semi-supervised dimension reduction with unsupervised\nanalysis of resting-state, using a much more tractable approach that we have shown to be conservative\nof cognitive signals. Second, we introduce the additional cognitive-aware projection W(cid:48)\ne, learned\non multiple studies. It substancially improves out-of-sample prediction performance, especially on\nsmall datasets, and above all allow to uncover a cognitive-aware latent space, as we have shown in\nour experiments.\n\ne[W(cid:48)\n\nConvex multi-task learning. Due to the Dropout regularization and the fact that l is allowed to be\nlarger than k, our formulation differs from the classical approach [39] to the multi-task problem, that\nd]d \u2208 Rg\u00d7k by solving a convex empirical risk minimization\nwould estimate \u0398 = W(cid:48)\nproblem with a trace-norm penalization, that encourages \u0398 to be low-rank. We tested this formulation,\nwhich does not perform better than the explicit factorization formulation with Dropout regularization.\nTrace-norm regularized regression has the further drawback of being slower to train, as it typically\noperates with full gradients, e.g. using FISTA [40]. In contrast, the non-convex explicit factorization\nmodel is easily amenable to large-scale stochastic optimization \u2014 hence our focus.\n\n1, . . . , W(cid:48)\n\nImportance of Dropout. The use of Dropout regularization is crucial in our model. Without\nDropout, in the single-study case with l > k, solving the factored problem (4) yields a solution worse\nin term of empirical risk than solving the simple multinomial problem on (W(cid:62)\ni, which \ufb01nds a\nglobal minimizer of (4). Yet, Figure 2 shows that the model enriched with a latent space (red) has\nbetter performance in test accuracy than the simple model (orange), thanks to the Dropout noise\napplied to the latent-space representation of the input data. Dropout is thus a promising novel way of\nregularizing fMRI models.\n\ng xi)\n\n4 Conclusion\n\nWe proposed and characterized a novel cognitive neuroimaging modeling scheme that blends latent\nfactor discovery and transfer learning. It can be applied to many different cognitive studies jointly\nwithout requiring explicit correspondences between the cognitive tasks. The model helps identifying\nthe fundamental building blocks underlying the diversity of cognitive processes that the human\nmind can realize. It produces a basis of cognitive processes whose generalization power is validated\nquantitatively, and extracts representations of brain activity that grounds the transfer of knowledge\nfrom existing fMRI repositories to newly acquired task data. The captured cognitive representations\nwill improve as we provide the model with a growing number of studies and cognitive conditions.\n\n5 Acknowledgments\n\nThis project has received funding from the European Union\u2019s Horizon 2020 Framework Programme\nfor Research and Innovation under grant agreement No 720270 (Human Brain Project SGA1). Julien\nMairal was supported by the ERC grant SOLARIS (No 714381) and a grant from ANR (MACARON\nproject ANR-14-CE23-0003-01). We thank Olivier Grisel for his most helpful insights.\n\n9\n\n\fReferences\n[1] David Van Essen, Kamil Ugurbil, and others. The Human Connectome Project: A data acquisition\n\nperspective. NeuroImage, 62(4):2222\u20132231, 2012.\n\n[2] Russell A. Poldrack, Chris I. Baker, Joke Durnez, Krzysztof J. Gorgolewski, Paul M. Matthews, Marcus R.\nMunaf\u00f2, Thomas E. Nichols, Jean-Baptiste Poline, Edward Vul, and Tal Yarkoni. Scanning the horizon:\nTowards transparent and reproducible neuroimaging research. Nature Reviews Neuroscience, 18(2):\n115\u2013126, 2017.\n\n[3] Allen Newell. You can\u2019t play 20 questions with nature and win: Projective comments on the papers of this\n\nsymposium. 1973.\n\n[4] John D. Medaglia, Mary-Ellen Lynall, and Danielle S. Bassett. Cognitive Network Neuroscience. Journal\n\nof Cognitive Neuroscience, 27(8):1471\u20131491, 2015.\n\n[5] Lisa Feldman Barrett. The future of psychology: Connecting mind to brain. Perspectives on psychological\n\nscience, 4(4):326\u2013339, 2009.\n\n[6] Tal Yarkoni, Russell A. Poldrack, Thomas E. Nichols, David C. Van Essen, and Tor D. Wager. Large-scale\n\nautomated synthesis of human functional neuroimaging data. Nature methods, 8(8):665\u2013670, 2011.\n\n[7] Angela R. Laird, Jack J. Lancaster, and Peter T. Fox. Brainmap. Neuroinformatics, 3(1):65\u201377, 2005.\n\n[8] Gholamreza Salimi-Khorshidi, Stephen M. Smith, John R. Keltner, Tor D. Wager, and Thomas E. Nichols.\nMeta-analysis of neuroimaging data: A comparison of image-based and coordinate-based pooling of\nstudies. NeuroImage, 45(3):810\u2013823, 2009.\n\n[9] Tor D. Wager, Lauren Y. Atlas, Martin A. Lindquist, Mathieu Roy, Choong-Wan Woo, and Ethan Kross.\nAn fMRI-Based Neurologic Signature of Physical Pain. New England Journal of Medicine, 368(15):\n1388\u20131397, 2013.\n\n[10] Yannick Schwartz, Bertrand Thirion, and Gael Varoquaux. Mapping paradigm ontologies to and from the\n\nbrain. In Advances in Neural Information Processing Systems, pages 1673\u20131681. 2013.\n\n[11] Oluwasanmi Koyejo and Russell A. Poldrack. Decoding cognitive processes from functional MRI. In\n\nNIPS Workshop on Machine Learning for Interpretable Neuroimaging, pages 5\u201310, 2013.\n\n[12] Russell A. Poldrack, Yaroslav O. Halchenko, and Stephen Jos\u00e9 Hanson. Decoding the large-scale structure\nof brain function by classifying mental states across individuals. Psychological Science, 20(11):1364\u20131372,\n2009.\n\n[13] Jessica A. Turner and Angela R. Laird. The cognitive paradigm ontology: Design and application.\n\nNeuroinformatics, 10(1):57\u201366, 2012.\n\n[14] Rie Kubota Ando and Tong Zhang. A framework for learning predictive structures from multiple tasks and\n\nunlabeled data. Journal of Machine Learning Research, 6(Nov):1817\u20131853, 2005.\n\n[15] Ya Xue, Xuejun Liao, Lawrence Carin, and Balaji Krishnapuram. Multi-task learning for classi\ufb01cation\n\nwith dirichlet process priors. Journal of Machine Learning Research, 8(Jan):35\u201363, 2007.\n\n[16] Yann LeCun, Yoshua Bengio, and Geoffrey Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\n[17] Jeff Donahue, Yangqing Jia, Oriol Vinyals, Judy Hoffman, Ning Zhang, Eric Tzeng, and Trevor Darrell.\nDeCAF: A Deep Convolutional Activation Feature for Generic Visual Recognition. In International\nConference on Machine Learning, volume 32, pages 647\u2013655, 2014.\n\n[18] Ronan Collobert and Jason Weston. A uni\ufb01ed architecture for natural language processing: Deep neural\nnetworks with multitask learning. In International Conference on Machine Learning, pages 160\u2013167,\n2008.\n\n[19] Danilo Bzdok and B. T. Thomas Yeo. Inference in the age of big data: Future perspectives on neuroscience.\n\nNeuroImage, 155(Supplement C):549 \u2013 564, 2017.\n\n[20] Danilo Bzdok, Michael Eickenberg, Olivier Grisel, Bertrand Thirion, and Ga\u00ebl Varoquaux. Semi-supervised\nfactored logistic regression for high-dimensional neuroimaging data. In Advances in Neural Information\nProcessing Systems, pages 3348\u20133356, 2015.\n\n10\n\n\f[21] Timothy Rubin, Oluwasanmi O Koyejo, Michael N Jones, and Tal Yarkoni. Generalized Correspondence-\nLDA Models (GC-LDA) for Identifying Functional Regions in the Brain. In Advances in Neural Information\nProcessing Systems, pages 1118\u20131126, 2016.\n\n[22] Alexandre Gramfort, Bertrand Thirion, and Ga\u00ebl Varoquaux. Identifying Predictive Regions from fMRI\nwith TV-L1 Prior. In International Workshop on Pattern Recognition in Neuroimaging, pages 17\u201320, 2013.\n\n[23] Bertrand Thirion, Ga\u00ebl Varoquaux, Elvis Dohmatob, and Jean-Baptiste Poline. Which fMRI clustering\n\ngives good brain parcellations? Frontiers in neuroscience, 8:167, 2014.\n\n[24] Nitish Srivastava, Geoffrey E. Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\nA simple way to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15(1):\n1929\u20131958, 2014.\n\n[25] Thomas Blumensath, Saad Jbabdi, Matthew F. Glasser, David C. Van Essen, Kamil Ugurbil, Timothy E.J.\nBehrens, and Stephen M. Smith. Spatially constrained hierarchical parcellation of the brain with resting-\nstate fMRI. NeuroImage, 76:313\u2013324, 2013.\n\n[26] Arthur Mensch, Julien Mairal, Bertrand Thirion, and Ga\u00ebl Varoquaux. Dictionary learning for massive\n\nmatrix factorization. In International Conference on Machine Learning, pages 1737\u20131746, 2016.\n\n[27] Arthur Mensch, Julien Mairal, Bertrand Thirion, and Ga\u00ebl Varoquaux. Stochastic Subsampling for\n\nFactorizing Huge Matrices. IEEE Transactions on Signal Processing, 99(to appear), 2017.\n\n[28] Simon B. Eickhoff, Bertrand Thirion, Ga\u00ebl Varoquaux, and Danilo Bzdok. Connectivity-based parcellation:\n\nCritique and implications. Human brain mapping, 36(12):4771\u20134792, 2015.\n\n[29] St\u00e9phane G. Mallat and Zhifeng Zhang. Matching pursuits with time-frequency dictionaries.\n\nTransactions on Signal Processing, 41(12):3397\u20133415, 1993.\n\nIEEE\n\n[30] Diederik P. Kingma and Jimmy Ba. Adam: A Method for Stochastic Optimization. In International\n\nConference for Learning Representations, 2015.\n\n[31] Philippe Pinel, Bertrand Thirion, S\u00e9bastien Meriaux, Antoinette Jobert, Julien Serres, Denis Le Bihan,\nJean-Baptiste Poline, and Stanislas Dehaene. Fast reproducible identi\ufb01cation and large-scale databasing of\nindividual functional cognitive networks. BMC Neuroscience, 8(1):91, 2007.\n\n[32] Dimitri Papadopoulos Orfanos, Vincent Michel, Yannick Schwartz, Philippe Pinel, Antonio Moreno, Denis\n\nLe Bihan, and Vincent Frouin. The Brainomics/Localizer database. NeuroImage, 144:309\u2013314, 2017.\n\n[33] Meredith A. Shafto, Lorraine K. Tyler, Marie Dixon, Jason R Taylor, James B. Rowe, Rhodri Cusack,\nWilliam D. Calder, Andrew J. an d Marslen-Wilson, John Duncan, Tim Dalgleish, Richard N. Henson,\nCarol Brayne, and Fiona E. Matthews. The Cambridge Centre for Ageing and Neuroscience (Cam-CAN)\nstudy protocol: A cross-sectional, lifespan, multidisciplinary examination of healthy cognitive ageing.\nBMC Neurology, 14:204, 2014.\n\n[34] RA Poldrack, Eliza Congdon, William Triplett, KJ Gorgolewski, KH Karlsgodt, JA Mumford, FW Sabb,\nNB Freimer, ED London, TD Cannon, et al. A phenome-wide examination of neural and cognitive function.\nScienti\ufb01c Data, 3:160110, 2016.\n\n[35] Alexandre Abraham, Fabian Pedregosa, Michael Eickenberg, Philippe Gervais, Andreas Mueller, Jean\nKossai\ufb01, Alexandre Gramfort, Bertrand Thirion, and Gael Varoquaux. Machine learning for neuroimaging\nwith scikit-learn. Frontiers in Neuroinformatics, 8:14, 2014.\n\n[36] Fabian Pedregosa, Ga\u00ebl Varoquaux, Alexandre Gramfort, Vincent Michel, Bertrand Thirion, Olivier Grisel,\nMathieu Blondel, Peter Prettenhofer, Ron Weiss, Vincent Dubourg, Jake Vanderplas, Alexandre Passos,\nDavid Cournapeau, Matthieu Brucher, Matthieu Perrot, and \u00c9douard Duchesnay. Scikit-learn: Machine\nlearning in Python. Journal of Machine Learning Research, 12:2825\u20132830, 2011.\n\n[37] Stefan Wager, Sida Wang, and Percy S Liang. Dropout Training as Adaptive Regularization. In Advances\n\nin Neural Information Processing Systems, pages 351\u2013359. 2013.\n\n[38] Stephanie Bugden, Gavin R. Price, D. Adam McLean, and Daniel Ansari. The role of the left intraparietal\nsulcus in the relationship between symbolic number processing and children\u2019s arithmetic competence.\nDevelopmental Cognitive Neuroscience, 2(4):448\u2013457, 2012.\n\n[39] Nathan Srebro, Jason Rennie, and Tommi S. Jaakkola. Maximum-margin matrix factorization. In Advances\n\nin Neural Information Processing Systems, pages 1329\u20131336, 2004.\n\n[40] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n11\n\n\f", "award": [], "sourceid": 3001, "authors": [{"given_name": "Arthur", "family_name": "Mensch", "institution": "Inria Parietal"}, {"given_name": "Julien", "family_name": "Mairal", "institution": "Inria"}, {"given_name": "Danilo", "family_name": "Bzdok", "institution": "RWTH Aachen University"}, {"given_name": "Bertrand", "family_name": "Thirion", "institution": "INRIA"}, {"given_name": "Gael", "family_name": "Varoquaux", "institution": "Parietal Team, INRIA"}]}