{"title": "Hierarchical Modular Optimization of Convolutional Networks Achieves Representations Similar to Macaque IT and Human Ventral Stream", "book": "Advances in Neural Information Processing Systems", "page_first": 3093, "page_last": 3101, "abstract": "Humans recognize visually-presented objects rapidly and accurately. To understand this ability, we seek to construct models of the ventral stream, the series of cortical areas thought to subserve object recognition. One tool to assess the quality of a model of the ventral stream is the Representation Dissimilarity Matrix (RDM), which uses a set of visual stimuli and measures the distances produced in either the brain (i.e. fMRI voxel responses, neural firing rates) or in models (features). Previous work has shown that all known models of the ventral stream fail to capture the RDM pattern observed in either IT cortex, the highest ventral area, or in the human ventral stream. In this work, we construct models of the ventral stream using a novel optimization procedure for category-level object recognition problems, and produce RDMs resembling both macaque IT and human ventral stream. The model, while novel in the optimization procedure, further develops a long-standing functional hypothesis that the ventral visual stream is a hierarchically arranged series of processing stages optimized for visual object recognition.", "full_text": "Hierarchical Modular Optimization of Convolutional\n\nNetworks Achieves Representations Similar to\n\nMacaque IT and Human Ventral Stream\n\nDaniel Yamins\u2217\n\nHa Hong\u2217\n\nMcGovern Institute of Brain Research\nMassachusetts Institute of Technology\n\nMcGovern Institute of Brain Research\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nyamins@mit.edu\n\nCambridge, MA 02139\nhahong@mit.edu\n\nCharles Cadieu\n\nJames J. Dicarlo\n\nMcGovern Institute of Brain Research\nMassachusetts Institute of Technology\n\nMcGovern Institute of Brain Research\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\nhahong@mit.edu\n\nCambridge, MA 02139\ndicarlo@mit.edu\n\nAbstract\n\nHumans recognize visually-presented objects rapidly and accurately. To under-\nstand this ability, we seek to construct models of the ventral stream, the series of\ncortical areas thought to subserve object recognition. One tool to assess the qual-\nity of a model of the ventral stream is the Representational Dissimilarity Matrix\n(RDM), which uses a set of visual stimuli and measures the distances produced in\neither the brain (i.e. fMRI voxel responses, neural \ufb01ring rates) or in models (fea-\ntures). Previous work has shown that all known models of the ventral stream fail\nto capture the RDM pattern observed in either IT cortex, the highest ventral area,\nor in the human ventral stream. In this work, we construct models of the ventral\nstream using a novel optimization procedure for category-level object recognition\nproblems, and produce RDMs resembling both macaque IT and human ventral\nstream. The model, while novel in the optimization procedure, further develops\na long-standing functional hypothesis that the ventral visual stream is a hierarchi-\ncally arranged series of processing stages optimized for visual object recognition.\n\n1\n\nIntroduction\n\nHumans recognize visually-presented objects rapidly and accurately even under image distortions\nand variations that make this a computationally challenging problem [27]. There is substantial\nevidence that the human brain solves this invariant object recognition challenge via a hierarchical\ncortical neuronal network called the ventral visual stream [13, 17], which has highly homologous\nareas in non-human primates [19, 9]. A core, long-standing hypothesis is that the visual input\ncaptured by the retina is rapidly processed through the ventral stream into an effective, \u201cinvariant\u201d\nrepresentation of object shape and identity [11, 9, 8]. This hypothesis has been bolstered by recent\ndevelopments in neuroscience which have shown that abstract category-level visual information is\naccessible in IT (inferotemporal) cortex, the highest ventral cortical area, but much less effectively\naccessible in lower areas such as V1, V2 or V4 [23]. This observation has been con\ufb01rmed both\nat the individual neural level, where single-unit responses can be decoded using linear classi\ufb01ers\n\n\u2217web.mit.edu/ yamins; \u2217 These authors contributed equally to this work.\n\n1\n\n\fFigure 1: A) Heterogenous hierarchical convolutional neural networks are composed of basic oper-\nations that are simple and neurally plausibly, including linear reweighting (\ufb01ltering), thresholding,\npooling and normalization. These simple elements are convolutional and are stacked hierarchi-\ncally to construct non-linear computations of increasingly greater power, ranging through low (L1),\nmedium (L2), and high (L3) complexity structures. B) Several of these elements are combined\nto produce mixtures capturing heterogenous neural populations. Each processing stage across the\nheterogeneous networks (A1, A2, ...) can be considered an analogous to a neural visual area.\n\nto to yield category predictions [14, 23] and at the population code level, where response vector\ncorrelation matrices evidence clear semantic structure [19].\nDeveloping encoding models, models that map the stimulus to the neural response, of visual area IT\nwould likely help us to understand object recognition in humans. Encoding models of lower-level\nvisual responses (RGC, LGN, V1, V2) have been relatively successful [21, 4] (but cf. [26]). In\nhigher-visual areas, particularly IT, theoretical work has described a compelling framework which\nwe ascribe to in this work [29]. However, to this point it has not been possible to produce effective\nencoding models of IT. This explanatory gap, between model responses and IT responses, is present\nat both the level of the individual neuron responses and at the population code level. Of particular\ninterest for our analysis in this paper, current models of IT, such as HMAX, have been shown to\nfail to achieve the speci\ufb01c categorical structures present in neural populations [18]. In other related\nwork, descriptions of higher areas (V4, IT) responses have been made for very narrow classes of\narti\ufb01cial stimuli and do not de\ufb01ne responses to arbitrary natural images [6, 3].\nIn a step toward bridging this explanatory gap, we describe advances in constructing models that\ncapture the categorical structures present in IT neural populations and fMRI measurements of hu-\nmans. We take a top-down functional approach focused on building invariant object representations,\noptimizing biologically-plausible computational architectures for high performance on a challenging\nobject recognition screening task. We then show that these models capture key response properties\nof IT, both at the level of individual neuronal responses as well as the neuronal population code \u2013\neven for entirely new objects and categories never seen in model selection.\n\n2 Methods\n\n2.1 Heterogenous Hierarchical Convolutional Models\n\nInspired by previous neuronal modeling work [7, 6], we constructed a model class based on three\nbasic principles: (i) single layers composed of neurally-plausible basic operations, including \ufb01l-\ntering, nonlinear activation, pooling and normalization (ii) using hierarchical stacking to construct\nmore complex operations, and (iii) convolutional weight sharing (\ufb01g. 1A). This general type of\nmodel has been successful in describing a variety of phenomenology throughout the ventral stream\n[30]. In addition, we allow combinations of multiple hierarchical components each with different\n\n2\n\nB)...\u03a61\u03a62\u03a6k\u2297\u2297\u2297NormalizePoolFilterThreshold &SaturateBasic computatationsl are neural-like operations.A)Basic operationsHierarchical stackingL1L2 L3L1L1L1L1L1A1 A2A3A4HeterogeneityL3L1L3L2L3L1L2Convolution\fparameters (such as pooling size, number of \ufb01lters, etc.), representing different types of units with\ndifferent response properties [5] and refer to this concept as (iv) heterogeneity (\ufb01g. 1B).\nWe will now formally de\ufb01ne the class of heterogeneous hierarchical convolutional neural networks,\nN . First consider a simple neural network function de\ufb01ned by\n\nN\u0398 = P ool\u03b8p (N ormalize\u03b8N (T hreshold\u03b8T (F ilter\u03b8F (Input))))\n\n(1)\nwhere the pooling, normalization, thresholding and \ufb01lterbank convolution operations are as de-\nscribed in [28]. The parameters \u0398 = (\u03b8p, \u03b8N , \u03b8T , \u03b8F ) control the structure of the constituent opera-\ntions. Each model stage therefore actually represents a large family of possible operations, speci\ufb01ed\nby a set of parameters controlling e.g. fan-in, activation thresholds, pooling exponents, spatial in-\nteraction radii, and template structure. Like [28], we use randomly chosen \ufb01lterbank templates in\nall models, but additionally allow the mean and variance of the \ufb01lterbank to vary as parameters. To\nproduce deep feedforward networks, single layers are stacked:\n\nP(cid:96)\u22121\n\u0398P,l\u22121\n\nF ilter\u2212\u2212\u2212\u2212\u2192 F(cid:96)\n\n\u0398F,l\n\nT hreshold\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 T(cid:96)\n\n\u0398T ,l\n\nN ormalize\n\n\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192 N(cid:96)\n\n\u0398N,l\n\nP ool\u2212\u2212\u2212\u2192 P(cid:96)\n\n\u0398P,l\n\n(2)\n\nWe denote such a stacking operation as N (\u03981, . . . , \u0398k), where the \u0398l are parameters chosen\nseparately for each layer, and will refer to networks of this fork as \u201csingle-stack\u201d networks.\nLet the set of all depth-k single-stack networks be denoted Nk. Given a sequence of such\nsingle-stack networks N(\u0398i1, \u039812, . . . , \u0398ini) (possibly of different depths), the combination N \u2261\n\u2295k\ni=1N(\u0398i1, \u039812, . . . , \u0398ini) is formed by aligning the output layers of these models along the spa-\ntial convolutional dimension. These networks N can, of course, also be stacked, just like their single-\nstack constituents, to form more complicated, deeper heterogenous hierarchies. By de\ufb01nition, the\nclass N consists of all the iterative chainings and combinations of such networks.\n\n2.2 High-Throughput Screening via Hierarchical Modular Optimization\nOur goal is to \ufb01nd models within N that are effective at modeling neural responses to a wide variety\nof images. To do this, our basic strategy is to perform high-throughput optimization on a screening\ntask [28]. By choosing a screening task that is suf\ufb01ciently representative of the aspects that make\nthe object recognition problem challenging, we should be able to \ufb01nd network architectures that are\ngenerally applicable. For our screening set, we created a set of 4500 synthetic images composed of\n125 images each containing one of 36 three-dimensional mesh models of everyday objects, placed\non naturalistic backgrounds. The screening task we evaluated was 36-way object recognition. We\ntrained Maximum Correlation Classi\ufb01ers (MCC) with 3-fold cross-validated 50%/50% train/test\nsplits, using testing classi\ufb01cation percent-correct as the screening objective function.\nBecause N is a very large space, determining among the vast space of possibilities which parameter\nsetting(s) produce visual representations that are high performing on the screening set, is a challenge.\nWe addressed this by applying a novel method we call Hierarchical Modular Optimization (HMO).\nThe intuitive idea of the HMO optimization procedure is that a good multi-stack heterogeneous net-\nwork will be found by creating mixtures of single-stack components each of which specializes in\na portion of an overall problem. To achieve this, we implemented a version of adaptive hyperpa-\nrameter boosting, in which rounds of optimization are interleaved with boosting and hierarchical\nstacking.\nSpeci\ufb01cally, suppose that N \u2208 N and S is a screening stimulus set. Let E be the binary-valued\nclassi\ufb01cation correctness indicator, assigning to each stimulus image s 1 or 0 according to whether\ns\u2208S N (F (s)). To ef\ufb01ciently\n\nthe screening task prediction was right or wrong. Let score(N, S) =(cid:80)\n\n\ufb01nd N that maximizes score(N, S), the HMO procedure follows these steps:\n1. Optimization: Optimize the score function within the class of single-stack networks, obtaining\nan optimization trajectory of networks in N (\ufb01g 2A, left). The optimization procedure that we use\nis Hyperparameter Tree Parzen Estimator, as described in [1]. This procedure is effective in large\nparameter spaces that include discrete and continuous parameters.\n2. Boosting: Consider the set of networks explored during step 1 as a set of weak learners, and\napply a standard boosting algorithm (Adaboost) to identify some number of networks N11, . . . , N1l1\nwhose error patterns are complementary (\ufb01g 2A, right).\n3. Combination: Form the multi-stack network N1 = \u2295iN1i and evaluate E(N1(s)) for all s \u2208 S.\n\n3\n\n\fFigure 2: A) The Hierarchical Modular Optimization is a mechanism for ef\ufb01ciently optimizing\nneural networks for object recognition performance. The intuitive idea of HMO is that a good multi-\nstack hetergenous network will be found by creating mixtures of single-stack components each of\nwhich specializes in a portion of an overall problem. The process \ufb01rst identi\ufb01es complementary per-\nformance gradients in the space of single-stack (non-heterogenous) convolutional neural networks\nby using version of adaptive boosting interleaved with hyperparameter optimization. The compo-\nnents identi\ufb01ed in this process are then composed nonlinearly using a second convolutional layer\nto produce a combined output model. B) Top: the 36-way confusion matrices associated with two\ncomplementary components identi\ufb01ed in the HMO process. Bottom Left: The two optimization\ntrajectories from which the single-stack models were drawn that produced the confusion matrices in\nthe top panels. The optimization criterion for the second round (red dots) was de\ufb01ned relative to the\nerrors of the \ufb01rst round (blue dots). Bottom Right: The confusion matrix of the heterogenous model\nproduced by combining the round 1 and round 2 networks.\n\nN is now(cid:80)\n\n4. Error-based Reweighting: Repeat step 1, but reweight the scoring to give the j-th stimulus sj\nweight 0 if N1 is correct in sj, and 1 otherwise. That is, the performance function to be optimized for\ns\u2208S E(N1(s))\u00b7 E(N (s)). Repeat the step 2 on the results of the optimization trajectory\nobtained to get models N21, . . . N2k2, and repeat step 3. Steps 1, 2, 3 are repeated K times.\nAfter K repeats, we will have obtained a multi-stack network N = \u2295i\u2264K,j\u2264kiNij. The process can\nthen simply be terminated, or repeated with the output of N as the input to another stacked network.\nIn the latter case, the next layer is chosen using the same model class N to draw from, and using the\nsame adaptive hyperparameter boosting procedure.\nThe meta-parameters of the HMO procedure include the numbers of components l1, l2, . . . to be se-\nlected at each boosting round, the number of times K that the interleaved boosting and optimization\nis repeated and the number of times M this procedure is stacked. To constrain this space we \ufb01x the\nmetaparameters l1 = l2 . . . .. = 10, K = 3, and M \u2264 2. With the \ufb01xed screening set described\nabove, and these metaparameter settings, we generated a network NHM O. We will refer back to\nthis model throughout the rest of the paper. NHM O produces 1250-dimensional feature vectors\nfor any input stimulus; we will denote NHM O(s) as the resulting feature vector for stimulus s and\nNHM O(s)k as its k-th component in 1250-dimensional space.\n\n2.3 Predicting IT Neural Responses\n\nUtilizing the NHM O network, we construct models of IT in one of two ways: 1) we estimate a GLM\nmodel predicting individual neural responses or 2) we estimate linear classi\ufb01ers of object categories\nto produce a candidate IT neural space.\nTo construct models of individual neural responses we estimate a linear mapping from a non-linear\nspace produced by a model. This procedure is a standard GLM of individual neural responses.\nBecause IT responses are highly non-linear functions of the input image, successful models must\n\n4\n\nError-based reweightingCombined ModelOptimization ofsingle-stack networksHierarchical Layering10000OptimizationStepParameter 1Parameter 2Round 1 Error PatternCombined Error PatternRound 2 Error PatternFaces. . . Performance\u03b8filter(1)\u03b8thr(1)\u03b8sat(1)\u03b8pool(3)N11N12N13N14 N21N22N23N24 . . . Performance\u03b8filter(1)\u03b8thr(1)\u03b8sat(1)\u03b8pool(3)\u03b8norm(3)}}Optimizing reweighted objectiveAdaboostCombinationA)B)Screening set\fcapture the non-linearity of the IT response. The NHM O network produces a highly-nonlinear\ntransformation of the input image and we compare the ef\ufb01cacy of this non-linearity against those\nproduced by other models. Speci\ufb01cally for a neuron ni, we estimate a vector wi to minimize the\nregression error from NHM O features to ni\u2019s responses, over a training set of stimuli. We evaluate\ngoodness of \ufb01t of by measuring the regression r2 values between the neural response and the GLM\npredictions on held-out images, averaged over several train/test splits. Taken over the set of predicted\nneurons n1, n2, ... nk, the collection of regression weight vectors wi comprise a matrix W that can\nbe thought of as a \ufb01nal linear top level that forms part of the model of IT. This method evidently\nrequires the presence of low-level neural data on which to train.\nWe also produce a candidate IT neural space by estimating linear classi\ufb01ers on an object recognition\ntask. As we might expect different subregions of IT cortex to have different selectivities for object\ncategories (for example face, body, and place patches [15, 10]), the output of the linear classi\ufb01ers\nwill also respond preferentially to different object categories. We may be able to leverage some\nunderstanding of what a subregion\u2019s task specialization might be to produce the weighting matrix\nW . Speci\ufb01cally, we estimate a linear mapping W to be the weights of a set of linear classi\ufb01ers\ntrained from the NHM O features on a speci\ufb01c set of object recognition tasks. We can then evaluate\nthis mapping on a novel set of images and compare to measured IT or human ventral stream data.\nThis method may have traction even when individual neural response data are not available.\n\n2.4 Representational Dissimilarity Matrices\n\nImplicit in this discussion is the idea of comparing two different representations (in this case, the\nmodel\u2019s predicted population versus the real neural population) on a \ufb01xed stimulus set. The Rep-\nresentational Dissimilarity Matrix (RDM) is a convenient tool for this comparison [19]. Formally,\ngiven a stimulus set S = s1, . . . , sk and vectors of neural population responses R = (cid:126)r1, . . . , (cid:126)rk in\nwhich rij is the response of the j-th neuron to the i-th stimulus, de\ufb01ne\n\nRDM (R)ij = 1 \u2212\n\ncov(ri, rj)\n\nvar(ri) \u00b7 var(rj)\n\n.\n\nThe RDM characterizes the layout of the stimuli in high-dimensional neural population space.\nFollowing [19], we measured similarity between population representations as the Spearman rank\ncorrelations between the RDMs for two populations, in which both RDMs are treated as vectors in\nk(k \u2212 1)/2-dimensional space. Two populations can have similar RDMs on a given stimulus set,\neven if details of the neural responses are different.\n\n3 Results\n\nTo test the NHM O model, we took two routes, corresponding to the two methods for prediction\ndescribed above. First (sec. 3.1), we obtained our own neural data on a testing set of our own\ndesign and tested the NHM O model\u2019s ability to predict individual-level neural responses using the\nlinear regression methodology described above. This approach allowed us to directly test the NHM O\nmodels\u2019 power in a setting were we had acess to low-level neural information. Second (sec. 3.2), we\nalso compared to neural data collected by a different group, but only released at a very coarse level\nof detail \u2013 the RDMs of their measured population. This constraint required us to additionally posit\na task blend, and to make the comparison at the population RDM level.\n\n3.1 The Neural Representation Benchmark Image Set\n\nWe analyzed neural data collected on the Neural Representation Benchmark (NRB) dataset, which\nwas originally developed to compare monkey neural and human behavioral responses [23, 2]. The\nNRB dataset consists of 5760 images of 64 distinct objects. The objects come from eight \u201cbasic\u201d cat-\negories (animals, boats, cars, chairs, faces, fruits, planes, tables), with eight exemplars per category\n(e.g., BMW, Z3, Ford, &c for cars) (see \ufb01g 3B bottom left), with objects varying in position, size, and\n3d-pose, and placed on a variety uncorrelated natural backgrounds. These parameters were varied\nconcomitantly, picked randomly from a uniform ranges at three levels of object identity-preserving\nvariation (low, medium, and high). The NRB set was designed to test explicitly the transformations\nof pose, position and size that are at the crux of the invariant object recognition problem. None of the\n\n5\n\n\fFigure 3: A) 8-way categorization performances. Comparison was made between several existing\nmodels from the literature (cyan bars), the HMO model features, and data from V4 and IT neural\npopulations. Performances are normalized relative to human behavioral data collected from Ama-\nzon Mechanical Turk experiments. High levels variation strongly separates the HMO model and the\nhigh-level IT neural features from the other representations. B) Top: Actual neural response (black)\nvs. prediction (red) for a single sample IT unit. This neuron shows high functional selectivity for\nfaces, which is effectively mirrored by the predicted unit. Bottom Left: Sample Neural Repre-\nsentation Benchmark images. C) Comparison of Representational Dissimilarity Matrices (RDMs)\nfor NRB dataset. D) As populations increase in complexity and abstraction power, they become\nprogressively more like that of IT, as category structure that was blurred out at lower levels by vari-\nability becomes abstracted at the higher levels. The HMO model shows similarity to IT both on the\nblock diagonal structure associated with categorization performance, but also on the off-diagonal\ncomparisons that characterize the neural representation more precisely.\n\nobjects, categories or backgrounds used in the HMO screening set appeared in the NRB set; more-\nover, the NRB image set was created with different image and lighting parameters, with different\nrendering software.\nNeural data was obtained via large-scale parallel array electrophysiology recordings in the visual\ncortex of awake behaving macaques. Testing set images were presented foveally (central 10 deg)\nwith a Rapid Serial Visual Presentation (RSVP) paradigm, involving passively viewing animals\nshown random stimulus sequences with durations comparable to those in natural primate \ufb01xations\n(e.g. 200 ms). Electrode arrays were surgically implanted in V4 and IT, and recordings took place\ndaily over a period of several months. A total of 296 multi-unit responses were recorded from\ntwo animals. For each testing stimulus and neuron, \ufb01nal neuron output responses were obtained\nby averaging data from between 25 and 50 repeated trials. With this dataset, we addressed two\nquestions: how well the HMO model was able to perform on the categorization tasks supported by\nthe dataset, how well the HMO predicted the neural data.\n\n6\n\nIT NeuronsV2-likeV4 NeuronsHMO ModelV1like PixelsHMAXD) SIFTA) PixelsLowVariationHighVariationAllControlsHMOV4IT Split-HalfAnimalsBoatsCarsChairsFacesFruitsPlanesTables 8-way CategorizationTaskAnimalsBoatsCarsChairsFacesFruitsPlanesTablesV1likeSIFTHMAXV2likeHMOIT Split HalfMedian Cross-Validated R2 Spearman of RDM to ITB) C) Unit 104Response levelPerformance Ratio(relative to human)\fPerformance was assessed for three types of tasks, including 8-way basic category classi\ufb01cation,\n8-way car object identi\ufb01cation, and 8-way face object identi\ufb01cation. We computed the model\u2019s\npredicted outputs in response to each of the testing images, and then tested simple, cross-validated\nlinear classi\ufb01ers based on these features. As performance controls, we also computed features on\nthe test images for a number of models from the literature, including a V1-like model [27], a V2-\nlike model [12], and an HMAX variant [25]. We also compared to a simple computer vision model,\nSIFT[22], as well as the basic pixel control. Performances were also measured for neural output\nfeatures, building on previous results showing that V4 neurons performed less well than IT neurons\nat higher variation levels[23], and con\ufb01rming that the testing tasks meaningfully engaged higher-\nlevel vision processing. Figure 3A) compares overall performances, showing that the HMO-selected\nmodel is able to achieve human-level performance at all levels of variation. Critically, the HMO\nmodel performs well not just in low-variation settings in which simple lower-level models can do\nso, but is able to achieve near-human performance (within 1 std of the average human) even when\nfaced with large amounts of variation which caused the other models to perform near chance. Since\nthe testing set contains entirely different objects in non-overlapping basic categories, with none of\nthe same backgrounds, this suggests that the nonlinearity identi\ufb01ed in the HMO screening phase is\nable to achieve signi\ufb01cant generalization across image domains.\nGiven that the model evidenced high transferable performance, we next determined the ability of the\nmodel to explain low-level neuronal responses using regression. The HMO model is able to predict\napproximately 48% of the explainable variance in the neural data, more than twice as much as any\ncompeting model (\ufb01g. 3B). Using the same transformation matrices W obtained from the regression\n\ufb01tting, we also computed RDMs, which show signi\ufb01cant similarity to IT populations at both nearly\ncomparable to the split-half similarity range of the IT population itself (\ufb01g. 3C). A key comparison\nbetween models and data shows that as populations ascend the ventral hierarchy and increase in\ncomplexity, they become progressively closer to IT, with category structure that was blurred out at\nlower levels by variation becoming properly abstracted away at the higher levels (\ufb01g. 3D).\n\n3.2 The Monkeys & Man Image Set\n\nKriegeskorte et. al. analyzed neural recordings made in an anterior patch of macaque IT on a small\nnumber of widely varying naturalistic images of every-day objects, and additionally obtained fMRI\nrecordings from the analogous region of human visual cortex [19]. These objects included human\nand animal faces and body parts, as well as a variety of natural and man-made inanimate objects.\nThree striking \ufb01ndings of this work were that (i) the population code (as measured by RDMs) of the\nmacaque neurons strongly mirrors the structure present in the human fMRI data, (ii) this structure\nappears to be dominated by the separation of animate vs inanimate object classes (\ufb01g. 4B, lower\nright) and (iii) that none of a variety of computational models produced RDMs with this structure.\nIndividual unit neural response data from these experiments is not publicly available. However, we\nwere able to obtain a set of approximately 1000 additional training images with roughly similar\ncategorical distinctions to that of the original target images, including distributions of human and\nanimal faces and body parts, and a variety of other non-animal objects [16]. We posited that the\npopulation code structure present in the anterior region of IT recorded in the original experiment\nis guided by functional goals similar to the task distinctions supported by this dataset. To test\nthis, we computed linear classi\ufb01ers from NHM O features for all the binary distinctions possible in\nthe training set (e.g. \u201chuman/non-human\u201d, \u201canimate/inanimate\u201d, \u201chand/non-hand\u201d, \u201cbird/non-bird\u201d,\n&c). The linear weighting matrix W derived from these linear classi\ufb01ers was then used to produce\nan RDM matrix which could be compared to that measured originally. In fact, the HMO-based\npopulation RDM strongly qualitatively matches that of the monkey IT RDM and, to a signi\ufb01cant\nbut lesser extent, that of the human IT RDM (\ufb01g. 4B). This \ufb01t is signi\ufb01cantly better than that of all\nmodels evaluated by Kriegeskorte, and approaches the human/monkey \ufb01t value itself (\ufb01g. 4A).\n\n4 Discussion\n\nHigh consistency with neural data at individual neuronal response and population code levels across\nseveral diverse datasets suggests that the HMO model is a good candidate model of the higher\nventral stream processing. That fact that the model was optimized only for performance, and not\ndirectly for consistency with neural responses, highlights the power of functionally-driven computa-\n\n7\n\n\fFigure 4: A) Comparison of model representations to Monkey IT (solid bars) and Human ventral\nstream (hatched bars). The HMO model followed by a simple task-blend based linear reweighting\n(red bars) quantitatively approximates the human/monkey \ufb01t value (black bar), and captures both\nmonkey and human ventral stream structure more effectively than any of the large number of mod-\nels shown in [18], or any of the additional comparison models we evaluated here (cyan bars). B)\nRepresentational Dissimilarity Matrices show a clear qualitative similarity between monkey IT and\nhuman IT on the one hand [19] and between these and the HMO model representation.\n\ntional approaches in understanding cortical processing. These results further develop a long-standing\nfunctional hypothesis about the ventral visual stream, and show that more rigorous versions of its\narchitecture and functional constraints can be leveraged using modern computational tools to expose\nthe transformation of visual information in the ventral stream.\nThe picture that emerges is of a general-purpose object recognition architecture \u2013 approximated by\nthe NHM O network \u2013 situtated just posterior to a set of several downstream regions that can be\nthought of as specialized linear projections \u2013 the matrices W \u2013 from the more general upstream\nregion. These linear projections can, at least in some cases, be characterized effectively as the sig-\nnature of interpretable functional tasks in which the system is thought to have gained expertise.\nThis two-step arrangement makes sense if there is a core set of object recognition primitives that\nare comparatively dif\ufb01cult to discover, but which, once found, underlie many recognition tasks.\nThe particular recognition tasks that the system learns to solve can all draw from this upstream\n\u201cnon-linear reservoir\u201d, creating downstream specialists that trade off generality for the ability to\nmake more ef\ufb01cient judgements on new visual data relative to the particular problems on which\nthey specialize. This hypothesis makes testable predictions about how monkey and human visual\nsystems should both respond to certain real-time training interventions (e.g.\nthe effects of \u201cnur-\nture\u201d), while being circumscribed within a range of possible behaviors allowed by the (presumably)\nharder-to-change upstream network (e.g. the constraints of \u201cnature\u201d). It also suggests that it will\nbe important to explore recent high-performing computer vision systems, e.g. [20], to determine\nwhether these algorithms provide further insight into ventral stream mechanisms. Our results show\nthat behaviorally-driven computational approaches have an important role in understanding the de-\ntails of cortical processing[24]. This is a fruitful direction of future investigation for such models to\nengage with additional neural and behavior experiments.\n\nReferences\n\n[1] J. Bergstra, D. Yamins, and D.D. Cox. Making a Science of Model Search, 2012.\n[2] C. Cadieu, H. Hong, D. Yamins, N. Pinto, N. Majaj, and J.J. DiCarlo. The neural representation bench-\nmark and its evaluation on brain and machine. In International Conference on Learning Representations,\nMay 2013.\n\n[3] C. Cadieu, M. Kouh, A. Pasupathy, C. E. Connor, M. Riesenhuber, and T. Poggio. A model of v4 shape\n\nselectivity and invariance. J Neurophysiol, 98(3):1733\u201350, 2007.\n\n[4] M. Carandini, J. B. Demb, V. Mante, D. J. Tolhurst, Y. Dan, B. A. Olshausen, J. L. Gallant, and N. C.\n\nRust. Do we know what the early visual system does? J Neurosci, 25(46):10577\u201397, 2005.\n\n8\n\nHMOMonkey IT (Kriegeskorte, 2008)Human (Kriegeskorte, 2008)PixelsHMAXV1-likePixelsHMAXV1-likeHMOMonkey/HumanA)B)Spearman Rank Correlation\f[5] M. Churchland and K. Shenoy. Temporal complexity and heterogeneity of single-neuron activity in pre-\n\nmotor and motor cortex. Journal of Neurophysiology, 97(6):4235\u20134257, 2007.\n\n[6] C. E. Connor, S. L. Brincat, and A. Pasupathy. Transformation of shape information in the ventral path-\n\nway. Curr Opin Neurobiol, 17(2):140\u20137, 2007.\n\n[7] S. V. David, B. Y. Hayden, and J. L. Gallant. Spectral receptive \ufb01eld properties explain shape selectivity\n\nin area v4. J Neurophysiol, 96(6):3492\u2013505, 2006.\n\n[8] R. Desimone, T. D. Albright, C. G. Gross, and C. Bruce. Stimulus-selective properties of inferior temporal\n\nneurons in the macaque. J Neurosci, 4(8):2051\u201362, 1984.\n\n[9] J. J. DiCarlo, D. Zoccolan, and N. C. Rust. How does the brain solve visual object recognition? Neuron,\n\n73(3):415\u201334, 2012.\n\n[10] P.E. Downing, Y. Jiang, M. Shuman, and N. Kanwisher. A cortical area selective for visual processing of\n\nthe human body. Science, 293:2470\u20132473, 2001.\n\n[11] D.J. Felleman and D.C. Van Essen. Distributed hierarchical processing in the primate cerebral cortex.\n\nCerebral Cortex, 1:1\u201347, 1991.\n\n[12] J. Freeman and E. Simoncelli. Metamers of the ventral stream. Nature Neuroscience, 14(9):1195\u20131201,\n\n2011.\n\n[13] K. Grill-Spector, Z. Kourtzi, and N. Kanwisher. The lateral occipital complex and its role in object\n\nrecognition. Vision research, 41(10-11):1409\u20131422, 2001.\n\n[14] C. P. Hung, G. Kreiman, T. Poggio, and J. J. DiCarlo. Fast readout of object identity from macaque\n\ninferior temporal cortex. Science, 310(5749):863\u20136, 2005.\n\n[15] N. Kanwisher, J. McDermott, and M. M. Chun. The fusiform face area: a module in human extrastriate\n\ncortex specialized for face perception. J Neurosci, 17(11):4302\u201311, 1997.\n\n[16] R. Kiani, H. Esteky, K. Mirpour, and K. Tanaka. Object category structure in response patterns of neuronal\n\npopulation in monkey inferior temporal cortex. J Neurophysiol, 97(6):4296\u2013309, 2007.\n\n[17] Z. Kourtzi and N. Kanwisher. Representation of perceived object shape by the human lateral occipital\n\ncomplex. Science, 293(5534):1506\u20131509, 2001.\n\n[18] N. Kriegeskorte. Relating population-code representations between man, monkey, and computational\n\nmodels. Frontiers in Neuroscience, 3(3):363, 2009.\n\n[19] N. Kriegeskorte, M. Mur, D. A. Ruff, R. Kiani, J. Bodurka, H. Esteky, K. Tanaka, and P. A. Bandettini.\nMatching categorical object representations in inferior temporal cortex of man and monkey. Neuron,\n60(6):1126\u201341, 2008.\n\n[20] A Krizhevsky, I Sutskever, and G Hinton. ImageNet classi\ufb01cation with deep convolutional neural net-\n\nworks. Advances in Neural Information Processing Systems, 2012.\n\n[21] P. Lennie and J. A. Movshon. Coding of color and form in the geniculostriate visual pathway (invited\n\nreview). J Opt Soc Am A Opt Image Sci Vis, 22(10):2013\u201333, 2005.\n\n[22] D. G. Lowe. Distinctive image features from scale-invariant keypoints. IJCV, 2004.\n[23] N. Majaj, H. Najib, E. Solomon, and J.J. DiCarlo. A uni\ufb01ed neuronal population code fully explains\n\nhuman object recognition. In Computational and Systems Neuroscience (COSYNE), 2012.\n\n[24] David Marr, Tomaso Poggio, and Shimon Ullman. Vision. A Computational Investigation Into the Human\n\nRepresentation and Processing of Visual Information. MIT Press, July 2010.\n\n[25] J. Mutch and D. G. Lowe. Object class recognition and localization using sparse features with limited\n\nreceptive \ufb01elds. IJCV, 2008.\n\n[26] Bruno A Olshausen and David J Field. How close are we to understanding v1? Neural computation,\n\n17(8):1665\u20131699, 2005.\n\n[27] N. Pinto, D. D. Cox, and J. J. DiCarlo. Why is Real-World Visual Object Recognition Hard. PLoS Comput\n\nBiol, 2008.\n\n[28] N. Pinto, D. Doukhan, J. J. DiCarlo, and D. D. Cox. A High-Throughput Screening Approach to Discov-\nering Good Forms of Biologically Inspired Visual Representation. PLoS Computational Biology, 5(11),\n2009.\n\n[29] T. Serre, A. Oliva, and T. Poggio. A feedforward architecture accounts for rapid categorization. Proc Natl\n\nAcad Sci U S A, 104(15):6424\u20139, 2007. 0027-8424 (Print) Journal Article.\n\n[30] Thomas Serre, Gabriel Kreiman, Minjoon Kouh, Charles Cadieu, Ulf Knoblich, and Tomaso Poggio.\nA quantitative theory of immediate visual recognition. In Prog. Brain Res., volume 165, pages 33\u201356.\nElsevier, 2007.\n\n9\n\n\f", "award": [], "sourceid": 1417, "authors": [{"given_name": "Daniel", "family_name": "Yamins", "institution": "MIT"}, {"given_name": "Ha", "family_name": "Hong", "institution": "MIT"}, {"given_name": "Charles", "family_name": "Cadieu", "institution": "MIT"}, {"given_name": "James", "family_name": "DiCarlo", "institution": "MIT"}]}