{"title": "Toward Goal-Driven Neural Network Models for the Rodent Whisker-Trigeminal System", "book": "Advances in Neural Information Processing Systems", "page_first": 2555, "page_last": 2565, "abstract": "In large part, rodents \u201csee\u201d the world through their whiskers, a powerful tactile sense enabled by a series of brain areas that form the whisker-trigeminal system. Raw sensory data arrives in the form of mechanical input to the exquisitely sensitive, actively-controllable whisker array, and is processed through a sequence of neural circuits, eventually arriving in cortical regions that communicate with decision making and memory areas. Although a long history of experimental studies has characterized many aspects of these processing stages, the computational operations of the whisker-trigeminal system remain largely unknown. In the present work, we take a goal-driven deep neural network (DNN) approach to modeling these computations. First, we construct a biophysically-realistic model of the rat whisker array. We then generate a large dataset of whisker sweeps across a wide variety of 3D objects in highly-varying poses, angles, and speeds. Next, we train DNNs from several distinct architectural families to solve a shape recognition task in this dataset. Each architectural family represents a structurally-distinct hypothesis for processing in the whisker-trigeminal system, corresponding to different ways in which spatial and temporal information can be integrated. We find that most networks perform poorly on the challenging shape recognition task, but that specific architectures from several families can achieve reasonable performance levels. Finally, we show that Representational Dissimilarity Matrices (RDMs), a tool for comparing population codes between neural systems, can separate these higher performing networks with data of a type that could plausibly be collected in a neurophysiological or imaging experiment. Our results are a proof-of-concept that DNN models of the whisker-trigeminal system are potentially within reach.", "full_text": "Toward Goal-Driven Neural Network Models for the\n\nRodent Whisker-Trigeminal System\n\nChengxu Zhuang\n\nDepartment of Psychology\n\nStanford University\nStanford, CA 94305\n\nchengxuz@stanford.edu\n\nJonas Kubilius\n\nDepartment of Brain and Cognitive Sciences\n\nMassachusetts Institute of Technology\n\nCambridge, MA 02139\n\nBrain and Cognition, KU Leuven, Belgium\n\nqbilius@mit.edu\n\nDaniel Yamins\n\nStanford Neurosciences Institute\n\nStanford University\nStanford, CA 94305\n\nyamins@stanford.edu\n\nDepartments of Biomedical Engineering\n\nDepartments of Psychology and Computer Science\n\nMitra Hartmann\n\nand Mechanical Engineering\n\nNorthwestern University\n\nEvanston, IL 60208\n\nhartmann@northwestern.edu\n\nAbstract\n\nIn large part, rodents \u201csee\u201d the world through their whiskers, a powerful tactile\nsense enabled by a series of brain areas that form the whisker-trigeminal system.\nRaw sensory data arrives in the form of mechanical input to the exquisitely sensitive,\nactively-controllable whisker array, and is processed through a sequence of neural\ncircuits, eventually arriving in cortical regions that communicate with decision-\nmaking and memory areas. Although a long history of experimental studies has\ncharacterized many aspects of these processing stages, the computational operations\nof the whisker-trigeminal system remain largely unknown. In the present work,\nwe take a goal-driven deep neural network (DNN) approach to modeling these\ncomputations. First, we construct a biophysically-realistic model of the rat whisker\narray. We then generate a large dataset of whisker sweeps across a wide variety\nof 3D objects in highly-varying poses, angles, and speeds. Next, we train DNNs\nfrom several distinct architectural families to solve a shape recognition task in\nthis dataset. Each architectural family represents a structurally-distinct hypothesis\nfor processing in the whisker-trigeminal system, corresponding to different ways\nin which spatial and temporal information can be integrated. We \ufb01nd that most\nnetworks perform poorly on the challenging shape recognition task, but that speci\ufb01c\narchitectures from several families can achieve reasonable performance levels.\nFinally, we show that Representational Dissimilarity Matrices (RDMs), a tool for\ncomparing population codes between neural systems, can separate these higher-\nperforming networks with data of a type that could plausibly be collected in a\nneurophysiological or imaging experiment. Our results are a proof-of-concept that\nDNN models of the whisker-trigeminal system are potentially within reach.\n\nIntroduction\n\n1\nThe sensory systems of brains do remarkable work in extracting behaviorally useful information from\nnoisy and complex raw sense data. Vision systems process intensities from retinal photoreceptor\narrays, auditory systems interpret the amplitudes and frequencies of hair-cell displacements, and\nsomatosensory systems integrate data from direct physical interactions. [28] Although these systems\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: Goal-Driven Approach to Modeling Barrel Cortex: a. Rodents have highly sensitive whisker\n(vibrissal) arrays that provide input data about the environment. Mechanical signals from the vibrissae are\nrelayed by primary sensory neurons of the trigeminal ganglion to the trigeminal nuclei, the original of multiple\nparallel pathways to S1 and S2. (Figure modi\ufb01ed from [8].) This system is a prime target for modeling because\nit is likely to be richly representational, but its computational underpinnings are largely unknown. Our long-term\napproach to modeling the whisker-trigeminal system is goal-driven: using an arti\ufb01cial whisker-array input\ndevice built using extensive biophysical measurements (b.), we seek to optimize neural networks of various\narchitectures (c.) to solve ethologically-relevant shape recognition tasks (d.), and then measure the extent to\nwhich these networks predict \ufb01ne-grained response patterns in real neural recordings.\n\ndiffer radically in their input modalities, total number of neurons, and speci\ufb01c neuronal microcircuits,\nthey share two fundamental characteristics. First, they are hierarchical sensory cascades, albeit\nwith extensive feedback, consisting of sequential processing stages that together produce a complex\ntransformation of the input data. Second, they operate in inherently highly-structured spatiotemporal\ndomains, and are generally organized in maps that re\ufb02ect this structure [11].\nExtensive experimental work in the rodent whisker-trigeminal system has provided insights into how\nthese principles help rodents use their whiskers (also known as vibrissae) to tactually explore objects\nin their environment. Similar to hierarchical processing in the visual system (e.g., from V1 to V2, V4\nand IT [11, 12]), processing in the somatosensory system is also known to be hierarchical[27, 17, 18].\nFor example, in the whisker trigeminal system, information from the whiskers is relayed from primary\nsensory neurons in the trigeminal ganglion to multiple trigeminal nuclei; these nuclei are the origin\nof several parallel pathways conveying information to the thalamus [36, 24] and then to primary and\nsecondary somatosensory cortex (S1 and S2) [4]. However, although the rodent somatosensory system\nhas been the subject of extensive experimental efforts[2, 26, 20, 32], there have been comparatively\nfew attempts at computational modeling of this important sensory system.\nRecent work has shown that deep neural networks (DNNs), whose architectures inherently contain\nhierarchy and spatial structure, can be effective models of neural processing in vision[34, 21] and\naudition[19]. Motivated by these successes, in this work we illustrate initial steps toward using\nDNNs to model rodent somatosensory systems. Our driving hypothesis is that the vibrissal-trigeminal\nsystem is optimized to use whisker-based sensor data to solve somatosensory shape-recognition\ntasks in complex, variable real-world environments. The underlying idea of this approach is thus to\nuse goal-driven modeling (Fig 1), in which the DNN parameters \u2014 both discrete and continuous\n\u2014 are optimized for performance on a challenging ethologically-relevant task[35]. Insofar as shape\nrecognition is a strong constraint on network parameters, optimized neural networks resulting from\nsuch a task may be an effective model of real trigeminal-system neural response patterns.\nThis idea is conceptually straightforward, but implementing it involves surmounting several chal-\nlenges. Unlike vision or audition, where signals from the retina or cochlea can for many purposes\nbe approximated by a simple structure (namely, a uniform data array representing light or sound\nintensities and frequencies), the equivalent mapping from stimulus (e.g. object in a scene) to sensor\ninput in the whisker system is much less direct. Thus, a biophysically-realistic embodied model of\nthe whisker array is a critical \ufb01rst component of any model of the vibrissal system. Once the sensor\narray is available, a second key problem is building a neural network that can accept whisker data\ninput and use it to solve relevant tasks. Aside from the question of the neural network design itself,\n\n2\n\nTrigeminal Ganglion2-34 / s Thalamus Sb 6 S1 2-34 Sa Sb 6 S2 ...b)c)d)Sweeps \u201cCube\u201d\u201cChair\u201d\u201cDuck\u201dInput ShapesArtificial Vibrissal ArrayShape CategoryRecognitionOutputTask-OptimizedNeural NetworkArchitecture(s)??????Matched to realmorphologya)Cortex\fFigure 2: Dynamic Three-Dimensional Whisker Model: a. Each whisker element is composed of a set of\ncuboid links. The follicle cuboid has a \ufb01xed location, and is attached to movable cuboids making up the rest of\nthe whisker. Motion is constrained by linear and torsional springs between each pair of cuboids. The number of\ncuboid links and spring equilibrium displacements are chosen to match known whisker length and curvature [31],\nwhile damping and spring stiffness parameters are chosen to ensure mechanically plausible whisker motion\ntrajectories. b. We constructed a 31-whisker array, arranged in a rough 5x7 grid (with 4 missing elements) on an\nellipsoid representing the rodent\u2019s mystacial pad. Whisker number and placement was matched to the known\nanatomy of the rat [31]. c. During dataset construction, the array is brought into contact with each object at three\nvertical heights, and four 90\u25e6-separated angles, for a total of 12 sweeps. The object\u2019s size, initial orientation\nangle, as well as sweep speed, vary randomly between each group of 12 sweeps. Forces and torques are recorded\nat the three cuboids closest to the follicle, for a total of 18 measurements per whisker at each timepoint. d. Basic\nvalidation of performance of binary linear classi\ufb01er trained on raw sensor output to distinguish between two\nshapes (in this case, a duck versus a teddy bear). The classi\ufb01er was trained/tested on several equal-sized datasets\nin which variation on one or more latent variable axes has been suppressed. \u201cNone\u201d indicates that all variations\nare present. Dotted line represents chance performance (50%).\nknowing what the \u201crelevant tasks\u201d are for training a rodent whisker system, in a way that is suf\ufb01ciently\nconcrete to be practically actionable, is a signi\ufb01cant unknown, given the very limited amount of\nethologically-relevant behavioral data on rodent sensory capacities[32, 22, 25, 1, 9]. Collecting neural\ndata of suf\ufb01cient coverage and resolution to quantitatively evaluate one or more task-optimized neural\nnetwork models represents a third major challenge. In this work, we show initial steps toward the\n\ufb01rst two of these problems (sensor modeling and neural network design/training).\n2 Modeling the Whisker Array Sensor\nIn order to provide our neural networks inputs similar to those of the rodent vibrissal system, we\nconstructed a physically-realistic three-dimensional (3D) model of the rodent vibrissal array (Fig. 2).\nTo help ensure biological realism, we used an anatomical model of the rat head and whisker array that\nquanti\ufb01es whisker number, length, and intrinsic curvature as well as relative position and orientation\non the rat\u2019s face [31]. We wanted the mechanics of each whisker to be reasonably accurate, but at\nthe same time, also needed simulations to be fast enough to generate a large training dataset. We\ntherefore used the Bullet [33], an open-source real-time physics engine used in many video games.\nStatics. Individual whiskers were each modeled as chains of \u201ccuboid\u201d links with a square cross-\nsection and length of 2mm. The number of links in each whisker was chosen to ensure that the total\nwhisker length matched that of the corresponding real whisker (Fig. 2 a). The \ufb01rst (most proximal)\nlink of each simulated whisker corresponded to the follicle at the whisker base, where the whisker\ninserts into the rodent\u2019s face. Each whisker follicle was \ufb01xed to a single location in 3D space. The\nlinks of the whisker are given \ufb01rst-order linear and rotational damping factors to ensure that unforced\nmotions dissipate over time. To simplify the model, the damping factors were assumed to be the same\nacross all links of a given whisker, but different from whisker to whisker. Each pair of links within\na whisker was connected with linear and torsional \ufb01rst-order springs; these springs both have two\nparameters (equilibrium displacement and stiffness). The equilibrium displacements of each spring\nwere chosen to ensure that the whisker\u2019s overall static shape matched the measured curvature for the\ncorresponding real whisker. Although we did not speci\ufb01cally seek to match the detailed biophysics\nof the whisker mechanics (e.g. the fact that the stiffness of the whisker increases with the 4th power\nof its radius), we assumed that the stiffness of the springs spanning a given length were linearly\ncorrelated to the distance between the starting position of the spring and the base, roughly capturing\nthe fact that the whisker is thicker and stiffer at the bottom [13].\nThe full simulated whisker array consisted of 31 simulated whiskers, ranging in length from 8mm\nto 60mm (Fig. 2b). The \ufb01xed locations of the follicles of the simulated whiskers were placed on\na curved ellipsoid surface modeling the rat\u2019s mystacial pad (cheek), with the relative locations of\n\n3\n\n. . .PairwiseLinear SpringsPairwiseTorsional SpringsFixed-Position \u201cFollicle\u201d Measuring Forces & Torquesa)b)c)d)topmiddle bottom90o31 Whiskers in Rough 5 x 7 Formation. . .Classification PerformanceVariations Excluded in Train/Testvs.Scale + SpeedRotation + ScaleRotationPositionScaleSpeedNone180o270o0o\fthe follicles on this surface obtained from the morphological model [31], forming roughly a 5 \u00d7 7\ngrid-like pattern with four vacant positions.\nDynamics. Whisker dynamics are generated by collisions with moving three-dimensional rigid\nbodies, also modeled as Bullet physics objects. The motion of a simulated whisker in reaction to\nexternal forces from a collision is constrained only by the \ufb01xed spatial location of the follicle, and\nby the damped dynamics of the springs at each node of the whisker. However, although the spring\nequilibrium displacements are determined by static measurements as described above, the damping\nfactors and spring stiffnesses cannot be fully determined from these data. If we had detailed dynamic\ntrajectories for all whiskers during realistic motions (e.g. [29]), we would have used this data to\ndetermine these parameters, but such data are not yet available.\nIn the absence of empirical trajectories, we used a heuristic method to determine damping and\nstiffness parameters, maximizing the \u201cmechanical plausibility\u201d of whisker behavior. Speci\ufb01cally, we\nconstructed a battery of scenarios in which forces were applied to each whisker for a \ufb01xed duration.\nThese scenarios included pushing the whisker tip towards its base (axial loading), as well as pushing\nthe whisker parallel or perpendicular to its intrinsic curvature (transverse loading in or out of the plane\nof intrinsic curvature). For each scenario and each potential setting of the unknown parameters, we\nsimulated the whisker\u2019s recovery after the force was removed, measuring the maximum displacement\nbetween the whisker base and tip caused by the force prior to recovery (d), the total time to recovery\n(T ), the average arc length travelled by each cuboid during recovery (S), and the average translational\nspeed of each cuboid during recovery (v). We used metaparameter optimization [3] to automatically\nidentify stiffness and damping parameters that simultaneously minimized the time and complexity of\nthe recovery trajectory, while also allowing the whisker to be \ufb02exible. Speci\ufb01cally, we minimized the\nloss function 0.025S + d + 20T \u2212 2v, where the coef\ufb01cients were set to make terms of comparable\nmagnitude. The optimization was performed for every whisker independently, as whisker length and\ncurvature interacts nonlinearly with its recovery dynamics.\n\n3 A Large-Scale Whisker Sweep Dataset\nUsing the whisker array, we generated a dataset of whisker responses to a variety of objects.\nSweep Con\ufb01guration. The dataset consists of series of simulated sweeps, mimicking one action in\nwhich the rat runs its whiskers past an object while holding its whiskers \ufb01xed (no active whisking).\nDuring each sweep, a single 3D object moves through the whisker array from front to back (rostral to\ncaudal) at a constant speed. Each sweep lasts a total of one second, and data is sampled at 110Hz.\nSweep scenarios vary both in terms of the identity of the object presented, as well as the position,\nangle, scale (de\ufb01ned as the length of longest axis), and speed at which it is presented. To simulate\nobserved rat whisking behavior in which animals often sample an object at several vertical locations\n(head pitches) [14], sweeps are performed at three different heights along the vertical axis and at each\nof four positions around the object (0\u25e6, 90\u25e6, 180\u25e6, and 270\u25e6 around the vertical axis), for a total of 12\nsweeps per object/latent variable setting (Fig. 2c).\nLatent variables settings are sampled randomly and independently on each group of sweeps, with\nobject rotation sampled uniformly within the space of all 3D rotations, object scale sampled uniformly\nbetween 25-135mm, and sweep speed sampled randomly between 77-154mm/s. Once these variables\nare chosen, the object is placed at a position that is chosen uniformly in a 20 \u00d7 8 \u00d7 20mm3 volume\ncentered in front of the whisker array at the chosen vertical height, and is moved along the ray toward\nthe center of the whisker array at the chosen speed. The position of the object may be adjusted to avoid\ncollisions with the \ufb01xed whisker base ellipsoid during the sweep. See supplementary information for\ndetails.\nThe data collected during a sweep includes, for each whisker, the forces and torques from all springs\nconnecting to the three cuboids most proximate to the base of the whisker. This choice re\ufb02ects the idea\nthat mechanoreceptors are distributed along the entire length of the follicle at the whisker base [10].\nThe collected data comprises a matrix of shape 110 \u00d7 31 \u00d7 3 \u00d7 2 \u00d7 3, with dimensions respectively\ncorresponding to: the 110 time samples; the 31 spatially distinct whiskers; the 3 recorded cuboids;\nthe forces and torques from each cuboid; and the three directional components of force/torque.\nObject Set. The objects used in each sweep are chosen from a subset of the ShapeNet [6] dataset,\nwhich contains over 50,000 3D objects, each with a distinct geometry, belonging to 55 categories.\nBecause the 55 ShapeNet categories are at a variety of levels of within-category semantic similarity,\nwe re\ufb01ned the original 55 categories into a taxonomy of 117 (sub)categories that we felt had a more\n\n4\n\n\fFigure 3: Families of DNN Architectures tested: a. \u201cSpatiotemporal\u201d models include spatiotemporal\nintegration at all stages. Convolution is performed on both spatial and temporal data dimensions, followed\nby one or several fully connected layers. b. \u201cTemporal-Spatial\u201d networks in which temporal integration is\nperformed separately before spatial integration. Temporal integration consists of one-dimensional convolution\nover the temporal dimension, separately for each whisker. In spatial integration stages, outputs from each\nwhisker are registered to their natural two-dimensional (2D) spatial grid and spatial convolution performed. c. In\n\u201cSpatial-Temporal\u201d networks, spatial convolution is performed \ufb01rst, replicated with shared weights across time\npoints; this is then followed by temporal convolution. d. Recurrent networks do not explicitly contain separate\nunits to handle different discrete timepoints, relying instead on the states of the units to encode memory traces.\nThese networks can have local recurrence (e.g. simple addition or more complicated motifs like LSTMs or\nGRUs), as well as long-range skip and feedback connections.\n\nuniform amount of within-category shape similarity. The distribution of number of ShapeNet objects\nis highly non-uniform across categories, so we randomly subsampled objects from large categories.\nThis procedure ensured that all categories contained approximately the same number of objects. Our\n\ufb01nal object set included 9,981 objects in 117 categories, ranging between 41 and 91 object exemplars\nper category (mean=85.3, median=91, std=10.2, see supplementary material for more details). To\ncreate the \ufb01nal dataset, for every object, 26 independent samples of rotation, scaling, and speed were\ndrawn and the corresponding group of 12 sweeps created. Out of these 26 sweep groups, 24 were\nadded to a training subset, while the remainder were reserved for testing.\nBasic Sensor Validation. To con\ufb01rm that the whisker array was minimally functional before\nproceeding to more complex models, we produced smaller versions of our dataset in which sweeps\nwere sampled densely for two objects (a bear and a duck). We also produced multiple easier versions\nof this dataset in which variation along one or several latent variables was suppressed. We then\ntrained binary support vector machine (SVM) classi\ufb01ers to report object identity in these datasets,\nusing only the raw sensor data as input, and testing classi\ufb01cation accuracy on held-out sweeps (Fig.\n2d). We found that with scale and object rotation variability suppressed (but with speed and position\nvariability retained), the sensor was able to nearly perfectly identify the objects. However, with all\nsources of variability present, the SVM was just above chance in its performance, while combinations\nof variability are more challenging for the sensor than others (details can be found in supplementary\ninformation). Thus, we concluded that our virtual whisker array was basically functional, but that\nunprocessed sensor data cannot be used to directly read out object shape in anything but the most\nhighly controlled circumstances. As in the case of vision, it is exactly this circumstance that calls for\na deep cascade of sensory processing stages.\n4 Computational Architectures\nWe trained deep neural networks (DNNs) in a variety of different architectural families (Fig. 3). These\narchitectural families represent qualitatively different classes of hypotheses about the computations\nperformed by the stages of processing in the vibrissal-trigeminal system. The fundamental questions\nexplored by these hypotheses are how and where temporal and spatial information are integrated.\nWithin each architectural family, the differences between speci\ufb01c parameter settings represent nuanced\nre\ufb01nements of the larger hypothesis of that family. Parameter speci\ufb01cs include how many layers\nof each type are in the network, how many units are allocated to each layer, what kernel sizes are\nused at each layer, and so on. Biologically, these parameters may correspond to the number of brain\nregions (areas) involved, how many neurons these regions have relative to each other, and neurons\u2019\nlocal spatiotemporal receptive \ufb01eld sizes [35].\nSimultaneous Spatiotemporal Integration. In this family of networks (Fig. 3a), networks consisted\nof convolution layers followed by one or more fully connected layers. Convolution is performed\n\n5\n\n( )...Whiskers (31)Forces and torques (18)Time (110)a) Spatiotemporalb) Temporal - SpatialTime (110)Forces and torques (18)x31c) Spatial - TemporalWhiskers (31) ( )x110d) Recurrent Skip/Feedback Time (110)Forces and torques (18)Whiskers (31) Whiskers (31) Forces and torques (18)\fsimultaneously on both temporal and spatial dimensions of the input (and their corresponding\ndownstream dimensions). In other words, temporally-proximal responses from spatially-proximal\nwhiskers are combined together simultaneously, so that neurons in each successive layers have larger\nreceptive \ufb01elds in both spatial and temporal dimensions at once. We evaluated both 2D convolution,\nin which the spatial dimension is indexed linearly across the list of whiskers (\ufb01rst by vertical columns\nand then by lateral row on the 5 \u00d7 7 grid), as well as 3D convolution in which the two dimensions of\nthe 5\u00d7 7 spatial grid are explicitly represented. Data from the three vertical sweeps of the same object\nwere then combined to produce the \ufb01nal output, culminating in a standard softmax cross-entropy.\nSeparate Spatial and Temporal Integration. In these families, networks begin by integrating tem-\nporal and spatial information separately (Fig. 3b-c). One subclass of these networks are \u201cTemporal-\nSpatial\u201d (Fig. 3b), which \ufb01rst integrate temporal information for each individual whisker separately\nand then combine the information from different whiskers in higher layers. Temporal processing\nis implemented as 1-dimensional convolution over the temporal dimension. After several layers of\ntemporal-only processing (the number of which is a parameter), the outputs at each whisker are then\nreshaped into vectors and combined into a 5 \u00d7 7 whisker grid. Spatial convolutions are then applied\nfor several layers. Finally, as with the spatiotemporal network described above, features from three\nsweeps are concatenated into a single fully connected layer which outputs softmax logits.\nConversely, \u201cSpatial-Temporal\u201d networks (Fig. 3c) \ufb01rst use 2D convolution to integrate across\nwhiskers for some number of layers, with shared parameters between the copies of the network\nfor each timepoint. The temporal sequence of outputs is then combined, and several layers of 1D\nconvolution are then applied in the temporal domain. Both Temporal-Spatial and Spatial-Temporal\nnetworks can be viewed as subclasses of 3D simultaneous spatiotemporal integration in which\ninitial and \ufb01nal portions of the network have kernel size 1 in the relevant dimensions. These two\nnetwork families can thus be thought of as two different strategies for allocating parameters between\ndimensions, i.e. different possible biological circuit structures.\nRecurrent Neural Networks with Skip and Feedback Connections. This family of networks (Fig.\n3d) does not allocate units or parameters explicitly for the temporal dimension, and instead requires\ntemporal processing to occur via the temporal update evolution of the system. These networks\nare built around a core feedforward 2D spatial convolution structure, with the addition of (i) local\nrecurrent connections, (ii) long-range feedforward skips between non-neighboring layers, and (iii)\nlong-range feedback connections. The most basic update rule for the dynamic trajectory of such a\nnetwork through (discrete) time is: H i\nt ], where Ri\nt\nt are the output and hidden state of layer i at time t respectively, \u03c4i are decay constants, \u2295\nand H i\nrepresents concatenation across the channel dimension with appropriate resizing to align dimensions,\nFi is the standard neural network update function (e.g. 2-D convolution), and Ai is activation function\nat layer i. The learned parameters of this type of network include the values of the parameters of Fi,\nwhich comprises both the feedforward and feedback weights from connections coming in to layer\ni, as well as the decay constants \u03c4i. More sophisticated dynamics can be incorporated by replacing\nthe simple additive rule above with a local recurrent structure such as Long Short-Term Memory\n(LSTM) [15] or Gated Recurrent Networks (GRUs) [7].\n5 Results\nModel Performance: Our strategy in identifying potential models of the whisker-trigeminal system\nis to explore many speci\ufb01c architectures within each architecture family, evaluating each speci\ufb01c\narchitecture both in terms of its ability to solve the shape recognition task in our training dataset, and\nits ef\ufb01ciency (number of parameters and number of overall units). Because we evaluate networks on\nheld-out validation data, it is not inherently unfair to compare results from networks different numbers\nof parameters, but for simplicity we generally evaluated models with similar numbers of parameters:\nexceptions are noted where they occur. As we evaluated many individual structures within each\nfamily, a list of the speci\ufb01c models and parameters are given in the supplementary materials.\nOur results (Fig. 4) can be summarized with following conclusions:\n\u2022 Many speci\ufb01c network choices within all families do a poor job at the task, achieving just-above-\n\u2022 However, within each family, certain speci\ufb01c choices of parameters lead to much better network\nperformance. Overall, the best performance was obtained for the Temporal-Spatial model, with\n\n(cid:16)\u2295j(cid:54)=iRj\n\nchance performance.\n\nand Ri\n\nt = Ai[H i\n\n(cid:17)\n\nt+1 = Fi\n\n+ \u03c4iH i\nt\n\nt\n\n6\n\n\fFigure 4: Performance results. a. Each bar in this \ufb01gure represents one model. The positive y-axis is\nperformance measured in percent correct (top1=dark bar, chance=0.85%, top5=light bar, chance=4.2%). The\nnegative y-axis indicates the number of units in networks, in millions of units. Small italic numbers indicate\nnumber of model parameters, in millions. Model architecture family is indicated by color. \"ncmf\" means\nn convolution and m fully connected layers. Detailed de\ufb01nition of individual model labels can be found in\nsupplementary material. b. Confusion Matrix for the highest-performing model (in the Temporal-Spatial family).\nThe objects are regrouped using methods described in supplementary material.\n\nperformed above chance levels.\n\nlayers achieved substantially lower performance than somewhat deeper ones.\n\n15.2% top-1 and 44.8% top-5 accuracy. Visualizing a confusion matrix for this network (Fig. 4)b\nand other high-performing networks indicate that the errors they make are generally reasonable.\n\u2022 Training the \ufb01lters was extremely important for performance; no architecture with random \ufb01lters\n\u2022 Architecture depth was an important factor in performance. Architectures with fewer than four\n\u2022 Number of model parameters was a somewhat important factor in performance within an archi-\ntectural family, but only to a point, and not between architectural families. The Temporal-Spatial\narchitecture was able to outperform other classes while using signi\ufb01cantly fewer parameters.\n\u2022 Recurrent networks with long-range feedback were able to perform nearly as well as the Temporal-\nSpatial model with equivalent numbers of parameters, while using far fewer units. These long-range\nfeedbacks appeared critical to performance, with purely local recurrent architectures (including\nLSTM and GRU) achieving signi\ufb01cantly worse results.\n\nModel Discrimination: The above results indicated that we had identi\ufb01ed several high-performing\nnetworks in quite distinct architecture families. In other words, the strong performance constraint\nallows us to identify several speci\ufb01c candidate model networks for the biological system, reducing a\nmuch larger set of mostly non-performing neural networks into a \u201cshortlist\u201d. The key biologically\nrelevant follow-up question is then: how should we distinguish between the elements in the shortlist?\nThat is, what reliable signatures of the differences between these architectures could be extracted\nfrom data obtainable from experiments that use today\u2019s neurophysiological tools?\nTo address this question, we used Representational Dissimilarity Matrix (RDM) analysis [23]. For\na set of stimuli S, RDMs are |S| \u00d7 |S|-shaped correlation distance matrices taken over the feature\ndimensions of a representation, e.g. matrices with ij-th entry RDM [i, j] = 1 \u2212 corr(F [i], F [j]) for\nstimuli i, j and corresponding feature output F [i], F [j]. The RDM characterizes the geometry of\nstimulus representation in a way that is independent of the individual feature dimensions. RDMs\ncan thus be quantitatively compared between different feature representations of the same data. This\nprocedure been useful in establishing connections between deep neural networks and the ventral\nvisual stream, where it has been shown that the RDMs of features from different layers of neural\nnetworks trained to solve categorization tasks match RDMs computed from visual brain areas at\ndifferent positions along the ventral visual hierarchy [5, 34, 21]. RDMs are readily computable\nfrom neural response pattern data samples, and are in general comparatively robust to variability\ndue to experimental randomness (e.g. electrode/voxel sampling). RDMs for real neural popula-\ntions from the rodent whisker-trigeminal system could be obtained through a conceptually simple\nelectrophysiological recording experiment similar in spirit to those performed in macaque [34].\nWe obtained RDMs for several of our high-performing models, computing RDMs separately for each\nmodel layer (Fig. 5a), averaging feature vectors over different sweeps of the same object before\n\n7\n\nchairstablesboatscarscontainerselectronicsairplaneshomeappliances50b)30100.41.2Accuracy(percent correct)Number of Units(in millions)S_randS_2c0fS_1c0fS_3c0fS_1c2fS_2c1fS_3c1fS_fewS_2c2fS_3DS_3c2fS_4c2fSpatiotemporal (S)S_moreS_deepTS_fewSpatial-TemporalRNN_bypRNN_lstmRNN_gruRNNRNN_fdbTemporal-Spatiala)0.425.021.525.022.623.422.11.523.953.122.222.122.383.424.011.827.924.723.025.525.523.727.2\fFigure 5: Using RDMs to Discriminate Between High-Performing Models. a. Representational Dissimilar-\nity Matrices (RDMs) for selected layers of a high-performing network from Fig. 4a, showing early, intermediate\nand late model layers. Model feature vectors are averaged over classes in the dataset prior to RDM computation,\nand RDMs are shown using the same ordering as in Fig. 4b. b. Two-dimensional MDS embedding of RDMs for\nthe feedback RNN (green squares) and Temporal-Spatial (red circles) model. Points correspond to layers, lines\nare drawn between adjacent layers, with darker color indicating earlier layers. Multiple lines are models trained\nfrom different initial conditions, allowing within-model noise estimate.\ncomputing the correlations. This procedure lead to 9981 \u00d7 9981-sized matrices (there were 9,981\ndistinct object in our dataset). We then computed distances between each layer of each model in\nRDM space, as in (e.g.) [21]. To determine if differences in this space between models and/or\nlayers were signi\ufb01cant, we computed RDMs for multiple instances of each model trained with\ndifferent initial conditions, and compared the between-model to within-model distances. We found\nthat while the top layers of models partially converged (likely because they were all trained on the\nsame task), intermediate layers diverged substantially between models, by amounts larger than either\nthe initial-condition-induced variability within a model layer or the distance between nearby layers of\nthe same model (Fig. 5b). This observation is important from an experimental design point of view\nbecause it shows that different model architectures differ substantially on a well-validated metric that\nmay be experimentally feasible to measure.\n6 Conclusion\nWe have introduced a model of the rodent whisker array informed by biophysical data, and used it to\ngenerate a large high-variability synthetic sweep dataset. While the raw sensor data is suf\ufb01ciently\npowerful to separate objects at low amounts of variability, at higher variation levels deeper non-\nlinear neural networks are required to extract object identity. We found further that while many\nparticular network architectures, especially shallow ones, fail to solve the shape recognition task,\nreasonable performance levels can be obtained for speci\ufb01c architectures within each distinct network\nstructural family tested. We then showed that a population-level measurement that is in principle\nexperimentally obtainable can distinguish between these higher-performing networks. To summarize,\nwe have shown that a goal-driven DNN approach to modeling the whisker-trigeminal system is\nfeasible. Code for all results, including the whisker model and neural networks, is publicly available\nat https://github.com/neuroailab/whisker_model.\nWe emphasize that the present work is proof-of-concept rather than a model of the real nervous\nsystem. A number of critical issues must be overcome before our true goal \u2014 a full integration of\ncomputational modeling with experimental data \u2014 becomes possible. First, although our sensor\nmodel was biophysically informed, it does not include active whisking, and the mechanical signals at\nthe whisker bases are approximate [29, 16].\nAn equally important problem is that the goal that we set for our network, i.e. shape discrimination\nbetween 117 human-recognizable object classes, is not directly ethologically relevant to rodents. The\nprimary reason for this task choice was practical: ShapeNet is a readily available and high-variability\nsource of 3D objects. If we had instead used a small, manually constructed, set of highly simpli\ufb01ed\nobjects that we hoped were more \u201crat-relevant\u201d, it is likely that our task would have been too simple\nto constrain neural networks at the scale of the real whisker-trigeminal system. Extrapolating from\nmodeling of the visual system, training a deep net on 1000 image categories yields a feature basis that\ncan readily distinguish between previously-unobserved categories [34, 5, 30]. Similarly, we suggest\nthat the large and variable object set used here may provide a meaningful constraint on network\n\n8\n\n0.000.160.000.480.001.35Early Layer . . . Middle Layer . . . Late Layer Principal Axis 2 Principal Axis 1Temporal-SpatialFeedback RNNb)a)inter-modeldistancewithin-modelvariability\fstructure, as the speci\ufb01c object geometries may be less important then having a wide spectrum of\nsuch geometries. However, a key next priority is systematically building an appropriately large and\nvariable set of objects, textures or other class boundaries that more realistically model the tasks that a\nrodent faces. The speci\ufb01c results obtained (e.g. which families are better than others, and the exact\nstructure of learned representations) are likely to change signi\ufb01cantly when these improvements are\nmade.\nIn concert with these improvements, we plan to collect neural data in several areas within the\nwhisker-trigeminal system, enabling us to make direct comparisons between model outputs and\nneural responses with metrics such as the RDM. There are few existing experimentally validated\nsignatures of the computations in the whisker-trigeminal system. Ideally, we will validate one or a\nsmall number of the speci\ufb01c model architectures described above by identifying a detailed mapping\nof model internal layers to brain-area speci\ufb01c response patterns. A core experimental issue is the\nmagnitude of real experimental noise in trigeminal-system RDMs. We will need to show that this\nnoise does not swamp inter-model distances (as shown in Fig. 5b), enabling us to reliably identify\nwhich model(s) are better predictors of the neural data. Though real neural RDM noise cannot yet be\nestimated, the intermodel RDM distances that we can compute computationally will be useful for\ninforming experimental design decisions (e.g. trial count, stimulus set size, &c).\nIn the longer term, we expect to use detailed encoding models of the whisker-trigeminal system as\na platform for investigating issues of representation learning and sensory-based decision making\nin the rodent. A particularly attractive option is to go beyond \ufb01xed class discrimination problems\nand situate a synthetic whisker system on a mobile animal in a navigational environment where\nit will be faced with a variety of actively-controlled discrete and continuous estimation problems.\nIn this context, we hope to replace our currently supervised loss function with a more naturalistic\nreinforcement-learning based goal. By doing this work with a rich sensory domain in rodents, we\nseek to leverage the sophisticated neuroscience tools available in these systems to go beyond what\nmight be possible in other model systems.\n7 Acknowledgement\nThis project has sponsored in part by hardware donation from the NVIDIA Corporation, a James S.\nMcDonnell Foundation Award (No. 220020469) and an NSF Robust Intelligence grant (No. 1703161)\nto DLKY, the European Union\u2019s Horizon 2020 research and innovation programme (No. 705498) to\nJK, and NSF awards (IOS-0846088 and IOS-1558068) to MJZH.\n\nReferences\n[1] Ehsan Arabzadeh, Erik Zorzin, and Mathew E. Diamond. Neuronal encoding of texture in the whisker\n\nsensory pathway. PLoS Biology, 3(1), 2005.\n\n[2] Michael Armstrong-James, KEVIN Fox, and Ashis Das-Gupta. Flow of excitation within rat barrel cortex\n\non striking a single vibrissa. Journal of neurophysiology, 68(4):1345\u20131358, 1992.\n\n[3] James Bergstra, Dan Yamins, and David D Cox. Hyperopt: A python library for optimizing the hyperpa-\nrameters of machine learning algorithms. In Proceedings of the 12th Python in Science Conference, pages\n13\u201320. Citeseer, 2013.\n\n[4] Laurens WJ Bosman, Arthur R Houweling, Cullen B Owens, Nouk Tanke, Olesya T Shevchouk, Negah\nRahmati, Wouter HT Teunissen, Chiheng Ju, Wei Gong, Sebastiaan KE Koekkoek, et al. Anatomical\npathways involved in generating and sensing rhythmic whisker movements. Frontiers in integrative\nneuroscience, 5:53, 2011.\n\n[5] Charles F Cadieu, Ha Hong, Daniel LK Yamins, Nicolas Pinto, Diego Ardila, Ethan A Solomon, Najib J\nMajaj, and James J DiCarlo. Deep neural networks rival the representation of primate it cortex for core\nvisual object recognition. PLoS computational biology, 10(12):e1003963, 2014.\n\n[6] Angel X. Chang, Thomas Funkhouser, Leonidas Guibas, Pat Hanrahan, Qixing Huang, Zimo Li, Silvio\nSavarese, Manolis Savva, Shuran Song, Hao Su, Jianxiong Xiao, Li Yi, and Fisher Yu. ShapeNet: An\nInformation-Rich 3D Model Repository. ArXiv, 2015.\n\n[7] Kyunghyun Cho, Bart Van Merri\u00ebnboer, Dzmitry Bahdanau, and Yoshua Bengio. On the properties of\n\nneural machine translation: Encoder-decoder approaches. arXiv preprint arXiv:1409.1259, 2014.\n\n[8] Martin Deschenes and Nadia Urbain. Vibrissal afferents from trigeminus to cortices. Scholarpedia,\n\n4(5):7454, 2009.\n\n9\n\n\f[9] Mathew E Diamond, Moritz von Heimendahl, Per Magne Knutsen, David Kleinfeld, and Ehud Ahissar.\n\n\u2019Where\u2019 and \u2019what\u2019 in the whisker sensorimotor system. Nat Rev Neurosci, 9(8):601\u2013612, 2008.\n\n[10] Satomi Ebara, Kenzo Kumamoto, Tadao Matsuura, Joseph E Mazurkiewicz, and Frank L Rice. Similarities\nand differences in the innervation of mystacial vibrissal follicle\u2013sinus complexes in the rat and cat: a\nconfocal microscopic study. Journal of Comparative Neurology, 449(2):103\u2013119, 2002.\n\n[11] Daniel J Felleman and David C Van Essen. Distributed hierarchical processing in the primate cerebral\n\ncortex. Cerebral cortex, 1(1):1\u201347, 1991.\n\n[12] Melvyn A. Goodale and A. David Milner. Separate visual pathways for perception and action. Trends in\n\nNeurosciences, 15(1):20\u201325, 1992.\n\n[13] M. Hartmann. Vibrissa mechanical properties. Scholarpedia, 10(5):6636, 2015. revision #151934.\n\n[14] Jennifer A Hobbs, R Blythe Towal, and Mitra JZ Hartmann. Spatiotemporal patterns of contact across the\n\nrat vibrissal array during exploratory behavior. Frontiers in behavioral neuroscience, 9, 2015.\n\n[15] Sepp Hochreiter and Jurgen J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation,\n\n9(8):1\u201332, 1997.\n\n[16] Lucie A. Huet and Mitra J Z Hartmann. Simulations of a Vibrissa Slipping along a Straight Edge and an\n\nAnalysis of Frictional Effects during Whisking. IEEE Transactions on Haptics, 9(2):158\u2013169, 2016.\n\n[17] Koji Inui, Xiaohong Wang, Yohei Tamura, Yoshiki Kaneoke, and Ryusuke Kakigi. Serial processing in the\n\nhuman somatosensory system. Cerebral Cortex, 14(8):851\u2013857, 2004.\n\n[18] Yoshiaki Iwamura. Hierarchical somatosensory processing. Current Opinion in Neurobiology, 8(4):522\u2013\n\n528, 1998.\n\n[19] A *Kell, D *Yamins, S Norman-Haignere, and J McDermott. Functional organization of auditory cortex\n\nrevealed by neural networks optimized for auditory tasks. In Society for Neuroscience, 2015.\n\n[20] Jason ND Kerr, Christiaan PJ De Kock, David S Greenberg, Randy M Bruno, Bert Sakmann, and Fritjof\nHelmchen. Spatial organization of neuronal population responses in layer 2/3 of rat barrel cortex. Journal\nof neuroscience, 27(48):13316\u201313328, 2007.\n\n[21] Seyed-Mahdi Khaligh-Razavi and Nikolaus Kriegeskorte. Deep supervised, but not unsupervised, models\n\nmay explain it cortical representation. PLoS Comput Biol, 10(11):e1003915, 2014.\n\n[22] Per Magne Knutsen, Maciej Pietr, and Ehud Ahissar. Haptic object localization in the vibrissal system:\nbehavior and performance. The Journal of neuroscience : the of\ufb01cial journal of the Society for Neuroscience,\n26(33):8451\u201364, 2006.\n\n[23] Nikolaus Kriegeskorte, Marieke Mur, and Peter a. Bandettini. Representational similarity analysis -\nconnecting the branches of systems neuroscience. Frontiers in systems neuroscience, 2(November):4,\n2008.\n\n[24] Jeffrey D Moore, Nicole Mercer Lindsay, Martin Desch\u00eanes, and David Kleinfeld. Vibrissa self-motion\nand touch are reliably encoded along the same somatosensory pathway from brainstem through thalamus.\nPLoS Biol, 13(9):e1002253, 2015.\n\n[25] Daniel H. O\u2019Connor, Simon P. Peron, Daniel Huber, and Karel Svoboda. Neural activity in barrel cortex\n\nunderlying vibrissa-based object localization in mice. Neuron, 67(6):1048\u20131061, 2010.\n\n[26] Carl CH Petersen, Amiram Grinvald, and Bert Sakmann. Spatiotemporal dynamics of sensory responses in\nlayer 2/3 of rat barrel cortex measured in vivo by voltage-sensitive dye imaging combined with whole-cell\nvoltage recordings and neuron reconstructions. Journal of neuroscience, 23(4):1298\u20131309, 2003.\n\n[27] T P Pons, P E Garraghty, David P Friedman, and Mortimer Mishkin. Physiological evidence for serial\n\nprocessing in somatosensory cortex. Science (New York, N.Y.), 237(4813):417\u2013420, 1987.\n\n[28] Dale Purves, George J Augustine, David Fitzpatrick, Lawrence C Katz, Anthony-Samuel LaMantia,\nJames O McNamara, and S Mark Williams. Neuroscience. Sunderland, MA: Sinauer Associates, 3, 2001.\n\n[29] Brian W Quist, Vlad Seghete, Lucie A Huet, Todd D Murphey, and Mitra J Z Hartmann. Modeling Forces\nand Moments at the Base of a Rat Vibrissa during Noncontact Whisking and Whisking against an Object.\nJ Neurosci, 34(30):9828\u20139844, 2014.\n\n10\n\n\f[30] Ali S Razavian, Hossein Azizpour, Josephine Sullivan, and Stefan Carlsson. Cnn features off-the-shelf: an\nastounding baseline for recognition. In Computer Vision and Pattern Recognition Workshops (CVPRW),\n2014 IEEE Conference on, pages 512\u2013519. IEEE, 2014.\n\n[31] R. Blythe Towal, Brian W. Quist, Venkatesh Gopal, Joseph H. Solomon, and Mitra J Z Hartmann. The\nmorphology of the rat vibrissal array: A model for quantifying spatiotemporal patterns of whisker-object\ncontact. PLoS Computational Biology, 7(4), 2011.\n\n[32] Moritz Von Heimendahl, Pavel M Itskov, Ehsan Arabzadeh, and Mathew E Diamond. Neuronal activity in\n\nrat barrel cortex underlying texture discrimination. PLoS Biol, 5(11):e305, 2007.\n\n[33] Wikipedia. Bullet (software) \u2014 wikipedia, the free encyclopedia, 2016. [Online; accessed 19-October-\n\n2016].\n\n[34] Daniel L K Yamins, Ha Hong, Charles F Cadieu, Ethan a Solomon, Darren Seibert, and James J DiCarlo.\nPerformance-optimized hierarchical models predict neural responses in higher visual cortex. Proceedings\nof the National Academy of Sciences of the United States of America, 111(23):8619\u201324, jun 2014.\n\n[35] Daniel LK Yamins and James J DiCarlo. Using goal-driven deep learning models to understand sensory\n\ncortex. Nature neuroscience, 19(3):356\u2013365, 2016.\n\n[36] Chunxiu Yu, Dori Derdikman, Sebastian Haidarliu, and Ehud Ahissar. Parallel thalamic pathways for\n\nwhisking and touch signals in the rat. PLoS Biol, 4(5):e124, 2006.\n\n11\n\n\f", "award": [], "sourceid": 1482, "authors": [{"given_name": "Chengxu", "family_name": "Zhuang", "institution": "Stanford University"}, {"given_name": "Jonas", "family_name": "Kubilius", "institution": "Massachusetts Institute of Technology"}, {"given_name": "Mitra", "family_name": "Hartmann", "institution": "Northwestern University"}, {"given_name": "Daniel", "family_name": "Yamins", "institution": "Stanford University"}]}