{"title": "Extracting regions of interest from biological images with convolutional sparse block coding", "book": "Advances in Neural Information Processing Systems", "page_first": 1745, "page_last": 1753, "abstract": "Biological tissue is often composed of cells with similar morphologies replicated throughout large volumes and many biological applications rely on the accurate identification of these cells and their locations from image data. Here we develop a generative model that captures the regularities present in images composed of repeating elements of a few different types. Formally, the model can be described as convolutional sparse block coding. For inference we use a variant of convolutional matching pursuit adapted to block-based representations. We extend the K-SVD learning algorithm to subspaces by retaining several principal vectors from the SVD decomposition instead of just one. Good models with little cross-talk between subspaces can be obtained by learning the blocks incrementally. We perform extensive experiments on simulated images and the inference algorithm consistently recovers a large proportion of the cells with a small number of false positives. We fit the convolutional model to noisy GCaMP6 two-photon images of spiking neurons and to Nissl-stained slices of cortical tissue and show that it recovers cell body locations without supervision. The flexibility of the block-based representation is reflected in the variability of the recovered cell shapes.", "full_text": "Extracting regions of interest from biological images\n\nwith convolutional sparse block coding\n\nMarius Pachitariu1, Adam Packer2, Noah Pettit2, Henry Dagleish2,\n1Gatsby Unit, UCL, UK {marius, maneesh}@gatsby.ucl.ac.uk\n2The Wolfson Institute for Biomedical Research, UCL, UK {a.packer,\n\nMichael Hausser2 and Maneesh Sahani1\n\nnoah.pettit.10, henry.dalgleish.09, m.hausser}@ucl.ac.uk\n\nAbstract\n\nBiological tissue is often composed of cells with similar morphologies replicated\nthroughout large volumes and many biological applications rely on the accurate\nidenti\ufb01cation of these cells and their locations from image data. Here we develop\na generative model that captures the regularities present in images composed of\nrepeating elements of a few different types. Formally, the model can be described\nas convolutional sparse block coding. For inference we use a variant of convolu-\ntional matching pursuit adapted to block-based representations. We extend the K-\nSVD learning algorithm to subspaces by retaining several principal vectors from\nthe SVD decomposition instead of just one. Good models with little cross-talk\nbetween subspaces can be obtained by learning the blocks incrementally. We\nperform extensive experiments on simulated images and the inference algorithm\nconsistently recovers a large proportion of the cells with a small number of false\npositives. We \ufb01t the convolutional model to noisy GCaMP6 two-photon images\nof spiking neurons and to Nissl-stained slices of cortical tissue and show that it re-\ncovers cell body locations without supervision. The \ufb02exibility of the block-based\nrepresentation is re\ufb02ected in the variability of the recovered cell shapes.\n\n1\n\nIntroduction\n\nFor evolutionary reasons, biological tissue at all spatial scales is composed of repeating patterns.\nThis is because successful biological motifs are reused and multiplied by evolutionary pressures. At\na small spatial scale eukaryotic cells contain only a few types of major organelles like mitochondria\nand vacuoles and several dozen minor organelles like vesicles and ribosomes. Each of the organelles\nis replicated a large number of times within each cell and has a distinctive visual appearance. At\nthe scale of whole cells, most tissue types like muscle and epithelium are composed primarily of\nsingle cell types. Some of the more diverse biological tissues are probably in the brain where gray\nmatter contains different types of neurons and glia, often spatially overlapping. Repetition is also\nencouraged at large spatial scales. Striate muscles are made out of similar axially-aligned \ufb01bers\ncalled sarcomers and human cortical surfaces are highly folded inside the skull producing repeating\nsurface patterns called gyri and sulci.\nMuch biological data at all spatial scales comes in the form of two- or three-dimensional images.\nNon-invasive techniques like magnetic resonance imaging allow visualization of details on the order\nof one millimeter. Cells in tissue can be seen with light microscopy and cellular organelles can\nbe seen with the electron microscope. Given the stereotypical nature of biological motifs, these\nimages often appear as collections of similar elements over a noisy background, as shown in \ufb01gure\n1(a). We developed a generative image model that automatically discovers the repeating motifs, and\nsegments biological images into the most common elements that form them. We apply the model\nto two-dimensional images composed of several hundred cells of possibly different types, such as\n\n1\n\n\f(a)\n\n(b)\n\nFigure 1: a. Mean image of a two-photon recording of calcium-based \ufb02uorescence. b. Same image\nas in (a) after subtractive and divisive normalization locally.\n\nimages of cortical tissue expressing \ufb02uorescent GCaMP6, a calcium indicator, taken with a two-\nphoton microscope in vivo. We also apply the model to Nissl-stained cortical tissue imaged in slice.\nEach experimental exposure can contain hundreds of cells and many exposures are usually taken\nover a single experimental session. Our main aim is to automate the cell detection stage, because\ntracing cell contours by hand can be a laborious and inexact process, especially given the multitude\nof confounds usually present in these images. One confound clearly visible in \ufb01gure 1(a) is the\nlarge variation in contrast and luminance over a single image. A second confound, also visible in\n\ufb01gure 1(a), is that many cells tend to cluster together and press their boundaries against each other.\nAssigning pixels to the correct cell can be dif\ufb01cult. A third confound is that calcium, the marker\nwhich the \ufb02uorescent images report, is present in the entire neuropil (in the dendrites and axons of\nthe cells). Activation of calcium in the neuropil makes a noisy background for the estimation of cell\nsomata. Given such large confounds, a properly-formulated image model is needed to resolve the\nambiguities as well as the human eye can resolve them.\n\n1.1 Background on automated extraction of cell somata\nHistological examination of biological tissue with light-microscopy is an important application for\ntechniques of cell identi\ufb01cation and segmentation. Most algorithms for identifying cell somata\nfrom such images are based on hand-crafted \ufb01ltering and thresholding techniques. For example,\n[1] proposes a pipeline of as many as fourteen separate steps, each of which is meant to deal with\nsome particular dimension of variability in the images. Our approach is to instead propose a fully\ngenerative model of the biological tissue which encapsulates our beliefs about the stereotypical\nstructure of such images. Inference in the model inverts the generative model \u2014 or in other words\ndeconvolves the image \u2014 and thereby replaces the \ufb01ltering and thresholding techniques usually\nemployed. Learning the parameters of the generative model replaces the hand-crafting of the \ufb01lters\nand thresholds.\nFor one image type we use here, \ufb02uorescent images of neuronal tissue, the approach of [2] is closer\nin spirit to our methodology of model design and inference. The authors propose an independent\ncomponents analysis (ICA) model of the movies which expresses their beliefs that all the pixels be-\nlonging to a cell should brighten together, but only rarely. The model effectively uses the temporal\ncorrelations between pixels to segment each image, much like [3] but the pipeline of [3] is man-\nual and not model-designed like that of [2]. Both of these studies are different from our approach,\nbecause we aim to recover cell bodies from single images alone. The method of [2] applies well\nto small \ufb01elds of view and large coherent \ufb02uorescence \ufb02uctuations in single cells, but fails when\napplied to our data with large \ufb01elds of view containing hundreds of small neurons. The failure is\ndue to long-range spatial correlations between many thousands of pixels which overcome the noisy\ncorrelations between the few dozen pixels belonging to each cell. Consequently, the independent\ncomponents extracted by the algorithm of [2]1 have large spatial domains as can be seen in sup-\nplemental \ufb01gure 1. Our approach is robust to large non-local correlations because we analyze the\n\n1available online at http://www.snl.salk.edu/\u223cemukamel/\n\n2\n\n\fmean image alone. One advantage is that the resulting model can be applied not just to data from\nfunctional imaging experiments but to data from any imaging technique.\n1.2 Background on convolutional image models\nOur proposed image model is a novel extension of a family of recent algorithms based on sparse\ncoding that are commonly used in object recognition experiments [4], [5], [6], [7], [8]. A starting\npoint for our model was the convolutional matching pursuit (MP) implementation of [5] (but see [6]\nfor more details). The authors show that convolutional MP learns a diverse set of basis functions\nfrom natural images. Most of these basis functions are edges, but some have a globular appearance\nand others represent curved edges and corners. Their implied generative model of an image is\nto pick out randomly a few basis functions and place them at random locations. While this is a\npoor generative model for natural images, it is much better suited to biological images which are\ncomposed of many repeating and seemingly randomly distributed elements of a few different types.\nOne disadvantage of convolutional MP as described by [6] is that it uses \ufb01xed templates for each\ndictionary element. Although it seems like the cells in \ufb01gure 1(b) might be well described by\na single ring shape, there are size and shape variations which could be better captured by more\n\ufb02exible templates. In general, we expect the repeating elements in a biological image to have similar\nappearances to a \ufb01rst approximation, but patterned variability is unavoidable. A better model of the\nimage of a single cell might be to assume it was generated by combining a few different prototypes\nwith different coef\ufb01cients, effectively interpolating between the prototypes. We group the prototypes\nrelated to a single object into blocks and every image is formed by activating a small number of\nsuch blocks. We call this model sparse block coding. Note that the blocking principle is common in\nnatural image modelling, where Gabor \ufb01lters in quadrature are combined with different coef\ufb01cients\nto produce edges of different spatial phases. Independent subspace analysis (ISA [7]) also entails\ndistributing basis functions into non-overlapping blocks. However, in our formulation the blocks are\neither activated or not, while ISA assumes a continuous distribution on the activations of each block.\nThis property of sparse block coding makes it valuable in making hard assignments of inferred cell\nlocations, rather than giving a continuous coef\ufb01cient for each location.\nCloser to our formulation, [8] have used a similar sparse block coding model on natural movie\npatches and added a temporal smoothness prior on the activation probabilities of blocks in con-\nsecutive movie frames. The expensive variational iterative techniques used by [8] for inference\nand learning in small image patches are computationally infeasible for the convolutional model of\nlarge images we present here. Instead, we use a convolutional block pursuit technique which is an\nextension of standard matching pursuit and has similarly low computational complexity even for\narbitrarily large blocks and arbitrarily large images.\n2 Model\n2.1 Convolutional sparse block coding\nFollowing [8], we distinguish between identity and attribute variables in the generative model of\neach object in an image. An object can be a cell, a cell fragment or any other spatially-localized\nobject. Identity variables hk\nxy, where (x, y) is the location of the object and k the type of object,\nare Bernoulli-distributed with very small prior probabilities. Each of the objects also has several\ncontinuous-valued attribute variables xkl\nIn the generative model\nthese attributes are given a broad uniform probability and specify the coef\ufb01cients with which a set\nof basis functions Akl are combined at spatial location (x, y) before being linearly combined with\nobjects generated at other locations. The full description of the generative process is best captured\nin terms of two-dimensional convolutions by the following set of equations\n\nxy, with l indexing the attribute.\n\n(cid:1)\nxy \u223c Bernoulli(p)\nhk\nAkl \u2217(cid:0)xkl \u25e6 hk(cid:1) + N (0, \u03c3y) ,\nxkl\n\nxy \u223c N(cid:0)0, \u03c32\ny \u223c(cid:88)\n\nx\n\nk,l\n\nwhere \u03c3y is the (small) noise variance for the image, \u03c3x is the (large) prior variance for the co-\nef\ufb01cients, p is a small activation probability speci\ufb01c to each object type, hk and xkl represent the\nfull two-dimensional maps of the binary and continuous coef\ufb01cients respectively, \u201c\u25e6\u201d represents the\nelementwise or Hadamard product and \u201c\u2217\u201d denotes two-dimensional convolution where the result is\n\n3\n\n\ftaken to have the same dimensions as the input image.2 The joint log-likelihood (or negative energy)\ncan now be derived easily\n\nL (x, h, A) = \u2212(cid:107)y \u2212(cid:80)\n(cid:0)hk\n\n(cid:88)\n\nkxy\n\nk,l Akl \u2217(cid:0)xkl \u25e6 hk(cid:1)(cid:107)2\n\n(cid:80)\n\n2\u03c32\ny\n\nxy log(p) + (1 \u2212 hk\n\n\u2212\n\nxy)2\n\nklxy (xkl\n\nxy) log(1 \u2212 p)(cid:1) + constants\n\n2\u03c32\nx\n\n+\n\n(1)\n\nInference by convolutional block pursuit\n\nIn practice, we used \u03c3x = \u221e as we found that it gave similar results to \ufb01nite values of \u03c3x. This\nmodel can be \ufb01t by alternately optimizing the cost function in equation 1 over the unobserved vari-\nables x and h and the parameters A. The prior bias parameter p will not be optimized over but\ninstead will be adjusted so as to guarantee a mean number of elements per image. We also set\n(cid:107)Akl(cid:107) = 1 without loss of generality, since the absolute values of x can scale to compensate.\n2.2\nGiven a set of basis functions Akl and an image y, we would like to infer the most likely locations\nof objects of each type in an image. This inference is generally NP-hard but good solutions can\nnonetheless be obtained with greedy methods like matching pursuit (MP). In standard matching\npursuit, a sequential process is followed where at each step a basis function Akl is chosen which if\nactivated increases most the log-likelihood of equation 1. In our model, at each step we activate a\nfull block k which includes multiple templates Akl. Due to the quadratic nature of equation 1, for a\nxy given the current residual\nproposal hk\nxy for\n\nimage yres = y \u2212(cid:88)\n\nxy = 1 we can easily compute the MAP estimate for each xk\n\nxy as a vector concatenating xkl\n\nall l. The MAP estimate for xk\n\nk,l\n\nxy is\n\nAkl \u2217(cid:0)xkl \u25e6 hk(cid:1). Here we understand xk\nxy =(cid:0)(Ak)T Ak(cid:1)\u22121\n(cid:1)\nxy(l) =(cid:0) \u00afAkl \u2217 yres\n(cid:1)T \u02c6xkl\n(cid:0)vk\n\nvk\nxy\n\nvk\n\n\u02c6xk\n\nxy\n\nxy\n\nxy\n\n\u03b4Lk\n\nxy =\n\n\u2212 log\n\np\n1 \u2212 p\n\n.\n\n2\u03c32\ny\n\np\n1 \u2212 p\n\nwhere \u00afAkl is the basis function Akl rotated by 180 degrees and the matrix Ak contains as columns\nthe vectorized basis functions Akl. The corresponding increase in likelihood in equation 1 is\n\nInference stops when the activation penalty log\npossible objects k at all possible locations (x, y).\nA simple trick common to all matching pursuit algorithms [9], [6] allows us to save computation\nwhen sequentially calculating vklxy = \u00afAkl \u2217 yres by keeping track of v and updating it after each\nnew coef\ufb01cient is turned on:\n\nfrom the prior overcomes the data term for all\n\nvnew = v \u2212 G(....),(k.xy) \u02c6xk\nxy,\n\nxy at all positions (x, y), and the indexing\nwhere G is the grand Gram matrix of all basis functions Akl\nmeans that every dot runs over all possible values of that index. Because the basis functions are\nmuch smaller in length and width than the entire image, most entries in the Gram matrix are actually\n0. In practice, we do not keep track of these and instead keep track only of G(k(cid:48)l(cid:48)x(cid:48)y(cid:48)),(klxy) for\n|x \u2212 x(cid:48)| < d and |y \u2212 y(cid:48)| < d, where d is the width and length of the basis function. We also keep\ntrack during inference of \u02c6x and \u03b4Lk\nxy and only need to update these quantities at positions (x, y)\naround the extracted object. These caching techniques make the complexity of the inference scale\nlinearly with the number of objects in each image, regardless of image or object size.\nThus, our algorithm bene\ufb01ts from the computational ef\ufb01cacy of matching pursuit. One additional\ncomputation lies in determining the inverse of (Ak)T Ak for each k. This cost is negligible, since\neach block contains a small number of attributes and we only need to do the inversions once per iter-\nation. Every iteration of block pursuit requires updating v, \u02c6x and \u03b4Lk\nxy locally around the extracted\n\n2In other words, the convolution uses \u201czero-padding\u201d.\n\n4\n\n\fblock, which is several times more expensive than the corresponding update in simple matching\npursuit. However, this cost is also negligible compared to the cost of \ufb01nding the best block at each\niteration: the single most intensive operation during inference is the loop through all the elements\nin all the convolutional maps to \ufb01nd the block which most increases the likelihood if activated. All\nthe other update operations are local around the extracted block, and thus negligible. In practice for\nthe datasets we use (for example, 18 images of 256 by 256 pixels each), a model can be learned in\nminutes on a modern CPU and inference on a single large image takes under one second.\n\n2.3 Learning with block K-SVD\nGiven the inferred active blocks and their coef\ufb01cients, we would like to adapt the parameters of the\nbasis functions Akl so as to maximize the cost function in eq 1. This can most easily be accomplished\nby gradient descent (GD). Unfortunately, for general dictionary learning setups gradient descent can\nproduce suboptimal solutions, where a proportion of the basis function fail to learn meaningful\nstructure [10]. Similarly, for our block-based representations we found that gradient descent often\nmixed together subspaces that should have been separated (see \ufb01g 2(c)). We considered the option\nof estimating the subspaces in each Ak sequentially where we run a couple of iterations of learning\nwith a single subspace in each Ak and then every couple of iterations we increase the number of\nsubspaces we estimate for Ak. This incremental approach always resulted in demixed subspaces\nlike those in \ufb01gure 2(a). Note also that the standard approach in MP-based models is to extract\na \ufb01xed number of coef\ufb01cients per image, but in our database of biological images there are large\nvariations in the number of cells present in each image so we needed the inference method to be\n\ufb02exible enough to accomodate varying numbers of objects. To control the total number of active\ncoef\ufb01cients, we adjusted during learning the prior activation probability p whenever the average\nnumber of active elements was too small or too large compared to our target mean activation rate.\nAlthough incremental gradient descent worked well, it tended to be slow in practice. A popular\nlearning algorithm that was proposed to accelerate patch-based dictionary learning is K-SVD [10].\nIn every iteration of K-SVD, coef\ufb01cients are extracted for all the image patches in the training\nset. Then the algorithm modi\ufb01es each basis function sequentially to exactly minimize the squared\nreconstruction cost. The convolutional MP implementation of [6] indeed uses K-SVD for learning\nand we here show how K-SVD can be adapted to block-based representations.\nAt every iteration of K-SVD, given a set of active basis functions per image obtained with an infer-\nence method, the objective is to minimize the reconstruction cost with respect to the basis functions\nand coef\ufb01cients simultaneously [10]. We consider each basis function Akl sequentially, extract all\nimage patches {yi}i where that basis function is active and assume all coef\ufb01cients for the other basis\nfunctions are \ufb01xed. In the convolutional setting, these patches are extracted from locations in the\nimages where each basis function is active [6]. We add back the contribution of basis function Akl\nto each patch in {yi}i and now make the observation that to minimize the reconstruction error with\na single basis function \u02c6Akl we must \ufb01nd the direction in pixel space where most of the variance in\n{yi}i lies. This can be done with an SVD decomposition followed by retaining the \ufb01rst principal\nvector \u02c6Akl. The new reconstructions for each patch yi are yi \u2212 \u02c6Akl( \u02c6Akl)T yi and with this new\nresidual we move on to the next basis function to be reestimated.\nBy analogy, in block K-SVD we are given a set of active blocks per image, each block consisting of\nK basis functions. We consider each block Ak sequentially, extract all image patches {yi}i where\nthat block is active and assume all coef\ufb01cients for the other blocks are \ufb01xed. We add back the\ncontribution of block Ak to each patch in {yi}i and like before perform an SVD decomposition\nof these residuals. However, we are now looking for a K-dimensional subspace where most of\nthe variance in {yi}i lies and this is exactly achieved by considering the \ufb01rst K principal vectors\nreturned by SVD. The reconstructions for each patch are yi \u2212 \u02c6Ak( \u02c6Ak)T yi where \u02c6Ak are the \ufb01rst\nK principal vectors. On a more technical note, after each iteration of K-SVD we centered the\nparameters spatially so that the center of mass of the \ufb01rst direction of variability in each block was\naligned to the center of its window, otherwise the basis functions did not center by themselves.\nAlthough K-SVD was an order of magnitude faster than GD and converged in practice, we noted\nthat in the convolutional setting K-SVD is biased. This is because at the step of re-estimating a\nblock Ak from a set of patches {yi}i, some of these patches may be spatially overlapping in the\nfull image. Therefore, the subspaces in Ak are driven to explain the residual at some pixels multiple\ntimes. One way around the problem would be to enforce non-overlapping windows during inference,\n\n5\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 2: a. Typical recovered parameters with incremental gradient descent learning on GCaMP6\n\ufb02uorescent images. Each column is a block and is sorted in the order of variance from the SVD\ndecomposition. Left columns capture the structure of cell somatas, while right columns represent\ndendrite fragments. b. Like (a) but with incremental block K-SVD. Similar subspaces are recovered\nwith ten times fewer iterations. c. and d. Typical failure modes of learning with non-incremental\ngradient descent and block K-SVD, respectively. The subspaces from (a) appear mixed together. e.\nSubspaces obtained from Nissl-stained slices of cortex.\nbut in our images many cell pairs touch and would in fact require overlapping windows. Instead,\nwe decided to \ufb01ne-tune the parameters returned by block K-SVD with a few iterations of gradient\ndescent which worked well in practice and in simulations recovered good model parameters with\nlittle further computational effort.\n3 Results\n3.1 Qualitative results on \ufb02uorescent images of neurons\nThe main applications of our work are to nissl-stained slices and to \ufb01elds of neurons and neuropil\nimaged with a two-photon microscope (\ufb01gure 1(a)). The neurons were densely labeled with a \ufb02u-\norescent calcium indicator GCaMP6 in a small area of the mouse somatosensory (barrel) cortex.\nWhile the mice were either anesthetized or awake, their whiskers were stimulated which activated\ncorresponding barrel cortex neurons, leading to an in\ufb02ux of calcium into the cells and consequently\nan increase in \ufb02uorescence which was reported by the two-photon microscope. Although cell somas\nreceive a large in\ufb02ux of calcium, dendrites and axons can also be seen. Individual images of the\n\ufb02uorescence can be very noisy purely due to the low number of photons released over each expo-\nsure. Better spatial accuracy can be obtained at the expense of temporal accuracy or at the expense\nof a smaller \ufb01eld of view. In practice, cell locations can be identi\ufb01ed based on the mean images\nrecorded over the duration of an entire experiment, in our case 1000 or 5000 frames. Using 18 im-\nages like the one in \ufb01gure 1(b) we learned a full model with two types of objects each with three\nsubspaces. One of the object types, the left column in \ufb01gure 2(a) was clearly a model of single\nneurons. The right column of \ufb01gure 2(a) represented small pieces of dendrite that were also highly\n\ufb02uorescent. Note how within a block each of the two objects includes dimensions of variability that\ncapture anisotropies in the shape of the cell or dendritic fragments. Figure 3(a) shows in alternating\nodd rows patches from the training set identi\ufb01ed by the algorithm to contain cells and the respective\nreconstructions in the even rows. Note that while most cells are ring-shaped, some appear \ufb01lled and\nsome appear to be larger and the model\u2019s \ufb02exibility is suf\ufb01cient to capture these variations. Figure\n2(c) shows a typical failure for gradient based learning that motivated us to use incremental block\nlearning. The two subspaces recovered in \ufb01gure 2(a) are mixed in \ufb01gure 2(c) and the likelihood\nfrom equation 1 is correspondingly lower.\n3.2 Simulated data\nWe ran extensive experiments on simulated data to assess the algorithm\u2019s ability to learn and infer\ncell locations. There are two possible failure modes: the inference algorithm might not be accurate\nenough or the learning algorithm might not recover good parameters. We address each of these\nfailure modes separately. We wanted to have simulated data as similar as possible to the real data so\nwe \ufb01rst \ufb01tted a model to the GCaMP6 data. We then took the learned model and generated a new\ndataset from it using the same number of objects of each type and similar amounts of Gaussian noise\nas the real images. To generate diverse shapes of cells, we \ufb01t a K-dimensional multivariate Gaussian\n\n6\n\n\f(a)\n\n(b)\n\nFigure 3: a. Patches from the GCaMP6 training images (odd rows) and their reconstructions (even\nrows) with the subspaces shown in \ufb01gure 2(b). b. One area from a Nissl-stained image together with\na human segmentation (open circles) and the model segmentation (stars). Larger zoom versions are\navailable in the supplementary material.\n\nto the posteriors of each block on the real data and generated coef\ufb01cients from this model for the\nsimulated images. Supplemental \ufb01gure 6 shows a simulated image and it can be seen to resemble\nimages in the training set. Note that we are not modelling some of the structured variability in the\nnoise, for example the blood vessels and dendrites visible in \ufb01gure 1(b). This structured variability\nis the likely reason why the model performs better on simulated than on real images.\n3.2.1 Inference quality of convolutional block pursuit\nWe kept the ground truths for the simulated dataset and investigated how well we can recover cell\nlocations when we know perfectly what the simulation parameters were. There is one free parameter\nin our model that we cannot learn automatically which is the average number of extracted objects\nper image. We varied this parameter and report ROC curves for true positives and false positives as\nwe vary the number of extracted coef\ufb01cients. Sometimes we observed that cells were identi\ufb01ed not\nexactly at the correct location but one or a few pixels away. Such small deviations are acceptable\nin practice, so we considered inferred cells as correctly identi\ufb01ed if they were within four pixels of\nthe correct location (cells were 8-16 pixels in diameter). We enforced that a true cell could only be\nidenti\ufb01ed once. If the algorithm made two predictions within \u00a14 pixels of a true cell, only the \ufb01rst\nof these was considered a true positive. Figure 4(a) reports the typical performance of convolutional\nblock pursuit. We also investigated the quality of inference without considering the full structure of\nthe subspaces in each object. Using a single subspace per object is equivalent to matching pursuit,\nachieved signi\ufb01cantly worse performance and saturated at a smaller number of true positives because\nthe model could not recognize some of the variations in cell shape.\n3.2.2 Learning quality of K-SVD + gradient descent\nWe next tested how well the algorithm recovers the generative parameters. We assume that the\nmodel knows how many object types there are and how many attributes each object type has. To\ncompare the various learning strategies we could in principle just evaluate the joint log-likelihood\nof equation 1. However the differences, although consistent, were relatively small and hard to in-\nterpret. More relevant to us is the ROC performance in recovering correctly cell locations. Block\nK-SVD consistently recovers good parameters but does not perform quite as well as the true param-\neters because of its bias (\ufb01gure 4(b)). However re\ufb01nement with GD consistently recovers the best\nparameters which approach the performance of the true generative parameters. We also asked how\nwell the model recovers the parameters when the true number of objects per image is unknown, by\nrunning several experiments with different mean numbers of objects per image. The performance of\nthe learned subspaces is reported in \ufb01gure 4(c). Although the correct number of elements per image\nwas 600, learning with as few as 200 or as many as 1400 objects resulted in equally well-performing\nmodels. If performance on simulated data is at all indicative of behavior on real data, we conclude\nthat our algorithm is not sensitive to the only free parameter in the model.\n\n7\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\n(e)\n\nFigure 4: ROC curves show the model\u2019s behavior on simulated data (a-c) and on manually-\nsegmented GCaMP6 images (d) and Nissl-stained images (e) . a.\nInference with block pursuit\nwith all three subspaces per object (B3P) as well as block pursuit with only the \ufb01rst or \ufb01rst two\nprincipal subspaces (B1P and B2P). We also show for comparison the performance of B3P with\nmodel parameters identi\ufb01ed by learning. Notice the small number of false negatives when a large\nproportion of the cells are identi\ufb01ed. The cells not identi\ufb01ed were too dim to pick out even with a\nlarge number of false negative, hence the quick saturation of the ROC curve. b. Ten runs of block\nK-SVD followed by gradient descent. Re\ufb01ning with GD improved performance. c. Not knowing\nthe average number of elements per image does not make a difference on simulated data.\n3.3 Comparison with human segmentation on biological images\nWe compare the segmentation of the model with manual segmentations on one example each of the\nGCaMP6 and Nissl-stained images (\ufb01gures 4(d) and 4(e)). The human segmenters were instructed\nto locate cells in approximately the order of con\ufb01dence, thus producing an ordering similar to the\nordering returned by the algorithm. As we retain more cells from that ordering we can build ROC\ncurves showing the agreement of the humans with each other, and of the model\u2019s segmentation to\nthe humans\u2019. We found that using multiple templates per block helped the model agree more with\nthe human segmentations. In the case of the Nissl-stain, block coding with four templates identi\ufb01ed\n\ufb01fty more cells than matching pursuit. Although the model generally performs below inter-human\nagreement, the gap is suf\ufb01ciently small to warrant practical use. In addition, a post-hoc analysis\nsuggests that many of the model\u2019s false positives are in fact cells that were not selected in the manual\nsegmentations. Examples of these false positives can be seen both in \ufb01gure 3(b) and in \ufb01gures in\nthe supplementary material. As we anticipated in the introduction, a standard method based on\nthresholded and localized correlation maps only reached 25 true positives at 50 false positives and\nis not shown in \ufb01gure 4(d).\n4 Conclusions\nWe have presented an image model that can be used to automatically and effectively infer the loca-\ntions and shapes of cells from biological image data. This application of generative image models is\nto our knowledge novel and should allow automating many types of biological studies. Our contri-\nbution to the image modelling literature is to extend the sparse block coding model presented in [8]\nto the convolutional setting where each block is allowed to be present at any location in an image.\nWe also derived convolutional block pursuit, a greedy inference algorithm which scales gracefully\nto images of large dimensions with many possible object types in the generative model. For learning\nthe model, we extended the K-SVD learning algorithm to the block-based and convolutional repre-\nsentation. We identi\ufb01ed a bias in convolutional K-SVD and used gradient descent to \ufb01ne-tune the\nmodel parameters towards good local optima.\nOn simulated data, convolutional block pursuit recovers with good accuracy cell locations in sim-\nulated biological images and the learning rule recovers well and consistently the parameters of the\ngenerative model. Using the block pursuit algorithm recovers signi\ufb01cantly more cells than simple\nmatching pursuit. On data from calcium imaging experiments and nissl-stained tissue, the model\nsucceeds in recovering cell locations and learns good models of the variability among different cell\nshapes.\n\n8\n\n01020050100150200False positivesTrue positivesInference withknown parameters  B1P (MP)B2PB3PB3P\u2212learnOracle01020100110120130140150160170False positivesTrue positivesLearning + InferenceB3P  K\u2212SVDK\u2212SVD + GDknownparameters01020100110120130140150160170False positivesTrue positivesLearning withX elements per image  X = 200400600 (true)800100012001400knownparameters02040050100150200False positivesTrue positivesCompare against Human3GCaMP6 fluorescence  BP1BP2BP3Human1Human2Oracle050100050100150200250300False positivesTrue positivesCompare against Human3Nissl stains  BP1BP2BP4Human 1Human 2Oracle\fReferences\n[1] M Oberlaender, VJ Dercksen, R Egger, M Gensel, B Sakmann, and HC Hege. Automated three-\n\ndimensional detection and counting of neuron somata. J Neuroscience Methods, 180:147\u2013160, 2009.\n\n[2] EA Mukamel, A Nimmerjahn, and MJ Schnitzer. Automated analysis of cellular signals from large-scale\n\ncalcium imaging data. Neuron, 63:747\u2013760, 2009.\n\n[3] I Ozden, HM Lee, MR Sullivan, and SSH Wang. Identi\ufb01cation and clustering of event patterns from in\n\nvivo multiphoton optical recordings of neuronal ensembles. J Neurophysiol, 100:495\u2013503, 2008.\n\n[4] K Kavukcuoglu, P Sermanet, YL Boureau, K Gregor, M Mathieu, and Y LeCun. Learning convolutional\n\nfeature hierarchies for visual recognition. Advances in Neural Information Processing, 2010.\n\n[5] K Gregor, A Szlam, and Y LeCun. Structured sparse coding via lateral inhibition. Advances in Neural\n\nInformation Processing, 2011.\n\n[6] A Szlam, K Kavukcuoglu, and Y LeCun. Convolutional matching pursuit and dictionary training. arXiv,\n\npage 1010.0422v1, 2010.\n\n[7] A Hyvarinen, J Hurri, and PO Hoyer. Natural Image Statistics. Springer, 2009.\n[8] P Berkes, RE Turner, and M Sahani. A structured model of video produces primary visual cortical\n\norganisation. PLoS Computational Biology, 5, 2009.\n\n[9] SG Mallat and Z Zhang. Matching pursuits with time-frequency dictionaries.\n\nSignal Processing, 41(12):3397\u20133415, 1993.\n\nIEEE Transactions on\n\n[10] M Aharon, M Elad, and A Bruckstein. K-svd: An algorithm for designing overcomplete dictionaries for\n\nsparse representation. IEEE Transactions on Signal Processing, 54(11):4311\u20134322, 2006.\n\n9\n\n\f", "award": [], "sourceid": 880, "authors": [{"given_name": "Marius", "family_name": "Pachitariu", "institution": "Gatsby Unit, UCL"}, {"given_name": "Adam", "family_name": "Packer", "institution": "UCL"}, {"given_name": "Noah", "family_name": "Pettit", "institution": "UCL"}, {"given_name": "Henry", "family_name": "Dalgleish", "institution": "UCL"}, {"given_name": "Michael", "family_name": "Hausser", "institution": "UCL"}, {"given_name": "Maneesh", "family_name": "Sahani", "institution": "Gatsby Unit, UCL"}]}