{"title": "Parallel Multi-Dimensional LSTM, With Application to Fast Biomedical Volumetric Image Segmentation", "book": "Advances in Neural Information Processing Systems", "page_first": 2998, "page_last": 3006, "abstract": "Convolutional Neural Networks (CNNs) can be shifted across 2D images or 3D videos to segment them. They have a fixed input size and typically perceive only small local contexts of the pixels to be classified as foreground or background. In contrast, Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the entire spatio-temporal context of each pixel in a few sweeps through all pixels, especially when the RNN is a Long Short-Term Memory (LSTM). Despite these theoretical advantages, however, unlike CNNs, previous MD-LSTM variants were hard to parallelise on GPUs. Here we re-arrange the traditional cuboid order of computations in MD-LSTM in pyramidal fashion. The resulting PyraMiD-LSTM is easy to parallelise, especially for 3D data such as stacks of brain slice images. PyraMiD-LSTM achieved best known pixel-wise brain image segmentation results on MRBrainS13 (and competitive results on EM-ISBI12).", "full_text": "Parallel Multi-Dimensional LSTM, With Application\nto Fast Biomedical Volumetric Image Segmentation\n\nMarijn F. Stollenga*123, Wonmin Byeon*1245, Marcus Liwicki4, and Juergen Schmidhuber123\n\n*Shared \ufb01rst authors, both Authors contribruted equally to this work. Corresponding authors:\n\nmarijn@idsia.ch, wonmin.byeon@dfki.de\n\n1Istituto Dalle Molle di Studi sull\u2019Intelligenza Arti\ufb01ciale (The Swiss AI Lab IDSIA)\n\n2Scuola universitaria professionale della Svizzera italiana (SUPSI), Switzerland\n\n3Universit\u00b4a della Svizzera italiana (USI), Switzerland\n\n4University of Kaiserslautern, Germany\n\n5German Research Center for Arti\ufb01cial Intelligence (DFKI), Germany\n\nAbstract\n\nConvolutional Neural Networks (CNNs) can be shifted across 2D images or 3D\nvideos to segment them. They have a \ufb01xed input size and typically perceive only\nsmall local contexts of the pixels to be classi\ufb01ed as foreground or background. In\ncontrast, Multi-Dimensional Recurrent NNs (MD-RNNs) can perceive the entire\nspatio-temporal context of each pixel in a few sweeps through all pixels, especially\nwhen the RNN is a Long Short-Term Memory (LSTM). Despite these theoretical\nadvantages, however, unlike CNNs, previous MD-LSTM variants were hard to par-\nallelise on GPUs. Here we re-arrange the traditional cuboid order of computations\nin MD-LSTM in pyramidal fashion. The resulting PyraMiD-LSTM is easy to paral-\nlelise, especially for 3D data such as stacks of brain slice images. PyraMiD-LSTM\nachieved best known pixel-wise brain image segmentation results on MRBrainS13\n(and competitive results on EM-ISBI12).\n\n1\n\nIntroduction\n\nLong Short-Term Memory (LSTM) networks [1, 2] are recurrent neural networks (RNNs) initially\ndesigned for sequence processing. They achieved state-of-the-art results on challenging tasks such\nas handwriting recognition [3], large vocabulary speech recognition [4] and machine translation [5].\nTheir architecture contains gates to store and read out information from linear units called error\ncarousels that retain information over long time intervals, which is hard for traditional RNNs.\nMulti-Dimensional LSTM networks (MD-LSTM [6]) connect hidden LSTM units in grid-like\nfashion1. Two dimensional MD-LSTM is applicable to image segmentation [6, 7, 8] where each pixel\nis assigned to a class such as background or foreground. Each LSTM unit sees a pixel and receives\ninput from predecessing LSTM units, thus recursively gathering information about all other pixels in\nthe image.\nThere are many biomedical 3D volumetric data sources, such as computed tomography (CT), mag-\nnetic resonance (MR), and electron microscopy (EM). Most previous approaches process each 2D\nslice separately, using image segmentation algorithms such as snakes [9], random forests [10] and\nConvolutional Neural Networks [11]. 3D-LSTM, however, can process the full context of each pixel\nin such a volume through 8 sweeps over all pixels by 8 different LSTMs, each sweep in the general\ndirection of one of the 8 directed volume diagonals.\n\n1For example, in two dimensions this yields 4 directions; up, down, left and right.\n\n1\n\n\fDue to the sequential nature of RNNs, however, MD-LSTM parallelisation was dif\ufb01cult, especially\nfor volumetric data. The novel Pyramidal Multi-Dimensional LSTM (PyraMiD-LSTM) networks\nintroduced in this paper use a rather different topology and update strategy. They are easier to\nparallelise, need fewer computations overall, and scale well on GPU architectures.\nPyraMiD-LSTM is applied to two challenging tasks involving segmentation of biological volumetric\nimages. Competitive results are achieved on EM-ISBI12 [12]; best known results are achieved on\nMRBrainS13 [13].\n\n2 Method\n\nWe will \ufb01rst describe standard one-dimensional LSTM [2]. Then we introduce the MD-LSTM and\ntopology changes to construct the PyraMiD-LSTM, which is formally described and discussed.\nA one-dimensional LSTM unit consists of an input gate (i), forget gate2 (f), output gate (o), and\nmemory cell (c) which control what should be remembered or forgotten over potentially long periods\nof time. The input x and all gates and activations are real-valued vectors: x, i, f, \u02dcc, c, o, h \u2208 RT ,\nwhere T is the length of the input. The gates and activations at discrete time t (t=1,2,...) are computed\nas follows:\n\nit = \u03c3(xt \u00b7 \u03b8xi + ht-1 \u00b7 \u03b8hi + \u03b8ibias),\nft = \u03c3(xt \u00b7 \u03b8xf + ht-1 \u00b7 \u03b8hf + \u03b8fbias),\n\u02dcct = tanh(xt \u00b7 \u03b8x\u02dcc + ht-1 \u00b7 \u03b8h\u02dcc + \u03b8\u02dccbias),\nct = \u02dcct (cid:12) it + ct-1 (cid:12) ft,\not = \u03c3(xt \u00b7 \u03b8xo + ht-1 \u00b7 \u03b8ho + \u03b8obias),\nht = ot (cid:12) tanh(ct)\n\n(1)\n(2)\n(3)\n(4)\n(5)\n(6)\nwhere (\u00b7) is a (matrix) multiplication, ((cid:12)) an element-wise multiplication, and \u03b8 denotes the weights.\n\u02dcc is the input to the \u2019cell\u2019 c, which is gated by the input gate, and h is the output. The non-linear\nfunctions \u03c3 and tanh are applied element-wise, where \u03c3(x) = 1\n1+e\u2212x . Equations (1, 2) determine\ngate activations, Equation (3) cell inputs, Equation (4) the new cell states (here \u2018memories\u2019 are stored\nor forgotten), Equation (5) output gate activations which appear in Equation (6), the \ufb01nal output.\n\n2.1 Pyramidal Connection Topology\n\nFigure 1: The standard MD-LSTM topology (a) evaluates the context of each pixel recursively from neighbouring\npixel contexts along the axes, that is, pixels on a simplex can be processed in parallel. Turning this order by 45\u25e6\n(b) causes the simplex to become a plane (a column vector in the 2D case here). The resulting gaps are \ufb01lled by\nadding extra connections, to process more than 2 elements of the context (c).\n\nThe multi-dimensional LSTM (MD-LSTM; [6]) aligns LSTM-units in a grid and connects them over\nthe axis. Multiple grids are needed to process information from all directions. A 2D-LSTM adds the\npixel-wise outputs of 4 LSTMs: one scanning the image pixel by pixel from north-west to south-east,\none from north-east to south-west, one from south-west to north-east, and one from south-east to\nnorth-west. Figure 1\u2013a shows one of these directions.\n\n2Although the forget gate output is inverted and actually \u2018remembers\u2019 when it is on, and forgets when it is\n\noff, the traditional nomenclature is kept.\n\n2\n\n(b) 'Turned' MD-LSTM(c) PyraMiD LSTM(a) Standard MD-LSTM\fHowever, a small change in connections can greatly facilitate parallelisations: If the connections\nare rotated by 45\u25e6, all inputs to all units come from either left, right, up, or down (left in case of\nFigure 1\u2013b), and all elements of a row in the grid row can be computed independently. However,\nthis introduces context gaps as in Figure 1\u2013b. By adding an extra input, these gaps are \ufb01lled as in\nFigure 1\u2013c. Extending this approach in 3 dimensions results in a Pyramidal Connection Topology,\nmeaning the context of a pixel is formed by a pyramid in each direction.\n\nFigure 2: On the left we see the context scanned so far by one of the 8 LSTMs of a 3D-LSTM: a cube. In\ngeneral, given d dimensions, 2d LSTMs are needed. On the right we see the context scanned so far by one of\nthe 6 LSTMs of a 3D-PyraMiD-LSTM: a pyramid. In general, 2 \u00d7 d LSTMs are needed.\n\nOne of the striking differences between PyraMiD-LSTM and MD-LSTM is the shape of the scanned\ncontexts. Each LSTM of an MD-LSTM scans rectangle-like contexts in 2D or cuboids in 3D. Each\nLSTM of a PyraMiD-LSTM scans triangles in 2D and pyramids in 3D (see Figure 2). An MD-LSTM\nneeds 8 LSTMs to scan a volume, while a PyraMiD-LSTM needs only 6, since it takes 8 cubes or 6\npyramids to \ufb01ll a volume. Given dimension d, the number of LSTMs grows as 2d for an MD-LSTM\n(exponentially) and 2 \u00d7 d for a PyraMiD-LSTM (linearly).\nA similar connection strategy has been previously used to speed up non-Euclidian distance computa-\ntions on surfaces [14]. There are however important differences:\n\nis done in CNNs, as will be explained below.\n\n\u2022 We can exploit ef\ufb01cient GPU-based CUDA convolution operations, but in a way unlike what\n\u2022 As a result of these operations, input \ufb01lters that are bigger than the necessary 3 \u00d7 3 \ufb01lters\narise naturally, creating overlapping contexts. Such redundancy turns out to be bene\ufb01cial\nand is used in our experiments.\n\u2022 We apply several layers of complex processing with multi-channelled outputs and several\nstate-variables for each pixel, instead of having a single value per pixel as in distance\ncomputations.\n\n\u2022 Our application is focused on volumetric data.\n\n2.2 PyraMiD-LSTM\n\nFigure 3: PyraMiD-LSTM network architecture. Randomly rotated and \ufb02ipped inputs are sampled from random\nlocations, then fed to six C-LSTMs over three axes. The outputs from all C-LSTMs are combined and sent to the\nfully-connected layer. tanh is used as a squashing function in the hidden layer. Several PyraMiD-LSTM layers\ncan be applied. The last layer is fully-connected and uses a softmax function to compute probabilities for each\nclass for each pixel.\n\nHere we explain the PyraMiD-LSTM network architecture for 3D volumes (see Figure 3). The\nworking horses are six convolutional LSTMs (C-LSTM) layers, one for each direction to create the\nfull context of each pixel. Note that each of these C-LSTMs is a entire LSTM RNN, processing the\n\n3\n\nC-LSTMC-LSTM\u03a3PyraMiD-LSTMC-LSTMC-LSTMC-LSTMFullyConnectedLayertanhFullyConnectedLayersoftmaxC-LSTMInput Data\fentire volume in one direction. The directions D are formally de\ufb01ned over the three axes (x, y, z):\nD = {(\u00b7,\u00b7, 1), (\u00b7,\u00b7,\u22121), (\u00b7, 1,\u00b7), (\u00b7,\u22121,\u00b7), (1,\u00b7,\u00b7), (\u22121,\u00b7,\u00b7)}. They essentially choose which axis is\nthe time direction; i.e. with (\u00b7,\u00b7, 1) the positive direction of the z-axis represents the time.\nEach C-LSTM performs computations in a plane moving in the de\ufb01ned direction. The input is\nx \u2208 RW\u00d7H\u00d7D\u00d7C, where W is the width, H the height, D the depth, and C the number of channels\nof the input, or hidden units in the case of second- and higher layers. Similarly, we de\ufb01ne the volumes\nf d, id, od, \u02dccd, cd, hd, h \u2208 RW\u00d7H\u00d7D\u00d7O, where d \u2208 D is a direction and O is the number of hidden\nunits per pixel. Since each direction needs a separate volume, we denote volumes with (\u00b7)d.\nThe time index t selects a slice in direction d. For instance, for direction d = (\u00b7,\u00b7, 1), vd\nt refers to the\nplane x, y, z, c for x = 1..X, y = 1..Y, c = 1..C, and z = t. For a negative direction d = (\u00b7,\u00b7,\u22121),\nthe plane is the same but moves in the opposite direction: z = Z \u2212 t. A special case is the \ufb01rst plane in\neach direction, which does not have a previous plane, hence we omit the corresponding computation.\n\nC-LSTM equations:\n\nid\nt = \u03c3(xd\nf d\nt = \u03c3(xd\n\u02dccd\nt = tanh(xd\n\nod\nt = \u03c3(xd\n\nt \u2217 \u03b8d\nxi + hd\nt \u2217 \u03b8d\nxf + hd\nt \u2217 \u03b8d\nx\u02dcc + hd\nt = \u02dccd\ncd\nt \u2217 \u03b8d\nxo + hd\n\nt-1 \u2217 \u03b8d\nt-1 \u2217 \u03b8d\nt-1 \u2217 \u03b8d\nt (cid:12) id\nt + cd\nt-1 \u2217 \u03b8d\nhd\nt = od\n\nhi + \u03b8d\nibias),\nhf + \u03b8d\nfbias),\nh\u02dcc + \u03b8d\n\u02dccbias),\nt-1 (cid:12) f d\nt ,\nho + \u03b8d\nobias),\n(cid:88)\nt (cid:12) tanh(cd\nt ),\nhd,\nh =\n\n(7)\n(8)\n(9)\n(10)\n(11)\n(12)\n(13)\n\nd\u2208D\n\nwhere (\u2217) is a convolution3, and h is the output of the layer. All biases are the same for all LSTM\nunits (i.e., no positional biases are used). The outputs hd for all directions are summed together.\nFully-Connected Layer: The output of our PyraMiD-LSTM layer is connected to a pixel-wise\nfully-connected layer, which output is squashed by the hyperbolic tangent (tanh) function. This step\nis used to increase the number of channels for the next layer. The \ufb01nal classi\ufb01cation is done using a\n(cid:80)\npixel-wise softmax function: y(x, y, z, c) = e\u2212h(x,y,z,c)\nc e\u2212h(x,y,z,c) giving pixel-wise probabilities for each\nclass.\n\n3 Experiments\n\nWe evaluate our approach on two 3D biomedical image segmentation datasets: electron microscopy\n(EM) and MR Brain images.\n\nEM dataset The EM dataset [12] is provided by the ISBI 2012 workshop on Segmentation of\nNeuronal Structures in EM Stacks [15]. Two stacks consist of 30 slices of 512 \u00d7 512 pixels obtained\nfrom a 2 \u00d7 2 \u00d7 1.5 \u00b5m3 microcube with a resolution of 4 \u00d7 4 \u00d7 50 nm3/pixel and binary labels.\nOne stack is used for training, the other for testing. Target data consists of binary labels (membrane\nand non-membrane).\n\nMR Brain dataset The MR Brain images are provided by the ISBI 2015 workshop on Neonatal\nand Adult MR Brain Image Segmentation (ISBI NEATBrainS15) [13]. The dataset consists of twenty\nfully annotated high-\ufb01eld (3T) multi-sequences: 3D T1-weighted scan (T1), T1-weighted inversion\nrecovery scan (IR), and \ufb02uid-attenuated inversion recovery scan (FLAIR). The dataset is divided into\na training set with \ufb01ve volumes and a test set with \ufb01fteen volumes. All scans are bias-corrected and\naligned. Each volume includes 48 slices with 240 \u00d7 240 pixels (3mm slice thickness). The slices\n3In 3D volumes, convolutions are performed in 2D; in general an n-D volume requires n-1-D convolutions.\nAll convolutions have stride 1, and their \ufb01lter sizes should at least be 3 \u00d7 3 in each dimension to create the full\ncontext.\n\n4\n\n\fare manually segmented through nine labels: cortical gray matter, basal ganglia, white matter, white\nmatter lesions, cerebrospinal \ufb02uid in the extracerebral space, ventricles, cerebellum, brainstem, and\nbackground. Following the ISBI NEATBrainS15 workshop procedure, all labels are grouped into four\nclasses and background: 1) cortical gray matter and basal ganglia (GM), 2) white matter and white\nmatter lesions (WM), 3) cerebrospinal \ufb02uid and ventricles (CSF), and 4) cerebellum and brainstem.\nClass 4) is ignored for the \ufb01nal evaluation as required.\n\nSub-volumes and Augmentation The full dataset requires more than the 12 GB of memory\nprovided by our GPU, hence we train and test on sub-volumes. We randomly pick a position in the\nfull data and extract a smaller cube (see the details in Bootstrapping). This cube is possibly rotated at\na random angle over some axis and can be \ufb02ipped over any axis. For EM images, we rotate over the\nz-axis and \ufb02ipped sub-volumes with 50% chance along x, y, and z axes. For MR brain images, rotation\nis disabled; only \ufb02ipping along the x direction is considered, since brains are (mostly) symmetric in\nthis direction.\nDuring test-time, rotations and \ufb02ipping are disabled and the results of all sub-volumes are stitched\ntogether using a Gaussian kernel, providing the \ufb01nal result.\n\nPre-processing We normalise each input slice towards a mean of zero and variance of one, since\nthe imaging methods sometimes yield large variability in contrast and brightness. We do not apply\nthe complex pre-processing common in biomedical image segmentation [10].\nWe apply simple pre-processing on the three datatypes of the MR Brain dataset, since they contain\nlarge brightness changes under the same label (even within one slice; see Figure 5). From all slices\nwe subtract the Gaussian smoothed images (\ufb01lter size: 31 \u00d7 31, \u03c3 = 5.0), then a Contrast-Limited\nAdaptive Histogram Equalisation (CLAHE) [16] is applied to enhance the local contrast (tile size:\n16 \u00d7 16, contrast limit: 2.0). An example of the images after pre-processing is shown in Figure 5.\nThe original and pre-processed images are all used, except the original IR images (Figure 5b), which\nhave high variability.\n\nTraining We apply RMS-prop [17] with momentum. We de\ufb01ne a\nwhere a, b \u2208 RN . The following equations hold for every epoch:\n\n\u03c1\u2190\u2212 b to be an = \u03c1an + (1\u2212 \u03c1)bn,\n\nE = (y\u2217 \u2212 y)2,\nMSE \u03c1MSE\u2190\u2212\u2212\u2212 \u22072\n\u03b8E,\n\u2207\u03b8E\n,\nMSE + \u0001\n\u03c1M\u2190\u2212\u2212 G,\nM\n\u03b8 = \u03b8 \u2212 \u03bblrM,\n\n\u221a\n\nG =\n\n(14)\n(15)\n\n(16)\n\n(17)\n(18)\nwhere y\u2217 is the target, y is the output from the networks, E is the squared loss, MSE a running\naverage of the variance of the gradient, \u22072 is the element-wise squared gradient, G the normalised\ngradient, M the smoothed gradient, and \u03b8 the weights. The squared loss was chosen as it produced\nbetter results than using the log-likelihood as an error function. This algorithm normalises the gradient\nof each weight, such that even weights with small gradients get updated. This also helps to deal with\nvanishing gradients [18].\nWe use a decaying learning rate: \u03bblr = 10\u22126 + 10\u22122 \u00b7 1\n100 , which starts at \u03bblr \u2248 10\u22122 and halves\nevery 100 epochs asymptotically towards \u03bblr = 10\u22126. Other hyper-parameters used are \u0001 = 10\u22125,\n\u03c1MSE = 0.9, and \u03c1M = 0.9.\n\nepoch\n\n2\n\nBootstrapping To speed up training, we run three learning procedures with increasing sub-volume\nsizes: \ufb01rst, 3000 epochs with size 64 \u00d7 64 \u00d7 8, then 2000 epochs with size 128 \u00d7 128 \u00d7 15. Finally,\nfor the EM-dataset, we train 1000 epochs with size 256 \u00d7 256 \u00d7 20, and for the MR Brain dataset\n1000 epochs with size 240 \u00d7 240 \u00d7 25. After each epoch, the learning rate \u03bblr is reset.\n\n5\n\n\fTable 1: Performance comparison on EM images. Some of the competing methods reported in the ISBI 2012\nwebsite are not yet published. Comparison details can be found under http://brainiac2.mit.edu/\nisbi_challenge/leaders-board.\n\nGroup\nHuman\nSimple Thresholding\nIDSIA [11]\nDIVE\nPyraMiD-LSTM\nIDSIA-SCI\nDIVE-SCI\n\nRand Err. Warping Err.(\u00d710\u22123)\n0.002\n0.450\n0.050\n0.048\n0.047\n0.0189\n0.0178\n\n0.0053\n17.14\n0.420\n0.374\n0.462\n0.617\n0.307\n\nPixel Err.\n\n0.001\n0.225\n0.061\n0.058\n0.062\n0.103\n0.058\n\nExperimental Setup All experiments are performed on a desktop computer with an NVIDIA GTX\nTITAN X 12GB GPU. Due to the pyramidal topology all major computations can be done using\nconvolutions with NVIDIA\u2019s cuDNN library [19], which has reported 20\u00d7 speedup over an optimised\nimplementation on a modern 16 core CPU. On the MR brain dataset, training took around three days,\nand testing per volume took around 2 minutes.\nWe use exactly the same hyper-parameters and architecture for both datasets. Our networks contain\nthree PyraMiD-LSTM layers. The \ufb01rst PyraMiD-LSTM layer has 16 hidden units followed by a\nfully-connected layer with 25 hidden units. In the next PyraMiD-LSTM layer, 32 hidden units are\nconnected to a fully-connected layer with 45 hidden units. In the last PyraMiD-LSTM layer, 64\nhidden units are connected to the fully-connected output layer whose size equals the number of\nclasses.\nThe convolutional \ufb01lter size for all PyraMiD-LSTM layers is set to 7\u00d7 7. The total number of weights\nis 10,751,549, and all weights are initialised according to a uniform distribution: U(\u22120.1, 0.1).\n\n3.1 Neuronal Membrane Segmentation\n\n(a) Input\n\n(b) PyraMiD-LSTM\n\nFigure 4: Segmentation results on EM dataset (slice 26)\n\nMembrane segmentation is evaluated through an online system provided by the ISBI 2012 organisers.\nThe measures used are the Rand error, warping error and pixel error [15]. Comparisons to other\nmethods are reported in Table 1. The teams IDSIA and DIVE provide membrane probability maps\nfor each pixel. The IDSIA team uses a state-of-the-art deep convolutional network [11], the method\nof DIVE was not provided.\nThese maps are adapted by the post-processing technique of the teams SCI [20], which directly\noptimises the rand error (DIVE-SCI (top-1) and IDSIA-SCI (top-2)); this is most important in this\nparticular segmentation task.\n\n6\n\n\fTable 2: The performance comparison on MR brain images.\n\nStructure\n\nGM\n\nWM\n\nCSF\n\n(%)\n\n(%)\n\n(mm)\n\n(mm)\n\n(%)\n\n(mm)\n\n(%)\n\nMetric\n\nDC MD AVD DC MD AVD DC MD AVD Rank\n(%)\n(%)\n84.65 1.88 6.14 88.42 2.36 6.02 78.31 3.19 22.8\nBIGR2\n84.12 1.92 5.44 87.96 2.49 6.59 82.10 2.71 12.8\nKSOM GHMF\nMNAB2\n84.50 1.69 7.10 88.04 2.12 7.73 82.30 2.27 8.73\nISI-Neonatology 85.77 1.62 6.62 88.66 2.06 6.96 81.08 2.66 9.77\n84.36 1.62 7.04 88.69 2.06 6.46 82.81 2.35 10.5\nUNC-IDEA\nPyraMiD-LSTM 84.82 1.69 6.77 88.33 2.07 7.05 83.72 2.14 7.10\n\n6\n5\n4\n3\n2\n1\n\nWithout post-processing, PyraMiD-LSTM networks outperform other methods in rand error, and\nare competitive in wrapping and pixel errors. Of course, performance could be further improved by\napplying post-processing techniques. Figure 4 shows an example segmentation result.\n\n3.2 MR Brain Segmentation\n\nThe results are compared using the DICE overlap (DC), the modi\ufb01ed Hausdorff distance (MD), and\nthe absolute volume difference (AVD) [13]. MR brain image segmentation results are evaluated by the\nISBI NEATBrain15 organisers [13] who provided the extensive comparison to other approaches on\nhttp://mrbrains13.isi.uu.nl/results.php. Table 2 compares our results to those of\nthe top \ufb01ve teams. The organisers compute nine measures in total and rank all teams for each of them\nseparately. These ranks are then summed per team, determining the \ufb01nal ranking (ties are broken\nusing the standard deviation). PyraMiD-LSTM leads the \ufb01nal ranking with a new state-of-the-art\nresult and outperforms other methods for CSF in all metrics.\nWe also tried regularisation through dropout [21]. Following earlier work [22], the dropout operator\nis applied only to non-recurrent connections (50% dropout on fully connected layers and/or 20% on\ninput layer). However, this did not improve performance.\n\n4 Conclusion\n\nSince 2011, GPU-trained max-pooling CNNs have dominated classi\ufb01cation contests [23, 24, 25] and\nsegmentation contests [11]. MD-LSTM, however, may pose a serious challenge to such CNNs, at\nleast for segmentation tasks. Unlike CNNs, MD-LSTM has an elegant recursive way of taking each\npixel\u2019s entire spatio-temporal context into account, in both images and videos. Previous MD-LSTM\nimplementations, however, could not exploit the parallelism of modern GPU hardware. This has\nchanged through our work presented here. Although our novel highly parallel PyraMiD-LSTM\nhas already achieved state-of-the-art segmentation results in challenging benchmarks, we feel we\nhave only scratched the surface of what will become possible with such PyraMiD-LSTM and other\nMD-RNNs.\n\n5 Acknowledgements\n\nWe would like to thank Klaus Greff and Alessandro Giusti for their valuable discussions, and\nJan Koutnik and Dan Ciresan for their useful feedback. We also thank the ISBI NEATBrain15\norganisers [13] and the ISBI 2012 organisers, in particular Adri\u00a8enne Mendrik and Ignacio Arganda-\nCarreras. Lastly we thank NVIDIA for generously providing us with hardware to perform our research.\nThis research was funded by the NASCENCE EU project (EU/FP7-ICT-317662).\n\n7\n\n\f(a) T1\n\n(b) IR\n\n(c) FLAIR\n\n(d) T1 (pre-processed)\n\n(e) IR (pre-processed)\n\n(f) FLAIR (pre-processed)\n\n(g) segmentation result from PyraMiD-LSTM\n\nFigure 5: Slice 19 of the test image 1. (a)-(c) are examples of three scan methods used in the MR brain\ndataset, and (d)-(f) show the corresponding images after our pre-processing procedure (see pre-processing in\nSection ). Input (b) is omitted due to strong artefacts in the data \u2014 the other datatypes are all used as input to the\nPyraMiD-LSTM. The segmentation result is shown in (g).\n\nReferences\n\n[1] S. Hochreiter and J. Schmidhuber. \u201cLong Short-Term Memory\u201d. In: Neural Computation 9.8\n\n(1997). Based on TR FKI-207-95, TUM (1995), pp. 1735\u20131780.\n\n8\n\n\f[2] F. A. Gers, J. Schmidhuber, and F. Cummins. \u201cLearning to Forget: Continual Prediction with\n\nLSTM\u201d. In: ICANN. 1999.\n\n[3] A. Graves, M. Liwicki, S. Fernandez, R. Bertolami, H. Bunke, and J. Schmidhuber. \u201cA Novel\nConnectionist System for Improved Unconstrained Handwriting Recognition\u201d. In: PAMI 31.5\n(2009).\n\n[4] H. Sak, A. Senior, and F. Beaufays. \u201cLong Short-Term Memory Recurrent Neural Network\n\nArchitectures for Large Scale Acoustic Modeling\u201d. In: Proc. Interspeech. 2014.\nI. Sutskever, O. Vinyals, and Q. V. Le. Sequence to Sequence Learning with Neural Networks.\nTech. rep. NIPS. 2014.\n\n[6] A. Graves, S. Fern\u00b4andez, and J. Schmidhuber. \u201cMulti-dimensional Recurrent Neural Networks\u201d.\n\n[5]\n\nIn: ICANN. 2007.\n\n[7] A. Graves and J. Schmidhuber. \u201cOf\ufb02ine Handwriting Recognition with Multidimensional\n\nRecurrent Neural Networks\u201d. In: NIPS. 2009.\n\n[8] W. Byeon, T. M. Breuel, F. Raue, and M. Liwicki. \u201cScene Labeling With LSTM Recurrent\n\n[9] M. Kass, A. Witkin, and D. Terzopoulos. \u201cSnakes: Active contour models\u201d. In: International\n\nNeural Networks\u201d. In: CVPR. 2015.\n\nJournal of Computer Vision (1988).\n\n[10] L. Wang, Y. Gao, F. Shi, G. Li, J. H. Gilmore, W. Lin, and D. Shen. \u201cLINKS: Learning-based\nmulti-source IntegratioN frameworK for Segmentation of infant brain images\u201d. In: NeuroImage\n(2015).\n\n[11] D. C. Ciresan, A. Giusti, L. M. Gambardella, and J. Schmidhuber. \u201cDeep Neural Networks\n\nSegment Neuronal Membranes in Electron Microscopy Images\u201d. In: NIPS. 2012.\n\n[12] A. Cardona, S. Saalfeld, S. Preibisch, B. Schmid, A. Cheng, J. Pulokas, P. Tomancak, and\nV. Hartenstein. \u201cAn integrated micro-and macroarchitectural analysis of the Drosophila brain\nby computer-assisted serial section electron microscopy\u201d. In: PLoS biology 8.10 (2010),\ne1000502.\n\n[13] A. M. Mendrik, K. L. Vincken, H. J. Kuijf, G. J. Biessels, and M. A. Viergever (organizers).\nMRBrainS Challenge: Online Evaluation Framework for Brain Image Segmentation in 3T MRI\nScans, http://mrbrains13.isi.uu.nl. 2015.\n\n[14] O. Weber, Y. S. Devir, A. M. Bronstein, M. M. Bronstein, and R. Kimmel. \u201cParallel algorithms\nfor approximation of distance maps on parametric surfaces\u201d. In: ACM Transactions on Graphics\n(2008).\n\n[15] Segmentation of Neuronal Structures in EM Stacks Challenge. IEEE International Symposium\n\non Biomedical Imaging (ISBI), http://tinyurl.com/d2fgh7g. 2012.\n\n[16] S. M. Pizer, E. P. Amburn, J. D. Austin, R. Cromartie, A. Geselowitz, T. Greer, B. T. H.\nRomeny, and J. B. Zimmerman. \u201cAdaptive Histogram Equalization and Its Variations\u201d. In:\nComput. Vision Graph. Image Process. (1987).\n\n[17] T. Tieleman and G. Hinton. \u201cLecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude\u201d. In: COURSERA: Neural Networks for Machine Learning 4 (2012).\n\n[18] S. Hochreiter. Untersuchungen zu dynamischen neuronalen Netzen. Diploma thesis, Institut f\u00a8ur\nInformatik, Lehrstuhl Prof. Brauer, Technische Universit\u00a8at M\u00a8unchen. Advisor: J. Schmidhuber.\n1991.\n\n[19] S. Chetlur, C. Woolley, P. Vandermersch, J. Cohen, J. Tran, B. Catanzaro, and E. Shelhamer.\n\n\u201ccuDNN: Ef\ufb01cient Primitives for Deep Learning\u201d. In: CoRR abs/1410.0759 (2014).\n\n[20] T. Liu, C. Jones, M. Seyedhosseini, and T. Tasdizen. \u201cA modular hierarchical approach to 3D\n\nelectron microscopy image segmentation\u201d. In: Journal of Neuroscience Methods (2014).\n\n[21] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. \u201cDropout: A\nSimple Way to Prevent Neural Networks from Over\ufb01tting\u201d. In: Journal of Machine Learning\nResearch (2014).\n\n[22] V. Pham, T. Bluche, C. Kermorvant, and J. Louradour. \u201cDropout improves recurrent neural\n\nnetworks for handwriting recognition\u201d. In: ICFHR. 2014.\n\n[23] D. C. Ciresan, U. Meier, J. Masci, and J. Schmidhuber. \u201cA Committee of Neural Networks for\n\nTraf\ufb01c Sign Classi\ufb01cation\u201d. In: IJCNN. 2011.\n\n[24] A. Krizhevsky, I Sutskever, and G. E Hinton. \u201cImageNet Classi\ufb01cation with Deep Convolu-\n\n[25] M. D. Zeiler and R. Fergus. Visualizing and Understanding Convolutional Networks. Tech. rep.\n\ntional Neural Networks\u201d. In: NIPS. 2012.\n\narXiv:1311.2901 [cs.CV]. NYU, 2013.\n\n9\n\n\f", "award": [], "sourceid": 1690, "authors": [{"given_name": "Marijn", "family_name": "Stollenga", "institution": "IDSIA"}, {"given_name": "Wonmin", "family_name": "Byeon", "institution": "IDSIA"}, {"given_name": "Marcus", "family_name": "Liwicki", "institution": "TU Kaiserslautern"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}