{"title": "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 545, "page_last": 552, "abstract": "Offline handwriting recognition---the transcription of images of handwritten text---is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks---multidimensional recurrent neural networks and connectionist temporal classification---this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic.", "full_text": "Of\ufb02ine Handwriting Recognition with\n\nMultidimensional Recurrent Neural Networks\n\nAlex Graves\n\nTU Munich, Germany\ngraves@in.tum.de\n\nJ\u00a8urgen Schmidhuber\n\nIDSIA, Switzerland and TU Munich, Germany\n\njuergen@idsia.ch\n\nAbstract\n\nOf\ufb02ine handwriting recognition\u2014the automatic transcription of images of hand-\nwritten text\u2014is a challenging task that combines computer vision with sequence\nlearning. In most systems the two elements are handled separately, with sophisti-\ncated preprocessing techniques used to extract the image features and sequential\nmodels such as HMMs used to provide the transcriptions. By combining two re-\ncent innovations in neural networks\u2014multidimensional recurrent neural networks\nand connectionist temporal classi\ufb01cation\u2014this paper introduces a globally trained\nof\ufb02ine handwriting recogniser that takes raw pixel data as input. Unlike competing\nsystems, it does not require any alphabet speci\ufb01c preprocessing, and can therefore\nbe used unchanged for any language. Evidence of its generality and power is pro-\nvided by data from a recent international Arabic recognition competition, where it\noutperformed all entries (91.4% accuracy compared to 87.2% for the competition\nwinner) despite the fact that neither author understands a word of Arabic.\n\n1 Introduction\n\nOf\ufb02ine handwriting recognition is generally observed to be harder than online handwriting recogni-\ntion [14]. In the online case, features can be extracted from both the pen trajectory and the resulting\nimage, whereas in the of\ufb02ine case only the image is available. Nonetheless, the standard recognition\nprocess is essentially the same: a sequence of features are extracted from the data, then matched to a\nsequence of labels (usually characters or sub-character strokes) using either a hidden Markov model\n(HMM) [9] or an HMM-neural network hybrid [10].\nThe main drawback of this approach is that the input features must meet the stringent independence\nassumptions imposed by HMMs (these assumptions are somewhat relaxed in the case of hybrid\nsystems, but long-range input dependencies are still problematic). In practice this means the features\nmust be redesigned for every alphabet, and, to a lesser extent, for every language. For example it\nwould be impossible to use the same system to recognise both English and Arabic.\nFollowing our recent success in transcribing raw online handwriting data with recurrent net-\nworks [6], we wanted to build an of\ufb02ine recognition system that would work on raw pixels. As well\nas being alphabet-independent, such a system would have the advantage of being globally trainable,\nwith the image features optimised along with the classi\ufb01er.\nThe online case was relatively straightforward, since the input data formed a 1D sequence that could\nbe fed directly to a recurrent network. The long short-term memory (LSTM) network architec-\nture [8, 3] was chosen for its ability to access long-range context, and the connectionist temporal\nclassi\ufb01cation [5] output layer allowed the network to transcribe the data with no prior segmentation.\nThe of\ufb02ine case, however, is more challenging, since the input is no longer one-dimensional. A\nnaive approach would be to present the images to the network one vertical line at a time, thereby\ntransforming them into 1D sequences. However such a system would be unable to handle distor-\n\n1\n\n\fFigure 1: Two dimensional MDRNN. The thick lines show connections to the current point (i, j).\nThe connections within the hidden layer plane are recurrent. The dashed lines show the scanning\nstrips along which previous points were visited, starting at the top left corner.\n\ntions along the vertical axis; for example the same image shifted up by one pixel would appear\ncompletely different. A more \ufb02exible solution is offered by multidimensional recurrent neural net-\nworks (MDRNNs) [7]. MDRNNs, which are a special case of directed acyclic graph networks [1],\ngeneralise standard RNNs by providing recurrent connections along all spatio-temporal dimensions\npresent in the data. These connections make MDRNNs robust to local distortions along any com-\nbination of input dimensions (e.g. image rotations and shears, which mix vertical and horizontal\ndisplacements) and allow them to model multidimensional context in a \ufb02exible way. We use multi-\ndimensional LSTM because it is able to access long-range context.\nThe problem remains, though, of how to transform two-dimensional images into one-dimensional\nlabel sequences. Our solution is to pass the data through a hierarchy of MDRNN layers, with\nblocks of activations gathered together after each level. The heights of the blocks are chosen to\nincrementally collapse the 2D images onto 1D sequences, which can then be labelled by the output\nlayer. Such hierarchical structures are common in computer vision [15], because they allow complex\nfeatures to be built up in stages. In particular our multilayered structure is similar to that used by\nconvolution networks [11], although it should be noted that because convolution networks are not\nrecurrent, they cannot be used for cursive handwriting recognition without presegmented inputs.\nThe method is described in detail in Section 2, experimental results are given in Section 3, and\nconclusions and directions for future work are given in Section 4.\n\n2 Method\n\nThe three components of our recognition system are: (1) multidimensional recurrent neural net-\nworks, and multidimensional LSTM in particular; (2) the connectionist temporal classi\ufb01cation out-\nput layer; and (3) the hierarchical structure. In what follows we describe each component in turn,\nthen show how they \ufb01t together to form a complete system. For a more detailed description of (1)\nand (2) we refer the reader to [4]\n\n2.1 Multidimensional Recurrent Neural Networks\n\nThe basic idea of multidimensional recurrent neural networks (MDRNNs) [7] is to replace the single\nrecurrent connection found in standard recurrent networks with as many connections as there are\nspatio-temporal dimensions in the data. These connections allow the network to create a \ufb02exible\ninternal representation of surrounding context, which is robust to localised distortions.\nAn MDRNN hidden layer scans through the input in 1D strips, storing its activations in a buffer. The\nstrips are ordered in such a way that at every point the layer has already visited the points one step\nback along every dimension. The hidden activations at these previous points are fed to the current\npoint through recurrent connections, along with the input. The 2D case is illustrated in Fig. 1.\nOne such layer is suf\ufb01cient to give the network access to all context against the direction of scan-\nning from the current point (e.g. to the top and left of (i, j) in Fig. 1). However we usually want\nsurrounding context in all directions. The same problem exists in 1D networks, where it is often\nuseful to have information about the future as well as the past. The canonical 1D solution is bidi-\n\n2\n\n\frectional recurrent networks [16], where two separate hidden layers scan through the input forwards\nand backwards. The generalisation of bidirectional networks to n dimensions requires 2n hidden\nlayers, starting in every corner of the n dimensional hypercube and scanning in opposite directions.\nFor example, a 2D network has four layers, one starting in the top left and scanning down and right,\none starting in the bottom left and scanning up and right, etc. All the hidden layers are connected to\na single output layer, which therefore receives information about all surrounding context.\nThe error gradient of an MDRNN can be calculated with an n-dimensional extension of backprop-\nagation through time. As in the 1D case, the data is processed in the reverse order of the forward\npass, with each hidden layer receiving both the output derivatives and its own n \u2018future\u2019 derivatives\nat every timestep.\nLet ap\nj be respectively the input and activation of unit j at point p = (p1, . . . , pn) in an n-\nd = (p1, . . . , pd \u2212 1, . . . , pn)\ndimensional input sequence x with dimensions (D1, . . . , Dn). Let p\u2212\nand p+\nij be respectively the weight of the feedforward\nconnection from unit i to unit j and the recurrent connection from i to j along dimension d. Let \u03b8h\nbe the activation function of hidden unit h, and for some unit j and some differentiable objective\nfunction O let \u03b4p\n. Then the forward and backward equations for an n-dimensional MDRNN\nwith I input units, K output units, and H hidden summation units are as follows:\n\nd = (p1, . . . , pd + 1, . . . , pn). Let wij and wd\n\nj = \u2202O\n\u2202ap\nj\n\nj and bp\n\nForward Pass\n\nIX\n\nap\nh =\n\nxp\ni wih +\n\ni=1\n\nh = \u03b8h(ap\nbp\nh)\n\nnX\n\nHX\n\nd=1:\npd>0\n\n\u02c6h=1\n\n\u2212\nd\n\np\n\u02c6h\n\nb\n\nwd\n\u02c6hh\n\nBackward Pass\n\n\u03b4p\nh = \u03b8\n\n(cid:48)\nh(ap\nh)\n\n0B@ KX\n\nk=1\n\nnX\n\nHX\n\n+\nd\n\np\n\u03b4\n\u02c6h\n\nwd\nh\u02c6h\n\n\u03b4p\nk whk +\n\nd=1:\n\npd