{"title": "Offline Handwriting Recognition with Multidimensional Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 545, "page_last": 552, "abstract": "Offline handwriting recognition---the transcription of images of handwritten text---is an interesting task, in that it combines computer vision with sequence learning. In most systems the two elements are handled separately, with sophisticated preprocessing techniques used to extract the image features and sequential models such as HMMs used to provide the transcriptions. By combining two recent innovations in neural networks---multidimensional recurrent neural networks and connectionist temporal classification---this paper introduces a globally trained offline handwriting recogniser that takes raw pixel data as input. Unlike competing systems, it does not require any alphabet specific preprocessing, and can therefore be used unchanged for any language. Evidence of its generality and power is provided by data from a recent international Arabic recognition competition, where it outperformed all entries (91.4% accuracy compared to 87.2% for the competition winner) despite the fact that neither author understands a word of Arabic.", "full_text": "Of\ufb02ine Handwriting Recognition with\n\nMultidimensional Recurrent Neural Networks\n\nAlex Graves\n\nTU Munich, Germany\ngraves@in.tum.de\n\nJ\u00a8urgen Schmidhuber\n\nIDSIA, Switzerland and TU Munich, Germany\n\njuergen@idsia.ch\n\nAbstract\n\nOf\ufb02ine handwriting recognition\u2014the automatic transcription of images of hand-\nwritten text\u2014is a challenging task that combines computer vision with sequence\nlearning. In most systems the two elements are handled separately, with sophisti-\ncated preprocessing techniques used to extract the image features and sequential\nmodels such as HMMs used to provide the transcriptions. By combining two re-\ncent innovations in neural networks\u2014multidimensional recurrent neural networks\nand connectionist temporal classi\ufb01cation\u2014this paper introduces a globally trained\nof\ufb02ine handwriting recogniser that takes raw pixel data as input. Unlike competing\nsystems, it does not require any alphabet speci\ufb01c preprocessing, and can therefore\nbe used unchanged for any language. Evidence of its generality and power is pro-\nvided by data from a recent international Arabic recognition competition, where it\noutperformed all entries (91.4% accuracy compared to 87.2% for the competition\nwinner) despite the fact that neither author understands a word of Arabic.\n\n1 Introduction\n\nOf\ufb02ine handwriting recognition is generally observed to be harder than online handwriting recogni-\ntion [14]. In the online case, features can be extracted from both the pen trajectory and the resulting\nimage, whereas in the of\ufb02ine case only the image is available. Nonetheless, the standard recognition\nprocess is essentially the same: a sequence of features are extracted from the data, then matched to a\nsequence of labels (usually characters or sub-character strokes) using either a hidden Markov model\n(HMM) [9] or an HMM-neural network hybrid [10].\nThe main drawback of this approach is that the input features must meet the stringent independence\nassumptions imposed by HMMs (these assumptions are somewhat relaxed in the case of hybrid\nsystems, but long-range input dependencies are still problematic). In practice this means the features\nmust be redesigned for every alphabet, and, to a lesser extent, for every language. For example it\nwould be impossible to use the same system to recognise both English and Arabic.\nFollowing our recent success in transcribing raw online handwriting data with recurrent net-\nworks [6], we wanted to build an of\ufb02ine recognition system that would work on raw pixels. As well\nas being alphabet-independent, such a system would have the advantage of being globally trainable,\nwith the image features optimised along with the classi\ufb01er.\nThe online case was relatively straightforward, since the input data formed a 1D sequence that could\nbe fed directly to a recurrent network. The long short-term memory (LSTM) network architec-\nture [8, 3] was chosen for its ability to access long-range context, and the connectionist temporal\nclassi\ufb01cation [5] output layer allowed the network to transcribe the data with no prior segmentation.\nThe of\ufb02ine case, however, is more challenging, since the input is no longer one-dimensional. A\nnaive approach would be to present the images to the network one vertical line at a time, thereby\ntransforming them into 1D sequences. However such a system would be unable to handle distor-\n\n1\n\n\fFigure 1: Two dimensional MDRNN. The thick lines show connections to the current point (i, j).\nThe connections within the hidden layer plane are recurrent. The dashed lines show the scanning\nstrips along which previous points were visited, starting at the top left corner.\n\ntions along the vertical axis; for example the same image shifted up by one pixel would appear\ncompletely different. A more \ufb02exible solution is offered by multidimensional recurrent neural net-\nworks (MDRNNs) [7]. MDRNNs, which are a special case of directed acyclic graph networks [1],\ngeneralise standard RNNs by providing recurrent connections along all spatio-temporal dimensions\npresent in the data. These connections make MDRNNs robust to local distortions along any com-\nbination of input dimensions (e.g. image rotations and shears, which mix vertical and horizontal\ndisplacements) and allow them to model multidimensional context in a \ufb02exible way. We use multi-\ndimensional LSTM because it is able to access long-range context.\nThe problem remains, though, of how to transform two-dimensional images into one-dimensional\nlabel sequences. Our solution is to pass the data through a hierarchy of MDRNN layers, with\nblocks of activations gathered together after each level. The heights of the blocks are chosen to\nincrementally collapse the 2D images onto 1D sequences, which can then be labelled by the output\nlayer. Such hierarchical structures are common in computer vision [15], because they allow complex\nfeatures to be built up in stages. In particular our multilayered structure is similar to that used by\nconvolution networks [11], although it should be noted that because convolution networks are not\nrecurrent, they cannot be used for cursive handwriting recognition without presegmented inputs.\nThe method is described in detail in Section 2, experimental results are given in Section 3, and\nconclusions and directions for future work are given in Section 4.\n\n2 Method\n\nThe three components of our recognition system are: (1) multidimensional recurrent neural net-\nworks, and multidimensional LSTM in particular; (2) the connectionist temporal classi\ufb01cation out-\nput layer; and (3) the hierarchical structure. In what follows we describe each component in turn,\nthen show how they \ufb01t together to form a complete system. For a more detailed description of (1)\nand (2) we refer the reader to [4]\n\n2.1 Multidimensional Recurrent Neural Networks\n\nThe basic idea of multidimensional recurrent neural networks (MDRNNs) [7] is to replace the single\nrecurrent connection found in standard recurrent networks with as many connections as there are\nspatio-temporal dimensions in the data. These connections allow the network to create a \ufb02exible\ninternal representation of surrounding context, which is robust to localised distortions.\nAn MDRNN hidden layer scans through the input in 1D strips, storing its activations in a buffer. The\nstrips are ordered in such a way that at every point the layer has already visited the points one step\nback along every dimension. The hidden activations at these previous points are fed to the current\npoint through recurrent connections, along with the input. The 2D case is illustrated in Fig. 1.\nOne such layer is suf\ufb01cient to give the network access to all context against the direction of scan-\nning from the current point (e.g. to the top and left of (i, j) in Fig. 1). However we usually want\nsurrounding context in all directions. The same problem exists in 1D networks, where it is often\nuseful to have information about the future as well as the past. The canonical 1D solution is bidi-\n\n2\n\n\frectional recurrent networks [16], where two separate hidden layers scan through the input forwards\nand backwards. The generalisation of bidirectional networks to n dimensions requires 2n hidden\nlayers, starting in every corner of the n dimensional hypercube and scanning in opposite directions.\nFor example, a 2D network has four layers, one starting in the top left and scanning down and right,\none starting in the bottom left and scanning up and right, etc. All the hidden layers are connected to\na single output layer, which therefore receives information about all surrounding context.\nThe error gradient of an MDRNN can be calculated with an n-dimensional extension of backprop-\nagation through time. As in the 1D case, the data is processed in the reverse order of the forward\npass, with each hidden layer receiving both the output derivatives and its own n \u2018future\u2019 derivatives\nat every timestep.\nLet ap\nj be respectively the input and activation of unit j at point p = (p1, . . . , pn) in an n-\nd = (p1, . . . , pd \u2212 1, . . . , pn)\ndimensional input sequence x with dimensions (D1, . . . , Dn). Let p\u2212\nand p+\nij be respectively the weight of the feedforward\nconnection from unit i to unit j and the recurrent connection from i to j along dimension d. Let \u03b8h\nbe the activation function of hidden unit h, and for some unit j and some differentiable objective\nfunction O let \u03b4p\n. Then the forward and backward equations for an n-dimensional MDRNN\nwith I input units, K output units, and H hidden summation units are as follows:\n\nd = (p1, . . . , pd + 1, . . . , pn). Let wij and wd\n\nj = \u2202O\n\u2202ap\nj\n\nj and bp\n\nForward Pass\n\nIX\n\nap\nh =\n\nxp\ni wih +\n\ni=1\n\nh = \u03b8h(ap\nbp\nh)\n\nnX\n\nHX\n\nd=1:\npd>0\n\n\u02c6h=1\n\n\u2212\nd\n\np\n\u02c6h\n\nb\n\nwd\n\u02c6hh\n\nBackward Pass\n\n\u03b4p\nh = \u03b8\n\n(cid:48)\nh(ap\nh)\n\n0B@ KX\n\nk=1\n\nnX\n\nHX\n\n+\nd\n\np\n\u03b4\n\u02c6h\n\nwd\nh\u02c6h\n\n\u03b4p\nk whk +\n\nd=1:\n\npd<Dd\u22121\n\n\u02c6h=1\n\n1CA\n\n2.1.1 Multidimensional LSTM\n\nLong Short-Term Memory (LSTM) [8, 3] is an RNN architecture designed for data with long-range\ninterdependencies. An LSTM layer consists of recurrently connected \u2018memory cells\u2019, whose activa-\ntions are controlled by three multiplicative gate units: the input gate, forget gate and output gate. The\ngates allows the cells to store and retrieve information over time, giving them access to long-range\ncontext.\nThe standard formulation of LSTM is explicitly one-dimensional, since each cell contains a single\nrecurrent connection, whose activation is controlled by a single forget gate. However we can extend\nthis to n dimensions by using instead n recurrent connections (one for each of the cell\u2019s previous\nstates along every dimension) with n forget gates.\nConsider an MDLSTM memory cell in a hidden layer of H cells, connected to I input units and K\noutput units. The subscripts c, \u03b9, \u03c6 and \u03c9 refer to the cell, input gate, forget gate and output gate\nrespectively. bp\nc is\nthe state of cell c at p. f1 is the activation function of the gates, and f2 and f3 are respectively the\ncell input and output activation functions. The suf\ufb01x \u03c6, d denotes the forget gate corresponding to\nrecurrent connection d. The input gate \u03b9 is connected to previous cell c along all dimensions with\nthe same weight (wc\u03b9) whereas the forget gates are connected to cell c with a separate weight wc(\u03c6,d)\nfor each dimension d. Then the forward and backward equations are as follows:\nForward Pass\n\nh is the output of cell h in the hidden layer at point p in the input sequence, and sp\n\nInput Gate: bp\n\n\u03b9 = f1\n\nxp\ni wi\u03b9 +\n\n\u2212\np\nc +\nd\n\nwc\u03b9s\n\np\n\n\u2212\nh wd\nd\n\nh\u03b9\n\nb\n\nForget Gate: bp\n\n\u03c6,d = f1\n\n!1CA\n(\n\n\u2212\nh wd(cid:48)\nd(cid:48)\np\n\nh(\u03c6,d) +\n\n\u2212\nd\n\np\nc\n\nwc(\u03c6,d)s\n0 otherwise\n\n1CCA\n\nif pd > 0\n\n0B@ IX\n0BB@ IX\n\ni=1\n\ni=1\n\n \n\nnX\n\nHX\n\nh=1\n\nd=1:\npd>0\n\nxp\ni wi(\u03c6,d) +\n\nHX\n\nh=1\n\nb\n\nnX\n\nd(cid:48)=1:\npd(cid:48) >0\n\n3\n\n\fIX\n\ni=1\n\nCell: ap\n\nc =\n\nxp\ni wic +\n\nState: sp\n\nc = bp\n\n\u03b9 f2(ap\n\nc ) +\n\nHX\n\nh=1\n\nb\n\np\n\n\u2212\nh wd\nd\n\nhc\n\nnX\n0B@ IX\n\nd=1:\npd>0\n\ni=1\n\nnX\n\nHX\n\nb\n\nh=1\n\nd=1:\npd>0\n\np\n\n\u2212\nh wd\nd\n\nh\u03c9 + wc\u03c9sp\nc\n\n1CA\n\nnX\n\nd=1:\npd>0\n\n\u2212\np\nc bp\nd\n\n\u03c6,d\n\ns\n\nOutput Gate: bp\n\n\u03c9 = f1\n\nxp\ni wi\u03c9 +\n\nCell Output: bp\n\nc = bp\n\n\u03c9f3(sp\nc )\n\nBackward Pass\n\nCell Output: \u0001p\nc\n\ndef\n=\n\n\u2202O\n\u2202bp\nc\n\n=\n\nKX\n\nk=1\n\n\u03b4p\nk wck +\n\nOutput Gate: \u03b4p\n\n\u03c9 = f\n\n(cid:48)\n1(ap\n\n\u03c9)\u0001p\n\nc f3(sp\nc )\n\nState: \u0001p\ns\n\ndef\n=\n\n\u2202O\n\u2202sp\nc\n\n= bp\n\u03c9f\n\n(cid:48)\n3(sp\n\nc )\u0001p\n\nc + \u03b4p\n\n\u03c9wc\u03c9 +\n\np\n\n+\nd\n\nh wd\n\u03b4\n\nch\n\nnX\n\nh=1\n\nHX\n\u201e\n(\n\nd=1:\n\npd<Dd\u22121\n\nnX\n\n+\np\n\u0001\ns b\nd\npd<Dd\u22121\n\nd=1:\n\nCell: \u03b4p\n\nc = bp\n\u03b9 f\n\n(cid:48)\n2(ap\n\nc )\u0001p\ns\n\nForget Gate: \u03b4p\n\n\u03c6,d =\n\nInput Gate: \u03b4p\n\n\u03b9 = f\n\n(cid:48)\n1(ap\n\n\u03b9 )f2(ap\n\nc )\u0001p\ns\n\n\u2212\nd\n\nf(cid:48)\np\n1(ap\nc\n0 otherwise\n\n\u03c6,d)s\n\ns if pd > 0\n\u0001p\n\n\u00ab\n\n+\np\n\u03c6,dwc(\u03c6,d)\nd\n\n+\np\n\u03c6,d + \u03b4\nd\n\n+\nd\n\np\n\u03b9 wc\u03b9 + \u03b4\n\n2.2 Connectionist Temporal Classi\ufb01cation\n\nConnectionist temporal classi\ufb01cation (CTC) [5] is an output layer designed for sequence labelling\nwith RNNs. Unlike other neural network output layers it does not require pre-segmented training\ndata, or postprocessing to transform its outputs into transcriptions. Instead, it trains the network to\ndirectly estimate the conditional probabilities of the possible labellings given the input sequences.\nA CTC output layer contains one more unit than there are elements in the alphabet L of labels for the\ntask. The output activations are normalised at each timestep with the softmax activation function [2].\nThe \ufb01rst |L| outputs estimate the probabilities of observing the corresponding labels at that time, and\nthe extra output estimates the probability of observing a \u2018blank\u2019, or no label. The combined output\nsequence estimates the joint probability of all possible alignments of the input sequence with all\nsequences of labels and blanks. The probability of a particular labelling can then be estimated by\nsumming over the probabilities of all the alignments that correspond to it.\nMore precisely, for a length T input sequence x, the CTC outputs de\ufb01ne a probability distribution\nover the set L(cid:48)T of length T sequences over the alphabet L(cid:48) = L \u222a {blank}. To distinguish them\nfrom labellings, we refer to the elements of L(cid:48)T as paths. Since the probabilities of the labels at\neach timestep are conditionally independent given x, the conditional probability of a path \u03c0 \u2208 L(cid:48)T\n\nis given by p(\u03c0|x) =(cid:81)T\nL\u2264T is the sum of the probabilities of all paths corresponding to it: p(l|x) = (cid:80)\n\nPaths are mapped onto labellings l \u2208 L\u2264T by an operator B that removes \ufb01rst the repeated labels,\nthen the blanks. So for example, both B(a,\u2212, a, b,\u2212) and B(\u2212, a, a,\u2212,\u2212, a, b, b) yield the labelling\n(a, a, b). Since the paths are mutually exclusive, the conditional probability of some labelling l \u2208\n\u03c0\u2208B\u22121(l) p(\u03c0|x).\nAlthough a naive calculation of this sum is unfeasible, it can be ef\ufb01ciently evaluated with a dynamic\nprogramming algorithm, similar to the forward-backward algorithm for HMMs.\nTo allow for blanks in the output paths, for each labelling l \u2208 L\u2264T consider a modi\ufb01ed labelling\nl(cid:48) \u2208 L(cid:48)\u2264T , with blanks added to the beginning and the end and inserted between every pair of labels.\nThe length |l(cid:48)| of l(cid:48) is therefore 2|l| + 1.\nFor a labelling l, de\ufb01ne the forward variable \u03b1t(s) as the summed probability of all path beginnings\nreaching index s of l(cid:48) at time t, and the backward variables \u03b2t(s) as the summed probability of all\npath endings that would complete the labelling l if the path beginning had reached s at time t. Both\n\nk is the activation of output unit k at time t.\n\n. where yt\n\nt=1 yt\n\u03c0t\n\n4\n\n\fp(l|x) =(cid:80)|l(cid:48)|\nlabelling all of S: O = \u2212(cid:80)\n\ns=1 \u03b1t(s)\u03b2t(s).\n\nthe forward and backward variables are calculated recursively [5]. The label sequence probability\nis given by the sum of the products of the forward and backward variables at any timestep, i.e.\n\nLet S be a training set, consisting of pairs of input and target sequences (x, z), where |z| \u2264 |x|.\nThen the objective function O for CTC is the negative log probability of the network correctly\n(x,z)\u2208S ln p(z|x). The network can be trained with gradient descent by\n\ufb01rst differentiating O with respect to the outputs, then using backpropagation through time to \ufb01nd\nthe derivatives with respect to the weights.\nNote that the same label (or blank) may be repeated several times for a single labelling l. We de\ufb01ne\nthe set of positions where label k occurs as lab(l, k) = {s :\ns = k}, which may be empty.\nl(cid:48)\nSetting l = z and differentiating O with respect to the network outputs, we obtain:\n\n\u2212 \u2202O\n\u2202at\nk\n\n= \u2212 \u2202 ln p(z|x)\n\n\u2202at\nk\n\n= yt\n\nk \u2212 1\n\np(z|x)\n\n(cid:88)\n\n\u03b1t(s)\u03b2t(s),\n\ns\u2208lab(z,k)\n\nk and yt\n\nk are respectively the input and output of CTC unit k at time t for some (x, z) \u2208 S.\nwhere at\nOnce the network is trained, we can label some unknown input sequence x by choosing the labelling\nl\u2217 with the highest conditional probability, i.e. l\u2217 = arg maxl p(l|x). In cases where a dictionary\nis used, the labelling can be constrained to yield only sequences of complete words by using the\nCTC token passing algorithm [6]. For the experiments in this paper, the labellings were further\nconstrained to give single word sequences only, and the ten most probable words were recorded.\n\n2.3 Network Hierarchy\n\nMany computer vision systems use a hierarchical approach to feature extraction, with the features\nat each level used as input to the next level [15]. This allows complex visual properties to be built\nup in stages. Typically, such systems use subsampling, with the feature resolution decreased at each\nstage. They also generally have more features at the higher levels. The basic idea is to progress from\na small number of simple local features to a large number of complex global features.\nWe created a hierarchical structure by repeatedly composing MDLSTM layers with feedforward\nlayers. The basic procedure is as follows: (1) the image is divided into small pixel blocks, each of\nwhich is presented as a single input to the \ufb01rst set of MDLSTM layers (e.g. a 4x3 block is reduced\nto a length 12 vector). If the image does not divide exactly into blocks, it is padded with zeros.\n(2) the four MDLSTM layers scan through the pixel blocks in all directions. (3) the activations of\nthe MDLSTM layers are collected into blocks. (4) these blocks are given as input to a feedforward\nlayer. Note that all the layers have a 2D array of activations: e.g. a 10 unit feedforward layer with\ninput from a 5x5 array of MDLSTM blocks has a total of 250 activations.\nThe above process is repeated as many times as required, with the activations of the feedforward\nlayer taking the place of the original image. The purpose of the blocks is twofold: to collect local\ncontextual information, and to reduce the area of the activation arrays. In particular, we want to\nreduce the vertical dimension, since the CTC output layer requires a 1D sequence as input. Note\nthat the blocks themselves do not reduce the overall amount of data; that is done by the layers that\nprocess them, which are therefore analogous to the subsampling steps in other approaches (although\nwith trainable weights rather than a \ufb01xed subsampling function).\nFor most tasks we \ufb01nd that a hierarchy of three MDLSTM/feedforward stages gives the best results.\nWe use the standard \u2018inverted pyramid\u2019 structure, with small layers at the bottom and large layers at\nthe top. As well as allowing for more features at higher levels, this leads to ef\ufb01cient networks, since\nmost of the weights are concentrated in the upper layers, which have a smaller input area.\nIn general we cannot assume that the input images are of \ufb01xed size. Therefore it is dif\ufb01cult to choose\nblock heights that ensure that the \ufb01nal activation array will always be one-dimensional, as required\nby CTC. A simple solution is to collapse the \ufb01nal array by summing over all the inputs in each\nvertical line, i.e. the input at time t to CTC unit k is given by at\nis the\nuncollapsed input to unit k at point (x, y) in the \ufb01nal array.\n\nk =(cid:80)\n\n, where a(x,y)\n\nk\n\nx a(x,t)\n\nk\n\n5\n\n\fFigure 2: The complete recognition system. First the input image is collected into boxes 3 pixels\nwide and 4 pixels high which are then scanned by four MDLSTM layers. The activations of the cells\nin each layer are displayed separately, and the arrows in the corners indicates the scanning direction.\nNext the MDLSTM activations are gathered into 4 x 3 boxes and fed to a feedforward layer of tanh\nsummation units. This process is repeated two more times, until the \ufb01nal MDLSTM activations are\ncollapsed to a 1D sequence and transcribed by the CTC layer. In this case all characters are correctly\nlabelled except the second last one, and the correct town name is chosen from the dictionary.\n\n3 Experiments\n\nTo see how our method compared to the state of the art, we applied it to data from the ICDAR\n2007 Arabic handwriting recognition competition [12]. Although we were too late to enter the\ncompetition itself, the organisers kindly agreed to evaluate our system according to the competition\ncriteria. We did not receive the test data at any point, and all evaluations were carried out by them.\nThe goal of the competition was to identify the postcodes of Tunisian town and village names. The\nnames are presented individually, so it is an isolated word recognition task. However we would\nlike to point out that our system is equally applicable to unconstrained handwriting, and has been\nsuccessfully applied to complete lines of English text.\n\n3.1 Data\n\nThe competition was based on the IFN/ENIT database of handwritten Arabic words [13]. The\npublically available data consists of 32,492 images of handwritten Tunisian town names, of which\nwe used 30,000 for training, and 2,492 for validation. The images were extracted from arti\ufb01cial\n\n6\n\n\fTable 1: Results on the ICDAR 2007 Arabic handwriting recognition contest. All scores are\npercentages of correctly identi\ufb01ed postcodes. The systems are ordered by the \u2018top 1\u2019 results on test\nset \u2018f\u2019. The best score in each column is shown in bold.\n\nSYSTEM\n\nCACI-3\nCACI-2\nCEDAR\nMITRE\nUOB-ENST-1\nPARIS V\nICRA\nUOB-ENST-2\nUOB-ENST-4\nUOB-ENST-3\nSIEMENS-1\nMIE\nSIEMENS-2\nOurs\n\ntop 1\n14.28\n15.79\n59.01\n61.70\n79.10\n80.18\n81.47\n81.65\n81.81\n81.93\n82.77\n83.34\n87.22\n91.43\n\nSET f\ntop 5\n29.88\n21.34\n78.76\n81.61\n87.69\n91.09\n90.07\n90.81\n88.71\n91.20\n92.37\n91.67\n94.05\n96.12\n\ntop 10\n37.91\n22.33\n83.70\n85.69\n90.21\n92.98\n92.15\n92.35\n90.40\n92.76\n93.92\n93.48\n95.42\n96.75\n\ntop 1\n10.68\n14.24\n41.32\n49.91\n64.97\n64.38\n72.22\n69.61\n70.57\n69.93\n68.09\n68.40\n73.94\n78.83\n\nSET s\ntop 5\n21.74\n19.39\n61.98\n70.50\n78.39\n78.12\n82.84\n83.79\n79.85\n84.11\n81.70\n80.93\n85.44\n88.00\n\ntop 10\n30.20\n20.53\n69.87\n76.48\n82.20\n82.13\n86.27\n85.89\n83.34\n87.03\n85.19\n83.73\n88.18\n91.05\n\nforms \ufb01lled in by over 400 Tunisian people. The forms were designed to simulate writing on a\nletter, and contained no lines or boxes to constrain the writing style.\nEach image was supplied with a ground truth transcription for the individual characters1. There were\n120 distinct characters in total. A list of 937 town names and postcodes was provided. Many of the\ntown names had transcription variants, giving a total of 1,518 entries in the complete dictionary.\nThe test data (which is not published) was divided into sets \u2018f\u2019 and \u2018s\u2019. The main competition results\nwere based on set \u2018f\u2019. Set \u2018s\u2019 contains data collected in the United Arab Emirates using the same\nforms; its purpose was to test the robustness of the recognisers to regional writing variations. The\nsystems were allowed to choose up to 10 postcodes for each image, in order of preference. The test\nset performance using the top 1, top 5, and top 10 answers was recorded by the organisers.\n\n3.2 Network Parameters\n\nThe structure shown in Figure 2 was used, with each layer fully connected to the next layer in the\nhierarchy, all MDLSTM layers connected to themselves, and all units connected to a bias weight.\nThere were 159,369 weights in total. This may sound like a lot, but as mentioned in Section 2.3, the\n\u2018inverted pyramid\u2019 structure greatly reduces the actual number of weight operations. In effect the\nhigher up networks (where the vast majority of the weights are concentrated) are processing much\nsmaller images than those lower down. The squashing function for the gates was the logistic sigmoid\nf1(x) = 1/(1 + e\u2212x), while tanh was used for f2 and f3. Each pass through the training set took\nabout an hour on a desktop computer, and the network converged after 85 passes.\nThe complete system was trained with online gradient descent, using a learning rate of 10\u22124 and\na momentum of 0.9. The character error rate was evaluated on the validation set after every pass\nthrough the training set, and training was stopped after 50 evaluations with no improvement. The\nweights giving the lowest error rate on the validation set were passed to the competition organisers\nfor assessment on the test sets.\n\n3.3 Results\n\nTable 1 clearly shows that our system outperformed all entries in the 2007 ICDAR Arabic recogni-\ntion contest. The other systems, most of which are based on hidden Markov models, are identi\ufb01ed\nby the names of the groups that submitted them (see [12] for more information).\n\n1At \ufb01rst we forgot that Arabic reads right to left and presented the transcriptions backwards. The system\n\nperformed surprisingly well, with a character error rate of 17.8%, compared to 10.7% for the correct targets.\n\n7\n\n\f4 Conclusions and Future Work\n\nWe have combined multidimensional LSTM with connectionist temporal classi\ufb01cation and a hierar-\nchical layer structure to create a powerful of\ufb02ine handwriting recogniser. The system is very general,\nand has been successfully applied to English as well as Arabic. Indeed, since the dimensionality of\nthe networks can be changed to match that of the data, it could in principle be used for almost any\nsupervised sequence labelling task.\n\nAcknowledgements\n\nWe would like to thank Haikal El Abed for giving us access to the ICDAR competition data, and\nfor persisting in the face of technical despair to install and evaluate our software. This work was\nsupported by the excellence cluster \u201cCognition for Technical Systems\u201d (CoTeSys) from the German\nResearch Foundation (DFG).\n\nReferences\n[1] P. Baldi and G. Pollastri. The principled design of large-scale recursive neural network architectures\u2013dag-\n\nrnns and the protein structure prediction problem. J. Mach. Learn. Res., 4:575\u2013602, 2003.\n\n[2] J. S. Bridle. Probabilistic interpretation of feedforward classi\ufb01cation network outputs, with relationships\nto statistical pattern recognition. In F. Fogleman-Soulie and J.Herault, editors, Neurocomputing: Algo-\nrithms, Architectures and Applications, pages 227\u2013236. Springer-Verlag, 1990.\n\n[3] F. Gers, N. Schraudolph, and J. Schmidhuber. Learning precise timing with LSTM recurrent networks.\n\nJournal of Machine Learning Research, 3:115\u2013143, 2002.\n\n[4] A. Graves. Supervised Sequence Labelling with Recurrent Neural Networks. PhD thesis.\n[5] A. Graves, S. Fern\u00b4andez, F. Gomez, and J. Schmidhuber. Connectionist temporal classi\ufb01cation: La-\nbelling unsegmented sequence data with recurrent neural networks. In Proceedings of the International\nConference on Machine Learning, ICML 2006, Pittsburgh, USA, 2006.\n\n[6] A. Graves, S. Fern\u00b4andez, M. Liwicki, H. Bunke, and J. Schmidhuber. Unconstrained online handwriting\nIn J. Platt, D. Koller, Y. Singer, and S. Roweis, editors,\n\nrecognition with recurrent neural networks.\nAdvances in Neural Information Processing Systems 20. MIT Press, Cambridge, MA, 2008.\n\n[7] A. Graves, S. Fern\u00b4andez, and J. Schmidhuber. Multidimensional recurrent neural networks.\n\nIn Pro-\nceedings of the 2007 International Conference on Arti\ufb01cial Neural Networks, Porto, Portugal, September\n2007.\n\n[8] S. Hochreiter and J. Schmidhuber. Long Short-Term Memory. Neural Computation, 9(8):1735\u20131780,\n\n1997.\n\n[9] J. Hu, S. G. Lim, and M. K. Brown. Writer independent on-line handwriting recognition using an HMM\n\napproach. Pattern Recognition, 33:133\u2013147, 2000.\n\n[10] S. Jaeger, S. Manke, J. Reichert, and A. Waibel. On-line handwriting recognition: the NPen++ recognizer.\n\nInternational Journal on Document Analysis and Recognition, 3:169\u2013180, 2001.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, November 1998.\n\n[12] V. Margner and H. E. Abed. Arabic handwriting recognition competition. In ICDAR \u201907: Proceedings of\nthe Ninth International Conference on Document Analysis and Recognition (ICDAR 2007) Vol 2, pages\n1274\u20131278, Washington, DC, USA, 2007. IEEE Computer Society.\n\n[13] M. Pechwitz, S. S. Maddouri, V. Mrgner, N. Ellouze, and H. Amiri. IFN/ENIT-database of handwritten\nIn 7th Colloque International Francophone sur l\u2019Ecrit et le Document (CIFED 2002),\n\narabic words.\nHammamet, Tunis, 2002.\n\n[14] R. Plamondon and S. N. Srihari. On-line and off-line handwriting recognition: a comprehensive survey.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, 2000.\n\n[15] M. Reisenhuber and T. Poggio. Hierarchical models of object recognition in cortex. Nature Neuroscience,\n\n2(11):1019\u20131025, 1999.\n\n[16] M. Schuster and K. K. Paliwal. Bidirectional recurrent neural networks. IEEE Transactions on Signal\n\nProcessing, 45:2673\u20132681, November 1997.\n\n8\n\n\f", "award": [], "sourceid": 431, "authors": [{"given_name": "Alex", "family_name": "Graves", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": null}]}