{"title": "Joint Line Segmentation and Transcription for End-to-End Handwritten Paragraph Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 838, "page_last": 846, "abstract": "Offline handwriting recognition systems require cropped text line images for both training and recognition. On the one hand, the annotation of position and transcript at line level is costly to obtain. On the other hand, automatic line segmentation algorithms are prone to errors, compromising the subsequent recognition.  In this paper, we propose a modification of the popular and efficient Multi-Dimensional Long Short-Term Memory Recurrent Neural Networks (MDLSTM-RNNs) to enable end-to-end processing of handwritten paragraphs. More particularly, we replace the collapse layer transforming the two-dimensional representation into a sequence of predictions by a recurrent version which can select one line at a time.  In the proposed model, a neural network performs a kind of implicit line segmentation by computing attention weights on the image representation. The experiments on paragraphs of Rimes and IAM databases yield results that are competitive with those of networks trained at line level, and constitute a significant step towards end-to-end transcription of full documents.", "full_text": "Joint Line Segmentation and Transcription for\nEnd-to-End Handwritten Paragraph Recognition\n\nTh\u00e9odore Bluche\n\nA2iA SAS\n\n39 rue de la Bienfaisance\n\n75008 Paris\ntb@a2ia.com\n\nAbstract\n\nOf\ufb02ine handwriting recognition systems require cropped text line images for both\ntraining and recognition. On the one hand, the annotation of position and transcript\nat line level is costly to obtain. On the other hand, automatic line segmentation\nalgorithms are prone to errors, compromising the subsequent recognition. In this\npaper, we propose a modi\ufb01cation of the popular and ef\ufb01cient Multi-Dimensional\nLong Short-Term Memory Recurrent Neural Networks (MDLSTM-RNNs) to\nenable end-to-end processing of handwritten paragraphs. More particularly, we\nreplace the collapse layer transforming the two-dimensional representation into\na sequence of predictions by a recurrent version which can select one line at a\ntime. In the proposed model, a neural network performs a kind of implicit line\nsegmentation by computing attention weights on the image representation. The\nexperiments on paragraphs of Rimes and IAM databases yield results that are\ncompetitive with those of networks trained at line level, and constitute a signi\ufb01cant\nstep towards end-to-end transcription of full documents.\n\n1\n\nIntroduction\n\nOf\ufb02ine handwriting recognition consists in recognizing a sequence of characters in an image of\nhandwritten text. Unlike printed texts, images of handwriting are dif\ufb01cult to segment into characters.\nEarly methods tried to compute segmentation hypotheses for characters, for example by performing a\nheuristic over-segmentation, followed by a scoring of groups of segments (e.g. in [4]). In the nineties,\nthis kind of approach was progressively replaced by segmentation-free methods, where a whole\nword image is fed to a system providing a sequence of scores. A lexicon constrains a decoding step,\nallowing to retrieve the character sequence. Some examples are the sliding window approach [25], in\nwhich features are extracted from vertical frames of the line image, or space-displacement neural\nnetworks [4]. In the last decade, word segmentations were abandoned in favor of complete text line\nrecognition with statistical language models [10].\nNowadays, the state of the art handwriting recognition systems are Multi-Dimensional Long Short-\nTerm Memory Recurrent Neural Networks (MDLSTM-RNNs [18]), which consider the whole image,\nalternating MDLSTM layers and convolutional layers. The transformation of the 2D structure into\na sequence is computed by a simple collapse layer summing the activations along the vertical axis.\nConnectionist Temporal Classi\ufb01cation (CTC [17]) allows to train the network to both align and\nrecognize sequences of characters. These models have become very popular and won the recent\nevaluations of handwriting recognition [9, 34, 37].\nHowever, current models still need segmented text lines, and full document processing pipelines\nshould include automatic line segmentation algorithms. Although the segmentation of documents\ninto lines is assumed in most descriptions of handwriting recognition systems, several papers or\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fsurveys state that it is a crucial step for handwriting text recognition systems [8, 28]. The need\nof line segmentation to train the recognition system has also motivated several efforts to map a\nparagraph-level or page-level transcript to line positions in the image (e.g. recently [7, 16]).\nHandwriting recognition systems evolved from character to word segmentation, and to complete\nline processing nowadays. The performance has always improved by making less segmentation\nhypotheses. In this paper, we pursue this traditional tendency. We propose a model for multi-\nline recognition based on the popular MDLSTM-RNNs, augmented with an attention mechanism\ninspired from the recent models for machine translation [3], image caption generation [38], or speech\nrecognition [11, 12]. In the proposed model, the \u201ccollapse\u201d layer is modi\ufb01ed with an attention\nnetwork, providing weights to modulate the importance given at different positions in the input. By\niteratively applying this layer to a paragraph image, the network can transcribe each text line in turn,\nenabling a purely segmentation-free recognition of full paragraphs.\nWe carried out experiments on two public datasets of handwritten paragraphs: Rimes and IAM. We\nreport results that are competitive with the state-of-the-art systems, which use the ground-truth line\nsegmentation. The remaining of this paper is organized as follows. Section 2 presents methods related\nto the one presented here, in terms of the tackled problem and modeling choices. In Section 3, we\nintroduce the baseline model: MDLSTM-RNNs. We expose in Section 4 the proposed modi\ufb01cation,\nand we give the details of the system. Experimental results are reported in Section 5, and followed by\na short discussion in Section 6, in which we explain how the system could be improved, and present\nthe challenge of generalizing it to complete documents.\n\n2 Related Work\n\nOur work is clearly related to MDLSTM-RNNs [18], which we improve by replacing the simple\ncollapse layer by a more elaborated mechanism, itself made of MDLSTM layers. The model we\npropose iteratively performs an implicit line segmentation at the level of intermediate representations.\nClassical text line segmentation algorithms are mostly based on image processing techniques and\nheuristics. However, some methods were devised using statistical models and machine learning\ntechniques such as hidden Markov models [8], conditional random \ufb01elds [21], or neural networks [24,\n31, 32]. In our model, the line segmentation is performed implicitly and integrated in the neural\nnetwork. The intermediate features are shared by the transcription and the segmentation models, and\nthey are jointly trained to minimize the transcription error.\nRecently, many \u201cattention-based\u201d models were proposed to iteratively select in an encoded signal\nthe relevant parts to make the next prediction. This paradigm, already suggested by Fukushima\nin 1987 [15], was successfully applied to various problems such as machine translation [3], image\ncaption generation [38], speech recognition [11, 12], or cropped words in scene text [27]. Attention\nmechanisms were also parts of systems that can generate or recognize small pieces of handwriting\n(e.g. a few digits with DRAW [20] or RAM [2], or short online handwritten sequences [19]). Our\nsystem is designed to handle long sequences and multiple lines.\nIn the \ufb01eld of computer vision, and particularly object detection and recognition, many neural\narchitectures were proposed to both locate and recognize the objects, such as OverFeat [35] or spatial\ntransformer networks (STN [22]). In a sense, our model is quite related to the DenseCap model for\nimage captioning [23], itself similar to STNs. However, we do not aim at explicitly predicting line\npositions, and STNs are not as good with a large amount of small objects.\nWe recently proposed an attention-based model to transcribe full paragraphs of handwritten text,\nwhich predicts each character in turn [6]. Outputting one token at a time turns out to be prohibitive in\nterms of memory and time consumption for full paragraphs, which typically contain about hundreds\nof characters. In the proposed system, the encoded image is not summarized as a single vector at each\ntimestep, but as a sequence of vectors representing full text lines. It represents a huge speedup, and\na comeback to the original MDLSTM-RNN architecture, in which the collapse layer is augmented\nwith an MDLSTM attention network similar to the one presented in [6].\n\n3 Handwriting Recognition with MDLSTM and CTC\n\nMDLSTM-RNNs [18] were \ufb01rst introduced in the context of handwriting recognition. The Multi-\n\n2\n\n\fFigure 1: MDLSTM-RNN architecture for handwriting recognition. LSTM layers in four scanning\ndirections are followed by convolutions. The feature maps of the top layer are are summed in the\nvertical dimension, and character predictions are obtained after a softmax normalization.\n\nDimensional Long Short-Term Memory layers scan the input in the four possible directions. The\nLSTM cell inner state and output are computed from the states and outputs of previous positions in\nthe considered horizontal and vertical directions. Each MDLSTM layer is followed by a convolutional\nlayer. At the top of this network, there is one feature map for each character. These maps are collapsed\ninto a sequence of prediction vectors, normalized with a softmax activation. The whole architecture\nis depicted in Figure 1. The Connectionist Temporal Classi\ufb01cation (CTC [17]) algorithm, which\nconsiders all possible labellings of the sequence, may be applied to train the network to recognize\ntext lines.\nThe 2D to 1D conversion happens in the collapsing layer, which computes a simple aggregation of\nthe feature maps into vector sequences, i.e. maps of height 1. This is achieved by a simple sum across\nthe vertical dimension:\n\nzi =\n\naij\n\n(1)\n\nwhere zi is the i-th output vector and aij is the input feature vector at coordinates (i, j). All the\ninformation in the vertical dimension is reduced to a single vector, regardless of its position in the\nfeature maps, preventing the recognition of multiple lines within this framework.\n\nj=1\n\n4 An Iterative Weighted Collapse for End-to-End Handwriting Recognition\n\nH(cid:88)\n\nH(cid:88)\n\nIn this paper, we replace the sum of Eqn. 1 by a weighted sum, in order to focus on a speci\ufb01c part of\nthe input. The weighted collapse is de\ufb01ned as follows:\n\nz(t)\ni =\n\n\u03c9(t)\nij aij\n\n(2)\n\nwhere \u03c9(t)\nij are scalar weights between 0 and 1, computed at every time t for each position (i, j). The\nweights are provided by a recurrent neural network, illustrated in Figure 2, enabling the recognition\nof a text line at each timestep.\n\nj=1\n\nFigure 2: Proposed modi\ufb01cation of the collapse layer. While the standard collapse (left, top) computes\na simple sum, the weighted collapse (right, bottom) includes a neural network to predict the weights\nof a weighted sum.\n\n3\n\n\fThis collapse, weighted with a neural network, may be interpreted as the \u201cattention\u201d module of an\nattention-based neural network similar to those of [3, 38]. This mechanism is differentiable and can\nbe trained with backpropagation. The complete architecture may be described as follows.\nAn encoder extracts feature maps from the input image I:\n\na = (aij)(i,j)\u2208[1,W ]\u00d7[1,H] = Encoder(I)\n\n(3)\n\nwhere (i, j) are coordinates in the feature maps. In this work, the Encoder module is an MDLSTM\nnetwork with same architecture as the model presented in Section 3.\nA weighted collapse provides a view of the encoded image at each timestep in the form of a weighted\nsum of feature vector sequences. The attention network computes a score for the feature vectors at\nevery position:\n\n(4)\nWe refer to \u03c9(t) = {\u03c9(t)\nij }(1\u2264i\u2264W, 1\u2264j\u2264H) as the attention map at time t, which computation depends\nnot only on the encoded image, but also on the previous attention features. A softmax normalization\nis applied to each column:\n\nij = Attention(a, \u03c9(t\u22121))\n\u03b1(t)\n\nij = e\u03b1(t)\n\u03c9(t)\nij /\n\ne\u03b1(t)\nij(cid:48)\n\n(5)\n\n(cid:88)\n\nj(cid:48)\n\nIn this work, the Attention module is an MDLSTM network.\nThis module is applied several times to the features from the encoder. The output of the attention\nmodule at iteration t, computed with Eqn. 2, is a sequence of feature vectors z, intended to represent\na text line. Therefore, we may see this module as a soft line segmentation neural network. The\nadvantages over the neural networks trained for line segmentation [13, 24, 32, 31] are that (i) it works\non the same features as those used for the transcription (multi-task encoder) and (ii) it is trained to\nmaximize the transcription accuracy (i.e. more closely related to the goal of handwriting recognition\nsystems, and easily interpretable).\nA decoder predicts a character sequence from the feature vectors:\n\ny = Decoder(z)\n\n(6)\n\nwhere z is the concatenation of z(1), z(2), . . . , z(T ). Alternatively, the decoder may be applied to\nz(i)s sub-sequences to get y(i)s and y is the concatenation of y(1), y(2), . . . , y(T ).\nIn the standard MDLSTM architecture of Section 3, the decoder is a simple softmax. However, a\nBidirectional LSTM (BLSTM) decoder could be applied to the collapsed representations. This is\nparticularly interesting in the proposed model, as the BLSTM would potentially process the whole\nparagraph, allowing a modeling of dependencies across text lines.\nThis model can be trained with CTC. If the line breaks are known in the transcript, the CTC could\nbe applied to the segments corresponding to each line prediction. Otherwise, one can directly apply\nCTC to the whole paragraph. In this work, we opted for that strategy, with a BLSTM decoder applied\nto the concatenation of all collapsing steps.\n\n5 Experiments\n\n5.1 Experimental Setup\n\nWe carried out the experiments on two public databases. The IAM database [29] is made of\nhandwritten English texts copied from the LOB corpus. There are 747 documents (6,482 lines) in the\ntraining set, 116 documents (976 lines) in the validation set and 336 documents (2,915 lines) in the\ntest set. The Rimes database [1] contains handwritten letters in French. The data consist of a training\nset of 1,500 paragraphs (11,333 lines), and a test set of 100 paragraphs (778 lines). We held out the\nlast 100 paragraphs of the training set as a validation set.\nThe networks have the following architecture. The encoder \ufb01rst computes a 2x2 tiling of the input\nand alternate MDLSTM layers of 4, 20 and 100 units and 2x4 convolutions of 12 and 32 \ufb01lters\nwith no overlap. The last layer is a linear layer with 80 outputs for IAM and 102 for Rimes. The\nattention network is an MDLSTM network with 2x16 units in each direction followed by a linear\n\n4\n\n\flayer with one output, and a softmax on columns (Eqn. 5). The decoder is a BLSTM network with 256\nunits. Dropout is applied after each LSTM layer [33]. The networks are trained with RMSProp [36]\nwith a base learning rate of 0.001 and mini-batches of 8 examples, to minimize the CTC loss over\nentire paragraphs. The measure of performance is the Character (or Word) Error Rate (CER%),\ncorresponding to the edit distance between the recognition and ground-truth, normalized by the\nnumber of ground-truth characters.\n\n5.2\n\nImpact of the Decoder\n\nIn our model, the weighted collapse method is followed by a BLSTM decoder. In this experiment,\nwe compare the baseline system (standard collapse followed by a softmax) with the proposed model.\nIn order to dissociate the impact of the weighted collapse from that of the BLSTM decoder, we also\ntrained an intermediate architecture with a BLSTM layer after the standard collapse, but still limited\nto text lines.\n\nTable 1: Character Error Rates (%) of CTC-trained RNNs on 150 dpi images. The Standard models\nare trained on segmented lines. The Attention models are trained on paragraphs.\n\nDecoder\nCollapse\nSoftmax\nStandard\nStandard\nBLSTM + Softmax\nAttention BLSTM + Softmax\n\nIAM Rimes\n8.4\n7.5\n6.8\n\n4.9\n4.8\n2.5\n\nThe character error rates (CER%) on the validation sets are reported in Table 1 for 150dpi images.\nWe observe that the proposed model outperforms the baseline by a large margin (relative 20%\nimprovement on IAM, 50% on Rimes), and that the gain may be attributed to both the BLSTM\ndecoder, and the attention mechanism.\n\n5.3\n\nImpact of Line Segmentation\n\nOur model performs an implicit line segmentation to transcribe paragraphs. The baseline considered\nin the previous section is somehow cheating, because it was evaluated on the ground-truth line\nsegmentation. In this experiment, we add to the comparison the baseline models evaluated in a real\nscenario where they are applied to the result of an automatic line segmentation algorithm.\n\nTable 2: Character Error Rates (%) of CTC-trained RNNs on ground-truth lines and automatic\nsegmentation of paragraphs with different resolutions. The last column contains the error rate of the\nattention-based model presented in this work, without an explicit line segmentation.\n\nDatabase Resolution GroundTruth\n\nIAM\n\nRimes\n\n150 dpi\n300 dpi\n150 dpi\n300 dpi\n\n8.4\n6.6\n4.8\n3.6\n\nLine segmentation\nProjection\n\nShredding Energy\n\nThis work\n\n15.5\n13.8\n6.3\n5.0\n\n9.3\n7.5\n5.9\n4.5\n\n10.2\n7.9\n8.2\n6.6\n\n6.8\n4.9\n2.8\n2.5\n\nIn Table 2, we report the CERs obtained with the ground-truth line positions, with three different\nsegmentation algorithms, and with our end-to-end system, on the validation sets of both databases with\ndifferent input resolutions. We see that applying the baseline networks on automatic segmentations\nincreases the error rates, by an absolute 1% in the best case. We also observe that the models are\nbetter with higher resolutions.\nOur models yield better performance than methods based on an explicit and automatic line segmenta-\ntion, and comparable or better results than with ground-truth segmentation, even with a resolution\ndivided by two. Two factors may explain why our model yields better results than the line recognition\nfrom ground-truth segmentation. First, the ground-truth line positions are bounding boxes that may\ninclude some parts of adjacent lines and include irrelevant data, whereas the attention model will\nfocus on smaller areas. But the main reason is probably that the proposed model includes a BLSTM\noperating on the whole paragraph, which may capture linguistic dependencies across text lines.\n\n5\n\n\fIn Figure 3, we display a visualisation of the implicit line segmentation computed by the network.\nEach color corresponds to one step of the iterative weighted collapse. On the images, the color\nrepresents the weights given by the attention network (the transparency encodes their intensity). The\ntexts below are the predicted transcriptions, and chunks are colored according to the corresponding\ntimestep of the attention mechanism.\n\nFigure 3: Transcription of full paragraphs of text and implicit line segmentation learnt by the network\non IAM (left) and Rimes (right). Best viewed in color.\n\n5.4 Comparison to Published Results\n\nIn this section, we also compute the word error rates (WER%) and evaluate our models on the test\nsets to compare the proposed approach to existing systems. For IAM, we applied a 3-gram language\nmodel with a lexicon of 50,000 words, trained on the LOB, Brown and Wellington corpora.1 This\nlanguage model has a perplexity of 298 and out-of-vocabulary rate of 4.3% on the validation set (329\nand 3.7% on the test set).\nThe results are presented in Table 3 for different input resolutions. When comparing the error rates, it\nis important to note that all systems in the literature used an explicit (ground-truth) line segmentation\nand a language model. [14, 26, 30] used a hybrid character/word language model to tackle the issue\nof out-of-vocabulary words. Moreover, all systems except [30, 33] carefully pre-processed the line\nimage (e.g. corrected the slant or skew, normalized the height, ...), whereas we just normalized the\npixel values to zero mean and unit variance. Finally, [5] is a combination of four systems.\n\nTable 3: Final results on Rimes and IAM databases\n\nRimes\n\nIAM\n\nWER% CER% WER% CER%\n\n150 dpi\n\n300 dpi\n\nno language model\nwith language model\nno language model\nwith language model\nBluche, 2015 [5]\nDoetsch et al., 2014 [14]\nKozielski et al. 2013 [26]\nPham et al., 2014 [33]\nMessina & Kermorvant, 2014 [30]\n\n13.6\n\n12.6\n\n11.2\n12.9\n13.7\n12.3\n13.3\n\n3.2\n\n2.9\n\n3.5\n4.3\n4.6\n3.3\n-\n\n29.5\n16.6\n24.6\n16.4\n10.9\n12.2\n13.3\n13.6\n19.1\n\n10.1\n6.5\n7.9\n5.5\n4.4\n4.7\n5.1\n5.1\n-\n\n1 The parts of the LOB corpus used in the validation and evaluation sets were removed.\n\n6\n\n\fOn Rimes, the system applied to 150 dpi images already outperforms the state of the art in CER%,\nwhile being competitive in terms of WER%. The system for 300 dpi images is comparable to the best\nsingle system [33] in WER% with a signi\ufb01cantly better CER%.\nOn IAM, the language model turned out to be quite important, probably because there is more\nvariability in the language.2 On 150 dpi images, the results are not too far from the state of the art\nresults. The WER% does not improve much on 300 dpi images, but we get a lower CER%. When\nanalysing the errors, we noticed that there is a lot of punctuation in IAM, which was often missed by\nthe attention mechanism. It may happen because punctuation marks are signi\ufb01cantly smaller than\ncharacters. With the attention-based collapse and the weighted sum, they will be more easily missed\nthan with the standard collapse, which gives the same weight to all vertical positions.\n\n6 Discussion\n\nTable 4: Comparison of decoding times of different methods: using ground-truth line information,\nwith explicit segmentation, with the attention-based method of [6] and with the system presented in\nthis paper.\n\nMethod\nGroundTruth\nShredding\nScan, Attend and Read [6]\nThis Work\n\n(crop+reco)\n(segment+crop+reco)\n(reco)\n(reco)\n\nProcessing time (s)\n0.21 \u00b1 0.07\n0.78 \u00b1 0.26\n21.2 \u00b1 5.6\n0.62 \u00b1 0.14\n\nThe proposed model can transcribe complete paragraphs without segmentation and is orders of\nmagnitude faster that the model of [6] (cf. Table 4). However, the mechanism cannot handle\narbitrary reading orders. Rather, it implements a sort of implicit line segmentation. In the current\nimplementation, the iterative collapse runs for a \ufb01xed number of timesteps. Yet, the model can handle\na variable number of text lines, and, interestingly, the focus is put on interlines in the additional steps.\nA more elegant solution should include the prediction of a binary variable indicating when to stop\nreading.\nOur method was applied to paragraph images, so a document layout analysis is required to detect\nthose paragraphs before applying the model. Naturally, the next step should be the transcription of\ncomplex documents without an explicit or assumed paragraph extraction. The limitation to paragraphs\nis inherent to this system. Indeed, the weighted collapse always outputs sequences corresponding to\nthe whole width of the encoded image, which, in paragraphs, may correspond to text lines. In order to\nswitch to full documents, several issues arise. On the one hand, the size of the lines is determined by\nthe size of the text block. Thus a method should be devised to only select a smaller part of the feature\nmaps, representing only the considered text line. This is not possible in the presented framework. A\npotential solution could come from spatial transformer networks [22], performing a differentiable\ncrop. On the other hand, training will in practice become more dif\ufb01cult, not only because of the\ncomplexity of the task, but also because the reading order of text blocks in complex documents cannot\nbe exactly inferred in many cases (even de\ufb01ning arbitrary rules may be tricky).\n\n7 Conclusion\n\nWe have presented a model to transcribe full paragraphs of handwritten texts without an explicit\nline segmentation. Contrary to classical methods relying on a two-step process (segment, then\nrecognize), our system directly considers the paragraph image without an elaborated pre-processing,\nand outputs the complete transcription. We proposed a simple modi\ufb01cation of the collapse layer\nin the standard MDLSTM architecture to iteratively focus on single text lines. This implicit line\nsegmentation is learnt with backpropagation along with the rest of the network to minimize the\nCTC error at the paragraph level. We reported error rates comparable to the state of the art on two\npublic databases. After switching from explicit to implicit character, then word segmentation for\nhandwriting recognition, we showed that line segmentation can also be learnt inside the transcription\nmodel. The next step towards end-to-end handwriting recognition is now at the full page level.\n\n2 A simple language model yields a perplexity of 18 on Rimes [5].\n\n7\n\n\fReferences\n[1] E. Augustin, M. Carr\u00e9, E. Grosicki, J.-M. Brodin, E. Geoffrois, and F. Preteux. RIMES evaluation campaign\nfor handwritten mail processing. In Proceedings of the Workshop on Frontiers in Handwriting Recognition,\nnumber 1, 2006.\n\n[2] Jimmy Ba, Volodymyr Mnih, and Koray Kavukcuoglu. Multiple object recognition with visual attention.\n\narXiv preprint arXiv:1412.7755, 2014.\n\n[3] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly learning\n\nto align and translate. arXiv preprint arXiv:1409.0473, 2014.\n\n[4] Yoshua Bengio, Yann LeCun, Craig Nohl, and Chris Burges. Lerec: A NN/HMM hybrid for on-line\n\nhandwriting recognition. Neural Computation, 7(6):1289\u20131303, 1995.\n\n[5] Th\u00e9odore Bluche. Deep Neural Networks for Large Vocabulary Handwritten Text Recognition. Theses,\n\nUniversit\u00e9 Paris Sud - Paris XI, May 2015.\n\n[6] Th\u00e9odore Bluche, J\u00e9r\u00f4me Louradour, and Ronaldo Messina. Scan, Attend and Read: End-to-End Hand-\n\nwritten Paragraph Recognition with MDLSTM Attention. arXiv preprint arXiv:1604.03286, 2016.\n\n[7] Th\u00e9odore Bluche, Bastien Moysset, and Christopher Kermorvant. Automatic line segmentation and ground-\ntruth alignment of handwritten documents. In International Conference on Frontiers in Handwriting\nRecognition (ICFHR), 2014.\n\n[8] Vicente Bosch, Alejandro Hector Toselli, and Enrique Vidal. Statistical text line analysis in handwritten\ndocuments. In Frontiers in Handwriting Recognition (ICFHR), 2012 International Conference on, pages\n201\u2013206. IEEE, 2012.\n\n[9] Sylvie Brunessaux, Patrick Giroux, Bruno Grilh\u00e8res, Mathieu Manta, Maylis Bodin, Khalid Choukri,\nOlivier Galibert, and Juliette Kahn. The Maurdor Project: Improving Automatic Processing of Digital\nDocuments. In Document Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages\n349\u2013354. IEEE, 2014.\n\n[10] Horst Bunke, Samy Bengio, and Alessandro Vinciarelli. Of\ufb02ine recognition of unconstrained handwritten\ntexts using hmms and statistical language models. Pattern Analysis and Machine Intelligence, IEEE\nTransactions on, 26(6):709\u2013720, 2004.\n\n[11] William Chan, Navdeep Jaitly, Quoc V Le, and Oriol Vinyals. Listen, attend and spell. arXiv preprint\n\narXiv:1508.01211, 2015.\n\n[12] Jan K Chorowski, Dzmitry Bahdanau, Dmitriy Serdyuk, Kyunghyun Cho, and Yoshua Bengio. Attention-\nbased models for speech recognition. In Advances in Neural Information Processing Systems, pages\n577\u2013585, 2015.\n\n[13] Manolis Delakis and Christophe Garcia. text detection with convolutional neural networks. In VISAPP (2),\n\npages 290\u2013294, 2008.\n\n[14] Patrick Doetsch, Michal Kozielski, and Hermann Ney. Fast and robust training of recurrent neural networks\n\nfor of\ufb02ine handwriting recognition. pages \u2013, 2014.\n\n[15] Kunihiko Fukushima. Neural network model for selective attention in visual pattern recognition and\n\nassociative recall. Applied Optics, 26(23):4985\u20134992, 1987.\n\n[16] Basilis Gatos, Georgios Louloudis, Tim Causer, Kris Grint, Veronica Romero, Joan-Andreu S\u00e1nchez,\nAlejandro Hector Toselli, and Enrique Vidal. Ground-truth production in the transcriptorium project. In\nDocument Analysis Systems (DAS), 2014 11th IAPR International Workshop on, pages 237\u2013241. IEEE,\n2014.\n\n[17] A Graves, S Fern\u00e1ndez, F Gomez, and J Schmidhuber. Connectionist temporal classi\ufb01cation: labelling\nunsegmented sequence data with recurrent neural networks. In International Conference on Machine\nlearning, pages 369\u2013376, 2006.\n\n[18] A. Graves and J. Schmidhuber. Of\ufb02ine Handwriting Recognition with Multidimensional Recurrent Neural\n\nNetworks. In Advances in Neural Information Processing Systems, pages 545\u2013552, 2008.\n\n[19] Alex Graves. Generating sequences with recurrent neural networks. arXiv preprint arXiv:1308.0850, 2013.\n\n[20] Karol Gregor, Ivo Danihelka, Alex Graves, and Daan Wierstra. DRAW: A recurrent neural network for\n\nimage generation. arXiv preprint arXiv:1502.04623, 2015.\n\n8\n\n\f[21] David Hebert, Thierry Paquet, and Stephane Nicolas. Continuous crf with multi-scale quantization feature\nfunctions application to structure extraction in old newspaper. In Document Analysis and Recognition\n(ICDAR), 2011 International Conference on, pages 493\u2013497. IEEE, 2011.\n\n[22] Max Jaderberg, Karen Simonyan, Andrew Zisserman, et al. Spatial transformer networks. In Advances in\n\nNeural Information Processing Systems, pages 2008\u20132016, 2015.\n\n[23] Justin Johnson, Andrej Karpathy, and Li Fei-Fei. Densecap: Fully convolutional localization networks for\n\ndense captioning. arXiv preprint arXiv:1511.07571, 2015.\n\n[24] Keechul Jung. Neural network-based text location in color images. Pattern Recognition Letters,\n\n22(14):1503\u20131515, 2001.\n\n[25] Alfred Kaltenmeier, Torsten Caesar, Joachim M Gloger, and Eberhard Mandler. Sophisticated topology\nof hidden Markov models for cursive script recognition. In Document Analysis and Recognition, 1993.,\nProceedings of the Second International Conference on, pages 139\u2013142. IEEE, 1993.\n\n[26] Michal Kozielski, Patrick Doetsch, Hermann Ney, et al. Improvements in RWTH\u2019s System for Off-Line\nHandwriting Recognition. In Document Analysis and Recognition (ICDAR), 2013 12th International\nConference on, pages 935\u2013939. IEEE, 2013.\n\n[27] Chen-Yu Lee and Simon Osindero. Recursive recurrent nets with attention modeling for ocr in the wild.\n\narXiv preprint arXiv:1603.03101, 2016.\n\n[28] Laurence Likforman-Sulem, Abderrazak Zahour, and Bruno Taconet. Text line segmentation of historical\ndocuments: a survey. International Journal of Document Analysis and Recognition (IJDAR), 9(2-4):123\u2013\n138, 2007.\n\n[29] U-V Marti and Horst Bunke. The IAM-database: an English sentence database for of\ufb02ine handwriting\n\nrecognition. International Journal on Document Analysis and Recognition, 5(1):39\u201346, 2002.\n\n[30] R. Messina and C. Kermorvant. Surgenerative Finite State Transducer n-gram for Out-Of-Vocabulary Word\nRecognition. In 11th IAPR Workshop on Document Analysis Systems (DAS2014), pages 212\u2013216, 2014.\n\n[31] Bastien Moysset, Pierre Adam, Christian Wolf, and J\u00e9r\u00f4me Louradour. Space displacement localization\nneural networks to locate origin points of handwritten text lines in historical documents. In International\nWorkshop on Historical Document Imaging and Processing (HIP), 2015.\n\n[32] Bastien Moysset, Christopher Kermorvant, Christian Wolf, and J\u00e9r\u00f4me Louradour. Paragraph text segmen-\ntation into lines with recurrent neural networks. In International Conference of Document Analysis and\nRecognition (ICDAR), 2015.\n\n[33] Vu Pham, Th\u00e9odore Bluche, Christopher Kermorvant, and J\u00e9r\u00f4me Louradour. Dropout improves recurrent\nneural networks for handwriting recognition. In 14th International Conference on Frontiers in Handwriting\nRecognition (ICFHR2014), pages 285\u2013290, 2014.\n\n[34] Joan Andreu S\u00e1nchez, Ver\u00f3nica Romero, Alejandro Toselli, and Enrique Vidal. ICFHR 2014 HTRtS:\nHandwritten Text Recognition on tranScriptorium Datasets. In International Conference on Frontiers in\nHandwriting Recognition (ICFHR), 2014.\n\n[35] Pierre Sermanet, David Eigen, Xiang Zhang, Micha\u00ebl Mathieu, Rob Fergus, and Yann LeCun. Over-\nfeat: Integrated recognition, localization and detection using convolutional networks. arXiv preprint\narXiv:1312.6229, 2013.\n\n[36] Tijmen Tieleman and Geoffrey Hinton. Lecture 6.5-rmsprop: Divide the gradient by a running average of\n\nits recent magnitude. COURSERA: Neural Networks for Machine Learning, 4, 2012.\n\n[37] A. Tong, M. Przybocki, V. Maergner, and H. El Abed. NIST 2013 Open Handwriting Recognition and\nTranslation (OpenHaRT13) Evaluation. In 11th IAPR Workshop on Document Analysis Systems (DAS2014),\n2014.\n\n[38] Kelvin Xu, Jimmy Ba, Ryan Kiros, Aaron Courville, Ruslan Salakhutdinov, Richard Zemel, and Yoshua\nBengio. Show, attend and tell: Neural image caption generation with visual attention. arXiv preprint\narXiv:1502.03044, 2015.\n\n9\n\n\f", "award": [], "sourceid": 523, "authors": [{"given_name": "Theodore", "family_name": "Bluche", "institution": "A2iA"}]}