{"title": "Hessian-free Optimization for Learning Deep Multidimensional Recurrent Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 883, "page_last": 891, "abstract": "Multidimensional recurrent neural networks (MDRNNs) have shown a remarkable performance in the area of speech and handwriting recognition. The performance of an MDRNN is improved by further increasing its depth, and the difficulty of learning the deeper network is overcome by using Hessian-free (HF) optimization. Given that connectionist temporal classification (CTC) is utilized as an objective of learning an MDRNN for sequence labeling, the non-convexity of CTC poses a problem when applying HF to the network. As a solution, a convex approximation of CTC is formulated and its relationship with the EM algorithm and the Fisher information matrix is discussed. An MDRNN up to a depth of 15 layers is successfully trained using HF, resulting in an improved performance for sequence labeling.", "full_text": "Hessian-free Optimization for Learning\n\nDeep Multidimensional Recurrent Neural Networks\n\nMinhyung Cho\n\nChandra Shekhar Dhir\n\nJaehyung Lee\n\n{mhyung.cho,shekhardhir}@gmail.com\n\nApplied Research Korea, Gracenote Inc.\n\njaehyung.lee@kaist.ac.kr\n\nAbstract\n\nMultidimensional recurrent neural networks (MDRNNs) have shown a remark-\nable performance in the area of speech and handwriting recognition. The perfor-\nmance of an MDRNN is improved by further increasing its depth, and the dif-\n\ufb01culty of learning the deeper network is overcome by using Hessian-free (HF)\noptimization. Given that connectionist temporal classi\ufb01cation (CTC) is utilized as\nan objective of learning an MDRNN for sequence labeling, the non-convexity of\nCTC poses a problem when applying HF to the network. As a solution, a convex\napproximation of CTC is formulated and its relationship with the EM algorithm\nand the Fisher information matrix is discussed. An MDRNN up to a depth of 15\nlayers is successfully trained using HF, resulting in an improved performance for\nsequence labeling.\n\n1\n\nIntroduction\n\nMultidimensional recurrent neural networks (MDRNNs) constitute an ef\ufb01cient architecture for\nbuilding a multidimensional context into recurrent neural networks [1]. End-to-end training of\nMDRNNs in conjunction with connectionist temporal classi\ufb01cation (CTC) has been shown to\nachieve a state-of-the-art performance in on/off-line handwriting and speech recognition [2, 3, 4].\nIn previous approaches, the performance of MDRNNs having a depth of up to \ufb01ve layers, which\nis limited as compared to the recent progress in feedforward networks [5], was demonstrated. The\neffectiveness of MDRNNs deeper than \ufb01ve layers has thus far been unknown.\nTraining a deep architecture has always been a challenging topic in machine learning. A notable\nbreakthrough was achieved when deep feedforward neural networks were initialized using layer-\nwise pre-training [6]. Recently, approaches have been proposed in which supervision is added to\nintermediate layers to train deep networks [5, 7]. To the best of our knowledge, no such pre-training\nor bootstrapping method has been developed for MDRNNs.\nAlternatively, Hesssian-free (HF) optimization is an appealing approach to training deep neural\nnetworks because of its ability to overcome pathological curvature of the objective function [8].\nFurthermore, it can be applied to any connectionist model provided that its objective function is\ndifferentiable. The recent success of HF for deep feedforward and recurrent neural networks [8, 9]\nsupports its application to MDRNNs.\nIn this paper, we claim that an MDRNN can bene\ufb01t from a deeper architecture, and the application of\nsecond order optimization such as HF allows its successful learning. First, we offer details of the de-\nvelopment of HF optimization for MDRNNs. Then, to apply HF optimization for sequence labeling\ntasks, we address the problem of the non-convexity of CTC, and formulate a convex approximation.\nIn addition, its relationship with the EM algorithm and the Fisher information matrix is discussed.\nExperimental results for of\ufb02ine handwriting and phoneme recognition show that an MDRNN with\nHF optimization performs better as the depth of the network increases up to 15 layers.\n\n1\n\n\f2 Multidimensional recurrent neural networks\n\nMDRNNs constitute a generalization of RNNs to process multidimensional data by replacing the\nsingle recurrent connection with as many connections as the dimensions of the data [1]. The network\ncan access the contextual information from 2N directions, allowing a collective decision to be made\nbased on rich context information. To enhance its ability to exploit context information, long short-\nterm memory (LSTM) [10] cells are usually utilized as hidden units. In addition, stacking MDRNNs\nto construct deeper networks further improves the performance as the depth increases, achieving the\nstate-of-the-art performance in phoneme recognition [4]. For sequence labeling, CTC is applied as\na loss function of the MDRNN. The important advantage of using CTC is that no pre-segmented\nsequences are required, and the entire transcription of the input sample is suf\ufb01cient.\n\n2.1 Learning MDRNNs\n\nA d-dimensional MDRNN with M inputs and K outputs is regarded as a mapping from an input\nsequence x \u2208 RM\u00d7T1\u00d7\u00b7\u00b7\u00b7\u00d7Td to an output sequence a \u2208 (RK)T of length T , where the input data for\nM input neurons are given by the vectorization of d-dimensional data, and T1, . . . , Td is the length\nof the sequence in each dimension. All learnable weights and biases are concatenated to obtain a\nparameter vector \u03b8 \u2208 RN . In the learning phase with \ufb01xed training data, the MDRNN is formalized\nas a mapping N : RN \u2192 (RK)T from the parameters \u03b8 to the output sequence a, i.e., a = N (\u03b8).\nThe scalar loss function is de\ufb01ned over the output sequence as L : (RK)T \u2192 R. Learning an\nMDRNN is viewed as an optimization of the objective L(N (\u03b8)) = L \u25e6 N (\u03b8) with respect to \u03b8.\nThe Jacobian JF of a function F : Rm \u2192 Rn is the n \u00d7 m matrix where each element is a partial\nderivative of an element of output with respect to an element of input. The Hessian HF of a scalar\nfunction F : Rm \u2192 R is the m \u00d7 m matrix of second-order partial derivatives of the output with\nrespect to its inputs. Throughout this paper, a vector sequence is denoted by boldface a, a vector at\ntime t in a is denoted by at, and the k-th element of at is denoted by at\nk.\n\n3 Hessian-free optimization for MDRNNs\n\nThe application of HF optimization to an MDRNN is straightforward if the matching loss func-\ntion [11] for its output layer is adopted. However, this is not the case for CTC, which is necessarily\nadopted for sequence labeling. Before developing an appropriate approximation to CTC that is com-\npatible with HF optimization, we discuss two considerations related to the approximation. The \ufb01rst\nis obtaining a quadratic approximation of the loss function, and the second is the ef\ufb01cient calculation\nof the matrix-vector product used at each iteration of the conjugate gradient (CG) method.\nHF optimization minimizes an objective by constructing a local quadratic approximation for the\nobjective function and minimizing the approximate function instead of the original one. The loss\nfunction L(\u03b8) needs to be approximated at each point \u03b8n of the n-th iteration:\n\n\u03b4(cid:62)\nn G\u03b4n,\n\n1\n2\n\nQn(\u03b8) = L(\u03b8n) + \u2207\u03b8L|(cid:62)\n\n\u03b8n\n\n\u03b4n +\n\n(1)\nwhere \u03b4n = \u03b8 \u2212 \u03b8n is the search direction, i.e., the parameters of the optimization, and G is a\nlocal approximation to the curvature of L(\u03b8) at \u03b8n, which is typically obtained by the generalized\nGauss-Newton (GGN) matrix as an approximation of the Hessian.\nHF optimization uses the CG method in a subroutine to minimize the quadratic objective above for\nutilizing the complete curvature information and achieving computational ef\ufb01ciency. CG requires\nthe computation of Gv for an arbitrary vector v, but not the explicit evaluation of G. For neural\nnetworks, an ef\ufb01cient way to compute Gv was proposed in [11], extending the study in [12]. In\nsection 3.2, we provide the details of the ef\ufb01cient computation of Gv for MDRNNs.\n\n3.1 Quadratic approximation of loss function\nThe Hessian matrix, HL\u25e6N , of the objective L (N (\u03b8)) is written as\n\nHL\u25e6N = J(cid:62)\n\nN HLJN +\n\n[JL]iH[N ]i,\n\n(2)\n\nKT(cid:88)\n\ni=1\n\n2\n\n\fwhere JN \u2208 RKT\u00d7N , HL \u2208 RKT\u00d7KT , and [q]i denotes the i-th component of the vector q.\nAn inde\ufb01nite Hessian matrix is problematic for second-order optimization, because it de\ufb01nes an\nunbounded local quadratic approximation [13]. For nonlinear systems, the Hessian is not necessarily\npositive semide\ufb01nite, and thus, the GGN matrix is used as an approximation of the Hessian [11, 8].\nThe GGN matrix is obtained by ignoring the second term in Eq. (2), as given by\n\nGL\u25e6N = J(cid:62)\n\nN HLJN .\n\n(3)\nThe suf\ufb01cient condition for the GGN approximation to be exact is that the network makes a perfect\nprediction for every given sample, that is, JL = 0, or [N ]i stays in the linear region for all i, that is,\nH[N ]i = 0.\nGL\u25e6N has less rank than KT and is positive semide\ufb01nite provided that HL is. Thus, L is chosen to\nbe a convex function so that HL is positive semide\ufb01nite. In principle, it is best to de\ufb01ne L and N\nsuch that L performs as much of the computation as possible, with the positive semide\ufb01niteness of\nHL as a minimum requirement [13]. In practice, a nonlinear output layer together with its matching\nloss function [11], such as the softmax function with cross-entropy loss, is widely used.\n\n3.2 Computation of matrix-vector product for MDRNN\nThe product of an arbitrary vector v by the GGN matrix, Gv = J(cid:62)\nN HLJN v, amounts to the se-\nquential multiplication of v by three matrices. First, the product JN v is a Jacobian times vector\nand is therefore equal to the directional derivative of N (\u03b8) along the direction of v. Thus, JN v can\nbe written using a differential operator JN v = Rv(N (\u03b8)) [12] and the properties of the operator\ncan be utilized for ef\ufb01cient computation. Because an MDRNN is a composition of differentiable\ncomponents, the computation of Rv(N (\u03b8)) throughout the whole network can be accomplished by\nrepeatedly applying the sum, product, and chain rules starting from the input layer. The detailed\nderivation of the R operator to LSTM, normally used as a hidden unit in MDRNNs, is provided in\nappendix A.\nNext, the multiplication of JN v by HL can be performed by direct computation. The dimension\nof HL could at \ufb01rst appear problematic, since the dimension of the output vector used by the loss\nfunction L can be as high as KT , in particular, if CTC is adopted as an objective for the MDRNN.\nIf the loss function can be expressed as the sum of individual loss functions with a domain restricted\nin time, the computation can be reduced signi\ufb01cantly. For example, with the commonly used cross-\nentropy loss function, the KT \u00d7 KT matrix HL can be transformed into a block diagonal matrix\nwith T blocks of a K \u00d7 K Hessian matrix. Let HL,t be the t-th block in HL. Then, the GGN matrix\ncan be written as\n\n(cid:88)\n\nGL\u25e6N =\n\nJ(cid:62)\nNt\n\nHL,tJNt,\n\n(4)\n\nwhere JNt is the Jacobian of the network at time t.\nFinally, the multiplication of a vector u = HLJN v by the matrix J(cid:62)\npropagation through time algorithm by propagating u instead of the error at the output layer.\n\nN is calculated using the back-\n\nt\n\n4 Convex approximation of CTC for application to HF optimization\n\nConnectioninst temporal classi\ufb01cation (CTC) [14] provides an objective function of learning an\nMDRNN for sequence labeling. In this section, we derive a convex approximation of CTC inspired\nby the GGN approximation according to the following steps. First, the non-convex part of the\noriginal objective is separated out by reformulating the softmax part. Next, the remaining convex\npart is approximated without altering its Hessian, making it well matched to the non-convex part.\nFinally, the convex approximation is obtained by reuniting the convex and non-convex parts.\n\n4.1 Connectionist temporal classi\ufb01cation\nCTC is formulated as the mapping from an output sequence of the recurrent network, a \u2208 (RK)T ,\nto a scalar loss. The output activations at time t are normalized using the softmax function\n\n(cid:80)\n\nyt\nk =\n\nexp(at\nk)\nk(cid:48) exp(at\n\nk(cid:48))\n\n,\n\n3\n\n(5)\n\n\fk is the probability of label k given a at time t.\n\nwhere yt\nThe conditional probability of the path \u03c0 is calculated by the multiplication of the label probabilities\nat each timestep, as given by\n\np(\u03c0|a) =\n\nyt\n\u03c0t\n\n,\n\n(6)\n\nT(cid:89)\n\nt=1\n\n(cid:88)\n\n\u03c0\u2208B\u22121(l)\n\nwhere \u03c0t is the label observed at time t along the path \u03c0. The path \u03c0 of length T is mapped to a\nlabel sequence of length M \u2264 T by an operator B, which removes the repeated labels and then\nthe blanks. Several mutually exclusive paths can map to the same label sequence. Let S be a set\ncontaining every possible sequence mapped by B, that is, S = {s|s \u2208 B(\u03c0) for some \u03c0} is the\nimage of B, and let |S| denote the cardinality of the set.\nThe conditional probability of a label sequence l is given by\n\np(l|a) =\n\np(\u03c0|a),\n\n(7)\n\nwhich is the sum of probabilities of all the paths mapped to a label sequence l by B.\nThe cross-entropy loss assigns a negative log probability to the correct answer. Given a target\nsequence z, the loss function of CTC for the sample is written as\n\nL(a) = \u2212 log p(z|a).\n\n(8)\n\nFrom the description above, CTC is composed of the sum of the product of softmax components.\nThe function \u2212 log(yt\nk), corresponding to the softmax with cross-entropy loss, is convex [11].\nTherefore, yt\nk is log-concave. Whereas log-concavity is closed under multiplication, the sum of\nlog-concave functions is not log-concave in general [15]. As a result, the CTC objective is not\nconvex in general because it contains the sum of softmax components in Eq. (7).\n\n4.2 Reformulation of CTC objective function\n\nWe reformulate the CTC objective Eq. (8) to separate out the terms that are responsible for the non-\nconvexity of the function. By reformulation, the softmax function is de\ufb01ned over the categorical\nlabel sequences.\nBy substituting Eq. (5) into Eq. (6), it follows that\n\nwhere b\u03c0 =(cid:80)\n\nas\n\np(\u03c0|a) =\n\nexp(b\u03c0)\n\n(9)\n\u03c0t. By substituting Eq. (9) into Eq. (7) and setting l = z, p(z|a) can be re-written\n\n\u03c0(cid:48)\u2208all exp(b\u03c0(cid:48))\n\n,\n\nt at\n\n(cid:80)\n\n(cid:80)\n(cid:80)\n\np(z|a) =\n\n\u03c0\u2208B\u22121(z) exp(b\u03c0)\n\n\u03c0\u2208all exp(b\u03c0)\n\n(cid:80)\n\n=\n\nexp(fz)\nz(cid:48)\u2208S exp(fz(cid:48))\n\n,\n\n(cid:16)(cid:80)\n\n\u03c0\u2208B\u22121(z) exp(b\u03c0)\n\nis the log-\nwhere S is the set of every possible label sequence and fz = log\nsum-exp function1, which is proportional to the probability of observing the label sequence z among\nall the other label sequences.\nWith the reformulation above, the CTC objective can be regarded as the cross-entropy loss with the\nsoftmax output, which is de\ufb01ned over all the possible label sequences. Because the cross-entropy\nloss function matches the softmax output layer [11], the CTC objective is convex, except the part\nthat computes fz for each of the label sequences. At this point, an obvious candidate for the convex\napproximation of CTC is the GGN matrix separating the convex and non-convex parts.\nLet the non-convex part be Nc and the convex part be Lc. The mapping Nc : (RK)T \u2192 R|S| is\nde\ufb01ned by\n\nNc(a) = F = [fz1, . . . , fz|S| ](cid:62),\n\n(11)\n\n(10)\n\n(cid:17)\n\n1f (x1, . . . , xn) = log(ex1 + \u00b7\u00b7\u00b7 + exn ) is the log-sum-exp function de\ufb01ned on Rn\n\n4\n\n\fwhere fz is given above, and |S| is the number of all the possible label sequences. For given F as\nabove, the mapping Lc : R|S| \u2192 R is de\ufb01ned by\n\n(cid:33)\n\n(cid:32)(cid:88)\n\nz(cid:48)\u2208S\n\n(cid:80)\n\nLc(F ) = \u2212 log\n\nexp(fz)\nz(cid:48)\u2208S exp(fz(cid:48))\n\n= \u2212fz + log\n\nexp(fz(cid:48))\n\n,\n\n(12)\n\nwhere z is the label sequence corresponding to a. The \ufb01nal reformulation for the loss function of\nCTC is given by\n\nL(a) = Lc \u25e6 Nc(a).\n\n(13)\n\n4.3 Convex approximation of CTC loss function\n\nThe GGN approximation of Eq. (13) immediately gives a convex approximation of the Hessian for\nCTC as GLc\u25e6Nc = J(cid:62)\nHLcJNc. Although HLc has the form of a diagonal matrix plus a rank-1\nNc\nmatrix, i.e., diag(Y ) \u2212 Y Y (cid:62), the dimension of HLc is |S| \u00d7 |S|, where |S| becomes exponentially\nlarge as the length of the sequence increases. This makes the practical calculation of HLc dif\ufb01cult.\nOn the other hand, removing the linear team \u2212fz from Lc(F ) in Eq. (12) does not alter its Hessian.\nand M = Lp \u25e6 Nc are the same, i.e., GLc\u25e6Nc = GLp\u25e6Nc. Therefore, their Hessian matrices are\napproximations of each other. The condition that the two Hessian matrices, HL and HM, converges\nto the same matrix is discussed below.\n\nThe resulting formula is Lp(F ) = log(cid:0)(cid:80)\nInterestingly, M is given as a compact formula M(a) = Lp \u25e6 Nc(a) =(cid:80)\n\nz(cid:48)\u2208S exp(fz(cid:48))(cid:1). The GGN matrices of L = Lc \u25e6 Nc\n\nt log(cid:80)\n\nk), where\nk is the output unit k at time t. Its Hessian HM can be directly computed, resulting in a block\nat\ndiagonal matrix. Each block is restricted in time, and the t-th block is given by\n\nk exp(at\n\n1, . . . , yt\n\nHM,t = diag(Y t) \u2212 Y tY t(cid:62)\n(14)\nk is given in Eq. (5). Because the Hessian of each block is positive\nwhere Y t = [yt\nsemide\ufb01nite, HM is positive semide\ufb01nite. A convex approximation of the Hessian of an MDRNN\nusing the CTC objective can be obtained by substituting HM for HL in Eq. (3). Note that the\nresulting matrix is block diagonal and Eq. (4) can be utilized for ef\ufb01cient computation.\nOur derivation can be summarized as follows:\n\nK](cid:62) and yt\n\n,\n\n1. HL = HLc\u25e6Nc is not positive semide\ufb01nite.\n2. GLc\u25e6Nc = GLp\u25e6Nc is positive semide\ufb01nite, but not computationally tractable.\n3. HLp\u25e6Nc is positive semide\ufb01nite and computationally tractable.\n\n4.4 Suf\ufb01cient condition for the proposed approximation to be exact\n\nFrom Eq. (2), the condition HLc\u25e6Nc = HLp\u25e6Nc holds if and only if (cid:80)KT\n(cid:80)KT\ni=1[JLc]iH[Nc]i =\ni=1[JLp ]iH[Nc]i. Since JLc (cid:54)= JLp in general, we consider only the case of H[Nc]i = 0 for\nall i, which corresponds to the case where Nc is a linear mapping.\n[Nc]i contains a log-sum-exp function mapping from paths to a label sequence. Let l be the label\nsequence corresponding to [Nc]i; then, [Nc]i = fl(. . . , b\u03c0, . . . ) for \u03c0 \u2208 B\u22121(l). If the probability\nof one path \u03c0(cid:48) is suf\ufb01ciently large to ignore all the other paths, that is, exp(b\u03c0(cid:48)) (cid:29) exp(b\u03c0) for\n\u03c0 \u2208 {B\u22121(l)\\\u03c0(cid:48)}, it follows that fl(. . . , b\u03c0(cid:48), . . . ) = b\u03c0(cid:48). This is a linear mapping, which results in\nH[Nc]i = 0.\nIn conclusion, the condition HLc\u25e6Nc = HLp\u25e6Nc holds if one dominant path \u03c0 \u2208 B\u22121(l) exists such\nthat fl(. . . , b\u03c0, . . . ) = b\u03c0 for each label sequence l.\n\n4.5 Derivation of the proposed approximation from the Fisher information matrix\n\nThe identity of the GGN and the Fisher information matrix [16] has been shown for the network\nusing the softmax with cross-entropy loss [17, 18]. Thus, it follows that the GGN matrix of Eq. (13)\nis identical to the Fisher information matrix. Now, we show that the proposed matrix in Eq. (14)\n\n5\n\n\fis derived from the Fisher information matrix under the condition given in section 4.4. The Fisher\ninformation matrix of an MDRNN using CTC is written as\n\n(cid:34)(cid:18) \u2202 log p(l|a)\n\n(cid:19)(cid:62)(cid:18) \u2202 log p(l|a)\n\n(cid:19)(cid:35)\n\n(cid:35)\n\n(cid:34)\n\nF = Ex\n\nJ(cid:62)\nN El\u223cp(l|a)\n\n\u2202a\n\n\u2202a\n\nJN\n\n,\n\n(15)\n\nwhere a = a(x, \u03b8) is the KT -dimensional output of the network N . CTC assumes output proba-\nbilities at each timestep to be independent of those at other timesteps [1], and therefore, its Fisher\ninformation matrix is given as the sum of every timestep. It follows that\n\n(cid:34)(cid:88)\n\nt\n\nF = Ex\n\nJ(cid:62)\nNt\n\nEl\u223cp(l|a)\n\n(cid:34)(cid:18) \u2202 log p(l|a)\n\n\u2202at\n\n(cid:19)(cid:35)\n\n(cid:35)\n\n(cid:19)(cid:62)(cid:18) \u2202 log p(l|a)\n(cid:35)\n\n\u2202at\n\n(cid:34)(cid:88)\n\nJNt\n\n.\n\n(16)\n\nUnder the condition in section 4.4, the Fisher information matrix is given by\n\nF = Ex\n\nJ(cid:62)\nNt\n\n(diag(Y t) \u2212 Y tY t(cid:62)\n\n)JNt\n\n,\n\n(17)\n\nwhich is the same form as Eqs. (4) and (14) combined. See appendix B for the detailed derivation.\n\nt\n\n4.6 EM interpretation of the proposed approximation\n\nThe goal of the Expectation-Maximization (EM) algorithm is to \ufb01nd the maximum likelihood so-\nlution for models having latent variables [19]. Given an input sequence x, and its corresponding\n\u03c0\u2208B\u22121(z) p(\u03c0|x, \u03b8),\nwhere \u03b8 represents the model parameters. For each observation x, we have a corresponding latent\nvariable q which is a 1-of-k binary vector where k is the number of all the paths mapped to z. The\n\u03c0\u2208B\u22121(z) q\u03c0|x,z log p(\u03c0|x, \u03b8). The\nEM algorithm starts with an initial parameter \u02c6\u03b8, and repeats the following process until convergence.\n\ntarget label sequence z, the log likelihood of z is given by log p(z|x, \u03b8) = log(cid:80)\nlog likelihood can be written in terms of q as log p(z, q|x, \u03b8) =(cid:80)\nMaximization step updates: \u02c6\u03b8 = argmax\u03b8Q(\u03b8), where Q(\u03b8) =(cid:80)\nof \u03c0 and(cid:80)\n\nIn the context of CTC and RNN, p(\u03c0|x, \u03b8) is given as p(\u03c0|a(x, \u03b8)) as in Eq. (6), where a(x, \u03b8) is\nthe KT -dimensional output of the neural network. Taking the second-order derivative of log p(\u03c0|a)\nwith respect to at gives diag(Y t)\u2212Y tY t(cid:62)\n, with Y t as in Eq. (14). Because this term is independent\n\n\u03c0\u2208B\u22121(z) \u03b3\u03c0|x,z = 1, the Hessian of Q with respect to at is given by\n\n\u03c0\u2208B\u22121(z) \u03b3\u03c0|x,z log p(\u03c0|x, \u03b8).\n\nExpectation step calculates: \u03b3\u03c0|x,z =\n\n(cid:80)\n\np(\u03c0|x,\u02c6\u03b8)\n\n\u03c0\u2208B\u22121(z) p(\u03c0|x,\u02c6\u03b8)\n\n.\n\nHQ,t = diag(Y t) \u2212 Y tY t(cid:62)\n\n,\n\n(18)\n\nwhich is the same as the convex approximation in Eq. (14).\n\n5 Experiments\n\nIn this section, we present the experimental results for two different sequence labeling tasks, of\ufb02ine\nhandwriting recognition and phoneme recognition. The performance of Hessian-free optimization\nfor MDRNNs with the proposed matrix is compared with that of stochastic gradient descent (SGD)\noptimization on the same settings.\n\n5.1 Database and preprocessing\n\nThe IFN/ENIT Database [20] is a database of handwritten Arabic words, which consists of 32,492\nimages. The entire dataset has \ufb01ve subsets (a, b, c, d, e). The 25,955 images corresponding to the\nsubsets (b \u2212 e) were used for training. The validation set consisted of 3,269 images corresponding\nto the \ufb01rst half of the sorted list in alphabetical order (ae07 001.tif \u2212 ai54 028.tif) in set a. The\nremaining images in set a, amounting to 3,268, were used for the test. The intensity of pixels was\ncentered and scaled using the mean and standard deviation calculated from the training set.\n\n6\n\n\fThe TIMIT corpus [21] is a benchmark database for evaluating speech recognition performance.\nThe standard training, validation, and core datasets were used. Each set contains 3,696 sentences,\n400 sentences, and 192 sentences, respectively. A mel spectrum with 26 coef\ufb01cients was used as\na feature vector with a pre-emphasis \ufb01lter, 25 ms window size, and 10 ms shift size. Each input\nfeature was centered and scaled using the mean and standard deviation of the training set.\n\n5.2 Experimental setup\n\nFor handwriting recognition, the basic architecture was adopted from that proposed in [3]. Deeper\nnetworks were constructed by replacing the top layer with more layers. The number of LSTM cells\nin the augmented layer was chosen such that the total number of weights between the different\nnetworks was similar. The detailed architectures are described in Table 1, together with the results.\nFor phoneme recognition, the deep bidirectional LSTM and CTC in [4] was adopted as the basic\narchitecture. In addition, the memory cell block [10], in which the cells share the gates, was applied\nfor ef\ufb01cient information sharing. Each LSTM block was constrained to have 10 memory cells.\nAccording to the results, using a large value of bias for input/output gates is bene\ufb01cial for training\ndeep MDRNNs. A possible explanation is that the activation of neurons is exponentially decayed\nby input/output gates during the propagation. Thus, setting large bias values for these gates may\nfacilitate the transmission of information through many layers at the beginning of the learning. For\nthis reason, the biases of the input and output gates were initialized to 2, whereas those of the forget\ngates and memory cells were initialized to 0. All the other weight parameters of the MDRNN were\ninitialized randomly from a uniform distribution in the range [\u22120.1, 0.1].\nThe label error rate was used as the metric for performance evaluation, together with the average\nloss of CTC in Eq. (8). It is de\ufb01ned by the edit distance, which sums the total number of insertions,\ndeletions, and substitutions required to match two given sequences. The \ufb01nal performance, shown\nin Tables 1 and 2, was evaluated using the weight parameters that gave the best label error rate on\nthe validation set. To map output probabilities to a label sequence, best path decoding [1] was used\nfor handwriting recognition and beam search decoding [4, 22] with a beam width of 100 was used\nfor phoneme recognition. For phoneme recognition, 61 phoneme labels were used during training\nand decoding, and then, mapped to 39 classes for calculating the phoneme error rate (PER) [4, 23].\nFor phoneme recognition, the regularization method suggested in [24] was used. We applied Gaus-\nsian weight noise of standard deviation \u03c3 = {0.03, 0.04, 0.05} together with L2 regularization of\nstrength 0.001. The network was \ufb01rst trained without noise, and then, it was initialized to the weights\nthat gave the lowest CTC loss on the validation set. Then, the network was retrained with Gaussian\nweight noise [4]. Table 2 presents the best result for different values of \u03c3.\n\n5.2.1 Parameters\n\nFor HF optimization, we followed the basic setup described in [8], but different parameters were\nutilized. Tikhonov damping was used together with Levenberg-Marquardt heuristics. The value of\nthe damping parameter \u03bb was initialized to 0.1, and adjusted according to the reduction ratio \u03c1 (mul-\ntiplied by 0.9 if \u03c1 > 0.75, divided by 0.9 if \u03c1 < 0.25, and unchanged otherwise). The initial search\ndirection for each run of CG was set to the CG direction found by the previous HF optimization\niteration decayed by 0.7. To ensure that CG followed the descent direction, we continued to perform\na minimum 5 and maximum 30 of additional CG iterations after it found the \ufb01rst descent direction.\nWe terminated CG at iteration i before reaching the maximum iteration if the following condition\nwas satis\ufb01ed: (\u03c6(xi) \u2212 \u03c6(xi\u22125))/\u03c6(xi) < 0.005 , where \u03c6 is the quadratic objective of CG with-\nout offset. The training data were divided into 100 and 50 mini-batches for the handwriting and\nphoneme recognition experiments, respectively, and used for both the gradient and matrix-vector\nproduct calculation. The learning was stopped if any of two criteria did not improve for 20 epochs\nand 10 epochs in handwriting and phoneme recognition, respectively.\nFor SGD optimization, the learning rate \u0001 was chosen from {10\u22124, 10\u22125, 10\u22126}, and the momentum\n\u00b5 from {0.9, 0.95, 0.99}. For handwriting recognition, the best performance obtained using all the\npossible combinations of parameters is presented in Table 1. For phoneme recognition, the best\nparameters out of nine candidates for each network were selected after training without weight noise\nbased on the CTC loss. Additionally, the backpropagated error in LSTM layer was clipped to remain\n\n7\n\n\fin the range [\u22121, 1] for stable learning [25]. The learning was stopped after 1000 epochs had been\nprocessed, and the \ufb01nal performance was evaluated using the weight parameters that showed the best\nlabel error rate on the validation set. It should be noted that in order to guarantee the convergence,\nwe selected a conservative criterion as compared to the study where the network converged after 85\nepochs in handwriting recognition [3] and after 55-150 epochs in phoneme recognition [4].\n\n5.3 Results\n\nTable 1 presents the label error rate on the test set for handwriting recognition. In all cases, the\nnetworks trained using HF optimization outperformed those using SGD. The advantage of using HF\nis more pronounced as the depth increases. The improvements resulting from the deeper architecture\ncan be seen with the error rate dropping from 6.1% to 4.5% as the depth increases from 3 to 13.\nTable 2 shows the phoneme error rate (PER) on the core set for phoneme recognition. The improved\nperformance according to the depth can be observed for both optimization methods. The best PER\nfor HF optimization is 18.54% at 15 layers and that for SGD is 18.46% at 10 layers, which are\ncomparable to that reported in [4], where the reported results are a PER of 18.6% from a network\nwith 3 layers having 3.8 million weights and a PER of 18.4% from a network with 5 layers having\n6.8 million weights. The bene\ufb01t of a deeper network is obvious in terms of the number of weight\nparameters, although this is not intended to be a de\ufb01nitive performance comparison because of\nthe different preprocessing. The advantage of HF optimization is not prominent in the result of\nthe experiments using the TIMIT database. One explanation is that the networks tend to over\ufb01t\nto a relatively small number of the training data samples, which removes the advantage of using\nadvanced optimization techniques.\n\nTable 1: Experimental results for Arabic of\ufb02ine handwriting recognition. The label error rate is\npresented with the different network depths. AB denotes a stack of B layers having A hidden\nLSTM cells in each layer. \u201cEpochs\u201d is the number of epochs required by the network using HF\noptimization so that the stopping criteria are ful\ufb01lled. \u0001 is the learning rate and \u00b5 is the momentum.\n\nNETWORKS\n2-10-50\n2-10-213\n2-10-146\n2-10-128\n2-10-1011\n2-10-913\n\nDEPTH WEIGHTS HF (%)\n\nEPOCHS\n\nSGD (%)\n\n3\n5\n8\n10\n13\n15\n\n159,369\n157,681\n154,209\n154,153\n150,169\n145,417\n\n6.10\n5.85\n4.98\n4.95\n4.50\n5.69\n\n77\n90\n140\n109\n84\n84\n\n9.57\n9.19\n9.67\n9.25\n10.63\n12.29\n\n{\u0001, \u00b5}\n{10\u22124,0.9}\n{10\u22125,0.99}\n{10\u22124,0.95}\n{10\u22124,0.95}\n{10\u22124,0.9}\n{10\u22125,0.99}\n\nTable 2: Experimental results for phoneme recognition using the TIMIT corpus. PER is presented\nwith the different MDRNN architectures (depth \u00d7 block \u00d7 cell/block). \u03c3 is the standard deviation\nof Gaussian weight noise. The remaining parameters are the same as in Table 1.\n\nWEIGHTS HF (%)\n20.14\n771,542\n19.18\n795,752\n19.09\n720,826\n755,822\n18.79\n18.59\n806,588\n741,230\n18.54\n\nEPOCHS\n\n22\n30\n29\n60\n93\n50\n\n{\u03c3}\n{0.03}\n{0.05}\n{0.05}\n{0.04}\n{0.05}\n{0.04}\n\n3.8M\n6.8M\n\nSGD (%)\n\n20.96\n20.82\n19.68\n18.46\n18.49\n19.09\n18.6\n18.4\n\n{\u0001, \u00b5, \u03c3}\n{10\u22125, 0.99, 0.05 }\n{10\u22124, 0.9, 0.04 }\n{10\u22124, 0.9, 0.04 }\n{10\u22125, 0.95, 0.04 }\n{10\u22125, 0.95, 0.04 }\n{10\u22125, 0.95, 0.03 }\n{10\u22124, 0.9, 0.075 }\n{10\u22124, 0.9, 0.075 }\n\nNETWORKS\n3 \u00d7 20 \u00d7 10\n5 \u00d7 15 \u00d7 10\n8 \u00d7 11 \u00d7 10\n10 \u00d7 10 \u00d7 10\n13 \u00d7 9 \u00d7 10\n15 \u00d7 8 \u00d7 10\n3 \u00d7 250 \u00d7 1\u2020\n5 \u00d7 250 \u00d7 1\u2020\n\n\u2020 The results were reported by Graves in 2013 [4].\n6 Conclusion\n\nHessian-free optimization as an approach for successful learning of deep MDRNNs, in conjunction\nwith CTC, was presented. To apply HF optimization to CTC, a convex approximation of its objective\nfunction was explored. In experiments, improvements in performance were seen as the depth of the\nnetwork increased for both HF and SGD. HF optimization showed a signi\ufb01cantly better performance\nfor handwriting recognition than did SGD, and a comparable performance for speech recognition.\n\n8\n\n\fReferences\n[1] Alex Graves. Supervised sequence labelling with recurrent neural networks, volume 385. Springer, 2012.\n[2] Alex Graves, Marcus Liwicki, Horst Bunke, J\u00a8urgen Schmidhuber, and Santiago Fern\u00b4andez. Uncon-\nstrained on-line handwriting recognition with recurrent neural networks. In Advances in Neural Informa-\ntion Processing Systems, pages 577\u2013584, 2008.\n\n[3] Alex Graves and J\u00a8urgen Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional recurrent\n\nneural networks. In Advances in Neural Information Processing Systems, pages 545\u2013552, 2009.\n\n[4] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep recurrent\n\nneural networks. In Proceedings of ICASSP, pages 6645\u20136649. IEEE, 2013.\n\n[5] Adriana Romero, Nicolas Ballas, Samira Ebrahimi Kahou, Antoine Chassang, Carlo Gatta, and Yoshua\nBengio. Fitnets: Hints for thin deep nets. CoRR, abs/1412.6550, 2014. URL http://arxiv.org/\nabs/1412.6550.\n\n[6] Geoffrey E Hinton and Ruslan R Salakhutdinov. Reducing the dimensionality of data with neural net-\n\nworks. Science, 313(5786):504\u2013507, 2006.\n\n[7] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20139, 2015.\n\n[8] James Martens. Deep learning via Hessian-free optimization. In Proceedings of the 27th International\n\nConference on Machine Learning, pages 735\u2013742, 2010.\n\n[9] James Martens and Ilya Sutskever. Learning recurrent neural networks with Hessian-free optimization.\n\nIn Proceedings of the 28th International Conference on Machine Learning, pages 1033\u20131040, 2011.\n\n[10] Sepp Hochreiter and J\u00a8urgen Schmidhuber. Long short-term memory. Neural Computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[11] Nicol N Schraudolph. Fast curvature matrix-vector products for second-order gradient descent. Neural\n\nComputation, 14(7):1723\u20131738, 2002.\n\n[12] Barak A Pearlmutter. Fast exact multiplication by the hessian. Neural Computation, 6(1):147\u2013160, 1994.\n[13] James Martens and Ilya Sutskever. Training deep and recurrent networks with Hessian-free optimization.\n\nIn Neural Networks: Tricks of the Trade, pages 479\u2013535. Springer, 2012.\n\n[14] Alex Graves, Santiago Fern\u00b4andez, Faustino Gomez, and J\u00a8urgen Schmidhuber. Connectionist temporal\nclassi\ufb01cation: labelling unsegmented sequence data with recurrent neural networks. In Proceedings of\nthe 23rd International Conference on Machine Learning, pages 369\u2013376, 2006.\n\n[15] Stephen Boyd and Lieven Vandenberghe, editors. Convex Optimization. Cambridge University Press,\n\n2004.\n\n[16] Shun-Ichi Amari. Natural gradient works ef\ufb01ciently in learning. Neural computation, 10(2):251\u2013276,\n\n1998.\n\n[17] Razvan Pascanu and Yoshua Bengio. Revisiting natural gradient for deep networks.\n\nConference on Learning Representations, 2014.\n\nIn International\n\n[18] Hyeyoung Park, S-I Amari, and Kenji Fukumizu. Adaptive natural gradient learning algorithms for vari-\n\nous stochastic models. Neural Networks, 13(7):755\u2013764, 2000.\n\n[19] Christopher M. Bishop, editor. Pattern Recognition and Machine Learning. Springer, 2007.\n[20] Mario Pechwitz, S Snoussi Maddouri, Volker M\u00a8argner, Noureddine Ellouze, and Hamid Amiri.\n\nIFN/ENIT-database of handwritten arabic words. In Proceedings of CIFED, pages 129\u2013136, 2002.\n\n[21] DARPA-ISTO. The DARPA TIMIT acoustic-phonetic continuous speech corpus (TIMIT). In speech disc\n\ncd1-1.1 edition, 1990.\n\n[22] Alex Graves. Sequence transduction with recurrent neural networks. In ICML Representation Learning\n\nWorkshop, 2012.\n\n[23] Kai-Fu Lee and Hsiao-Wuen Hon. Speaker-independent phone recognition using hidden markov models.\n\nIEEE Transactions on Acoustics, Speech and Signal Processing, 37(11):1641\u20131648, 1989.\n\n[24] Alex Graves. Practical variational inference for neural networks.\n\nProcessing Systems, pages 2348\u20132356, 2011.\n\nIn Advances in Neural Information\n\n[25] Alex Graves. Rnnlib: A recurrent neural network library for sequence learning problems, 2008.\n\n9\n\n\f", "award": [], "sourceid": 570, "authors": [{"given_name": "Minhyung", "family_name": "Cho", "institution": "Gracenote"}, {"given_name": "Chandra", "family_name": "Dhir", "institution": "Gracenote"}, {"given_name": "Jaehyung", "family_name": "Lee", "institution": "Gracenote"}]}