{"title": "Conditional Neural Fields", "book": "Advances in Neural Information Processing Systems", "page_first": 1419, "page_last": 1427, "abstract": "Conditional random fields (CRF) are quite successful on sequence labeling tasks such as natural language processing and biological sequence analysis. CRF models use linear potential functions to represent the relationship between input features and outputs. However, in many real-world applications such as protein structure prediction and handwriting recognition, the relationship between input features and outputs is highly complex and nonlinear, which cannot be accurately modeled by a linear function. To model the nonlinear relationship between input features and outputs we propose Conditional Neural Fields (CNF), a new conditional probabilistic graphical model for sequence labeling. Our CNF model extends CRF by adding one (or possibly several) middle layer between input features and outputs. The middle layer consists of a number of hidden parameterized gates, each acting as a local neural network node or feature extractor to capture the nonlinear relationship between input features and outputs. Therefore, conceptually this CNF model is much more expressive than the linear CRF model. To better control the complexity of the CNF model, we also present a hyperparameter optimization procedure within the evidence framework. Experiments on two widely-used benchmarks indicate that this CNF model performs significantly better than a number of popular methods. In particular, our CNF model is the best among about ten machine learning methods for protein secondary tructure prediction and also among a few of the best methods for handwriting recognition.", "full_text": "Conditional Neural Fields\n\nJian Peng\n\nLiefeng Bo\n\nToyota Technological Institute at Chicago\n\nToyota Technological Institute at Chicago\n\n6045 S. Kenwood Ave.\n\nChicago, IL 60637\n\njpengwhu@gmail.com\n\n6045 S. Kenwood Ave.\n\nChicago, IL 60637\n\nliefengbo@gmail.com\n\nJinbo Xu\n\nToyota Technological Institute at Chicago\n\n6045 S. Kenwood Ave.\n\nChicago, IL 60637\n\njinboxu@gmail.com\n\nAbstract\n\nConditional random \ufb01elds (CRF) are widely used for sequence labeling such as\nnatural language processing and biological sequence analysis. Most CRF models\nuse a linear potential function to represent the relationship between input features\nand output. However, in many real-world applications such as protein structure\nprediction and handwriting recognition, the relationship between input features\nand output is highly complex and nonlinear, which cannot be accurately modeled\nby a linear function. To model the nonlinear relationship between input and output\nwe propose a new conditional probabilistic graphical model, Conditional Neural\nFields (CNF), for sequence labeling. CNF extends CRF by adding one (or possi-\nbly more) middle layer between input and output. The middle layer consists of a\nnumber of gate functions, each acting as a local neuron or feature extractor to cap-\nture the nonlinear relationship between input and output. Therefore, conceptually\nCNF is much more expressive than CRF. Experiments on two widely-used bench-\nmarks indicate that CNF performs signi\ufb01cantly better than a number of popular\nmethods. In particular, CNF is the best among approximately 10 machine learning\nmethods for protein secondary structure prediction and also among a few of the\nbest methods for handwriting recognition.\n\n1 Introduction\n\nSequence labeling is a ubiquitous problem arising in many areas, including natural language pro-\ncessing [1], bioinformatics [2, 3, 4] and computer vision [5]. Given an input/observation sequence,\nthe goal of sequence labeling is to infer the state sequence (also called output sequence), where a\nstate may be some type of labeling or segmentation. For example, in protein secondary structure\nprediction, the observation is a protein sequence consisting of a collection of residues. The output\nis a sequence of secondary structure types. Hidden Markov model (HMM) [6] is one of the popular\nmethods for sequence labeling. HMM is a generative learning model since it generates output from\na joint distribution between input and output. In the past decade, several discriminative learning\nmodels such as conditional random \ufb01elds (CRF) have emerged as the mainstream methods for se-\nquence labeling. Conditional random \ufb01elds, introduced by Lafferty [7], is an undirected graphical\nmodel. It de\ufb01nes the conditional probability of the output given the input. CRF is also a special case\nof the log-linear model since its potential function is de\ufb01ned as a linear combination of features. An-\nother approach for sequence labeling is max margin structured learning such as max margin Markov\n\n1\n\n\fnetworks (MMMN) [8] and SVM-struct [9]. These models generalize the large margin and kernel\nmethods to structured learning.\n\nIn this work, we present a new probabilistic graphical model, called conditional neural \ufb01elds (CNF),\nfor sequence labeling. CNF combines the advantages of both CRF and neural networks. First, CNF\npreserves the globally consistent prediction, i.e. exploiting the structural correlation between out-\nputs, and the strength of CRF as a rigorous probabilistic model. Within the probabilistic framework,\nposterior probability can be derived to evaluate con\ufb01dence on predictions. This property is particu-\nlarly valuable in applications that require multiple cascade predictors. Second, CNF automatically\nlearns an implicit nonlinear representation of features and thus, can capture more complicated rela-\ntionship between input and output. Finally, CNF is much more ef\ufb01cient than kernel-based methods\nsuch as MMMN and SVM-struct. The learning and inference procedures in CNF adopt ef\ufb01cient\ndynamic programming algorithm, which makes CNF applicable to large scale tasks.\n\n2 Conditional Random Fields\n\nAssume the input and output sequences are X and Y ,\n{y1, y2, ..., yN } \u2208 \u03a3N where \u03a3 is the alphabet of all possible output states and |\u03a3| = M .\n\nrespectively. Meanwhile, Y =\n\nCRF uses two types of features given a pair of input and output sequences. The \ufb01rst type of features\ndescribes the dependency between the neighboring output labels.\n\nfy,y \u2032(Y, X, t) = \u03b4[yt = y]\u03b4[yt\u22121 = y\u2032]\n\n(1)\n\nwhere \u03b4[yt = y] is a indicator function. It is equal to 1 if and only if the state at position t is y.\n\nThe second type of features describes the dependency between the label at one position and the\nobservations around this position.\n\nfy(Y, X, t) = f (X, t)\u03b4[yt = y]\n\n(2)\n\nwhere f(X,t) is the local observation or feature vector at position t.\n\nIn a linear chain CRF model [7], the conditional probability of the output sequence Y given the\ninput sequence X is the normalized product of the exponentials of potential functions on all edges\nand vertices in the chain.\n\nP (Y |X) =\n\n(\u03c8(Y, X, t) + \u03c6(Y, X, t)))\n\nwhere\n\nwT\n\ny fy(Y, X, t)\n\n(3)\n\n(4)\n\n1\n\nN\n\nexp(\n\nZ(X)\n\nXt=1\n\u03c6(Y, X, t) =Xy\n\n\u03c8(Y, X, t) =Xy,y \u2032\n\nis the potential function de\ufb01ned on vertex at the tth position, which measures the compatibility\nbetween the local observations around the tth position and the output label yt; and\n\nuy,y \u2032fy,y \u2032(Y, X, t)\n\n(5)\n\nis the potential function de\ufb01ned on an edge connecting two labels yt and yt+1. This potential mea-\nsures the compatibility between two neighbor output labels.\n\nAlthough CRF is a very powerful model for sequence labeling, CRF does not work very well on\nthe tasks in which the input features and output labels have complex relationship. For example,\nin computer vision or bioinformatics, many problems require the modeling of complex/nonlinear\nrelationship between input and output [10, 11]. To model complex/nonlinear relationship between\ninput and output, CRF has to explicitly enumerate all possible combinations of input features and\noutput labels. Nevertheless, even assisted with domain knowledge, it is not always possible for CRF\nto capture all the important nonlinear relationship by explicit enumeration.\n\n3 Conditional Neural Fields\n\nHere we propose a new probabilistic graphical model, conditional neural \ufb01elds (CNF), for se-\nquence labeling. Figure 1 shows the structural difference between CNF and CRF. CNF not only\n\n2\n\n\fcan parametrize the conditional probability in the log-linear like formulation, but also is able to im-\nplicitly model complex/nonlinear relationship between input features and output labels. In a linear\nchain CNF, the edge potential function is similar to that of a linear chain CRF. That is, the edge func-\ntion describes only the interdependency between the neighbor output labels. However, the potential\nfunction of CNF at each vertex is different from that of CRF. The function is de\ufb01ned as follows.\n\nK\n\nwy,gh(\u03b8T\n\ng f (X, t))\u03b4[yt = y]\n\n(6)\n\n\u03c6(Y, X, t) =Xy\n\nXg=1\n\nwhere h is a gate function. In this work, we use the logistic function as the gate function. The\nmajor difference between CRF and CNF is the de\ufb01nition of the potential function at each vertex. In\nCRF, the local potential function (see Equation (4)) is de\ufb01ned as a linear combination of features. In\nCNF, there is an extra hidden layer between the input and output, which consists of K gate functions\n(see Figure 1 and Equation (6)). The K gate functions extract a K-dimensional implicit nonlinear\nrepresentation of input features. Therefore, CNF can be viewed as a CRF with its inputs being K\nhomogeneous hidden feature-extractors at each position. Similar to CRF, CNF can also be de\ufb01ned\non a general graph structure or an high-order Markov chain. This paper mainly focuses on a linear\nchain CNF model for sequence labeling.\n\nInput\n\nLocal window\n\nxi-2\n\nxi-1\n\nyi-2\n\nyi-1\n\nxi\n\nyi\n\nxi+1\n\nxi+2\n\nwyi\n\nyi+1\n\nyi+2\n\nuyi-2,yi-1\n\nuyi-1,yi\n\nuyi,yi+1\n\nuyi+2,yi+1\n\nOutput\n\nInput\n\nLocal window\n\nxi-2\n\nxi-1\n\nxi\n\nxi+1\n\nxi+2\n\n\u03b8\n1\n\n\u03b8\nK\n\n\u2026 \u2026\n\n\u03b8\ng\n\nwyi,1\n\nwyi,K\n\nGates Level\n\nyi-2\n\nyi-1\n\nyi\n\nyi+1\n\nyi+2\n\nuyi-2,yi-1\n\nuyi-1,yi\n\nuyi,yi+1\n\nuyi+2,yi+1\n\nOutput\n\nFigure 1: Structures of CRF and CNF\n\nCNF can also be viewed as a natural combination of neural networks and log-linear models. In the\nhidden layer, there are a set of neurons that extract implicit features from input. Then the log-linear\nmodel in the output layer utilizes the implicit features as its input. The parameters in the hidden\nneurons and the log-linear model can be jointly optimized. After learning the parameters, we can \ufb01rst\ncompute all the hidden neuron values from the input and then use an inference algorithm to predict\nthe output. Any inference algorithm used by CRF, such as Viterbi [7], can be used by CNF. Assume\nthat the dimension of feature vector at each vertex is D. The computational complexity for the K\nneurons is O(N KD). Supposing Viterbi is used as the inference algorithm, the total computational\ncomplexity of CNF inference is O(N M K + N KD). Empirically the number of hidden neurons\nK is small, so the CNF inference procedure may have lower computational complexity than CRF.\nIn our experiments, CNF shows superior predictive performance over two baseline methods: neural\nnetworks and CRF.\n\n4 Parameter Optimization\n\nSimilar to CRF, we can use the maximum likelihood method to train the model parameters such that\nthe log-likelihood is maximized. For CNF, the log-likelihood is as follows.\n\nlog P (Y |X) =\n\nN\n\nXt=1\n\n(\u03c8(Y, X, t) + \u03c6(Y, X, t))) \u2212 log Z(X)\n\n(7)\n\n3\n\n\fSince CNF contains a hidden layer of gate function h, the log-likelihood function is not convex any\nmore. Therefore, it is very likely that we can only obtain a local optimal solution of the parame-\nters. Although both the output and hidden layers contain model parameters, all the parameters can\nbe learned together by gradient-based optimization. We can use LBFGS [12] as the optimization\nroutine to search for the optimal model parameters because 1) LBFGS is very ef\ufb01cient and robust;\nand 2) LBFGS provides us an approximation of inverse Hessian for hyperparameter learning [13],\nwhich will be described in the next section. The gradient of the log-likelihood with respect to the\nparameters is given by\n\nN\n\nN\n\n\u03b4[yt = y]\u03b4[yt\u22121 = y\u2032] \u2212 EP ( \u02dcY |X,w,u,\u03b8)[\n\n\u03b4[yt = y]h(\u03b8T\n\ng f (X, t)) \u2212 EP ( \u02dcY |X,w,u,\u03b8)[\n\n\u2202 log P\n\u2202uy,y \u2032\n\n=\n\n\u2202 log P\n\u2202wy,g\n\n=\n\nN\n\nXt=1\nXt=1\nXt=1\n\nN\n\n=\n\n\u03b4[\u02dcyt = y]\u03b4[\u02dcyt\u22121 = y\u2032]]\n\n(8)\n\n\u03b4[\u02dcyt = y]h(\u03b8T\n\ng f (X, t))]\n\n(9)\n\nw\u02dcyt,g\n\n\u2202h(\u03b8T\n\ng f (X, t))\n\u2202\u03b8g\n\n]\n\n(10)\n\nN\n\nXt=1\nXt=1\nXt=1\n\nN\n\n\u2202 log P\n\n\u2202\u03b8g\n\nwyt,g\n\n\u2202h(\u03b8T\n\ng f (X, t))\n\u2202\u03b8g\n\n\u2212 EP ( \u02dcY |X,w,u,\u03b8)[\n\nwhere \u03b4 is the indicator function.\n\nJust like CRF, we can calculate the expectations in these gradients ef\ufb01ciently using the forward-\nbackward algorithm. Assume that the dimension of feature vector at each vertex is D. Since the K\ngate functions can be computed in advance, the computational complexity of the gradient computa-\ntion is O(N KD + N M 2K) for a single input-output pair with length N . If K is smaller than D, it\nis very possible that the computation of gradient in CNF is faster than in CRF, where the complexity\nof gradient computation is O(N M 2D). In our experiments, K is usually much smaller than D. For\nexample, in protein secondary structure prediction, K = 30 and D = 260. In handwriting recog-\nnition, K = 40 and D = 128. As a result, although the optimization problem is non-convex, the\ntraining time of CNF is acceptable. Our experiments show that the training time of CNF is about 2\nor 3 times that of CRF.\n\n5 Regularization and Hyperparameter Optimization\n\nBecause an hidden layer is added to CNF to introduce more expressive power than CRF, it is cru-\ncial to control the model complexity of CNF to avoid over\ufb01tting. Similar to CRF, we can enforce\nregularization on the model parameters to avoid over\ufb01tting. We assume that the parameters have\na Gaussian prior and constrain the inverse covariance matrix (of Gaussian distribution) by a small\nnumber of hyperparameters. To simplify the problem, we divide the model parameter vector \u03bb into\nthree different groups w, u and \u03b8 (see Figure 1) and assume that the parameters among different\ngroups are independent of each other. Furthermore, we assume parameters in each group share the\nsame Gaussian prior with a diagonal covariance matrix. Let \u03b1 = [\u03b1w, \u03b1u, \u03b1\u03b8]T denote the vector of\nthree regularizations/hyperparameters for these three groups of parameters, respectively. While grid\nsearch provides a practical way to determine the best value at low resolution for a single hyperpa-\nrameter, we need a more sophisticated method to determine three hyperparameters simultaneously.\nIn this section, we discuss the hyperparameter learning in evidence framework.\n\n5.1 Laplace\u2019s Approximation\n\nThe evidence framework [14] assumes that the posterior of \u03b1 is sharply peaked around the maximum\n\u03b1max. Since no prior knowledge of \u03b1 is known, the prior of each \u03b1i, i \u2208 {w, u, \u03b8}, P (\u03b1i) is chosen\nto be a constant on log-scale or \ufb02at. Thus, the value of \u03b1 maximizing the posterior of \u03b1 P (\u03b1|Y, X)\ncan be found by maximizing\n\nP (Y |X, \u03b1) =Z\u03bb\n\nP (Y |X, \u03bb)P (\u03bb|\u03b1)d\u03bb\n\n(11)\n\nBy Laplace\u2019s approximation [14], this integral is approximated around the MAP estimation of\nweights. We have\n\nlog P (Y |X, \u03b1) = log P (Y |X, \u03bbM AP ) + log P (\u03bbM AP |\u03b1) \u2212\n\n1\n2\n\nlog det(A) + const\n\n(12)\n\n4\n\n\fwhere A is the hessian of log P (Y |X, \u03bbM AP ) + log P (\u03bbM AP |\u03b1) with respect to \u03bb.\n\nIn order to maximize the approximation, we take the derivative of the right hand side of Equation\n(12) with respect to \u03b1. The optimal \u03b1 value can be derived by the following update formula.\n\n\u03b1new\n\ni =\n\n1\n\n\u03bbT\n\nM AP \u03bbM AP\n\n(Wi \u2212 \u03b1old\n\ni Tr(A\u22121))\n\n(13)\n\nwhere Wi is the number of parameters in group i \u2208 {w, u, \u03b8}.\n\n5.2 Approximation of the Trace of Inverse Hessian\n\nWhen there is a large number of model parameters, accurate computation of Tr(A\u22121) is very ex-\npensive. All model parameters are coupled together by the normalization factor, so the diagonal\napproximation of Hessian or the outer-product approximation are not appropriate. In this work, we\napproximate inverse Hessian using information available in the parameter optimization procedure.\nThe LBFGS algorithm is used to optimize parameters iteratively, so we can approximate inverse\nHessian at \u03bbM AP using the update information generated in the past several iterations. This ap-\nproach is also employed in [15, 14]. From the LBFGS update formula [13], we can compute the\napproximation of the trace of inverse Hessian very ef\ufb01ciently. The computational complexity of\nthis approximation is only O(m3 + nm2), while the accurate computation has complexity O(n3)\nwhere n is the number of parameters and m is the size of history budget used by LBFGS. Since m\nis usually much smaller than n, the computational complexity is only O(nm2). See Theorem 2.2 in\n[13] for more detailed account of this approximation method.\n\n5.3 Hyperparameter Update\n\nThe hyperparameter \u03b1 is iteratively updated by a two-step procedure. In the \ufb01rst step we \ufb01x hyper-\nparameter \u03b1 and optimize the model parameters by maximizing the log-likelihood in Equation (7)\nusing LBFGS. In the second step,we \ufb01x the model parameters and then update \u03b1 using Equation\n(13). This two-step procedure is iteratively carried out until the norm of \u03b1 does not change more\nthan a threshold. Figure 2 shows the learning curve of the hyperparameter on a protein secondary\nstructure prediction benchmark. In our experiments, the update usually converges in less than 15\niterations. Also we found that this method achieves almost the same test performance as the grid\nsearch approach on two public benchmarks.\n\nHyperparameter Training\n\ny\nc\na\nr\nu\nc\nc\nA\n\n80.6\n\n80.4\n\n80.2\n\n80\n\n79.8\n\n79.6\n\n79.4\n\n79.2\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\n9\n\n10\n\nIterations\n\nFigure 2: Learning curve of hyperparameter \u03b1.\n\n6 Related Work\n\nMost existing methods for sequence labeling are built under the framework of graphical models such\nas HMM and CRF. Since these approaches are incapable of capturing highly complex relationship\nbetween observations and labels, many structured models are proposed for nonlinear modeling of\nlabel-observation dependency. For example, kernelized max margin Markov networks [8], SVM-\nstruct [9] and kernel CRF [16] use nonlinear kernels to model the complex relationship between\n\n5\n\n\fobservations and labels. Although these kernelized models are convex, it is still too expensive to\ntrain and test them in the case that observations are of very high dimension. Furthermore,the num-\nber of resultant support vectors for these kernel methods are also very large. Instead, CNF has\ncomputational complexity comparable to CRF. Although CNF is non-convex and usually only the\nlocal minimum solution can be obtained, CNF still achieves very good performance in real-world\napplications. Very recently, the probabilistic neural language model [17] and recurrent temporal re-\nstricted Boltzmann machine [18] are proposed for natural language and time series modeling. These\ntwo methods model sequential data using a directed graph structure, so they are essentially genera-\ntive models. By contrast, our CNF is a discriminative model, which is mainly used for discriminative\nprediction of sequence data. The hierarchical recurrent neural networks [19, 20] can be viewed as\na hybrid of HMM and neural networks (HMM/NN), building on a directed linear chain. Similarly,\nCNF can be viewed as an a hybrid of CRF and neural networks, which has the global normalization\nfactor and alleviate the label-bias problem.\n\n7 Experiments\n\n7.1 Protein Secondary Structure Prediction\n\nProtein secondary structure (SS) prediction is a fundamental problem in computational biology as\nwell as a typical problem used to evaluate sequence labeling methods. Given a protein sequence\nconsisting of a collection of residues, the problem of protein SS prediction is to predict the secondary\nstructure type at each residue. A variety of methods have been described in literature for protein SS\nprediction.\n\nGiven a protein sequence,we \ufb01rst run PSI-BLAST [21] to generate sequence pro\ufb01le and then use this\npro\ufb01le as input to predict SS. A sequence pro\ufb01le is a position-speci\ufb01c scoring matrix X with n \u00d7 20\nelements where n is the number of residues in a protein. Formally, X = [x1, x2, x3, ..., xn] where\nxi is a vector of 20 elements. Each xi contains 20 position-speci\ufb01c scores, each corresponding to\none of the 20 amino acids in nature. The output we want to predict is Y = [y1, y2, ..., yn] where\nyi \u2208 {H, E, C} represents the secondary structure type at the ith residue.\n\nWe evaluate all the SS prediction methods using the CB513 benchmark [22], which consists of 513\nno-homologous proteins. The true secondary structure for each protein is calculated using DSSP\n[23], which generates eight possible secondary structure states. Then we convert these 8 states into\nthree SS types as follows: H and G to H (Helix), B and E to E (Sheets) and all other states to C\n(Coil). Q3 is used to measure the accuracy of three SS types averaged on all positions. To obtain\ngood performance, we also linearly transform X into values in [0, 1] as suggested by Kim et al[24].\n\nS(x) =( 0\n\nif x < \u22125;\n\n0.1x+0.5 if \u22125 \u2264 x \u2264 5;\n1\n\nif x > 5.\n\nTo determine the number of gate functions for CNF, we enumerate this number in set\n{10,20,30,40,60,100}. We also enumerate window size for CNF in set {7,9,11,13,15,17} and \ufb01nd\nthat the best evidence is achieved when window size is 13 and K = 30. Two baseline methods are\nused for comparison: conditional random \ufb01elds and neural networks. All the parameters of these\nmethods are carefully tuned. The best window sizes for neural networks and CRF are 15 and 13,\nrespectively. We also compared our methods with other popular secondary structure prediction pro-\ngrams. CRF, neural networks, Semi-Markov HMM [25], SVMpsi [24], PSIPRED[2] and CNF use\nthe sequence pro\ufb01le generated by PSI-BLAST as described above. SVMpro [26] uses the position\nspeci\ufb01c frequency as input feature. YASSPP [27] and SPINE [28] also use other residue-speci\ufb01c\nfeatures in addition to sequence pro\ufb01le.\n\nTable 1 lists the overall performance of a variety of methods on the CB513 data set. As shown in\nthis table, there are two types of gains on accuracy. First, by using one hidden layer to model the\nnonlinear relationship between input and output, CNF achieves a very signi\ufb01cant gain over linear\nchain CRF. This also con\ufb01rms that strong nonlinear relationship exists between sequence pro\ufb01le and\nsecondary structure type. Second, by modeling interdependency between neighbor residues, CNF\nalso obtains much better prediction accuracy over neural networks. We also tested the the hybrid\nof HMM/NN on this dataset. The predicted accuracy of HMM/NN is about three percent less than\n\n6\n\n\fTable 1: Performance of various methods for protein secondary structure prediction on the CB513 dataset.\nSemi-Markov HMM is a segmental semi-Markov model for sequence labeling. SVMpro and SVMpsi are jury\nmethod with the SVM (Gaussian kernel) as the basic classi\ufb01ers. YASSPP use the SVM with a speci\ufb01cally\ndesigned pro\ufb01le kernel function for SVM classi\ufb01ers. PSIPRED is a two stage double-hidden layer neural\nnetwork. SPINE is voting systems with multiple coupled neural networks. YASSPP, PSIPRED and SPINE\nalso use other features besides the PSSM scores. An * symbol indicates the methods are tested over a 10-fold\ncross-validation on CB513, while others are tested over a 7-fold cross-validation.\n\nMethods\n\nQ3(%)\n\nConditional Random Fields\nSVM-struct (Linear Kernel)\n\nNeural Networks (one hidden layer)\nNeural Networks (two hidden layer)\n\nSemimarkov HMM\n\nSVMpro\nSVMpsi\nPSIPRED\nYASSPP\nSPINE*\n\n72.9\n73.1\n72\n74\n72.8\n73.5\n76.6\n76\n77.8\n76.8\n\nConditional Neural Fields\nConditional Neural Fields*\n\n80.1 \u00b10.3\n80.5 \u00b10.3\n\nthat of CNF. By seamlessly integrating neural networks and CRF, CNF outperforms all other the-\nstate-of-art prediction methods on this dataset. We also tried Max-Margin Markov Network [8] and\nSVM-struct1 with RBF kernel for this dataset. However, because the dataset is large and the feature\nspace is of high dimension, it is impossible for these kernel-based methods to \ufb01nish training within\na reasonable amount of time. Both of them failed to converge within 120 hours. The running time\nof CNF learning and inference is about twice that of CRF.\n\n7.2 Handwriting Recognition\n\nHandwriting recognition(OCR) is another widely-used benchmark for sequence labeling algorithms.\nWe use the subset of OCR dataset chosen by Taskar [8], which contains 6876 sequences. In this\ndataset, each word consists of a sequence of characters and each character is represented by an\nimage with 16 \u00d7 8 binary pixels. In addition to using the vector of pixel values as input features, we\ndo not use any higher-level features. Formally, the input X = [x1, x2, x3, ..., xn] is a sequence of\n128-dimensional binary vectors. The output we want to predict is a sequence of labels. Each label yi\nfor image xi is one of the 26 classes {a, b, c, ..., z}. The accuracy is de\ufb01ned as the average accuracy\nover all characters.\n\nThe number of gate functions for CNF is selected from set {10, 20, 30, 40, 60, 100} and we \ufb01nd that\nthe best evidence is achieved when K = 40. Window sizes for all methods are \ufb01xed to 1. All the\nmethods are tested using 10-fold cross-validation and their performance are shown in Table 2. As\nshown in this table, CNF achieves superior performance over log-linear methods, SVM, CRF and\nneural networks. CNF is also comparable with two slightly different max margin Markov network\nmodels.\n\n8 Discussion\n\nWe present a probabilistic graphical model conditional neural \ufb01elds (CNF) for sequence labeling\ntasks which require accurate account of nonlinear relationship between input and output. CNF is\na very natural integration of conditional graphical models and neural networks and thus, inherits\nadvantages from both of them. On one hand, by neural networks, CNF can model nonlinear re-\nlationship between input and output. On the other hand, by using graphical representation, CNF\n\n1http://svmlight.joachims.org/svm struct.html\n\n7\n\n\fTable 2: Performance of various methods on handwriting recognition. The results of logistic regression, SVM\nand max margin Markov networks are taken from [8]. Both CNF and neural networks use 40 neurons in the\nhidden layer. The CRF performance (78.9%) we obtained is a bit better than 76% in [8].\n\nMethods\n\nAccuracy(%)\n\nLogistic Regression\n\nSVM (linear)\n\nSVM (quadratic)\n\nSVM (cubic)\nSVM-struct\n\nConditional Random Fields\n\nNeural Networks\nMMMN (linear)\n\nMMMN (quadratic)\n\nMMMN (cubic)\n\n71\n71\n80\n81\n80\n78.9\n79.8\n80\n87\n87\n\nConditional Neural Fields\n\n86.9 \u00b10.4\n\ncan model interdependency between output labels. While CNF is more sophisticated and expressive\nthan CRF, the computational complexity of learning and inference is not necessarily higher. Our\nexperimental results on large-scale datasets indicate that CNF can be trained and tested as almost\nef\ufb01cient as CRF but much faster than kernel-based methods. Although CNF is not convex, it can still\nbe trained using the quasi-Newton method to obtain a local optimal solution, which usually works\nvery well in real-world applications.\n\nIn two real-world applications, CNF signi\ufb01cantly outperforms two baseline methods, CRF and neu-\nral networks. On protein secondary structure prediction, CNF achieves the best performance over all\nmethods we tested. on handwriting recognition, CNF also compares favorably with the best method\nmax-margin Markov network. We are currently generalizing our CNF model to a second-order\nMarkov chain and a more general graph structure and also studying if it will improve predictive\npower of CNF by interposing more than one hidden layers between input and output.\n\nAcknowledgements\n\nWe thank Nathan Srebro and David McAllester for insightful discussions.\n\nReferences\n\n[1] Fei Sha and O. Pereira. Shallow parsing with conditional random \ufb01elds. In Proceedings of\n\nHuman Language Technology-NAACL 2003.\n\n[2] D. T. Jones. Protein secondary structure prediction based on position-speci\ufb01c scoring matrices.\n\nJournal of Molecular Biology, 292(2):195\u2013202, September 1999.\n\n[3] Feng Zhao, Shuaicheng Li, Beckett W. Sterner, and Jinbo Xu. Discriminative learning for\n\nprotein conformation sampling. Proteins, 73(1):228\u2013240, October 2008.\n\n[4] Feng Zhao, Jian Peng, Joe Debartolo, Karl F. Freed, Tobin R. Sosnick, and Jinbo Xu. A\nprobabilistic graphical model for ab initio folding. In RECOMB 2\u201909: Proceedings of the 13th\nAnnual International Conference on Research in Computational Molecular Biology, pages 59\u2013\n73, Berlin, Heidelberg, 2009. Springer-Verlag.\n\n[5] Sy Bor Wang, Ariadna Quattoni, Louis-Philippe Morency, and David Demirdjian. Hidden\n\nconditional random \ufb01elds for gesture recognition. In CVPR 2006.\n\n[6] Lawrence R. Rabiner. A tutorial on hidden markov models and selected applications in speech\n\nrecognition. In Proceedings of the IEEE, 1989.\n\n[7] John D. Lafferty, Andrew McCallum, and Fernando C. N. Pereira. Conditional random \ufb01elds:\n\nProbabilistic models for segmenting and labeling sequence data. In ICML 2001.\n\n[8] Ben Taskar, Carlos Guestrin, and Daphne Koller. Max-margin markov networks.\n\nIn NIPS\n\n2003.\n\n8\n\n\f[9] Ioannis Tsochantaridis, Thomas Hofmann, Thorsten Joachims, and Yasemin Altun. Support\n\nvector machine learning for interdependent and structured output spaces. In ICML 2004.\n\n[10] Nam Nguyen and Yunsong Guo. Comparisons of sequence labeling algorithms and extensions.\n\nIn ICML 2007.\n\n[11] Yan Liu, Jaime Carbonell, Judith Klein-Seetharaman, and Vanathi Gopalakrishnan. Compari-\nson of probabilistic combination methods for protein secondary structure prediction. Bioinfor-\nmatics, 20(17), November 2004.\n\n[12] D. C. Liu and J. Nocedal. On the limited memory bfgs method for large scale optimization.\n\nMathematical Programming, 45(3), 1989.\n\n[13] Richard H. Byrd, Jorge Nocedal, and Robert B. Schnabel. Representations of quasi-newton\nmatrices and their use in limited memory methods. Mathematical Programming, 63(2), 1994.\n\n[14] David J. C. Mackay. A practical bayesian framework for backpropagation networks. Neural\n\nComputation, 4:448\u2013472, 1992.\n\n[15] Christopher M. Bishop. Neural Networks for Pattern Recognition. Oxford University Press,\n\nNovember 1995.\n\n[16] John Lafferty, Xiaojin Zhu, and Yan Liu. Kernel conditional random \ufb01elds: representation and\n\nclique selection. In ICML 2004.\n\n[17] Yoshua Bengio, R\u00b4ejean Ducharme, Pascal Vincent, and Christian Janvin. A neural probabilistic\n\nlanguage model. Journal of Machine Learning Research, 3:1137\u20131155, 2003.\n\n[18] Ilya Sutskever, Geoffrey E Hinton, and Graham Taylor. The recurrent temporal restricted\nboltzmann machine. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors, NIPS\n2009.\n\n[19] Barbara Hammer. Recurrent networks for structured data - a unifying approach and its proper-\n\nties. Cognitive Systems Research, 2002.\n\n[20] Alex Graves and Juergen Schmidhuber. Of\ufb02ine handwriting recognition with multidimensional\nrecurrent neural networks. In D. Koller, D. Schuurmans, Y. Bengio, and L. Bottou, editors,\nNIPS 2009.\n\n[21] S. F. Altschul, T. L. Madden, A. A. Sch\u00a8affer, J. Zhang, Z. Zhang, W. Miller, and D. J. Lipman.\nGapped blast and psi-blast: a new generation of protein database search programs. Nucleic\nAcids Research, 25, September 1997.\n\n[22] James A. Cuff and Geoffrey J. Barton. Evaluation and improvement of multiple sequence\nmethods for protein secondary structure prediction. Proteins: Structure, Function, and Genet-\nics, 34, 1999.\n\n[23] Wolfgang Kabsch and Christian Sander. Dictionary of protein secondary structure: Pattern\nrecognition of hydrogen-bonded and geometrical features. Biopolymers, 22(12):2577\u20132637,\nDecember 1983.\n\n[24] H. Kim and H. Park. Protein secondary structure prediction based on an improved support\n\nvector machines approach. Protein Engineering, 16(8), August 2003.\n\n[25] Wei Chu, Zoubin Ghahramani, and David. A graphical model for protein secondary structure\n\nprediction. In ICML 2004.\n\n[26] Sujun Hua and Zhirong Sun. A novel method of protein secondary structure prediction with\nhigh segment overlap measure: Support vector machine approach. Journal of Molecular Biol-\nogy, 308, 2001.\n\n[27] George Karypis. Yasspp: Better kernels and coding schemes lead to improvements in protein\nsecondary structure prediction. Proteins: Structure, Function, and Bioinformatics, 64(3):575\u2013\n586, 2006.\n\n[28] O. Dor and Y. Zhou. Achieving 80% ten-fold cross-validated accuracy for secondary struc-\nture prediction by large-scale training. Proteins: Structure, Function, and Bioinformatics, 66,\nMarch 2007.\n\n9\n\n\f", "award": [], "sourceid": 935, "authors": [{"given_name": "Jian", "family_name": "Peng", "institution": null}, {"given_name": "Liefeng", "family_name": "Bo", "institution": null}, {"given_name": "Jinbo", "family_name": "Xu", "institution": null}]}