{"title": "Training a 3-Node Neural Network is NP-Complete", "book": "Advances in Neural Information Processing Systems", "page_first": 494, "page_last": 501, "abstract": null, "full_text": "494 \n\nTRAINING A 3-NODE NEURAL NETWORK \n\nIS NP-COMPLETE \n\nAvrim Blum'\" \nMIT Lab. for Computer Science \nCambridge, Mass. 02139 USA \n\nRonald L. Rivest t \nMIT Lab. for Computer Science \nCambridge, Mass. 02139 USA \n\nABSTRACT \n\nWe consider a 2-layer, 3-node, n-input neural network whose nodes \ncompute linear threshold functions of their inputs. We show that it \nis NP-complete to decide whether there exist weights and thresholds \nfor the three nodes of this network so that it will produce output con(cid:173)\nsistent with a given set of training examples. We extend the result \nto other simple networks. This result suggests that those looking for \nperfect training algorithms cannot escape inherent computational \ndifficulties just by considering only simple or very regular networks. \nIt also suggests the importance, given a training problem, of finding \nan appropriate network and input encoding for that problem. It is \nleft as an open problem to extend our result to nodes with non-linear \nfunctions such as sigmoids. \n\nINTRODUCTION \n\nOne reason for the recent surge in interest in neural networks is the develop(cid:173)\nment of the \"back-propagation\" algorithm for training neural networks. The \nability to train large multi-layer neural networks is essential for utilizing neural \nnetworks in practice, and the back-propagation algorithm promises just that. \nIn practice, however, the back-propagation algorithm runs very slowly, and the \nquestion naturally arises as to whether there are necessarily intrinsic compu(cid:173)\ntational difficulties associated with training neural networks, or whether better \ntraining algorithms might exist. This paper provides additional support for the \nposition that training neural networks is intrinsically difficult. \n\nA common method of demonstrating a problem to be intrinsically hard is to \nshow the problem to be \"NP-complete\". The theory of NP-complete problems \nis well-understood (Garey and Johnson, 1979), and many infamous problems(cid:173)\nsuch as the traveling salesman problem-are now known to be NP-complete. \nWhile NP-completeness does not render a problem totally unapproachable in \n\n\u00b7Supported by an NSF graduate fellowship. \ntThis paper was prepared with support from NSF grant DCR-8607494, ARO Grant \n\nDAAL03-86-K-0l71, and the Siemens Corporation. \n\n\fTraining a 3-Node Neural Network is NP-Complete \n\n495 \n\npractice, it usually implies that only small instances ofthe problem can be solved \nexactly, and that large instances can at best only be solved approximately, even \nwith large amounts of computer time. \n\nThe work in this paper is inspired by Judd (Judd, 1987) who shows the following \nproblem to be NP-complete: \n\n\"Given a neural network and a set of training examples, does there \nexist a set of edge weights for the network so that the network pro(cid:173)\nduces the correct output for all the training examples?\" \n\nJudd also shows that the problem remains NP-complete even if it is only required \na network produce the correct output for two-thirds of the training examples, \nwhich implies that even approximately training a neural network is intrinsically \ndifficult in the worst case. Judd produces a class of networks and training ex(cid:173)\namples for those networks such that any training algorithm will perform poorly \non some networks and training examples in that class. The results, however, \ndo not specify any particular \"hard network\"-that is, any single network hard \nfor all algorithms. Also, the networks produced have a number of hidden nodes \nthat grows with the number of inputs, as well as a quite irregular connection \npattern. \n\nWe extend his result by showing that it is NP-complete to train a specific very \nsimple network, having only two hidden nodes and a regular interconnection \npattern. We also present classes of regular 2-layer networks such that for all \nnetworks in these classes, the training problem is hard in the worst case (in \nthat there exists some hard sets of training examples). The NP-completeness \nproof also yields results showing that finding approximation algorithms that \nmake only one-sided error or that approximate closely the minimum number \nof hidden-layer nodes needed for a network to be powerful enough to correctly \nclassify the training data, is probably hard, in that these problems can be related \nto other difficult (but not known to be NP-complete) approximation problems. \n\nOur results, like Judd's, are described in terms of \"batch\"-style learning algo(cid:173)\nrithms that are given all the training examples at once. It is worth noting that \ntraining with an \"incremental\" algorithm that sees the examples one at a time \nsuch as back-propagation is at least as hard. Thus the NP-completeness result \ngiven here also implies that incremental training algorithms are likely to run \nslowly. \n\nOur results state that given a network of the classes considered, for any training \nalgorithm there will be some types of training problems such that the algorithm \nwill perform poorly as the problem size increases. The results leave open the \npossibility that given a training problem that is hard for some network, there \nmight exist a different network and encoding of the input that make training \neasy. \n\n\f496 \n\nBlum and Rivest \n\n1 2 \n\n3 4 \n\nn \n\nFigure 1: The three node neural network. \n\nTHE NEURAL NETWORK TRAINING PROBLEM \n\nThe multilayer network that we consider has n binary inputs and three nodes: \nNt, N2, Na. All the inputs are connected to nodes Nl and N2. The outputs \nof hidden nodes Nl and N2 are connected to output node Na which gives the \noutput of the network. \n\nEach node Ni computes a linear threshold function Ii on its inputs. If Ni has \ninput Z = (Zll \u2022\u2022. I Zm), then for some constants ao, . .. , am, \n\nThe aj's (j > 1) are typically viewed as weights on the incoming edges and ao \nas the threshold. \n\nThe training algorithm for the network is given a set of training examples. Each \nis either a positive example (an input for which the desired network output is +1) \nor a negative example (an input for which the desired output is -1). Consider \nthe following problem. Note that we have stated it as a decision (\"yes\" or \"no\") \nproblem, but that the search problem (finding the weights) is at least equally \nhard. \n\nTRAINING A 3-NODE NEURAL NETWORK: \n\nGiven: A set of O( n) training examples on n inputs. \nQuestion: Do there exist linear threshold functions h, /2, fa for nodes Nt, N21 Na \n\n\fTraining a 3-Node Neural Network is NP-Complete \n\n497 \n\nsuch that the network of figure 1 produces outputs consistent with the training \nset? \n\nTheorem: Training a 3-node neural network is NP-complete. \n\nWe also show (proofs omitted here due to space requirements) NP-completeness \nresults for the following networks: \n\n1. The 3-node network described above, even if any or all of the weights for \none hidden node are required to equal the corresponding weights of the \nother, so possibly only the thresholds differ, and even if any or all of the \nweights are forced to be from {+ 1, -I}. \n\n2. Any k-hidden node, for k bounded by some polynomial in n (eg: k = n 2 ), \ntwo-layer fully-connected network with linear threshold function nodes \nwhere the output node is required to compute the AND function of its \ninputs. \n\n3. The 2-layer, 3-node n-input network with an XOR output node, if ternary \n\nfeatures are allowed. \n\nIn addition we show (proof omitted here) that any set of positive and negative \ntraining examples classifiable by the 3-node network with XOR output node (for \nwhich training is NP-complete) can be correctly classified by a perceptron with \nO(n2 ) inputs which consist of the original n inputs and all products of pairs of \nthe original n inputs (for which training can be done in polynomial-time using \nlinear programming techniques). \n\nTHE GEOMETRIC POINT OF VIEW \n\nA training example can be thought of as a point in n-dimensional space, labeled \n'+' or '-' depending on whether it is a positive or negative example. The points \nare vertices of the n-dimensional hypercube. The zeros of the functions /1 and \nh for the hidden nodes can be thought of as (n -\nI)-dimensional hyperplanes \nin this space. The planes Pl and P2 corresponding to the functions hand \n/2 divide the space into four quadrants according to the four possible pairs of \noutputs for nodes Nl and N 2 \u2022 If the planes are parallel, then one or two of the \nquadrants is degenerate (non-existent). Since the output node receives as input \nonly the outputs of the hidden nodes Nl and N 2 , it can only distinguish between \npoints in different quadrants. The output node is also restricted to be a linear \nfunction. It may not, for example, output \"+1\" when its inputs are (+1,+1) \nand (-1, -1), and output \"-I\" when its inputs are (+1, -1) and (-1,+1). \n\nSo, we may reduce our question to the following: given O( n) points in {O, 1}n , \neach point labeled '+' or '-', does there exist either \n\n\f498 \n\nBlum and Rivest \n\n1. a single plane that separates the '+' points from the '-' points, or \n2. two planes that partition the points so that either one quadrant contains \nall and only '+' points or one quadrant contains all and only '-' points. \n\nWe first look at the restricted question of whether there exist two planes that \npartition the points such that one quadrant contains all and only the '+' points. \nThis corresponds to having an \"AND\" function at the output node. We will call \nthis problem: \"2-Linear Confinement of Positive Boolean Examples\". Once we \nhave shown this to be NP-complete, we will extend the proof to the full problem \nby adding examples that disallow the other possibilities at the output node. \nMegiddo (Megiddo, 1986) has shown that for O(n) arbitrary '+' and '-' points \nin n-dimensional Euclidean space, the problem of whether there exist two hy(cid:173)\nperplanes that separate them is NP-complete. His proof breaks down, however, \nwhen one restricts the coordinate values to {O, I} as we do here. Our proof \nturns out to be of a quite different style. \n\nSET SPLITTING \n\nThe following problem was proven to be NP-complete by Lovasz (Garey and \nJohnson 1979). \n\nSET-SPLITTING: \nGiven: A finite set 5 and a collection C of subsets Ci of 5. \n\nQuestion: Do there exist disjoint sets 51, S2 such that Sl U S2 - Sand \nVi, Ci rt. Sl and Ci rt. S2. \nThe Set-Splitting problem is also known as 2-non-Monotone Colorability. Our \nuse of this problem is inspired by its use by Kearns, Li, Pitt, and Valiant to \nshow that learning k-term DNF is NP-complete (Kearns et al. 1987) and the \nstyle of the reduction is similar. \n\nSuppose we are given an instance of the Set-Splitting problem: \n\nTHE REDUCTION \n\nCreate the following signed points on the n-dimensional hypercube {O, l}n: \n\n\u2022 Let the origin on be labeled '+' . \n\u2022 For each Si, put a point labeled '-' at the neighbor to the origin that has \n\n12 ... \n(00\" -010\u00b7\u00b7 \u00b70). Call this point Pi. \n\n... n \n\na 1 in the ith bit-that is, at \n\ni \n\n\fTraining a 3-Node Neural Network is NP-Complete \n\n499 \n\n(001) \n\n(010) \n\n(000) \n\n(100) \n\nFigure 2: An example . \n\n\u2022 For each Cj = {Sjt, ..\u2022 ,Sjkj}, put a point labeled '+' at the location whose \nbits are 1 at exactly the positions j1,i2, ... ,jkj-that is, at Pj1 + .. '+Pjkr \n\nFor example, let 8 = {Sl,S2,S3}, C = {Ct,C2}, Cl = {Sl,S2}, C2 = {S2,S3}' \nSO, we create '-' points at: (0 0 1), (0 1 0), (1 0 0) \nand '+' points at: (0 0 0), (1 1 0), (0 1 1) in this reduction (see figure 2). \n\nClaim: The given instance of the Set-Splitting problem has a solution \u00a2:::::} the \nconstructed instance of the 2-Linear Confinement of Positive Boolean Examples \nproblem has a solution. \n\nProof: (=\u00bb \nGiven 8 1 from the solution to the Set-Splitting instance, create the plane P1 : \na1z1 + ... + anZn = -~, where ai = -1 if Sj E 8 11 and aj = n if Si \u00a2 8 1 , Let \nthe vectors a = (a1, .. ' an),z = (Zl,\"\" zn). \nThis plane separates from the origin exactly the '-' points corresponding to \nSi E 81 and no '+' points. Notice that for each Si E 81, a\u00b7 Pi = -1, and that \nfor each Si \u00a2 8 1 , a . Pi = n. For each '+' point p, a\u00b7 P > - ~ since either P is \nthe origin or else P has a 1 in a bit i such that Si \u00a2 8 1 , \n\nSimilarly, create the plane P2 from 8 2 , \n\n{\u00a2::} \n\nLet 81 be the set of points separated from the origin by PI and 8 2 be those \npoints separated by P2. Place any points separated by both planes in either \n81 or 82 arbitrarily. Sets 81 and 8 2 cover 8 since all '-' points are separated \nfrom the origin by at least one of the planes. Consider some Cj = {Sjl .\u2022\u2022 Sjkj} \n\n\f500 \n\nBlum and Rivest \n\n(001) \n\n(010) \n\n(000) \n\n(100) \n\nFigure 3: The gadget. \n\nand the corresponding '-' points Pjb\" \u2022 ,Pjkr If, say, Cj C 8 11 then P1 must \nseparate all the Pji from the origin. Therefore, Pl must separate Pj1 + ... + Pjkj \nfrom the origin. Since that point is the '+' point corresponding to Cj, the '+' \npoints are not all confined to one quadrant, contradicting our assumptions. So, \nno Cj can be contained in 81. Similarly, no Cj can be contained in 8 2 \u2022 \n\u2022 \n\nWe now add a \"gadget\" consisting of 6 new points to handle the other possi(cid:173)\nbilities at the output node. The gadget forces that the only way in which two \nplanes could linearly separate the '+' points from the '-' points would be to \nconfine the '+' points to one quadrant. The gadget consists of extra points and \nthree new dimensions. We add three new dimensions, xn+b X n +2, and Xn +3, \nand put '+' points in locations: \n\n(0\u00b7 .. 0101), (0 .. \u00b70011) \n\nand '-' points in locations: \n\n(0 .. \u00b70100), (0 .. \u00b70010), (0 .. \u00b70001), (0 .. \u00b70 111). \n\n(See figure 3.) \nThe '+' points ot:this cube can be separated from the '-' points by appropriate \nsettings of the weights of planes P1 and P2 corresponding to the three new \ndimensions. Given planes P{ : a1X1 + ... + anXn = -! and P2 : b1x1 + ... + \nbnxn = -~ which solve a 2-Linear Confinement of Positive Boolean Examples \ninstance in n dimensions, expand the solution to handle the gadget by setting \n\n1 \nto a1 x 1 + ... + anXn + X n +1 + X n +2 - X n +3 = -2\" \n1 \nx n +2 + X n +3 = -2\" \nto b1x 1 + ... + bnxn -\n\nx n +1 -\n\n\fTraining a 3-N ode Neural Network is NP-Complete \n\n501 \n\n(Pl separates '-' point (0\u00b7\u00b7\u00b70 001) from the '+' points and P2 separates the \nother three '-' points from the '+' points). Also, notice that there is no way \nin which just one plane can separate the '+' points from the '-' points in the \ncube and also no way for two planes to confine all the negative points in one \nquadrant. Thus we have proved the theorem. \n\nCONCLUSIONS \n\nTraining a 3-node neural network whose nodes compute linear threshold func(cid:173)\ntions is NP-complete. \n\nAn open problem is whether the NP-completeness result can be extended to \nneural networks that use sigmoid functions. We believe that it can because the \nuse of sigmoid functions does not seem to alter significantly the expressive power \nof a neural network. Note that Judd (Judd 1987), for the networks he considers, \nshows NP-completeness for a wide variety of node functions including sigmoids. \n\nReferences \nJames A. Anderson and Edward Rosenfeld, editors. Neurocomputing: Foun(cid:173)\n\ndations of Research. MIT Press, 1988. \n\nM. Garey and D. Johnson. Computers and Intractability: A Guide to the \n\nTheory of NP-Completeness. W. H. Freeman, San Francisco, 1979. \n\nJ. Stephen Judd. Learning in networks is hard. In Proceedings of the First \nInternational Conference on Neural Networks, pages 685-692, I.E.E.E., \nSan Diego, California June 1987. \n\nJ. Stephen Judd. Neural Network Design and the Complexity of Learning. \n\nPhD thesis, Computer and Information Science dept., University of Mas(cid:173)\nsachusetts, Amherst, Mass. U.S.A., 1988. \n\nMichael Kearns, Ming Li, Leonard Pitt, and Leslie Valiant. On the learn ability \n\nof boolean formulae. In Proceedings of the Nineteenth Annual ACM Sym(cid:173)\nposium on Theory of Computing, pages 285-295, New York, New York, \nMay 1987. \n\nNimrod Megiddo. On The Complexity of Polyhedral Separability. Technical \n\nReport RJ 5252, IBM Almaden Research Center, August 1986. \n\nMarvin Minsky and Seymour Papert. Perceptrons: An Introduction to Com(cid:173)\n\nputational Geometry. The MIT Press, 1969. \n\nDavid E. Rumelhart and James 1. McClelland, editors. Parallel Distributed \n\nProcessing (Volume I: Foundations). MIT Press, 1986. \n\n\f", "award": [], "sourceid": 125, "authors": [{"given_name": "Avrim", "family_name": "Blum", "institution": null}, {"given_name": "Ronald", "family_name": "Rivest", "institution": null}]}