{"title": "On the Infeasibility of Training Neural Networks with Small Squared Errors", "book": "Advances in Neural Information Processing Systems", "page_first": 371, "page_last": 377, "abstract": null, "full_text": "On the infeasibility of training neural \nnetworks with small squared errors \n\nVan H. Vu \n\nDepartment of Mathematics, Yale University \n\nvuha@math.yale.edu \n\nAbstract \n\nWe demonstrate that the problem of training neural networks with \nsmall (average) squared error is computationally intractable. Con(cid:173)\nsider a data set of M points (Xi, Yi), i = 1,2, ... , M, where Xi are \ninput vectors from Rd, Yi are real outputs (Yi E R). For a net-\nwork 10 in some class F of neural networks, (11M) L~l (fO(Xi)(cid:173)\nYi)2)1/2 - inlfEF(l/ M) \"2:f!1 (f(Xi) - YJ2)1/2 is the (avarage) rel(cid:173)\native error occurs when one tries to fit the data set by 10. We will \nprove for several classes F of neural networks that achieving a rela(cid:173)\ntive error smaller than some fixed positive threshold (independent \nfrom the size of the data set) is NP-hard. \n\n1 \n\nIntroduction \n\nGiven a data set (Xi, Yi), i = 1,2, ... , M. Xi are input vectors from Rd , Yi are real \noutputs (Yi E R). We call the points (Xi, Yi) data points. The training problem \nfor neural networks is to find a network from some class (usually with fixed number \nof nodes and layers), which fits the data set with small error. In the following we \ndescribe the problem with more details. \nLet F be a class (set) of neural networks, and a be a metric norm in RM. To \neach 1 E F, associate an error vector Ef = (1/(Xd - Yil)f;l (EF depends on the \ndata set, of course, though we prefer this notation to avoid difficulty of having too \nmany subindices). The norm of Ej in a shows how well the network 1 fits the data \nregarding to this particular norm. Furthermore, let eo:,F denote the smallest error \nachieved by a network in F, namely: \n\neo: F = min liEf 110: \n\nfEF \n\n, \n\nIn this context, the training problem we consider here is to find 1 E F such that \n\n\f372 \n\nv.n Vu \n\nIIEfila - ea ,F ~ fF, where fF is a positive number given in advance, and does not \ndepend on the size M of the data set. We will call fF relative error. The norm a \nis chosen by the nature of the training process, the most common norms are: \n\n100 norm: Ilvll oo = maxlvi/ (interpolation problem) \n12 norm: IIvl12 = (l/M2::;l v[)1/2, where v = (Vi)t;l (least square error prob(cid:173)\n\nlem). \nThe quantity liEf 1112 is usually referred to as the emperical error of the training \nprocess. The first goal of this paper is to show that achieving small emperical error \nis NP-hard. From now on, we work with 12 norm, if not otherwise specified. \n\nA question of great importance is: given the data set, F and fF in advance, could \none find an efficient algorithm to solve the training problem formulated above. By \nefficiency we mean an algorithm terminating in polynomial time (polynomial in the \nsize of the input). This question is closely related to the problem of learning neural \nnetworks in polynomial time (see [3]). The input in the algorithm is the data set, \nby its size we means the number of bits required to write down all (Xi, Yi). \n\nQuestion 1. Given F and fF and a data set. Could one find an efficient algorithm \nwhich produces a function f E F such that liEf II < eF + fF \nQuestion 1 is very difficult to answer in general. In this paper we will investigate \nthe following important sub-question: \n\nQuestion 2. Can one achieve arbitrary small relative error using polynomial algo(cid:173)\nrithms ? \n\nOur purpose is to give a negative answer for Question 2. This question was posed \nby 1. Jones in his seminar at Yale (1996). The crucial point here is that we are \ndealing with 12 norm, which is very important from statistical point of view. Our \ninvestigation is also inspired by former works done in [2], [6], [7], etc, which show \nnegative results in the 100 norm case. \nDefinition. A positive number f is a threshold of a class F of neural networks if \nthe training problem by networks from F with relative error less than f is NP-hard \n(i.e., computationally infeasible). \n\nIn order to provide a negative answer to Question 2, we are going to show the \nexistence of thresholds (which is independent from the size of the data set) for the \nfollowing classes of networks. \n\u2022 Fn = {flf(x) = (l/n)(2:~=l step (ai x - bi)} \n\u2022 F~ = {flf(x) = (2:7=1 Cistep (ai x - bd} \n\u2022 On = {glg(x) = 2:~1 cii(aix - bi)} \nwhere n is a positive integer, step(x) = 1 if x is positive and zero otherwise, ai and \nx are vectors from Rd , bi are real numbers, and Ci are positive numbel's. It is clear \nthat the class F~ contains Fn; the reason why we distinguish these two cases is that \nthe proof for Fn is relatively easy to present, while contains the most important \nideas. In the third class, the functions 1>i are sigmoid functions which satisfy certain \nLipchitzian conditions (for more details see [9]) \n\nMain Theorem \n(i) The classes F1, F2, F~ and 02 have absolute constant (positive) thresholds \n\n\fOn the Infeasibility of Training Neural Networks with Small Squared Errors \n\n373 \n\n(ii) For ellery class F n+2, n > 0, there is a threshold of form (n- 3/'2d- 1/'2. \n(iii) For every F~+'2' 11 > 0, there is a threshold of form (n-3/2d-3/'2 . \n(iv) For every class 9n+2, n > 0, there is a threshold of form (n- 5 / 2d- 1/ 2 . \nIn the last three statements. ( is an absolute positive constant . \n\nHere is the key argument of the proof. Assume that there is an algorithm A which \nsolves the training problem in some class (say Fn ) with relative error f. From \nsome (properly chosen) NP-hard problem. we will construct a data set so that if f \nis sufficiently small, then the solution found by A (given the constructed data set \nas input) in Fn implies a solution for the original NP-hard problem. This will give \na lower bound on f, if we assume that the algorithm A is polynomial. In all proofs \nthe leading parameter is d (the dimension of data inputs). So by polynomial we \nmean a polynomial with d as variable. All the input (data) sets constructed \"\"ill \nhave polynomial size in d. \n\nThe paper is organized as follow. In the next Section, we discuss earlier results \nconcerning the 100 norm. In Section 3, we display the NP-hard results we will use in \nthe reduction. In Section 4, we prove the main Theorem for class F2 and mention \nthe method to handle more general cases. We conclude with some remarks and \nopen questions in Section 5. \n\nTo end this Section, let us mention one important corollary. The Main Theorem \nimplies that learning F n, F~ and 9n (with respect to 12 norm) is hard. For more \nabout the connection between the complexity of training and learning problems, we \nrefer to [3], [5]. \n\nNotation: Through the paper Ud denotes the unit hypercube in Rd. For any \nnumber x, Xd denotes the vector (x, X,.\" x) of length d. In particular, Od denotes \nthe origin of Rd. For any half space H, fI is the complement of H. For any set A, IAI \nis the number of elements in A. A function y( d) is said to have order of magnitude \n0(F(d)), if there are c < C positive constants such that c < y(d)jF(d) < C for all \nd. \n\n2 Previous works in the loo case \nThe case Q = 100 (interpolation problem) was considered by several authors for \nmany different classes of (usually) 2-layer networks (see [6],[2], [7], [8]). Most of the \nauthors investigate the case when there is a perfect fit, i.e., eleo,F = O. In [2], the \nauthors proved that training 2-layer networks containing 3 step function nodes with \nzero relative error is NP-hard. Their proof can be extended for networks with more \ninner nodes and various logistic output nodes. This generalized a former result of \nMaggido [8] on data set with rational inputs. Combining the techniques used in \n[2] with analysis arguments, Lee Jones [6] showed that the training problem with \nrelative error 1/10 by networks with two monotone Lipschitzian Sigmoid inner nodes \nand linear output node, is also NP-hard (NP-complete under certain circumstances). \nThis implies a threshold (in the sense of our definition) (1/10)M- 1/ 2 for the class \nexamined. However, this threshold is rather weak, since it is decreasing in M. This \nresult was also extended for the n inner nodes case [6]. \nIt is also interesting to compare our results with Judd's. In [7] he considered the \nfollowing problem \"Given a network and a set of training examples (a data set), \ndoes there exist a set of weights so that the network gives correct output for all \ntraining examples ?\" He proved that this problem is NP-hard even if the network is \n\n\f374 \n\nV. H. Vu \n\nrequired to produce the correct output for two-third of the traing examples. In fact, \nit was shown that there is a class of networks and a data sets so that any algorithm \nwill produce poorly on some networks and data sets in the class. However, from \nthis result one could not tell if there is a network which is \"hard to train\" for all \nalgorithms. Moreover, the number of nodes in the networks grows with the size of \nthe data set. Therefore, in some sense, the result is not independent from the size \nof the data set. \n\nIn our proofs we will exploit many techniques provided in these former works. The \ncrucial one is the reduction used by A. Blum and R. Rivest, which involves the \nNP-hardness of the Hypergraph 2-Coloring problem. \n\n3 Sonle NP hard problems \n\nDefinition Let B be a CNF formula, where each clause has at most k literals. \nLet max(B) be the maximum number of clauses which can be satisfied by a truth \nassignment. The APP MAX k-SAT problem is to find a truth assignment which \nsatisfies (1 -\n\nf)max(B) clauses. \n\nThe following Theorem says that this approximation problem is NP -hard, for some \nsmall f. \nTheorem 3.1.1 Fix k 2: 2. There is fl > 0, such that finding a truth assignment. \nwhich satisfies at least (1- fdmax(B) clauses is NP-h a rd. \n\nThe problem is still hard, when every literal in B appears in only few clauses, and \nevery clause contains only few literals. Let B3(5) denote the class of CNFs with at \nmost 3 literals in a clause and every literal appears in at most 5 clauses (see [1]). \nTheorem 3.1.2 There is t2 > 0 such that finding a truth assignment, which satisfies \nat least (1 -\n\nf)max(B) clauses in a formula B E B3(5) is NP-hard. \n\nThe optimal thresholds in these theorems can be computed, due to recent results \nin Thereotical Computer Science. Because of space limitation, we do not go into \nthis matter. \nLet H = (V, E) be a hypergraph on the set V, and E is the set of edges (collection \nof subsets of V). Elements of V are called vertices. The degree of a vertex is the \nnumber of edges containing the vertex. We could assume that each edge contains \nat least two vertices. Color the vertices with color Blue or Red. An edge is colorful \nif it contains vertices of both colors, otherwise we call it monochromatic. Let c( H) \nbe the maximum number of colorful edges one can achieve by a coloring. By a \nprobabilistic argument, it is easy to show that c(H) is at least IEII2 (in a random \ncoloring, an edge will be colorful with probability at least 1/2). Using 3.1.2, we \ncould prove the following theorem (for the proof see [9]) \nTheorem 3.1.3 There is a constant f3 > 0 such that finding a coloring with at \nleast (1 - t3)c(H) colorful edges is NP-hard. This statement holds even in the case \nwhen every but one degree in H is at most 10 \n\n4 Proof for :F2 \n\nWe follow the reduction used in [2]. Consider a hypergraph H(V, E) described \nTheorem 3.2.1. Let V = {I, 2, . . \" d + I}, where with the possible exception of \nthe vertex d + 1, all other vertices have degree at most 10. Every edge will have \nat least 2 and at most 4 vertices. So the number of edges is at least (d + 1) /4. \n\n\fOn the Infeasibility of Training Neural Networks with Small Squared Errors \n\n375 \n\nLet Pi be the ith unit vector in R d+l , Pi = (0,0 , . .. ,0,1,0, .. . ,0). Furthermore, \nXc = LiE C Pi for every edge C E E. Let S be a coloring with maximum number \nof colorful edges. In this coloring denote by Al the set of colorful edges and by A2 \nthe set of monochromatic edges. Clearly IAII = e(H). \nOur data set will be the following (inputs are from R d+l instead of from Rd , but it \nmakes no difference) \n\nwhere (Pd+1,1/2)t and (Od+l , l)t means (Pd+1, 1/2) and (Od+l, 1) are repeated t \ntimes in the data set, resp. Similarly to [2], consider two vectors a and b in R d+l \nwhere \n\na = (al,\"\" ad+l), ai = -1 if i is Red and ai = d + 1 otherwise \nb = (b l , . .. , bd+l) , bi = -1 if i is Blue and bi = d + 1 otherwise \n\nIt is not difficult to verify that the function fa = (1/2)(step (ax + 1/2) + step (bx + \n1/2)) fits the data perfectly, thus e:F2 = IIEjal1 = O. \nSuppose f = (1/2) (step (ex - I) + step (dx - 6\u00bb satisfies \n\nM \n\nMllEjW = 2)f(Xd - Yi)2 < Mc2 \n\ni=l \n\nSince if f(Xi ) # Yi then U(Xi ) - Yi)2 2: 1/4, the previous inequality implies: \nPo = l{i.J(Xd # Ydl < 4Mc2 = p \nThe ratio po/Mis called misclassification ratio, and we will show that this ratio \ncannot be arbitrary small. In order to avoid unnecessary ceiling and floor symbols, \nwe assume the upper-bound p is an integer. We choose t = P so that we can also \nassume that (Od+l, 1) and (Pd+l, 1/2) are well classified. Let Hl (H2) be the half \n'Y > 0 (dx - 6 > 0). Note that Od E HI n H2 and \nspace consisting of x: ex -\nPd+l E fI I U fI 2. Now let Pl denote the set of i where Pi t/:. HI, and P2 the set of i \nsuch that Pi E Hl n H 2\u2022 Clearly, if j E P2 , then f(pj) # Yj, hence: IP21::; p. Let \nQ = {C E EIC n P2 # 0}. Note that for each j E P2, the degree of j is at most 10, \nthus: IQI ::; 10!?:?1 ::; lOp \nLet A~ = {Clf(xc) = I}. Since less than p points are misclassified, IA~ .0. A I I < p. \nColor V by the following rule: (1) if Pi E PI, then i is Red; (2) if Pi E P2 , color i \narbitrarily, either Red or Blue; (3) if Pi t/:. Pl U P2 , then i is Blue. \nN ow we can finish the proof by the following two claims: \nClaim 1: Every edge in A~ \\Q is colorful. It is left to readers to verify this simple \nstatement. \n\nClaim 2: IA~ \\QI is close to IAII \u00b7 \nNotice that: \n\nIAI \\(A~ \\Q)I ::; IAI.0.A~ 1+ IQI ::; p + lOp = IIp \n\nObserve that the size of the data set is M = d + 2t + lEI, so lEI + d 2: M - 2t = \nM - 2p. Moreover, lEI 2: (d + 1)/4, so lEI 2: (1/5)(M - 2p). On the other hand, \nIAII2: (1/2)IEI, all together we obtain; IAII2: (1/10)(M - p), which yields: \n\n\f376 \n\nV. H. Vu \n\nChoose f = f4 such that k(f4) ~ f3 (see Theortm 3.1.3). Then f4 will be a threshold \nfor the class ;:2. This completes the proof. Q.E.D. \nDue to space limitation, we omit the proofs for other classes and refer to [9]. How(cid:173)\never, let us at least describe (roughly) the general method to handle these cases. \nThe method consists of following steps: \n\n\u2022 Extend the data set in the previous proof by a set of (special) points. \n\n\u2022 Set the multiplicities of the special points sufficiently high so that those points \nshould be well-classified. \n\n\u2022 If we choose the special points properly, the fact that these points are well-classified \nwill determine (roughly) the behavior of all but 2 nodes. In general we will show \nthat all but 2 nodes have little influence on the outputs of non-special data points. \n\n\u2022 The problem basically reduces to the case of two nodes. By modifying the previous \nproof, we could achieve the desired thresholds. \n\n5 Remarks and open problems \n\n\u2022 Readers may argue about the existence of (somewhat less natural) data points of \nhigh multiplicities. We can avoid using these data points by a combinatorial trick \ndescribed in [9]. \n\n\u2022 The proof in Section 4 could be carried out using Theorem 3.1.2. However, we \nprefer using the hypergraph coloring terminology (Theorem 3.1.3), which is more \nconvenient and standard. Moreover, Theorem 3.1.3 itself is interesting, and has not \nbeen listed among well known \"approximation is hard\" theorems. \n\n\u2022 It remains an open question to determine the right order of magnitude of thresh(cid:173)\nolds for all the classes we considered. (see Section 1). By technical reasons, in the \nMain theorem, the thresholds for more than two nodes involve the dimension (d). \nWe conjecture that there are dimension-free thresholds. \n\nAcknowledgement We wish to thank A. Blum, A. Barron and 1. Lovasz for many \nuseful ideas and discussions. \n\nReferences \n\n[1] S. Arora and C. Lund Hardness of approximation, book chapter, preprint \n[2] A. Blum, R. Rivest Training a 3-node neural network is NP-hard Neutral Net(cid:173)\n\nworks, Vol 5., p 117-127, 1992 \n\n[3] A. Blumer, A. Ehrenfeucht, D. Haussler, M. Warmuth, Learnability and the \n\nVepnik-Chervonenkis Dimension, Journal ofthe Association for computing Ma(cid:173)\nchinery, Vol 36, No.4, 929-965, 1989. \n\n[4] M. Garey and D. Johnson, Computers and intractability: A guide to the theory \n\nof NP-completeness, San Francisco, W.H.Freeman, 1979 \n\n\fOn the Infeasibility o/Training Neural Networks with Small Squared Errors \n\n377 \n\n[5] D. Haussler, Generalizing the PAC model for neural net and other learning \napplications (Tech. Rep. UCSC-CRL-89-30). Santa Cruz. CA: University of \nCalifornia 1989. \n\n[6] L. J ones, The computational intractability of training sigmoidal neural networks \n\n(preprint) \n\n[7] J. Judd Neutral Networks and Complexity of learning, MIT Press 1990. \n[8] N. Meggido, On the complexity of polyhedral separability (Tech. Rep. RJ 5252) \n\nIBM Almaden Research Center, San Jose, CA \n\n[9] V. H. Vu, On the infeasibility of training neural networks with small squared \n\nerror. manuscript. \n\n\f", "award": [], "sourceid": 1441, "authors": [{"given_name": "Van H. Vu", "family_name": "", "institution": null}]}