{"title": "Agnostic PAC-Learning of Functions on Analog Neural Nets", "book": "Advances in Neural Information Processing Systems", "page_first": 311, "page_last": 318, "abstract": null, "full_text": "Agnostic PAC-Learning of Functions on \n\nAnalog Neural Nets \n\n(Extended Abstract) \n\nWolfgang Maass \n\nInstitute for Theoretical Computer Science \n\nTechnische Universitaet Graz \n\nKlosterwiesgasse 32/2 \nA-BOlO Graz, Austria \n\ne-mail: maass@igi.tu-graz.ac.at \n\nAbstract: \n\nThere exist a number of negative results ([J), [BR), [KV]) about \nlearning on neural nets in Valiant's model [V) for probably approx(cid:173)\nimately correct learning (\"PAC-learning\"). These negative results \nare based on an asymptotic analysis where one lets the number of \nnodes in the neural net go to infinit.y. Hence this analysis is less ad(cid:173)\nequate for the investigation of learning on a small fixed neural net. \nwith relatively few analog inputs (e.g. the principal components of \nsome sensory data). The latter type of learning problem gives rise \nto a different kind of asymptotic question: Can the true error of the \nneural net be brought arbitrarily close to that of a neural net with \n\"optimal\" weights through sufficiently long training? In this paper \nwe employ some new arguments ill order to give a positive answer \nto this question in Haussler's rather realistic refinement of Valiant's \nmodel for PAC-learning ([H), [KSS)). In this more realistic model \nno a-priori assumptions are required about the \"learning target\" , \nnoise is permitted in the training data, and the inputs and outputs \nare not restricted to boolean values. As a special case our result \nimplies one of the first positive results about learning on multi-layer \nneural net.s in Valiant's original PAC-learning model. At the end \nof this paper we will describe an efficient parallel implementation \nof this new learning algorit.hm. \n\n311 \n\n\f312 \n\nMaass \n\nWe consider multi-layer high order feedforward neural nets N with arbitrary piece(cid:173)\nwise polynomial activation functions . Each node g of fan-in m > 0 in N is \nand some piecewise polynomial activation funetion ,9 : R \nIt is labelled by some polynomial Q9(Yl, ... , Ym) \ncalled a computation node. \nthat ,9 consists of finitely many polynomial pieces and that its definition in(cid:173)\n--+ R. We assume \n(Yl, ... ,Ym) t-+ ,9 (Q9 (Yl, ... , Ym)) from R minto R. The nodes of fan-in 0 in N \nactivation functions ,9 is bounded. Any parameters that occur in the definitions of \nthe ,9 are referred to as architectural parameters of N. \n\n(\"input nodes\") are labelled by variables Xl, ... , Xk. The nodes g of fan-out 0 in \nN (\"output nodes\") are labelled by 1, ... , I. We assume that the range B of their \n\nvolves only rational parameters. The computation node g computes the function \n\nThe coefficient.s of all the polynomials Q9 are called the programmable parameters \n(or weights) of N. Let w be the number of programmable parameters of N. For any \nassignment a E R W to the programmable parameters of N the network computes \na function from Rk into RI which we will denote by N!!... \nWe write Q n for the set of rational numbers that can be written as quotients of \nintegers with bit-length::; n. For;,. = (Zl, .. . ,ZI) E RI we write 11;,.lh for E Iz;l. \n\nI \n\n;=1 \n\nLet F : Rk --+ RI be some arbitrary function, which we will view as a \"prediction \nrule\". For any given instance (~, 1/) E R k X Rl we measure the error of F by \n\"F(~) - 111 II\u00b7 For any distribution A over some subset of R k x Rl we measure the \ntrue error of F with regard to A by E(\u00a3,Y)EA [IIF(~) -lllll]' i.e. the expected value \nof the error of F with respect to distribution A. \n\nTheorelll 1: Let N be some arbitrary high order feedforward neural net with piece(cid:173)\nwise polynomial activation functions. Let tv be the number of programmable para(cid:173)\nmeters of N (we assume that w = 0(1)). Then one can construct from N some \nfirst order feedforward neural net jj with piecewise linear activation functions and \nthe quadratic activation function ,(x) = x2, which has the following property: \nThere exists a polynomial m(:, i) and a learning algorithm LEARN such that for \nany given \u20ac, 6, E (0,1) and s, n E N and any distribution A over Q~ x (Qn n B)l \nthe following holds: \nFor any sample ( = ({Xi, Yi) )i=l, ... ,m of m ~ m(:, i) points that are independently \ndrawn according to A the algorithm LEARN computes in polynomially in m, s, n \ncomputation steps an assignment ii of rational numb~rs to the programmable para(cid:173)\nmeters of jj such that with probability ~ 1 - 6: \n\nor in other words: \nThe true error of jjli with regard to A is within \u20ac of the least possible true error \nthat can be achieved by any N!!.. with a E Q:. \nRemarks \n\na) One can easily see (see [M 93b] for details) that Theorem 1 provides a \n\npositive learning result in Haussler's extension of Valiant's model for PAC(cid:173)\nlearning ([H], [KSS]). The \"touchstone class\" (see [KSS)) is defined as the \n\n\fAgnostic PAC-Learning of Functions on Analog Neural Nets \n\n313 \n\nclass of function f : Rk -+ Rl that are computable on N with program(cid:173)\nmable parameters from Q. \nThis fact is of some general interest, since so far only very few positive \nresults are known for any learning problem in this rather realistic (but \nquite demanding) learning model. \n\nb) Consider the special case where the distribution A over Q~ x (Qn n B)l is \n\nof the form \n\nADIO'T(~' y) = \n\n-\n\n-\n\n{ \n\nD(~) \n\n0 \n\notherwise \n\nfor some arbitrary distribution D over the domain Q~ and some arbitrary \nQ:T E Q~. Then the term \n\ninf \n\na EQw \n\n3 \n\n-\n\nEC~IY}EA[IINQ.(~) -lllhl \n\n-\n\nis equal to O. Hence the preceding theorem states that with learning algo(cid:173)\nrithm LEARN the \"learning network\" jj can \"learn\" with arbitrarily small \ntrue error any target function NQT that is computable on N with rational \n\"weights\" aT' Thus by choosing N sufficiently large, one can guarantee \nthat the associated \"learning network\" jj can learn any target-function \nthat might arise in the context of a specific learning problem. \n\nIn addition the theorem also applies to the more realistic situation where \nthe learner receives examples (~, y) of the form (~, NQT (~)+ noise), or even \nif there exists no \"target function\" NQT that would \"explain\" the actual \ndistribution A of examples (~, ll) (\"agnostic learning\"). \n\nThe proof of Theorem 1 is mathematically quite involved, and we can give here \nonly an outline. It consists of three steps: \n\n(1) Construction of the auxiliary neural net fl . \n(2) Reducing the optimization of weights in jj for a given distribution A to a \n\nfinite nonlinear optimization problem. \n\n(3) Reducing the resulting finite nonlinear optimization problem to a family of \n\nfinite linear optimization problems. \n\nDetails to step (1): If the activation functions ,9 in N are piecewise linear and \nall computation nodes in N have fan-out::; 1 (this occurs for example if N has just \none hidden layer and only one output) then one can set fI := N. If the ,9 are \npiecewise linear but not all computation nodes in N have fan-out::; lone defines \njj as the tree of the same depth as N, where sub circuits of computation nodes with \nfan-out m > 1 are duplicated 111 times. The activation functions remain unchanged \nin this case. \nIf the activation functions ,9 are piecewise polynomial but not piecewise linear, \none has to apply a rather complex construction which is described in detail in the \nJournal version of [M 93a]. In any case if has the property that all functions that \n\n\f314 \n\nMaass \n\nare computable on N can also be computed on N, the depth of N is bounded by a \nconstant, and the size of N is bounded by a polynomial in the size of N (provided \nthat the depth and order of N, as well as the number and degrees of the polynomial \npieces of the \"'(9 are bounded by a constant). \n\nDetails to step (2): Since the VC-dimension of a neural net is only defined \nfor neural nets with boolean output, one has to consider here instead the pseudo(cid:173)\ndimension of the function class F that is defined by N. \n\nDefinition: (see Haussler (H]). \nLet X be some arbitrary domain, and let F be an arbitrary class of functions from \nX into R. Then the pseudo-dimension of F is defined by \n\ndimp(F) \n\n:= max{ISI: S ~ X and 3h : S --+ R such that \n\nVb E {O, l}s 31 E F Vx E S (I(x) ~ hex) ~ b(x) = I)}. \n\nNote that in the special case where F is a concept class (i.e. all 1 E Fare \u00b0 - 1 \nvalued) the pseudo-dimension dimp(F) coincides with the VC-dimension of F. The \npseudo-dimension of the function class associated with network architectures N with \npiecewise polynomial activation functions can be bounded with the help of Milnor's \nTheorem [Mi] in the same way as the VC-dimension for the case of boolean network \noutput (see [GJ)): \n\nTheorenl 2: Consider arbitrary network architectures N of order v with k input \nnodes, I output nodes, and w programmable parameters. Assume that each gate in \nN employs as activation function some piecewise polynomial (or piecewise rational) \nfunction of degree ~ d with at most q pieces. For some arbitrary p E {I, 2, ... } \nwe define F \nIINQ'.(.~) -1!.lIp)}\u00b7 Then one has dimp(F) = 0(w 2 10gq) if v, d, 1= 0(1). \n\n: 30: E R W Vx E Rk V1!. E Rl(l(~,1!.) \n\n{ 1 : R k+1 --+ R \n\n\u2022 \n\nWith the help of the pseudo-dimension one can carry out the desired reduction of \nthe optimization of weights in N (with regard to an arbitrary given distribution A \nof examples (~, 11.) \nto a finite optimization problem. Fix some interval [b 1 , b2] ~ R \nsuch that B ~ [b 1 , b2], b1 < b2, and such that the ranges of the activation functions \nof the output gates of N are contained in [b1 , b2]. We define b := I\u00b7 (b 2 - bt) , and \nF:= {f :RkX[b1 ,b2]I--+[0,b]: 30:ERwV~ERkV1!.E[bl,b2F(f(~,1!.)= \nIINQ'.(~) - YIII)}\u00b7 Assume now that parameters c, 6 E (0,1) with c ~ band s, n E N \nhave been -fixed. For convenience we assume that s is sufficiently large so that \nall architectural parameters in N are from Qs (we assume that all architectural \nparameters in Ai are rational). We define \n257\u00b7b2 ( . \n\n8) \n2\u00b7 dllnp(F) .In-c- + In\"8 \n\n(11) \n\u20ac'\"8 \n\n33eb \n\n:= \n\nc2 \n\n771 \n\n. \n\nBy Corollary 2 of Theorem 7 in Haussler [H) one has for 771 ~ 771(:, i), I< := y~57 E \n(2,3), and any distribution A over Q~ x (Qn n [b 1 ,b2))1 \n\n(1) \n\nP7'(EAm[{31 E F: 1(771 L...J \n\n/(!1.,1!.\u00bb) - E(~,.~)EA[f(!1.'1!.)]I > I<}] < 6, \n\nc \n\n1 ~ \n\n(~,~)E( \n\n\fAgnostic PAC-Learning of Functions on Analog Neural Nets \n\n315 \n\nwhere E(~.!!)EA [f(~, u)] is the expectation of f(~, u) with regard to distribution A. \n\nWe design an algorithm LEARN that computes for any mEN, any sample \n\n(= ((Xi,yi))iE{l \u2022..\u2022 m} E (Q~ x (Qn n [b 1,b2])I)m, \n\nand any given sEN in polynomially in m, s, n computation steps an assignment \na of rational numbers to the parameters in j\\( such that the function it that is \ncomputed by j\\(!i. satisfies \n\n(2) \n\n1 m \nTn L Ilh(xd - ydh ~ (1 -\n\n_ \n\n2 \n]{)e + \n\ni=l \n\ninf ~ ~ IIN\u00a3(xd - ydh\u00b7 \n\nw m~ - -\n\na E Q\" \n\nm \n\ni=l \n\nThis suffices for the proof of Theorem 1, since (1) and (2) together imply that, for \nany distribution A over Q~ x (Qn n [b1 , b2])1 and any m ~ m( 1, i), with probability \n~ 1 - 6 (with respect to the random drawing of ( E Am) the algorithm LEARN \noutputs for inputs ( and s an assignment a of rational numbers to the parameters \nin j\\( such that \n\nE(~'1!:)EA[IIN!i.(~) -ulld ~ c + \n\ninf \n\na E Q~ \n\nE(!:.Y)EA[IIN\u00a3(~) -ulh]\u00b7 \n\n-\n\nDetails to step (3): The computation of weights a that satisfy (2) is nontrivial, \nsince this amounts t.o solving a nonlinear optimization problem. This holds even if \neach activation function in N is piecewise linear, because weights from successive \nlayers are multiplied with each other. \n\nWe employ a method from [M 93a] that allows us to replace the nonlinear conditions \non the programmable parameters a of N by linear conditions for a transformed set \n.\u00a3, f3 of parameters. We simulate j\\(\u00a3 by another network architecture N[\u00a3]~ (which \none may view as a \"normal form\" for j\\(\u00a3) that uses the same graph (V, E) as \nN, but different activation functions and different values f3 for its programmable \nparameters. The activation functions of N[.\u00a3] depend on IVI new architectural \nparameters .\u00a3 E RI vI, which we call scaling parameters in the following. Whereas \nthe architectural parameters of a network architecture are usually kept fixed, we \nwill be forced to change the scaling parameters of N along with its programmable \nparameters f3. Although this new network architecture has the disadvantage that \nit requires IVI additional parameters .\u00a3, it has the advantage that we can choose in \nN[\u00a3] all weights on edges between computation nodes to be from {-I,O, I}. Hence \nwe can treat them as constants with at most 3 possible values in the system of \ninequalities that describes computations of N[\u00a3]. Thereby we can achieve that all \nvariables that appear in the inqualities that describe computations of N[\u00a3J for fixed \nnetwork inputs (the variables for weights of gates on levell, the variables for the \nbiases of gates on all levels, and the new variables for the scaling parameters .\u00a3) \nappear only linearly in those inqualities. \nWe briefly indicate the construction of N in the case where each activation function \n\"I in N is piecewise linear. For any c > \u00b0 we consider the associated piecewise linear \n\nactivation function \"Ic with \n\nT;f x E R( \"Ic (c . x) = c . \"I ( x ) ). \n\n\f316 \n\nMaass \n\nAssume that fr is some arbitrary given assignment to the programmable parameters \nin jj. We transform jjsr through a recursive process into a \"normal form\" N(\u00a3]t \nin which all weights on edges between computation nodes are from {-I, 0, I}, such \nthat \\:f ll. E R k (jjsr(ll.) = N(\u00a3]t(ll.\u00bb) . \n\nq \n\ni=l \n\nAssume that an output gate gout of jjsr receives as input L: aiYi + ao, where \nal, ... , a q , ao are the weights and the bias of gout (under the assignment a) and \nYl, ... ,Yq are the (real valued) outputs of the immediate predecessors g1, ... ,gq of \ng. For each i E {I, ... , q} with 0i =/:- 0 such that gi is not an input node we replace \nthe activation function \"fi of gi by \"f!a,l, and we multiply the weights and the bias \nof gate gi with lail. Finally we replace the weight ai of gate gout by sgn( ad, where \nsgn(ad := 1 ifai > 0 and sgn(ai) := -1 ifai < o. This operation has the effect \nthat the multiplication with IOj I is carried out before the gate gi (rather than after \ngj, as done in jjsr), but that the considered output gate gout still receives the same \ninput as before. If aj = 0 we want to \"freeze\" that weight at O. This can be done \nby deleting gi and all gates below gi from N. \nThe analogous operations are recursively carried out for the predecessors gi of gout \n(note however that the weights of gj are no longer the original ones from jjsr, since \nthey have been changed in the preceding step). We exploit here the assumption \nthat each gate in jj has fan-out::; 1. \nLet f3 consist of the new weights on edges adjacent to input nodes and of the \nresulting biases of all gates in N. Let f consist of the resulting scaling parameters \nat the gates of N. Then we have \\:f~ E Rk (jjsr(~) = N[.\u00a3]t(~\u00bb). Furthermore c > 0 \nfor all scaling parameters c in f. \n\nAt the end of this proof we will also need the fact that the previously described para(cid:173)\nmeter transformation can be inverted, i.e. one can compute from Q, f3 an equivalent \nweight assignment a for jj (with the original activation functions \"f). \nWe now describe how the algorithm LEARN computes for any given sample \n(= ({Xi,Yi)i=l ..... m E (Q~ x (Q\" n[bl ,b2W)m and any given sEN with the \nhelp of linear programming a new assignment .\u00a3, ~ to the parameters in N such that \nthe function It that is computed by N@]i satisfies (2). For that purpose we describe \nthe computations of N for the fixed inputs Xi from the sample ( = ((Xi, Yi) )i=l .. .. ,m \nby polynomially in m many systems L l , . .. , Lp(m) that each consist of Oem) linear \ninequalities with the transformed parameters Q, f3 as variables. Each system Lj re(cid:173)\nflects one possibility for employing specific linear pieces of the activation functions in \nN for specific network inputs Xl, ... , X m , and for employing different combinations \nof weights from {-I, 0, I} for edges between computation nodes. \nOne can show that it suffices to consider only polynomially in Tn many systems \nof inequalities Lj by exploiting that all inequalities are linear, and that the input \nspace for N has bounded dimension k. \n\n\fAgnostic PAC-Learning of Functions on Analog Neural Nets \n\n317 \n\nWe now expand each of the systems Lj (which has only 0(1) variables) into a \nlinear programming problem LPj with Oem) variables. We add to Lj for each of \nthe I output nodes IJ of N 2m new variables ur, vr for i = 1, ... , m, and the 4m \ninequalities \n\ntj(xd :S (Y;)II + ui - vi, \n\ntj(xd ~ (Ydll + ui - vi, ui ~ 0, \n\nvi ~ 0, \n\nwhere ((Xi, Yi) )i=l , .. . ,m is the fixed sample ( and (Yi)1I is that coordinate of Yj which \ncorresponds to the output node IJ of N. In these inequalities the symbol tj(xd de(cid:173)\nnotes the term (which is by construction linear in the variables f, (3) that represents \nthe output of gate IJ for network input Xi in this system Lj. One-should note that \nthese terms tj( Xi) will in general be different for different j, since different linear \npieces of the activation functions at preceding gates may be used in the computation \nof N for the same network input Xi. We expand the system Lj of linear inequalities \nto a linear programming problem LPj in canonical form by adding the optimization \nrequirement \n\nmmlmlze \n\nm \n\ni=l \n\nIJ output node \n\nThe algorithm LEARN employs an efficient algorithm for linear programming (e.g. \nthe ellipsoid algorithm, see [PS]) in order to compute in altogether polynomially \nin m, sand n many steps an optimal solution for each of the linear programming \nproblems LP1 , ... , LPp(m). We write hj for the function from Rk into Rl that is \ncomputed by N[f]~ for the optimal solution \u00a3, (3 of LPj. The algorithm LEARN \ncomputes ~ '\" Ilhj(xj) - Yilll for j = 1, . .. ,p(m). Let] be that index for which \nthis expression has a minimal value . Let f, ~ be the associated optimal solution of \n\nmL...J \ni=l \n\nm \n\n-\n\n-\n\nLPl (i.e. N@)l computes hl). LEARN employs the previously mentioned back(cid:173)\nwards transformation from f, j3 into values Ii for the programmable parameters of \njj such that 'V~ E Rk (jjQ.(~) = N[f.]l(~)). These values a are given as output of \nthe algorithm LEARN. \nWe refer to [M 93b] for the verification that this weight assignment a has the \nproperty that is claimed in Theorem 1. We also refer to [M 93b] for the proof in the \nmore general case where the activation functions of N are piecewise polynomial . \u2022 \n\nReillark: The algorithm LEARN can be speeded up substantially on a parallel ma(cid:173)\nchine. Furthermore if the individual processors of the parallel machine are allowed \nto use random bits, hardly any global control is required for this parallel computa(cid:173)\ntion. We use polynomially in m many processors. Each processor picks at random \none of the systems Lj of linear inequalit.ies and solves the corresponding linear pro(cid:173)\ngramming problem LPj\n. Then the parallel machine compares in a \"competitive \nphase\" the costs L: Ilhj(Xi) - ydh of the solutions hj that have been computed by \nthe individual processors. It outputs the weights a for jj that correspond to the \n\ni=l \n\n-\n\n-\n\nm \n\n\f318 \n\nMaass \n\n. If one views the number w of weights in N no longer \nbest ones of these solutions hj\nas a c.onstant, one sees that the number of processores that are needed is simply \nexponential in w, but that the parallel computation time is polynomial in m and \nw. \n\nAcknowledgements \n\nI would like to thank Peter Auer, Phil Long and Hal White for their helpful com(cid:173)\nments. \n\nReferences \n\n[BR] \n\n[GJ] \n\n[H] \n\n[J] \n\n[KV] \n\n[KSS] \n\n[M 93a] \n\n[M 93b] \n\n[Mi] \n\n[PS] \n\n[V] \n\nA. Blum, R. L. Rivest, \"Training a 3-node neural network is NP(cid:173)\ncomplete\", Proc. of the 1988 Workshop on Computational Learning \nTheory, Morgan Kaufmann (San Mateo, 1988), 9 - 18 \nP. Goldberg, M. Jerrum, \"Bounding the Vapnik-Chervonenkis dimen(cid:173)\nsion of concept classes parameterized by real numbers\", Proc. of the \n6th Annual A CM Conference on Computational Learning Theory, 361 \n- 369. \nD. Haussler, \"Decision theoretic generalizations of the PAC model \nfor neural nets and other learning applications\", Information and \nComputation, vol. 100, 1992, 78 - 150 \nJ. S. Judd, \"Neural Network Design and the Complexity of Learning\" , \nMIT-Press (Cambridge, 1990) \nM. Kearns, L. Valiant, \"Cryptographic limitations on learning \nboolean formulae and finite automata\", Proc. of the 21st ACM Sym(cid:173)\nposium on Theory of Computing, 1989,433 - 444 \nM. J. Kearns, R. E. Schapire, L. M. Sellie, \"Toward efficient agnostic \nlearning\", Proc. of the 5th A CM Workshop on Computational Learn(cid:173)\ning Theory, 1992, 341 - 352 \nW. Maass, \"Bounds for t.he c.omputational power and learning c.om(cid:173)\nplexity of analog neural nets\" (extended abstract), Proc. of the 25th \nACM Symposium on Theory of Computing, 1993,335 - 344. Journal \nversion submitted for publication \nW. Maass, \"Agnostic PAC-learning of functions on analog neural \nnets\" (journal version), to appear in Neural Computation. \n.J. Milnor, \"On the Betti numbers ofreal varieties\", Proc. of the Amer(cid:173)\nican Math. Soc., vol. 15, 1964, 275 - 280 \nC. H. Papadimitrioll, K. Steiglitz, \"Combinatorial Optimization: Al(cid:173)\ngorithms and Complexity\" , Prent.ice Hall (Englewood Cliffs, 1982) \nL. G. Valiant, \"A theory of the learnable\", Comm. of the ACM, vol. \n27, 1984, 1134 - 1142 \n\n\f", "award": [], "sourceid": 837, "authors": [{"given_name": "Wolfgang", "family_name": "Maass", "institution": null}]}