O and q~O. A sequence \n\nC= (Cl, ... , cn+q) E lR.n+q \n\nis said to be n-recursive if there exist real numbers r1, .. . , rn so that \n\nn \n\ncn+j = 2: cn+j-iri, j = 1, . .. , q. \n\ni=l \n\n(In particular, every sequence of length n is n-recursive, but the interesting cases \nare those in which q i= 0, and in fact q ~ n.) Given such an n-recursive sequence \nC, we may consider its associated perceptron classifier. This is the map \n\n\u00a2c: lR.n+q --+{-1,1}: \n\n(X1, ... ,Xn+q) H \n\nsign (I:CiXi) \n\n.=1 \n\nwhere the sign function is understood to be defined by sign (z) = -1 if z ~ 0 and \nsign (z) = 1 otherwise. (Changing the definition at zero to be + 1 would not change \nthe results to be presented in any way.) We now introduce, for each two fixed n, q \nas above, a class of functions: \n\n:Fn,q := {\u00a2cl cE lR.n+q is n-recursive}. \n\nThis is understood as a function class with respect to the input space X = lR. n +q, \nand we are interested in estimating vc (:Fn,q). \nOur main result will be as follows (all logs in base 2): \n\nTheorem 1 \nImax {n, nLlog(L1 + ~ J)J} ~ vc (:Fn ,q) ~ min {n + q, 18n + 4n log(q + 1)} I \n\nNote that, in particular, when q> max{2 + n 2 , 32}, one has the tight estimates \n\nn \"2 logq ~ vc (:Fn ,q) ~ 8n logq . \n\nThe organization of the rest of the paper is as follows. In Section 3 we state an \nabstract result on VC-dimension, which is then used in Section 4 to prove Theo(cid:173)\nrem 1. Finally, Section 6 deals with bounds on the sample complexity needed for \nidentification of linear dynamical systems, that is to say, the real-valued functions \nobtained when not taking \"signs\" when defining the maps \u00a2c. \n\n3 An Abstract Result on VC Dimension \n\nAssume that we are given two sets X and A, to be called in this context the set of \ninputs and the set of parameter values respectively. Suppose that we are also given \na function \n\nF: AxX--+{-1,1}. \n\nAssociated to this data is the class of functions \n\n:F := {F(A,\u00b7): X --+ {-1, 1} I A E A} \n\n\fSample Complexity for Learning Recurrent Perceptron Mappings \n\n207 \n\nobtained by considering F as a function of the inputs alone, one such function for \neach possible parameter value A. Note that, given the same data one could, dually, \nstudy the class \n\nF*: {F(-,~) : A-{-I,I}I~EX} \n\nwhich obtains by fixing the elements of X and thinking of the parameters as inputs. \nIt is well-known (and in any case, a consequence of the more general result to be \npresented below) that vc (F) ~ Llog(vc (F*\u00bbJ, which provides a lower bound on \nvc (F) in terms of the \"dual VC dimension.\" A sharper estimate is possible when \nA can be written as a product of n sets \n\nA = Al X A2 X \u2022 \u2022 . x An \n\n(1) \n\nand that is the topic which we develop next. \nWe assume from now on that a decomposition of the form in Equation (1) is given, \nand will define a variation of the dual VC dimension by asking that only certain \ndichotomies on A be obtained from F*. We define these dichotomies only on \"rect(cid:173)\nangular\" subsets of A, that is, sets of the form \n\nL = Ll X .\u2022. x Ln ~ A \n\nwith each Li ~ Ai a nonempty subset. Given any index 1 ::; K ::; n, by a K-axis \ndichotomy on such a subset L we mean any function 6 : L - {-I, I} which depends \nonly on the Kth coordinate, that is, there is some function \u00a2 : Lit - {-I, I} so that \n6(Al, . . . ,An ) = \u00a2(AIt) for all (Al, . . . ,An ) E L; an axis dichotomy is a map that \nis a K-axis dichotomy for some K. A rectangular set L will be said to be axis(cid:173)\nshattered if every axis dichotomy is the restriction to L of some function of the form \nF(\u00b7,~): A - {-I, I}, for some ~ EX. \n\nTheorem 2 If L = Ll X ... x Ln ~ A can be axis-shattered and each set Li has \ncardinality ri, then vc (F) ~ Llog(rt)J + ... + Llog(rn)J . \n\n(In the special case n=1 one recovers the classical result vc (F) ~ Llog(vc (F*)J.) \nThe proof of Theorem 2 is omitted due to space limitations. \n\n4 Proof of Main Result \n\nWe recall the following result; it was proved, using Milnor-Warren bounds on the \nnumber of connected components of semi-algebraic sets, by Goldberg and Jerrum: \n\nFact 4.1 (Goldberg and Jerrum, 1995) Assume given a function F : A x X -\n{-I, I} and the associated class of functions F:= {F(A,\u00b7): X - {-I, I} I A E A} . \nSuppose that A = ~ k and X = ~ n, and that the function F can be defined in terms \nof a Boolean formula involving at most s polynomial inequalities in k + n variables, \neach polynomial being of degree at most d. Then, vc (F) ::; 2k log(8eds). \n0 \n\nUsing the above Fact and bounds for the standard \"perceptron\" model, it is not \ndifficult to prove the following Lemma. \nLemma 4.2 vc (Fn,q) ::; min{n + q, 18n + 4nlog(q + I)} \n\nNext, we consider the lower bound of Theorem 1. \nLemma 4.3 vc (Fn,q) ~ maxin, nLlog(Ll + q~1 J)J} \n\n\f208 \n\nB. DASGUPTA, E. D. SONTAG \n\nProof As Fn,q contains the class offunctions

0, the consistency problem for :Fn,q can be solved \nin time polynomial in q and s in the unit cost model, and time polynomial in q, s, \nand L in the logarithmic cost model. \nSince vc (:Fn ,q) = O(n + nlog(q + 1)), it follows from here that the class :Fn,q is \nlearnable in time polynomial in q (and L in the log model). Due to space limitations, \nwe must omit the proof; it is based on the application of recent results regarding \ncomputational complexity aspects of the first-order theory of real-closed fields. \n\n6 Pseudo-Dimension Bounds \n\nIn this section, we obtain results on the learnability of linear systems dynamics, that \nis, the class of functions obtained if one does not take the sign when defining recur(cid:173)\nrent perceptrons. The connection between VC dimension and sample complexity \nis only meaningful for classes of Boolean functions; in order to obtain learnability \nresults applicable to real-valued functions one needs metric entropy estimates for \ncertain spaces of functions. These can be in turn bounded through the estimation \nof Pollard's pseudo-dimension. We next briefly sketch the general framework for \nlearning due to Haussler (based on previous work by Vapnik, Chervonenkis, and \nPollard) and then compute a pseudo-dimension estimate for the class of interest. \n\nThe basic ingredients are two complete separable metric spaces X and If (called \nrespectively the sets of inputs and outputs), a class :F of functions f : X -+ If \n(called the decision rule or hypothesis space) , and a function f : If x If -+ [0, r] C jR \n(called the loss or cost function). The function f is so that the class of functions \n(x, y) ~ f(f(x), y) is \"permissible\" in the sense of Haussler and Pollard. \nNow, \none may introduce, for each f E :F, the function \n\nAJ,l : X x If x jR -+ {-I, I} : (x, y, t) ~ sign (f(f(x) , y) - t) \n\nas well as the class A.1\",i consisting of all such A/,i ' The pseudo-dimension of :F \nwith respect to the loss function f, denoted by PO [:F, f], is defined as: \n\nPO [:F,R] := vc (A.1\",i). \n\nDue to space limitations, the relationship between the pseudo-dimension and the \nsample complexity of the class :F will not be discussed here; the reader is referred \nto the references (Haussler, 1992; Maass, 1994) for details. \n\nFor our application we define , for any two nonnegative integers n, q, the class \n\n:F~,q := {\u00a2