{"title": "Some Theoretical Results Concerning the Convergence of Compositions of Regularized Linear Functions", "book": "Advances in Neural Information Processing Systems", "page_first": 370, "page_last": 378, "abstract": null, "full_text": "Some Theoretical Results Concerning the \n\nConvergence of Compositions of Regularized \n\nLinear Functions \n\nTong Zhang \n\nMathematical Sciences Department \nIBM T.1. Watson Research Center \n\nYorktown Heights, NY 10598 \n\ntzhang@watson.ibm.com \n\nAbstract \n\nRecently, sample complexity bounds have been derived for problems in(cid:173)\nvolving linear functions such as neural networks and support vector ma(cid:173)\nchines. In this paper, we extend some theoretical results in this area by \nderiving dimensional independent covering number bounds for regular(cid:173)\nized linear functions under certain regularization conditions. We show \nthat such bounds lead to a class of new methods for training linear clas(cid:173)\nsifiers with similar theoretical advantages of the support vector machine. \nFurthermore, we also present a theoretical analysis for these new meth(cid:173)\nods from the asymptotic statistical point of view. This technique provides \nbetter description for large sample behaviors of these algorithms. \n\n1 Introduction \n\nIn this paper, we are interested in the generalization performance of linear classifiers ob(cid:173)\ntained from certain algorithms. From computational learning theory point of view, such \nperformance measurements, or sample complexity bounds, can be described by a quanti(cid:173)\nty called covering number [11, 15, 17], which measures the size of a parametric function \nfamily. For two-class classification problem, the covering number can be bounded by a \ncombinatorial quantity called VC-dimension [12, 17]. Following this work, researchers \nhave found other combinatorial quantities (dimensions) useful for bounding the covering \nnumbers. Consequently, the concept of VC-dimension has been generalized to deal with \nmore general problems, for example in [15, 11]. \n\nRecently, Vapnik introduced the concept of support vector machine [16] which has been \nsuccessful applied to many real problems. This method achieves good generalization by \nrestricting the 2-norm of the weights of a separating hyperplane. A similar technique has \nbeen investigated by Bartlett [3], where the author studied the performance of neural net(cid:173)\nworks when the I-norm of the weights is bounded. The same idea has also been applied \nin [13] to explain the effectiveness of the boosting algorithm. In this paper, we will extend \ntheir results and emphasize the importance of dimension independence. Specifically, we \nconsider the following form of regularization method (with an emphasis on classification \nproblems) which has been widely studied for regression problems both in statistics and in \n\n\fConvergence of Regularized Linear Functions \n\nnumerical mathematics: \n\ninf Ex yL(w, 2:, y) = inf Ex yl(wT 2:Y) + Ag(W), \nw \n\nW \n\nI \n\nI \n\n371 \n\n(1) \n\nwhere Ex ,y is the expectation over a distribution of (2:, y), and y E {-1, 1} is the binary \nlabel of data vector 2:. To apply this fonnulation for the purpose oftraining linear classifiers. \nwe can choose I as a decreasing function, such that I ( .) ~ 0, and choose 9 ( w) ~ 0 as \na function that penalizes large w (liIl1w~oo g( w) -4 00). A is an appropriately chosen \npositive parameter to balance the two tenns. \n\nThe paper is organized as follows. In Section 2, we briefly review the concept of covering \nnumbers as well as the main results related to analyzing the perfonnance of learning algo(cid:173)\nrithms. In Section 3, we introduce the regularization idea. Our main goal is to construct \nregularization conditions so that dimension independent bounds on covering numbers can \nbe obtained. Section 4 extends results from the previous section to nonlinear composition(cid:173)\ns of linear functions. In Section 5. we give an asymptotic fonnula for the generalization \nperfonnance of a learning algorithm, which will then be used to analyze an instance of \nSVM. Due to the space limitation, we will only present the main results and discuss their \nimplications. The detailed derivations can be found in [18]. \n\n2 Covering numbers \n\nWe fonnulate the learning problem as to find a parameter from random observations to \nminimize risk: given a loss function L( a, x) and n observations Xl = {x 1, ... , xn } \nindependently drawn from a fixed but unknown distribution D, we want to find a that \nminimizes the expected loss over 2: (risk): \n\nR(a) = ExL(a,x)= / L(a,x)dP(x). \n\n(2) \n\nThe most natural method for solving (2) using a limited number of observations is by the \nempirical risk minimization (ERM) method (cf [15, 16]). We simply choose a parameter \na that minimizes the observed risk: \n\nR(a,Xl ) = - LL(a,xi). \n\n1 n \n\nn i=l \n\n(3) \n\nWe denote the parameter obtained in this way as a erm (Xl)' The convergence behavior \nof this method can be analyzed by using the VC theoretical point of view. which relies \non the unifonn convergence of the empirical risk (the unifonn law of large numbers): \nSUPa IR(a, Xl) - R(a)l. Such a bound can be obtained from quantities that measure \nthe size of a Glivenko-Cantelli class. For finite number of indices, the family size can be \nmeasured simply by its cardinality. For general function families, a well known quantity to \nmeasure the degree ofunifonn convergence is the covering number which can be be dated \nback to Kolmogrov [8, 9]. The idea is to discretize (which can depend on the data Xl) the \nparameter space into N values a1, . .. ,aN SO that each L(a, .) can be approximated by \nL( ai, .) for some i. We shall only describe a simplified version relevant for our purposes. \n\nDefinition 2.1 Let B be a metric space with metric p. Given a norm p, observations Xl = \n[Xl, ... ,xn ]. and vectors I(a, Xl) = [/(a, Xl)\"\" \n,/(a, xn )] E Bn parameterized by \na, the covering number in p-norm, denoted as Np (I, \u20ac, Xl)' is the minimum number of a \ncollection o/vectors V1, ... ,Vm E B n such that Va. 3Vi: IIp(l(a,Xl),vi)lIp ::; n1/P\u20ac. \nWe also denote Np(l, \u20ac, n) = maxx~ Np(l, \u20ac, Xl). \n\nNote that from the definition and the Jensen's inequality, we have Np ::; Nq for p ::; q. We \nwill always assume the metric on R to be IX1 - x21 if not explicitly specified otherwise. \nThe following theorem is due to Pollard [11]: \n\n\f372 \n\nTheorem 2.1 ([11]) \\;/n, f > \u00b0 and distribution D. \n\nT. Zhang \n\nP(s~p IR(a, X~) - R(a)1 > \u20acj ~ 8E(Af1(L , f/8, X~)] exp( 128M2)' \n\n-nf 2 \n\nwhere M = sUPa,:z: L(a, x) -\ndrawn from D. \n\ninfa,:z: L(a, x). and X~ = {Xl, . .. ,X' l } are independently \n\nThe constants in the above theorem can be improved for certain problems; see [4. 6, 15, 16] \nfor related results. However, they yield very similar bounds. The result most relevant for \nthis paper is a lemma in [3] where the 1-nonn covering number is replaced by the oo-nonn \ncovering number. The latter can be bounded by a scale-sensitive combinatorial dimension \n[1], which can be bounded from the I-norm covering number if this covering number does \nnot depend on n. These results can replace Theorem 2.1 to yield better estimates under \ncertain circumstances. \n\nSince Bartlett's lemma in [3] is only for binary loss functions, we shall give a generalization \nso that it is comparable to Theorem 2.1 : \nTheorem 2.2 Let It and 12 be two functions: R n -+ [0, 1] such that /Y1 - Y21 ~ I implies \nIt (Y1) ~ h(Y2) ~ h(Y1) where h : R n -+ [0,1] is a reference separatingfunction, then \n\nP[s~p[E:z:It(L(a, x\u00bb) - Ex-;-h(L(a, x))] > f] ~ 4E[Afoo(L, I, X~)] exp( 32)' \n\n-nf 2 \n\nNote that in the extreme case that some choice of a achieves perfect generalization: \nE:z:h(L(a, x)) = 0, and assume that our choices of a(X1) always satisfy the condition \nEXf h(L( a, x\u00bb = 0, then better bounds can be obtained by using a refined version of the \nChernoffbound. \n\n3 Covering number bounds for linear systems \n\nIn this section, we present a few new bounds on covering numbers for the following form \nof real valued loss functions: \n\nL(w, x) = xT w = L XiWi \u00b7 \n\nd \n\ni=l \n\n(4) \n\nAs we shall see later, these bounds are relevant to the convergence properties of (1). Note \nthat in order to apply Theorem 2.1, since Afl < Af2 , therefore it is sufficient to estimate \nAf2(L, \u20ac, n) for \u20ac > O. It is clear that Af2(L, f, ~ is not finite ifno restrictions on x and w \nare imposed. Therefore in the following, we will assume that each I/xil/p is bounded. and \nstudy conditions ofllw// q so that logAf(j, f, n) is independent or weakly dependent of d. \nOur first result generalizes a theorem of Bartlett [3]. The original results is with p = 00 \nand q = 1, and the related technique has also appeared in [10, 13]. The proof uses a lemma \nthat is attributed to Maurey (cf. [2, 7]). \nTheorem 3.1 V/lxi/lp ~ band Ilw/lq ~ a, where lip + 1/q == 1 and 2 ~ p ~ 00, then \n\nlog2 Af2(L, f, n) ~ r 7 1 Iog2 (2d + 1). \n\na2 b2 \n\nThe above bound on the covering number depends logarithmically on d, which is already \nquite weak (as compared to linear dependency on d in the standard situation). However, the \nbound in Theorem 3.1 is nottightforp < 00. For example, the following theorem improves \nthe above bound for p = 2. Our technique of proof relies on the SVD decomposition [5] \nfor matrices, which improves a similar result in [14 J by a logarithmic factor. \n\n\fConvergence of Regularized Linear Functions \n\n373 \n\nThe next theorem shows that if lip + llq > 1, then the 2-nonn covering number is also \nmdependent of dimension. \nTheorem 3.3 Let L(w, x) = xTw. {f'llxillp :::; band Ilwllq :::; a, where 1 :::; q :::; 2 and \nJ = lip + 1jq - 1 > 0, then \n\nOne consequence of this theorem is a potentially refined explanation for the boosting al(cid:173)\ngorithm. In [13], the boosting algorithm has been analyzed by using a technique related to \nresults in [3] which essentially rely on Theorem 3.1 withp = 00. Unfortunately, the bound \ncontains a logarithmic dependency on d (in the most general case) which does not seem to \nfully explain the fact that in many cases the perfonnance of the boosting algorithm keeps \nimproving as d increases. However, this seemingly mysterious behavior might be better \nunderstood from Theorem 3.3 under the assumption that the data is more restricted than \nsimply being oo-nonn bounded. For example, when the contribution of the wrong predic(cid:173)\ntions is bounded by a constant (or grow very slowly as d increases), then we can regard \nits p-th nonn bounded for some p < 00 . In this case, Theorem 3.3 implies dimensional \nindependent generalization. \n\nIf we want to apply Theorem 2.2, then it is necessary to obtain bounds for infinity-nonn \ncovering numbers. The following theorem gives such bounds by using a result from online \nlearning. \nTheorem 3.4 lfllxillp :::; band Ilwllq :::; a, where 2 :::; p < 00 and lip + 11q = 1, then \ntiE> O. \n\nIn the case of p = 00, an entropy condition can be used to obtain dimensional independent \ncovering number bounds. \nDefinition 3.1 Let f1. = [f1.i] be a vector with positive entries such that 11f1.lll = 1 (in this \ncase, we call f1. a distribution vector). Let x = [Xi] \"# 0 be a vector of the same length, then \nwe define the weighted relative entropy of x with re5pect to f1. as: \n\nentro~(x) = ~ IXil ln J-Lillxlh' \n\n~ IXil \n\u2022 \n\nTheorem 3.5 Given a distribution vector f1., If llxi lloo \n:::; band Ilwlll \nentro ~ ( w) :::; c, where we assume that w has non-negative entries, then tiE> 0, \n\n:::; a and \n\nlog2 Noo(L, E, n) :::; \n\n36b 2 ( a 2 + ac) \n\nE2 \n\nlog2[2 r 4ab/ E + 21n + 1]. \n\nTheorems in this section can be combined with Theorem 4.1 to fonn more complex cover(cid:173)\ning number bounds for nonlinear compositions oflinear functions. \n\n\f374 \n\n4 Nonlinear extensions \n\nConsider the following system: \n\nT. Zhang \n\nL([a, w], x) = I(g(a, x) + wTh(a, x)) , \n\n(5) \nwhere x is the observation, and [a, w] is the parameter. We assume that 1 is a nonlinear \nfunction with bounded total variation. \nDefinition 4.1 A/unction 1 : R -+ R is said to satisfy the Lipschitz condition with param(cid:173)\neter\"Y ifVx, y: I/( x) - I(y) I ~ )'Ix - yl\u00b7 \nDefinition 4.2 The total variation of a/unction 1 : R -+ R is defined as \nsup L I/(xi) - I(xi-dl \u00b7 \n\nTV(f, x) = \n\nL \n\n:2:0 0, and n > 2(d + 1): \nIog2 Nr (L, E1 + E2, n) < (d + 1) log2[den max(l TV(f) J, 1)] + log2 Nr([g , h], E2h, n) , \nwhere the metric o/[g, h) is defined as Ig1 - g21 + cllh1 - h211p (l/p + l/q = 1). \nExample 4.1 Consider classification by hyperplane: L( w, x) = J( wT x < 0) where J is \nthe set indicator function. Let L' ( w, x) = 10 ( wT x) be another loss function where \n\n+ 1 \n\n2E1 \n\n-\n\nlo(z) = \n\n{\n\nz < 0 \n\n1 \n1 - z z E [0 , 1] . \no \n\nz > 1 \n\nInstead of using ERM for estimating parameter that minimizes the risk of L , consider the \nscheme of minimize empirical risk associated with L', under the assumption that II x 112 :s; b \nand constraint that JJwl12 :s; a. Denote the estimated parameter by wn . It follows from the \ncovering number bounds and Theorem 2.1 that with probability of at least 1 - 1]: \n\nIf we apply a slight generalization of Theorem 2.2 and the covering number bound of \nTheorem 3.4, then with probability of at least 1 - T/: \n\nn 1 / 2ab In( nab + 2) + In 1.. \n________ --'-'7 ). \n\nn \n\nExJ(w~ x ~ 0) :s; EXfJ(w~ x :s; 2)') + O( -(-2 In(abh + 2) + In n + In -)) \n\n1 a 2b2 \nn \n\n)' \n\n1 \nT/ \n\nfor all)' E (0,1]. 0 \n\nBounds given in this paper can be applied to show that under appropriate regularization \nconditions and assumptions on the data, methods based on (1) lead to generalization per(cid:173)\nformances of the form 0(1/ .jn), where 0 symbol (which is independent of d) is used to \nindicate that the hidden constant may include a polynomial dependency on Iog( n). It is \nalso important to note that in certain cases, ,\\ will not appear (or it has a small influence on \nthe convergence) in the constant of 0, as being demonstrated by the example in the next \nsection. \n\n\fConvergence of Regularized Linear Functions \n\n375 \n\n5 Asymptotic analysis \n\nThe convergence results in the previous sections are in the form of VC style convergence \nin probability, which has a combinatorial flavor. However, for problems with differen(cid:173)\ntiable function families involving vector parameters, it is often convenient to derive precise \nasymptotic results using the differential structure. \n\nAssume that the parameter a E Rm in (2) is a vector and L is a smooth function. Let \na* denote the optimal parameter; \"\\1 ex denote the derivative with respect to a; and 'It( a, x) \ndenote \"\\1 exL(a, x) . Assume that \n\nV = J \"\\1 ex'lt(a* , x) dP(x) \nU = J 'It ( a * , x) 'It ( a * , x f dP ( x) . \n\nThen under certain regularity conditions, the asymptotic expected generalization error is \ngiven by \n\nE R(aerm ) = R(a*) + 2n tr(V-1U). \n\n1 \n\nMore generally, for any evaluation function h( a) such that \"\\1 h( a*) = 0: \n\nE h(aerm ) I=::j h(a*) + -tr(V- 1\"\\12h\u00b7 V-1U), \n\n1 \n2n \n\n(6) \n\n(7) \n\nwhere \"\\1 2 h is the Hessian matrix of hat a*. Note that this approach assumes that the op(cid:173)\ntimal solution is unique. These results are exact asymptotically and provide better bounds \nthan those from the standard PAC analysis. \n\nExample 5.1 We would like to study a form of the support vector machine: Consider \nL(a, x) = f(a T x) + ~Aa2 , \n\nz < 1 \nz > 1 . \n\nBecause of the discontinuity in the derivative of f , the asymptotic formula may not hold. \nHowever, if we make an assumption on the smoothness of the distribution x, then the \nexpectation of the derivative over x can still be smooth. In this case, the smoothness of \nf itself is not crucial. Furthermore, in a separate report. we shall illustrate that similar \nsmall sample bounds without any assumption on the smoothness of the distribution can be \nobtained by using techniques related to asymptotic analysis. \nConsider the optimal parameter a* and letS = {x : a*Tx::; 1}. Note that Aa* = ExEsx, \nand U = EXES(X - ExEsx)(x - EXEsxf. Assume that 3')' > 0 S.t. P(a*T x ::; ')') = 0, \nthen V = AI + B where B is a positive semi-definite matrix. It follows that \n\ntr(V-1U) ::; tr(U)jA ::; EXES *T Ila*I I ~::; sup Ilxll~ l la*ll~j')'. \n\nx 2 \n\nE \nxESa X \n\nNow, consider an obtained from observations Xl = [Xl, '\" \nrisk associated with loss function L( a, x), then \n\n,xn ] by minimizing empirical \n\nExL(a emp , x) ::; inf ExL(a, x) + -21 sup Ilxl l~lla*ll~ \n\nex \n\n')'n \n\nasymptotically. Let A --+ 0, this scheme becomes the optimal separating hyperplane [16]. \nThis asymptotic bound is better than typical PAC bounds with fixed A. 0 \n\nNote that although the bound obtained in the above example is very similar to the mistake \nbound for the perceptron online update algorithm, we may in practice obtain much better \nestimates from (6) by plugging in the empirical data. \n\n\f376 \n\nReferences \n\nT. Zhang \n\n[I] N. Alon, S. Ben-David, N. Cesa-Bianchi, and D. Haussler. Scale-sensitive dimen(cid:173)\n\nsions, uniform convergence, and learnability. Journal of the ACM, 44(4):615-631, \n1997. \n\n[2] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal func(cid:173)\n\ntion. IEEE Transactions on Injormation Theory, 39(3):930-945, 1993. \n\n[3] P.L. Bartlett. The sample complexity of pattern classification with neural networks: \n\nthe size of the weights is more important than the size of the network. IEEE Transac(cid:173)\ntions on Information Theory, 44(2):525-536, 1998. \n\n[4] R.M. Dudley. A course on empirical processes, volume 1097 of Lecture Notes in \n\nMathematics. 1984. \n\n[5] G.H. Golub and C.P. Van Loan. Matrix computations. Johns Hopkins University \n\nPress, Baltimore, MD, third edition, 1996. \n\n[6] D. Haussler. Generalizing the PAC model: sample size bounds from metric \ndimension-based uniform convergence results. In Proc. 30th IEEE Symposium on \nFoundations of Computer Science, pages 40-45, 1989. \n\n[7] Lee K. Jones. A simple lemma on greedy approximation in Hilbert space and con(cid:173)\nvergence rates for projection pursuit regression and neural network training. Ann. \nStatist., 20(1): 60~13, 1992. \n\n[8] A.N. Kolmogorov. Asymptotic characteristics of some completely bounded metric \n\nspaces. Dokl. Akad. Nauk. SSSR, 108:585-589, 1956. \n\n[9] A.N. Kolmogorov and Y.M. Tihomirov. f-entropyand f-capacity of sets in functional \n\nspaces. Amer. Math. Soc. Trans!., 17(2):277-364,1961. \n\n[10] Wee Sun Lee, P.L. Bartlett, and R.C. Williamson. Efficient agnostic learning of \nneural networks with bounded fan-in. IEEE Transactions on Information Theory, \n42(6):2118-2132,1996. \n\n[II] D. Pollard. Convergence of stochastic processes. Springer-Verlag, New York, 1984. \n[12] N. Sauer. On the density of families of sets. Journal of Combinatorial Theory (Series \n\nA), 13: 145-147,1972. \n\n[13] Robert E. Schapire, Yoav Freund, Peter Bartlett, and Wee Sun Lee. Boosting the \nmargin: a new explanation for the effectiveness of voting methods. Ann. Statist., \n26(5): 1651-1686,1998. \n\n[14] 1. Shawe-Taylor, P.L. Bartlett, R.C. Williamson, and M. Anthony. Structural risk \nminimization over data-dependent hierarchies. IEEE Trans. In! Theory, 44(5): 1926-\n1940, 1998. \n\n[15] Y.N. Vapnik. Estimation of dependences based on empirical data. Springer-Verlag, \n\nNew York, 1982. Translated from the Russian by Samuel Kotz. \n\n[16] Y.N. Vapnik. The nature of statistical learning theory. Springer-Verlag, New York, \n\n1995. \n\n[17] Y.N. Vapnik and AJ. Chervonenkis. On the uniform convergence of relative fre(cid:173)\n\nquencies of events to their probabilities. Theory of Probability and Applications, \n16:264-280, 1971. \n\n[18] Tong Zhang. Analysis of regularized linear functions for classification problems. \n\nTechnical Report RC-21572, IBM, 1999. \n\n\fPART IV \n\nALGORITHMS AND ARCHITECTURE \n\n\f\f", "award": [], "sourceid": 1689, "authors": [{"given_name": "Tong", "family_name": "Zhang", "institution": null}]}