{"title": "Weighted Sums of Random Kitchen Sinks: Replacing minimization with randomization in learning", "book": "Advances in Neural Information Processing Systems", "page_first": 1313, "page_last": 1320, "abstract": "Randomized neural networks are immortalized in this AI Koan: In the days when Sussman was a novice, Minsky once came to him as he sat hacking at the PDP-6. ``What are you doing?'' asked Minsky. ``I am training a randomly wired neural net to play tic-tac-toe,'' Sussman replied. ``Why is the net wired randomly?'' asked Minsky. Sussman replied, ``I do not want it to have any preconceptions of how to play.'' Minsky then shut his eyes. ``Why do you close your eyes?'' Sussman asked his teacher. ``So that the room will be empty,'' replied Minsky. At that moment, Sussman was enlightened. We analyze shallow random networks with the help of concentration of measure inequalities. Specifically, we consider architectures that compute a weighted sum of their inputs after passing them through a bank of arbitrary randomized nonlinearities. We identify conditions under which these networks exhibit good classification performance, and bound their test error in terms of the size of the dataset and the number of random nonlinearities.", "full_text": "Weighted Sums of Random Kitchen Sinks: Replacing\n\nminimization with randomization in learning\n\nPaper #858\n\nAbstract\n\nRandomized neural networks are immortalized in this AI Koan:\n\nIn the days when Sussman was a novice, Minsky once came to him as he sat\n\nhacking at the PDP-6.\n\n\u201cWhat are you doing?\u201d asked Minsky. \u201cI am training a randomly wired\nneural net to play tic-tac-toe,\u201d Sussman replied. \u201cWhy is the net wired ran-\ndomly?\u201d asked Minsky. Sussman replied, \u201cI do not want it to have any precon-\nceptions of how to play.\u201d\n\nMinsky then shut his eyes. \u201cWhy do you close your eyes?\u201d Sussman asked\nhis teacher. \u201cSo that the room will be empty,\u201d replied Minsky. At that moment,\nSussman was enlightened.\n\nWe analyze shallow random networks with the help of concentration of measure\ninequalities. Speci\ufb01cally, we consider architectures that compute a weighted sum\nof their inputs after passing them through a bank of arbitrary randomized nonlin-\nearities. We identify conditions under which these networks exhibit good classi-\n\ufb01cation performance, and bound their test error in terms of the size of the dataset\nand the number of random nonlinearities.\n\n1 Introduction\n\nIn the earliest days of arti\ufb01cial intelligence, the bottom-most layer of neural networks consisted\nof randomly connected \u201cassociator units\u201d that computed random binary functions of their inputs\n[1]. These randomized shallow networks have largely been superceded by optimally, or nearly\noptimally, tuned shallow architectures such as weighted sums of positive de\ufb01nite kernels (as in\nSupport Vector Machines), or weigted sums of weak classi\ufb01ers (as in Adaboost). But recently,\narchitectures that randomly transform their inputs have been resurfacing in the machine learning\ncommunity [2, 3, 4, 5], largely motivated by the fact that randomization is computationally cheaper\nthan optimization. With the help of concentration of measure inequalities on function spaces, we\nshow that training a shallow architecture by randomly choosing the nonlinearities in the \ufb01rst layer\nresults in a classi\ufb01er that is not much worse than one constructed by optimally tuning the non-\nlinearities. The main technical contributions of the paper are an approximation error bound (Lemma\n1), and a synthesis of known techniques from learning theory to analyze random shallow networks.\nConsider the problem of \ufb01tting a function f : X \u2192 R to a training data set of m input-output pairs\n{xi, yi}i=1...m, drawn iid from some unknown distribution P (x, y), with xi \u2208 X and yi \u2208 \u00b11. The\n\ufb01tting problem consists of \ufb01nding an f that minimizes the empirical risk\n\nm(cid:88)\n\ni=1\n\n1\n\nRemp[f] \u2261 1\nm\n\nc(f(xi), yi).\n\n(1)\n\nThe loss c(y, y(cid:48)) penalizes the deviation between the prediction f(x) and the label y. Popular choices\nfor c are the hinge loss, max(0, 1 \u2212 yy(cid:48)), used in the Support Vector Machine [6], the exponential\n, used in Adaboost [7, 8], and the quadratic loss, (y \u2212 y(cid:48))2, used in matching pursuit [9]\nloss, e\u2212yy(cid:48)\nand regularized least squares classi\ufb01cation [10].\n\n\fi=1 \u03b1(wi)\u03c6(x; wi) or f(x) = (cid:82) \u03b1(w)\u03c6(x; w) dw, where feature functions \u03c6 : X \u00d7 \u2126 \u2192 R,\n(cid:80)\u221e\n\nSimilarly to kernel machines and Adaboost, we will consider functions of the form f(x) =\nparameterized by some vector w \u2208 \u2126, are weighted by a function \u03b1 : \u2126 \u2192 R. In kernel machines,\nthe feature functions \u03c6 are the eigenfunctions of a positive de\ufb01nite kernel k, and in Adaboost they\nare typically decision trees or stumps. Adaboost [8, 7] and matching pursuit [11, 9] \ufb01nd approximate\nempirical risk minimizer over this class of functions by greedily minimizing over a \ufb01nite number of\nscalar weights \u03b1 and parameter vectors w jointly:\n\n(cid:35)\n\n(cid:34) K(cid:88)\n\nk=1\n\nw1, . . . , wK \u2208 \u2126\n\nminimize\n\u03b1 \u2208 A\n\nRemp\n\n\u03c6(x; wk)\u03b1k\n\n.\n\n(2)\n\nBut it is also possible to randomize over w and minimize over \u03b1. Rather than jointly optimizing over\n\u03b1 and w, the following algorithm \ufb01rst draws the parameters of the nonlinearities randomly from a\npre-speci\ufb01cied distribution p. Then with w \ufb01xed, it \ufb01ts the weights \u03b1 optimally via a simple convex\noptimization:\n\nAlgorithm 1 The Weighted Sum of Random Kitchen Sinks \ufb01tting procedure.\nInput: A dataset {xi, yi}i=1...m of m points, a bounded feature function |\u03c6(x; w)| \u2264 1, an integer\n\nK, a scalar C, and a probability distribution p(w) on the parameters of \u03c6.\n\nOutput: A function \u02c6f(x) =(cid:80)K\n\nk=1 \u03c6(x; wk)\u03b1k.\n\nDraw w1, . . . , wK iid from p.\nFeaturize the input: zi \u2190 [\u03c6(xi; w1), . . . , \u03c6(xi; wK)](cid:62).\nWith w \ufb01xed, solve the empirical risk minimization problem\n\nminimize\n\n\u03b1\u2208RK\n\n1\nm\n\nc(cid:0)\u03b1(cid:62)zi, yi\n\nm(cid:88)\n\ni=1\n\n(cid:1)\n\n(3)\n\n(4)\n\ns.t. (cid:107)\u03b1(cid:107)\u221e \u2264 C/K.\n\nIn pratice, we let C be large enough that the constraint (4) remains inactive. The when c is the\nquadratic loss, the minimization (3) is simple linear least squares, and when c is the hinge loss, it\namounts of \ufb01tting a linear SVM to a dataset of m K-dimensional feature vectors.\nRandomly setting the nonlinearities is appealing for several reasons. First, the \ufb01tting procedure\nis simple: Algorithm 1 can be implemented in a few lines of MATLAB code even for complex\nfeature functions \u03c6, whereas \ufb01tting nonlinearities with Adaboost requires much more care. This\n\ufb02exibility allows practioners to experiment with a wide variety of nonlinear feature fuctions without\n\ufb01rst having to devise \ufb01tting procedures for them. Second, the algorithm is fast: experiments show\nbetween one and three orders of magnitude speedup over Adaboost. On the down side, one might\nexpect to have to tune the sampling distribution p for each dataset. But in practice, we \ufb01nd that to\nobtain accuracies that are competitive with Adaboost, the same sampling distribution can be used\nfor all the datasets we considered if the coordinates of the data are \ufb01rst zero-meaned and rescaled to\nunit variance.\nFormally, we show that Algorithm 1 returns a function that has low true risk. The true risk of a\nfunction f is\n\nR[f] \u2261 E\n\n(x,y)\u223cP\n\nc(f(x), y),\n\n(5)\n\nand measures the expected loss of f on as-yet-unseen test points, assuming these test points are\ngenerated from the same distribution that generated the training data. The following theorem states\nthat with very high probability, Algorithm 1 returns a function whose true risk is near the lowest true\nrisk attainable by functions in the class Fp de\ufb01ned below:\nTheorem 1 (Main result). Let p be a distribution on \u2126, and let \u03c6 satisfy supx,w |\u03c6(x; w)| \u2264 1.\nDe\ufb01ne the set\n\n(cid:26)\n\n(cid:90)\n\nFp \u2261\n\nf(x) =\n\n\u03b1(w)\u03c6(x; w) dw\n\n(cid:12)(cid:12)(cid:12)(cid:12) |\u03b1(w)| \u2264 Cp(w)\n\n(cid:27)\n\n.\n\n(6)\n\n\u2126\n\n2\n\n\fSuppose c(y, y(cid:48)) = c(yy(cid:48)), with c(yy(cid:48)) L-Lipschitz. Then for any \u03b4 > 0, if the training data\n{xi, yi}i=1...m are drawn iid from some distribution P , Algorithm 1 returns a function \u02c6f that satis-\n\ufb01es\n\nR[ \u02c6f] \u2212 min\nf\u2208Fp\n\nR[f] \u2264 O\n\n(7)\nwith probability at least 1 \u2212 2\u03b4 over the training dataset and the choice of the parameters\nw1, . . . , wK.\n\nlog 1\n\u03b4\n\nLC\n\n+\n\nm\n\n1\u221a\nK\n\n(cid:18)(cid:18) 1\u221a\n\n(cid:19)\n\n(cid:113)\n\n(cid:19)\n\nNote that the dependence on \u03b4 in the bound is logarithmic, so even small \u03b4\u2019s do not cause the\nbound to blow up. The set Fp is a rich class of functions. It consists of functions whose weights\n\u03b1(w) decays more rapidly than the given sampling distribution p. For example, when \u03c6(x; w) are\nsinusoids with frequency w, Fp is the set of all functions whose Fourier transforms decay faster than\nC p(w).\nWe prove the theorem in the next section, and demonstrate the algorithm on some sample datasets in\nSection 4. The proof of the theorem provides explicit values for the constants in the big O notation.\n\n2 Proof of the Main Theorem\n\nAlgorithm 1 returns a function that lies in the random set\n\n(cid:40)\n\nK(cid:88)\n\nk=1\n\n\u02c6Fw \u2261\n\nf(x) =\n\n\u03b1k\u03c6(x; wk)\n\n(cid:12)(cid:12)(cid:12)(cid:12) |\u03b1k| \u2264 C\n\nK\n\n(cid:41)\n\n.\n\n(8)\n\nThe bound in the main theorem can be decomposed in a standard way into two bounds:\n\n1. An approximation error bound that shows that the lowest true risk attainable by a function\n\nin \u02c6Fw is not much larger than the lowest true risk attainable in Fp (Lemma 2).\n\n2. An estimation error bound that shows that the true risk of every function in \u02c6Fw is close to\n\nits empirical risk (Lemma 3).\n\nThe following Lemma is helpful in bounding the approximation error:\nLemma 1. Let \u00b5 be a measure on X , and f\u2217 a function in Fp. If w1, . . . , wK are drawn iid from p,\nthen for any \u03b4 > 0, with probability at least 1 \u2212 \u03b4 over w1, . . . , wK, there exists a function \u02c6f \u2208 \u02c6Fw\nso that\n\n(cid:18)\n\n(cid:113)\n\n(cid:19)\n\nd\u00b5(x) \u2264 C\u221a\nK\n\n1 +\n\n2 log 1\n\u03b4\n\n.\n\n(9)\n\n(cid:115)(cid:90)\n\nX\n\n(cid:16) \u02c6f(x) \u2212 f\u2217(x)\n(cid:17)2\n\nThe proof relies on Lemma 4 of the Appendix, which states that the average of bounded vectors in\na Hilbert space concentrates towards its expectation in the Hilbert norm exponentially fast.\n\nProof. Since f\u2217 \u2208 Fp, we can write f\u2217(x) =(cid:82)\nproduct (cid:104)f, g(cid:105) =(cid:82) f(x)g(x) d\u00b5(x), (cid:107)\u03b2k\u03c6(\u00b7; wk)(cid:107) \u2264 C. The Lemma follows by applying Lemma\n\n\u2126 \u03b1(w)\u03c6(x; w) dw. Construct the functions fk =\n\u03b2k\u03c6(\u00b7; wk), k = 1 . . . K, with \u03b2k \u2261 \u03b1(\u03c9k)\nK \u03c6(x; \u03c9k) be the\n\u03b2k\nsample average of these functions. Then \u02c6f \u2208 \u02c6Fw because |\u03b2k/K| \u2264 C/K. Also, under the inner\n\np(\u03c9k) , so that E fk = f\u2217. Let \u02c6f(x) =(cid:80)K\n\n4 to f1, . . . , fK under this inner product.\nLemma 2 (Bound on the approximation error). Suppose c(y, y(cid:48)) is L-Lipschitz in its \ufb01rst argument.\nLet f\u2217 be a \ufb01xed function in Fp. If w1, . . . , wK are drawn iid from p, then for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4 over w1, . . . , wK, there exists a function \u02c6f \u2208 \u02c6Fw that satis\ufb01es\n\nk=1\n\n(cid:18)\n\n(cid:113)\n\n(cid:19)\n\n1 +\n\n2 log 1\n\u03b4\n\n.\n\n(10)\n\nR[ \u02c6f] \u2264 R[f\u2217] + LC\u221a\nK\n\n3\n\n\fProof. For any two functions f and g, the Lipschitz condition on c followed by the concavity of\nsquare root gives\n\nR[f] \u2212 R[g] = E c(f(x), y) \u2212 c(g(x), y) \u2264 E|c(f(x), y) \u2212 c(g(x), y)|\n\n\u2264 L E|f(x) \u2212 g(x)| \u2264 L(cid:112)E(f(x) \u2212 g(x))2.\n\n(11)\n(12)\n\nThe lemma then follows from Lemma 1.\n\nNext, we rely on a standard result from statistical learning theory to show that for a given choice of\nw1, . . . , wK the empirical risk of every function in \u02c6Fw is close to its true risk.\nLemma 3 (Bound on the estimation error). Suppose c(y, y(cid:48)) = c(yy(cid:48)), with c(yy(cid:48)) L-Lipschitz. Let\nw1,\u00b7\u00b7\u00b7 , wK be \ufb01xed. If {xi, yi}i=1...m are drawn iid from a \ufb01xed distribution, for any \u03b4 > 0, with\nprobability at least 1 \u2212 \u03b4 over the dataset, we have\n\n\u2200f\u2208 \u02c6Fw\n\n|R[f] \u2212 Remp[f]| \u2264 1\u221a\nm\n\n4LC + 2|c(0)| + LC\n\n2 log 1\n\n\u03b4\n\n.\n\n(13)\n\n(cid:18)\n\n(cid:113) 1\n\n(cid:19)\n\nProof sketch. By H\u00a8older, the functions in \u02c6Fw are bounded above by C. The Rademacher complexity\n\u221a\nof \u02c6Fw can be shown to be bounded above by C/\nm (see the Appendix). The theorem follows by\nresults from [12] which are summarized in Theorem 2 of the Appendix.\nProof of Theorem 1. Let f\u2217 be a minimizer of R over Fp, \u02c6f a minimizer of Remp over \u02c6Fw (the\noutput of the algorithm), and \u02c6f\u2217 a minimizer of R over \u02c6Fw. Then\n\nR[ \u02c6f] \u2212 R[f\u2217] = R[ \u02c6f] \u2212 R[ \u02c6f\u2217] + R[ \u02c6f\u2217] \u2212 R[f\u2217]\n\u2264 |R[ \u02c6f] \u2212 R[ \u02c6f\u2217]| + R[ \u02c6f\u2217] \u2212 R[f\u2217].\n\n(14)\n(15)\n\nThe \ufb01rst term in the right side is an estimation error: By Lemma 3, with probability at least\n1 \u2212 \u03b4, |R[ \u02c6f\u2217] \u2212 Remp[ \u02c6f\u2217]| \u2264 \u0001est and simultaneously, |R[ \u02c6f] \u2212 Remp[ \u02c6f]| \u2264 \u0001est, where \u0001est\nis the right side of the bound in Lemma 3. By the optimality of \u02c6f, Remp[ \u02c6f] \u2264 Remp[ \u02c6f\u2217].\nCombining these facts gives that with probability at least 1 \u2212 \u03b4, |R[ \u02c6f] \u2212 R[ \u02c6f\u2217]| \u2264 2\u0001est =\n2\u221a\nm\n\n4LC + 2|c(0)| + LC\n\n(cid:113) 1\n\n2 log 1\n\n(cid:17)\n\n(cid:16)\n\n.\n\n\u03b4\n\nThe second term in Equation (15) is the approximation error, and by Theorem 1, with probability at\nleast 1 \u2212 \u03b4, it is bounded above by \u0001app = LC\u221a\nBy the union bound, with probability at least 1\u22122\u03b4, the right side of Equation (15) is bounded above\nby 2\u0001est + \u0001app.\n\n2 log 1\n\u03b4\n\n1 +\n\nK\n\n.\n\n(cid:16)\n\n(cid:113)\n\n(cid:17)\n\n3 Related Work\n\nGreedy algorithms for \ufb01tting networks of the form (2) have been analyzed, for example, in [7, 11, 9].\nZhang analyzed greedy algorithms and a randomized algorithm similar to Algorithm 1 for \ufb01tting\nsparse Gaussian processes to data, a more narrow setting than we consider here. He obtained bounds\non the expected error for this sparse approximation problem by viewing these methods as stochastic\ngradient descent.\nApproximation error bounds such as that of Maurey [11][Lemma 1], Girosi [13] and Gnecco and\nSanguineti [14] rely on random sampling to guarantee the existence of good parameters w1, . . . , wk,\nbut they require access to the representation of f\u2217 to actually produce these parameters. These ap-\nproximation bounds cannot be used to guarantee the performance of Algorithm 1 because Algorithm\n1 is oblivious of the data when it generates the parameters. Lemma 2 differs from these bounds in\nthat it relies on f\u2217 only to generate the weights \u03b11, . . . , \u03b1K, but it remains oblivious to f\u2217 when\ngenerating the parameters by sampling them from p instead. Furthermore, because \u02c6Fw is smaller\nthan the classes considered by [11, 14], the approximation error rate in Lemma 1 matches those of\nexisting approximation error bounds.\n\n4\n\n\fFigure 1: Comparisons between Random Kitchen Sinks and Adaboosted decision stumps on adult\n(\ufb01rst row), activity (second row), and KDDCUP99 (third row). The \ufb01rst column plots test error\nof each classi\ufb01er as a function of K. The accuracy of Random Kitchen Sinks catches up to that of\nAdaboost as K grows. The second column plots the total training and testing time as a function\nof K. For a given K, Random Kitchen Sinks is between two and three orders of magnitude faster\nthan Adaboost. The third column combines the previous two columns. It plots testing+training time\nrequired to achieve a desired error rate. For a given error rate, Random Kitchen Sinks is between\none and three orders of magnitude faster than Adaboost.\n\n4 Experiments\n\nSince others have already empirically demonstrated the bene\ufb01ts of random featurization [2, 3, 4, 5],\nwe only a present a few illustrations in this section.\nWe compared Random Kitchen Sinks with Adaboost on three classi\ufb01cation problems: The adult\ndataset has roughly 32,000 training instances. Each categorical variable was replaced by a binary in-\ndicator variable over the categories, resulting in 123 dimensions per instance. The test set consists of\n15,000 instances. KDDCUP99 is a network intrusion detection problem with roughly 5,000,000 127-\ndimensional training instances, subsampled to 50,000 instances. The test set consists of 150,000 in-\nstances. activity is a human activity recognition dataset with 20,0000 223-dimensional instance,\nof which about 200 are irrelevant for classi\ufb01cation. The test set constists of 50,000 instances. The\ndatasets were preprocessed by zero-meaning and rescaling each dimension to unit variance. The\nfeature functions in these experiments were decision stumps\u03c6(x; w) = sign(xwd \u2212 wt), which sim-\nply determine whether the wdth dimension of x is smaller or greater than the threshold wt. The\nsampling distribution p for Random Kitchen Sinks drew the threshold parameter wt from a normal\ndistribution and the coordinate wd from a uniform distribution over the coorindates. For some ex-\nperiments, we could afford to run Random Kitchen Sinks for larger K than Adaboost, and these runs\nare included in the plots. We used the quadratic loss, but \ufb01nd no substantial differences in quality\nunder the hinge loss (though there is degradation in speed by a factor of 2-10). We used MATLAB\noptimized versions of Adaboost and Random Kitchen Sinks, and report wall clock time in seconds.\nFigure 1 compares the results on these datasets. Adaboost expends considerable effort in choosing\nthe decision stumps and achieves good test accuracy with a few of them. Random Kitchen Sinks\n\n5\n\n0100200300400500600700800900100014161820222426# weak learners (K)% error AdaboostRKS0100200300400500600700800900100010\u22121100101102# weak learners (K)training+testing time (seconds) AdaboostRKS141516171819202122232410\u22121100101102% errortraining+testing time (seconds) AdaboostRKS0501001502002503003504001012141618202224262830# weak learners (K)% error AdaboostRKS050100150200250300350400100101102103# weak learners (K)training+testing time (seconds) AdaboostRKS1012141618202224262830100101102103% errortraining+testing time (seconds) AdaboostRKS010020030040050060070068101214161820# weak learners (K)% error AdaboostRKS010020030040050060070010\u2212310\u2212210\u22121100101102103# weak learners (K)training+testing time (seconds) AdaboostRKS6810121416182010\u2212310\u2212210\u22121100101102103% errortraining+testing time (seconds) AdaboostRKS\fFigure 2: The L\u221e norm of \u03b1 returned by RKS for 500 different runs of RKS with various settings\nof K on adult. (cid:107)\u03b1(cid:107)\u221e decays with K, which justi\ufb01es dropping the constraint (4) in practice.\n\nrequires more nonlinearities to achieve similar accuracies. But because it is faster than Adaboost, it\ncan produce classi\ufb01ers that are just as accurate as Adaboost\u2019s with more nonlinearities in less total\ntime. In these experiments, Random Kitchen Sinks is almost as accurate as Adaboost but faster by\none to three orders of magnitude.\nWe defer the details of the following experiments to a technical report: As an alternative to Adaboost,\nwe have experimented with conjugate gradient-descent based \ufb01tting procedures for (2), and \ufb01nd\nagain that randomly generating the nonlinearities produces equally accurate classi\ufb01ers using many\nmore nonlinearities but in much less time. We obtain similar results as those of Figure 1 with the\nrandom features of [4], and random sigmoidal ridge functions \u03c6(x; w) = \u03c3(w(cid:48)x),\nTo simplify the implementation of Random Kitchen Sinks, we ignore the constraint (4) in practice.\nThe scalar C controls the size of \u02c6Fw and Fp, and to eliminate the constraint, we implicitly set C it\nto a large value so that the constraint is never tight. However, for the results of this paper to hold, C\ncannot grow faster than K. Figure 2 shows that the L\u221e norm of the unconstrained optimum of (3)\nfor the adult dataset does decays linearly with K, so that there exists a C that does not grow with\nK for which the constraint is never tight, thereby justifying dropping the constraint.\n\n5 Discussion and Conclusions\n\nVarious hardness of approximation lower bounds for \ufb01xed basis functions exist (see, for example\n[11]). The guarantee in Lemma 1 avoids running afoul of these lower bounds because it does not\nseek to approximate every function in Fp simultaneously, but rather only the true risk minimizer\nwith high probability.\nIt may be surprising that Theorem 1 holds even when the feature functions \u03c6 are nearly orthogo-\nnal. The result works because the importance sampling constraint |\u03b1(w)| \u2264 Cp(w) ensures that\na feature function does not receive a large weight if it is unlikely to be sampled by p. When\nthe feature functions are highly linearly dependent, better bounds can be obtained because any\n\nf(x) =(cid:82) \u03b1(w)\u03c6(x; w) can be rewritten as f(x) =(cid:82) \u03b1(cid:48)(w)\u03c6(x; w) with |\u03b1(cid:48)|/p \u2264 |\u03b1|/p, improving\nthe importance ratio C. This intuition can be formalized via the the Rademacher complexity of \u03c6, a\nresult which we leave for future work.\n(cid:82)\nOne may wonder whether Algorithm 1 has good theoretical guarantees on Fp because Fp is\nIndeed, when \u03c6 are the Fourier bases, |\u03b1|/p \u2264 C implies\ntoo small small class of functions.\n\u2126 |\u03b1(w)| dw \u2264 C, so every function in Fp has an absolutely integrable Frourier transform. Thus\nFp is smaller than the set considered by Jones [9] for greedy matching pursuit, and for which he\n\u221a\nK). The most reliable way to show that Fp is rich enough\nobtained an approximation rate of O(1/\nfor practical applications is to conduct experiments with real data. The experiment show that Fp\nindeed contains good predictors.\n\u221a\nThe convergence rate for Adaboost [7] is exponentially fast in K, which at \ufb01rst appears to be much\nfaster than 1/\nK. However, the base of the exponent is the minimum weighted margin encountered\nby the algorithm through all iterations, a quantity that is dif\ufb01cult to bound a priori. This makes a\ndirect comparison of the bounds dif\ufb01cult, though we have tried to provide empirical comparisons.\n\n6\n\n501001502002503003504004500.150.20.250.30.350.4K||\u03b1||\u221e\fA Exponentially Fast Concentration of Averages towards the Mean in a\n\nHilbert Space\n\n(cid:80)K\n\nLemma 4. Let X = {x1,\u00b7\u00b7\u00b7 , xK} be iid random variables in a ball H of radius M centered\naround the origin in a Hilbert space. Denote their average by X = 1\nk=1 xk. Then for any\n\u03b4 > 0, with probability at least 1 \u2212 \u03b4,\nK\n\n(cid:13)(cid:13)X \u2212 E X(cid:13)(cid:13) \u2264 M\u221a\n\n(cid:18)\n\n(cid:113)\n\n(cid:19)\n\n1 +\n\n2 log 1\n\u03b4\n\nK\n\nProof. We use McDiarmid\u2019s inequality to show that the scalar function f(X) = (cid:13)(cid:13)X \u2212 EX X(cid:13)(cid:13) is\n\n\u221a\nK).\nconcentrated about its mean, which shrinks as O(1/\nThe function f is stable under perturbation of its ith argument. Let \u02dcX = {x1,\u00b7\u00b7\u00b7 , \u02dcxi,\u00b7\u00b7\u00b7 , xK}\nbe a copy of X with the ith element replaced by an arbitrary element of H. Applying the triangle\ninequality twice gives\n\n.\n\n(16)\n\n|f(X) \u2212 f( \u02dcX)| = |(cid:107)X \u2212 E X(cid:107) \u2212 (cid:107) \u02dcX \u2212 E X(cid:107)| \u2264 (cid:107)X \u2212 \u02dcX(cid:107) \u2264 (cid:107)xi \u2212 \u02dcxi(cid:107)\n\n\u2264 2M\nK\n\n.\n\n(17)\n\nK\n\nTo bound the expectation of f, use the familiar identity about the variance of the average of iid\nrandom variables\n\nin conjunction with Jensen\u2019s inequality and the fact that (cid:107)x(cid:107) \u2264 M to get\n\n1\nK\n\nE(cid:13)(cid:13)X \u2212 E X(cid:13)(cid:13)2 =\nE f(X) \u2264(cid:112)E f 2(X) =\n(cid:20)\n\n(cid:21)\n\n(cid:0)E(cid:107)x(cid:107)2 \u2212 (cid:107) E x(cid:107)2(cid:1) ,\n(cid:113)\nE(cid:13)(cid:13)X \u2212 E X(cid:13)(cid:13)2 \u2264 M\u221a\n\nK\n\n.\n\n(cid:18)\n\n(cid:21)\n\n(cid:20)\n\nPr\nX\n\nf(X) \u2212 M\u221a\nK\n\n\u2265 \u0001\n\n\u2264 Pr\n\nX\n\nf(X) \u2212 E f(X) \u2265 \u0001\n\n\u2264 exp\n\n\u2212 K\u00012\n2M 2\n\nThis bound for the expectation of f and McDiarmid\u2019s inequality give\n\n(18)\n\n(19)\n\n(20)\n\n(cid:19)\n\nTo get the \ufb01nal result, set \u03b4 to the right hand side, solve for \u0001, and rearrange.\n\nB Generalization bounds that use Rademacher complexity\nOne measure of the size of a class F of functions is its Rademacher complexity:\n\n(cid:34)\n\nm(cid:88)\n\ni=1\n\n(cid:35)\n\nRm[F] \u2261\n\nE\n\nx1,\u00b7\u00b7\u00b7 ,xm\n\u03c31,\u00b7\u00b7\u00b7 ,\u03c3m\n\n1\nm\n\nsup\nf\u2208F\n\n\u03c3if(xi)\n\n,\n\nThe variables \u03c31,\u00b7\u00b7\u00b7 , \u03c3m are iid Bernouli random variables that take on the value -1 or +1 with\nequal probability and are independent of x1, . . . , xm.\nThe Rademacher complexity of\n\n\u02c6Fw can be bounded as\n\nDe\ufb01ne S \u2261\n\nfollows.\n\n(cid:8)\u03b1 \u2208 RK (cid:12)(cid:12) (cid:107)\u03b1(cid:107)\u221e \u2264 C\n\nRm[ \u02c6Fw] = E\n\n\u03c3,X\n\n(cid:32) K(cid:88)\n\nk=1\n\nsup\n\u03b1\u2208S\n\nK\n\nm\n\n\u03c3i\n\n(cid:9):\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\nm(cid:88)\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) 1\nm(cid:88)\nK(cid:88)\n(cid:118)(cid:117)(cid:117)(cid:116)E\nm(cid:88)\nK(cid:88)\n\nk=1\n\ni=1\n\ni=1\n\nm\n\n1\nm2\n\n\u03c3\n\nC\nK\n\nk=1\n\nk=1\n\n\u2264 E\n\n\u03c3,X\n\n= E\n\nX\n\nC\nK\n\n\u03c3i\u03c6(xi; \u03c9k)\n\n\u03c62(xi; \u03c9k) \u2264 C\nK\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) \u2264 E\n\nX\n\n7\n\n\u03b1k\u03c6(xi; \u03c9k)\n\n\u03c3i\u03c6(xi; \u03c9k)\n\nm(cid:88)\n\n\u03b1k\n\n1\nm\n\ni=1\n\n(cid:33)2\n\n\u03c3i\u03c6(xi; \u03c9k)\n\n\u03c3,X\n\n(cid:33)(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) = E\n(cid:118)(cid:117)(cid:117)(cid:116)E\nK(cid:88)\n(cid:114) 1\nK(cid:88)\n\nk=1\n\n\u03c3\n\nC\nK\n\nm\n\nk=1\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12) K(cid:88)\nm(cid:88)\n\nk=1\n\nk=1\n\n1\nm\n\nsup\n\u03b1\u2208S\n\n(cid:32)\n\n\u221a\n\u2264 C/\nm,\n\n(cid:12)(cid:12)(cid:12)(cid:12)(cid:12)\n\n(21)\n\n(22)\n\n(23)\n\n\fwhere the \ufb01rst inequality follows by H\u00a8older, the second by the concavity of square root, the third by\nthe fact that conditioned on \u03c9, E\u03c3 \u03c3i\u03c6(xi; \u03c9)\u03c3j\u03c6(xj; \u03c9) = 0 when i (cid:54)= j, and the fourth follows by\nthe boundedness of \u03c6.\nThe following theorem is a summary of the results from [12]:\nTheorem 2. Let F be a class of bounded functions so that supx |f(x)| \u2264 C for all f \u2208 F, and\nsuppose c(y, y(cid:48)) = c(yy(cid:48)), with c(yy(cid:48)) L-Lipschitz. Then with probability at least 1\u2212 \u03b4 with respect\nto training samples {xi, yi}m drawn from a probabilisty distribution P on X \u00d7 {\u22121, +1}, every\nfunction in F satis\ufb01es\n\n(cid:114) 1\n\n2m\n\nR[f] \u2264 Remp[f] + 4LRm[F] +\n\n2|c(0)|\u221a\n\nm\n\n+ LC\n\nlog 1\n\u03b4 .\n\n(24)\n\nReferences\n[1] H. D. Block. The perceptron: a model for brain functioning. Review of modern physics,\n\n34:123\u2013135, January 1962.\n\n[2] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural\n\nComputation, 9(7):1545\u20131588, 1997.\n\n[3] F. Moosmann, B. Triggs, and F. Jurie. Randomized clustering forests for building fast and\nIn Advances in Neural Information Processing Systems\n\ndiscriminative visual vocabularies.\n(NIPS), 2006.\n\n[4] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In Advances in\n\nNeural Information Processing Systems (NIPS), 2007.\n\n[5] W. Maass and H. Markram. On the computational power of circuits of spiking neurons. Journal\n\nof Computer and System Sciences, 69:593\u2013616, December 2004.\n\n[6] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: an application to face\n\ndetection. In Computer Vision and Pattern Recognition (CVPR), 1997.\n\n[7] R. E. Schapire. The boosting approach to machine learning: An overview. In D. D. Denison,\nM. H. Hansen, C. Holmes, B. Mallick, and B. Yu, editors, Nonlinear Estimation and Classi\ufb01-\ncation. Springer, 2003.\n\n[8] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: a statistical view of\n\nboosting. Technical report, Dept. of Statistics, Stanford University, 1998.\n\n[9] L. K. Jones. A simple lemma on greedy approximation in Hilbert space and convergence\nrates for projection pursuit regression and neural network training. The Annals of Statistics,\n20(1):608\u2013613, March 1992.\n\n[10] R. Rifkin, G. Yeo, and T. Poggio. Regularized least squares classi\ufb01cation. Advances in Learn-\ning Theory: Methods, Model and Applications, NATO Science Series III: Computer and Sys-\ntems Sciences, 190, 2003.\n\n[11] A.R. Barron. Universal approximation bounds for superpositions of a sigmoidal function.\n\nIEEE Transactions on Information Theory, 39:930\u2013945, May 1993.\n\n[12] P. L. Bartlett and S. Mendelson. Rademacher and Gaussian complexities: Risk bounds and\n\nstructural results. Journal of Machine Learning Research (JMLR), 3:463\u2013482, 2002.\n\n[13] F. Girosi. Approximation error bounds that use VC-bounds. In International Conference on\n\nNeural Networks, pages 295\u2013302, 1995.\n\n[14] G. Gnecco and M. Sanguineti. Approximation error bounds via Rademacher\u2019s complexity.\n\nApplied Mathematical Sciences, 2(4):153\u2013176, 2008.\n\n8\n\n\f", "award": [], "sourceid": 885, "authors": [{"given_name": "Ali", "family_name": "Rahimi", "institution": null}, {"given_name": "Benjamin", "family_name": "Recht", "institution": null}]}