{"title": "Optimal Rates for Random Fourier Features", "book": "Advances in Neural Information Processing Systems", "page_first": 1144, "page_last": 1152, "abstract": "Kernel methods represent one of the most powerful tools in machine learning to tackle problems expressed in terms of function values and derivatives due to their capability to represent and model complex relations. While these methods show good versatility, they are computationally intensive and have poor scalability to large data as they require operations on Gram matrices. In order to mitigate this serious computational limitation, recently randomized constructions have been proposed in the literature, which allow the application of fast linear algorithms. Random Fourier features (RFF) are among the most popular and widely applied constructions: they provide an easily computable, low-dimensional feature representation for shift-invariant kernels. Despite the popularity of RFFs, very little is understood theoretically about their approximation quality. In this paper, we provide a detailed finite-sample theoretical analysis about the approximation quality of RFFs by (i) establishing optimal (in terms of the RFF dimension, and growing set size) performance guarantees in uniform norm, and (ii) presenting guarantees in L^r (1 \u2264 r < \u221e) norms. We also propose an RFF approximation to derivatives of a kernel with a theoretical study on its approximation quality.", "full_text": "Optimal Rates for Random Fourier Features\n\nBharath K. Sriperumbudur\u2217\n\nDepartment of Statistics\n\nZolt\u00b4an Szab\u00b4o\u2217\n\nGatsby Unit, CSML, UCL\n\nPennsylvania State University\n\nSainsbury Wellcome Centre, 25 Howland Street\n\nUniversity Park, PA 16802, USA\n\nLondon - W1T 4JG, UK\n\nbks18@psu.edu\n\nzoltan.szabo@gatsby.ucl.ac.uk\n\nAbstract\n\nKernel methods represent one of the most powerful tools in machine learning to tackle\nproblems expressed in terms of function values and derivatives due to their capability to\nrepresent and model complex relations. While these methods show good versatility, they\nare computationally intensive and have poor scalability to large data as they require opera-\ntions on Gram matrices. In order to mitigate this serious computational limitation, recently\nrandomized constructions have been proposed in the literature, which allow the applica-\ntion of fast linear algorithms. Random Fourier features (RFF) are among the most popular\nand widely applied constructions: they provide an easily computable, low-dimensional\nfeature representation for shift-invariant kernels. Despite the popularity of RFFs, very lit-\ntle is understood theoretically about their approximation quality. In this paper, we provide\na detailed \ufb01nite-sample theoretical analysis about the approximation quality of RFFs by (i)\nestablishing optimal (in terms of the RFF dimension, and growing set size) performance\n\nguarantees in uniform norm, and (ii) presenting guarantees in Lr (1 \u2264 r < \u221e) norms.\n\nWe also propose an RFF approximation to derivatives of a kernel with a theoretical study\non its approximation quality.\n\n1\n\nIntroduction\n\nKernel methods [17] have enjoyed tremendous success in solving several fundamental problems of\nmachine learning ranging from classi\ufb01cation, regression, feature extraction, dependency estimation,\ncausal discovery, Bayesian inference and hypothesis testing. Such a success owes to their capability\nto represent and model complex relations by mapping points into high (possibly in\ufb01nite) dimensional\nfeature spaces. At the heart of all these techniques is the kernel trick, which allows to implicitly\ncompute inner products between these high dimensional feature maps, \u03bb via a kernel function k:\nk(x, y) = h\u03bb(x), \u03bb(y)i. However, this \ufb02exibility and richness of kernels has a price: by resorting\nto implicit computations these methods operate on the Gram matrix of the data, which raises serious\ncomputational challenges while dealing with large-scale data. In order to resolve this bottleneck,\nnumerous solutions have been proposed, such as low-rank matrix approximations [25, 6, 1], explicit\nfeature maps designed for additive kernels [23, 11], hashing [19, 9], and random Fourier features\n(RFF) [13] constructed for shift-invariant kernels, the focus of the current paper.\n\nRFFs implement an extremely simple, yet ef\ufb01cient idea: instead of relying on the implicit feature\nmap \u03bb associated with the kernel, by appealing to Bochner\u2019s theorem [24]\u2014any bounded, contin-\nuous, shift-invariant kernel is the Fourier transform of a probability measure\u2014-[13] proposed an\nexplicit low-dimensional random Fourier feature map \u03c6 obtained by empirically approximating the\nFourier integral so that k(x, y) \u2248 h\u03c6(x), \u03c6(y)i. The advantage of this explicit low-dimensional\nfeature representation is that the kernel machine can be ef\ufb01ciently solved in the primal form through\nfast linear solvers, thereby enabling to handle large-scale data. Through numerical experiments, it\nhas also been demonstrated that kernel algorithms constructed using the approximate kernel do not\n\n\u2217Contributed equally.\n\n1\n\n\fsuffer from signi\ufb01cant performance degradation [13]. Another advantage with the RFF approach is\nthat unlike low rank matrix approximation approach [25, 6] which also speeds up kernel machines,\nit approximates the entire kernel function and not just the kernel matrix. This property is particu-\nlarly useful while dealing with out-of-sample data and also in online learning applications. The RFF\ntechnique has found wide applicability in several areas such as fast function-to-function regression\n[12], differential privacy preserving [2] and causal discovery [10].\n\nDespite the success of the RFF method, surprisingly, very little is known about its performance guar-\nantees. To the best of our knowledge, the only paper in the machine learning literature providing\ncertain theoretical insight into the accuracy of kernel approximation via RFF is [13, 22]:1 it shows\n\nthat Am := sup{|k(x, y) \u2212 h\u03c6(x), \u03c6(y)iR2m| : x, y \u2208 S} = Op(plog(m)/m) for any compact\nset S \u2282 Rd, where m is the number of random Fourier features. However, since the approximation\n\nproposed by the RFF method involves empirically approximating the Fourier integral, the RFF esti-\nmator can be thought of as an empirical characteristic function (ECF). In the probability literature,\nthe systematic study of ECF-s was initiated by [7] and followed up by [5, 4, 27]. While [7] shows\nthe almost sure (a.s.) convergence of Am to zero, [5, Theorems 1 and 2] and [27, Theorems 6.2 and\n6.3] show that the optimal rate is m\u22121/2. In addition, [7] shows that almost sure convergence cannot\nbe attained over the entire space (i.e., Rd) if the characteristic function decays to zero at in\ufb01nity.\nDue to this, [5, 27] study the convergence behavior of Am when the diameter of S grows with m\nand show that almost sure convergence of Am is guaranteed as long as the diameter of S is eo(m).\nUnfortunately, all these results (to the best of our knowledge) are asymptotic in nature and the only\nknown \ufb01nite-sample guarantee by [13, 22] is non-optimal. In this paper (see Section 3), we present\na \ufb01nite-sample probabilistic bound for Am that holds for any m and provides the optimal rate of\nm\u22121/2 for any compact set S along with guaranteeing the almost sure convergence of Am as long\nas the diameter of S is eo(m). Since convergence in uniform norm might sometimes be a too strong\nrequirement and may not be suitable to attain correct rates in the generalization bounds associated\nwith learning algorithms involving RFF,2 we also study the behavior of k(x, y) \u2212 h\u03c6(x), \u03c6(y)iR2m\nin Lr-norm (1 \u2264 r < \u221e) and obtain an optimal rate of m\u22121/2. The RFF approach to approximate\na translation-invariant kernel can be seen as a special of the problem of approximating a function in\nthe barycenter of a family (say F ) of functions, which was considered in [14]. However, the approx-\nimation guarantees in [14, Theorem 3.2] do not directly apply to RFF as the assumptions on F are\nnot satis\ufb01ed by the cosine function, which is the family of functions that is used to approximate the\nkernel in the RFF approach. While a careful modi\ufb01cation of the proof of [14, Theorem 3.2] could\nyield m\u22121/2 rate of approximation for any compact set S, this result would still be sub-optimal by\nproviding a linear dependence on |S| similar to the theorems in [13, 22], in contrast to the optimal\nlogarithmic dependence on |S| that is guaranteed by our results.\nTraditionally, kernel based algorithms involve computing the value of the kernel. Recently, ker-\nnel algorithms involving the derivatives of the kernel (i.e., the Gram matrix consists of derivatives\nof the kernel computed at training samples) have been used to address numerous machine learn-\ning tasks, e.g., semi-supervised or Hermite learning with gradient information [28, 18], nonlin-\near variable selection [15, 16], (multi-task) gradient learning [26] and \ufb01tting of distributions in an\nin\ufb01nite-dimensional exponential family [20]. Given the importance of these derivative based ker-\nnel algorithms, similar to [13], in Section 4, we propose a \ufb01nite dimensional random feature map\napproximation to kernel derivatives, which can be used to speed up the above mentioned derivative\nbased kernel algorithms. We present a \ufb01nite-sample bound that quanti\ufb01es the quality of approxima-\ntion in uniform and Lr-norms and show the rate of convergence to be m\u22121/2 in both these cases.\n\nA summary of our contributions are as follows. We\n\n1. provide the \ufb01rst detailed \ufb01nite-sample performance analysis of RFFs for approximating kernels\n\nand their derivatives.\n\n2. prove uniform and Lr convergence on \ufb01xed compacts sets with optimal rate in terms of the RFF\n\ndimension (m);\n\n3. give suf\ufb01cient conditions for the growth rate of compact sets while preserving a.s. convergence\nuniformly and in Lr; specializing our result we match the best attainable asymptotic growth rate.\n\n1[22] derived tighter constants compared to [13] and also considered different RFF implementations.\n2For example, in applications like kernel ridge regression based on RFF, it is more appropriate to consider\n\nthe approximation guarantee in L2 norm than in the uniform norm.\n\n2\n\n\fVarious notations and de\ufb01nitions that are used throughout the paper are provided in Section 2 along\nwith a brief review of RFF approximation proposed by [13]. The missing proofs of the results in\nSections 3 and 4 are provided in the supplementary material.\n\n2 Notations & preliminaries\n\nIn this section, we introduce notations that are used throughout the paper and then present prelimi-\nnaries on kernel approximation through random feature maps as introduced by [13].\nDe\ufb01nitions & Notation: For a topological space X , C(X ) (resp. Cb(X )) denotes the space of all\ncontinuous (resp. bounded continuous) functions on X . For f \u2208 Cb(X ), kfkX := supx\u2208X |f (x)|\nis the supremum norm of f . Mb(X ) and M 1\n+(X ) is the set of all \ufb01nite Borel and probability mea-\nsures on X , respectively. For \u00b5 \u2208 Mb(X ), Lr(X , \u00b5) denotes the Banach space of r-power (r \u2265 1)\n\u00b5-integrable functions. For X \u2286 Rd, we will use Lr(X ) for Lr(X , \u00b5) if \u00b5 is a Lebesgue measure\non X . For f \u2208 Lr(X , \u00b5), kfkLr(X ,\u00b5) :=(cid:0)RX |f|r d\u00b5(cid:1)1/r\ndenotes the Lr-norm of f for 1 \u2264 r < \u221e\nand we write it as k\u00b7kLr(X ) if X \u2286 Rd and \u00b5 is the Lebesgue measure. For any f \u2208 L1(X , P) where\ni.i.d.\u223c P,\nmPm\nP \u2208 M 1\nPm := 1\ni=1 \u03b4Xi is the empirical measure and \u03b4x is a Dirac measure supported on x \u2208 X .\nsupp(P) denotes the support of P. Pm := \u2297m\n\n+(X ), we de\ufb01ne Pf :=RX\nmPm\n\nP denotes the m-fold product measure.\n\ni=1 f (Xi) where (Xi)m\ni=1\n\nf (x) dP(x) and Pmf := 1\n\nFor v := (v1, . . . , vd) \u2208 Rd, kvk2 :=qPd\ni . The diameter of A \u2286 Y where (Y, \u03c1) is a metric\nspace is de\ufb01ned as |A|\u03c1 := sup{\u03c1(x, y) : x, y \u2208 Y}. If Y = Rd with \u03c1 = k\u00b7k2, we denote the diam-\neter of A as |A|; |A| < \u221e if A is compact. The volume of A \u2286 Rd is de\ufb01ned as vol(A) =RA 1 dx.\nFor A \u2286 Rd, we de\ufb01ne A\u2206 := A \u2212 A = {x \u2212 y : x, y \u2208 A}. conv(A) is the convex hull of A. For\na function g de\ufb01ned on open set B \u2286 Rd \u00d7 Rd, \u2202p,qg(x, y) :=\n, (x, y) \u2208 B,\nwhere p, q \u2208 Nd are multi-indices, |p| =Pd\nj=1 vpj\nj .\n= 0. Xn = Op(rn) (resp.\n0 xt\u22121e\u2212x dx\nis the Gamma function, \u0393(cid:0) 1\nRandom feature maps: Let k : Rd \u00d7 Rd \u2192 R be a bounded, continuous, positive de\ufb01nite,\ntranslation-invariant kernel, i.e., there exists a positive de\ufb01nite function \u03c8 : Rd \u2192 R such that\nk(x, y) = \u03c8(x \u2212 y), x, y \u2208 Rd where \u03c8 \u2208 Cb(Rd). By Bochner\u2019s theorem [24, Theorem 6.6], \u03c8\ncan be represented as the Fourier transform of a \ufb01nite non-negative Borel measure \u039b on Rd, i.e.,\n\nis bounded in probability (resp. almost surely). \u0393(t) =R \u221e\n2(cid:1) = \u221a\u03c0 and \u0393(t + 1) = t\u0393(t).\n\nFor positive sequences (an)n\u2208N, (bn)n\u2208N, an = o(bn) if limn\u2192\u221e\nOa.s.(rn)) denotes that Xn\nrn\n\nj=1 pj and N := {0, 1, 2, . . .}. De\ufb01ne vp =Qd\n\n\u2202|p|+|q|g(x,y)\n\ni=1 v2\n\n1 \u00b7\u00b7\u00b7\u2202x\n\n1 \u00b7\u00b7\u00b7\u2202y\n\npd\nd \u2202yq1\n\n\u2202xp1\n\nan\nbn\n\nj=1\n\nqd\nd\n\nk(x, y) = \u03c8(x \u2212 y) =ZRd\n\n\u221a\u22121\u03c9\n\ne\n\nT (x\u2212y)d\u039b(\u03c9)\n\n(\u22c6)\n\n= ZRd\ncos(cid:0)\u03c9\n\u03c8(0) \u2208 M 1\n\nT (x \u2212 y)(cid:1) d\u039b(\u03c9),\n\n(1)\n\nT (x\u2212y) dP(\u03c9) where P := \u039b\n\nwhere (\u22c6) follows from the fact that \u03c8 is real-valued and symmetric. Since \u039b(Rd) = \u03c8(0),\n+(Rd). Therefore, w.l.o.g., we\n+(Rd). Based on (1), [13] proposed an\ni.i.d.\u223c \u039b\n\nk(x, y) = \u03c8(0)R e\u221a\u22121\u03c9\nassume throughout the paper that \u03c8(0) = 1 and so \u039b \u2208 M 1\napproximation to k by replacing \u039b with its empirical measure, \u039bm constructed from (\u03c9i)m\ni=1\nso that resultant approximation can be written as the Euclidean inner product of \ufb01nite dimensional\nrandom feature maps, i.e.,\n\n\u02c6k(x, y) =\n\n1\nm\n\nm\n\nXi=1\n\ncos(cid:0)\u03c9\n\nT\n\ni (x \u2212 y)(cid:1) (\u2217)\n\n= h\u03c6(x), \u03c6(y)iR2m ,\n\n(2)\n\n1 x), . . . , cos(\u03c9T\n\nwhere \u03c6(x) = 1\u221am (cos(\u03c9T\nmx)) and (\u2217) holds based on\nthe basic trigonometric identity: cos(a\u2212b) = cos a cos b+sin a sin b. This elegant approximation to\nk is particularly useful in speeding up kernel-based algorithms as the \ufb01nite-dimensional random fea-\nture map \u03c6 can be used to solve these algorithms in the primal thereby offering better computational\ncomplexity (than by solving them in the dual) while at the same time not lacking in performance.\nApart from these practical advantages, [13, Claim 1] (and similarly, [22, Prop. 1]) provides a theoret-\n\n1 x), . . . , sin(\u03c9T\n\nmx), sin(\u03c9T\n\nical guarantee that k\u02c6k \u2212 kkS\u00d7S \u2192 0 as m \u2192 \u221e for any compact set S \u2282 Rd. Formally, [13, Claim\n\n3\n\n\f1] showed that\u2014note that (3) is slightly different but more precise than the one in the statement of\nClaim 1 in [13]\u2014for any \u01eb > 0,\n\n\u039bm(cid:16)n(\u03c9i)m\n\ni=1 : k\u02c6k \u2212 kkS\u00d7S \u2265 \u01ebo(cid:17) \u2264 Cd(cid:0)|S|\u03c3\u01eb\u22121(cid:1) 2d\nd+2(cid:17) \u2264 27d\nd+2 (cid:16)(cid:0) 2\n2(cid:1) 2\nd(cid:1) d\nd+2 +(cid:0) d\n\nwhere \u03c32 := R k\u03c9k2 d\u039b(\u03c9) and Cd := 2\nd+2 when d \u2265 2. The\ncondition \u03c32 < \u221e implies that \u03c8 (and therefore k) is twice differentiable. From (3) it is clear that\nthe probability has polynomial tails if \u01eb < |S|\u03c3 (i.e., small \u01eb) and Gaussian tails if \u01eb \u2265 |S|\u03c3 (i.e.,\nlarge \u01eb) and can be equivalently written as\ni=1 : k\u02c6k \u2212 kkS\u00d7S \u2265 C\n\n4(d+2) (log m)\u2212 d\n\nd+2 e\u2212 m\u01eb2\n\n4(d+2) ,\n\nd+2 ,\n\n(3)\n\n(4)\n\n6d+2\n\n\u03b1\n\n2\n\n|S|2\u03c32. For |S| suf\ufb01ciently large (i.e., \u03b1 < 0), it follows from (4) that\n\nd\n\n2d\n\nd+2\n\n|S|\u03c3pm\u22121 log mo(cid:17) \u2264 m\nk\u02c6k \u2212 kkS\u00d7S = Op(cid:16)|S|pm\u22121 log m(cid:17) .\n\n(5)\n\n\u039bm(cid:16)n(\u03c9i)m\nwhere \u03b1 := 4d \u2212 C\n\nd\n\nd+2\n\nd\n\nWhile (5) shows that \u02c6k is a consistent estimator of k in the topology of compact convergence (i.e.,\n\n\u02c6k convergences to k uniformly over compact sets), the rate of convergence ofp(log m)/m is not\noptimal. In addition, the order of dependence on |S| is not optimal. While a faster rate (in fact,\nan optimal rate) of convergence is desired\u2014better rates in (5) can lead to better convergence rates\nfor the excess error of the kernel machine constructed using \u02c6k\u2014, the order of dependence on |S| is\nalso important as it determines the the number of RFF features (i.e., m) that are needed to achieve\na given approximation accuracy. In fact, the order of dependence on |S| controls the rate at which\n|S| can be grown as a function of m when m \u2192 \u221e (see Remark 1(ii) for a detailed discussion\nabout the signi\ufb01cance of growing |S|). In the following section, we present an analogue of (4)\u2014see\nTheorem 1\u2014that provides optimal rates and has correct dependence on |S|.\n\n3 Main results: approximation of k\n\nAs discussed in Sections 1 and 2, while the random feature map approximation of k introduced by\n[13] has many practical advantages, it does not seem to be theoretically well-understood. The exist-\ning theoretical results on the quality of approximation do not provide a complete picture owing to\ntheir non-optimality. In this section, we \ufb01rst present our main result (see Theorem 1) that improves\n\nsequences of Theorem 1 along with its optimality in Remark 1. Next, in Corollary 2 and Theorem 3,\n\nupon (4) and provides a rate of m\u22121/2 with logarithm dependence on |S|. We then discuss the con-\nwe discuss the Lr-convergence (1 \u2264 r < \u221e) of \u02c6k to k over compact subsets of Rd.\nTheorem 1. Suppose k(x, y) = \u03c8(x \u2212 y), x, y \u2208 Rd where \u03c8 \u2208 Cb(Rd) is positive de\ufb01nite and\n\u03c32 :=R k\u03c9k2 d\u039b(\u03c9) < \u221e. Then for any \u03c4 > 0 and non-empty compact set S \u2282 Rd,\n)! \u2264 e\u2212\u03c4 ,\n\nh(d,|S|, \u03c3) + \u221a2\u03c4\n\n\u039bm ((\u03c9i)m\n\ni=1 : k\u02c6k \u2212 kkS\u00d7S \u2265\n\n\u221am\n\nwhere h(d,|S|, \u03c3) := 32p2d log(2|S| + 1) + 32p2d log(\u03c3 + 1) + 16p2d[log(2|S| + 1)]\u22121.\nProof (sketch). Note that k\u02c6k \u2212 kkS\u00d7S = supx,y\u2208S |\u02c6k(x, y) \u2212 k(x, y)| = supg\u2208G |\u039bmg \u2212 \u039bg|,\nwhere G := {gx,y(\u03c9) = cos(\u03c9T (x \u2212 y)) : x, y \u2208 S}, which means the object of interest is the\nsuprema of an empirical process indexed by G. Instead of bounding supg\u2208G |\u039bmg \u2212 \u039bg| by using\nHoeffding\u2019s inequality on a cover of G and then applying union bound as carried out in [13, 22],\nwe use the re\ufb01ned technique of applying concentration via McDiarmid\u2019s inequality, followed by\nsymmetrization and bound the Rademacher average by Dudley entropy bound. The result is obtained\n\nby carefully bounding the L2(\u039bm)-covering number of G. The details are provided in Section B.1\n\nof the supplementary material.\n\nRemark 1. (i) Theorem 1 shows that \u02c6k is a consistent estimator of k in the topology of compact con-\n\nvergence as m \u2192 \u221e with the rate of a.s. convergence beingpm\u22121 log |S| (almost sure convergence\n\nis guaranteed by the \ufb01rst Borel-Cantelli lemma). In comparison to (4), it is clear that Theorem 1\n\n4\n\n\fprovides improved rates with better constants and logarithmic dependence on |S| instead of a linear\ndependence. The logarithmic dependence on |S| ensures that we need m = O(\u01eb\u22122 log |S|) ran-\ndom features instead of O(\u01eb\u22122|S|2 log(|S|/\u01eb)) random features, i.e., signi\ufb01cantly fewer features to\nachieve the same approximation accuracy of \u01eb.\n\n(ii) Growing diameter: While Theorem 1 provides almost sure convergence uniformly over com-\npact sets, one might wonder whether it is possible to achieve uniform convergence over Rd. [7,\nSection 2] showed that such a result is possible if \u039b is a discrete measure but not possible for \u039b\nthat is absolutely continuous w.r.t. the Lebesgue measure (i.e., if \u039b has a density). Since uniform\nconvergence of \u02c6k to k over Rd is not possible for many interesting k (e.g., Gaussian kernel), it is\nof interest to study the convergence on S whose diameter grows with m. Therefore, as mentioned\nin Section 2, the order of dependence of rates on |S| is critical. Suppose |Sm| \u2192 \u221e as m \u2192 \u221e\n(we write |Sm| instead of |S| to show the explicit dependence on m). Then Theorem 1 shows that\n\u02c6k is a consistent estimator of k in the topology of compact convergence if m\u22121 log |Sm| \u2192 0 as\nm \u2192 \u221e (i.e., |Sm| = eo(m)) in contrast to the result in (4) which requires |Sm| = o(pm/ log m).\nIn other words, Theorem 1 ensures consistency even when |Sm| grows exponentially in m whereas\n(4) ensures consistency only if |Sm| does not grow faster thanpm/ log m.\n(iii) Optimality: Note that \u03c8 is the characteristic function of \u039b \u2208 M 1\n+(Rd) since \u03c8 is the Fourier\ntransform of \u039b (by Bochner\u2019s theorem). Therefore, the object of interest k\u02c6k\u2212 kkS\u00d7S = k \u02c6\u03c8\u2212 \u03c8kS\u2206 ,\nis the uniform norm of the difference between \u03c8 and the empirical characteristic function \u02c6\u03c8 =\nmPm\n1\ni=1 cos(h\u03c9i,\u00b7i), when both are restricted to a compact set S\u2206 \u2282 Rd. The question of the con-\nvergence behavior of k \u02c6\u03c8\u2212\u03c8kS\u2206 is not new and has been studied in great detail in the probability and\nstatistics literature (e.g., see [7, 27] for d = 1 and [4, 5] for d > 1) where the characteristic function\nis not just a real-valued symmetric function (like \u03c8) but is Hermitian. [27, Theorems 6.2 and 6.3]\nshow that the optimal rate of convergence of k \u02c6\u03c8 \u2212 \u03c8kS\u2206 is m\u22121/2 when d = 1, which matches\nwith our result in Theorem 1. Also Theorems 1 and 2 in [5] show that the logarithmic dependence\non |Sm| is optimal asymptotically. In particular, [5, Theorem 1] matches with the growing diame-\nter result in Remark 1(ii), while [5, Theorem 2] shows that if \u039b is absolutely continuous w.r.t. the\nLebesgue measure and if lim supm\u2192\u221e m\u22121 log |Sm| > 0, then there exists a positive \u03b5 such that\nlim supm\u2192\u221e \u039bm(k \u02c6\u03c8 \u2212 \u03c8kSm,\u2206 \u2265 \u03b5) > 0. This means the rate |Sm| = eo(m) is not only the best\npossible in general for almost sure convergence, but if faster sequence |Sm| is considered then even\nstochastic convergence cannot be retained for any characteristic function vanishing at in\ufb01nity along\nat least one path. While these previous results match with that of Theorem 1 (and its consequences),\nwe would like to highlight the fact that all these previous results are asymptotic in nature whereas\nTheorem 1 provides a \ufb01nite-sample probabilistic inequality that holds for any m. We are not aware\nof any such \ufb01nite-sample result except for the one in [13, 22].\n(cid:4)\n\nUsing Theorem 1, one can obtain a probabilistic inequality for the Lr-norm of \u02c6k \u2212 k over any\ncompact set S \u2282 Rd, as given by the following result.\nCorollary 2. Suppose k satis\ufb01es the assumptions in Theorem 1. Then for any 1 \u2264 r < \u221e, \u03c4 > 0\nand non-empty compact set S \u2282 Rd,\n\n(\u03c9i)m\n\n\uf8f1\uf8f2\n\u039bm\uf8eb\n\uf8ed\n\uf8f3\n\ni=1 : k\u02c6k \u2212 kkLr(S) \u2265 \u03c0d/2|S|d\n\n\uf8fc\uf8fd\nh(d,|S|, \u03c3) + \u221a2\u03c4\n\uf8fe\nwhere k\u02c6k \u2212 kkLr(S) := k\u02c6k \u2212 kkLr(S\u00d7S) =(cid:16)RSRS |\u02c6k(x, y) \u2212 k(x, y)|r dx dy(cid:17) 1\n\n2 + 1)!2/r\n\n\u221am\n\n2d\u0393( d\n\n\uf8f6\n\uf8f8 \u2264 e\u2212\u03c4 ,\n\nr\n\n.\n\nProof. Note that\n\nk\u02c6k \u2212 kkLr(S) \u2264 k\u02c6k \u2212 kkS\u00d7Svol2/r(S).\n\nThe result follows by combining Theorem 1 and the fact that vol(S) \u2264 vol(A) where A :=\nnx \u2208 Rd : kxk2 \u2264 |S|2 o and vol(A) = \u03c0d/2|S|d\nCorollary 2 shows that k\u02c6k\u2212 kkLr(S) = Oa.s.(m\u22121/2|S|2d/rplog |S|) and therefore if |Sm| \u2192 \u221e as\nm \u2192 \u221e, then consistency of \u02c6k in Lr(Sm)-norm is achieved as long as m\u22121/2|Sm|2d/rplog |Sm| \u2192\n\n(which follows from [8, Corollary 2.55]).\n\n2d\u0393( d\n\n2 +1)\n\n5\n\n\fr\n\nC\u2032r\n\n+\n\n2 , 1\nr }\n\nr\n\n2d\u0393( d\n\n(\u03c9i)m\n\n\u221a2\u03c4\n\n\uf8f1\uf8f2\n\uf8f3\n\n4d (log m)\u2212 r\n\ni=1 : k\u02c6k \u2212 kkLr(S) \u2265 \u03c0d/2|S|d\n\n0 as m \u2192 \u221e. This means, in comparison to the uniform norm in Theorem 1 where |Sm| can grow\nexponential in m\u03b4 (\u03b4 < 1), |Sm| cannot grow faster than m\n4d \u2212\u03b8 (\u03b8 > 0) to achieve\nconsistency in Lr-norm.\nInstead of using Theorem 1 to obtain a bound on k\u02c6k \u2212 kkLr(S) (this bound may be weak as k\u02c6k \u2212\nkkLr(S) \u2264 k\u02c6k \u2212 kkS\u00d7Svol2/r(S) for any 1 \u2264 r < \u221e), a better bound (for 2 \u2264 r < \u221e) can be\nobtained by directly bounding k\u02c6k \u2212 kkLr(S), as shown in the following result.\nTheorem 3. Suppose k(x, y) = \u03c8(x\u2212 y), x, y \u2208 Rd where \u03c8 \u2208 Cb(Rd) is positive de\ufb01nite. Then\nfor any 1 < r < \u221e, \u03c4 > 0 and non-empty compact set S \u2282 Rd,\n\u221am!\uf8fc\uf8fd\n\uf8f6\n\u039bm\uf8eb\n2 + 1)!2/r \n\uf8f8 \u2264 e\u2212\u03c4 ,\n\uf8ed\n\uf8fe\nwhere C\u2032r is the Khintchine constant given by C\u2032r = 1 for r \u2208 (1, 2] and C\u2032r = \u221a2(cid:2)\u0393(cid:0) r+1\n2 (cid:1) /\u221a\u03c0(cid:3) 1\nfor r \u2208 [2,\u221e).\nProof (sketch). As in Theorem 1, we show that kk \u2212 \u02c6kkLr(S) satis\ufb01es the bounded difference\nproperty, hence by the McDiarmid\u2019s inequality, it concentrates around its expectation Ekk \u2212\n\u02c6kkLr(S). By symmetrization, we then show that Ekk \u2212 \u02c6kkLr(S) is upper bounded in terms of\nE\u03b5 kPm\ni=1 are Rademacher random variables. By\nexploiting the fact that Lr(S) is a Banach space of type min{r, 2}, the result follows. The details\nRemark 2. Theorem 3 shows an improved dependence on |S| without the extraplog |S| factor given\nin Corollary 2 and therefore provides a better rate for 2 \u2264 r < \u221e when the diameter of S grows, i.e.,\nk\u02c6k \u2212 kkLr(Sm)\n4d ) as m \u2192 \u221e. However, for 1 < r < 2, Theorem 3 provides\na slower rate than Corollary 2 and therefore it is appropriate to use the bound in Corollary 2. While\none might wonder why we only considered the convergence of k\u02c6k \u2212 kkLr(S) and not k\u02c6k \u2212 kkLr(Rd),\nit is important to note that the latter is not well-de\ufb01ned because \u02c6k /\u2208 Lr(Rd) even if k \u2208 Lr(Rd). (cid:4)\n\ni=1 \u03b5i cos(h\u03c9i,\u00b7 \u2212 \u00b7i)kLr(S), where \u03b5 := (\u03b5i)m\n\nare provided in Section B.2 of the supplementary material.\n\na.s.\u2192 0 if |Sm| = o(m\n\nm1\u2212max{ 1\n\nr\n\n4 Approximation of kernel derivatives\n\nIn the previous section we focused on the approximation of the kernel function where we presented\nuniform and Lr convergence guarantees on compact sets for the random Fourier feature approx-\nimation, and discussed how fast the diameter of these sets can grow to preserve uniform and Lr\nconvergence almost surely. In this section, we propose an approximation to derivatives of the kernel\nand analyze the uniform and Lr convergence behavior of the proposed approximation. As motivated\nin Section 1, the question of approximating the derivatives of the kernel through \ufb01nite dimensional\nrandom feature map is also important as it enables to speed up several interesting machine learning\ntasks that involve the derivatives of the kernel [28, 18, 15, 16, 26, 20], see for example the recent\nin\ufb01nite dimensional exponential family \ufb01tting technique [21], which implements this idea.\nTo this end, we consider k as in (1) and de\ufb01ne ha := cos( \u03c0a\n2 + \u00b7), a \u2208 N (in other words\nh0 = cos, h1 = \u2212 sin, h2 = \u2212 cos, h3 = sin and ha = ha mod 4). For p, q \u2208 Nd, assuming\nR |\u03c9\n\np+q| d\u039b(\u03c9) < \u221e, it follows from the dominated convergence theorem that\n\u2202p,qk(x, y) =ZRd\n=ZRd\n\nT (x \u2212 y)(cid:1) d\u039b(\u03c9)\nT y) + h3+|p|(\u03c9\n\np(\u2212\u03c9)qh|p+q|(cid:0)\u03c9\np+q(cid:2)h|p|(\u03c9\n\nT y)(cid:3) d\u039b(\u03c9),\n\nso that \u2202p,qk(x, y) can be approximated by replacing \u039b with \u039bm, resulting in\n\nT x)h3+|q|(\u03c9\n\nT x)h|q|(\u03c9\n\n\u03c9\n\n\u03c9\n\n\\\u2202p,qk(x, y) := sp,q(x, y) =\n\n1\nm\n\nm\n\nXj=1\n\n\u03c9\n\np\n\nj (\u2212\u03c9j)qh|p+q|(cid:0)\u03c9\n\nT\n\nj (x \u2212 y)(cid:1) = h\u03c6p(x), \u03c6q(y)iR2m , (6)\n\n6\n\n\fp\n\np\n\np\n\np\n\nj=1\n\nmu), \u03c9\n\n1 h|p|(\u03c9T\n\nmh|p|(\u03c9T\n\n1 u),\u00b7\u00b7\u00b7 , \u03c9\n\n1 u),\u00b7\u00b7\u00b7 , \u03c9\n\n1 h3+|p|(\u03c9T\n\nmh3+|p|(\u03c9T\n\nmu)(cid:1)\ni.i.d.\u223c \u039b. Now the goal is to understand the behavior of ksp,q \u2212 \u2202p,qkkS\u00d7S and\n\nwhere \u03c6p(u) := 1\u221am(cid:0)\u03c9\nand (\u03c9j)m\nksp,q \u2212 \u2202p,qkkLr(S) for r \u2208 [1,\u221e), i.e., obtain analogues of Theorems 1 and 3.\nAs in the proof sketch of Theorem 1, while ksp,q\u2212\u2202p,qkkS\u00d7S can be analyzed as the suprema of an\nempirical process indexed by a suitable function class (say G), some technical issues arise because\nG is not uniformly bounded. This means McDiarmid or Talagrand\u2019s inequality cannot be applied\nto achieve concentration and bounding Rademacher average by Dudley entropy bound may not be\nreasonable. While these issues can be tackled by resorting to more technical and re\ufb01ned methods,\nin this paper, we generalize (see Theorem 4 which is proved in Section B.1 of the supplement)\nTheorem 1 to derivatives under the restrictive assumption that supp(\u039b) is bounded (note that many\npopular kernels including the Gaussian do not satisfy this assumption). We also present another\nresult (see Theorem 5) by generalizing the proof technique3 of [13] to unbounded functions where\nthe boundedness assumption of supp(\u039b) is relaxed but at the expense of a worse rate (compared to\nTheorem 4).\n\n2i, and\nTheorem 4. Let p, q \u2208 Nd, Tp,q := sup\u03c9\u2208supp(\u039b) |\u03c9\nassume that C2p,2q < \u221e. Suppose supp(\u039b) is bounded if p 6= 0 and q 6= 0. Then for any \u03c4 > 0\nand non-empty compact set S \u2282 Rd,\n\np+q|, Cp,q := E\u03c9\u223c\u039bh|\u03c9\n\np+q|k\u03c9k2\n\n\u039bm ((\u03c9i)m\n\ni=1 : k\u2202p,qk \u2212 sp,qkS\u00d7S \u2265\n\nH(d, p, q,|S|) + Tp,q\u221a2\u03c4\n\n\u221am\n\n)! \u2264 e\u2212\u03c4 ,\n\nwhere\n\nH(d, p, q,|S|) = 32p2d T2p,2q\"pU (p, q,|S|) +\nU (p, q,|S|) = log(cid:16)2|S|T \u22121/2\n\n2p,2q + 1(cid:17).\n\n1\n\n2pU (p, q,|S|)\n\n+qlog(pC2p,2q + 1)# ,\n\nRemark 3. (i) Note that Theorem 4 reduces to Theorem 1 if p = q = 0, in which case\nTp,q = T2p,2q = 1. If p 6= 0 or q 6= 0, then the boundedness of supp(\u039b) implies that Tp,q < \u221e\nand T2p,2q < \u221e.\n(ii) Growth of |Sm|: By the same reasoning as in Remark 1(ii) and Corollary 2, it follows\na.s.\u2212\u2192 0 if\nthat k\u2202p,qk \u2212 sp,qkSm\u00d7Sm\nm\u22121/2|Sm|2d/rplog |Sm| \u2192 0 (for 1 \u2264 r < \u221e) as m \u2192 \u221e. An exact analogue of Theorem 3 can\nbe obtained (but with different constants) under the assumption that supp(\u039b) is bounded and it can\nbe shown that for r \u2208 [2,\u221e), k\u2202p,qk \u2212 sp,qkLr(Sm)\n(cid:4)\nThe following result relaxes the boundedness of supp(\u039b) by imposing certain moment conditions on\n\u039b but at the expense of a worse rate. The proof relies on applying Bernstein inequality at the elements\nof a net (which exists by the compactness of S) combined with a union bound, and extending the\napproximation error from the anchors by a probabilistic Lipschitz argument.\n\na.s.\u2212\u2192 0 if |Sm| = eo(m) and k\u2202p,qk \u2212 sp,qkLr(Sm)\n\na.s.\u2212\u2192 0 if |Sm| = o(m\n\nr\n4d ).\n\nTheorem 5. Let p, q \u2208 Nd, \u03c8 be continuously differentiable, z 7\u2192 \u2207z [\u2202p,qk(z)] be continuous,\nS \u2282 Rd be any non-empty compact set, Dp,q,S := supz\u2208conv(S\u2206) k\u2207z [\u2202p,qk(z)]k2 and Ep,q :=\nE\u03c9\u223c\u039b [|\u03c9\n\np+q|k\u03c9k2]. Assume that Ep,q < \u221e. Suppose \u2203L > 0, \u03c3 > 0 such that\n\nE\u03c9\u223c\u039b(cid:2)|f (z; \u03c9)|M(cid:3) \u2264\n\nM ! \u03c32LM\u22122\n\n2\n\n(\u2200M \u2265 2,\u2200z \u2208 S\u2206),\n\n(7)\n\n1\n\nj=1 (cid:2)cos(\u03c9\n\n3We also correct some technical issues in the proof of [13, Claim 1], where (i) a shift-invariant argument was\nT\nj y + bj) =\nj (x + y) + 2bj)(cid:3), (ii) the convexity of S was not imposed leading to\n\u2207[k(\u2206) \u2212\n(cid:13)2 was not taken into account, thus the upper bound on the expectation of the squared Lipschitz constant\n\napplied to the non-shift invariant kernel estimator \u02c6k(x, y) = 1\nm Pm\npossibly unde\ufb01ned Lipschitz constant (L) and (iii) the randomness of \u2206\u2217 = arg max\u2206\u2208S\u2206 (cid:13)\n(cid:13)\n\u02c6k(\u2206)](cid:13)\n(E[L2]) does not hold.\n\nT\nj (x \u2212 y)) + cos(\u03c9\n\nT\nj x + bj) cos(\u03c9\n\nj=1 2 cos(\u03c9\n\nm Pm\n\nT\n\n7\n\n\fd+1 + d\n\nd+1 .4 Then\n\n1\n\nwhere f (z; \u03c9) = \u2202p,qk(z) \u2212 \u03c9\n\n\u039bm ({(\u03c9i)m\n\u2264 2d\u22121e\n\ni=1 : k\u2202p,qk \u2212 sp,qkS\u00d7S \u2265 \u01eb}) \u2264\n\np(\u2212\u03c9)qh|p+q|(cid:0)\u03c9T z(cid:1). De\ufb01ne Fd := d\u2212 d\n(cid:21) d\nd+1 (cid:20)|S|(Dp,q,S + Ep,q)\nRemark 4. (i) The compactness of S implies that of S\u2206. Hence, by the continuity of z 7\u2192\n2 and E\u03c9\u223c\u039b(cid:2)|f (z; \u03c9)|2(cid:3) \u2264 \u03c32\n\u2207z [\u2202p,qk(z)], one gets Dp,q,S < \u221e. (7) holds if |f (z; \u03c9)| \u2264 L\n(\u2200z \u2208 S\u2206). If supp(\u039b) is bounded, then the boundedness of f is guaranteed (see Section B.4 in the\nsupplement).\n\n8\u03c32(1+ \u01ebL\n\n2\u03c32 ) + Fd2\n\n8(d+1)\u03c32(1+ \u01ebL\n\n2\u03c32 ) .\n\n\u2212\n\ne\n\nm\u01eb2\n\n\u2212\n\nd+1\n\nm\u01eb2\n\n4d\u22121\n\n\u01eb\n\n(8)\n\n(ii) In the special case when p = q = 0, our requirement boils down to the continuously differen-\ntiability of \u03c8, E0,0 = E\u03c9\u223c\u039b k\u03c9k2 < \u221e, and (7).\n(iii) Note that (8) is similar to (3) and therefore based on the discussion in Section 2, one has\nk\u2202p,qk \u2212 sp,qkS\u00d7S = Oa.s.(|S|pm\u22121 log m). But the advantage with Theorem 5 over [13, Claim\n\n1] and [22, Prop. 1] is that it can handle unbounded functions. In comparison to Theorem 4, we\nobtain worse rates and it will be of interest to improve the rates of Theorem 5 while handling un-\nbounded functions.\n(cid:4)\n\n5 Discussion\n\nIn this paper, we presented the \ufb01rst detailed theoretical analysis about the approximation quality of\nrandom Fourier features (RFF) that was proposed by [13] in the context of improving the computa-\ntional complexity of kernel machines. While [13, 22] provided a probabilistic bound on the uniform\napproximation (over compact subsets of Rd) of a kernel by random features, the result is not opti-\nmal. We improved this result by providing a \ufb01nite-sample bound with optimal rate of convergence\n\nand also analyzed the quality of approximation in Lr-norm (1 \u2264 r < \u221e). We also proposed an\nRFF approximation for derivatives of a kernel and provided theoretical guarantees on the quality of\napproximation in uniform and Lr-norms over compact subsets of Rd.\n\nWhile all the results in this paper (and also in the literature) dealt with the approximation quality\nof RFF over only compact subsets of Rd, it is of interest to understand its behavior over entire Rd.\nHowever, as discussed in Remark 1(ii) and in the paragraph following Theorem 3, RFF cannot ap-\nproximate the kernel uniformly or in Lr-norm over Rd. By truncating the Taylor series expansion\nof the exponential function, [3] proposed a non-random \ufb01nite dimensional representation to approx-\nimate the Gaussian kernel which also enjoys the computational advantages of RFF. However, this\nrepresentation also does not approximate the Gaussian kernel uniformly over Rd. Therefore, the\nquestion remains whether it is possible to approximate a kernel uniformly or in Lr-norm over Rd\nbut still retaining the computational advantages associated with RFF.\n\nAcknowledgments\n\nZ. Szab\u00b4o wishes to thank the Gatsby Charitable Foundation for its generous support.\n\nReferences\n\n[1] A. E. Alaoui and M. Mahoney. Fast randomized kernel ridge regression with statistical guarantees. In\n\nNIPS, 2015.\n\n[2] K. Chaudhuri, C. Monteleoni, and A. D. Sarwate. Differentially private empirical risk minimization.\n\nJournal of Machine Learning Research, 12:1069\u20131109, 2011.\n\n[3] A. Cotter, J. Keshet, and N. Srebro. Explicit approximations of the Gaussian kernel. Technical report,\n\n2011. http://arxiv.org/pdf/1109.4603.pdf.\n\n[4] S. Cs\u00a8org\u02ddo. Multivariate empirical characteristic functions. Zeitschrift f\u00a8ur Wahrscheinlichkeitstheorie und\n\nVerwandte Gebiete, 55:203\u2013229, 1981.\n\n[5] S. Cs\u00a8org\u02ddo and V. Totik. On how long interval is the empirical characteristic function uniformly consistent?\n\nActa Scientiarum Mathematicarum, 45:141\u2013149, 1983.\n\n4Fd is monotonically decreasing in d, F1 = 2.\n\n8\n\n\f[6] P. Drineas and M. W. Mahoney. On the Nystr\u00a8om method for approximating a Gram matrix for improved\n\nkernel-based learning. Journal of Machine Learning Research, 6:2153\u20132175, 2005.\n\n[7] A. Feuerverger and R. A. Mureika. The empirical characteristic function and its applications. Annals of\n\nStatistics, 5(1):88\u201398, 1977.\n\n[8] G. B. Folland. Real Analysis: Modern Techniques and Their Applications. Wiley-Interscience, 1999.\n\n[9] B. Kulis and K. Grauman. Kernelized locality-sensitive hashing. IEEE Transactions on Pattern Analysis\n\nand Machine Intelligence, 34:1092\u20131104, 2012.\n\n[10] D. Lopez-Paz, K. Muandet, B. Sch\u00a8olkopf, and I. Tolstikhin. Towards a learning theory of cause-effect\n\ninference. JMLR W&CP \u2013 ICML, pages 1452\u20131461, 2015.\n\n[11] S. Maji, A. C. Berg, and J. Malik. Ef\ufb01cient classi\ufb01cation for additive kernel SVMs. IEEE Transactions\n\non Pattern Analysis and Machine Intelligence, 35:66\u201377, 2013.\n\n[12] J. Oliva, W. Neiswanger, B. P\u00b4oczos, E. Xing, and J. Schneider. Fast function to function regression. JMLR\n\nW&CP \u2013 AISTATS, pages 717\u2013725, 2015.\n\n[13] A. Rahimi and B. Recht. Random features for large-scale kernel machines. In NIPS, pages 1177\u20131184,\n\n2007.\n\n[14] A. Rahimi and B. Recht. Uniform approximation of functions with random bases. In Allerton, pages\n\n555\u2013561, 2008.\n\n[15] L. Rosasco, M. Santoro, S. Mosci, A. Verri, and S. Villa. A regularization approach to nonlinear variable\n\nselection. JMLR W&CP \u2013 AISTATS, 9:653\u2013660, 2010.\n\n[16] L. Rosasco, S. Villa, S. Mosci, M. Santoro, and A. Verri. Nonparametric sparsity and regularization.\n\nJournal of Machine Learning Research, 14:1665\u20131714, 2013.\n\n[17] B. Sch\u00a8olkopf and A. J. Smola. Learning with Kernels: Support Vector Machines, Regularization, Opti-\n\nmization, and Beyond. MIT Press, 2002.\n\n[18] L. Shi, X. Guo, and D.-X. Zhou. Hermite learning with gradient data. Journal of Computational and\n\nApplied Mathematics, 233:3046\u20133059, 2010.\n\n[19] Q. Shi, J. Petterson, G. Dror, J. Langford, A. Smola, A. Strehl, and V. Vishwanathan. Hash kernels.\n\nAISTATS, 5:496\u2013503, 2009.\n\n[20] B. K. Sriperumbudur, K. Fukumizu, A. Gretton, A. Hyv\u00a8arinen,\n\nestimation\n\nsity\nhttp://arxiv.org/pdf/1312.3516.pdf.\n\ndimensional\n\nin\n\nin\ufb01nite\n\nexponential\n\nfamilies.\n\nand R. Kumar.\nTechnical\n\nreport,\n\nDen-\n2014.\n\n[21] H. Strathmann, D. Sejdinovic, S. Livingstone, Z. Szab\u00b4o, and A. Gretton. Gradient-free Hamiltonian\n\nMonte Carlo with ef\ufb01cient kernel exponential families. In NIPS, 2015.\n\n[22] D. J. Sutherland and J. Schneider. On the error of random Fourier features. In UAI, pages 862\u2013871, 2015.\n\n[23] A. Vedaldi and A. Zisserman. Ef\ufb01cient additive kernels via explicit feature maps. IEEE Transactions on\n\nPattern Analysis and Machine Intelligence, 34:480\u2013492, 2012.\n\n[24] H. Wendland. Scattered Data Approximation. Cambridge University Press, 2005.\n\n[25] C. K. I. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. In NIPS, pages\n\n682\u2013688, 2001.\n\n[26] Y. Ying, Q. Wu, and C. Campbell. Learning the coordinate gradients. Advances in Computational Math-\n\nematics, 37:355\u2013378, 2012.\n\n[27] J. E. Yukich. Some limit theorems for the empirical process indexed by functions. Probability Theory\n\nand Related Fields, 74:71\u201390, 1987.\n\n[28] D.-X. Zhou. Derivative reproducing properties for kernel methods in learning theory. Journal of Compu-\n\ntational and Applied Mathematics, 220:456\u2013463, 2008.\n\n9\n\n\f", "award": [], "sourceid": 714, "authors": [{"given_name": "Bharath", "family_name": "Sriperumbudur", "institution": "The Pennsylvania State University"}, {"given_name": "Zoltan", "family_name": "Szabo", "institution": "Gatsby Unit, UCL"}]}