{"title": "Learning Continuous Distributions: Simulations With Field Theoretic Priors", "book": "Advances in Neural Information Processing Systems", "page_first": 287, "page_last": 293, "abstract": null, "full_text": "Learning continuous distributions: \n\nSimulations with field theoretic priors \n\nlIya Nemenman1,2 and William Bialek2 \n\n1 Department of Physics, Princeton University, Princeton, New Jersey 08544 \n2NEC Research Institute, 4 Independence Way, Princeton, New Jersey 08540 \n\nnemenman@research.nj.nec.com, bialek@research.nj.nec.com \n\nAbstract \n\nLearning of a smooth but nonparametric probability density can be reg(cid:173)\nularized using methods of Quantum Field Theory. We implement a field \ntheoretic prior numerically, test its efficacy, and show that the free pa(cid:173)\nrameter of the theory (,smoothness scale') can be determined self con(cid:173)\nsistently by the data; this forms an infinite dimensional generalization of \nthe MDL principle. Finally, we study the implications of one's choice \nof the prior and the parameterization and conclude that the smoothness \nscale determination makes density estimation very weakly sensitive to \nthe choice of the prior, and that even wrong choices can be advantageous \nfor small data sets. \n\nOne of the central problems in learning is to balance 'goodness of fit' criteria against the \ncomplexity of models. An important development in the Bayesian approach was thus the \nrealization that there does not need to be any extra penalty for model complexity: if we \ncompute the total probability that data are generated by a model, there is a factor from the \nvolume in parameter space-the 'Occam factor' -that discriminates against models with \nmore parameters [1, 2]. This works remarkably welJ for systems with a finite number of \nparameters and creates a complexity 'razor' (after 'Occam's razor') that is almost equiv(cid:173)\nalent to the celebrated Minimal Description Length (MDL) principle [3]. In addition, if \nthe a priori distributions involved are strictly Gaussian, the ideas have also been proven to \napply to some infinite-dimensional (nonparametric) problems [4]. It is not clear, however, \nwhat happens if we leave the finite dimensional setting to consider nonparametric prob(cid:173)\nlems which are not Gaussian, such as the estimation of a smooth probability density. A \npossible route to progress on the nonparametric problem was opened by noticing [5] that \na Bayesian prior for density estimation is equivalent to a quantum field theory (QFT). In \nparticular, there are field theoretic methods for computing the infinite dimensional analog \nof the Occam factor, at least asymptotically for large numbers of examples. These obser(cid:173)\nvations have led to a number of papers [6, 7, 8, 9] exploring alternative formulations and \ntheir implications for the speed of learning. Here we return to the original formulation \nof Ref. [5] and use numerical methods to address some of the questions left open by the \nanalytic work [10]: What is the result of balancing the infinite dimensional Occam factor \nagainst the goodness of fit? Is the QFT inference optimal in using alJ of the information \nrelevant for learning [II]? What happens if our learning problem is strongly atypical of the \nprior distribution? \nFollowing Ref. [5], if N i. i. d. samples {Xi}, i = 1 ... N, are observed, then the probability \n\n\fthat a particular density Q(x) gave rise to these data is given by \nP[Q(x)] rr~1 Q(Xi) \n\nP[Q(x)l{x.}] -\n\n\u2022 \n\n- J[dQ(x)]P[Q(x)] rr~1 Q(Xi) , \n\n(1) \n\nwhere P[Q(x)] encodes our a priori expectations of Q. Specifying this prior on a space of \nfunctions defines a QFf, and the optimal least square estimator is then \n\nQ (I{ .}) -\n-\n\nest X X. \n\n(Q(X)Q(Xl)Q(X2) ... Q(XN)}(O) \n\n(Q(Xl)Q(X2) ... Q(XN ))(0) \n\n, \n\n(2) \n\nwhere ( ... )(0) means averaging with respect to the prior. Since Q(x) ~ 0, it is convenient \nto define an unconstrained field \u00a2(x), Q(x) = (l/io)exp[-\u00a2(x)]. Other definitions are \nalso possible [6], but we think that most of our results do not depend on this choice. \n\nThe next step is to select a prior that regularizes the infinite number of degrees of freedom \nand allows learning. We want the prior P[\u00a2] to make sense as a continuous theory, inde(cid:173)\npendent of discretization of x on small scales. We also require that when we estimate the \ndistribution Q(x) the answer must be everywhere finite. These conditions imply that our \nfield theory must be convergent at small length scales. For x in one dimension, a minimal \nchoice is \n\n1 [\u00a32 11\nP[\u00a2(x)] = Z exp --2-\n\n-\n\n1 f (8 11\u00a2)2] \n\ndx 8xll \n\n[1 f \n\nio \n\nc5 \n\ndxe-\u00a2(x) -1 , \n\n(3) \n\n] \n\nwhere'T/ > 1/2, Z is the normalization constant, and the c5-function enforces normalization \nof Q. We refer to i and 'T/ as the smoothness scale and the exponent, respectively. \nIn [5] this theory was solved for large Nand 'T/ = 1: \n\nN (II Q(Xi))(O) ~ \n\ni=1 \n\nSeff = \n\ni8;\u00a2c1 (x) + \n\n(4) \n\n(5) \n\n(6) \n\nwhere \u00a2cl is the 'classical' (maximum likelihood, saddle point) solution. In the effective \naction [Eq. (5)], it is the square root term that arises from integrating over fluctuations \naround the classical solution (Occam factors). It was shown that Eq. (4) is nonsingular \ntarget distribution P(x) very quickly, and that the variance of fluctuations 'Ij;(x) = \u00a2(x) -\neven at finite N, that the mean value of \u00a2c1 converges to the negative logarithm of the \n[- log ioP( x)] falls off as ....., 1/ J iN P( x). Finally, it was speculated that if the actual i is \nunknown one may average over it and hope that, much as in Bayesian model selection [2], \nthe competition between the data and the fluctuations will select the optimal smoothness \nscale i*. \nAt the first glance the theory seems to look almost exactly like a Gaussian Process [4]. This \nimpression is produced by a Gaussian form of the smoothness penalty in Eq. (3), and by \nthe fluctuation determinant that plays against the goodness of fit in the smoothness scale \n(model) selection. However, both similarities are incomplete. The Gaussian penalty in \nthe prior is amended by the normalization constraint, which gives rise to the exponential \nterm in Eq. (6), and violates many familiar results that hold for Gaussian Processes, the \n\n\frepresenter theorem [12] being just one of them. In the semi--classical limit of large N, \nGaussianity is restored approximately, but the classical solution is extremely non-trivial, \nand the fluctuation determinant is only the leading term of the Occam's razor, not the com(cid:173)\nplete razor as it is for a Gaussian Process. In addition, it has no data dependence and is thus \nremarkably different from the usual determinants arising in the literature. \n\nThe algorithm to implement the discussed density estimation procedure numerically is \nrather simple. First, to make the problem well posed [10, 11] we confine x to a box \na ~ x ~ L with periodic boundary conditions. The boundary value problem Eq. (6) is \nthen solved by a standard 'relaxation' (or Newton) method of iterative improvements to \na guessed solution [13] (the target precision is always 10-5). The independent variable \nx E [0,1] is discretized in equal steps [104 for Figs. (l.a-2.b), and 105 for Figs. (3.a, 3.b)]. \nWe use an equally spaced grid to ensure stability of the method, while small step sizes are \nneeded since the scale for variation of \u00a2el (x) is [5] \n\nc5x '\" Jl/NP(x) , \n\n(7) \n\nwhich can be rather small for large N or smalll. \n\nSince the theory is short scale insensitive, we can generate random probability densities \nchosen from the prior by replacing \u00a2 with its Fourier series and truncating the latter at some \nsufficiently high wavenumber kc [kc = 1000 for Figs. (l.a-2.b), and 5000 for Figs. (3.a, \n3.b)]. Then Eq. (3) enforces the amplitude of the k'th mode to be distributed a priori \nnormally with the standard deviation \n\n(8) \n\n21/ 2 (L) '1/ \n\n27rk \n\nuk = l'l/-1/2 \n\nCoded in such a way, the simulations are extremely computationally intensive. There(cid:173)\nfore, Monte Carlo averagings given here are only over 500 runs, fluctuation determi(cid:173)\nnants are calculated according to Eq. (5), not using numerical path integration, and \nQcl = (l/lo) eXP[-\u00a2ed is always used as an approximation to Qest. \nAs an example of the algorithm's performance, Fig. (l.a) shows one particular learning run \nfor TJ = 1 and l = 0.2. We see that singularities and overfitting are absent even for N as \nlow as 10. Moreover, the approach of Qel(X) to the actual distribution P(x) is remarkably \nfast: for N = 10, they are similar; for N = 1000, very close; for N = 100000, one needs \nto look carefully to see the difference between the two. \n\nTo quantify this similarity of distributions, we compute the Kullback-Leibler divergence \nDKL(PIIQest) between the true distribution P(x) and its estimate Qest(x), and then av(cid:173)\nerage over the realizations of the data points and the true distribution. As discussed in \n[11], this learning curve A(N) measures the (average) excess cost incurred in coding the \nN + 1 'st data point because of the finiteness of the data sample, and thus can be called the \n\"universalleaming curve\". If the inference algorithm uses all of the information contained \nin the data that is relevant for learning (\"predictive information\" [11]), then [5, 9, 11, 10] \n\nA(N) '\" (L/l)1/2'1/N1/2'1/- 1. \n\n(9) \n\nWe test this prediction against the learning curves in the actual simulations. For TJ = 1 \nand l = 0.4, 0.2, 0.05, these are shown on Fig. (l.b). One sees that the exponents are \nextremely close to the expected 1/2, and the ratios of the prefactors are within the errors \nfrom the predicted scaling'\" 1/ Vi. All of this means that the proposed algorithm for \nfinding densities not only works, but is at most a constant factor away from being optimal \nin using the predictive information of the sample set. \n\nNext we investigate how one's choice of the prior influences learning. We first stress that \nthere is no such thing as a wrong prior. If one admits a possibility of it being wrong, then \n\n\f(a) \n\n3.5 \n\n3 \n\n2.5 \n\nFit for 10 samples \nFit for 1000 samples \n\n- - - Fit for 100000 samples \n-\n\nActual distribution \n\n:s:: \n~ \nO'i \u00b75 \n\nCl 2 ~' \n\n0.5 \n\n0 \n0 \n\n0.2 \n\n0.4 \n\n0.6 \n\n0.8 \n\nx \n\n10' \n\n10-' \n\n< \n\n10-' \n\n\". \n\n10-' \n\n10' \n\n(b) \n\n- e - 1=0.4, data and best fit \n~ 1=0.2 , data and best fit \n-<>-\n\n1=0.05, data and best fit \n\n~. \n\n, .. \n... \n\"0\", . .. \n. . . . \no. \n. . \no\u00b7 \n\n\"'-\n\n'0. \n\n\". \n\n\u2022 0 \n\n'. \n\n10' \n\n10' \nN \n\n10' \n\n10' \n\nFigure 1: (a) QcJ found for different N at f = 0.2. (b) A as a function of N and f. \nThe best fits are: for f = 0.4, A = (0.54 \u00b1 0.07)N-o. 483\u00b1O.014; for f = 0.2, A = \n(0.83 \u00b1 0.08)N-o.493 \u00b1O.09; for f = 0.05, A = (1.64 \u00b1 0.16)N-o.507\u00b1O.09. \n\nit does not encode all of the a priori knowledge! It does make sense, however, to ask what \nhappens if the distribution we are trying to learn is an extreme outlier in the prior P[\u00a2]. \nOne way to generate such an example is to choose a typical function from a different prior \nPI[\u00a2], and this is what we mean by 'learning with a wrong prior.' If the prior is wrong \nin this sense, and learning is described by Eqs. (2-6), then we still expect the asymptotic \nbehavior, Eq. (9), to hold; only the prefactors of A should change, and those must increase \nsince there is an obvious advantage in having the right prior; we illustrate this in Figs. (2.a, \n2.b). \nFor Fig. (2.a), both PI[\u00a2] and P[\u00a2] are given by Eq. (3), but pI has the 'actual' smoothness \nscale fa = 0.4, 0.05, and for P the ' learning' smoothness scale is f = 0.2 (we show the \ncase fa = f = 0.2 again as a reference). The A,...., l/VN behavior is seen unmistakably. \nThe prefactors are a bit larger (unfortunately, insignificantly) than the corresponding ones \nfrom Fig. (1.b), so we may expect that the 'right' f, indeed, provides better learning (see \nlater for a detailed discussion). \nFurther, Fig. (2.b) illustrates learning when not only f, but also 'fI is 'wrong' in the sense \ndefined above. We illustrate this for 'fIa = 2, 0.8, 0.6, 0 (remember that only 'fIa > 0.5 \nremoves UV divergences). Again, the inverse square root decay of A should be observed, \nand this is evident for 'fIa = 2. The 'fIa = 0.8,0.6,0 cases are different: even for N as high \nas 105 the estimate of the distribution is far from the target, thus the asymptotic regime is \nnot reached. This is a crucial observation for our subsequent analysis of the smoothness \nscale determination from the data. Remarkably, A (both averaged and in the single runs \nshown) is monotonic, so even in the cases of qualitatively less smooth distributions there \nstill is no overfitting. On the other hand, A is well above the asymptote for 'fI = 2 and small \nN , which means that initially too many details are expected and wrongfully introduced into \nthe estimate, but then they are almost immediately (N ,...., 300) eliminated by the data. \nFollowing the argument suggested in [5], we now view P[\u00a2], Eq. (3), as being a part of \nsome wider model that involves a prior over l. The details of the prior are irrelevant, \nhowever, if 8 eff(f), Eq. (5), has a minimum that becomes more prominent as N grows. We \nexplicitly note that this mechanism is not tuning of the prior's parameters, but Bayesian \ninference at work: f* emerges in a competition between the smoothness, the data, and the \nOccam terms to make 8 eff smaller, and thus the total probability of the data is larger. In its \n\n\f(a) \n\n(b) \n\n0 - _ \n\n'0- , \n\n______ 11a=1, '.=0.2, data, best fit \n- 9 - 11a=2, '. =O.l , data, best fit \n\nTJ. =0.8, '. =0.1, data, best fit \nTJ.=0.6, '. =0.1 , data, one run \nTJ.=O, '. =0.12, data, one run \n\n- 0 -\n\n6 \n\n\"', \n\n'0, \n\n< 10-' \n\n10-3 \n\n10-4 \n\n''', \n\n\"'\" \n\n10' \n\n0\"', \n\n10-' \n\n'\" \n\n< \n\n10-' \n\n0, \n\n. \n\n0' \n\n0, \n\n. \n\n0, \n\n'~ \n\n'Q. \n\n'0 \n\n'\", \n\n'0 \n\n'.=0.2, data and best fit \n'.=0.4, data and best fit \n'. =0 .05 , data and best fit \n\n- 0-\n\n- e -\n\n10-' \n\n10' \n\n10' \n\n10' \nN \n\n10\u00b7 \n\n10' \n\n10' \n\n10' \n\n10' \nN \n\n10\u00b7 \n\n10' \n\nFigure 2: (a) A as a function of N and fa. Best fits are: for fa = 0.4, A = (0.56 \u00b1 \n0.08)N-o.477\u00b1O.015; for fa = 0.05, A = (1.90 \u00b1 0.16)N-o.502\u00b1o.oo8. Learning is always \nwith f = 0.2. (b) A as a function of N, 1/a and fa. Best fits: for 1/a = 2, fa = 0.1, \nA = (0.40\u00b10.05)N-o.493\u00b1O.013; for1/a = 0.8, fa = 0.1, A = (1.06\u00b10.08)N-o.355\u00b1O.008. \nf = 0.2 for all graphs, but the one with 1/a = 0, for which f = 0.1. \n\nturn, larger probability means shorter total code length. \nThe data term, on average, is equal to NDKL(PIIQc1), and, for very regular P(x) (an \nimplicit assumption in [5]), it is small. Thus only the kinetic and the Occam terms matter, \nand f* ,..., N 1/3[5]. For less regular distributions P(x), this is not true [cf. Fig. (2.b)]. For \n1/ = 1, Qc1(X) approximates large-scale features of P(x) very well, but details at scales \nsmaller than,..., Jf/NL are averaged out. If P(x) is taken from the prior, Eq. (3), with \nsome 1/a, then these details fall off with the wave number k as ,..., k-'T/a. Thus the data term \nis,..., N1.5-'T/a f'T/a- O.5 and is not necessarily small. For 1/a < 1.5 this dominates the kinetic \nterm and competes with the fluctuations to set \n\n1/a < 1.5. \n\nf* ,..., N('T/a- 1)/'T/a, \n\n(10) \nThere are two remarkable things about Eq. (10). First, for 1/a = 1, f* stabilizes at some \nconstant value, which we expect to be equal to fa. Second, even for 1/ f:. 1/a, Eqs. (9, 10) \nensure that A scales as ,..., N 1/ 2'T/a -1 , which is at worst a constant factor away from the best \nscaling, Eq. (9), achievable with the 'right' prior, 1/ = 1/a. So, by allowing f* to vary with \nN we can correctly capture the structure of models that are qualitatively different from our \nexpectations (1/ f:. 1/a) and produce estimates of Q that are extremely robust to the choice \nof the prior. To our knowledge, this feature has not been noted before in a reference to a \nnonparametric problem. \n\nWe present simulations relevant to these predictions in Figs. (3.a, 3.b). Unlike on the pre(cid:173)\nvious Figures, the results are not averaged due to extreme computational costs, so all our \nfurther claims have to be taken cautiously. On the other hand, selecting f* in single runs \nhas some practical advantages: we are able to ensure the best possible learning for any \nrealization of the data. Fig. (3.a) shows single learning runs for various 1/a and fa . In ad(cid:173)\ndition, to keep the Figure readable, we do not show runs with 1/a = 0.6, 0.7, 1.2, 1.5,3, \nand 1/a -+ 00, which is a finitely parameterizable distribution. All of these display a good \nagreement with the predicted scalings: Eq. (10) for 1/a < 1.5, and f* ,..., N 1/3 otherwise. \nNext we calculate the KL divergence between the target and the estimate at f = f*; the \naverage of this divergence over the samples and the prior is the learning curve [cf. Eq. (9)]. \nFor 1/a = 0.8,2 we plot the divergencies on Fig. (3.b) side by side with their fixed f = 0.2 \n\n\f(a) \n\n(b) \n\n10' ,-----~-----~----~ \n\n10-2 \n\n--+--11a-1, 'a-0 .2 \n\n,-------:c---.----.--\".,-- - - - - - - - , \"0.. -\n\n_ 0 -\n--<>-\n\no \n\n'1.=0.8. 1.=0.1 \n'1.=1. variable I \u2022\u2022 mean 0.12 \n'1.=2 . 1. =0 .1 \n\n- -0- _ _ ~_ \n\"\"-\n\n~ '1.=0.8.1.=0.1. I=f \n10-4 ~ '1.=0.8. 1.=0.1. 1=0.2 \n\n'1.=2. 1.=0.1 . 1= f \n\n_ 0 _ \n_\" _ '1.=2 . 1.=0.1. 1=0.2 \n\n'0. \n\n10-3 l'=========='-_~_----.J \n1~ \n\n1~ \n\n1~ \n\nN \n\nN \n\nFigure 3: (a) Comparison of learning speed for the same data sets with different a priori \nassumptions. (b) Smoothness scale selection by the data. The lines that go off the axis for \nsmall N symbolize that Seff monotonically decreases as \u00a3 -+ 00. \n\nanalogues. Again, the predictions clearly are fulfilled. Note, that for 'TJa :I 'TJ there is a \nqualitative advantage in using the data induced smoothness scale. \n\nThe last four Figures have illustrated some aspects of learning with 'wrong' priors. How(cid:173)\never, all of our results may be considered as belonging to the 'wrong prior' class. Indeed, \nthe actual probability distributions we used were not nonparametric continuous functions \nwith smoothness constraints, but were composed of kc Fourier modes, thus had 2kc param(cid:173)\neters. For finite parameterization, asymptotic properties of learning usually do not depend \non the priors (cf. [3, 11]), and priorless theories can be considered [14]. In such theories \nit would take well over 2kc samples to even start to close down on the actual value of the \nparameters, and yet a lot more to get accurate results. However, using the wrong contin(cid:173)\nuous parameterization [4>(x)] we were able to obtain good fits for as low as 1000 samples \n[cf. Fig. (l.a)] with the help of the prior Eq. (3). Moreover, learning happened continuously \nand monotonically without huge chaotic jumps of overfitting that necessarily accompany \nany brute force parameter estimation method at low N. So, for some cases, a seemingly \nmore complex model is actually easier to learn! \n\nThus our claim: when data are scarce and the parameters are abundant, one gains even by \nusing the regularizing powers of wrong priors. The priors select some large scale features \nthat are the most important to learn first and fill in the details as more data become available \n(see [11] on relation of this to the Structural Risk Minimization theory). If the global \nfeatures are dominant (arguably, this is generic), one actually wins in the learning speed \n[cf. Figs. (l.b, 2.a, 3.b)]. If, however, small scale details are as important, then one at least \nis guaranteed to avoid overfitting [cf. Fig. (2.b)]. \n\nOne can summarize this in an Occam-like fashion [11]: if two models provide equally good \nfits to data, a simpler one should always be used. In particular, the predictive information, \nwhich quantifies complexity [11], and of which A is the derivative, in a QFT model is \n......, N 1/ 2TJ, and it is ......, kc log N in the parametric case. So, for kc > N 1/ 2TJ, one should \nprefer a 'wrong' QFT formulation to the correct finite parameter model. These results are \nvery much in the spirit of our whole program: not only is the value of \u00a3* selected that \nsimplifies the description of the data, but the continuous parameterization itself serves the \nsame purpose. This is an unexpectedly neat generalization of the MDL principle [3] to \nnon parametric cases. \n\n\fSummary: The field theoretic approach to density estimation not only regularizes the learn(cid:173)\ning process but also allows the self-consistent selection of smoothness criteria through an \ninfinite dimensional version of the Occam factors. We have shown numerically that this \nworks, even more clearly than was conjectured: for \"la < 1.5, the learning curve truly be(cid:173)\ncomes a property of the data, and not of the Bayesian prior! If we can extend these results to \nother \"la and combine this work with the reparameterization invariant formulation of [7, 8], \nthis should give a complete theory of Bayesian learning for one dimensional distributions, \nand this theory has no arbitrary parameters. In addition, if this theory properly treats the \nlimit \"la -* 00, we should be able to see how the well-studied finite dimensional Occam \nfactors and the MDL principle arise from a more general nonparametric formulation. \n\nReferences \n\n[1] D. MacKay, Neural Compo 4,415-448 (1992). \n[2] V. Balasubramanian, Neural Compo 9, 349-368 (1997), \nhttp://xxx . lanl . gov/abs/ a d a p - org/9601001 . \n\n[3] J. Rissanen. Stochastic Complexity and Statistical Inquiry. World Scientific, Singa(cid:173)\n\npore (1989). \n\n[4] D. MacKay, NIPS, Tutorial Lecture Notes (1997), \n\nftp : //wol . ra . phy . c a m. ac . uk/pub/ma ckay/gp . ps . gz. \n\n[5] W. Bialek, C. Callan, and S. Strong, Phys. Rev. Lett. 77, 4693-4697 (1996), \n\nhttp : //xxx.l a nl . gov/ a bs/cond-ma t/96071BO. \n\n[6] T. Holy, Phys. Rev. Lett. 79,3545-3548 (1997), \n\nhttp : //xxx . l a nl . gov/ a bs/physics/9706015 . \n\n[7] V. Periwal, Phys. Rev. Lett. 78,4671-4674 (1997), \n\nhttp://xxx . lanl . gov/he p - th/9703135 . \n\n[8] V. Periwal, Nuc!. Phys. B, 554 [FS], 719-730 (1999), \n\nhttp://xxx.l a nl . gov/ a dap- org/9B01001 . \n[9] T. Aida, Phys. Rev. Lett. 83, 3554-3557 (1999), \nhttp : //xxx . l a nl . gov/cond- ma t/9911474. \n\n[10] A more detailed version of our current analysis may be found in: I. Nemenman, Ph.D. \n\nThesis, Princeton, (2000), http : //xxx . l a n l . gov/ a bs/phys i cs/OOO 9032 . \n\n[11] W. Bialek, I. Nemenman, N. Tishby. Preprint \n\nhttp : //xxx . l a nl . gov/ a bs/physics/0007070 . \n\n[12] G. Wahba. In B. Sh6lkopf, C. 1. S. Burges, and A. 1. Smola, eds., Advances in Kernel \nMethods-Support Vector Learning, pp. 69-88. MIT Press, Cambridge, MA (1999), \nftp : //ftp . st a t . wisc . edu/pub/wa hb a /nips97rr . ps . \n\n[13] w. Press et al. Numerical Recipes in C. Cambridge UP, Cambridge (1988). \n[14] Vapnik, V. Statistical Learning Theory. John Wiley & Sons, New York (1998). \n\n\f", "award": [], "sourceid": 1800, "authors": [{"given_name": "Ilya", "family_name": "Nemenman", "institution": null}, {"given_name": "William", "family_name": "Bialek", "institution": null}]}