{"title": "Dichotomize and Generalize: PAC-Bayesian Binary Activated Deep Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 6872, "page_last": 6882, "abstract": "We present a comprehensive study of multilayer neural networks with binary activation, relying on the PAC-Bayesian theory. Our contributions are twofold: (i) we develop an end-to-end framework to train a binary activated deep neural network, (ii) we provide nonvacuous PAC-Bayesian generalization bounds for binary activated deep neural networks. Our results are obtained by minimizing the expected loss of an architecture-dependent aggregation of binary activated deep neural networks. Our analysis inherently overcomes the fact that binary activation function is non-differentiable. The performance of our approach is assessed on a thorough numerical experiment protocol on real-life datasets.", "full_text": "Dichotomize and Generalize: PAC-Bayesian Binary\n\nActivated Deep Neural Networks\n\nGa\u00ebl Letarte\nUniversit\u00e9 Laval\n\nCanada\n\ngael.letarte.1@ulaval.ca\n\nBenjamin Guedj\n\nInria and University College London\n\nFrance and United Kingdom\nbenjamin.guedj@inria.fr\n\nPascal Germain\n\nInria\nFrance\n\npascal.germain@inria.fr\n\nFran\u00e7ois Laviolette\n\nUniversit\u00e9 Laval\n\nCanada\n\nfrancois.laviolette@ift.ulaval.ca\n\nAbstract\n\nWe present a comprehensive study of multilayer neural networks with binary\nactivation, relying on the PAC-Bayesian theory. Our contributions are twofold:\n(i) we develop an end-to-end framework to train a binary activated deep neural\nnetwork, (ii) we provide nonvacuous PAC-Bayesian generalization bounds for\nbinary activated deep neural networks. Our results are obtained by minimizing the\nexpected loss of an architecture-dependent aggregation of binary activated deep\nneural networks. Our analysis inherently overcomes the fact that binary activation\nfunction is non-differentiable. The performance of our approach is assessed on a\nthorough numerical experiment protocol on real-life datasets.\n\n1\n\nIntroduction\n\nThe remarkable practical successes of deep learning make the need for better theoretical understanding\nall the more pressing. The PAC-Bayesian theory has recently emerged as a fruitful framework to\nanalyze generalization abilities of deep neural network. Inspired by precursor work of Langford and\nCaruana [2001], nonvacuous risk bounds for multilayer architectures have been obtained by Dziugaite\nand Roy [2017], Zhou et al. [2019]. Although informative, these results do not explicitly take into\naccount the network architecture (number of layers, neurons per layer, type of activation function). A\nnotable exception is the work of Neyshabur et al. [2018] which provides a PAC-Bayesian analysis\nrelying on the network architecture and the choice of ReLU activation function. The latter bound\narguably gives insights on the generalization mechanism of neural networks (namely in terms of the\nspectral norms of the learned weight matrices), but their validity hold for some margin assumptions,\nand they are likely to be numerically vacuous.\nWe focus our study on deep neural networks with a sign activation function. We call such networks\nbinary activated multilayer (BAM) networks. This specialization leads to nonvacuous generalization\nbounds which hold under the sole assumption that training samples are iid. We provide a PAC-\nBayesian bound holding on the generalization error of a continuous aggregation of BAM networks.\nThis leads to an original approach to train BAM networks, named PBGNet. The building block of\nPBGNet arises from the specialization of PAC-Bayesian bounds to linear classi\ufb01ers [Germain et al.,\n2009], that we adapt to deep neural networks. The term binary neural networks has been coined by\nBengio [2009], and further studied in Hubara et al. [2016, 2017], Soudry et al. [2014]: it refers to\nneural networks for which both the activation functions and the weights are binarized (in contrast\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fwith BAM networks). These architectures are motivated by the desire to reduce the computation and\nmemory footprints of neural networks.\nOur theory-driven approach is validated on real life datasets, showing competitive accuracy with\ntanh-activated multilayer networks, and providing nonvacuous generalization bounds.\nOrganisation of the paper. We formalize our framework and notation in Section 2, along with\na presentation of the PAC-Bayes framework and its specialization to linear classi\ufb01ers. Section 3\nillustrates the key ideas we develop in the present paper, on the simple case of a two-layers neural\nnetwork. This is then generalized to deep neural networks in Section 4. We present our main\ntheoretical result in Section 5: a PAC-Bayesian generalization bound for binary activated deep neural\nnetworks, and the associated learning algorithm. Section 6 presents the numerical experiment protocol\nand results. The paper closes with avenues for future work in Section 7.\n\n2 Framework and notation\n\nand D :=(cid:80)L\n\nWe stand in the supervised binary classi\ufb01cation setting: given a real input vector1 x \u2208 Rd0, one wants\nto predict a label y \u2208 {\u22121, 1}. Let us consider a neural network of L fully connected layers with a\n(binary) sign activation function: sgn(a) = 1 if a > 0 and sgn(a) = \u22121 otherwise.2 We let dk denote\nthe number of neurons of the kth layer, for k \u2208 {1, . . . , L}; d0 is the input data point dimension,\nk=1 dk\u22121dk is the total number of parameters. The output of the (deterministic) BAM\nnetwork on an input data point x \u2208 Rd0 is given by\n\nf\u03b8(x) = sgn(cid:0)WLsgn(cid:0)WL\u22121sgn(cid:0) . . . sgn(cid:0)W1x(cid:1)(cid:1)(cid:1)(cid:1) ,\n\n{Wk}L\n\n(1)\n\u03b8 = vec(cid:0)\n(cid:1)\nwhere Wk \u2208 Rdk\u00d7dk\u22121 denotes the weight matrices. The network is thus parametrized by\nk. For binary classi\ufb01cation,\nthe BAM network \ufb01nal layer WL\u2208R1\u00d7dL\u22121 has one line (dL=1), that is a vector wL\u2208RdL\u22121, and\nf\u03b8 : Rd0\u2192{\u22121, 1}.\n2.1 Elements from the PAC-Bayesian theory\n\n\u2208RD. The ith line of matrix Wk will be denoted wi\n\nk=1\n\nThe Probably Approximately Correct (PAC) framework [introduced by Valiant, 1984] holds under\nthe frequentist assumption that data is sampled in an iid fashion from a data distribution D over the\ninput-output space. The learning algorithm observes a \ufb01nite training sample S = {(xi, yi)}n\ni=1 \u223c Dn\nand outputs a predictor f : Rd0 \u2192 [\u22121, 1]. Given a loss function (cid:96) : [\u22121, 1]2 \u2192 [0, 1], we de\ufb01ne\nerror on the training set, given by\n\nLD(f ) as the generalization loss on the data generating distribution D, and (cid:98)LS(f ) as the empirical\n\nLD(f ) = E\n\n(x,y)\u223cD (cid:96)(f (x), y) ,\n\nand\n\n(cid:96)(f (xi), yi) .\n\n(cid:98)LS(f ) =\n\n1\nn\n\nn(cid:88)\n\ni=1\n\nPAC-Bayes considers the expected loss of an aggregation of predictors: considering a distribution Q\n(called the posterior) over a family of predictors F, one obtains PAC upper bounds on Ef\u223cQ LD(f ).\n2 (1\u2212 yy(cid:48)), for which the aggregated loss is equivalent\nOur work focuses on the linear loss (cid:96)(y(cid:48), y) := 1\nLD(FQ) = Ef\u223cQ LD(f ), by its empirical counterpart (cid:98)LS(FQ) = Ef\u223cQ (cid:98)LS(f ) and a complexity\nto the loss of the predictor FQ(x) := Ef\u223cQ f (x), performing a Q-aggregation of all predictors in F.\nIn other words, we may upper bound with an arbitrarily high probability the generalization loss\ndistribution) chosen independently of the training set S, given by KL(Q(cid:107)P ) := (cid:82) ln Q(\u03b8)\n\nterm, the Kullback-Leibler divergence between Q and a reference measure P (called the prior\nP (\u03b8) Q(d\u03b8).\nSince the seminal works of Shawe-Taylor and Williamson [1997], McAllester [1999, 2003] and\nCatoni [2003, 2004, 2007], the celebrated PAC-Bayesian theorem has been declined in many forms\n[see Guedj, 2019, for a survey]. The following Theorems 1 and 2 will be useful in the sequel.\n\n1Bold uppercase letters denote matrices, bold lowercase letters denote vectors.\n2We consider the activation function as an element-wise operator when applied to vectors or matrices.\n\n2\n\n\f,\n\nn\n\n\u2264\n\nn\n\n\u03b4\n\nfor all Q on F : kl\n\np + (1 \u2212 q) ln 1\u2212q\n\n\u221a\nKL(Q(cid:107)P ) + ln 2\n\nTheorem 1 (Seeger [2002], Maurer [2004]). Given a prior P on F, with probability at least 1 \u2212 \u03b4\nover S \u223c Dn,\n\n(cid:16)(cid:98)LS(FQ)(cid:13)(cid:13)LD(FQ)\n(cid:17)\n(cid:18)\n\u2212C (cid:98)LS(FQ) \u2212\n\n(2)\n1\u2212p is the Kullback-Leibler divergence between Bernoulli\n\nwhere kl(q(cid:107)p) := q ln q\ndistributions with probability of success p and q, respectively.\nTheorem 2 (Catoni [2007]). Given P on F and C > 0, with probability at least 1 \u2212 \u03b4 over S \u223c Dn,\nfor all Q on F : LD(FQ) \u2264\n(3)\nFrom Theorems 1 and 2, we obtain PAC-Bayesian bounds on the linear loss of the Q-aggregated\npredictor FQ. Given our binary classi\ufb01cation setting, it is natural to predict a label by taking the\nsign of FQ(\u00b7). Thus, one may also be interested in the zero-one loss (cid:96)01(y(cid:48), y) := 1[sgn(y(cid:48)) (cid:54)= y];\nthe bounds obtained from Theorems 1 and 2 can be turned into bounds on the zero-one loss with an\nextra 2 multiplicative factor, using the elementary inequality (cid:96)01(FQ(x), y) \u2264 2(cid:96)(FQ(x), y).\n2.2 Elementary building block: PAC-Bayesian learning of linear classi\ufb01ers\n\nKL(Q(cid:107)P ) + ln 1\n\n1 \u2212 e\u2212C\n\n1 \u2212 exp\n\n(cid:19)(cid:19)\n\n(cid:18)\n\n\u03b4\n\n.\n\nn\n\n1\n\nThe PAC-Bayesian specialization to linear classi\ufb01ers has been proposed by Langford and Shawe-\nTaylor [2002], and used for providing tight generalization bounds and a model selection criteria\n[further studied by Ambroladze et al., 2006, Langford, 2005, Parrado-Hern\u00e1ndez et al., 2012].\nThis paved the way to the PAC-Bayesian bound minimization algorithm of Germain et al. [2009],\nthat learns a linear classi\ufb01er fw(x) := sgn(w \u00b7 x), with w \u2208 Rd. The strategy is to consider a\nGaussian posterior Qw := N (w, Id) and a Gaussian prior Pw0 := N (w0, Id) over the space of all\nlinear predictors Fd := {fv|v \u2208 Rd} (where Id denotes the d \u00d7 d identity matrix). The posterior\nis used to de\ufb01ne a linear predictor fw and the prior may have been learned on previously seen\ndata; a common uninformative prior being the null vector w0 = 0. With such parametrization,\n2(cid:107)w \u2212 w0(cid:107)2. Moreover, the Qw-aggregated output can be written in terms of the\nKL(Qw(cid:107)Pw0 ) = 1\nGauss error function erf(\u00b7). In Germain et al. [2009], the erf function is introduced as a loss function\nto be optimized. Here we interpret it as the predictor output, to be in phase with our neural network\napproach. Likewise, we study the linear loss of an aggregated predictor instead of the Gibbs risk of a\nstochastic classi\ufb01er. We obtain (explicit calculations are provided in Appendix A.1 for completeness)\n(4)\nGiven a training set S \u223c Dn, Germain et al. [2009] propose to minimize a PAC-Bayes upper bound\non LD(Fw) by gradient descent on w. This approach is appealing as the bounds are valid uniformly\nfor all Qw (see Equations 2 and 3). In other words, the algorithm provides both a learned predictor\nand a generalization guarantee that is rigorously valid (under the iid assumption) even when the\noptimization procedure did not \ufb01nd the global minimum of the cost function (either because it\nconverges to a local minimum, or early stopping is used). Germain et al. [2009] investigate the\noptimization of several versions of Theorems 1 and 2. The minimization of Theorem 1 generally leads\nto tighter bound values, but empirical studies show lowest accuracy as the procedure conservatively\nprevents over\ufb01tting. The best empirical results are obtained by minimizing Theorem 2 for a \ufb01xed\nhyperparameter C, selected by cross-validation. Minimizing Equation (3) amounts to minimizing\n\n(cid:82) x\n0 e\u2212t2\n\n, with erf(x) := 2\u221a\n\n(cid:16) w\u00b7x\u221a\n\nFw(x) := E\n\nfv(x) = erf\n\nv\u223cQw\n\n(cid:17)\n\n2(cid:107)x(cid:107)\n\ndt .\n\n\u03c0\n\nC n(cid:98)LS(Fw) + KL(Qw(cid:107)Pw0) = C\n\n1\n2\n\n(cid:18)\n\nn(cid:88)\n\ni=1\n\nerf\n\n\u2212yi\n\nw \u00b7 xi\n\u221a2(cid:107)xi(cid:107)\n\n(cid:19)\n\n+\n\n1\n2(cid:107)w \u2212 w0(cid:107)2 .\n\n(5)\n\nIn their discussion, Germain et al. [2009] observe that the objective in Equation (5) is similar to\nthe one optimized by the soft-margin Support Vector Machines [Cortes and Vapnik, 1995], by\nroughly interpreting the hinge loss max(0, 1\u2212 yy(cid:48)) as a convex surrogate of the probit loss erf(\u2212yy(cid:48)).\nLikewise, Langford and Shawe-Taylor [2002] present this parameterization of the PAC-Bayes theorem\nas a margin bound. In the following, we develop an original approach to neural networks based on a\nslightly different observation: the predictor output given by Equation (4) is reminiscent of the tanh\nactivation used in classical neural networks (see Figure 3 in the appendix for a visual comparison).\nTherefore, as the linear perceptron is viewed as the building block of modern multilayer neural\nnetworks, the PAC-Bayesian specialization to binary classi\ufb01ers is the cornerstone of our theoretical\nand algorithmic framework for BAM networks.\n\n3\n\n\f3 The simple case of a one hidden layer network\n\nLet us \ufb01rst consider a network with one hidden layer of size d1. Hence, this network is parameterized\nby weights \u03b8 = vec({W1, w2}), with W1 \u2208 Rd1\u00d7d0 and w2 \u2208 Rd1. Given an input x \u2208 Rd0, the\noutput of the network is\n(6)\n\nf\u03b8(x) = sgn(cid:0)w2 \u00b7 sgn(W1x)(cid:1) .\n\nFollowing Section 2, we consider an isotropic Gaussian posterior distribution centered in \u03b8, denoted\nQ\u03b8 = N (\u03b8, ID), over the family of all networks FD = {f\u02dc\u03b8 | \u02dc\u03b8 \u2208 RD}. Thus, the prediction\nof the Q\u03b8-aggregate predictor is given by F\u03b8(x) = E\u02dc\u03b8\u223cQ\u03b8\nf\u02dc\u03b8(x). Note that Dziugaite and Roy\n[2017], Langford and Caruana [2001] also consider Gaussian distributions over neural networks\nparameters. However, as their analysis is not speci\ufb01c to a particular activation function\u2014experiments\nare performed with typical activation functions (sigmoid, ReLU)\u2014the prediction relies on sampling\nthe parameters according to the posterior. An originality of our approach is that, by studying the sign\nactivation function, we can calculate the exact form of F\u03b8(x), as detailed below.\n\n3.1 Deterministic network\n\nPrediction. To compute the value of F\u03b8(x), we \ufb01rst need to decompose the probability of each\n\u02dc\u03b8=vec({V1, v2})\u223c Q\u03b8 as Q\u03b8(\u02dc\u03b8)=Q1(V1)Q2(v2), with Q1=N (W1, Id0d1 ) and Q2=N (w2, Id1).\n\nF\u03b8(x) =\n\nQ1(V1)\n\nRd1\u00d7d0\n\nRd1\n\n(cid:90)\n(cid:90)\n\nQ1(V1) erf\n\n(cid:90)\n(cid:16) w2\u00b7s\u221a\n(cid:16) w2\u00b7s\u221a\n\n2d1\n\n2d1\n\nerf\n\nerf\n\n(cid:17)\n\n2(cid:107)sgn(V1x)(cid:107)\n\n(cid:16) w2\u00b7sgn(V1x)\n(cid:17)(cid:90)\n(cid:17)\n\n\u03a8s (x, W1) ,\n\nQ2(v2)sgn(v2 \u00b7 sgn(V1x))dv2dV1\n\u221a\n\ndV1\n\n1[s = sgn(V1x)]Q1(V1) dV1\nRd1\u00d7d0\n\nRd1\u00d7d0\n\n=\n\n=\n\n(cid:88)\n(cid:88)\nwhere, from Q1(V1) =(cid:81)d1\n(cid:90)\nd1(cid:89)\n\n=\n\n\u03a8s (x, W1) :=\n\ns\u2208{\u22121,1}d1\n\ns\u2208{\u22121,1}d1\n\ni=1 Qi\n\n1(vi\n\n1) with Qi\n\n1 := N (wi\n\n1, Id0), we obtain\n\n1[si x \u00b7 vi\n\nRd0\n\n1 > 0]Qi\n\n1(vi\n\n1) dvi\n\n1 =\n\ni=1\n\nd1(cid:89)\n\ni=1\n\n(cid:20) 1\n(cid:124)\n\n2\n\n(cid:18) wi\n(cid:123)(cid:122)\n1 \u00b7 x\n\u221a2(cid:107)x(cid:107)\n\n(cid:19)(cid:21)\n(cid:125)\n\n+\n\nsi\n2\n\nerf\n\n(7)\n\n(8)\n\n(9)\n\n.\n\n(10)\n\n\u03c8si (x,wi\n1)\n\nLine (7) states that the output neuron is a linear predictor over the hidden layer\u2019s activation values\n\ns = sgn(V1x); based on Equation (4), the integral on v2 becomes erf(cid:0)w2 \u00b7 s/(\u221a2(cid:107)s(cid:107))(cid:1). As a\n\nfunction of s, the latter expression is piecewise constant. Thus, Line (8) discretizes the integral on V1\nas a sum of the 2d1 different values of s = (si)d1\nFinally, one can compute the exact output of F\u03b8(x), provided one accepts to compute a sum combi-\nnatorial in the number of hidden neurons (Equation 9). We show in forthcoming Section 3.2 that it is\npossible to circumvent this computational burden and approximate F\u03b8(x) by a sampling procedure.\nDerivatives. Following contemporary approaches in deep neural networks [Goodfellow et al., 2016],\n\nwe minimize the empirical loss (cid:98)LS(F\u03b8) by stochastic gradient descent (SGD). This requires to\n\ni=1, si \u2208 {\u22121, 1}. Note that (cid:107)s(cid:107)2 = d1.\n\ncompute the partial derivative of the cost function according to the parameters \u03b8:\n\n\u2202(cid:98)LS(F\u03b8)\n\n\u2202\u03b8\n\n=\n\n1\nn\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\n\u2202(cid:96)(F\u03b8(xi), yi)\n\ni=1\n\n\u2202\u03b8\n\n=\n\n1\nn\n\n\u2202F\u03b8(xi)\n\n\u2202\u03b8\n\n(cid:48)\n(cid:96)\n\n(F\u03b8(xi), yi) ,\n\n(11)\n\nwith the derivative of the linear loss (cid:96)(cid:48)(F\u03b8(xi), yi) = \u2212 1\n2 y.\n\n4\n\n\fFigure 1: Illustration of the proposed method for a one hidden layer network of size d1=3, interpreted\nas a majority vote over 8 binary representations s \u2208 {\u22121, 1}3. For each s, a plot shows the values\nof Fw2(s)\u03a8s(x, W1). The sum of these values gives the deterministic network output F\u03b8(x) (see\nEq. 9). We also plot the BAM network output f\u03b8(x) for the same parameters \u03b8 (see Eq. 6).\n\nThe partial derivatives of the prediction function (Equation 9) according to the hidden layer parameters\nwk\n1 \u2208 {w1\n\n1, . . . , wd1\n\n(cid:48)(cid:18) wk\n1 \u00b7 x\n\u221a2(cid:107)x(cid:107)\n(cid:88)\n\n1 } and the output neuron parameters w2 are\nx\n2(cid:107)x(cid:107)\n1\n\u221a2d1\n\n(cid:19) (cid:88)\n(cid:48)(cid:18) w2 \u00b7 s\n\n(cid:18) w2 \u00b7 s\n\ns\u2208{\u22121,1}d1\n\n\u221a2d1\n\n\u221a2d1\n\n(cid:19)\n\nsk erf\n\ns erf\n\n3\n\n2\n\nerf\n\ns\u2208{\u22121,1}d1\n\n\u2202\n\n\u2202wk\n1\n\n\u2202\n\n\u2202w2\n\nF\u03b8(x) =\n\nF\u03b8(x) =\n\n(cid:19)(cid:20) \u03a8s(x, W1)\n\n(cid:21)\n\n\u03c8sk (x, wk\n1 )\n\n,\n\n(12)\n\n\u03a8s(x, W1) , with erf\n\n(cid:48)\n\n(x) := 2\u221a\n\n\u03c0 e\u2212x2 .\n\n(13)\n\nNote that this is an exact computation. A salient fact is that even though we work on non-differentiable\nIndeed,(cid:80)\nBAM networks, we get a structure trainable by (stochastic) gradient descent by aggregating networks.\nMajority vote of learned representations. Note that \u03a8s (Equation 10) de\ufb01nes a distribution on s.\ns \u03a8s(x, W1)=1, as \u03a8s(x, W1) + \u03a8\u00afs(x, W1) = 2\u2212d1 for every \u00afs =\u2212s. Thus, by Equa-\ntion (9) we can interpret F\u03b8 akin to a majority vote predictor, which performs a convex combination\nof a linear predictor outputs Fw2 (s) := erf(w2 \u00b7 s/\u221a2d1). The vote aggregates the predictions on\nthe 2d1 possible binary representations. Thus, the algorithm does not learn the representations per se,\nbut rather the weights \u03a8s(x, W1) associated to every s given an input x, as illustrated by Figure 1.\n\n3.2 Stochastic approximation\n\nSince \u03a8s (Equation 10) de\ufb01nes a distribution, we can interpret the function value as the probability\nof mapping input x into the hidden representation s given the parameters W1. Using a different\nformalism, we could write Pr(s|x, W1) = \u03a8s(x, W1). This viewpoint suggests a sampling scheme\nto approximate both the predictor output (Equation 9) and the partial derivatives (Equations 12\nand 13), that can be framed as a variant of the REINFORCE algorithm [Williams, 1992] (see the\ndiscussion below): We avoid computing the 2d1 terms by resorting to a Monte Carlo approximation\nof the sum. Given an input x and a sampling size T , the procedure goes as follows.\n(cid:80)T\nPrediction. We generate T random binary vectors Z:={st}T\nt=1 according to the \u03a8s(x, W1)-\nA stochastic approximation of F\u03b8(x) is given by (cid:98)F\u03b8(Z) := 1\n1)\u2212zt\ndistribution. This can be done by uniformly sampling zt\ni ).\ndeep learning frameworks while evaluating (cid:98)F\u03b8(Z) [e.g., Paszke et al., 2017]. However, we need the\nDerivatives. Note that for a given sample {st}T\nt=1, the approximate derivatives according to w2\n(Equation 15 below) can be computed numerically by the automatic differentiation mechanism of\nfollowing Equation (14) to approximate the gradient according to W1 because \u2202(cid:98)F\u03b8(Z)/\u2202wk\n\ni\u2208[0, 1], and setting st\n\n(cid:16) w2\u00b7st\u221a\n\ni=sgn(\u03c81(x, wi\n\nt=1 erf\n\n(cid:17)\n\n2d1\n\n1 = 0.\n\nT\n\n.\n\n5\n\ns=(\u22121,\u22121,\u22121)s=(\u22121,\u22121,1)s=(\u22121,1,\u22121)s=(\u22121,1,1)DeterministicNetworkF\u03b8s=(1,\u22121,\u22121)s=(1,\u22121,1)s=(1,1,\u22121)s=(1,1,1)BAMNetworkf\u03b8\u22121.00\u22120.75\u22120.50\u22120.250.000.250.500.751.00\f\u02c6y\n\n\u02c6y\n\nx1\n\nx2\n\nx1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2 x1 x2\n\nFigure 2: Illustration of the BAM to tree architecture map on a three layers network.\n\n\u2202\n\n\u2202wk\n1\n\n\u2202\n\n\u2202w2\n\nF\u03b8(x) \u2248\n\nF\u03b8(x) \u2248\n\nT 2\n\nx\n3\n2(cid:107)x(cid:107)\n1\nT\u221a2d1\n\nerf\n\nT(cid:88)\n\nt=1\n\n(cid:19) T(cid:88)\n(cid:48)(cid:18) wk\n1 \u00b7 x\n(cid:48)(cid:18) w2 \u00b7 st\n(cid:19)\n\u221a2(cid:107)x(cid:107)\n\nst erf\n\nt=1\n\n\u221a2d1\n\n=\n\n\u2202\n\n\u2202w2\n\n(cid:18) w2 \u00b7 st\n\n\u221a2d1\n\n(cid:19)\n\n;\n\n(14)\n\n(15)\n\nst\nk\n(x, wk\n1 )\n\n\u03c8st\n\nk\n\nerf\n\n(cid:98)F\u03b8(Z) .\n\nSimilar approaches to stochastic networks. Random activation functions are commonly used in\ngenerative neural networks, and tools have been developed to train these by gradient descent (see\nGoodfellow et al. [2016, Section 20.9] for a review). Contrary to these approaches, our analysis\ndiffers as the stochastic operations are introduced to estimate a deterministic objective. That being\nsaid, Equation (14) can be interpreted as a variant of REINFORCE algorithm [Williams, 1992]\nto apply the back-propagation method along with discrete activation functions. Interestingly, the\nformulation we obtain through our PAC-Bayes objective is similar to a commonly used REINFORCE\nvariant [e.g., Bengio et al., 2013, Yin and Zhou, 2019], where the activation function is given by a\nBernoulli variable with probability of success \u03c3(a), where a is the neuron input, and \u03c3 is the sigma is\nthe sigmoid function. The latter can be interpreted as a surrogate of our \u03c8si(x, wi\n\n1).\n\n4 Generalization to multilayer networks\n\nIn the following, we extend the strategy introduced in Section 3 to BAM architectures with an arbitrary\nnumber of layers L \u2208 N\u2217 (Equation 1). An apparently straightforward approach to achieve this\nsums of(cid:81)L\ngeneralization would have been to consider a Gaussian posterior distribution N (\u03b8, ID) over the BAM\nfamily {f\u02dc\u03b8|\u02dc\u03b8 \u2208 RD}. However, doing so leads to a deterministic network relying on undesirable\nk=1 2dk elements (see Appendix A.2 for details). Instead, we de\ufb01ne a mapping f\u03b8 (cid:55)\u2192 g\u03b6(\u03b8)\nwhich transforms the BAM network into a computation tree, as illustrated by Figure 4.\ngraph nodes): the tree leaves contain(cid:81)L\nBAM to tree architecture map. Given a BAM network f\u03b8 of L layers with sizes d0, d1, . . . , dL\n(reminder: dL=1), we obtain a computation tree by decoupling the neurons (i.e., the computation\nk=1 dk copies of each of the d0 BAM input neurons, and\nhas its own parameter (a real-valued scalar); the total number of edges is D\u2020 := (cid:80)L\u22121\nthe tree root node corresponds to the single BAM output neuron. Each input-output path of the\nk :=(cid:81)L\noriginal BAM network becomes a path of length L from one leaf to the tree root. Each tree edge\n\u2020\nk, with\nk=0 d\ni=k di. We de\ufb01ne a set of tree parameters \u03b7 recursively according to the tree structure. From\nd\n\u2020\nk edges. That is, each node at level k+1 has its own parameters subtree\nlevel k to k+1, the tree has d\ni }dk\ni=0, where each \u03b7k\n\u03b7k+1 := {\u03b7k\ni is either a weight vector containing the input edges parameters (by\n0 \u2208 Rdk\u22121) or a parameter set (thus, \u03b7k\nconvention, \u03b7k\ndk\u22121 are themselves parameter subtrees).\n\n1 , . . . , \u03b7k\n\n\u2020\n\n6\n\n\fHence, the deepest elements of the recursive parameters set \u03b7 are weight vectors \u03b71 \u2208 Rd0. Let us\nnow de\ufb01ne the output tree g\u03b7(x) := gL(x, \u03b7) on an input x \u2208 Rd0 as a recursive function:\n\n(cid:124)\n\ng1(x,{w}) = sgn (w \u00b7 x) ,\n1 , . . . , \u03b7k\n\n) = sgn\n\n(cid:125)\n\ndk}\n\n(cid:123)(cid:122)\n\n(cid:16)\n\n(cid:124)\n\n(cid:123)(cid:122)\n\n(cid:17)\n\n(cid:125)\n\ngk+1(x,{w, \u03b7k\n\nw \u00b7 (gk(x, \u03b71), . . . , gk(x, \u03b7dk ))\n\nfor k = 1, . . . , L\u22121 .\n\n\u03b7k\n\ngk(x,\u03b7k)\n\nBAM to tree parameters map. Given BAM parameters \u03b8, we denote \u03b81:k := vec(cid:0)\n\n{Wk}k\ni=1\nis \u03b6(\u03b8) =\nk, \u03b61(\u03b81:k\u22121), . . . , \u03b6dk\u22121(\u03b81:k\u22121)},\n1}. Note that the parameters tree obtained by the transformation \u03b6(\u03b8) is highly\n\u2020\nk (the ith line of the Wk matrix from \u03b8) is replicated d\nk+1 times.\n\nThe mapping from \u03b8 into the corresponding (recursive) tree parameters set\n{wL, \u03b61(\u03b81:L\u22121), . . . , \u03b6dL\u22121(\u03b81:L\u22121)}, such that \u03b6i(\u03b81:k) = {wi\nand \u03b6i(\u03b81:1) = {wi\nredundant, as each weight vector wi\nThis construction is such that f\u03b8(x) = g\u03b6(\u03b8)(x) for all x \u2208 Rd0.\nDeterministic network. With a slight abuse of notation, we let \u02dc\u03b7 \u223c Q\u03b7 := N (\u03b7, ID\u2020) denote a\nparameter tree of the same structure as \u03b7, where every weight is sampled iid from a normal distribution.\nWe denote G\u03b8(x) := E\u02dc\u03b7\u223cQ\u03b6(\u03b8) g\u02dc\u03b7(x), and we compute the output value of this predictor recursively.\nIn the following, we denote G(j)\n(x) the function returning the jth neuron value of the layer k+1.\nHence, the output of this network is G\u03b8(x) = G(1)\n\n(cid:1).\n\n\u03b81:k+1\n\n\u03b81:L(x). As such,\n\n(cid:90)\n(cid:88)\n\nRd0\n\n(cid:17)\n\n(cid:16) wj\n\n1\u00b7x\u221a\n2(cid:107)x(cid:107)\n\n,\n\ns (x, \u03b8) =\n\nG(j)\n\n\u03b81:1(x) =\n\nG(j)\n\n\u03b81:k+1\n\n(x) =\n\nerf\ns\u2208{\u22121,1}dk\n\nQwj\n\n1\n\n(cid:18) wj\n\n(cid:19)\n\n(v)sgn(v \u00b7 x)dv = erf\nk+1\u00b7s\u221a\n\ns (x, \u03b8), with \u03a8k\n\n\u03a8k\n\n2dk\n\n(cid:18) 1\n\n2\n\ndk(cid:89)\n\ni=1\n\n(cid:19)\n\n(x)\n\n.\n\n(16)\n\n+\n\n1\n2\n\nsi \u00d7 G(i)\n\n\u03b81:k\n\nThe complete mathematical calculations leading to the above results are provided in Appendix A.3.\nThe computation tree structure and the parameter mapping \u03b6(\u03b8) are crucial to obtain the recursive\nexpression of Equation (16). However, note that this abstract mathematical structure is never manipu-\nlated explicitly. Instead, it allows computing each hidden layer vector (G(j)\nj=1 sequentially; a\n\u03b81:k\nsummation of 2dk terms is required for each layer k = 1, . . . , L\u22121.\nStochastic approximation. Following the Section 3.2 sampling procedure trick for the one hidden\nlayer network, we propose to perform a stochastic approximation of the network prediction output,\nby a Monte Carlo sampling for each layer. Likewise, we recover exact and approximate derivatives in\na layer-by-layer scheme. The related equations are given in Appendix A.4.\n\n(x))dk\n\n5 PBGNet: PAC-Bayesian SGD learning of binary activated networks\nWe design an algorithm to learn the parameters \u03b8 \u2208 RD of the predictor G\u03b8 by minimizing a\nPAC-Bayesian upper bound on the generalization loss LD(G\u03b8). We name our algorithm PBGNet\n(PAC-Bayesian Binary Gradient Network), as it is a generalization of the PBGD (PAC-Bayesian\nGradient Descent) learning algorithm for linear classi\ufb01ers [Germain et al., 2009] to deep binary\nactivated neural networks.\nKullback-Leibler regularization. The computation of a PAC-Bayesian bound value relies on two\nkey elements: the empirical loss on the training set and the Kullback-Leibler divergence between\nthe prior and the posterior. Sections 3 and 4 present exact computation and approximation schemes\n\nKL-divergence associated to the parameter maps of Section 4. We use the shortcut notation K(\u03b8, \u00b5)\nto refer to the divergence between two multivariate Gaussians of D\u2020 dimensions, corresponding to\n\nfor the empirical loss (cid:98)LS(G\u03b8) (which is equal to (cid:98)LS(F\u03b8) when L=2). Equation (17) introduces the\nlearned parameters \u03b8 = vec(cid:0)\n(cid:16)\nk+1 =(cid:81)L\n\nK(\u03b8, \u00b5) := KL\n\u2020\nwhere the factors d\ni=k+1 di are due to the redundancy introduced by transformation \u03b6(\u00b7).\nThis has the effect of penalizing more the weights on the \ufb01rst layers. It might have a considerable\n\n(cid:1) and prior parameters \u00b5 = vec(cid:0)\n(cid:32)\n(cid:107)wL \u2212 uL(cid:107)2 +\n\n(cid:17)\n{Wk}L\n\n(cid:13)(cid:13) P\u03b6(\u00b5)\n\n(cid:1).\n(cid:13)(cid:13)Wk \u2212 Uk\n(cid:13)(cid:13)2\n\n{Uk}L\n\nL\u22121(cid:88)\n\n\u2020\nd\nk+1\n\n(cid:33)\n\nQ\u03b6(\u03b8)\n\n(17)\n\n1\n2\n\nk=1\n\nk=1\n\nk=1\n\n=\n\nF\n\n,\n\n7\n\n\fin\ufb02uence on the bound value for very deep networks. On the other hand, we observe that this is\nconsistent with the \ufb01ne-tuning practice performed when training deep neural networks for a transfer\nlearning task: prior parameters are learned on a \ufb01rst dataset, and the posterior weights are learned by\nadjusting the last layer weights on a second dataset [see Bengio, 2009, Yosinski et al., 2014].\nBound minimization. PBGNet minimizes the bound of Theorem 1 (rephrased as Equation 18).\nHowever, this is done indirectly by minimizing a variation on Theorem 2 and used in a deep learning\ncontext by Zhou et al. [2019] (Equation 19). Theorem 3 links both results (proof in Appendix A.5).\nTheorem 3. Given prior parameters \u00b5 \u2208 RD, with probability at least 1 \u2212 \u03b4 over S \u223c Dn, we have\nfor all \u03b8 on RD :\n\n(cid:27)\n\n(cid:19)(cid:19)(cid:27)\n\n(18)\n\n.\n\n(19)\n\n\u221a\nn\n[K(\u03b8, \u00b5) + ln 2\n\u03b4 ]\n\n1\nn\n\n(cid:26)\n\n(cid:26)\nLD(G\u03b8) \u2264 sup\n0\u2264p\u22641\n\n= inf\nC>0\n\np : kl((cid:98)LS(G\u03b8)(cid:107)p) \u2264\n(cid:18)\n(cid:18)\n\u2212C (cid:98)LS(G\u03b8) \u2212\n\n1 \u2212 exp\n\n1\nn\n\n1\n\n1\u2212e\u2212C\n\n\u221a\nn\n[K(\u03b8, \u00b5) + ln 2\n\u03b4 ]\n\nWe use stochastic gradient descent (SGD) as the optimization procedure to minimize Equation (19)\nwith respect to \u03b8 and C. It optimizes the same trade-off as in Equation (5), but choosing the C value\nwhich minimizes the bound.3 The originality of our SGD approach is that not only do we induce\ngradient randomness by selecting mini-batches among the training set S, we also approximate the\nloss gradient by sampling T elements for the combinatorial sum at each layer. Our experiments show\nthat, for some learning problems, reducing the sample size of the Monte Carlo approximation can be\nbene\ufb01cial to the stochastic gradient descent. Thus the sample size value T has an in\ufb02uence on the\ncost function space exploration during the training procedure (see Figure 7 in the appendix). Hence,\nwe consider T as a PBGNet hyperparameter.\n\n6 Numerical experiments\n\nExperiments were conducted on six binary classi\ufb01cation datasets, described in Appendix B.\nLearning algorithms. In order to get insights on the trade-offs promoted by the PAC-Bayes bound\nminimization, we compared PBGNet to variants focusing on empirical loss minimization. We train the\nmodels using multiple network architectures (depth and layer size) and hyperparameter choices. The\nobjective is to evaluate the ef\ufb01ciency of our PAC-Bayesian framework both as a learning algorithm\ndesign tool and a model selection criterion. For all methods, the network parameters are trained using\nthe Adam optimizer [Kingma and Ba, 2015]. Early stopping is used to interrupt the training when the\ncost function value is not improved for 20 consecutive epochs. Network architectures explored range\nfrom 1 to 3 hidden layers (L) and a hidden size h \u2208 {10, 50, 100} (dk = h for 1 \u2264 k < L). Unless\notherwise speci\ufb01ed, the same randomly initialized parameters are used as a prior in the bound and as\na starting point for SGD optimization [as in Dziugaite and Roy, 2017]. Also, for all models except\nMLP, we select the binary activation sampling size T in a range going from 10 to 10000. More details\nabout the experimental setting are given in Appendix B.\nMLP. We compare to a standard network with tanh activation, as this activation resembles the erf\nfunction of PBGNet. We optimize the linear loss as the cost function and use 20% of training data\nas validation for hyperparameters selection. A weight decay parameter \u03c1 is selected between 0 and\n10\u22124. Using weight decay corresponds to adding an L2 regularizer \u03c1\n2(cid:107)\u03b8(cid:107)2 to the cost function, but\ncontrary to the regularizer of Equation (17) promoted by PBGNet, this regularization is uniform for\nall layers.\n\nPBGNet(cid:96). This variant minimizes the empirical loss (cid:98)L(G\u03b8), with an L2 regularization term \u03c1\nPBGNet(cid:96)-bnd. Again, the empirical loss (cid:98)L(G\u03b8) with an L2 regularization term \u03c1\n\n2(cid:107)\u03b8(cid:107)2.\nThe corresponding weight decay \u03c1, as well as other hyperparameters, are selected using a validation\nset, exactly as the MLP does. The bound expression is not involved in the learning process and is\ncomputed on the model selected by the validation set technique.\n\n2(cid:107)\u03b8(cid:107)2 is minimized.\nHowever, only the weight decay hyperparameter \u03c1 is selected on the validation set, the other ones are\nselected by the bound. This method is motivated by an empirical observation: our PAC-Bayesian\nbound is a great model selection tool for most hyperparameters, except the weight decay term.\n\n3We also note that our training objective can be seen as a generalized Bayesian inference one [Knoblauch\n\net al., 2019], where the tradeoff between the loss and the KL divergence is given by the PAC-Bayes bound.\n\n8\n\n\fTable 1: Experiment results for the considered models on the binary classi\ufb01cation datasets: error rates\non the train and test sets (ES and ET ), and generalization bounds on the linear loss LD (Bnd). The\nPAC-Bayesian bounds hold with probability 0.95. Bound values for PBGNet(cid:96) are trivial, excepted\nAdult with a bound value of 0.606, and are thus not reported. A visual representation of this table is\npresented in the appendix (Figure 5).\n\nMLP\nES\n\nET\n\nPBGNet(cid:96)\nES\nET\n\nPBGNet(cid:96)-bnd\nES\n\nPBGNet\n\nPBGNetpre\n\nDataset\n\nET Bnd\n\nET Bnd\n0.021 0.035 0.018 0.030 0.028 0.047 0.763 0.131 0.168 0.205 0.033 0.033 0.060\nads\n0.137 0.152 0.133 0.149 0.147 0.155 0.281 0.154 0.163 0.214 0.149 0.154 0.164\nadult\nmnist17 0.002 0.004 0.003 0.004 0.004 0.006 0.096 0.005 0.007 0.040 0.004 0.004 0.010\nmnist49 0.004 0.013 0.003 0.018 0.029 0.035 0.311 0.035 0.040 0.139 0.016 0.017 0.028\nmnist56 0.004 0.013 0.003 0.011 0.022 0.024 0.172 0.022 0.025 0.090 0.009 0.009 0.018\nmnistLH 0.006 0.018 0.004 0.019 0.046 0.051 0.311 0.049 0.052 0.160 0.026 0.027 0.033\n\nES\n\nET Bnd\n\nES\n\nPBGNet. As described in Section 5, the generalization bound is directly optimized as the cost\nfunction during the learning procedure and used solely for hyperparameters selection: no validation\nset is needed and all training data S are exploited for learning.\nPBGNetpre. We also explore the possibility of using a part of the training data as a pre-training step.\nTo do so, we split the training set into two halves. First, we minimize the empirical loss for a \ufb01xed\nnumber of 20 epochs on the \ufb01rst 50% of the training set. Then, we use the learned parameters as\ninitialization and prior for PBGNet and learn on the second 50% of the training set.\nAnalysis. Results are summarized in Table 1, which highlights the strengths and weaknesses of the\nmodels. Both MLP and PBGNet(cid:96) obtain competitive error scores but lack generalization guarantees.\nBy introducing the bound value in the model selection process, even with the linear loss as the cost\nfunction, PBGNet(cid:96)-bnd yields non-vacuous generalization bound values although with an increase in\nerror scores. Using the bound expression for the cost function in PBGNet improves bound values\nwhile keeping similar performances. The Ads dataset is a remarkable exception where the small\namount of training examples seems to radically constrain the network in the learning process as it\nhinders the KL divergence growth in the bound expression. With an informative prior from pre-\ntraining, PBGNetpre is able to recover competitive error scores while offering tight generalization\nguarantees. All selected hyperparameters are presented in the appendix (Table 4).\nA notable observation is the impact of the bound exploitation for model selection on the train-test\nerror gap. Indeed, PBGNet(cid:96)-bnd, PBGNet and PBGNetpre display test errors closer to their train errors,\nas compared to MLP and PBGNet(cid:96). This behavior is more noticeable as the dataset size grows and\nsuggests potential robustness to over\ufb01tting when the bound is involved in the learning process.\n\n7 Conclusion and perspectives\n\nWe made theoretical and algorithmic contributions towards a better understanding of generalization\nabilities of binary activated multilayer networks, using PAC-Bayes. Note that the computational\ncomplexity of a learning epoch of PBGNet is higher than the cost induced in binary neural networks\n[Bengio, 2009, Hubara et al., 2016, 2017, Soudry et al., 2014]. Indeed, we focus on the optimization\nof the generalization guarantee more than computational complexity. Although we also propose a\nsampling scheme that considerably reduces the learning time required by our method, achieving a\nnontrivial tradeoff.\nWe intend to investigate how we could leverage the bound to learn suitable priors for PBGNet. Or\nequivalently, \ufb01nding (from the bound point of view) the best network architecture. We also plan to\nextend our analysis to multiclass and multilabel prediction, and convolutional networks. We believe\nthat this line of work is part of a necessary effort to give rise to a better understanding of the behavior\nof deep neural networks.\n\n9\n\n\fAcknowledgments\nWe would like to thank Mario Marchand for the insight leading to the Theorem 3, Gabriel Dub\u00e9 and\nJean-Samuel Leboeuf for their input on the theoretical aspects, Fr\u00e9d\u00e9rik Paradis for his help with the\nimplementation, and Robert Gower for his insightful comments. This work was supported in part by\nthe French Project APRIORI ANR-18-CE23-0015, in part by NSERC and in part by Intact Financial\nCorporation. We gratefully acknowledge the support of NVIDIA Corporation with the donation of\nTitan Xp GPUs used for this research.\n\nReferences\nAmiran Ambroladze, Emilio Parrado-Hern\u00e1ndez, and John Shawe-Taylor. Tighter PAC-Bayes bounds.\n\nIn NIPS, 2006.\n\nYoshua Bengio. Learning deep architectures for AI. Foundations and Trends in Machine Learning, 2\n\n(1):1\u2013127, 2009.\n\nYoshua Bengio, Nicholas L\u00e9onard, and Aaron C. Courville. Estimating or propagating gradients\n\nthrough stochastic neurons for conditional computation. CoRR, abs/1308.3432, 2013.\nOlivier Catoni. A PAC-Bayesian approach to adaptive classi\ufb01cation. preprint, 840, 2003.\nOlivier Catoni. Statistical learning theory and stochastic optimization: Ecole d\u2019Et\u00e9 de Probabilit\u00e9s\n\nde Saint-Flour XXXI-2001. Springer, 2004.\n\nOlivier Catoni. PAC-Bayesian supervised classi\ufb01cation: the thermodynamics of statistical learning,\n\nvolume 56. Inst. of Mathematical Statistic, 2007.\n\nCorinna Cortes and Vladimir Vapnik. Support-vector networks. Machine Learning, 20(3), 1995.\n\nGintare Karolina Dziugaite and Daniel M. Roy. Computing nonvacuous generalization bounds for\ndeep (stochastic) neural networks with many more parameters than training data. In UAI. AUAI\nPress, 2017.\n\nPascal Germain, Alexandre Lacasse, Fran\u00e7ois Laviolette, and Mario Marchand. PAC-Bayesian\n\nlearning of linear classi\ufb01ers. In ICML, pages 353\u2013360. ACM, 2009.\n\nIan Goodfellow, Yoshua Bengio, and Aaron Courville. Deep Learning. MIT Press, 2016. http:\n\n//www.deeplearningbook.org.\n\nBenjamin Guedj. A primer on PAC-Bayesian learning. arXiv preprint arXiv:1901.05353, 2019.\n\nItay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Binarized\n\nneural networks. In NIPS, pages 4107\u20134115, 2016.\n\nItay Hubara, Matthieu Courbariaux, Daniel Soudry, Ran El-Yaniv, and Yoshua Bengio. Quantized\nneural networks: Training neural networks with low precision weights and activations. JMLR, 18\n(1):6869\u20136898, 2017.\n\nDiederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n\nJeremias Knoblauch, Jack Jewson, and Theodoros Damoulas. Generalized variational inference,\n\n2019.\n\nAlexandre Lacasse. Bornes PAC-Bayes et algorithmes d\u2019apprentissage. PhD thesis, Universit\u00e9 Laval,\n\n2010. URL http://www.theses.ulaval.ca/2010/27635/.\n\nJohn Langford. Tutorial on practical prediction theory for classi\ufb01cation. JMLR, 6, 2005.\nJohn Langford and Rich Caruana. (Not) Bounding the True Error. In NIPS, pages 809\u2013816. MIT\n\nPress, 2001.\n\nJohn Langford and John Shawe-Taylor. PAC-Bayes & margins. In NIPS, 2002.\nAndreas Maurer. A note on the PAC-Bayesian theorem. CoRR, cs.LG/0411099, 2004.\n\n10\n\n\fDavid McAllester. Some PAC-Bayesian theorems. Machine Learning, 37(3), 1999.\nDavid McAllester. PAC-Bayesian stochastic model selection. Machine Learning, 51(1), 2003.\n\nBehnam Neyshabur, Srinadh Bhojanapalli, and Nathan Srebro. A PAC-Bayesian approach to\n\nspectrally-normalized margin bounds for neural networks. In ICLR, 2018.\n\nFr\u00e9d\u00e9rik Paradis. Poutyne: A Keras-like framework for PyTorch, 2018. https://poutyne.org.\n\nEmilio Parrado-Hern\u00e1ndez, Amiran Ambroladze, John Shawe-Taylor, and Shiliang Sun. PAC-Bayes\n\nbounds with data dependent priors. JMLR, 13, 2012.\n\nAdam Paszke, Sam Gross, Soumith Chintala, Gregory Chanan, Edward Yang, Zachary DeVito,\nZeming Lin, Alban Desmaison, Luca Antiga, and Adam Lerer. Automatic differentiation in\nPyTorch. In NIPS Autodiff Workshop, 2017.\n\nMatthias Seeger. PAC-Bayesian generalization bounds for gaussian processes. JMLR, 3, 2002.\nJohn Shawe-Taylor and Robert C. Williamson. A PAC analysis of a Bayesian estimator. In COLT,\n\n1997.\n\nDaniel Soudry, Itay Hubara, and Ron Meir. Expectation backpropagation: Parameter-free training of\nmultilayer neural networks with continuous or discrete weights. In NIPS, pages 963\u2013971, 2014.\nLeslie G Valiant. A theory of the learnable. In Proceedings of the sixteenth annual ACM symposium\n\non Theory of computing, pages 436\u2013445. ACM, 1984.\n\nRonald J. Williams. Simple statistical gradient-following algorithms for connectionist reinforcement\n\nlearning. Machine Learning, 8(3):229\u2013256, May 1992.\n\nMingzhang Yin and Mingyuan Zhou. ARM: augment-reinforce-merge gradient for stochastic binary\n\nnetworks. In ICLR (Poster), 2019.\n\nJason Yosinski, Jeff Clune, Yoshua Bengio, and Hod Lipson. How transferable are features in deep\n\nneural networks? In NIPS, pages 3320\u20133328, 2014.\n\nWenda Zhou, Victor Veitch, Morgane Austern, Ryan P. Adams, and Peter Orbanz. Non-vacuous\ngeneralization bounds at the imagenet scale: a PAC-bayesian compression approach. In ICLR,\n2019.\n\n11\n\n\f", "award": [], "sourceid": 3731, "authors": [{"given_name": "Ga\u00ebl", "family_name": "Letarte", "institution": "Universit\u00e9 Laval"}, {"given_name": "Pascal", "family_name": "Germain", "institution": "INRIA"}, {"given_name": "Benjamin", "family_name": "Guedj", "institution": "Inria & University College London"}, {"given_name": "Francois", "family_name": "Laviolette", "institution": "Universit\u00e9 Laval"}]}