{"title": "The Description Length of Deep Learning models", "book": "Advances in Neural Information Processing Systems", "page_first": 2216, "page_last": 2226, "abstract": "Deep learning models often have more parameters than observations, and still perform well. This is sometimes described as a paradox. In this work, we show experimentally that despite their huge number of parameters, deep neural networks can compress the data losslessly even when taking the cost of encoding the parameters into account. Such a compression viewpoint originally motivated the use of variational methods in neural networks. However, we show that these variational methods provide surprisingly poor compression bounds, despite being explicitly built to minimize such bounds. This might explain the relatively poor practical performance of variational methods in deep learning. Better encoding methods, imported from the Minimum Description Length (MDL) toolbox, yield much better compression values on deep networks.", "full_text": "The Description Length of Deep Learning Models\n\nL\u00e9onard Blier\n\n\u00c9cole Normale Sup\u00e9rieure\n\nParis, France\n\nleonard.blier@normalesup.org\n\nFacebook Arti\ufb01cial Intelligence Research\n\nYann Ollivier\n\nParis, France\nyol@fb.com\n\nAbstract\n\nSolomonoff\u2019s general theory of inference (Solomonoff, 1964) and the Minimum\nDescription Length principle (Gr\u00fcnwald, 2007; Rissanen, 2007) formalize Oc-\ncam\u2019s razor, and hold that a good model of data is a model that is good at losslessly\ncompressing the data, including the cost of describing the model itself. Deep neu-\nral networks might seem to go against this principle given the large number of\nparameters to be encoded.\nWe demonstrate experimentally the ability of deep neural networks to compress\nthe training data even when accounting for parameter encoding. The compression\nviewpoint originally motivated the use of variational methods in neural networks\n(Hinton and Van Camp, 1993; Schmidhuber, 1997). Unexpectedly, we found that\nthese variational methods provide surprisingly poor compression bounds, despite\nbeing explicitly built to minimize such bounds. This might explain the relatively\npoor practical performance of variational methods in deep learning. On the other\nhand, simple incremental encoding methods yield excellent compression values\non deep networks, vindicating Solomonoff\u2019s approach.\n\n1\n\nIntroduction\n\nDeep learning has achieved remarkable results in many different areas (LeCun et al., 2015). Still, the\nability of deep models not to over\ufb01t despite their large number of parameters is not well understood.\nTo quantify the complexity of these models in light of their generalization ability, several metrics\nbeyond parameter-counting have been measured, such as the number of degrees of freedom of mod-\nels (Gao and Jojic, 2016), or their intrinsic dimension (Li et al., 2018). These works concluded that\ndeep learning models are signi\ufb01cantly simpler than their numbers of parameters might suggest.\nIn information theory and Minimum Description Length (MDL), learning a good model of the data\nis recast as using the model to losslessly transmit the data in as few bits as possible. More complex\nmodels will compress the data more, but the model must be transmitted as well. The overall code-\nlength can be understood as a combination of quality-of-\ufb01t of the model (compressed data length),\ntogether with the cost of encoding (transmitting) the model itself. For neural networks, the MDL\nviewpoint goes back as far as (Hinton and Van Camp, 1993), which used a variational technique to\nestimate the joint compressed length of data and parameters in a neural network model.\nCompression is strongly related to generalization and practical performance. Standard sample com-\nplexity bounds (VC-dimension, PAC-Bayes...) are related to the compressed length of the data in\na model, and any compression scheme leads to generalization bounds (Blum and Langford, 2003).\nSpeci\ufb01cally for deep learning, (Arora et al., 2018) showed that compression leads to generaliza-\ntion bounds (see also (Dziugaite and Roy, 2017)). Several other deep learning methods have been\ninspired by information theory and the compression viewpoint. In unsupervised learning, autoen-\ncoders and especially variational autoencoders (Kingma and Welling, 2013) are compression meth-\nods of the data (Ollivier, 2014). In supervised learning, the information bottleneck method studies\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Fake labels cannot be compressed Measuring codelength while training a deep model\non MNIST with true and fake labels. The model is an MLP with 3 hidden layers of size 200, with\nRELU units. With ordinary SGD training, the model is able to over\ufb01t random labels. The plot shows\nthe effect of using variational learning instead, and reports the variational objective (encoding cost\nof the training data, see Section 3.3), on true and fake labels. We also isolated the contribution from\nparameter encoding in the total loss (KL term in (3.2)). With true labels, the encoding cost is below\nthe uniform encoding, and half of the description length is information contained in the weights.\nWith fake labels, on the contrary, the encoding cost converges to a uniform random model, with no\ninformation contained in the weights: there is no mutual information between inputs and outputs.\n\nhow the hidden representations in a neural network compress the inputs while preserving the mu-\ntual information between inputs and outputs (Tishby and Zaslavsky, 2015; Shwartz-Ziv and Tishby,\n2017; Achille and Soatto, 2017).\nMDL is based on Occam\u2019s razor, and on Chaitin\u2019s hypothesis that \u201ccomprehension is compression\u201d\n(Chaitin, 2007): any regularity in the data can be exploited both to compress it and to make predic-\ntions. This is ultimately rooted in Solomonoff\u2019s general theory of inference (Solomonoff, 1964) (see\nalso, e.g., (Hutter, 2007; Schmidhuber, 1997)), whose principle is to favor models that correspond\nto the \u201cshortest program\u201d to produce the training data, based on its Kolmogorov complexity (Li and\nVit\u00e1nyi, 2008). If no structure is present in the data, no compression to a shorter program is possible.\nThe problem of over\ufb01tting fake labels is a nice illustration: convolutional neural networks commonly\nused for image classi\ufb01cation are able to reach 100% accuracy on random labels on the train set\n(Zhang et al., 2017). However, measuring the associated compression bound (Fig. 1) immediately\nreveals that these models do not compress fake labels (and indeed, theoretically, they cannot, see\nAppendix A), that no information is present in the model parameters, and that no learning has\noccurred.\nIn this work we explicitly measure how much current deep models actually compress data. (We\nintroduce no new architectures or learning procedures.) As seen above, this may clarify several\nissues around generalization and measures of model complexity. Our contributions are:\n\n\u2022 We show that the traditional method to estimate MDL codelengths in deep learning, varia-\ntional inference (Hinton and Van Camp, 1993), yields surprisingly inef\ufb01cient codelengths\nfor deep models, despite explicitly minimizing this criterion. This might explain why vari-\national inference as a regularization method often does not reach optimal test performance.\n\n\u2022 We introduce new practical ways to compute tight compression bounds in deep learning\nmodels, based on the MDL toolbox (Gr\u00fcnwald, 2007; Rissanen, 2007). We show that pre-\nquential coding on top of standard learning, yields much better codelengths than variational\ninference, correlating better with test set performance. Thus, despite their many parame-\nters, deep learning models do compress the data well, even when accounting for the cost of\ndescribing the model.\n\n2\n\n020406080100120Number of epochs0123456Variational codelength per sample (bits)Uniform model =log2(10)True labelsTrue labels (cost of params)Fake LabelsFake Labels (cost of params)\f2 Probabilistic Models, Compression, and Information Theory\nImagine that Alice wants to ef\ufb01ciently transmit some information to Bob. Alice has a dataset D =\n{(x1, y1), ..., (xn, yn)} where x1, ..., xn are some inputs and y1, ..., yn some labels. We do not\nassume that these data come from a \u201ctrue\u201d probability distribution. Bob also has the data x1, ..., xn,\nbut he does not have the labels. This describes a supervised learning situation in which the inputs x\nmay be publicly available, and a prediction of the labels y is needed. How can deep learning models\nhelp with data encoding? One key problem is that Bob does not necessarily know the precise, trained\nmodel that Alice is using. So some explicit or implicit transmission of the model itself is required.\nWe study, in turn, various methods to encode the labels y, with or without a deep learning model.\nEncoding the labels knowing the inputs is equivalent to estimating their mutual information (Sec-\ntion 2.4); this is distinct from the problem of practical network compression (Section 3.2) or from\nusing neural networks for lossy data compression. Our running example will be image classi\ufb01cation\non the MNIST (LeCun et al., 1998) and CIFAR10 (Krizhevsky, 2009) datasets.\n\nprobability distribution p(y|x), namely, a function such that for each x \u2208 X ,(cid:80)\n\n2.1 De\ufb01nitions and notation\nLet X be the input space and Y the output (label) space. In this work, we only consider classi-\n\ufb01cation tasks, so Y = {1, ..., K}. The dataset is D := {(x1, y1), ..., (yn, xn)}. Denote xk:l :=\n(xk, xk+1, ..., xl\u22121, xl). We de\ufb01ne a model for the supervised learning problem as a conditional\ny\u2208Y p(y|x) = 1. A\nmodel class, or architecture, is a set of models depending on some parameter \u03b8: M = {p\u03b8, \u03b8 \u2208 \u0398}.\nThe Kullback\u2013Leibler divergence between two distributions is KL(\u00b5(cid:107)\u03bd) = EX\u223c\u00b5[log2\n\n\u00b5(x)\n\n\u03bd(x) ].\n\n2.2 Models and codelengths\n\nWe recall a basic result of compression theory (Shannon, 1948).\nProposition 1 (Shannon\u2013Huffman code). Suppose that Alice and Bob have agreed in advance on a\nmodel p, and both know the inputs x1:n. Then there exists a code to transmit the labels y1:n losslessly\nwith codelength (up to at most one bit on the whole sequence)\n\nLp(y1:n|x1:n) = \u2212 n(cid:88)\n\nlog2 p(yi|xi)\n\n(2.1)\n\ni=1\n\nThis bound is known to be optimal if the data are independent and coming from the model p\n(Mackay, 2003). The one additional bit in the Shannon\u2013Huffman code is incurred only once for\nthe whole dataset (Mackay, 2003). With large datasets this is negligible. Thus, from now on we will\nsystematically omit the +1 as well as admit non-integer codelengths (Gr\u00fcnwald, 2007). We will use\nthe terms codelength or compression bound interchangeably.\nThis bound is exactly the categorical cross-entropy loss evaluated on the model p. Hence, trying to\nminimize the description length of the outputs over the parameters of a model class is equivalent to\nminimizing the usual classi\ufb01cation loss.\nHere we do not consider the practical implementation of compression algorithms: we only care\nabout the theoretical bit length of their associated encodings. We are interested in measuring the\namount of information contained in the data, the mutual information between input and output, and\nhow it is captured by the model. Thus, we will directly work with codelength functions.\n\nAn obvious limitation of the bound (2.1) is that Alice and Bob both have to know the model p in\nadvance. This is problematic if the model must be learned from the data.\n\n2.3 Uniform encoding\nThe uniform distribution punif (y|x) = 1\nK over the K classes does not require any learning from the\ndata, thus no additional information has to be transmitted. Using punif (y|x) (2.1) yields a codelength\n(2.2)\n\nLunif (y1:n|x1:n) = n log2 K\n\n3\n\n\fTable 1: Compression bounds via Deep Learning. Compression bounds given by different codes\non two datasets, MNIST and CIFAR10. The Codelength is the number of bits necessary to send\nthe labels to someone who already has the inputs. This codelength includes the description length\nof the model. The compression ratio for a given code is the ratio between its codelength and the\ncodelength of the uniform code. The test accuracy of a model is the accuracy of its predictions on\nthe test set. For 2-part and network compression codes, we report results from (Han et al., 2015a)\nand (Xu et al., 2017), and for the intrinsic dimension code, results from (Li et al., 2018). The values\nin the table for these codelengths and compression ratio are lower bounds, only taking into account\nthe codelength of the weights, and not the codelength of the data encoded with the model (the \ufb01nal\nloss is not always available in these publications). For variational and prequential codes, we selected\nthe model and hyperparameters providing the best compression bound.\n\nCODE\n\nUNIFORM\nFLOAT32 2-PART\nNETWORK COMPR.\nINTRINSIC DIM.\nVARIATIONAL\nPREQUENTIAL\n\nMNIST\n\nCODELENGTH\n\n(kbits)\n\nCOMP.\nRATIO\n\n199\n\n> 8.6Mb\n\n> 400\n> 9.28\n22.2\n4.10\n\n1.\n\n> 45.\n> 2.\n\n> 0.05\n0.11\n0.02\n\nCIFAR10\n\nTEST\nACC\n10%\n98.4%\n98.4%\n90%\n98.2%\n99.5%\n\nCODELENGTH\n\n(kbits)\n\n166\n\n> 428Mb\n> 14Mb\n> 92, 8\n89.0\n45.3\n\nCOMP.\nRATIO\n\n1.\n\n> 2500.\n\n> 83.\n> 0.56\n0.54\n0.27\n\nTEST\nACC\n10%\n92.9%\n93.3%\n70%\n66,5%\n93.3%\n\nThis uniform encoding will be a sanity check against which to compare the other encodings in this\ntext. For MNIST, the uniform encoding cost is 60000 \u00d7 log2 10 = 199 kbits. For CIFAR, the\nuniform encoding cost is 50000 \u00d7 log2 10 = 166 kbits.\n\n2.4 Mutual information between inputs and outputs\n\nIntuitively, the only way to beat a trivial encoding of the outputs is to use the mutual information (in\na loose sense) between the inputs and outputs.\nThis can be formalized as follows. Assume that the inputs and outputs follow a \u201ctrue\u201d joint distribu-\ntion q(x, y). Then any transmission method with codelength L satis\ufb01es (Mackay, 2003)\n\nEq[L(y|x)] \u2265 H(y|x)\n\n(2.3)\n\nTherefore, the gain (per data point) between the codelength L and the trivial codelength H(y) is\n\nH(y) \u2212 Eq[L(y|x)] \u2264 H(y) \u2212 H(y|x) = I(y; x)\n\n(2.4)\n\nthe mutual information between inputs and outputs (Mackay, 2003).\nThus, the gain of any codelength compared to the uniform code is limited by the amount of mutual\ninformation between input and output. (This bound is reached with the true model q(y|x).) Any suc-\ncessful compression of the labels is, at the same time, a direct estimation of the mutual information\nbetween input and output. The latter is the central quantity in the Information Bottleneck approach\nto deep learning models (Shwartz-Ziv and Tishby, 2017).\nNote that this still makes sense without assuming a true underlying probabilistic model, by replacing\nthe mutual information H(y) \u2212 H(y|x) with the \u201cabsolute\u201d mutual information K(y) \u2212 K(y|x)\nbased on Kolmogorov complexity K (Li and Vit\u00e1nyi, 2008).\n\n3 Compression Bounds via Deep Learning\n\nVarious compression methods from the MDL toolbox can be used on deep learning models. (Note\nthat a given model can be stored or encoded in several ways, some of which may have large code-\nlengths. A good model in the MDL sense is one that admits at least one good encoding.)\n\n4\n\n\f3.1 Two-Part Encodings\n\nAlice and Bob can \ufb01rst agree on a model class (such as \u201cneural networks with two layers and 1,000\nneurons per layer\u201d). However, Bob does not have access to the labels, so Bob cannot train the\nparameters of the model. Therefore, if Alice wants to use such a parametric model, the parameters\nthemselves have to be transmitted. Such codings in which Alice \ufb01rst transmits the parameters of\na model, then encodes the data using this parameter, have been called two-part codes (Gr\u00fcnwald,\n2007).\nDe\ufb01nition 1 (Two-part codes). Assume that Alice and Bob have \ufb01rst agreed on a model class\n(p\u03b8)\u03b8\u2208\u0398. Let Lparam(\u03b8) be any encoding scheme for parameters \u03b8 \u2208 \u0398. Let \u03b8\u2217 be any parame-\nter. The corresponding two-part codelength is\n\n(y1:n|x1:n) := Lparam(\u03b8\u2217) + Lp\u03b8\u2217 (y1:n|x1:n) = Lparam(\u03b8\u2217) \u2212 n(cid:88)\n\nlog2 p\u03b8\u2217 (yi|xi)\n\n(3.1)\n\nL2-part\n\u03b8\u2217\n\nAn obvious possible code Lparam for \u03b8 is the standard \ufb02oat32 binary encoding for \u03b8, for which\nLparam(\u03b8) = 32 dim(\u03b8). In deep learning, two-part codes are widely inef\ufb01cient and much worse\nthan the uniform encoding (Graves, 2011). For a model with 1 million parameters, the two-part\ncode with \ufb02oat32 binary encoding will amount to 32 Mbits, or 200 times the uniform encoding on\nCIFAR10.\n\ni=1\n\n3.2 Network Compression\n\nThe practical encoding of trained models is a well-developed research topic, e.g., for use on small\ndevices such as cell phones. Such encodings can be seen as two-part codes using a clever code for \u03b8\ninstead of encoding every parameter on 32 bits. Possible strategies include training a student layer\nto approximate a well-trained network (Ba and Caruana, 2014; Romero et al., 2015), or pipelines\ninvolving retraining, pruning, and quantization of the model weights (Han et al., 2015a,b; Simonyan\nand Zisserman, 2014; Louizos et al., 2017; See et al., 2016; Ullrich et al., 2017).\nStill, the resulting codelengths (for compressing the labels given the data) are way above the uniform\ncompression bound for image classi\ufb01cation (Table 1).\nAnother scheme for network compression, less used in practice but very informative, is to sample\na random low-dimensional af\ufb01ne subspace in parameter space and to optimize in this subspace (Li\net al., 2018). The number of parameters is thus reduced to the dimension of the subspace and we\ncan use the associated two-part encoding. (The random subspace can be transmitted via a pseudo-\nrandom seed.) Our methodology to derive compression bounds from (Li et al., 2018) is detailed in\nAppendix B.\n\n3.3 Variational and Bayesian Codes\n\nAnother strategy for encoding weights with a limited precision is to represent these weights by ran-\ndom variables: the uncertainty on \u03b8 represents the precision with which \u03b8 is transmitted. The vari-\national code turns this into an explicit encoding scheme, thanks to the bits-back argument (Honkela\nand Valpola, 2004). Initially a way to compute codelength bounds with neural networks (Hinton and\nVan Camp, 1993), this is now often seen as a regularization technique (Blundell et al., 2015). This\nmethod yields the following codelength.\nDe\ufb01nition 2 (Variational code). Assume that Alice and Bob have agreed on a model class (p\u03b8)\u03b8\u2208\u0398\n(cid:21)\nand a prior \u03b1 over \u0398. Then for any distribution \u03b2 over \u0398, there exists an encoding with codelength\nlog2 p\u03b8(yi|xi)\n\u03b2 (y1:n|x1:n) = KL (\u03b2(cid:107)\u03b1) + E\u03b8\u223c\u03b2\nLvar\n(3.2)\n\n(cid:2)Lp\u03b8 (y1:n|x1:n)(cid:3) = KL (\u03b2(cid:107)\u03b1) \u2212 E\u03b8\u223c\u03b2\n\n(cid:20) n(cid:88)\n\ni=1\n\nThis can be minimized over \u03b2, by choosing a parametric model class (\u03b2\u03c6)\u03c6\u2208\u03a6, and minimiz-\ning (3.2) over \u03c6. A common model class for \u03b2 is the set of multivariate Gaussian distributions\n{N (\u00b5, \u03a3), \u00b5 \u2208 Rd, \u03a3 diagonal}, and \u00b5 and \u03a3 can be optimized with a stochastic gradient descent\nalgorithm (Graves, 2011; Kucukelbir et al., 2017). \u03a3 can be interpreted as the precision with which\nthe parameters are encoded.\n\n5\n\n\f(cid:90)\n\npBayes(y1:n|x1:n) =\n\np\u03b8(y1:n|x1:n)\u03b1(\u03b8)d\u03b8,\n\nThe variational bound Lvar\nBayesian model p\u03b8 with parameter \u03b8 and prior \u03b1. Considering the Bayesian distribution of y,\n\nis an upper bound for the Bayesian description length bound of the\n\n\u03b2\n\n(3.3)\nthen Proposition 1 provides an associated code via (2.1) with model pBayes: LBayes(y1:n|x1:n) =\n\u2212 log2 pBayes(y1:n|x1:n) Then, for any \u03b2 we have (Graves, 2011)\n\n\u03b8\u2208\u0398\n\n\u03b2 (y1:n|x1:n) \u2265 LBayes(y1:n|x1:n)\nLvar\n\n(3.4)\nwith equality if and only if \u03b2 is equal to the Bayesian posterior pBayes(\u03b8|x1:n, y1:n). Variational\nmethods can be used as approximate Bayesian inference for intractable Bayesian posteriors.\nWe computed practical compression bounds with variational methods on MNIST and CIFAR10.\nNeural networks that give the best variational compression bounds appear to be smaller than net-\nworks trained the usual way. We tested various fully connected networks and convolutional networks\n(Appendix C): the models that gave the best variational compression bounds were small LeNet-like\nnetworks. To test the link between compression and test accuracy, in Table 1 we report the best\nmodel based on compression, not test accuracy. This results in a drop of test accuracy with respect\nto other settings.\nOn MNIST, this provides a codelength of the labels (knowing the inputs) of 24.1 kbits, i.e., a com-\npression ratio of 0.12. The corresponding model achieved 95.5% accuracy on the test set.\nOn CIFAR, we obtained a codelength of 89.0 kbits, i.e., a compression ratio of 0.54. The corre-\nsponding model achieved 61.6% classi\ufb01cation accuracy on the test set.\nWe can make two observations. First, choosing the model class which minimizes variational code-\nlength selects smaller deep learning models than would cross-validation. Second, the model with\nbest variational codelength has low classi\ufb01cation accuracy on the test set on MNIST and CIFAR,\ncompared to models trained in a non-variational way. This aligns with a common criticism of\nBayesian methods as too conservative for model selection compared with cross-validation (Rissanen\net al., 1992; Foster and George, 1994; Barron and Yang, 1999; Gr\u00fcnwald, 2007).\n\n3.4 Prequential or Online Code\n\nThe next coding procedure shows that deep neural models which generalize well also compress well.\nThe prequential (or online) code is a way to encode both the model and the labels without directly\nencoding the weights, based on the prequential approach to statistics (Dawid, 1984), by using pre-\ndiction strategies. Intuitively, a model with default values is used to encode the \ufb01rst few data; then\nthe model is trained on these few encoded data; this partially trained model is used to encode the\nnext data; then the model is retrained on all data encoded so far; and so on.\nPrecisely, we call p a prediction strategy for predicting the labels in Y knowing the inputs in X if\nfor all k, p(yk+1|x1:k+1, y1:k) is a conditional model; namely, any strategy for predicting the k + 1-\nlabel after already having seen k input-output pairs. In particular, such a model may learn from the\n\ufb01rst k data samples. Any prediction strategy p de\ufb01nes a model on the whole dataset:\nppreq(y1:n|x1:n) = p(y1|x1) \u00b7 p(y2|x1:2, y1) \u00b7 . . . \u00b7 p(yn|x1:n, y1:n\u22121)\n\n(3.5)\n\nLet (p\u03b8)\u03b8\u2208\u0398 be a deep learning model. We assume that we have a learning algorithm which com-\nputes, from any number of data samples (x1:k, y1:k), a trained parameter vector \u02c6\u03b8(x1:k, y1:k). Then\nthe data is encoded in an incremental way: at each step k, \u02c6\u03b8(x1:k, y1:k) is used to predict yk+1.\nIn practice, the learning procedure \u02c6\u03b8 may only reset and retrain the network at certain timesteps. We\nchoose timesteps 1 = t0 < t1 < ... < tS = n, and we encode the data by blocks, always using the\nmodel learned from the already transmitted data (Algorithm 2 in Appendix D). A uniform encoding\nis used for the \ufb01rst few points. (Even though the encoding procedure is called \u201conline\u201d, it does not\nmean that only the most recent sample is used to update the parameter \u02c6\u03b8: the optimization procedure\n\u02c6\u03b8 can be any prede\ufb01ned technique using all the previous samples (x1:k, y1:k), only requiring that the\nalgorithm has an explicit stopping criterion.) This yields the following description length:\n\n6\n\n\fDe\ufb01nition 3 (Prequential code). Given a model p\u03b8, a learning algorithm \u02c6\u03b8(x1:k, y1:k), and retraining\ntimesteps 1 = t0 < t1 < ... < tS = n, the prequential codelength is\n\n(yts+1:ts+1|xts+1:ts+1)\n\n(3.6)\n\nS\u22121(cid:88)\n\nLpreq(y1:n|x1:n) = t1 log2 K +\n\n\u2212 log2 p\u02c6\u03b8ts\nquential codelength Lpreq(y1:n|x1:n) and the log-loss(cid:80)n\n\ns=0\n\nwhere for each s, \u02c6\u03b8ts = \u02c6\u03b8(x1:ts , y1:ts ) is the parameter learned on data samples 1 to ts.\n\nt=1 \u2212 log2 p\u02c6\u03b8tK\n\nThe model parameters are never encoded explicitly in this method. The difference between the pre-\n(yt|xt) of the \ufb01nal trained\nmodel, can be interpreted as the amount of information that the trained parameters contain about the\ndata contained: the former is the data codelength if Bob does not know the parameters, while the\nlatter is the codelength of the same data knowing the parameters.\nPrequential codes depend on the performance of the underlying training algorithm, and take advan-\ntage of the model\u2019s generalization ability from the previous data to the next. In particular, the model\ntraining should yield good generalization performance from data [1; ts] to data [ts + 1; ts+1].\nIn practice, optimization procedures for neural networks may be stochastic (initial values, dropout,\ndata augmentation...), and Alice and Bob need to make all the same random actions in order to\nget the same \ufb01nal model. A possibility is to agree on a random seed \u03c9 (or pseudorandom num-\nbers) beforehand, so that the random optimization procedure \u02c6\u03b8(x1:ts, y1:ts) is deterministic given \u03c9,\nHyperparameters may also be transmitted \ufb01rst (the cost of sending a few numbers is small).\nPrequential coding with deep models provides excellent compression bounds. On MNIST, we com-\nputed the description length of the labels with different networks (Appendix D). The best compres-\nsion bound was given by a convolutional network of depth 8. It achieved a description length of\n4.10 kbits, i.e., a compression ratio of 0.021, with 99.5% test set accuracy (Table 1). This code-\nlength is 6 times smaller than the variational codelength.\nOn CIFAR, we tested a simple multilayer perceptron, a shallow network, a small convolutional\nnetwork, and a VGG convolutional network (Simonyan and Zisserman, 2014) \ufb01rst without data\naugmentation or batch normalization (VGGa) (Ioffe and Szegedy, 2015), then with both of them\n(VGGb) (Appendix D). The results are in Figure 2. The best compression bound was obtained with\nVGGb, achieving a codelength of 45.3 kbits, i.e., a compression ratio of 0.27, and 93% test set\naccuracy (Table 1). This codelength is twice smaller than the variational codelength. The difference\nbetween VGGa and VGGb also shows the impact of the training procedure on codelengths for a\ngiven architecture.\n\nModel Switching. A weakness of prequential codes is the catch-up phenomenon (Van Erven et al.,\n2012). Large architectures might over\ufb01t during the \ufb01rst steps of the prequential encoding, when the\nmodel is trained with few data samples. Thus the encoding cost of the \ufb01rst packs of data might be\nworse than with the uniform code. Even after the encoding cost on current labels becomes lower, the\ncumulated codelength may need a lot of time to \u201ccatch up\u201d on its initial lag. This can be observed in\npractice with neural networks: in Fig. 2, the VGGb model needs 5,000 samples on CIFAR to reach\na cumulative compression ratio < 1, even though the encoding cost per label becomes drops below\nuniform after just 1,000 samples. This is ef\ufb01ciently solved by switching (Van Erven et al., 2012)\nbetween models (see Appendix E). Switching further improves the practical compression bounds,\neven when just switching between copies of the same model with different SGD stopping times\n(Fig. 3, Table 2).\n\n4 Discussion\n\nToo Many Parameters in Deep Learning Models? >From an information theory perspective,\nthe goal of a model is to extract as much mutual information between the labels and inputs as\npossible\u2014equivalently (Section 2.4), to compress the labels. This cannot be achieved with 2-part\ncodes or practical network compression. With the variational code, the models do compress the data,\nbut with a worse prediction performance: one could conclude that deep learning models that achieve\nthe best prediction performance cannot compress the data.\n\n7\n\n\fFigure 2: Prequential code results on CIFAR. Results of prequential encoding on CIFAR with 5\ndifferent models: a small Multilayer Perceptron (MLP), a shallow network, a small convolutional\nlayer (tinyCNN), a VGG-like network without data augmentation and batch normalization (VGGa)\nand the same VGG-like architecture with data augmentation and batch normalization (VGGb) (see\nAppendix D). Performance is reported during online training, as a function of the number of samples\nseen so far. Top left: codelength per sample (log loss) on a pack of data [tk; tk+1) given data [1; tk).\nBottom left: test accuracy on a pack of data [tk; tk+1) given data [1; tk), as a function of tk. Top\nright: difference between the prequential cumulated codelength on data [1; tk], and the uniform\nencoding. Bottom right: compression ratio of the prequential code on data [1; tk].\n\nThanks to the prequential code, we have seen that deep learning models, even with a large number\nof parameters, compress the data well: from an information theory point of view, the number of\nparameters is not an obstacle to compression. This is consistent with Chaitin\u2019s hypothesis that\n\u201ccomprehension is compression\u201d, contrary to previous observations with the variational code.\n\nPrequential Code and Generalization. The prequential encoding shows that a model that gen-\neralizes well for every dataset size, will compress well. The ef\ufb01ciency of the prequential code is\ndirectly due to the generalization ability of the model at each time.\nTheoretically, three of the codes (two-parts, Bayesian, and prequential based on a maximum like-\nlihood or MAP estimator) are known to be asymptotically equivalent under strong assumptions\n(d-dimensional identi\ufb01able model, data coming from the model, suitable Bayesian prior, and tech-\nnical assumptions ensuring the effective dimension of the trained model is not lower than d): in that\ncase, these three methods yield a codelength L(y1:n|x1:n) = nH(Y |X) + d\n2 log2 n + O(1) (Gr\u00fcn-\nwald, 2007). This corresponds to the BIC criterion for model selection. Hence there was no obvious\nreason for the prequential code to be an order of magnitude better than the others.\nHowever, deep learning models do not usually satisfy any of these hypotheses. Moreover, our\nprequential codes are not based on the maximum likelihood estimator at each step, but on standard\ndeep learning methods (so training is regularized at least by dropout and early stopping).\n\nInef\ufb01ciency of Variational Models for Deep Networks. The objective of variational methods is\nequivalent to minimizing a description length. Thus, on our image classi\ufb01cation tasks, variational\nmethods do not have good results even for their own objective, compared to prequential codes. This\nmakes their relatively poor results at test time less surprising.\nUnderstanding this observed inef\ufb01ciency of variational methods is an open problem. As stated in\n(3.4), the variational codelength is an upper bound for the Bayesian codelength. More precisely,\n\n\u03b2 (y1:n|x1:n) = LBayes(y1:n|x1:n) + KL (pBayes(\u03b8|x1:n, y1:n)(cid:107)\u03b2)\nLvar\n\n(4.1)\n\n8\n\n010000200003000040000500000246Encoding cost per sample (bi(s)01000020000300004000050000\u2212100\u2212500C)m)la(ive encoding cos((difference wi(h )niform) ( bi(s)01000020000300004000050000N)mber of samples0.00.20.40.60.81.0Accuracy on the nextdata pack (%)uniformMLPShallowtinyCNNVGGaVGGb01000020000300004000050000Number of samples0.00.51.01.52.0Compression ratio\f\u03b2 (y1:n|x1:n).\n\nwith notation as above, and with pBayes(\u03b8|x1:n, y1:n) the Bayesian posterior on \u03b8 given the data.\nEmpirically, on MNIST and CIFAR, we observe that Lpreq(y1:n|x1:n) (cid:28) Lvar\nSeveral phenomena could contribute to this gap. First, the optimization of the parameters \u03c6 of the\napproximate Bayesian posterior might be imperfect. Second, even the optimal distribution \u03b2\u2217 in the\nvariational class might not approximate the posterior pBayes(\u03b8|x1:n, y1:n) well, leading to a large\nKL term in (4.1); this would be a problem with the choice of variational posterior class \u03b2. On the\nother hand we do not expect the choice of Bayesian prior to be a key factor: we tested Gaussian\npriors with various variances as well as a conjugate Gaussian prior, with similar results. Moreover,\nGaussian initializations and L2 weight decay (acting like a Gaussian prior) are common in deep\nlearning. Finally, the (untractable) Bayesian codelength based on the exact posterior might itself\nbe larger than the prequential codelength. This would be a problem of under\ufb01tting with parametric\nBayesian inference, perhaps related to the catch-up phenomenon or to the known conservatism of\nBayesian model selection (end of Section 3.3).\n\n5 Conclusion\n\nDeep learning models can represent the data together with the model in fewer bits than a naive\nencoding, despite their many parameters. However, we were surprised to observe that variational\ninference, though explicitly designed to minimize such codelengths, provides very poor such val-\nues compared to a simple incremental coding scheme. Understanding this limitation of variational\ninference is a topic for future research.\n\nAcknowledgments\n\nFirst, we would like to thank the reviewers for their careful reading and their questions and com-\nments. We would also like to thank Corentin Tallec for his technical help, and David Lopez-Paz,\nMoustapha Ciss\u00e9, Ga\u00e9tan Marceau Caron and Jonathan Laurent for their helpful comments and\nadvice.\n\nReferences\nA. Achille and S. Soatto. On the Emergence of Invariance and Disentangling in Deep Representa-\ntions. arXiv preprint arXiv:1706.01350, jun 2017. URL http://arxiv.org/abs/1706.01350.\n\nS. Arora, R. Ge, B. Neyshabur, and Y. Zhang. Stronger generalization bounds for deep nets via a\n\ncompression approach. arXiv preprint arXiv:1802.05296, 2018.\n\nL. J. Ba and R. Caruana. Do Deep Nets Really Need to be Deep? In Advances in Neural Information\n\nProcessing Systems, pages 2654\u20132662, 2014.\n\nA. Barron and Y. Yang. Information-theoretic determination of minimax rates of convergence. The\n\nAnnals of Statistics, 27(5):1564\u20131599, 1999.\n\nA. Blum and J. Langford. PAC-MDL Bounds.\n\nIn B. Sch\u00f6lkopf and M. K. Warmuth, editors,\nLearning Theory and Kernel Machines, pages 344\u2013357, Berlin, Heidelberg, 2003. Springer Berlin\nHeidelberg. ISBN 978-3-540-45167-9.\n\nC. Blundell, J. Cornebise, K. Kavukcuoglu, and D. Wierstra. Weight Uncertainty in Neural Net-\n\nworks. In International Conference on Machine Learning, pages 1613\u20131622, 2015.\n\nG. J. Chaitin. On the intelligibility of the universe and the notions of simplicity, complexity and\nirreducibility. In Thinking about Godel and Turing: Essays on Complexity, 1970-2007. World\nscienti\ufb01c, 2007.\n\nA. P. Dawid. Present Position and Potential Developments: Some Personal Views: Statistical The-\nory: The Prequential Approach. Journal of the Royal Statistical Society. Series A (General), 147\n(2):278, 1984.\n\n9\n\n\fG. K. Dziugaite and D. M. Roy. Computing Nonvacuous Generalization Bounds for Deep (Stochas-\nIn Proceedings of the\n\ntic) Neural Networks with Many More Parameters than Training Data.\nThirty-Third Conference on Uncertainty in Arti\ufb01cial Intelligence, Sydney, 2017.\n\nD. P. Foster and E. I. George. The Risk In\ufb02ation Criterion for Multiple Regression. The Annals of\n\nStatistics, 22(4):1947\u20131975, dec 1994.\n\nT. Gao and V. Jojic.\n\nDegrees of Freedom in Deep Neural Networks.\n\narXiv:1603.09260, mar 2016.\n\narXiv preprint\n\nA. Graves. Practical Variational Inference for Neural Networks. In Neural Information Processing\n\nSystems, 2011.\n\nP. D. Gr\u00fcnwald. The Minimum Description Length principle. MIT press, 2007.\n\nS. Han, H. Mao, and W. J. Dally. Deep Compression: Compressing Deep Neural Networks with\n\nPruning, Trained Quantization and Huffman Coding. arXiv preprint arXiv:1510.00149, 2015a.\n\nS. Han, J. Pool, J. Tran, and W. J. Dally. Learning both Weights and Connections for Ef\ufb01cient Neural\n\nNetworks. In Advances in Neural Information Processing Systems, 2015b.\n\nG. E. Hinton and D. Van Camp. Keeping Neural Networks Simple by Minimizing the Description\nLength of the Weights. In Proceedings of the sixth annual conference on Computational learning\ntheory. ACM, 1993.\n\nA. Honkela and H. Valpola. Variational Learning and Bits-Back Coding: An Information-Theoretic\n\nView to Bayesian Learning. IEEE transactions on Neural Networks, 15(4), 2004.\n\nM. Hutter. On Universal Prediction and Bayesian Con\ufb01rmation. Theoretical Computer Science, 384\n\n(1), sep 2007.\n\nS. Ioffe and C. Szegedy. Batch Normalization: Accelerating Deep Network Training by Reducing\nInternal Covariate Shift. In International Conference on Machine Learning, pages 448\u2013456, 2015.\n\nD. P. Kingma and M. Welling. Auto-Encoding Variational Bayes. arXiv preprint arXiv:1312.6114,\n\n2013.\n\nA. Krizhevsky. Learning Multiple Layers of Features from Tiny Images. 2009.\n\nA. Kucukelbir, D. Tran, R. Ranganath, A. Gelman, and D. M. Blei. Automatic Differentiation\n\nVariational Inference. Journal of Machine Learning Research, 18:1\u201345, 2017.\n\nY. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-Based Learning Applied to Document\n\nRecognition. Proceedings of the IEEE, 86(11), 1998.\n\nY. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n\nC. Li, H. Farkhoor, R. Liu, and J. Yosinski. Measuring the Intrinsic Dimension of Objective Land-\n\nscapes. arXiv preprint arXiv:1804.08838, apr 2018.\n\nM. Li and P. Vit\u00e1nyi. An introduction to Kolmogorov complexity. Springer, 2008.\n\nC. Louizos, K. Ullrich, and M. Welling. Bayesian compression for deep learning. In Advances in\n\nNeural Information Processing Systems, pages 3290\u20133300, 2017.\n\nD. J. C. Mackay. Information Theory, Inference, and Learning Algorithms. Cambridge University\n\nPress, cambridge edition, 2003.\n\nY. Ollivier. Auto-encoders: reconstruction versus compression. arXiv preprint arXiv:1403.7752,\n\nmar 2014. URL http://arxiv.org/abs/1403.7752.\n\nJ. Rissanen.\n\nInformation and complexity in statistical modeling. Springer Science & Business\n\nMedia, 2007.\n\nJ. Rissanen, T. Speed, and B. Yu. Density estimation by stochastic complexity. IEEE Transactions\n\non Information Theory, 38(2):315\u2013323, 1992.\n\n10\n\n\fA. Romero, N. Ballas, S. E. Kahou, A. Chassang, C. Gatta, and Y. Bengio. Fitnets: Hints for thin\n\ndeep nets. In Proceedings of the International Conference on Learning Representations, 2015.\n\nJ. Schmidhuber. Discovering Neural Nets with Low Kolmogorov Complexity and High Generaliza-\n\ntion Capability. Neural Networks, 10(5):857\u2013873, jul 1997.\n\nA. See, M.-T. Luong, and C. D. Manning. Compression of Neural Machine Translation Models via\n\nPruning. arXiv preprint arXiv:1606.09274, 2016.\n\nC. Shannon. A mathematical theory of communication. The Bell System Technical Journal, 27,\n\n1948.\n\nR. Shwartz-Ziv and N. Tishby. Opening the Black Box of Deep Neural Networks via Information.\n\narXiv preprint arXiv:1703.00810, 2017.\n\nK. Simonyan and A. Zisserman. Very Deep Convolutional Networks for Large-Scale Image Recog-\n\nnition. arXiv preprint arXiv:1409.1556, sep 2014.\n\nR. Solomonoff. A formal theory of inductive inference. Information and control, 1964.\n\nC. Tallec and L. Blier. Pyvarinf : Variational Inference for PyTorch, 2018. URL https://github.\n\ncom/ctallec/pyvarinf.\n\nN. Tishby and N. Zaslavsky. Deep Learning and the Information Bottleneck Principle. In Informa-\n\ntion Theory Workshop, pages 1\u20135. IEEE, 2015.\n\nK. Ullrich, E. Meeds, and M. Welling. Soft Weight-Sharing for Neural Network Compression. arXiv\n\npreprint arXiv:1702.04008, 2017.\n\nT. Van Erven, P. Gr\u00fcnwald, and S. De Rooij. Catching Up Faster by Switching Sooner: A predictive\napproach to adaptive estimation with an application to the AIC-BIC Dilemma. Journal of the\nRoyal Statistical Society: Series B (Statistical Methodology), 74(3):361\u2013417, 2012.\n\nT.-B. Xu, P. Yang, X.-Y. Zhang, and C.-L. Liu. Margin-Aware Binarized Weight Networks for Image\nClassi\ufb01cation. In International Conference on Image and Graphics, pages 590\u2013601. Springer,\nCham, sep 2017.\n\nS. Zagoruyko. 92.45% on CIFAR-10 in Torch, 2015. URL http://torch.ch/blog/2015/07/\n\n30/cifar.html.\n\nC. Zhang, S. Bengio, M. Hardt, B. Recht, and O. Vinyals. Understanding deep learning requires\nrethinking generalization. In Proceedings of the International Conference on Learning Represen-\ntations, 2017.\n\n11\n\n\f", "award": [], "sourceid": 1122, "authors": [{"given_name": "L\u00e9onard", "family_name": "Blier", "institution": "Ecole Normale Sup\u00e9rieure"}, {"given_name": "Yann", "family_name": "Ollivier", "institution": "Facebook Artificial Intelligence Research"}]}