{"title": "Scalable Model Selection for Belief Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 4609, "page_last": 4619, "abstract": "We propose a scalable algorithm for model selection in sigmoid belief networks (SBNs), based on the factorized asymptotic Bayesian (FAB) framework. We derive the corresponding generalized factorized information criterion (gFIC) for the SBN, which is proven to be statistically consistent with the marginal log-likelihood. To capture the dependencies within hidden variables in SBNs, a recognition network is employed to model the variational distribution. The resulting algorithm, which we call FABIA, can simultaneously execute both model selection and inference by maximizing the lower bound of gFIC. On both synthetic and real data, our experiments suggest that FABIA, when compared to state-of-the-art algorithms for learning SBNs, $(i)$ produces a more concise model, thus enabling faster testing; $(ii)$ improves predictive performance; $(iii)$ accelerates convergence; and $(iv)$ prevents overfitting.", "full_text": "Scalable Model Selection for Belief Networks\n\nZhao Song\u2020, Yusuke Muraoka\u2217, Ryohei Fujimaki\u2217, Lawrence Carin\u2020\n\n\u2020Department of ECE, Duke University\n\nDurham, NC 27708, USA\n\n{zhao.song, lcarin}@duke.edu\n\n\u2217NEC Data Science Research Laboratories\n\nCupertino, CA 95014, USA\n\n{ymuraoka, rfujimaki}@nec-labs.com\n\nAbstract\n\nWe propose a scalable algorithm for model selection in sigmoid belief networks\n(SBNs), based on the factorized asymptotic Bayesian (FAB) framework. We derive\nthe corresponding generalized factorized information criterion (gFIC) for the SBN,\nwhich is proven to be statistically consistent with the marginal log-likelihood. To\ncapture the dependencies within hidden variables in SBNs, a recognition network\nis employed to model the variational distribution. The resulting algorithm, which\nwe call FABIA, can simultaneously execute both model selection and inference\nby maximizing the lower bound of gFIC. On both synthetic and real data, our\nexperiments suggest that FABIA, when compared to state-of-the-art algorithms for\nlearning SBNs, (i) produces a more concise model, thus enabling faster testing; (ii)\nimproves predictive performance; (iii) accelerates convergence; and (iv) prevents\nover\ufb01tting.\n\n1\n\nIntroduction\n\nThe past decade has witnessed a dramatic increase in popularity of deep learning [20], stemming from\nits state-of-the-art performance across many domains, including computer vision [19], reinforcement\nlearning [27], and speech recognition [15]. However, one important issue in deep learning is that\nits performance is largely determined by the underlying model: a larger and deeper network tends\nto possess more representational power, but at the cost of being more prone to over\ufb01tting [32],\nand increased computation. The latter issue presents a challenge for deployment to devices with\nconstrained resources [2]. Inevitably, an appropriate model-selection method is required to achieve\ngood performance. Model selection is here the task of selecting the number of layers and the number\nof nodes in each layer.\nDespite the rapid advancement in performance of deep models, little work has been done to address\nthe problem of model selection. As a basic approach, cross-validation selects a model according\nto a validation score. However, this is not scalable, as its complexity is exponential with respect to\nthe number of layers in the network: O(J LMAX\nMAX ), where JMAX and LMAX represent the maximum\nallowed numbers of nodes in each layer and number of layers, respectively. In Alvarez and Salzmann\n[2], a constrained optimization approach was proposed to infer the number of nodes in convolutional\nneural networks (CNNs); the key idea is to incorporate a sparse group Lasso penalty term to shrink\nall edges \ufb02owing into a node. Based on the shrinkage mechanism of the truncated gamma-negative\nbinomial process, Zhou et al. [36] showed that the number of nodes in Poisson gamma belief networks\n(PGBNs) can be learned. Furthermore, we empirically observe that the shrinkage priors employed\nin Gan et al. [11], Henao et al. [14], Song et al. [31] can potentially perform model selection in\ncertain tasks, even though this was not explicitly discussed in those works. One common problem for\nthese approaches, however, is that the hyperparameters need to be tuned in order to achieve good\nperformance, which may be time-consuming for some applications involving deep networks.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fThe factorized asymptotic Bayesian (FAB) approach has recently been shown as a scalable model-\nselection framework for latent variable models. Originally proposed for mixture models [9], it was\nlater extended to the hidden Markov model (HMM) [8], latent feature model (LFM) [12], and\nrelational model [22]. By maximizing the approximate marginal log-likelihood, FAB introduces an\n(cid:96)0 regularization term on latent variables, which can automatically estimate the model structure by\neliminating irrelevant latent features through an expectation maximization [7] (EM)-like alternating\noptimization, with low computational cost.\nWe develop here a scalable model selection\nalgorithm within the FAB framework to in-\nfer the size of SBNs [28], a popular compo-\nnent of deep models, e.g., deep belief net-\nworks (DBN) [16] and deep Poisson factor\nanalysis (DPFA) [10], and we assume here\nthe depth of the SBN is \ufb01xed. Since the\nmean-\ufb01eld assumption used in FAB does not\nhold in SBNs, we employ a recognition net-\nwork [18, 29, 25, 26] to represent the varia-\ntional distribution. As our method combines\nthe advantages of FAB Inference and Auto-\nencoding variational Bayesian (VB) frame-\nworks, we term it as FABIA. To handle large\ndatasets, we also derive a scalable version\nof FABIA with mini-batches. As opposed to\nprevious works, which prede\ufb01ne the SBN size [28, 30, 25, 5, 11, 6, 31, 26], FABIA determines it\nautomatically.\nIt should be noted that model selection in SBNs is more challenging than CNNs and feedforward\nneural networks (FNNs). As shown in Figure 1, simply imposing a sparsity prior or a group sparsity\nprior as employed in CNNs [2] and SBNs [11, 14, 31] does not necessarily shrink a node in SBN,\nsince such approaches cannot guarantee to shrink all edges connected to a node.\nFABIA possesses the following distinguishing features: (i) a theoretical guarantee that its objective\nfunction, the generalized factorized information criterion (gFIC), is statistically consistent with\nthe model\u2019s marginal log-likelihood; and (ii) prevention of over\ufb01tting in large networks when the\namount of training data is not suf\ufb01ciently large, thanks to an intrinsic shrinkage mechanism. We\nalso detail that FABIA has important connections with previous work on model regularization,\nsuch as Dropout [32], Dropconnect [35], shrinkage priors [11, 36, 14, 31], and automatic relevance\ndetermination (ARD) [34].\n\nFigure 1: Requirement for removal of nodes in (Left)\nSBN and (Right) FNN (dashed circles denote nodes\nthat can be removed). Note that a node in the SBN\ncan be removed only if all of its connected edges\nshrink. For FNN, shrinkage of all incoming edges\neliminates a node.\n\n2 Background\n\nAn SBN is a directed graphical model for which the distribution of each layer is determined by the\npreceding layer via the sigmoid function, de\ufb01ned as \u03c3(x) (cid:44) 1/[1 + exp(\u2212x)]. Let h(l) denote the\nlth hidden layer with Jl units, and v represent the visible layer with M units. The generative model\nof the SBN, with L hidden layers, is represented as\n\nJL(cid:89)\n\nJl(cid:89)\n\n[\u03c3(\u03c8(l)\n\ni )]h(l)\n\ni\n\n[\u03c3(\u2212\u03c8(l)\n\ni )]1\u2212h(l)\n\ni\n\np(h(L)|b) =\n\n[\u03c3(bi)]h(L)\n\ni\n\n[\u03c3(\u2212bi)]1\u2212h(L)\n\ni\n\n,\n\np(h(l)|h(l+1)) =\n\ni=1\n\ni=1\n\ni\n\ni = W (l)\n\ni\u00b7 h(l+1) +c(l)\n\nwhere l = 1, . . . , L\u22121, \u03c8(l)\n, and b corresponds to prior parameters; the notation\ni\u00b7 means the ith row of a matrix. For the link function of the visible layer, i.e., p(v|h(1)), we use the\nsigmoid function for binary data and the multinomial function for count data, as in Mnih and Gregor\n[25], Carlson et al. [6].\nOne dif\ufb01culty of learning SBNs is the evaluation of the expectation with respect to the posterior\ndistribution of hidden variables [31]. In Mnih and Gregor [25], a recognition network under the varia-\ntional auto-encoding (VAE) framework [18] was proposed to approximate this intractable expectation.\nCompared with the Gibbs sampler employed in Gan et al. [11], Carlson et al. [6], Song et al. [31], the\nrecognition network enables fast sampling of hidden variables in blocks. The variational parameters in\n\n2\n\n\fthe recognition network can be learned via stochastic gradient descent (SGD), as shown in the neural\nvariational inference and learning (NVIL) algorithm [25], for which multiple variance reduction\ntechniques have been proposed to obtain better gradient estimates. Note that all previous work on\nlearning SBNs assumes that a model with a \ufb01xed number of nodes in each layer has been provided.\nTo select a model for an SBN, we follow the FAB framework [9], which infers the structure of a\nlatent variable model by Bayesian inference. Let \u03b8 = {W, b, c} denote the model parameters and M\nbe the model, with the goal in the FAB framework being to obtain the following maximum-likelihood\n(ML) estimate:\n\nln p(vn|M) = arg maxM\n\np(vn, hn|\u03b8)p(\u03b8|M) d\u03b8\n\n(1)\n\n(cid:99)MML = arg maxM\n\nN(cid:88)\n\nn=1\n\n(cid:90)\n\n(cid:88)\n\nN(cid:88)\n\nln\n\nn=1\n\nhn\n\nAs a key feature of the FAB framework, the (cid:96)0 penalty term on hn induced by approximating (1)\ncan remove irrelevant latent variables from the model (\u201cshrinkage mechanism\"). In practice, we\ncan start from a large model and gradually reduce its size through this \u201cshrinkage mechanism\" until\nconvergence.\nAlthough a larger model has more representational capacity, a smaller model with similar predictive\nperformance is preferred in practice, given a computational budget. A smaller model also enables\nfaster testing, a desirable property in many machine learning tasks. Furthermore, a smaller model\nimplies more robustness to over\ufb01tting, a common danger in deeper and larger models with insuf\ufb01cient\ntraining data.\nSince the integration in (1) is in general intractable, Laplace\u2019s method [23] is employed in FAB\ninference for approximation. Consequently, gFIC can be derived as a surrogate function of the\nmarginal log-likelihood. By maximizing the variational lower bound of gFIC, one obtains estimates\nof both parameters and the underlying model size. Note that while FAB inference uses the mean-\ufb01eld\napproximation for the variational distribution [9, 8, 22, 21], the same does not hold for SBNs, due to\nthe correlation within hidden variables given the data. In contrast, the recognition network has been\ndesigned to approximate the posterior distribution of hidden variables with more \ufb01delity [18, 29, 25].\nTherefore, it can be a better candidate for the variational distribution in our task.\n\n3 The FABIA Algorithm\n\n3.1\n\ngFIC for SBN\n\nFollowing the FAB inference approach, we \ufb01rst lower bound the marginal log-likelihood in (1) via a\nvariational distribution q(h|\u03c6) as 1\n\n(cid:90)\n\n(cid:88)\n\nln\n\np(vn, hn|\u03b8)p(\u03b8|M) d\u03b8 \u2265 (cid:88)\n\nhn\n\nhn\n\n(cid:20)(cid:82) p(vn, hn|\u03b8) p(\u03b8|M) d\u03b8\n\n(cid:21)\n\nq(hn|\u03c6)\n\n.\n\nq(hn|\u03c6) ln\n\nBy applying Laplace\u2019s method [23], we obtain\n\nN(cid:88)\n\nln p(vn, hn|(cid:98)\u03b8) + ln p((cid:98)\u03b8|M) \u2212 1\n\nM(cid:88)\n\nln|\u03a8m| + O(1) (2)\n\nln p(v, h|M) =\n\nD\u03b8\n2\n\nln(\n\n) +\n\n2\u03c0\nN\n\n2\n\nwhere D\u03b8 refers to the dimension of \u03b8,(cid:98)\u03b8 represents the ML estimate of \u03b8, and \u03a8m represents the\n\nm=1\n\nn=1\n\nnegative Hessian of the log-likelihood with respect to Wm\u00b7.\nSince ln|\u03a8m| in (2) cannot be represented with an analytical form, we must approximate it \ufb01rst,\nfor the purpose of ef\ufb01cient optimization of the marginal log-likelihood. Following the gFIC [13]\napproach, we propose performing model selection in SBNs by introducing the shrinkage mechanism\nfrom this approximation. We start by providing the following assumptions, which are useful in the\nproof of our main theoretical results in Theorem 1.\nn hn has full rank with probability 1 as N \u2192 \u221e, where\n\nAssumption 1. The matrix(cid:80)N\n\nn=1 \u03b7n hT\n\n\u03b7n \u2208 (0, 1).\n\n1For derivation clarity, we assume only one hidden layer and drop the bias term in the SBN\n\n3\n\n\f(cid:88)\n\n(cid:16)\n\n(cid:88)\n\n(cid:17)\n\nNote that this full-rank assumption implies that the SBN can preserve information in the large-sample\nlimit, based on the degeneration analysis of gFIC [13].\nAssumption 2. hn,j,\u2200j is generated from a Bernoulli distribution as hn,j \u223c Ber(\u03c4j), where \u03c4j > 0.\nTheorem 1. As N \u2192 \u221e, ln|\u03a8m| can be represented with the following equality:\n\nln|\u03a8m| =\n\nln\n\nhn,j \u2212 ln N\n\n+ O(1)\n\n(3)\n\nj\n\n\u2202\n\n\u2202W T\n\nm\u00b7 \u2202Wm\u00b7\n\n(cid:88)\n\nln p(vn, hn|\u03b8) =\n\nn\nProof. We \ufb01rst compute the negative Hessian as\n\u03a8m = \u2212 1\nN\nFrom Assumption 1, \u03a8m has full rank, since \u03c3(x) \u2208 (0, 1),\u2200x \u2208 R. Furthermore, the determinant\nof \u03a8m is bounded, since \u03a8m\n\n\u03c3(Wm\u00b7hn) \u03c3(\u2212Wm\u00b7hn) hT\n\nij \u2208 (0, 1),\u2200i, j. Next, we de\ufb01ne the following diagonal matrix\n\u039b (cid:44) diag\n\n(cid:20)(cid:18) ((cid:80)\nFrom Assumption 2, limN\u2192\u221e P r[(cid:80)\n\nn hn,j = 0] = 0,\u2200j. Therefore, \u039b is full-rank and its determi-\n\nnant is bounded, when N \u2192 \u221e. Subsequently, we can decompose it as\n\n(cid:88)\n(cid:18) ((cid:80)\n\nn hn,J )\nN\n\nn hn,1)\nN\n\n(cid:19)(cid:21)\n\n(cid:19)\n\nn hn.\n\n, . . . ,\n\n1\nN\n\nn\n\nn\n\n.\n\n(4)\nwhere F also has full rank and bounded determinant. Finally, applying the log determinant operator\nto the right side of (4) leads to our conclusion.\n\n\u03a8m = \u039b F\n\nln\n\n(cid:16)\n(cid:88)\n(cid:88)\n(cid:80)\nj(ln(cid:80)\n\nn\n\nj\n\n(5)\n\nTo obtain the gFIC for SBN, we \ufb01rst follow the previous FAB approaches [9, 12, 22] to assume\nthe log-prior of \u03b8 to be constant with respect to N, i.e., limN\u2192\u221e ln p(\u03b8|M)\n= 0. We then apply\nTheorem 1 to (2) and have\n\u2212 M\n2\n\nln p(vn, hn|(cid:98)\u03b8) +\n\ngFICSBN = max\n\nM J \u2212 D\u03b8\n\nN(cid:88)\n\n+ H(q)\n\n(cid:17)\n\n(cid:20)\n\n(cid:21)\n\nln N\n\nhn,j\n\nEq\n\n+\n\n2\n\nN\n\nq\n\nn=1\n\nwhere H(q) is the entropy for the variational distribution q(h).\nAs a key quantity in (5), M\nn hn,j) can be viewed as a regularizer over the model to\n2\nexecute model selection. This term directly operates on hidden nodes to perform shrinkage, which\ndistinguishes our approach from previous work [11, 14, 31], where sparsity priors are assigned over\nedges. As illustrated in Figure 1, these earlier approaches do not necessarily shrink hidden nodes, as\nsetting up a prior or a penalty term to shrink all edges connected to a node is very challenging in SBNs.\nFurthermore, the introduction of this quantity does not bring any cost of tuning parameters with cross-\nvalidation. In contrast, the Lagrange parameter in Alvarez and Salzmann [2] and hyperparameters for\npriors in Gan et al. [11], Henao et al. [14], Zhou et al. [36], Song et al. [31] all need to be properly\nset, which may be time-consuming in certain applications involving deep and large networks.\nUnder the same regularity conditions as Hayashi and Fujimaki [12], gFICSBN is statistically consistent\nwith the marginal log-likelihood, an important property of the FAB framework.\nCorollary 1. As N \u2192 \u221e, ln p(v|M) = gFICSBN + O(1).\n\nProof. The conclusion holds as a direct extension of the consistency results in Hayashi and Fujimaki\n[12].\n\n3.2 Optimization of gFIC\n\nThe gFICSBN in (5) cannot be directly optimized, because (i) the ML estimator(cid:98)\u03b8 is in general not\n\navailable, and (ii) evaluation of the expectation over hidden variables is computationally expensive.\nInstead, the proposed FABIA algorithm optimizes the lower bound as\n\n(cid:88)\n\n(cid:88)\n\n(cid:2) ln\n\nEq(hn,j)(cid:3) +\n\nN(cid:88)\n\n(cid:2) ln p(vn, hn|\u03b8)(cid:3) + H(q)\n\nEq\n\n(6)\n\ngFICSBN \u2265 \u2212 M\n2\n\nj\n\nn\n\nn=1\n\n4\n\n\fwhere we use the following facts to get the lower bound: (i) p(vn, hn|(cid:98)\u03b8) \u2265 p(vn, hn|\u03b8),\u2200\u03b8; (ii)\n\nj=1 q(hn,j|vn, \u03c6) = (cid:81)J\n(cid:81)J\n\nthe concavity of the logarithm function; (iii) D\u03b8 \u2264 M J; and (iv) the maximum of all possible\nvariational distributions q in (5).\nThis leaves the choice of the form of the variational distribution. We could use the mean-\ufb01eld\napproximation as in previous FAB approaches [9, 8, 12, 13, 22, 21]. However, this approximation\nfails to capture the dependencies between hidden variables in SBN, as discussed in Song et al. [31].\nInstead, we follow the recent auto-encoding VB approach [18, 29, 25, 26] to model the variational\ndistribution with a recognition network, which maps vn to q(hn|vn, \u03c6). Speci\ufb01cally, q(hn|vn, \u03c6) =\nj=1 Ber[\u03c3(\u03c6j\u00b7vn)], where \u03c6 \u2208 RJ\u00d7M parameterizes the recognition\nnetwork. Not only does using a recognition network allow us to more accurately model the variational\ndistribution, it also enables faster sampling of hidden variables.\nThe optimization of the lower bound in (6) can be executed via SGD; we use the Adam algorithm [17]\nas our optimizer. To reduce gradient variance, we employ the NVIL algorithm to estimate gradients\nin both generative and recognition networks. We also note that other methods, such as the importance-\nsampled objectives method [5, 26, 24], can be used and such an extension is left for future work.\nSince M\n2\nour FABIA algorithm and NVIL should be the same. However, gradients of the recognition network\nin FABIA are regularized to shrink the model, which is lacking in the standard VAE framework.\nWe note that FABIA is a \ufb02exible framework, as its shrinkage term can be combined with any gradient-\nbased variational auto-encoding methods to perform model selection, where only minimal changes to\nthe gradients of the recognition network of the original methods are necessary.\n\nEq(hn,j)(cid:3) in (6) is only dependent on q, gradients of the generative model in\n\n(cid:2) ln(cid:80)\n\n(cid:80)\n\nj\n\nn\n\nn,j) \u2264 \u0001(l), where\nA node j at level l will be removed from the model if it satis\ufb01es 1\nN\n\u0001(l) is a threshold parameter to control the model size. This criterion has an intuitive interpretation\nthat a node should be removed if the proportion of its samples equaling 1 is small. When the\nexpectation is not exact, such as in the top layers, we use samples drawn from the recognition network\nto approximate it.\n\nEq(h(l)\n\nn=1\n\n(cid:80)N\n\n3.3 Minibatch gFIC\n\nTo handle large datasets, we adapt the gFICSBN developed in (5) to use minibatches (which is also\nappropriate for online learning). Suppose that each mini-batch contains Nmini data points, and\ncurrently we have seen T mini-batches, an unbiased estimator for (5) (up to constant terms) is then\n\n(cid:94)gFICSBN = max\n\nq\n\nEq\n\n\u2212 M\n2\n\nhi+NT ,j\n\n+ T\n\nln\n\nj\n\ni=1\n\ni=1\n\n(cid:17)\n\nNmini(cid:88)\n\np(vi+NT , hi+NT |(cid:98)\u03b8)\nq(hi+NT |\u03c6)\n\n(cid:20)\n\n(cid:88)\n\n(cid:16) Nmini(cid:88)\n\nln\n\n(cid:21)\n\nM J \u2212 D\u03b8\n\n2\n\n+\n\nln NT +1\n\n(7)\n\nwhere NT = (T \u2212 1)Nmini. Derivation details are provided in Supplemental Materials.\nAn interesting observation in (7) is that (cid:94)gFICSBN can automatically adjust shrinkage over time: At the\nbeginning of the optimization, i.e., when T is small, the shrinkage term M\nhi+NT ,j)\n2\nis more dominant in (7). As T becomes larger, the model is more stable and shrinkage gradually\ndisappears. This phenomenon is also observed in our experiments in Section 5.\n\n(cid:80)\nj ln((cid:80)Nmini\n\ni=1\n\n3.4 Computational complexity\nThe NVIL algorithm has complexity O(M JNtrain) for computing gradients in both the generative\nmodel and recognition network. FABIA needs an extra model selection step, also with complexity\nO(M JNtrain) per step. As the number of training iteration increases, the additional cost to perform\nmodel selection is offset by the reduction of time when computing gradients, as observed in Figure 3.\nIn test, the complexity is O(M JNtestK) per step, with K being the number of samples taken to\ncompute the variational lower bound. Therefore, shrinkage of nodes can linearly reduce the testing\ntime.\n\n5\n\n\f4 Related Work\n\nDropout As a standard approach to regularize deep models, Dropout [32] randomly removes a\ncertain number of hidden units during training. Note that FABIA shares this important characteristic\nby directly operating on nodes, instead of edges, to regularize the model, which has a more direct\nconnection with model selection. One important difference is that in each training iteration, Dropout\nupdates only a subset of the model; in contrast, FABIA updates every parameter in the model, which\nenables faster convergence.\n\nShrinkage prior The shrinkage sparsity-inducing approach aims to shrink edges in a model, by\nemploying either shrinkage priors [11, 14, 36, 31] or a random mask [35] on the weight matrix. In\nFABIA, the penalty term derived in gFIC of (5) also has the shrinkage property, but the shrinkage\neffect is instead imposed on the nodes. Furthermore, shrinkage priors are usually approached from the\nBayesian framework, where Markov chain Monte Carlo (MCMC) is often needed for inference. In\ncontrast, FABIA integrates the shrinkage mechanism from gFIC into the auto-encoding VB approach\nand thus is scalable to large deep models.\n\nGroup Sparsity Application of group sparsity can be viewed as an extension of the shrinkage prior,\nwith the key idea being to enforce sparsity on entire rows (columns) of the weight matrix [2]. This\ncorresponds to the ARD prior [34] where each row (column) has an individual hyperparameter. In\nFNNs and CNNs, this is equivalent to node shrinkage in FABIA for SBNs. The structure of SBNs\nprecludes a direct application of the group sparsity approach for model selection, but there exists an\ninteresting opportunity for future work to extend FABIA to FNNs and CNNs.\n\nNonparametric Prior\nIn Adams et al. [1], a cascading Indian buffet process (IBP) based approach\nwas proposed to infer the structure of the Gaussian belief network with continuous hidden units,\nfor which the inference was performed via MCMC. By employing the nonparametric properties\nof the IBP prior, this approach can adjust the model size with observations. Due to the high\ncomputational cost of MCMC, however, it may not be scalable to large problems.\n\n5 Experiments\n\nWe test the proposed FABIA algorithm on synthetic data, as well as real image and count data. For\ncomparison, we use the NVIL algorithm [25] as a baseline method, which does not have the model\nselection procedure. Both FABIA and NVIL are implemented in Theano [4] and tested on a machine\nwith 3.0GHz CPU and 64GB RAM. The learning rate in Adam is set to be 0.001 and we follow the\ndefault settings of other parameters in all of our experiments. We set the threshold parameter \u0001(l) to be\n0.001, \u2200l unless otherwise stated. We also tested Dropout but did not notice any clear improvement.\nThe purpose of these experiments is to show that FABIA can automatically learn the model size, and\nachieve better or competitive performance with a more compact model.\n\n5.1 Synthetic Dataset\n\nThe synthetic data are generated from a one-layer SBN and a two-layer SBN, with M = 30 visible\nunits in both cases. We simulate 1250 data points, and then follow an 80/20% split to obtain the\ntraining and test sets. For the one-layer case, we employ a true model with 5 nodes and initialize\nFABIA and NVIL with 25 nodes. For the two-layer case, the true network has the structure of 10-5 2,\nand we initialize FABIA and NVIL with a network of 25-15. We compare the inferred SBN structure\nand test log-likelihood for FABIA, the NVIL algorithm initialized with the same model size as FABIA\n(denoted as \u201cNVIL\"), and the NVIL algorithm initialized with the true model size (denoted as \u201cNVIL\n(True)\u201d). One hundred independent random trials are conducted to report statistics.\nFigure 2(a) shows the mean and standard deviation of the number of nodes inferred by FABIA, as a\nfunction of iteration number. In both one- and two-layer cases, the mean of the inferred model size\nis very close to the ground truth. In Figure 2(b), we compare the convergence in terms of the test\nlog-likelihood for different algorithms: FABIA has almost the same convergence speed as NVIL with\n\n2We list the number of nodes in the deeper layer \ufb01rst in all of our experiments.\n\n6\n\n\f(a)\n\n(b)\n\nFigure 2: (a) Inferred number of nodes from FABIA in (Left) one- and (Right) two-layer cases; (b)\nTest log-likelihood for different methods in (Left) one- and (Right) two-layer cases.\n\nthe true model, both of which have remarkable gaps over the NVIL variant initialized with the same\nmodel size as FABIA.\n\n5.2\n\nImage Modeling\n\nWe use the publicly available MNIST dataset, which contains 60, 000 training and 10, 000 test images\nof size 28 \u00d7 28. Our performance metric is the variational lower bound of the test log-likelihood. The\nmini-batches for FABIA and NVIL are set to 100. For this dataset we compared FABIA with the VB\napproach in Gan et al. [11] and Rec-MCEM in Song et al. [31]. The VB approach in Gan et al. [11] can\npotentially shrink nodes, due to the three parameter beta-normal (TPBN) prior [3]. We claim a node\nk,j]2/J (l\u22121) <\nh(l)\n]2/J (l+1) < 10\u22128. We run the code provided in https://github.com/\nzhegan27/dsbn_aistats2015 and use default parameter settings to report the VB results. We\nalso implemented the Rec-MCEM approach but only observed shrinkage of edges, not nodes.\n\nj can be removed from the model, if its adjacent weight matrices satisfy(cid:80)\n10\u22128 and(cid:80)\n\nk[W (l+1)\n\nk[W (l)\n\nj,k\n\nTable 1: Model size, test variational lower bound\n(VLB) (in nats), and test time (in seconds) on the\nMNIST dataset. Note that FABIA and VB start from\nthe same model size as NVIL and Rec-MCEM.\n\nSize\nMethod\nVB\n81\nRec-MCEM 200\nNVIL\n200\n107\nFABIA\nVB\n200-11\nRec-MCEM 200-200\nNVIL\n200-200\n135-93\nFABIA\n200-200-200\nNVIL\nFABIA\n136-77-72\n\nTable 1 shows the variational lower bound of\nthe test log-likelihood, model size, and test\ntime for different algorithms. FABIA achieves\nthe highest test log-likelihood in all cases and\nconverges to smaller models, compared to\nNVIL. FABIA also bene\ufb01ts from its more\ncompact model to have the smallest test time.\nFurthermore, we observe that VB always over-\nshrinks nodes in the top layer, which might\nbe related to the settings of hyperparameters.\nUnlike VB, FABIA avoids the dif\ufb01cult task of\ntuning hyperparameters to balance predictive\nperformance and model size. We also notice\nthat the deeper layer in the two-layer model\ndid not shrink in VB, as our experiments sug-\ngest that all nodes in the deeper layer still have\nconnections with nodes in adjacent layers.\nFigure 3 shows the variational lower bound of\nthe test log-likelihood and number of nodes in FABIA, as a function of CPU time, for different initial\nmodel sizes. Additional plots as a function of the number of iterations are provided in Supplemental\nMaterials, which are similar to Figure 3. We note that FABIA initially has a similar log-likelihood\nthat gradually outperforms NVIL, which can be explained by the fact that FABIA initially needs\nadditional time to perform the shrinkage step but later converges to a smaller and better model. This\ngap becomes more obvious when we increase the number of hidden units from 200 to 500. The\ndeteriorating performance of NVIL is most likely due to over\ufb01tting. In contrast, FABIA is robust to\nthe change of the initial model size.\n\nVLB\n-117.04\n-116.70\n-115.63\n\u2212114.96\n-113.69\n-106.54\n-105.62\n\u2212104.92\n-101.99\n\u2212101.14\n\nTime\n8.94\n8.52\n8.47\n6.88\n22.37\n12.25\n12.34\n9.18\n15.66\n10.97\n\n7\n\n01234567Iteration1e20510152025Number of nodesLevel 101234567Iteration1e20510152025Number of nodesLevel 1Level 2050100150200Time (seconds)\u221219.5\u221219.0\u221218.5\u221218.0\u221217.5\u221217.0\u221216.5\u221216.0Test log-likelihoodFABIANVIL (True)NVIL050100150200250Time (seconds)\u221219.0\u221218.5\u221218.0\u221217.5\u221217.0\u221216.5Test log-likelihoodFABIANVIL (True)NVIL\f(a)\n\n(b)\n\nFigure 3: Test log-likelihood and the number of nodes in FABIA, as a function of CPU time on the\nMNIST dataset, for an SBN with initial size as (a) 200-200-200 (b) 500-500-500.\n\n5.3 Topic Modeling\n\nThe two benchmarks we used for topic modeling are Reuters Corpus Volume I (RCV1) and Wikipedia,\nas in Gan et al. [10], Henao et al. [14]. RCV1 contains 794,414 training and 10,000 testing documents,\nwith a vocabulary size of 10,000. Wikipedia is composed of 9,986,051 training documents, 1,000 test\ndocuments, and 7,702 words. The performance metric we use is the predictive perplexity on the test\nset, which cannot be directly evaluated. Instead, we follow the approach of 80/20% split on the test\nset, with details provided in Gan et al. [10].\nWe compare FABIA against DPFA [10], deep Poisson factor modeling (DPFM) [14], MCEM [31],\nOver-RSM [33], and NVIL. For both FABIA and NVIL, we use a mini-batch of 200 documents. The\nresults for other methods are cited from corresponding references. We test DPFA and DPFM with\nthe publicly available code provided by the authors; however, no shrinkage of nodes are observed in\nour experiments.\nTable 2 shows the perplexities of different algorithms on the RCV1 and Wikipedia datasets, respec-\ntively. Both FABIA and NVIL outperform other methods with marked margins.\n\nInterestingly, we note that FABIA does not\nshrink any nodes in the \ufb01rst layer, which is likely\ndue to the fact that these two datasets have a\nlarge number of visible units and thus a suf\ufb01-\nciently large \ufb01rst hidden layer is necessary. This\nrequirement of a large \ufb01rst hidden layer to prop-\nerly model the data may also explain why NVIL\ndoes not over\ufb01t on these datasets as much as it\ndoes on MNIST; the training set of these datasets\nbeing suf\ufb01ciently large is another possible ex-\nplanation. We also computed test time but did\nnot observe any clear improvement of FABIA\nover NVIL, which may be explained by the fact\nthat most of the computation is spent on the \ufb01rst\nlayer in these two benchmarks.\nIn Figure 4, we vary the number of hidden units\nin the \ufb01rst layer and \ufb01x the number of nodes\nin other layers to be 400. We use early stopping for NVIL to prevent it from over\ufb01tting with\nlarger networks. For the networks with 100 and 400 nodes in the \ufb01rst layer, FABIA and NVIL\nhave roughly the same perplexities. Once the number of nodes is increased to 1000, FABIA starts\nto outperform NVIL with remarkable gaps, which implies that FABIA can handle the over\ufb01tting\nproblem, as a consequence of its shrinkage mechanism for model selection. We also observed that\nsetting a larger \u0001(1) for the \ufb01rst layer in the 2000 units case for FABIA can stabilize its performance;\n\nFigure 4: Test perplexities as a function of number\nof nodes in the \ufb01rst layer, in the two-layer case.\n\n8\n\n0.00.51.01.52.0Time (seconds)1e5\u2212130\u2212125\u2212120\u2212115\u2212110\u2212105\u2212100Test log-likelihoodFabiaNVIL0.00.51.01.52.0Time (seconds)1e580100120140160180200Number of nodesLevel 1Level 2Level 30.00.51.01.52.02.53.0Time (seconds)1e5\u2212140\u2212135\u2212130\u2212125\u2212120\u2212115\u2212110\u2212105\u2212100Test log-likelihoodFabiaNVIL0.00.51.01.52.02.53.0Time (seconds)1e5100150200250300350400450500Number of nodesLevel 1Level 2Level 310040010002000# of nodes in the 1st layer8008509009501000Test PerplexityRCV1FABIANVIL10040010002000# of nodes in the 1st layer6006507007508008509009501000Test PerplexityWikipediaFABIANVIL\fTable 2: Test perplexities and model size on the benchmarks. FABIA starts from a model initialized\nwith 400 hidden units in each layer.\n\nRCV1\n\nSize\n128\n128\n1024-512-256\n\nOver-RSM\nMCEM\nDPFA-SBN\nDPFA-RBM 128-64-32\nDPFM\nNVIL\nFABIA\n\n128-64\n400-400\n400-156\n\nPerplexity\n1060\n1023\n964\n920\n908\n857\n856\n\nWikipedia\n\nSize\n-\n-\n1024-512-256\n128-64-32\n128-64\n400-400\n400-151\n\nPerplexity\n-\n-\n770\n942\n783\n735\n730\n\nwe choose this value by cross-validation. The results for three layers are similar and are included in\nSupplemental Materials.\n\n6 Conclusion and Future Work\n\nWe develop an automatic method to select the number of hidden units in SBNs. The proposed\ngFIC criterion is proven to be statistically consistent with the model\u2019s marginal log-likelihood. By\nmaximizing gFIC, the FABIA algorithm can simultaneously execute model selection and inference\ntasks. Furthermore, we show that FABIA is a \ufb02exible framework that can be combined with auto-\nencoding VB approaches. Our experiments on various datasets suggest that FABIA can effectively\nselect a more-compact model and achieve better held-out performance. Our future work will be to\nextend FABIA to importance-sampling-based VAEs [5, 26, 24]. We also aim to explicitly select\nthe number of layers in SBNs, and to tackle other popular deep models, such as CNNs and FNNs.\nFinally, investigating the effect of FABIA\u2019s shrinkage mechanism on the gradient noise is another\ninteresting direction.\n\nAcknowledgements\n\nThe authors would like to thank Ricardo Henao for helpful discussions, and the anonymous reviewers\nfor their insightful comments and suggestions. Part of this work was done during the internship of\nthe \ufb01rst author at NEC Laboratories America, Cupertino, CA. This research was supported in part by\nARO, DARPA, DOE, NGA, ONR, NSF, and the NEC Fellowship.\n\nReferences\n[1] Adams, R., Wallach, H., and Ghahramani, Z. (2010). Learning the structure of deep sparse graphical models.\n\nIn International Conference on Arti\ufb01cial Intelligence and Statistics, pages 1\u20138.\n\n[2] Alvarez, J. M. and Salzmann, M. (2016). Learning the number of neurons in deep networks. In Advances in\n\nNeural Information Processing Systems, pages 2270\u20132278.\n\n[3] Armagan, A., Clyde, M., and Dunson, D. B. (2011). Generalized beta mixtures of Gaussians. In Advances\n\nin Neural Information Processing Systems, pages 523\u2013531.\n\n[4] Bergstra, J., Breuleux, O., Bastien, F., Lamblin, P., Pascanu, R., Desjardins, G., Turian, J., Warde-Farley, D.,\nand Bengio, Y. (2010). Theano: A CPU and GPU math compiler in python. In Proc. 9th Python in Science\nConf, pages 1\u20137.\n\n[5] Bornschein, J. and Bengio, Y. (2015). Reweighted wake-sleep. In International Conference on Learning\n\nRepresentations.\n\n[6] Carlson, D., Hsieh, Y.-P., Collins, E., Carin, L., and Cevher, V. (2016). Stochastic spectral descent for\n\ndiscrete graphical models. IEEE J. Sel. Topics Signal Process., 10(2):296\u2013311.\n\n[7] Dempster, A. P., Laird, N. M., and Rubin, D. B. (1977). Maximum likelihood from incomplete data via the\n\nEM algorithm. J. Roy. Statist. Soc. Ser. B, pages 1\u201338.\n\n9\n\n\f[8] Fujimaki, R. and Hayashi, K. (2012). Factorized asymptotic Bayesian hidden Markov models. In Interna-\n\ntional Conference on Machine Learning, pages 799\u2013806.\n\n[9] Fujimaki, R. and Morinaga, S. (2012). Factorized asymptotic Bayesian inference for mixture modeling. In\n\nInternational Conference on Arti\ufb01cial Intelligence and Statistics, pages 400\u2013408.\n\n[10] Gan, Z., Chen, C., Henao, R., Carlson, D., and Carin, L. (2015a). Scalable deep Poisson factor analysis for\n\ntopic modeling. In International Conference on Machine Learning, pages 1823\u20131832.\n\n[11] Gan, Z., Henao, R., Carlson, D., and Carin, L. (2015b). Learning deep sigmoid belief networks with data\n\naugmentation. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages 268\u2013276.\n\n[12] Hayashi, K. and Fujimaki, R. (2013). Factorized asymptotic Bayesian inference for latent feature models.\n\nIn Advances in Neural Information Processing Systems, pages 1214\u20131222.\n\n[13] Hayashi, K., Maeda, S.-i., and Fujimaki, R. (2015). Rebuilding factorized information criterion: Asymptot-\n\nically accurate marginal likelihood. In International Conference on Machine Learning, pages 1358\u20131366.\n\n[14] Henao, R., Gan, Z., Lu, J., and Carin, L. (2015). Deep Poisson factor modeling. In Advances in Neural\n\nInformation Processing Systems, pages 2800\u20132808.\n\n[15] Hinton, G., Deng, L., Yu, D., Dahl, G., Mohamed, A.-r., Jaitly, N., Senior, A., Vanhoucke, V., Nguyen, P.,\nKingsbury, B., et al. (2012). Deep neural networks for acoustic modeling in speech recognition: The shared\nviews of four research groups. IEEE Signal Process. Mag., 29(6):82\u201397.\n\n[16] Hinton, G., Osindero, S., and Teh, Y.-W. (2006). A fast learning algorithm for deep belief nets. Neural\n\nComputation, 18(7):1527\u20131554.\n\n[17] Kingma, D. P. and Ba, J. L. (2015). Adam: A method for stochastic optimization. In International\n\nConference on Learning Representations.\n\n[18] Kingma, D. P. and Welling, M. (2014). Auto-encoding variational Bayes. In International Conference on\n\nLearning Representations.\n\n[19] Krizhevsky, A., Sutskever, I., and Hinton, G. E. (2012). Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105.\n\n[20] LeCun, Y., Bengio, Y., and Hinton, G. (2015). Deep learning. Nature, 521(7553):436\u2013444.\n\n[21] Liu, C., Feng, L., and Fujimaki, R. (2016). Streaming model selection via online factorized asymptotic\n\nBayesian inference. In International Conference on Data Mining, pages 271\u2013280.\n\n[22] Liu, C., Feng, L., Fujimaki, R., and Muraoka, Y. (2015). Scalable model selection for large-scale factorial\n\nrelational models. In International Conference on Machine Learning, pages 1227\u20131235.\n\n[23] MacKay, D. J. (2003). Information theory, inference and learning algorithms. Cambridge University Press.\n\n[24] Maddison, C. J., Mnih, A., and Teh, Y. W. (2017). The concrete distribution: A continuous relaxation of\n\ndiscrete random variables. In International Conference on Learning Representations.\n\n[25] Mnih, A. and Gregor, K. (2014). Neural variational inference and learning in belief networks.\n\nInternational Conference on Machine Learning, pages 1791\u20131799.\n\nIn\n\n[26] Mnih, A. and Rezende, D. (2016). Variational inference for Monte Carlo objectives. In International\n\nConference on Machine Learning, pages 2188\u20132196.\n\n[27] Mnih, V., Kavukcuoglu, K., Silver, D., Rusu, A. A., Veness, J., Bellemare, M. G., Graves, A., Riedmiller,\nM., Fidjeland, A. K., Ostrovski, G., Petersen, S., Beattie, C., Sadik, A., Antonoglou, I., King, H., Kumaran,\nD., Wierstra, D., Legg, S., and Hassabis, D. (2015). Human-level control through deep reinforcement learning.\nNature, 518(7540):529\u2013533.\n\n[28] Neal, R. M. (1992). Connectionist learning of belief networks. Arti\ufb01cial Intelligence, 56(1):71\u2013113.\n\n[29] Rezende, D. J., Mohamed, S., and Wierstra, D. (2014). Stochastic backpropagation and approximate\ninference in deep generative models. In International Conference on Machine Learning, pages 1278\u20131286.\n\n[30] Saul, L. K., Jaakkola, T., and Jordan, M. I. (1996). Mean \ufb01eld theory for sigmoid belief networks. Journal\n\nof Arti\ufb01cial Intelligence Research, 4:61\u201376.\n\n10\n\n\f[31] Song, Z., Henao, R., Carlson, D., and Carin, L. (2016). Learning sigmoid belief networks via Monte\nCarlo expectation maximization. In International Conference on Arti\ufb01cial Intelligence and Statistics, pages\n1347\u20131355.\n\n[32] Srivastava, N., Hinton, G. E., Krizhevsky, A., Sutskever, I., and Salakhutdinov, R. (2014). Dropout: a simple\nway to prevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15(1):1929\u20131958.\n\n[33] Srivastava, N., Salakhutdinov, R. R., and Hinton, G. E. (2013). Modeling documents with deep Boltzmann\n\nmachines. In Uncertainty in Arti\ufb01cial Intelligence, pages 616\u2013624.\n\n[34] Tipping, M. E. (2001). Sparse Bayesian learning and the relevance vector machine. Journal of Machine\n\nLearning Research, 1:211\u2013244.\n\n[35] Wan, L., Zeiler, M., Zhang, S., LeCun, Y., and Fergus, R. (2013). Regularization of neural networks using\n\ndropconnect. In International Conference on Machine Learning, pages 1058\u20131066.\n\n[36] Zhou, M., Cong, Y., and Chen, B. (2015). The Poisson Gamma belief network. In Advances in Neural\n\nInformation Processing Systems, pages 3043\u20133051.\n\n11\n\n\f", "award": [], "sourceid": 2410, "authors": [{"given_name": "Zhao", "family_name": "Song", "institution": "Duke University"}, {"given_name": "Yusuke", "family_name": "Muraoka", "institution": null}, {"given_name": "Ryohei", "family_name": "Fujimaki", "institution": "NEC Data Science Research Laboratories"}, {"given_name": "Lawrence", "family_name": "Carin", "institution": "Duke University"}]}