{"title": "Rectified Factor Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 1855, "page_last": 1863, "abstract": "We propose rectified factor networks (RFNs) to efficiently construct very sparse, non-linear, high-dimensional representations of the input. RFN models identify rare and small events, have a low interference between code units, have a small reconstruction error, and explain the data covariance structure. RFN learning is a generalized alternating minimization algorithm derived from the posterior regularization method which enforces non-negative and normalized posterior means. We proof convergence and correctness of the RFN learning algorithm.On benchmarks, RFNs are compared to other unsupervised methods like autoencoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse coding methods, RFNs yield sparser codes, capture the data's covariance structure more precisely, and have a significantly smaller reconstruction error. We test RFNs as pretraining technique of deep networks on different vision datasets, where RFNs were superior to RBMs and autoencoders. On gene expression data from two pharmaceutical drug discovery studies, RFNs detected small and rare gene modules that revealed highly relevant new biological insights which were so far missed by other unsupervised methods.RFN package for GPU/CPU is available at http://www.bioinf.jku.at/software/rfn.", "full_text": "Recti\ufb01ed Factor Networks\n\nDjork-Arn\u00b4e Clevert, Andreas Mayr, Thomas Unterthiner and Sepp Hochreiter\n\nInstitute of Bioinformatics, Johannes Kepler University, Linz, Austria\n{okko,mayr,unterthiner,hochreit}@bioinf.jku.at\n\nAbstract\n\nWe propose recti\ufb01ed factor networks (RFNs) to ef\ufb01ciently construct very sparse,\nnon-linear, high-dimensional representations of the input. RFN models identify\nrare and small events in the input, have a low interference between code units,\nhave a small reconstruction error, and explain the data covariance structure. RFN\nlearning is a generalized alternating minimization algorithm derived from the pos-\nterior regularization method which enforces non-negative and normalized poste-\nrior means. We proof convergence and correctness of the RFN learning algorithm.\nOn benchmarks, RFNs are compared to other unsupervised methods like autoen-\ncoders, RBMs, factor analysis, ICA, and PCA. In contrast to previous sparse\ncoding methods, RFNs yield sparser codes, capture the data\u2019s covariance struc-\nture more precisely, and have a signi\ufb01cantly smaller reconstruction error. We\ntest RFNs as pretraining technique for deep networks on different vision datasets,\nwhere RFNs were superior to RBMs and autoencoders. On gene expression data\nfrom two pharmaceutical drug discovery studies, RFNs detected small and rare\ngene modules that revealed highly relevant new biological insights which were so\nfar missed by other unsupervised methods.\nRFN package for GPU/CPU is available at http://www.bioinf.jku.at/software/rfn.\n\n1\n\nIntroduction\n\nThe success of deep learning is to a large part based on advanced and ef\ufb01cient input representations\n[1, 2, 3, 4]. These representations are sparse and hierarchical. Sparse representations of the input\nare in general obtained by recti\ufb01ed linear units (ReLU) [5, 6] and dropout [7]. The key advantage of\nsparse representations is that dependencies between coding units are easy to model and to interpret.\nMost importantly, distinct concepts are much less likely to interfere in sparse representations. Using\nsparse representations, similarities of samples often break down to co-occurrences of features in\nthese samples. In bioinformatics sparse codes excelled in biclustering of gene expression data [8]\nand in \ufb01nding DNA sharing patterns between humans and Neanderthals [9].\nRepresentations learned by ReLUs are not only sparse but also non-negative. Non-negative repre-\nsentations do not code the degree of absence of events or objects in the input. As the vast majority of\nevents is supposed to be absent, to code for their degree of absence would introduce a high level of\nrandom \ufb02uctuations. We also aim for non-linear input representations to stack models for construct-\ning hierarchical representations. Finally, the representations are supposed to have a large number\nof coding units to allow coding of rare and small events in the input. Rare events are only observed\nin few samples like seldom side effects in drug design, rare genotypes in genetics, or small customer\ngroups in e-commerce. Small events affect only few input components like pathways with few genes\nin biology, few relevant mutations in oncology, or a pattern of few products in e-commerce. In sum-\nmary, our goal is to construct input representations that (1) are sparse, (2) are non-negative, (3) are\nnon-linear, (4) use many code units, and (5) model structures in the input data (see next paragraph).\nCurrent unsupervised deep learning approaches like autoencoders or restricted Boltzmann machines\n(RBMs) do encode all peculiarities in the data (including noise). Generative models can be design\n\n1\n\n\fto model speci\ufb01c structures in the data, but their codes cannot be enforced to be sparse and non-\nnegative. The input representation of a generative model is its posterior\u2019s mean, median, or mode,\nwhich depends on the data. Therefore, sparseness and non-negativity cannot be guaranteed indepen-\ndent of the data. For example, generative models with recti\ufb01ed priors, like recti\ufb01ed factor analysis,\nhave zero posterior probability for negative values, therefore their means are positive and not sparse\n[10, 11]. Sparse priors like Laplacian and Jeffrey\u2019s do not guarantee sparse posteriors (see experi-\nments in Tab. 1). To address the data dependence of the code, we employ the posterior regularization\nmethod [12]. This method separates model characteristics from data dependent characteristics that\nare enforced by constraints on the model\u2019s posterior.\nWe aim at representations that are feasible for many code units and massive datasets, therefore\nthe computational complexity of generating a code is essential in our approach. For non-Gaussian\npriors, the computation of the posterior mean of a new input requires either to numerically solve\nan integral or to iteratively update variational parameters [13]. In contrast, for Gaussian priors the\nposterior mean is the product between the input and a matrix that is independent of the input. Still\nthe posterior regularization method leads to a quadratic (in the number of coding units) constrained\noptimization problem in each E-step (see Eq. (3) below). To speed up computation, we do not solve\nthe quadratic problem but perform a gradient step. To allow for stochastic gradients and fast GPU\nimplementations, also the M-step is a gradient step. These E-step and M-step modi\ufb01cations of the\nposterior regularization method result in a generalized alternating minimization (GAM) algorithm\n[12]. We will show that the GAM algorithm used for RFN learning (i) converges and (ii) is correct.\nCorrectness means that the RFN codes are non-negative, sparse, have a low reconstruction error, and\nexplain the covariance structure of the data.\n\n2 Recti\ufb01ed Factor Network\n\nOur goal is to construct representations of the input that (1) are sparse, (2) are non-negative, (3) are\nnon-linear, (4) use many code units, and (5) model structures in the input. Structures in the input\nare identi\ufb01ed by a generative model, where the model assumptions determine which input structures\nto explain by the model. We want to model the covariance structure of the input, therefore we\nchoose maximum likelihood factor analysis as model. The constraints on the input representation\nare enforced by the posterior regularization method [12]. Non-negative constraints lead to sparse\nand non-linear codes, while normalization constraints scale the signal part of each hidden (code)\nunit. Normalizing constraints avoid that generative models explain away rare and small signals by\nnoise. Explaining away becomes a serious problem for models with many coding units since their\ncapacities are not utilized. Normalizing ensures that all hidden units are used but at the cost of coding\nalso random and spurious signals. Spurious and true signals must be separated in a subsequent step\neither by supervised techniques, by evaluating coding units via additional data, or by domain experts.\nA generative model with hidden units h and data v is de\ufb01ned by its prior p(h) and its likelihood\np(v | h). The full model distribution p(h, v) = p(v | h)p(h) can be expressed by the model\u2019s\nposterior p(h | v) and its evidence (marginal likelihood) p(v): p(h, v) = p(h | v)p(v). The\nrepresentation of input v is the posterior\u2019s mean, median, or mode. The posterior regularization\nmethod introduces a variational distribution Q(h | v) \u2208 Q from a family Q, which approximates\nthe posterior p(h | v). We choose Q to constrain the posterior means to be non-negative and\nnormalized. The full model distribution p(h, v) contains all model assumptions and, thereby, de\ufb01nes\nwhich structures of the data are modeled. Q(h | v) contains data dependent constraints on the\nposterior, therefore on the code.\nFor data {v} = {v1, . . . , vn}, the posterior regularization method maximizes the objective F [12]:\n\nlog p(vi) \u2212 1\nn\n\nDKL(Q(hi | vi) (cid:107) p(hi | vi))\n\n(1)\n\nQ(hi | vi) log p(vi | hi) dhi \u2212 1\nn\n\nDKL(Q(hi | vi) (cid:107) p(hi)) ,\n\nwhere DKL is the Kullback-Leibler distance. Maximizing F achieves two goals simultaneously: (1)\nextracting desired structures and information from the data as imposed by the generative model and\n(2) ensuring desired code properties via Q \u2208 Q.\n\nn(cid:88)\n\ni=1\n\nn(cid:88)\nn(cid:88)\n\ni=1\n\ni=1\n\n(cid:90)\n\nF =\n\n=\n\n1\nn\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n2\n\n\fThe factor analysis model v = W h + \u0001 extracts the covari-\nance structure of the data. The prior h \u223c N (0, I) of the\nhidden units (factors) h \u2208 Rl and the noise \u0001 \u223c N (0, \u03a8)\nof visible units (observations) v \u2208 Rm are independent. The\nmodel parameters are the weight (loading) matrix W \u2208 Rm\u00d7l\nand the noise covariance matrix \u03a8 \u2208 Rm\u00d7m. We assume di-\nagonal \u03a8 to explain correlations between input components by\nthe hidden units and not by correlated noise. The factor analy-\nsis model is depicted in Fig. 1. Given the mean-centered data\n{v} = {v1, . . . , vn}, the posterior p(hi | vi) is Gaussian with\nmean vector (\u00b5p)i and covariance matrix \u03a3p:\n\n(\u00b5p)i = (cid:0)I + W T \u03a8\u22121W(cid:1)\u22121\n\u03a3p = (cid:0)I + W T \u03a8\u22121W(cid:1)\u22121\n\nW T \u03a8\u22121 vi ,\n.\n\n(2)\n\nFigure 1: Factor analysis model:\nhidden units (factors) h, visible\nunits v, weight matrix W , noise \u0001.\n\np\n\n1\nn\n\ni=1\n\n1\nn\n\ni=1\n\nA recti\ufb01ed factor network (RFN) consists of a single or stacked factor analysis model(s) with con-\nstraints on the posterior. To incorporate the posterior constraints into the factor analysis model,\nwe use the posterior regularization method that maximizes the objective F given in Eq. (1) [12].\nLike the expectation-maximization (EM) algorithm, the posterior regularization method alternates\nbetween an E-step and an M-step. Minimizing the \ufb01rst DKL of Eq. (1) with respect to Q leads to a\nconstrained optimization problem. For Gaussian distributions, the solution with (\u00b5p)i and \u03a3p from\nEq. (2) is Q(hi | vi) \u223c N (\u00b5i, \u03a3) with \u03a3 = \u03a3p and the quadratic problem:\ns.t. \u2200i : \u00b5i \u2265 0 , \u2200j :\n\n(\u00b5i \u2212 (\u00b5p)i)T \u03a3\u22121\n\n(\u00b5i \u2212 (\u00b5p)i) ,\n\nn(cid:88)\n\nn(cid:88)\n\n\u00b52\n\nij = 1 ,\n\n(3)\nmin\n\u00b5i\nwhere \u201c\u2265\u201d is component-wise. This is a constraint non-convex quadratic optimization problem in\nthe number of hidden units which is too complex to be solved in each EM iteration. Therefore, we\nperform a step of the gradient projection algorithm [14, 15], which performs \ufb01rst a gradient step\nand then projects the result to the feasible set. We start by a step of the projected Newton method,\nthen we try the gradient projection algorithm, thereafter the scaled gradient projection algorithm\nwith reduced matrix [16] (see also [15]). If these methods fail to decrease the objective in Eq. (3),\nwe use the generalized reduced method [17]. It solves each equality constraint for one variable and\ninserts it into the objective while ensuring convex constraints. Alternatively, we use Rosen\u2019s gradient\nprojection method [18] or its improvement [19]. These methods guarantee a decrease of the E-step\nobjective.\nSince the projection P by Eq. (6) is very fast, the projected Newton and projected gradient up-\ndate is very fast, too. A projected Newton step requires O(nl) steps (see Eq. (7) and P de\ufb01ned\nin Theorem 1), a projected gradient step requires O(min{nlm, nl2}) steps, and a scaled gradient\nprojection step requires O(nl3) steps. The RFN complexity per iteration is O(n(m2 + l2)) (see\nAlg. 1). In contrast, a quadratic program solver typically requires for the (nl) variables (the means\nof the hidden units for all samples) O(n4l4) steps to \ufb01nd the minimum [20]. We exemplify these\nvalues on our benchmark datasets MNIST (n = 50k, l = 1024, m = 784) and CIFAR (n = 50k,\nl = 2048, m = 1024). The speedup with projected Newton or projected gradient in contrast to\na quadratic solver is O(n3l2) = O(n4l4)/O(nl2), which gives speedup ratios of 1.3 \u00b7 1020 for\nMNIST and 5.2 \u00b7 1020 for CIFAR. These speedup ratios show that ef\ufb01cient E-step updates are\nessential for RFN learning. Furthermore, on our computers, RAM restrictions limited quadratic\nprogram solvers to problems with nl \u2264 20k. Running times of RFNs with the Newton step and a\nquadratic program solver are given in the supplementary Section 15.\nThe M-step decreases the expected reconstruction error\nE = \u2212 1\nn\n\nn(cid:88)\nm log (2\u03c0) + log |\u03a8| + Tr(cid:0)\u03a8\u22121C(cid:1) \u2212 2 Tr(cid:0)\u03a8\u22121W U T(cid:1) + Tr(cid:0)W T \u03a8\u22121W S(cid:1)(cid:17)\n\nQ(hi | vi) log (p(vi | hi)) dhi\n\n(cid:16)\n\n(cid:90)\n\n(4)\n\nRl\n\ni=1\n\n=\n\n1\n2\n\nfrom Eq. (1) with respect to the model parameters W and \u03a8. De\ufb01nitions of C, U and S are\ngiven in Alg. 1. The M-step performs a gradient step in the Newton direction, since we want to\n\n3\n\n.\n\n\u00012\u00011W22W11v1v2h1h3h2h4\fi\n\ni=1 vivT\n\nAlgorithm 1 Recti\ufb01ed Factor Network.\n1: C = 1\nn\n2: while STOP=false do\n3: \u2014\u2014E-step1\u2014\u2014\nfor all 1 \u2264 i \u2264 n do\n4:\n5:\n6:\n\n(cid:80)n\n(\u00b5p)i =(cid:0)I + W T \u03a8\u22121W(cid:1)\u22121\n7: \u03a3 = \u03a3p = (cid:0)I + W T \u03a8\u22121W(cid:1)\u22121\n\nend for\n\nW T \u03a8\u22121vi\n\n8: \u2014\u2014Constraint Posterior\u2014\u2014\n9:\n\n(1) projected Newton, (2) projected gradient,\n(3) scaled gradient projection, (4) generalized\nreduced method, (5) Rosen\u2019s gradient project.\n\n(cid:80)n\n(cid:80)n\n\n10: \u2014\u2014E-step2\u2014\u2014\n11: U = 1\nn\n12:\nS = 1\nn\n13: \u2014\u2014M-step\u2014\u2014\n14: E = C \u2212 U W T \u2212 W U + W S W T\n\n15: W = W + \u03b7 (cid:0)U S\u22121 \u2212 W(cid:1)\n\ni=1 vi \u00b5T\ni=1 \u00b5i \u00b5T\n\ni\ni + \u03a3\n\nfor all 1 \u2264 k \u2264 m do\n\u03a8kk = \u03a8kk + \u03b7 (Ekk \u2212 \u03a8kk)\nend for\nif stopping criterion is met: STOP=true\n\n16:\n17:\n18:\n19:\n20: end while\n\nComplexity: objective F: O(min{nlm, nl2} + l3); E-step1: O(min{m2(m + l), l2(m + l)} + nlm);\nprojected Newton: O(nl); projected gradient: O(min{nlm, nl2}); scaled gradient projection: O(nl3); E-\nstep2: O(nl(m+l)); M-step: O(ml(m+l)); overall complexity with projected Newton / gradient for (l+m) <\nn: O(n(m2 + l2)).\n\nn(cid:88)\n\nmin\n\u00b5i\n\n1\nn\n\nallow stochastic gradients, fast GPU implementation, and dropout regularization. The Newton step\nis derived in the supplementary which gives further details, too. Also in the E-step, RFN learning\nperforms a gradient step using projected Newton or gradient projection methods. These projection\nmethods require the Euclidean projection P of the posterior means {(\u00b5p)i} onto the non-convex\nfeasible set:\n\n(\u00b5i \u2212 (\u00b5p)i)T (\u00b5i \u2212 (\u00b5p)i) ,\n\ns.t. \u00b5i \u2265 0 ,\n\n\u00b52\n\nij = 1 .\n\n(5)\n\ni=1\n\ni=1\nThe following Theorem 1 gives the Euclidean projection P as solution to Eq. (5).\nTheorem 1 (Euclidean Projection). If at least one (\u00b5p)ij is positive for 1 \u2264 j \u2264 l, then the solution\nto optimization problem Eq. (5) is\n\nn(cid:88)\n\n1\nn\n\n(cid:26) 0\n\n\u00b5ij = [P((\u00b5p)i)]j =\n\n(cid:113) 1\n\nn\n\n(cid:80)n\n\n\u02c6\u00b5ij\n\ni=1 \u02c6\u00b52\nij\n\n,\n\n\u02c6\u00b5ij =\n\n(\u00b5p)ij\n\nfor\nfor\n\n(\u00b5p)ij \u2264 0\n(\u00b5p)ij > 0\n\n.\n\n(6)\n\nIf all (\u00b5p)ij are non-positive for 1 \u2264 j \u2264 l, then the optimization problem Eq. (5) has the solution\n\u00b5ij =\n\nn for j = arg max\u02c6j{(\u00b5p)i\u02c6j} and \u00b5ij = 0 otherwise.\n\n\u221a\n\nProof. See supplementary material.\n\nUsing the projection P de\ufb01ned in Eq. (6), the E-step updates for the posterior means \u00b5i are:\np ((\u00b5p)i \u2212 \u00b5old\n\ni + \u03bb H\u22121 \u03a3\u22121\n\n= P(cid:0)\u00b5old\n\ni + \u03b3(cid:0)d \u2212 \u00b5old\n\nd = P(cid:0)\u00b5old\n\n(cid:1)(cid:1) ,\n\n\u00b5new\n\ni\n\ni\n\ni\n\n)(cid:1)\n\n(7)\n\nwhere we set for the projected Newton method H\u22121 = \u03a3p (thus H\u22121\u03a3\u22121\np = I), and for the\nprojected gradient method H\u22121 = I. For the scaled gradient projection algorithm with reduced\nmatrix, the \u0001-active set for i consists of all j with \u00b5ij \u2264 \u0001. The reduced matrix H is the Hessian\n\u03a3\u22121\np with \u0001-active columns and rows j \ufb01xed to unit vectors ej. The resulting algorithm is a posterior\nregularization method with a gradient based E- and M-step, leading to a generalized alternating\nminimization (GAM) algorithm [21]. The RFN learning algorithm is given in Alg. 1. Dropout\nregularization can be included before E-step2 by randomly setting code units \u00b5ij to zero with a\nprede\ufb01ned dropout rate (note that convergence results will no longer hold).\n\n3 Convergence and Correctness of RFN Learning\nConvergence of RFN Learning. Theorem 2 states that Alg. 1 converges to a maximum of F.\nTheorem 2 (RFN Convergence). The recti\ufb01ed factor network (RFN) learning algorithm given in\nAlg. 1 is a \u201cgeneralized alternating minimization\u201d (GAM) algorithm and converges to a solution\nthat maximizes the objective F.\n\n4\n\n\fProof. We present a sketch of the proof which is given in detail in the supplement. For convergence,\nwe show that Alg. 1 is a GAM algorithm which convergences according to Proposition 5 in [21].\nAlg. 1 ensures to decrease the M-step objective which is convex in W and \u03a8\u22121. The update with\n\u03b7 = 1 leads to the minimum of the objective. Convexity of the objective guarantees a decrease in the\nM-step for 0 < \u03b7 \u2264 1 if not in a minimum. Alg. 1 ensures to decrease the E-step objective by using\ngradient projection methods. All other requirements for GAM convergence are also ful\ufb01lled.\n\nProposition 5 in [21] is based on Zangwill\u2019s generalized convergence theorem, thus updates of the\nRFN algorithm are viewed as point-to-set mappings [22]. Therefore, the numerical precision, the\nchoice of the methods in the E-step, and GPU implementations are covered by the proof.\n\nCorrectness of RFN Learning. The goal of the RFN algorithm is to explain the data and its\ncovariance structure. The expected approximation error E is de\ufb01ned in line 14 of Alg. 1. Theorem 3\nstates that the RFN algorithm is correct, that is, it explains the data (low reconstruction error) and\ncaptures the covariance structure as good as possible.\nTheorem 3 (RFN Correctness). The \ufb01xed point W of Alg. 1 minimizes Tr (\u03a8) given \u00b5i and \u03a3 by\nridge regression with\n\nn(cid:88)\n\ni=1\n\nTr (\u03a8) =\n\n1\nn\n\n(cid:107)\u0001i(cid:107)2\n\n2 +\n\n(cid:13)(cid:13)(cid:13)W \u03a31/2(cid:13)(cid:13)(cid:13)2\n\nF\n\nwhere \u0001i = vi \u2212 W \u00b5i. The model explains the data covariance matrix by\n\nC = \u03a8 + W S W T\n\nup to an error, which is quadratic in \u03a8 for \u03a8 (cid:28) W W T . The reconstruction error 1\nis quadratic in \u03a8 for \u03a8 (cid:28) W W T .\nProof. The \ufb01xed point equation for the W update is \u2206W = U S\u22121 \u2212 W = 0 \u21d2 W = U S\u22121.\n\n2\n\nn\n\n1\nn\n\n1\nn\n\ni=1\n\n2 +\n\nE =\n\n1\nn\n\n= Tr\n\nF\n\n(cid:1)(cid:0) 1\n\nn\n\nn(cid:88)\n\ni=1\ni=1 \u0001i\u0001T\n\nwhere Tr is the trace. After multiplying out all \u0001i\u0001T\ni\n\ni=1 vi \u00b5T\ni\n\n(cid:80)n\n(cid:32)\nn(cid:88)\nin 1/n(cid:80)n\n\nis the ridge regression solution of\n(cid:107)vi \u2212 W \u00b5i(cid:107)2\n\nUsing the de\ufb01nition of U and S, we have W = (cid:0) 1\n(cid:13)(cid:13)(cid:13)W \u03a31/2(cid:13)(cid:13)(cid:13)2\nn(cid:88)\nFor the \ufb01xed point of \u03a8, the update rule gives: diag (\u03a8) = diag(cid:0) 1\n(cid:0)W W T + \u03a8(cid:1)\u22121 from left and right by \u03a8 gives\nW \u03a3W T = \u03a8 \u2212 \u03a8(cid:0)W W T + \u03a8(cid:1)\u22121\n(cid:16)\n\u03a8(cid:0)W W T + \u03a8(cid:1)\u22121\ni \u2212 \u03a8 (cid:0)W W T + \u03a8(cid:1)\u22121\n\n\u03a8.\nInserting this into the expression for diag (\u03a8) and taking the trace gives\n\n(cid:17) \u2264 Tr\n\n\u0001i \u0001T\n\ni + W \u03a3 W T .\n\nn(cid:88)\n\nn(cid:88)\n\ni=1\n\n(cid:32)\n\n(cid:33)\n\n\u0001i \u0001T\ni\n\n= Tr\n\n\u0001i \u0001T\n\nTr\n\n1\nn\n\ni=1\n\n\u03a8\n\n1\nn\n\ni=1\n\nTherefore, for \u03a8 (cid:28) W W T the error is quadratic in \u03a8. W U T = W SW T = U W T follows\nfrom \ufb01xed point equation U = W S. Using this and Eq. (12), Eq. (11) is\n\n\u03a8 = C \u2212 \u03a8 \u2212 W S W T .\n\n(14)\n\nUsing the trace norm (nuclear norm or Ky-Fan n-norm) on matrices, Eq. (13) states that the left\nhand side of Eq. (14) is quadratic in \u03a8 for \u03a8 (cid:28) W W T . The trace norm of a positive semi-de\ufb01nite\nmatrix is its trace and bounds the Frobenius norm [23]. Thus, for \u03a8 (cid:28) W W T , the covariance is\napproximated up to a quadratic error in \u03a8 according to Eq. (9). The diagonal is exactly modeled.\n\n5\n\n,\n\n(8)\n\nn\n\n(cid:80)n\n(9)\ni=1 (cid:107)\u0001i(cid:107)2\ni + \u03a3(cid:1)\u22121 . W\n(cid:33)\n\n(cid:80)n\n\ni=1 \u00b5i \u00b5T\n\n\u0001i \u0001T\n\ni + W \u03a3 W T\n\n,\n\n(10)\n\ni , we obtain:\n\n(cid:80)n\n\ni=1 \u0001i\u0001T\n\nn\n\n(11)\n\ni + W \u03a3W T(cid:1).\n\n(cid:16)(cid:0)W W T + \u03a8(cid:1)\u22121(cid:17)\n\n(12)\n\nTr (\u03a8)2 .\n\n(13)\n\nThus, W minimizes Tr (\u03a8) given \u00b5i and \u03a3. Multiplying the Woodbury identity for\n\n\fSince the minimization of the expected reconstruction error Tr (\u03a8) is based on \u00b5i, the quality of\nreconstruction depends on the correlation between \u00b5i and vi. We ensure maximal information in \u00b5i\non vi by the I-projection (the minimal Kullback-Leibler distance) of the posterior onto the family of\nrecti\ufb01ed and normalized Gaussian distributions.\n\n4 Experiments\n\nRFNs vs. Other Unsupervised Methods. We assess the performance of recti\ufb01ed factor networks\n(RFNs) as unsupervised methods for data representation. We compare (1) RFN: recti\ufb01ed factor net-\nworks, (2) RFNn: RFNs without normalization, (3) DAE: denoising autoencoders with ReLUs, (4)\nRBM: restricted Boltzmann machines with Gaussian visible units, (5) FAsp: factor analysis with\nJeffrey\u2019s prior (p(z) \u221d 1/z) on the hidden units which is sparser than a Laplace prior, (6) FAlap:\nfactor analysis with Laplace prior on the hidden units, (7) ICA: independent component analysis\nby FastICA [24], (8) SFA: sparse factor analysis with a Laplace prior on the parameters, (9) FA:\nstandard factor analysis, (10) PCA: principal component analysis. The number of components are\n\ufb01xed to 50, 100 and 150 for each method. We generated nine different benchmark datasets (D1 to\nD9), where each dataset consists of 100 instances. Each instance has 100 samples and 100 features\nresulting in a 100\u00d7100 matrix. Into these matrices, biclusters are implanted [8]. A bicluster is a\npattern of particular features which is found in particular samples like a pathway activated in some\nsamples. An optimal representation will only code the biclusters that are present in a sample. The\ndatasets have different noise levels and different bicluster sizes. Large biclusters have 20\u201330 sam-\nples and 20\u201330 features, while small biclusters 3\u20138 samples and 3\u20138 features. The pattern\u2019s signal\nstrength in a particular sample was randomly chosen according to the Gaussian N (1, 1). Finally,\nto each matrix, zero-mean Gaussian background noise was added with standard deviation 1, 5, or\n10. The datasets are characterized by Dx=(\u03c3, n1, n2) with background noise \u03c3, number of large\nbiclusters n1, and the number of small biclusters n2: D1=(1,10,10), D2=(5,10,10), D3=(10,10,10),\nD4=(1,15,5), D5=(5,15,5), D6=(10,15,5), D7=(1,5,15), D8=(5,5,15), D9=(10,5,15).\nWe evaluated the methods according to the (1) sparseness of the components, the (2) input recon-\nstruction error from the code, and the (3) covariance reconstruction error for generative models.\nFor RFNs sparseness is the percentage of the components that are exactly 0, while for others meth-\nods it is the percentage of components with an absolute value smaller than 0.01. The reconstruction\nerror is the sum of the squared errors across samples. The covariance reconstruction error is the\nFrobenius norm of the difference between model and data covariance. See supplement for more\ndetails on the data and for information on hyperparameter selection for the different methods. Tab. 1\ngives averaged results for models with 50 (undercomplete), 100 (complete) and 150 (overcomplete)\ncoding units. Results are the mean of 900 instances consisting of 100 instances for each dataset\nD1 to D9. In the supplement, we separately tabulate the results for D1 to D9 and con\ufb01rm them\nwith different noise levels. FAlap did not yield sparse codes since the variational parameter did not\nTable 1: Comparison of RFN with other unsupervised methods, where the upper part contains meth-\nods that yielded sparse codes. Criteria: sparseness of the code (SP), reconstruction error (ER),\ndifference between data and model covariance (CO). The panels give the results for models with 50,\n100 and 150 coding units. Results are the mean of 900 instances, 100 instances for each dataset D1\nto D9 (maximal value: 999). RFNs had the sparsest code, the lowest reconstruction error, and the\nlowest covariance approximation error of all methods that yielded sparse representations (SP>10%).\n\nundercomplete 50 code units\nCO\n108\u00b13\n140\u00b14\n\u2014\n\u2014\n999\u00b199\n341\u00b119\n\u2014\n94\u00b13\n90\u00b13\n\u2014\n\nER\n249\u00b13\n295\u00b14\n251\u00b13\n310\u00b14\n999\u00b163\n239\u00b16\n174\u00b12\n218\u00b15\n218\u00b14\n174\u00b12\n\nSP\n75\u00b10\n74\u00b10\n66\u00b10\n15\u00b11\n40\u00b11\n4\u00b10\n2\u00b10\n1\u00b10\n1\u00b10\n0\u00b10\n\nRFN\nRFNn\nDAE\nRBM\nFAsp\nFAlap\nICA\nSFA\nFA\nPCA\n\ncomplete 100 code units\n\nSP\n81\u00b11\n79\u00b10\n69\u00b10\n7\u00b11\n63\u00b10\n6\u00b10\n3\u00b11\n1\u00b10\n1\u00b10\n2\u00b10\n\nER\n68\u00b19\n185\u00b15\n147\u00b12\n287\u00b14\n999\u00b165\n46\u00b14\n0\u00b10\n16\u00b11\n16\u00b11\n0\u00b10\n\nCO\n26\u00b16\n59\u00b13\n\u2014\n\u2014\n999\u00b199\n985\u00b145\n\u2014\n114\u00b15\n83\u00b14\n\u2014\n\n6\n\novercomplete 150 code units\nCO\n7\u00b16\n35\u00b12\n\u2014\n\u2014\n999\u00b199\n976\u00b153\n\u2014\n285\u00b17\n263\u00b16\n\u2014\n\nER\n17\u00b16\n142\u00b14\n130\u00b12\n286\u00b14\n999\u00b165\n46\u00b14\n0\u00b10\n16\u00b11\n16\u00b11\n0\u00b10\n\nSP\n85\u00b11\n80\u00b10\n71\u00b10\n5\u00b10\n80\u00b10\n4\u00b10\n3\u00b11\n1\u00b10\n1\u00b10\n2\u00b10\n\n\f(a) MNIST digits\n\n(b) MNIST digits with random image background\n\n(c) MNIST digits with random noise background\n\n(d) convex and concave shapes\n\n(e) tall and wide rectangular\n\n(f) rectangular images on background images\n\n(g) CIFAR-10 images (best viewed in color)\n\n(h) NORB images\n\nFigure 2: Randomly selected \ufb01lters trained on image datasets using an RFN with 1024 hidden units.\nRFNs learned stroke, local and global blob detectors. RFNs are robust to background noise (b,c,f).\n\npush the absolute representations below the threshold of 0.01. The variational approximation to the\nLaplacian is a Gaussian [13]. RFNs had the sparsest code, the lowest reconstruction error, and the\nlowest covariance approximation error of all methods yielding sparse representations (SP>10%).\n\nRFN Pretraining for Deep Nets. We assess the performance of recti\ufb01ed factor networks (RFNs)\nif used for pretraining of deep networks. Stacked RFNs are obtained by \ufb01rst training a single layer\nRFN and then passing on the resulting representation as input for training the next RFN. The deep\nnetwork architectures use a RFN pretrained \ufb01rst layer (RFN-1) or stacks of 3 RFNs giving a 3-\nhidden layer network. The classi\ufb01cation performance of deep networks with RFN pretrained layers\nwas compared to (i) support vector machines, (ii) deep networks pretrained by stacking denoising\nautoencoders (SDAE), (iii) stacking regular autoencoders (SAE), (iv) restricted Boltzmann machines\n(RBM), and (v) stacking restricted Boltzmann machines (DBN).\nThe benchmark datasets and results are taken from previous publications [25, 26, 27, 28] and con-\ntain: (i) MNIST (original MNIST), (ii) basic (a smaller subset of MNIST for training), (iii) bg-rand\n(MNIST with random noise background), (iv) bg-img (MNIST with random image background),\n(v) rect (tall or wide rectangles), (vi) rect-img (tall or wide rectangular images with random back-\nground images), (vii) convex (convex or concave shapes), (viii) CIFAR-10 (60k color images in 10\nclasses), and (ix) NORB (29,160 stereo image pairs of 5 categories). For each dataset its size of\ntraining, validation and test set is given in the second column of Tab. 2. As preprocessing we only\nperformed median centering. Model selection is based on the validation set [26]. The RFNs hyper-\nparameters are (i) the number of units per layer from {1024, 2048, 4096} and (ii) the dropout rate\nfrom {0.0, 0.25, 0.5, 0.75}. The learning rate was \ufb01xed to \u03b7 = 0.01 (default value). For supervised\n\ufb01ne-tuning with stochastic gradient descent, we selected the learning rate from {0.1, 0.01, 0.001},\nthe masking noise from {0.0, 0.25}, and the number of layers from {1, 3}. Fine-tuning was stopped\nbased on the validation set, see [26]. Fig. 2 shows learned \ufb01lters. Test error rates and the 95%\nTable 2: Results of deep networks pretrained by RFNs and other models (taken from [25, 26, 27,\n28]). The test error rate is reported together with the 95% con\ufb01dence interval. The best performing\nmethod is given in bold, as well as those for which con\ufb01dence intervals overlap. The \ufb01rst column\ngives the dataset, the second the size of training, validation and test set, the last column indicates\nthe number of hidden layers of the selected deep network. In only one case RFN pretraining was\nsigni\ufb01cantly worse than the best method but still the second best. In six out of the nine experiments\nRFN pretraining performed best, where in four cases it was signi\ufb01cantly the best.\nSDAE\nDataset\n1.28\u00b10.22\nMNIST\n2.84\u00b10.15\nbasic\n10.30\u00b10.27\nbg-rand\n16.68\u00b10.33\nbg-img\nrect\n1.99\u00b10.12\n21.59\u00b10.36\nrect-img\n19.06\u00b10.34\nconvex\n9.50\u00b10.37\nNORB\nCIFAR\n-\n\nRFN\n1.27\u00b10.22 (1)\n2.66\u00b10.14 (1)\n7.94\u00b10.24 (3)\n15.66\u00b10.32 (1)\n0.63\u00b10.06 (1)\n20.77\u00b10.36 (1)\n16.41\u00b10.32 (1)\n7.00\u00b10.32 (1)\n41.29\u00b10.95 (1)\n\nSVM\n1.40\u00b10.23\n3.03\u00b10.15\n14.58\u00b10.31\n22.61\u00b10.37\n2.15\u00b10.13\n24.04\u00b10.37\n19.13\u00b10.34\n11.6\u00b10.40\n62.7\u00b10.95\n\n50k-10k-10k\n10k-2k-50k\n10k-2k-50k\n10k-2k-50k\n1k-0.2k-50k\n10k-2k-50k\n10k-2k-50k\n19k-5k-24k\n40k-10k-10k\n\nRBM\n1.21\u00b10.21\n3.94\u00b10.17\n9.80\u00b10.26\n16.15\u00b10.32\n4.71\u00b10.19\n23.69\u00b10.37\n19.92\u00b10.35\n8.31\u00b10.35\n40.39\u00b10.96\n\nDBN\n1.24\u00b10.22\n3.11\u00b10.15\n6.73\u00b10.22\n16.31\u00b10.32\n2.60\u00b10.14\n22.50\u00b10.37\n18.63\u00b10.34\n-\n43.38\u00b10.97\n\nSAE\n1.40\u00b10.23\n3.46\u00b10.16\n11.28\u00b10.28\n23.00\u00b10.37\n2.41\u00b10.13\n24.05\u00b10.37\n18.41\u00b10.34\n10.10\u00b10.38\n43.25\u00b10.97\n\n7\n\n\fFigure 3: Examples of small and rare events identi\ufb01ed by RFN in two drug design studies, which\nwere missed by previous methods. Panel A and B: \ufb01rst row gives the coding unit, while the other\nrows display expression values of genes for controls (red), active drugs (green), and inactive drugs\n(black). Drugs (green) in panel A strongly downregulate the expression of tubulin genes which\nhints at a genotoxic effect by the formation of micronuclei (C). The micronuclei were con\ufb01rmed by\nmicroscopic analysis (D). Drugs (green) in panel B show a transcriptional effect on genes with a\nnegative feedback to the MAPK signaling pathway (E) and therefore are potential cancer drugs.\n\ncon\ufb01dence interval (computed according to [26]) for deep network pretraining by RFNs and other\nmethods are given in Tab. 2. Best results and those with overlapping con\ufb01dence intervals are given\nin bold. RFNs were only once signi\ufb01cantly worse than the best method but still the second best.\nIn six out of the nine experiments RFNs performed best, where in four cases it was signi\ufb01cantly\nthe best. Supplementary Section 14 shows results of RFN pretraining for convolutional networks,\nwhere RFN pretraining decreased the test error rates to 7.63% for CIFAR-10 and to 29.75% for\nCIFAR-100.\n\nRFNs in Drug Discovery. Using RFNs we analyzed gene expression datasets of two projects in\nthe lead optimization phase of a big pharmaceutical company [29]. The \ufb01rst project aimed at \ufb01nding\nnovel antipsychotics that target PDE10A. The second project was an oncology study that focused\non compounds inhibiting the FGF receptor. In both projects, the expression data was summarized\nby FARMS [30] and standardized. RFNs were trained with 500 hidden units, no masking noise, and\na learning rate of \u03b7 = 0.01. The identi\ufb01ed transcriptional modules are shown in Fig. 3. Panels A\nand B illustrate that RFNs found rare and small events in the input. In panel A only a few drugs are\ngenotoxic (rare event) by downregulating the expression of a small number of tubulin genes (small\nevent). The genotoxic effect stems from the formation of micronuclei (panel C and D) since the\nmitotic spindle apparatus is impaired. Also in panel B, RFN identi\ufb01ed a rare and small event which\nis a transcriptional module that has a negative feedback to the MAPK signaling pathway. Rare events\nare unexpectedly inactive drugs (black dots), which do not inhibit the FGF receptor. Both \ufb01ndings\nwere not detected by other unsupervised methods, while they were highly relevant and supported\ndecision-making in both projects [29].\n\n5 Conclusion\n\nWe have introduced recti\ufb01ed factor networks (RFNs) for constructing very sparse and non-linear\ninput representations with many coding units in a generative framework. Like factor analysis, RFN\nlearning explains the data variance by its model parameters. The RFN learning algorithm is a poste-\nrior regularization method which enforces non-negative and normalized posterior means. We have\nshown that RFN learning is a generalized alternating minimization method which can be proved\nto converge and to be correct. RFNs had the sparsest code, the lowest reconstruction error, and the\nlowest covariance approximation error of all methods that yielded sparse representations (SP>10%).\nRFNs have shown that they improve performance if used for pretraining of deep networks. In two\npharmaceutical drug discovery studies, RFNs detected small and rare gene modules that were so far\nmissed by other unsupervised methods. These gene modules were highly relevant and supported\nthe decision-making in both studies. RFNs are geared to large datasets, sparse coding, and many\nrepresentational units, therefore they have high potential as unsupervised deep learning techniques.\n\nAcknowledgment. The Tesla K40 used for this research was donated by the NVIDIA Corporation.\n\n8\n\nABEMicronucleiCD \fReferences\n[1] G. E. Hinton and R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504\u2013507, 2006.\n\n[2] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks. In\n\nB. Sch\u00a8olkopf, J. C. Platt, and T. Hoffman, editors, NIPS, pages 153\u2013160. MIT Press, 2007.\n\n[3] J. Schmidhuber. Deep learning in neural networks: An overview. Neural Networks, 61:85\u2013117, 2015.\n[4] Y. LeCun, Y. Bengio, and G. Hinton. Deep learning. Nature, 521(7553):436\u2013444, 2015.\n[5] V. Nair and G. E. Hinton. Recti\ufb01ed linear units improve restricted Boltzmann machines. In ICML, pages\n\n807\u2013814. Omnipress 2010, ISBN 978-1-60558-907-7, 2010.\n\n[6] X. Glorot, A. Bordes, and Y. Bengio. Deep sparse recti\ufb01er neural networks. In AISTATS, volume 15,\n\npages 315\u2013323, 2011.\n\n[7] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov. Dropout: A simple way to\nprevent neural networks from over\ufb01tting. Journal of Machine Learning Research, 15:1929\u20131958, 2014.\n[8] S. Hochreiter, U. Bodenhofer, et al. FABIA: factor analysis for bicluster acquisition. Bioinformatics,\n\n26(12):1520\u20131527, 2010.\n\n[9] S. Hochreiter. HapFABIA: Identi\ufb01cation of very short segments of identity by descent characterized by\n\nrare variants in large sequencing data. Nucleic Acids Res., 41(22):e202, 2013.\n\n[10] B. J. Frey and G. E. Hinton. Variational learning in nonlinear Gaussian belief networks. Neural Compu-\n\ntation, 11(1):193\u2013214, 1999.\n\n[11] M. Harva and A. Kaban. Variational learning for recti\ufb01ed factor analysis. Signal Processing, 87(3):509\u2013\n\n527, 2007.\n\n[12] K. Ganchev, J. Graca, J. Gillenwater, and B. Taskar. Posterior regularization for structured latent variable\n\nmodels. Journal of Machine Learning Research, 11:2001\u20132049, 2010.\n\n[13] J. Palmer, D. Wipf, K. Kreutz-Delgado, and B. Rao. Variational EM algorithms for non-Gaussian latent\n\nvariable models. In NIPS, volume 18, pages 1059\u20131066, 2006.\n\n[14] D. P. Bertsekas. On the Goldstein-Levitin-Polyak gradient projection method.\n\nControl, 21:174\u2013184, 1976.\n\nIEEE Trans. Automat.\n\n[15] C. T. Kelley. Iterative Methods for Optimization. Society for Industrial and Applied Mathematics (SIAM),\n\nPhiladelphia, 1999.\n\n[16] D. P. Bertsekas. Projected Newton methods for optimization problems with simple constraints. SIAM J.\n\nControl Optim., 20:221\u2013246, 1982.\n\n[17] J. Abadie and J. Carpentier. Optimization, chapter Generalization of the Wolfe Reduced Gradient Method\n\nto the Case of Nonlinear Constraints. Academic Press, 1969.\n\n[18] J. B. Rosen. The gradient projection method for nonlinear programming. part ii. nonlinear constraints.\n\nJournal of the Society for Industrial and Applied Mathematics, 9(4):514\u2013532, 1961.\n\n[19] E. J. Haug and J. S. Arora. Applied optimal design. J. Wiley & Sons, New York, 1979.\n[20] A. Ben-Tal and A. Nemirovski. Interior Point Polynomial Time Methods for Linear Programming, Conic\nQuadratic Programming, and Semide\ufb01nite Programming, chapter 6, pages 377\u2013442. Society for Industrial\nand Applied Mathematics, 2001.\n\n[21] A. Gunawardana and W. Byrne. Convergence theorems for generalized alternating minimization proce-\n\ndures. Journal of Machine Learning Research, 6:2049\u20132073, 2005.\n\n[22] W. I. Zangwill. Nonlinear Programming: A Uni\ufb01ed Approach. Prentice Hall, Englewood Cliffs, N.J.,\n\n1969.\n\n[23] N. Srebro. Learning with Matrix Factorizations. PhD thesis, Department of Electrical Engineering and\n\nComputer Science, Massachusetts Institute of Technology, 2004.\n\n[24] A. Hyv\u00a8arinen and E. Oja. A fast \ufb01xed-point algorithm for independent component analysis. Neural\n\nComput., 9(7):1483\u20131492, 1999.\n\n[25] Y. LeCun, F.-J. Huang, and L. Bottou. Learning methods for generic object recognition with invariance to\npose and lighting. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR). IEEE Press, 2004.\n\n[26] P. Vincent, H. Larochelle, et al. Stacked denoising autoencoders: Learning useful representations in a\n\ndeep network with a local denoising criterion. JMLR, 11:3371\u20133408, 2010.\n\n[27] H. Larochelle, D. Erhan, et al. An empirical evaluation of deep architectures on problems with many\n\nfactors of variation. In ICML, pages 473\u2013480, 2007.\n\n[28] A. Krizhevsky. Learning multiple layers of features from tiny images. Master\u2019s thesis, Deptartment of\n\n[29] B. Verbist, G. Klambauer, et al. Using transcriptomics to guide lead optimization in drug discovery\n\nComputer Science, University of Toronto, 2009.\nprojects: Lessons learned from the {QSTAR} project. Drug Discovery Today, 20(5):505 \u2013 513, 2015.\n\n[30] S. Hochreiter, D.-A. Clevert, and K. Obermayer. A new summarization method for Affymetrix probe\n\nlevel data. Bioinformatics, 22(8):943\u2013949, 2006.\n\n9\n\n\f", "award": [], "sourceid": 1148, "authors": [{"given_name": "Djork-Arn\u00e9", "family_name": "Clevert", "institution": "Johannes Kepler University"}, {"given_name": "Andreas", "family_name": "Mayr", "institution": "Johannes Kepler University Linz"}, {"given_name": "Thomas", "family_name": "Unterthiner", "institution": "Johannes Kepler University Linz"}, {"given_name": "Sepp", "family_name": "Hochreiter", "institution": "Johannes Kepler University Linz"}]}