{"title": "Deep Learning with Kernel Regularization for Visual Recognition", "book": "Advances in Neural Information Processing Systems", "page_first": 1889, "page_last": 1896, "abstract": "In this paper we focus on training deep neural networks for visual recognition tasks. One challenge is the lack of an informative regularization on the network parameters, to imply a meaningful control on the computed function. We propose a training strategy that takes advantage of kernel methods, where an existing kernel function represents useful prior knowledge about the learning task of interest. We derive an efficient algorithm using stochastic gradient descent, and demonstrate very positive results in a wide range of visual recognition tasks.", "full_text": "Deep Learning with Kernel Regularization\n\nfor Visual Recognition\n\nKai Yu\n\nNEC Laboratories America, Cupertino, CA 95014, USA\n\n{kyu, wx, ygong}@sv.nec-labs.com\n\nWei Xu\n\nYihong Gong\n\nAbstract\n\nIn this paper we aim to train deep neural networks for rapid visual recognition.\nThe task is highly challenging, largely due to the lack of a meaningful regular-\nizer on the functions realized by the networks. We propose a novel regularization\nmethod that takes advantage of kernel methods, where an oracle kernel function\nrepresents prior knowledge about the recognition task of interest. We derive an ef-\n\ufb01cient algorithm using stochastic gradient descent, and demonstrate encouraging\nresults on a wide range of recognition tasks, in terms of both accuracy and speed.\n\n1 Introduction\n\nVisual recognition remains a challenging task for machines. This dif\ufb01culty stems from the large\npattern variations under which a recognition system must operate. The task is extremely easy for a\nhuman, largely due to the expressive deep architecture employed by human visual cortex systems.\nDeep neural networks (DNNs) are argued to have a greater capacity to recognize a larger variety of\nvisual patterns than shallow models, because they are considered biologically plausible.\n\nHowever, training deep architectures is dif\ufb01cult because the large number of parameters to be tuned\nnecessitates an enormous amount of labeled training data that is often unavailable. Several authors\nhave recently proposed training methods by using unlabeled data. These methods perform a greedy\nlayer-wise pre-training using unlabeled data, followed by a supervised \ufb01ne-tuning [9, 4, 15]. Even\nthough the strategy notably improves the performance, to date, the best reported recognition accu-\nracy on popular benchmarks such as Caltech101 by deep models is still largely behind the results of\nshallow models.\n\nBeside using unlabeled data, in this paper we tackle the problem by leveraging additional prior\nknowledge. In the last few decades, researchers have developed successful kernel-based systems\nfor a wide range of visual recognition tasks. Those sensibly-designed kernel functions provide\nan extremely valuable source of prior knowledge, which we believe should be exploited in deep\nlearning. In this paper, we propose an informative kernel-based regularizer, which makes it possible\nto train DNNs with prior knowledge about the recognition task.\n\nComputationally, we propose to solve the learning problem using stochastic gradient descent (SGD),\nas it is the de facto method for neural network training. To this end we transform the kernel regu-\nlarizer into a loss function represented as a sum of costs by individual examples. This results in a\nsimple multi-task architecture where a number of extra nodes at the output layer are added to \ufb01t a\nset of auxiliary functions automatically constructed from the kernel function.\n\nWe apply the described method to train convolutional neural networks (CNNs) for a wide range of\nvisual recognition tasks, including handwritten digit recognition, gender classi\ufb01cation, ethnic origin\nrecognition, and object recognition. Overall our approach exhibits excellent accuracy and speed on\nall of these tasks. Our results show that incorporation of prior knowledge can boost the performance\nof CNNs by a large margin when the training set is small or the learning problem is dif\ufb01cult.\n\n1\n\n\f2 DNNs with Kernel Regularization\n\nIn our setting, the learning model, a deep neural network (DNN), aims to learn a predictive function\nf : X \u2192 R that can achieve a low expected discrepancy E[\u2018(y, f(x))] over the distribution p(x, y).\nIn the simplest case Y = {\u22121, 1} and \u2018(\u00b7,\u00b7) is a differentiable hinge loss. Based on a set of labeled\nexamples [(xi, yi)]n\n\ni=1, the learning is by minimizing a regularized loss\n\n\u2018(cid:0)yi, \u03b2>\n\nnX\n\ni=1\n\n(cid:1) + \u03bbk\u03b21k2\n\nL(\u03b2, \u03b8) =\n\n1 \u03c6i + \u03b20\n\n(1)\n\nwhere \u03c6i = \u03c6(xi; \u03b8) maps xi to q-dimensional hidden units via a nonlinear deep architecture\nwith parameters \u03b8, including the connection weights and biases of all the intermediate layers,\n\u03b2 = {\u03b21, \u03b20}, \u03b21 includes all the parameters of the transformation from the last hidden layer to\nthe output layer, \u03b20 is a bias term, \u03bb > 0, and kak2 = tr(a>a) is the usual weight decay reg-\nularization. Applying the well-known representor theorem, we derive the equivalence to a kernel\nsystem1\n\nnX\n\n\uf8eb\uf8edyi,\n\nnX\n\n\uf8f6\uf8f8 + \u03bb\n\nnX\n\nL(\u03b1, \u03b20, \u03b8) =\n\n\u2018\n\n\u03b1jKi,j + \u03b20\n\n\u03b1i\u03b1jKi,j\n\n(2)\n\nwhere the kernel is computed by\n\ni=1\n\nj=1\n\ni,j=1\n\nKi,j = h\u03c6(xi; \u03b8), \u03c6(xj; \u03b8)i = \u03c6>\n\ni \u03c6j\n\nWe assume the network is provided with some prior knowledge, in the form of an m \u00d7 m kernel\nmatrix \u03a3, computed on n labeled training data, plus possibly additional m\u2212n unlabeled data if m >\nn. We exploit this prior knowledge via imposing a kernel regularization on K(\u03b8) = [Ki,j]m\ni,j=1, such\nthat the learning problem seeks\nProblem 2.1.\n\nwhere \u03b3 > 0 and \u2126(\u03b8) is de\ufb01ned by\n\nmin\n\u03b2,\u03b8\n\nL(\u03b2, \u03b8) + \u03b3\u2126(\u03b8)\n\n\u2126(\u03b8) = tr(cid:2)K(\u03b8)\u22121\u03a3(cid:3) + log det[K(\u03b8)]\n\n(3)\n\n(4)\n\nThis is a case of semi-supervised learning if m > n. Though \u2126 is non-convex w.r.t. K, it has a\nunique minimum at K = \u03a3 if \u03a3 (cid:31) 0, suggesting that minimizing \u2126(\u03b8) encourages K to approach\n\u03a3. The regularization can be explained from an information-theoretic perspective. Let p(f|K) and\np(f|\u03a3) be two Gaussian distributions N (0, K) and N (0, \u03a3).2 \u2126(\u03b8) is related to the KL-divergence\nDKL[p(f|\u03a3)kp(f|K)]. Therefore, minimizing \u2126(\u03b8) forces the two distributions to be close. We\nnote that the regularization does not require \u03a3 to be positive de\ufb01nite \u2014 it can be semide\ufb01nite.\n\n3 Kernel Regularization via Stochastic Gradient Descent\n\nThe learning problem in Eq. (3) can be solved by using gradient-based methods. In this paper we\nemphasize large-scale optimizations using stochastic gradient descent (SGD), because the method\nis fast when the size m of total data is large and backpropagation, a typical SGD, has been the de\nfacto method to train neural networks for large-scale learning tasks.\n\nSGD considers the problem where the optimization cost is the sum of the local cost of each indi-\nvidual training example. A standard batch gradient descent updates the model parameters by using\nthe true gradient summed over the whole training set, while SGD approximates the true gradient by\nthe gradient caused by a single random training example. Therefore, the parameters of the model\n\n1In this paper we slightly abuse the notation, i.e., we use L to denote different loss functions. However their\n\nmeanings should be uniquely identi\ufb01ed by checking the input parameters.\n\n2From a Gaussian process point of view, a kernel function de\ufb01nes the prior distribution of a function f, such\n\nthat the marginal distribution of the function values f on any \ufb01nite set of inputs is a multivariate Gaussian.\n\n2\n\n\fare updated after each training example. For large data sets, SGD is often much faster than batch\ngradient descent.\n\nHowever, because the regularization term de\ufb01ned by Eq. (4) does not consist of a cost function that\ncan be expressed as a sum (or an average) over data examples, SGD is not directly applicable. Our\nidea is to transform the problem into an equivalent formulation that can be optimized stochastically.\n\n3.1 Shrinkage on the Kernel Matrix\n\nWe consider a large-scale problem where the data size m may grow over time, while the size of the\nlast hidden layer (q) of the DNN is \ufb01xed. Therefore the computed kernel K can be rank de\ufb01cient.\nIn order to ensure that the trace term in \u2126(\u03b8) is well-de\ufb01ned, and that the log-determinant term is\nbounded from below, we instead use K + \u03b4I to replace K in \u2126(\u03b8), where \u03b4 > 0 is a small shrinkage\nparameter and I is an identity matrix. Thus the log-determinant acts on a much smaller q\u00d7q matrix3\n\nwhere \u03a6 = [\u03c61, . . . , \u03c6m]> and const = (m \u2212 q) \u00b7 log \u03b4. Omitting all the irrelevant constants, we\nthen turn the kernel regularization into\n\nlog det(K + \u03b4I) = log det(cid:0)\u03a6>\u03a6 + \u03b4I(cid:1) + const\n\u2126(\u03b8) = tr(cid:2)(\u03a6\u03a6> + \u03b4I)\u22121\u03a3(cid:3) + log det(\u03a6>\u03a6 + \u03b4I)\n\n(5)\n\n(6)\n\nThe kernel shrinkage not only remedies the ill-posedness, but also yields other conveniences in our\nlater development.\n\n3.2 Transformation of the Log-determinant Term\n\nBy noticing that \u03a6>\u03a6 =Pn\n\ni=1 \u03c6i\u03c6>\n\ni\n\nof the log determinant for the convenience of SGD.\nTheorem 3.1. Consider min\u03b8{L(\u03b8) = h(\u03b8) + g(a)}, where g(\u00b7) is concave and a \u2261 a(\u03b8) is a\nfunction of \u03b8, if its local minimum w.r.t. \u03b8 exists, then the problem is equivalent to\n\nis a sum of quantities over data examples, we move it outside\n\n(cid:8)L(\u03b8, \u03c8) = h(\u03b8) + a(\u03b8)>\u03c8 \u2212 g\u2022(\u03c8)(cid:9)\n\nmin\n\u03b8,\u03c8\n\nwhere g\u2022(\u03c8) is the conjugate function of g(a), i.e. g\u2022(\u03c8) = mina{\u03c8>a \u2212 g(a)}.4\n\nProof. For a concave function g(a), the conjugate function of its conjugate function is itself,\ni.e., g(a) = min\u03c8{a>\u03c8 \u2212 g\u2022(\u03c8)}. Since g\u2022(\u03c8) is concave, a>\u03c8 \u2212 g\u2022(\u03c8) is convex w.r.t. \u03c8\nand has the unique minimum g(a). Therefore minimizing L(\u03b8, \u03c8) w.r.t. \u03b8 and \u03c8 is equivalent to\nminimizing L(\u03b8) w.r.t. \u03b8.\n\nSince log-determinant is concave for q \u00d7 q positive de\ufb01nite matrices A, the conjugate function\nof log det(A) is log det(\u03a8) + q. We can use the above theorem to transform any loss function\ncontaining log det(A) into another loss, which is an upper bound and involves A in a linear term.\nTherefore the log-determinant in Eq. (5) is turned into a variational representation\n\nlog det(cid:0)\u03a6>\u03a6 + \u03b4I(cid:1) = min\n\n\u03a8\u2208S+\n\nq\n\n\" mX\n\ni=1\n\ni \u03a8\u03c6i + \u03b4 \u00b7 tr(\u03a8) \u2212 log det(\u03a8) + const\n\u03c6>\n\n#\n\nwhere \u03a8 \u2208 S+\nq is a q \u00d7 q positive de\ufb01nite matrix, and const = \u2212q. As we can see, the upper bound\nis a convex function of auxiliary variables \u03a8 and more importantly, it amounts to a sum of local\nquantities caused by each of the m data examples.\n\nirrelevant to the variables of interest.\n\n3Hereafter in this paper, with a slight abuse of notation, we use \u201cconst\u201d in equations to summarize the terms\n4If g(a) is convex, its conjugate function is g\u25e6(\u03c8) = maxa{\u03c8>a \u2212 g(a)}.\n\n3\n\n\f3.3 Transformation of the Trace Term\nWe assume that the kernel matrix \u03a3 is presented in a decomposed form \u03a3 = U U>, with U =\n[u1, . . . , um]>, ui \u2208 Rp, and p \u2264 m. We have found that the trace term can be cast as a variational\nproblem by introducing an q \u00d7 p auxiliary variable matrix \u03b7.\nProposition 3.1. The trace term in Eq. (5) is equivalent to a convex variational representation\n\ntr(cid:2)(\u03a6\u03a6> + \u03b4I)\u22121\u03a3(cid:3) = min\n\n\u03b7\u2208Rq\u00d7p\n\n\" mX\n\ni=1\n\n#\n\nk 1\u221a\n\u03b4\n\nui \u2212 \u03b7>\u03c6ik2 + \u03b4k\u03b7k2\n\nF\n\nProof. We \ufb01rst obtain the analytical solution \u03b7\u2217 = 1\u221a\nrepresentation reaches its unique minimum. Then, plugging it back into the function, we have\n\n(cid:20)1\n(cid:21)\ntr(cid:2)U>U \u2212 U>\u03a6(\u03a6>\u03a6 + \u03b4I)\u22121\u03a6>U(cid:3) = tr(cid:2)(\u03a6\u03a6> + \u03b4I)\u22121U U>(cid:3)\n\nU>\u03a6\u03b7\u2217 + \u03b7\u2217>\u03a6>\u03a6\u03b7\u2217 +\n\nU>\u03a6(\u03a6>\u03a6 + \u03b4I)\u22122\u03a6>U\n\n(\u03a6>\u03a6 + \u03b4I)\u22121\u03a6>U, where the variational\n\nU>U \u2212 2\n\n1\u221a\n\u03b4\n\n1\n\u03b4\n\n\u03b4\n\ntr\n\n=\n\n\u03b4\n1\n\u03b4\n\nwhere the last step is derived by applying the Woodbury matrix identity.\n\nAgain, we note that the upper bound is a convex function of \u03b7, and consists of a sum of local costs\nover data examples.\n\n3.4 An Equivalent Learning Framework\n\nCombining the previous results, we obtain the convex upper bound for the kernel regularization\nEq. (5), which amounts to a sum of costs over examples under some regularization\n\n(cid:19)\n\n#\nF + \u03b4 \u00b7 tr(\u03a8) \u2212 log det(\u03a8)\n\n\u2126(\u03b8) \u2264\n\nL(\u03b7, \u03a8, \u03b8) =\n\nk 1\u221a\n\u03b4\n\nui \u2212 \u03b7>\u03c6ik2 + \u03c6>\n\ni \u03a8\u03c6i\n\n+ \u03b4k\u03b7k2\n\n\"\n\nwhere we omit all the terms irrelevant to \u03b7, \u03a8 and \u03b8. L(\u03b7, \u03a8, \u03b8) is convex w.r.t. \u03b7 and \u03a8, and has\na unique minimum \u2126(\u03b8), hence we can replace \u2126(\u03b8) by instead minimizing the upper bound and\nformulate an equivalent learning problem\n\n(cid:18)\n\nmX\n\ni=1\n\nh\n\ni\n\nmin\n\u03b2,\u03b7,\u03a8,\u03b8\n\nL(\u03b2, \u03b7, \u03a8, \u03b8) = L(\u03b2, \u03b8) + \u03b3L(\u03b7, \u03a8, \u03b8)\n\n(7)\n\nClearly this new optimization can be solved by SGD.\n\nWhen applying the SGD method, each step based on one example needs to compute the inverse of\n\u03a8. This can be computationally unaffordable when the dimensionality is large (e.g. q > 1000) \u2014\nremember that the ef\ufb01ciency of SGD is dependent on the lightweight of each stochastic update. Our\nnext result suggests that we can dramatically reduce this complexity from O(q3) to O(q).\nProposition 3.2. Eq. (5) is equivalent to the convex variational problem\n\n(cid:19)\n\nF + \u03b4 \u00b7 \u03c8>e \u2212 qX\n\nk=1\n\n#\n\nlog \u03c8k\n\n(8)\n\n\u2126(\u03b8) = min\n\u03b7,\u03c8\n\nk 1\u221a\n\u03b4\n\nui \u2212 \u03b7>\u03c6ik2 + \u03c8>\u03c62\n\ni\n\n+ \u03b4k\u03b7k2\n\n\" mX\n\n(cid:18)\n\ni=1\n\nwhere \u03c8 = [\u03c81, . . . , \u03c8q]>, and e = [1, . . . , 1]>.\nProof. There is an ambiguity of the solutions up to rotations. Suppose {\u03b2\u2217, \u03a6\u2217, \u03b7\u2217, \u03a8\u2217} is an op-\ntimal solution set, a transformation \u03b2\u2217 \u2190 R\u03b2\u2217, \u03a6\u2217 \u2190 R\u03a6\u2217, \u03b7\u2217 \u2190 R\u03b7\u2217, and \u03a8\u2217 \u2190 R\u03a8\u2217R>\nresults in the same optimality if R>R = I. Since there always exists an R to diagonalize \u03a8\u2217, we\ncan pre-restrict \u03a8 to be a diagonal positive de\ufb01nite matrix \u03a8 = diag[\u03c81, . . . , \u03c8q], which does not\nchange our problem and gives rise to Eq. (8).\n\nWe note that the variational form is convex w.r.t. the auxiliary variables \u03b7 and \u03c8. Therefore we can\nformulate the whole learning problem as\n\n4\n\n\f(cid:20)\n\nProblem 3.1.\n\n1\nn\nwhere L1(\u03b2, \u03b8) is de\ufb01ned by Eq. (1), and\n\nL(\u03b2, \u03b7, \u03c8, \u03b8) =\n\nmin\n\u03b2,\u03b7,\u03c8,\u03b8\n\nL1(\u03b2, \u03b8) + \u03b3\nmn\n\nL2(\u03b7, \u03b8) + \u03b3\nmn\n\n(cid:21)\n\nL3(\u03c8, \u03b8)\n\n(9)\n\nk 1\u221a\n\u03b4\n\n\u03c8>\u03c62\n\nui \u2212 \u03b7>\u03c6ik2 + \u03b4k\u03b7k2\n\ni + \u03b4 \u00b7 \u03c8>e \u2212 qX\n\nF\n\nlog \u03c8k\n\nmX\nmX\n\ni=1\n\nL2(\u03b7, \u03b8) =\n\nL3(\u03c8, \u03b8) =\n\ni=1\n\nk=1\n\nTo ensure the estimator of \u03b2 and \u03b8 is consistent, the effect of regularization should vanish as n \u2192 \u221e.\nTherefore we intentionally normalize L2(\u03b7, \u03b8) and L3(\u03c8, \u03b8) by 1/m. The overall loss function is\naveraged over the n labeled examples, consisting of three loss functions: the main classi\ufb01cation task\nL1(\u03b2, \u03b8), an auxiliary least-squares regression problem L2(\u03b7, \u03b8), and an additional regularization\nterm L3(\u03c8, \u03b8), which can be interpreted as another least-squares problem. Since each of the loss\nfunctions amounts to a summation of local costs caused by individual data examples, the whole\nlearning problem can be conveniently implemented by SGD, as described in Algorithm 1.\nIn practice, the kernel matrix \u03a3 = U U> that represents domain knowledge can be obtained in\nthree different ways: (i) In the easiest case, U is directly available by computing some hand-crafted\nfeatures computed from the input data, which corresponds to a case of a linear kernel function; (ii)\nU can be results of some unsupervised learning (e.g. the self-taught learning [14] based on sparse\ncoding), applied on a large set of unlabeled data; (iii) If a nonlinear kernel function is available, U\ncan be obtained by applying incomplete Cholesky decomposition on an m \u00d7 m kernel matrix \u03a3. In\nthe third case, when m is so large that the matrix decomposition cannot be computed in the main\nmemory, we apply the Nystr\u00a8om method [19]: We \ufb01rst randomly sample m1 examples p < m1 < m,\nsuch that the computed kernel matrix \u03a31 can be decomposed in the memory. Let V DV > be the p-\nrank eigenvalue decomposition of \u03a31, then the p-rank decomposition of \u03a3 can be approximated by\n2 , where \u03a3:,1 is the m \u00d7 m1 kernel matrix between all the m examples\n\u03a3 \u2248 U U>, U = \u03a3:,1V D\u2212 1\nand the subset of size m1.\n\nAlgorithm 1 Stochastic Gradient Descent\n\nrepeat\n\nGenerate a number a from uniform distribution [0, 1]\nif a < n\n\nm+n then\n\nRandomly pick a sample i \u2208 {1,\u00b7\u00b7\u00b7 , n} for L1, and update parameter by\n\n[\u03b2, \u03b8] \u2190 [\u03b2, \u03b8] \u2212 \u0001\n\n\u2202L1(xi, \u03b2, \u03b8)\n\n\u2202[\u03b2, \u03b8]\n\nelse\n\nRandomly pick a sample i \u2208 {1,\u00b7\u00b7\u00b7 , m} for L2, and update parameter by\n\n[\u03b7, \u03c8, \u03b8] \u2190 [\u03b7, \u03c8, \u03b8] \u2212 \u0001\nm\n\n\u2202[L2(xi, \u03b7, \u03b8) + L3(xi, \u03c8, \u03b8)]\n\n\u2202[\u03b7, \u03c8, \u03b8]\n\nend if\n\nuntil convergence\n\n4 Visual Recognition by Deep Learning with Kernel Regularization\n\nIn the following, we apply the proposed strategy to train a class of deep models and convolutional\nneural networks (CNNs, [11]) for a range of visual recognition tasks including digit recognition on\nMNIST dataset, gender and ethnicity classi\ufb01cation on the FRGC face dataset, and object recognition\non the Caltech101 dataset. In each of these tasks, we choose a kernel function that has been reported\nto have state-of-the-art or otherwise good performances in the literature. We will see whether a\nkernel-regularizer can improve the recognition accuracy of the deep models, and how it is compared\nwith the support vector machine (SVM) using the exactly the same kernel.\n\n5\n\n\fTable 1: Percentage error rates of handwritten digit recognition on MNIST\n\nTraining Size\nSVM (RBF)\nSVM (RBF, Nystr\u00a8om)\nSVM (Graph)\nSVM (Graph, Cholesky)\nCNN\nkCNN (RBF)\nkCNN (Graph)\nCNN (Pretrain) [15]\nEmbedO CNN [18]\nEmbedI5 CNN [18]\nEmbedA1 CNN [18]\n\n100\n22.73\n24.73\n5.21\n7.17\n19.40\n14.49\n4.28\n\u2212\n11.73\n7.75\n7.87\n\n600\n8.53\n9.15\n3.74\n6.47\n6.40\n3.85\n2.36\n3.21\n3.42\n3.82\n3.82\n\n1000\n6.58\n6.92\n3.46\n5.75\n5.50\n3.40\n2.05\n\u2212\n3.34\n2.73\n2.76\n\n3000\n3.91\n5.51\n3.01\n4.28\n2.75\n1.88\n1.75\n\u2212\n2.28\n1.83\n2.07\n\n60000\n1.41\n5.16\n2.23\n2.87\n0.82\n0.73\n0.64\n0.64\n\u2212\n\u2212\n\u2212\n\nThroughout all the experiments, \u201ckCNN\u201d denotes CNNs regularized by nonlinear kernels, processed\nby either Cholesky or Nystr\u00a8om approximation, with parameters p = 600, m1 = 5000, and m the\nsize of each whole data set. The obtained ui are normalized to have unitary lengths. \u03bb and \u03b4 are\n\ufb01xed by 1. The remaining two hyperparameters are: the learning rates \u0001 = {10\u22123, 10\u22124, 10\u22125} and\nthe kernel regularization weights \u03b3 = {102, 103, 104, 105}. Their values are set once for each of the\n4 recognition tasks based on a 5-fold cross validation using 500 labeled examples.\n\n4.1 Handwritten Digit Recognition on MNIST Dataset\nThe data contains a training set with 60000 examples and a test set with 10000 examples. The CNN\nemploys 50 \ufb01lters of size 7 \u00d7 7 on 34 \u00d7 34 input images, followed by down-sampling by 1/2, then\n128 \ufb01lters of size 5\u00d75, followed by down-sampling by 1/2, and then 200 \ufb01lters of size 5\u00d75, giving\nrise to 200 dimensional features that are fed to the output layer. Two nonlinear kernels are used: (1)\nRBF kernel, and (2) Graph kernel on 10 nearest neighbor graph [6]. We perform 600-dimension\nCholesky decomposition on the whole 70000 \u00d7 70000 graph kernel because it is very sparse.\nIn addition to using the whole training set, we train the models on 100, 600, 1000 and 3000 random\nexamples from the training set and evaluate the classi\ufb01ers on the whole test set, and repeat each\nsetting by 5 times independently. The results are given in Tab. 1. kCNNs effectively improve over\nCNNs by leveraging the prior knowledge, and also outperform SVMs that use the same kernels. The\nresults are competitive with the state-of-the-art results by [15], and [18] of a different architecture.\n\n4.2 Gender and Ethnicity Recognition on FRGC Dataset\nThe FRGC 2.0 dataset [13] contains 568 individuals\u2019 14714 face images under various lighting\nconditions and backgrounds. Beside person identities, each image is annotated with gender and\nethnicity, which we put into 3 classes, \u201cwhite\u201d, \u201casian\u201d, and \u201cother\u201d. We \ufb01x 114 persons\u2019 3014\nimages (randomly chosen) as the testing set, and randomly selected 5%, 10%, 20%, 50%, and \u201cAll\u201d\nimages from the rest 454 individuals\u2019 11700 images. For each training size, we randomize the\ntraining data 5 times and report the average error rates.\nIn this experiment, CNNs operate on images represented by R/G/B planes plus horizontal and ver-\ntical gradient maps of gray intensities. The 5 input planes of size 140 \u00d7 140 are processed by 16\nconvolution \ufb01lters with size 16 \u00d7 16, followed by max pooling within each disjoint 5 \u00d7 5 neigh-\nborhood. The obtained 16 feature maps of size 25 \u00d7 25 are connected to the next layer by 256\n\ufb01lters of size 6 \u00d7 6, with 50% random sparse connections, followed by max pooling within each\n5 \u00d7 5 neighborhood. The resulting 256 \u00d7 4 \u00d7 4 features are fed to the output layer. The nonlinear\nkernel used in this experiment is the RBF kernel computed directly on images, which has demon-\nstrated state-of-the-art accuracy for gender recognition [3]. The results shown in Tab. 2 and Tab. 3\ndemonstrate that kCNNs signi\ufb01cantly boost the recognition accuracy of CNNs for both gender and\nethnicity recognition. The difference is prominent when small training sets are presented.\n\n4.3 Object Recognition on Caltech101 Dataset\nCaltech101 [7] contains 9144 images from 101 object categories and a background category. It is\nconsidered one of the most diverse object databases available today, and is probably the most popular\nbenchmark for object recognition. We follow the common setting to train on 15 and 30 images per\nclass and test on the rest. Following [10], we limit the number of test images to 30 per class. The\n\n6\n\n\fTable 2: Percentage error rates of gender recognition on FRGC\n\nTraining Size\nSVM (RBF)\nSVM (RBF, Nystr\u00a8om)\nCNN\nkCNN\n\n5% 10% 20% 50% All\n8.6\n16.7\n8.8\n20.2\n61.5\n5.9\n4.4\n17.1\n\n13.4\n14.3\n17.2\n7.2\n\n11.3\n11.6\n8.4\n5.8\n\n9.1\n9.1\n6.6\n5.0\n\nTable 3: Percentage error rates of ethnicity recognition on FRGC\n\nTraining Size\nSVM (RBF)\nSVM (RBF, Nystr\u00a8om)\nCNN\nkCNN\n\n5% 10% 20% 50% All\n22.9\n10.2\n11.1\n24.7\n6.3\n30.0\n15.6\n5.8\n\n16.9\n20.6\n13.9\n8.7\n\n14.1\n15.8\n10.0\n7.3\n\n11.3\n11.9\n8.2\n6.2\n\nrecognition accuracy was normalized by class sizes and evaluated over 5 random data splits. The\nCNN has the same architecture as the one used in the FRGC experiment. The nonlinear kernel is the\nspatial pyramid matching (SPM) kernel developed in [10].\n\nTab. 4 shows our results together with those reported in [12, 15] using deep hierarchical architec-\ntures. The task is much more challenging than the previous three tasks for CNNs, because in each\ncategory the data size is very small while the visual patterns are highly diverse. Thanks to the reg-\nularization by SPM kernel, kCNN dramatically improves the accuracy of CNN, and outperforms\nSVM using the same kernel. This is perhaps the best performance by (trainable and hand-crafted)\ndeep hierarchical models on the Caltech101 dataset. Some \ufb01lters trained with and without kernel\nregularization are visualized in Fig. 1, which helps to understand the difference made by kCNN.\n\n5 Related Work, Discussion, and Conclusion\nRecent work on deep visual recognition models includes [17, 12, 15]. In [17] and [12] the \ufb01rst layer\nconsisted of hard-wired Gabor \ufb01lters, and then a large number of patches were sampled from the\nsecond layer and used as the basis of the representation which was then used to train a discriminative\nclassi\ufb01er.\n\nDeep models are powerful in representing complex functions but very dif\ufb01cult to train. Hinton and\nhis coworkers proposed training deep belief networks with layer-wise unsupervised pre-training,\nfollowed by supervised \ufb01ne-tuning [9]. The strategy was subsequently studied for other deep mod-\nels like CNNs [15], autoassociators [4], and for document coding [16]. In recent work [18], the\nauthors proposed training a deep model jointly with an unsupervised embedding task, which led to\nimproved results as well. Though using unlabeled data too, our work differs from previous work at\nthe emphasis on leveraging the prior knowledge, which suggests that it can be combined with those\napproaches, including neighborhood component analysis [8], to further enhance the deep learning.\nThis work is also related to transfer learning [2] that used auxiliary learning tasks to learn a linear\nfeature mapping, and more directly, our previous work [1], which created pseudo auxiliary tasks\nbased on hand-craft image features to train nonlinear deep networks.\n\nOne may ask, why bother training with kCNN, instead of simply combining two independently\ntrained CNN and SVM systems? The reason is computational speed \u2013 kCNN pays an extra cost to\nexploit a kernel matrix in the training phase, but in the prediction phase the system uses CNN alone.\n\n(a) CNN-Caltech101\n\n(b) kCNN-Caltech101\n\nFigure 1: First-layer \ufb01lters on the B channel, learned from Caltech101 (30 examples per class)\n\n7\n\n\fTable 4: Percentage accuracy on Caltech101\n\nTraining Size\nSVM (SPM) [10]\nSVM (SPM, Nystr\u00a8om)\nHMAX [12]\n\n15\n54.0\n52.1\n51.0\n\nTraining Size\n\n30\n64.6 CNN (Pretrain) [15]\n63.1 CNN\n56.0\nkCNN\n\n15\n\u2212\n26.5\n59.2\n\n30\n54.0\n43.6\n67.4\n\nIn our Caltech101 experiment, the SVM (SPM) needed several seconds to process a new image on\na PC with a 3.0 GHz processor, while kCNN can process about 40 images per second. The latest\nrecord on Caltech101 was based on combining multiple kernels [5]. We conjecture that kCNN could\nbe further improved by using multiple kernels without sacri\ufb01cing recognition speed.\n\nTo conclude, we proposed using kernels to improve the training of deep models. The approach was\nimplemented by stochastic gradient descent, and demonstrated excellent performances on a range of\nvisual recognition tasks. Our experiments showed that prior knowledge could signi\ufb01cantly improve\nthe performance of deep models when insuf\ufb01cient labeled data were available in hard recognition\nproblems. The trained model was much faster than kernel systems for making predictions.\nAcknowledgment: We thank the reviewers and Douglas Gray for helpful comments.\nReferences\n\n[1] A. Ahmed, K. Yu, W. Xu, Y. Gong, and E. P. Xing. Training hierarchical feed-forward visual recognition\n\nmodels using transfer learning from pseudo tasks. European Conference on Computer Vision, 2008.\n\n[2] R. K. Ando and T. Zhang. A framework for learning predictive structures from multiple tasks and unla-\n\nbeled data. Journal of Machine Learning Research, 2005.\n\n[3] S. Baluja and H. Rowley. Boosting sex identi\ufb01cation performance. Journal of Computer Vision, 2007.\n[4] Y. Bengio, P. Lamblin, D. Popovici, and H. Larochelle. Greedy layer-wise training of deep networks.\n\nNeural Information Processing Systems, 2007.\n\n[5] A. Bosch, A. Zisserman, and X. Mun\u02dcoz. Image classi\ufb01cation using ROIs and multiple kernel learning.\n\n2008. submitted to International Journal of Computer Vison.\n\n[6] O. Chapelle, J. Weston, and B. Sch\u00a8olkopf. Cluster kernels for semi-supervised learning. Neural Informa-\n\ntion Processing Systems, 2003.\n\n[7] L. Fei-Fei, R. Fergus, and P. Perona. Learning generative visual models from few training examples: An\n\nincremental Bayesian approach tested on 101 object categories. CVPR Workshop, 2004.\n\n[8] J. Goldberger, S. Roweis, G. Hinton, and R. Salakhutdinov. Neighbourhood components analysis. Neural\n\nInformation Processing Systems, 2005.\n\n[9] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks. Science,\n\n313(5786):504 \u2013 507, July 2006.\n\n[10] S. Lazebnik, C. Schmid, and J. Ponce. Beyond bags of features: Spatial pyramid matching for recognizing\n\nnatural scene categories. IEEE Conference on Computer Vision and Pattern Recognition, 2006.\n\n[11] Y. LeCun, L. Bottou, Y. Bengio, and P. Haffner. Gradient-based learning applied to document recognition.\n\nProceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] J. Mutch and D. G. Lowe. Multiclass object recognition with sparse, localized features. IEEE Conference\n\non Computer Vision and Pattern Recognition, 2006.\n\n[13] P. J. Philips, P. J. Flynn, T. Scruggs, K. W. Bower, and W. Worek. Preliminary face recognition grand\n\nchallenge results. IEEE Conference on Automatic Face and Gesture Recgonition, 2006.\n\n[14] R. Raina, A. Battle, H. Lee, B. Packer, and A. Y. Ng. Self-taught learning: Transfer learning from\n\nunlabeled data. International Conference on Machine Learning, 2007.\n\n[15] M. Ranzato, F.-J. Huang, Y.-L. Boureau, and Y. LeCun. Unsupervised learning of invariant feature hi-\nIEEE Conference on Computer Vision and Pattern\n\nerarchies with applications to object recognition.\nRecognition, 2007.\n\n[16] M. Ranzato and M. Szummer. Semi-supervised learning of compact document representations with deep\n\nnetworks. International Conferenece on Machine Learning, 2008.\n\n[17] T. Serre, L. Wolf, and T. Poggio. Object recognition with features inspired by visual cortex.\n\nConference on Computer Vision and Pattern Recognition, 2005.\n\nIEEE\n\n[18] J. Weston, F. Ratle, and R. Collobert. Deep learning via semi-supervised embedding.\n\nConference on Machine Learning, 2008.\n\nInternational\n\n[19] C. Williams and M. Seeger. Using the Nystr\u00a8om method to speed up kernel machines. Neural Information\n\nProcessing Systems, 2001.\n\n8\n\n\f", "award": [], "sourceid": 790, "authors": [{"given_name": "Kai", "family_name": "Yu", "institution": null}, {"given_name": "Wei", "family_name": "Xu", "institution": null}, {"given_name": "Yihong", "family_name": "Gong", "institution": null}]}