{"title": "Label Distribution Learning Forests", "book": "Advances in Neural Information Processing Systems", "page_first": 834, "page_last": 843, "abstract": "Label distribution learning (LDL) is a general learning framework, which assigns to an instance a distribution over a set of labels rather than a single label or multiple labels. Current LDL methods have either restricted assumptions on the expression form of the label distribution or limitations in representation learning, e.g., to learn deep features in an end-to-end manner. This paper presents label distribution learning forests (LDLFs) - a novel label distribution learning algorithm based on differentiable decision trees, which have several advantages: 1) Decision trees have the potential to model any general form of label distributions by a mixture of leaf node predictions. 2) The learning of differentiable decision trees can be combined with representation learning. We define a distribution-based loss function for a forest, enabling all the trees to be learned jointly, and show that an update function for leaf node predictions, which guarantees a strict decrease of the loss function, can be derived by variational bounding. The effectiveness of the proposed LDLFs is verified on several LDL tasks and a computer vision application, showing significant improvements to the state-of-the-art LDL methods.", "full_text": "Label Distribution Learning Forests\n\n1 Key Laboratory of Specialty Fiber Optics and Optical Access Networks,\n\nShanghai Institute for Advanced Communication and Data Science,\n\nSchool of Communication and Information Engineering, Shanghai University\n\nWei Shen1,2, Kai Zhao1, Yilu Guo1, Alan Yuille2\n\n2 Department of Computer Science, Johns Hopkins University\n\n{shenwei1231,zhaok1206,gyl.luan0,alan.l.yuille}@gmail.com\n\nAbstract\n\nLabel distribution learning (LDL) is a general learning framework, which assigns\nto an instance a distribution over a set of labels rather than a single label or multiple\nlabels. Current LDL methods have either restricted assumptions on the expression\nform of the label distribution or limitations in representation learning, e.g., to\nlearn deep features in an end-to-end manner. This paper presents label distribution\nlearning forests (LDLFs) - a novel label distribution learning algorithm based on\ndifferentiable decision trees, which have several advantages: 1) Decision trees\nhave the potential to model any general form of label distributions by a mixture\nof leaf node predictions. 2) The learning of differentiable decision trees can be\ncombined with representation learning. We de\ufb01ne a distribution-based loss function\nfor a forest, enabling all the trees to be learned jointly, and show that an update\nfunction for leaf node predictions, which guarantees a strict decrease of the loss\nfunction, can be derived by variational bounding. The effectiveness of the proposed\nLDLFs is veri\ufb01ed on several LDL tasks and a computer vision application, showing\nsigni\ufb01cant improvements to the state-of-the-art LDL methods.\n\n1\n\nIntroduction\n\nLabel distribution learning (LDL) [6, 11] is a learning framework to deal with problems of label\nambiguity. Unlike single-label learning (SLL) and multi-label learning (MLL) [26], which assume an\ninstance is assigned to a single label or multiple labels, LDL aims at learning the relative importance\nof each label involved in the description of an instance, i.e., a distribution over the set of labels. Such\na learning strategy is suitable for many real-world problems, which have label ambiguity. An example\nis facial age estimation [8]. Even humans cannot predict the precise age from a single facial image.\nThey may say that the person is probably in one age group and less likely to be in another. Hence it is\nmore natural to assign a distribution of age labels to each facial image (Fig. 1(a)) instead of using a\nsingle age label. Another example is movie rating prediction [7]. Many famous movie review web\nsites, such as Net\ufb02ix, IMDb and Douban, provide a crowd opinion for each movie speci\ufb01ed by the\ndistribution of ratings collected from their users (Fig. 1(b)). If a system could precisely predict such a\nrating distribution for every movie before it is released, movie producers can reduce their investment\nrisk and the audience can better choose which movies to watch.\nMany LDL methods assume the label distribution can be represented by a maximum entropy model [2]\nand learn it by optimizing an energy function based on the model [8, 11, 28, 6]. But, the exponential\npart of this model restricts the generality of the distribution form, e.g., it has dif\ufb01culty in representing\nmixture distributions. Some other LDL methods extend the existing learning algorithms, e.g, by\nboosting and support vector regression, to deal with label distributions [7, 27], which avoid making\nthis assumption, but have limitations in representation learning, e.g., they do not learn deep features\nin an end-to-end manner.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fFigure 1: The real-world data which are suitable to be modeled by label distribution learning. (a)\nEstimated facial ages (a unimodal distribution). (b) Rating distribution of crowd opinion on a movie\n(a multimodal distribution).\n\nIn this paper, we present label distribution learning forests (LDLFs) - a novel label distribution\nlearning algorithm inspired by differentiable decision trees [20]. Extending differentiable decision\ntrees to deal with the LDL task has two advantages. One is that decision trees have the potential\nto model any general form of label distributions by mixture of the leaf node predictions, which\navoid making strong assumption on the form of the label distributions. The second is that the split\nnode parameters in differentiable decision trees can be learned by back-propagation, which enables\na combination of tree learning and representation learning in an end-to-end manner. We de\ufb01ne a\ndistribution-based loss function for a tree by the Kullback-Leibler divergence (K-L) between the\nground truth label distribution and the distribution predicted by the tree. By \ufb01xing split nodes, we\nshow that the optimization of leaf node predictions to minimize the loss function of the tree can\nbe addressed by variational bounding [19, 29], in which the original loss function to be minimized\ngets iteratively replaced by a decreasing sequence of upper bounds. Following this optimization\nstrategy, we derive a discrete iterative function to update the leaf node predictions. To learn a forest,\nwe average the losses of all the individual trees to be the loss for the forest and allow the split nodes\nfrom different trees to be connected to the same output unit of the feature learning function. In this\nway, the split node parameters of all the individual trees can be learned jointly. Our LDLFs can be\nused as a (shallow) stand-alone model, and can also be integrated with any deep networks, i.e., the\nfeature learning function can be a linear transformation and a deep network, respectively. Fig. 2\nillustrates a sketch chart of our LDLFs, where a forest consists of two trees is shown.\nWe verify the effectiveness of our model on several LDL tasks, such as crowd opinion prediction on\nmovies and disease prediction based on human genes, as well as one computer vision application, i.e.,\nfacial age estimation, showing signi\ufb01cant improvements to the state-of-the-art LDL methods. The\nlabel distributions for these tasks include both unimodal distributions (e.g., the age distribution in\nFig. 1(a)) and mixture distributions (the rating distribution on a movie in Fig. 1(b)). The superiority\nof our model on both of them veri\ufb01es its ability to model any general form of label distributions\n\nFigure 2: Illustration of a label distribution learning forest. The top circles denote the output units\nof the function f parameterized by \u0398, which can be a feature vector or a fully-connected layer of\na deep network. The blue and green circles are split nodes and leaf nodes, respectively. Two index\nfunction \u03d51 and \u03d52 are assigned to these two trees respectively. The black dash arrows indicate the\ncorrespondence between the split nodes of these two trees and the output units of function f. Note\nthat, one output unit may correspond to the split nodes belonging to different trees. Each tree has\nindependent leaf node predictions q (denoted by histograms in leaf nodes). The output of the forest\nis a mixture of the tree predictions. f (\u00b7; \u0398) and q are learned jointly in an end-to-end manner.\n\n2\n\n\f2 Related Work\n\nSince our LDL algorithm is inspired by differentiable decision trees, it is necessary to \ufb01rst review\nsome typical techniques of decision trees. Then, we discuss current LDL methods.\nDecision trees. Random forests or randomized decision trees [16, 1, 3, 4], are a popular ensemble\npredictive model suitable for many machine learning tasks. In the past, learning of a decision tree was\nbased on heuristics such as a greedy algorithm where locally-optimal hard decisions are made at each\nsplit node [1], and thus, cannot be integrated into in a deep learning framework, i.e., be combined\nwith representation learning in an end-to-end manner.\nThe newly proposed deep neural decision forests (dNDFs) [20] overcomes this problem by introducing\na soft differentiable decision function at the split nodes and a global loss function de\ufb01ned on a tree.\nThis ensures that the split node parameters can be learned by back-propagation and leaf node\npredictions can be updated by a discrete iterative function.\nOur method extends dNDFs to address LDL problems, but this extension is non-trivial, because\nlearning leaf node predictions is a constrained convex optimization problem. Although a step-size\nfree update function was given in dNDFs to update leaf node predictions, it was only proved to\nconverge for a classi\ufb01cation loss. Consequently, it was unclear how to obtain such an update function\nfor other losses. We observed, however, that the update function in dNDFs can be derived from\nvariational bounding, which allows us to extend it to our LDL loss. In addition, the strategies used in\nLDLFs and dNDFs to learning the ensemble of multiple trees (forests) are different: 1) we explicitly\nde\ufb01ne a loss function for forests, while only the loss function for a single tree was de\ufb01ned in dNDFs;\n2) we allow the split nodes from different trees to be connected to the same output unit of the feature\nlearning function, while dNDFs did not; 3) all trees in LDLFs can be learned jointly, while trees in\ndNDFs were learned alternatively. These changes in the ensemble learning are important, because as\nshown in our experiments (Sec. 4.4), LDLFs can get better results by using more trees, but by using\nthe ensemble strategy proposed in dNDFs, the results of forests are even worse than those for a single\ntree.\nTo sum up, w.r.t. dNDFs [20], the contributions of LDLFs are: \ufb01rst, we extend from classi\ufb01cation [20]\nto distribution learning by proposing a distribution-based loss for the forests and derive the gradient to\nlearn splits nodes w.r.t. this loss; second, we derived the update function for leaf nodes by variational\nbounding (having observed that the update function in [20] was a special case of variational\nbounding); last but not the least, we propose above three strategies to learning the ensemble of\nmultiple trees, which are different from [20], but we show are effective.\nLabel distribution learning. A number of specialized algorithms have been proposed to address the\nLDL task, and have shown their effectiveness in many computer vision applications, such as facial\nage estimation [8, 11, 28], expression recognition [30] and hand orientation estimation [10].\nGeng et al. [8] de\ufb01ned the label distribution for an instance as a vector containing the probabilities\nof the instance having each label. They also gave a strategy to assign a proper label distribution\nto an instance with a single label, i.e., assigning a Gaussian or Triangle distribution whose peak\nis the single label, and proposed an algorithm called IIS-LLD, which is an iterative optimization\nprocess based on a two-layer energy based model. Yang et al. [28] then de\ufb01ned a three-layer energy\nbased model, called SCE-LDL, in which the ability to perform feature learning is improved by\nadding the extra hidden layer and sparsity constraints are also incorporated to ameliorate the model.\nGeng [6] developed an accelerated version of IIS-LLD, called BFGS-LDL, by using quasi-Newton\noptimization. All the above LDL methods assume that the label distribution can be represented by a\nmaximum entropy model [2], but the exponential part of this model restricts the generality of the\ndistribution form.\nAnother way to address the LDL task, is to extend existing learning algorithms to deal with label\ndistributions. Geng and Hou [7] proposed LDSVR, a LDL method by extending support vector\nregressor, which \ufb01t a sigmoid function to each component of the distribution simultaneously by a\nsupport vector machine. Xing et al. [27] then extended boosting to address the LDL task by additive\nweighted regressors. They showed that using the vector tree model as the weak regressor can lead to\nbetter performance and named this method AOSO-LDLLogitBoost. As the learning of this tree model\nis based on locally-optimal hard data partition functions at each split node, AOSO-LDLLogitBoost is\nunable to be combined with representation learning. Extending current deep learning algorithms to\n\n3\n\n\faddress the LDL task is an interesting topic. But, the existing such a method, called DLDL [5], still\nfocuses on maximum entropy model based LDL.\nOur method, LDLFs, extends differentiable decision trees to address LDL tasks, in which the predicted\nlabel distribution for a sample can be expressed by a linear combination of the label distributions\nof the training data, and thus have no restrictions on the distributions (e.g., no requirement of the\nmaximum entropy model). In addition, thanks to the introduction of differentiable decision functions,\nLDLFs can be combined with representation learning, e.g., to learn deep features in an end-to-end\nmanner.\n\n3 Label Distribution Learning Forests\n\nA forest is an ensemble of decision trees. We \ufb01rst introduce how to learn a single decision tree by\nlabel distribution learning, then describe the learning of a forest.\n\nx , dy2\n\nc=1 dyc\n\nx , . . . , dyC\n\nx \u2208 [0, 1] and(cid:80)C\n\n3.1 Problem Formulation\nLet X = Rm denote the input space and Y = {y1, y2, . . . , yC} denote the complete set of labels,\nwhere C is the number of possible label values. We consider a label distribution learning (LDL)\nx )(cid:62) \u2208\nproblem, where for each input sample x \u2208 X , there is a label distribution d = (dy1\nRC. Here dyc\nx expresses the probability of the sample x having the c-th label yc and thus has the\nconstraints that dyc\nx = 1. The goal of the LDL problem is to learn a mapping\nfunction g : x \u2192 d between an input sample x and its corresponding label distribution d.\nHere, we want to learn the mapping function g(x) by a decision tree based model T . A decision\ntree consists of a set of split nodes N and a set of leaf nodes L. Each split node n \u2208 N de\ufb01nes\na split function sn(\u00b7; \u0398) : X \u2192 [0, 1] parameterized by \u0398 to determine whether a sample is sent\nto the left or right subtree. Each leaf node (cid:96) \u2208 L holds a distribution q(cid:96) = (q(cid:96)1 , q(cid:96)2, . . . , q(cid:96)C )(cid:62)\nc=1 q(cid:96)c = 1. To build a differentiable decision tree, following [20],\nwe use a probabilistic split function sn(x; \u0398) = \u03c3(f\u03d5(n)(x; \u0398)), where \u03c3(\u00b7) is a sigmoid function,\n\u03d5(\u00b7) is an index function to bring the \u03d5(n)-th output of function f (x; \u0398) in correspondence with\nsplit node n, and f : x \u2192 RM is a real-valued feature learning function depending on the sample x\nand the parameter \u0398, and can take any form. For a simple form, it can be a linear transformation\nof x, where \u0398 is the transformation matrix; For a complex form, it can be a deep network to\nperform representation learning in an end-to-end manner, then \u0398 is the network parameter. The\ncorrespondence between the split nodes and the output units of function f, indicated by \u03d5(\u00b7) that is\nrandomly generated before tree learning, i.e., which output units from \u201cf\u201d are used for constructing a\ntree is determined randomly. An example to demonstrate \u03d5(\u00b7) is shown in Fig. 2. Then, the probability\nof the sample x falling into leaf node (cid:96) is given by\n\nover Y, i.e, q(cid:96)c \u2208 [0, 1] and(cid:80)C\n\np((cid:96)|x; \u0398) =\n\nsn(x; \u0398)1((cid:96)\u2208Ll\n\nn)(1 \u2212 sn(x; \u0398))1((cid:96)\u2208Lr\nn),\n\n(1)\n\n(cid:89)\n\nn\u2208N\n\nwhere 1(\u00b7) is an indicator function and Ll\nright subtrees of node n, T l\nfunction g, is de\ufb01ned by\n\nn and T r\n\nn and Lr\n(cid:88)\n\nn denote the sets of leaf nodes held by the left and\nn , respectively. The output of the tree T w.r.t. x, i.e., the mapping\ng(x; \u0398,T ) =\n\np((cid:96)|x; \u0398)q(cid:96).\n\n(2)\n\n(cid:96)\u2208L\n\n3.2 Tree Optimization\ni=1, our goal is to learn a decision tree T described in Sec. 3.1\nGiven a training set S = {(xi, di)}N\nwhich can output a distribution g(xi; \u0398,T ) similar to di for each sample xi. To this end, a\nstraightforward way is to minimize the Kullback-Leibler (K-L) divergence between each g(xi; \u0398,T )\n(cid:17)\nand di, or equivalently to minimize the following cross-entropy loss:\n\n(cid:16)(cid:88)\n\nN(cid:88)\n\nC(cid:88)\n\nN(cid:88)\n\nC(cid:88)\n\ndyc\nxi\n\nlog\n\np((cid:96)|xi; \u0398)q(cid:96)c\n\n,\n\n(3)\n\ni=1\n\nc=1\n\n(cid:96)\u2208L\n\nR(q, \u0398;S) = \u2212 1\nN\n\ndyc\nxi\n\nlog(gc(xi; \u0398,T )) = \u2212 1\nN\n\ni=1\n\nc=1\n\n4\n\n\fwhere q denote the distributions held by all the leaf nodes L and gc(xi; \u0398,T ) is the c-th output unit\nof g(xi; \u0398,T ). Learning the tree T requires the estimation of two parameters: 1) the split node\nparameter \u0398 and 2) the distributions q held by the leaf nodes. The best parameters (\u0398\u2217, q\u2217) are\ndetermined by\n\n(\u0398\u2217, q\u2217) = arg min\n\nR(q, \u0398;S).\n\n\u0398,q\n\n(4)\n\nTo solve Eqn. 4, we consider an alternating optimization strategy: First, we \ufb01x q and optimize\n\u0398; Then, we \ufb01x \u0398 and optimize q. These two learning steps are alternatively performed, until\nconvergence or a maximum number of iterations is reached (de\ufb01ned in the experiments).\n\n3.2.1 Learning Split Nodes\n\nIn this section, we describe how to learn the parameter \u0398 for split nodes, when the distributions held\nby the leaf nodes q are \ufb01xed. We compute the gradient of the loss R(q, \u0398;S) w.r.t. \u0398 by the chain\nrule:\n\n\u2202R(q, \u0398;S)\n\u2202f\u03d5(n)(xi; \u0398)\n\n\u2202f\u03d5(n)(xi; \u0398)\n\n\u2202\u0398\n\n,\n\n(5)\n\n\u2202R(q, \u0398;S)\n\n\u2202\u0398\n\nN(cid:88)\n\n(cid:88)\n\ni=1\n\nn\u2208N\n\n=\n\n=\n\nC(cid:88)\nn) =(cid:80)\n\nwhere only the \ufb01rst term depends on the tree and the second term depends on the speci\ufb01c type of the\nfunction f\u03d5(n). The \ufb01rst term is given by\n\n\u2202R(q, \u0398;S)\nn)\ngc(xi; \u0398,T )\n\u2202f\u03d5(n)(xi; \u0398)\np((cid:96)|xi; \u0398)q(cid:96)c. Note that,\nwhere gc(xi; \u0398,T l\nn) + gc(xi; \u0398,T r\nlet Tn be the tree rooted at the node n, then we have gc(xi; \u0398,Tn) = gc(xi; \u0398,T l\nn ).\nThis means the gradient computation in Eqn. 6 can be started at the leaf nodes and carried out in a\nbottom up manner. Thus, the split node parameters can be learned by standard back-propagation.\n\n\u2212(cid:0)1 \u2212 sn(xi; \u0398)(cid:1) gc(xi; \u0398,T l\nn ) =(cid:80)\n\n(cid:16)\np((cid:96)|xi; \u0398)q(cid:96)c and gc(xi; \u0398,T r\n\ngc(xi; \u0398,T r\nn )\ngc(xi; \u0398,T )\n\nsn(xi; \u0398)\n\n(cid:17)\n\n(cid:96)\u2208Lr\n\n1\nN\n\n(cid:96)\u2208Ll\n\nn\n\ndyc\nxi\n\n(6)\n\nc=1\n\n,\n\nn\n\n3.2.2 Learning Leaf Nodes\n\nNow, \ufb01xing the parameter \u0398, we show how to learn the distributions held by the leaf nodes q, which\nis a constrained optimization problem:\n\nR(q, \u0398;S), s.t.,\u2200(cid:96),\n\nmin\n\nq\n\nq(cid:96)c = 1.\n\n(7)\n\nHere, we propose to address this constrained convex optimization problem by variational bound-\ning [19, 29], which leads to a step-size free and fast-converged update rule for q. In variational\nbounding, an original objective function to be minimized gets replaced by its bound in an iterative\nmanner. A upper bound for the loss function R(q, \u0398;S) can be obtained by Jensen\u2019s inequality:\n\nC(cid:88)\n\nc=1\n\nR(q, \u0398;S) = \u2212 1\nN\n\ndyc\nxi\n\nlog\n\nN(cid:88)\n\nC(cid:88)\n\ni=1\n\nc=1\n\n\u2264 \u2212 1\nN\n\ndyc\nxi\n\n\u03be(cid:96)(\u00afq(cid:96)c, xi) log\n\nC(cid:88)\n\nc=1\n\nN(cid:88)\n(cid:88)\n\ni=1\n\n(cid:96)\u2208L\n\n(cid:16)(cid:88)\n(cid:16) p((cid:96)|xi; \u0398)q(cid:96)c\n\n(cid:17)\np((cid:96)|xi; \u0398)q(cid:96)c\n(cid:17)\n\n(cid:96)\u2208L\n\n,\n\n\u03be(cid:96)(\u00afq(cid:96)c, xi)\n\nwhere \u03be(cid:96)(q(cid:96)c, xi) = p((cid:96)|xi;\u0398)q(cid:96)c\n\ngc(xi;\u0398,T ) . We de\ufb01ne\n\nN(cid:88)\n\nC(cid:88)\n\ni=1\n\nc=1\n\ndyc\nxi\n\n(cid:88)\n\n(cid:96)\u2208L\n\n(cid:16) p((cid:96)|xi; \u0398)q(cid:96)c\n\n(cid:17)\n\n\u03be(cid:96)(\u00afq(cid:96)c , xi)\n\n.\n\n\u03be(cid:96)(\u00afq(cid:96)c, xi) log\n\n\u03c6(q, \u00afq) = \u2212 1\nN\n\nThen \u03c6(q, \u00afq) is an upper bound for R(q, \u0398;S), which has the property that for any q and \u00afq,\n\u03c6(q, \u00afq) \u2265 R(q, \u0398;S), and \u03c6(q, q) = R(q, \u0398;S). Assume that we are at a point q(t) corresponding\nto the t-th iteration, then \u03c6(q, q(t)) is an upper bound for R(q, \u0398;S). In the next iteration, q(t+1)\nis chosen such that \u03c6(q(t+1), q) \u2264 R(q(t), \u0398;S), which implies R(q(t+1), \u0398;S) \u2264 R(q(t), \u0398;S).\n\n5\n\n(8)\n\n(9)\n\n\fConsequently, we can minimize \u03c6(q, \u00afq) instead of R(q, \u0398;S) after ensuring that R(q(t), \u0398;S) =\n\u03c6(q(t), \u00afq), i.e., \u00afq = q(t). So we have\n\n(cid:88)\n\nq(cid:96)c = 1,\n\nc=1\n\nC(cid:88)\nC(cid:88)\nq(cid:96)c \u2212 1),\n(cid:80)N\n(cid:80)C\n(cid:80)N\n\nc=1\n\n= 0, we have\n\nq(t+1) = arg min\n\nq\n\n\u03c6(q, q(t)), s.t.,\u2200(cid:96),\n\nwhich leads to minimizing the Lagrangian de\ufb01ned by\n\nN(cid:88)\n\nC(cid:88)\n\ni=1\n\nc=1\n\n\u03bb(cid:96) =\n\n1\nN\n\n\u03d5(q, q(t)) = \u03c6(q, q(t)) +\n\n\u03bb(cid:96)(\n\n(cid:96)\u2208L\nwhere \u03bb(cid:96) is the Lagrange multiplier. By setting \u2202\u03d5(q,q(t))\n\n\u2202q(cid:96)c\n\ndyc\nxi\n\n\u03be(cid:96)(q(t)\n(cid:96)c\n\n=\n\n, xi) and q(t+1)\n\n\u2208 [0, 1] and(cid:80)C\n\n(cid:96)c\n\n(cid:96)c\n\nsatis\ufb01es that q(t+1)\n\nNote that, q(t+1)\n(cid:96)c\ndistributions held by the leaf nodes. The starting point q(0)\ndistribution: q(0)\n(cid:96)c\n\nC .\n= 1\n\nc=1 q(t+1)\n\n(cid:96)c\n\n(cid:96)\n\n(10)\n\n(11)\n\n.\n\n(12)\n\ni=1 dyc\nxi\ni=1 dyc\n\n\u03be(cid:96)(q(t)\n, xi)\n(cid:96)c\nxi\u03be(cid:96)(q(t)\n(cid:96)c\n\n, xi)\n\nc=1\n= 1. Eqn. 12 is the update scheme for\ncan be simply initialized by the uniform\n\n3.3 Learning a Forest\nA forest is an ensemble of decision trees F = {T1, . . . ,TK}. In the training stage, all trees in the\nforest F use the same parameters \u0398 for feature learning function f (\u00b7; \u0398) (but correspond to different\n(cid:80)K\noutput units of f assigned by \u03d5, see Fig. 2), but each tree has independent leaf node predictions\nq. The loss function for a forest is given by averaging the loss functions for all individual trees:\nk=1 RTk, where RTk is the loss function for tree Tk de\ufb01ned by Eqn. 3. To learn \u0398 by\nRF = 1\n\ufb01xing the leaf node predictions q of all the trees in the forest F, based on the derivation in Sec. 3.2\nK\nand referring to Fig. 2, we have\n\n\u2202RTk\n\n\u2202f\u03d5k(n)(xi; \u0398)\n\n\u2202f\u03d5k(n)(xi; \u0398)\n\n\u2202\u0398\n\n,\n\n(13)\n\nwhere Nk and \u03d5k(\u00b7) are the split node set and the index function of Tk, respectively. Note that,\nthe index function \u03d5k(\u00b7) for each tree is randomly assigned before tree learning, and thus split\nnodes correspond to a subset of output units of f. This strategy is similar to the random subspace\nmethod [17], which increases the randomness in training to reduce the risk of over\ufb01tting.\nAs for q, since each tree in the forest F has its own leaf node predictions q, we can update them\nindependently by Eqn. 12, given by \u0398. For implementational convenience, we do not conduct this\nupdate scheme on the whole dataset S but on a set of mini-batches B. The training procedure of a\nLDLF is shown in Algorithm. 1.\n\nN(cid:88)\n\nK(cid:88)\n\n(cid:88)\n\ni=1\n\nk=1\n\nn\u2208Nk\n\n\u2202RF\n\u2202\u0398\n\n=\n\n1\nK\n\nAlgorithm 1 The training procedure of a LDLF.\nRequire: S: training set, nB: the number of mini-batches to update q\n\nInitialize \u0398 randomly and q uniformly, set B = {\u2205}\nwhile Not converge do\nwhile |B| < nB do\n\nRandomly select a mini-batch B from S\nUpdate \u0398 by computing gradient (Eqn. 13) on B\n\nB = B(cid:83) B\n\nend while\nUpdate q by iterating Eqn. 12 on B\nB = {\u2205}\nend while\n\nIn the testing stage, the output of the forest F is given by averaging the predictions from all the\nindividual trees: g(x; \u0398,F) = 1\n\n(cid:80)K\nk=1 g(x; \u0398,Tk).\n\nK\n\n6\n\n\f4 Experimental Results\n\nOur realization of LDLFs is based on \u201cCaffe\u201d [18]. It is modular and implemented as a standard\nneural network layer. We can either use it as a shallow stand-alone model (sLDLFs) or integrate it\nwith any deep networks (dLDLFs). We evaluate sLDLFs on different LDL tasks and compare it with\nother stand-alone LDL methods. As dLDLFs can be learned from raw image data in an end-to-end\nmanner, we verify dLDLFs on a computer vision application, i.e., facial age estimation. The default\nsettings for the parameters of our forests are: tree number (5), tree depth (7), output unit number of\nthe feature learning function (64), iteration times to update leaf node predictions (20), the number of\nmini-batches to update leaf node predictions (100), maximum iteration (25000).\n\n4.1 Comparison of sLDLFs to Stand-alone LDL Methods\n\nWe compare our shallow model sLDLFs with other state-of-the-art stand-alone LDL methods.\nFor sLDLFs, the feature learning function f (x, \u0398) is a linear transformation of x, i.e., the i-th\noutput unit fi(x, \u03b8i) = \u03b8(cid:62)\ni x, where \u03b8i is the i-th column of the transformation matrix \u0398. We\nused 3 popular LDL datasets in [6], Movie, Human Gene and Natural Scene1. The samples\nin these 3 datasets are represented by numerical descriptors, and the ground truths for them are\nthe rating distributions of crowd opinion on movies, the diseases distributions related to human\ngenes and label distributions on scenes, such as plant, sky and cloud, respectively. The label\ndistributions of these 3 datasets are mixture distributions, such as the rating distribution shown in\nFig. 1(b). Following [7, 27], we use 6 measures to evaluate the performances of LDL methods,\nwhich compute the average similarity/distance between the predicted rating distributions and the real\nrating distributions, including 4 distance measures (K-L, Euclidean, S\u03c6rensen, Squared \u03c72) and two\nsimilarity measures (Fidelity, Intersection).\nWe evaluate our shallow model sLDLFs on these 3 datasets and compare it with other state-of-the-art\nstand-alone LDL methods. The results of sLDLFs and the competitors are summarized in Table 1.\nFor Movie we quote the results reported in [27], as the code of [27] is not publicly available. For the\nresults of the others two, we run code that the authors had made available. In all case, following [27, 6],\nwe split each dataset into 10 \ufb01xed folds and do standard ten-fold cross validation, which represents\nthe result by \u201cmean\u00b1standard deviation\u201d and matters less how training and testing data get divided.\nAs can be seen from Table 1, sLDLFs perform best on all of the six measures.\nTable 1: Comparison results on three LDL datasets [6]. \u201c\u2191\u201d and \u201c\u2193\u201d indicate the larger and the smaller\nthe better, respectively.\n\nDataset\n\nMethod\n\nMovie\n\nHuman Gene\n\nNatural Scene\n\nsLDLF (ours)\n\nAOSO-LDLogitBoost [27]\n\nLDLogitBoost [27]\n\nLDSVR [7]\n\nBFGS-LDL [6]\nIIS-LDL [11]\n\nsLDLF (ours)\nLDSVR [7]\n\nBFGS-LDL [6]\nIIS-LDL [11]\n\nsLDLF (ours)\nLDSVR [7]\n\nBFGS-LDL [6]\nIIS-LDL [11]\n\nK-L \u2193\n\n0.073\u00b10.005\n0.086\u00b10.004\n0.090\u00b10.004\n0.092\u00b10.005\n0.099\u00b10.004\n0.129\u00b10.007\n0.228\u00b10.006\n0.245\u00b10.019\n0.231\u00b10.021\n0.239\u00b10.018\n0.534\u00b10.013\n0.852\u00b10.023\n0.856\u00b10.061\n0.879\u00b10.023\n\nEuclidean \u2193\n0.133\u00b10.003\n0.155\u00b10.003\n0.159\u00b10.003\n0.158\u00b10.004\n0.167\u00b10.004\n0.187\u00b10.004\n0.085\u00b10.002\n0.099\u00b10.005\n0.076\u00b10.006\n0.089\u00b10.006\n0.317\u00b10.014\n0.511\u00b10.021\n0.475\u00b10.029\n0.458\u00b10.014\n\nS\u03c6rensen \u2193\n0.130\u00b10.003\n0.152\u00b10.003\n0.155\u00b10.003\n0.156\u00b10.004\n0.164\u00b10.003\n0.183\u00b10.004\n0.212\u00b10.002\n0.229\u00b10.015\n0.231\u00b10.012\n0.253\u00b10.009\n0.336\u00b10.010\n0.492\u00b10.016\n0.508\u00b10.026\n0.539\u00b10.011\n\nSquared \u03c72 \u2193\n0.070\u00b10.004\n0.084\u00b10.003\n0.088\u00b10.003\n0.088\u00b10.004\n0.096\u00b10.004\n0.120\u00b10.005\n0.179\u00b10.004\n0.189\u00b10.021\n0.211\u00b10.018\n0.205\u00b10.012\n0.448\u00b10.017\n0.595\u00b10.026\n0.716\u00b10.041\n0.792\u00b10.019\n\nFidelity \u2191\n0.981\u00b10.001\n0.978\u00b10.001\n0.977\u00b10.001\n0.977\u00b10.001\n0.974\u00b10.001\n0.967\u00b10.001\n0.948\u00b10.001\n0.940\u00b10.006\n0.938\u00b10.008\n0.944\u00b10.003\n0.824\u00b10.008\n0.813\u00b10.008\n0.722\u00b10.021\n0.686\u00b10.009\n\nIntersection \u2191\n0.870\u00b10.003\n0.848\u00b10.003\n0.845\u00b10.003\n0.844\u00b10.004\n0.836\u00b10.003\n0.817\u00b10.004\n0.788\u00b10.002\n0.771\u00b10.015\n0.769\u00b10.012\n0.747\u00b10.009\n0.664\u00b10.010\n0.509\u00b10.016\n0.492\u00b10.026\n0.461\u00b10.011\n\n4.2 Evaluation of dLDLFs on Facial Age Estimation\n\nIn some literature [8, 11, 28, 15, 5], age estimation is formulated as a LDL problem. We conduct\nfacial age estimation experiments on Morph [24], which contains more than 50,000 facial images\nfrom about 13,000 people of different races. Each facial image is annotated with a chronological age.\nTo generate an age distribution for each face image, we follow the same strategy used in [8, 28, 5],\nwhich uses a Gaussian distribution whose mean is the chronological age of the face image (Fig. 1(a)).\nThe predicted age for a face image is simply the age having the highest probability in the predicted\n\n1We download these datasets from http://cse.seu.edu.cn/people/xgeng/LDL/index.htm.\n\n7\n\n\flabel distribution. The performance of age estimation is evaluated by the mean absolute error (MAE)\nbetween predicted ages and chronological ages. As the current state-of-the-art result on Morph\nis obtain by \ufb01ne-tuning DLDL [5] on VGG-Face [23], we also build a dLDLF on VGG-Face, by\nreplacing the softmax layer in VGGNet by a LDLF. Following [5], we do standard 10 ten-fold cross\nvalidation and the results are summarized in Table. 2, which shows dLDLF achieve the state-of-the-art\nperformance on Morph. Note that, the signi\ufb01cant performance gain between deep LDL models\n(DLDL and dLDLF) and non-deep LDL models (IIS-LDL, CPNN, BFGS-LDL) and the superiority\nof dLDLF compared with DLDL veri\ufb01es the effectiveness of end-to-end learning and our tree-based\nmodel for LDL, respectively.\n\nTable 2: MAE of age estimation comparison on Morph [24].\n\nMethod\n\nMAE\n\nIIS-LDL [11]\n5.67\u00b10.15\n\nCPNN [11]\n4.87\u00b10.31\n\nBFGS-LDL [6]\n\nDLDL+VGG-Face [5]\n\ndLDLF+VGG-Face (ours)\n\n3.94\u00b10.05\n\n2.42\u00b10.01\n\n2.24\u00b10.02\n\nAs the distribution of gender and ethnicity is very unbalanced in Morph, many age estimation meth-\nods [13, 14, 15] are evaluated on a subset of Morph, called Morph_Sub for short, which consists of\n20,160 selected facial images to avoid the in\ufb02uence of unbalanced distribution. The best performance\nreported on Morph_Sub is given by D2LDL [15], a data-dependent LDL method. As D2LDL used\nthe output of the \u201cfc7\u201d layer in AlexNet [21] as the face image features, here we integrate a LDLF\nwith AlexNet. Following the experiment setting used in D2LDL, we evaluate our dLDLF and the\ncompetitors, including both SLL and LDL based methods, under six different training set ratios (10%\nto 60%). All of the competitors are trained on the same deep features used by D2LDL. As can be\nseen from Table 3, our dLDLFs signi\ufb01cantly outperform others for all training set ratios.\nNote that, the generated age distri-\nbutions are unimodal distributions\nand the label distributions used in\nSec. 4.1 are mixture distributions.\nThe proposed method LDLFs achieve\nthe state-of-the-art results on both of\nthem, which veri\ufb01es that our model\nhas the ability to model any general\nform of label distributions.\n\nFigure 3: MAE of age estimation comparison on\nMorph_Sub.\n\nAAS [22]\nLARR [12]\nIIS-ALDL [9]\nD2LDL [15]\ndLDLF (ours)\n\n10%\n\n4.9081\n4.7501\n4.1791\n4.1080\n3.8495\n\n60%\n\n4.4061\n4.2949\n4.0902\n3.8385\n3.1224\n\n20%\n\n4.7616\n4.6112\n4.1683\n3.9857\n3.6220\n\n50%\n\n4.4690\n4.3500\n4.1024\n3.8560\n3.1917\n\nMethod\n\nTraining set ratio\n30%\n40%\n\n4.6507\n4.5131\n4.1228\n3.9204\n3.3991\n\n4.5553\n4.4273\n4.1107\n3.8712\n3.2401\n\n4.3 Time Complexity\n\nLet h and sB be the tree depth and the\nbatch size, respectively. Each tree has 2h\u22121 \u2212 1 split nodes and 2h\u22121 leaf nodes. Let D = 2h\u22121 \u2212 1.\nFor one tree and one sample, the complexity of a forward pass and a backward pass are O(D +\nD + 1\u00d7C) = O(D\u00d7C) and O(D + 1\u00d7C + D\u00d7C) = O(D\u00d7C), respectively. So for K trees and\nnB batches, the complexity of a forward and backward pass is O(D\u00d7C\u00d7K\u00d7nB\u00d7sB). The complex-\nity of an iteration to update leaf nodes are O(nB\u00d7sB\u00d7K\u00d7C\u00d7D + 1) = O(D\u00d7C\u00d7K\u00d7nB\u00d7sB).\nThus, the complexity for the training procedure (one epoch, nB batches) and the testing procedure\n(one sample) are O(D\u00d7C\u00d7K\u00d7nB\u00d7sB) and O(D\u00d7C\u00d7K), respectively. LDLFs are ef\ufb01cient: On\nMorph_Sub (12636 training images, 8424 testing images), our model only takes 5250s for training\n(25000 iterations) and 8s for testing all 8424 images.\n\n4.4 Parameter Discussion\n\nNow we discuss the in\ufb02uence of parameter settings on performance. We report the results of rating\nprediction on Movie (measured by K-L) and age estimation on Morph_Sub with 60% training set\nratio (measured by MAE) for different parameter settings in this section.\nTree number. As a forest is an ensemble model, it is necessary to investigate how performances\nchange by varying the tree number used in a forest. Note that, as we discussed in Sec. 2, the\nensemble strategy to learn a forest proposed in dNDFs [20] is different from ours. Therefore, it is\nnecessary to see which ensemble strategy is better to learn a forest. Towards this end, we replace our\nensemble strategy in dLDLFs by the one used in dNDFs, and name this method dNDFs-LDL. The\ncorresponding shallow model is named by sNDFs-LDL. We \ufb01x other parameters, i.e., tree depth and\n\n8\n\n\foutput unit number of the feature learning function, as the default setting. As shown in Fig. 4 (a), our\nensemble strategy can improve the performance by using more trees, while the one used in dNDFs\neven leads to a worse performance than one for a single tree.\nObserved from Fig. 4, the performance of LDLFs can be improved by using more trees, but the\nimprovement becomes increasingly smaller and smaller. Therefore, using much larger ensembles\ndoes not yield a big improvement (On Movie, the number of trees K = 100: K-L = 0.070 vs K = 20:\nK-L = 0.071). Note that, not all random forests based methods use a large number of trees, e.g.,\nShotton et al. [25] obtained very good pose estimation results from depth images by only 3 decision\ntrees.\nTree depth. Tree depth is another important parameter for decision trees. In LDLFs, there is an\nimplicit constraint between tree depth h and output unit number of the feature learning function \u03c4:\n\u03c4 \u2265 2h\u22121 \u2212 1. To discuss the in\ufb02uence of tree depth to the performance of dLDLFs, we set \u03c4 = 2h\u22121\nand \ufb01x tree number K = 1, and the performance change by varying tree depth is shown in Fig. 4 (b).\nWe see that the performance \ufb01rst improves then decreases with the increase of the tree depth. The\nreason is as the tree depth increases, the dimension of learned features increases exponentially, which\ngreatly increases the training dif\ufb01culty. So using much larger depths may lead to bad performance\n(On Movie, tree depth h = 18: K-L = 0.1162 vs h = 9: K-L = 0.0831).\n\nFigure 4: The performance change of age estimation on Morph_Sub and rating prediction on Movie\nby varying (a) tree number and (b) tree depth. Our approach (dLDLFs/sLDLFs) can improve the\nperformance by using more trees, while using the ensemble strategy proposed in dNDFs (dNDFs-\nLDL/sNDFs-LDL) even leads to a worse performance than one for a single tree.\n\n5 Conclusion\n\nWe present label distribution learning forests, a novel label distribution learning algorithm inspired by\ndifferentiable decision trees. We de\ufb01ned a distribution-based loss function for the forests and found\nthat the leaf node predictions can be optimized via variational bounding, which enables all the trees\nand the feature they use to be learned jointly in an end-to-end manner. Experimental results showed\nthe superiority of our algorithm for several LDL tasks and a related computer vision application, and\nveri\ufb01ed our model has the ability to model any general form of label distributions.\nAcknowledgement. This work was supported in part by the National Natural Science Foundation of\nChina No. 61672336, in part by \u201cChen Guang\u201d project supported by Shanghai Municipal Education\nCommission and Shanghai Education Development Foundation No. 15CG43 and in part by ONR\nN00014-15-1-2356.\n\nReferences\n\n[1] Y. Amit and D. Geman. Shape quantization and recognition with randomized trees. Neural Computation,\n\n9(7):1545\u20131588, 1997.\n\n[2] A. L. Berger, S. D. Pietra, and V. J. D. Pietra. A maximum entropy approach to natural language processing.\n\nComputational Linguistics, 22(1):39\u201371, 1996.\n\n[3] L. Breiman. Random forests. Machine Learning, 45(1):5\u201332, 2001.\n[4] A. Criminisi and J. Shotton. Decision Forests for Computer Vision and Medical Image Analysis. Springer,\n\n2013.\n\n[5] B.-B. Gao, C. Xing, C.-W. Xie, J. Wu, and X. Geng. Deep label distribution learning with label ambiguity.\n\narXiv:1611.01731, 2017.\n\n[6] X. Geng. Label distribution learning. IEEE Trans. Knowl. Data Eng., 28(7):1734\u20131748, 2016.\n\n9\n\n\f[7] X. Geng and P. Hou. Pre-release prediction of crowd opinion on movies by label distribution learning. In\n\nPro. IJCAI, pages 3511\u20133517, 2015.\n\n[8] X. Geng, K. Smith-Miles, and Z. Zhou. Facial age estimation by learning from label distributions. In Proc.\n\nAAAI, 2010.\n\n[9] X. Geng, Q. Wang, and Y. Xia. Facial age estimation by adaptive label distribution learning. In Proc.\n\nICPR, pages 4465\u20134470, 2014.\n\n[10] X. Geng and Y. Xia. Head pose estimation based on multivariate label distribution. In Proc. CVPR, pages\n\n1837\u20131842, 2014.\n\n[11] X. Geng, C. Yin, and Z. Zhou. Facial age estimation by learning from label distributions. IEEE Trans.\n\nPattern Anal. Mach. Intell., 35(10):2401\u20132412, 2013.\n\n[12] G. Guo, Y. Fu, C. R. Dyer, and T. S. Huang. Image-based human age estimation by manifold learning and\n\nlocally adjusted robust regression. IEEE Trans. Image Processing, 17(7):1178\u20131188, 2008.\n\n[13] G. Guo and G. Mu. Human age estimation: What is the in\ufb02uence across race and gender? In CVPR\n\nWorkshops, pages 71\u201378, 2010.\n\n[14] G. Guo and C. Zhang. A study on cross-population age estimation. In Proc. CVPR, pages 4257\u20134263,\n\n2014.\n\n[15] Z. He, X. Li, Z. Zhang, F. Wu, X. Geng, Y. Zhang, M.-H. Yang, and Y. Zhuang. Data-dependent label\n\ndistribution learning for age estimation. IEEE Trans. on Image Processing, 2017.\n\n[16] T. K. Ho. Random decision forests. In Proc. ICDAR, pages 278\u2013282, 1995.\n[17] T. K. Ho. The random subspace method for constructing decision forests. IEEE Trans. Pattern Anal. Mach.\n\nIntell., 20(8):832\u2013844, 1998.\n\n[18] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[19] M. I. Jordan, Z. Ghahramani, T. S. Jaakkola, and L. K. Saul. An introduction to variational methods for\n\ngraphical models. Machine Learning, 37(2):183\u2013233, 1999.\n\n[20] P. Kontschieder, M. Fiterau, A. Criminisi, and S. R. Bul\u00f2. Deep neural decision forests. In Proc. ICCV,\n\npages 1467\u20131475, 2015.\n\n[21] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional neural\n\nnetworks. In Proc. NIPS, pages 1106\u20131114, 2012.\n\n[22] A. Lanitis, C. Draganova, and C. Christodoulou. Comparing different classi\ufb01ers for automatic age\n\nestimation. IEEE Trans. on Cybernetics,, 34(1):621\u2013628, 2004.\n\n[23] O. M. Parkhi, A. Vedaldi, and A. Zisserman. Deep face recognition. In Proc. BMVC, pages 41.1\u201341.12,\n\n2015.\n\n[24] K. Ricanek and T. Tesafaye. MORPH: A longitudinal image database of normal adult age-progression. In\n\nProc. FG, pages 341\u2013345, 2006.\n\n[25] J. Shotton, A. W. Fitzgibbon, M. Cook, T. Sharp, M. Finocchio, R. Moore, A. Kipman, and A. Blake.\nReal-time human pose recognition in parts from single depth images. In Proc. CVPR, pages 1297\u20131304,\n2011.\n\n[26] G. Tsoumakas and I. Katakis. Multi-label classi\ufb01cation: An overview. International Journal of Data\n\nWarehousing and Mining, 3(3):1\u201313, 2007.\n\n[27] C. Xing, X. Geng, and H. Xue. Logistic boosting regression for label distribution learning. In Proc. CVPR,\n\npages 4489\u20134497, 2016.\n\n[28] X. Yang, X. Geng, and D. Zhou. Sparsity conditional energy label distribution learning for age estimation.\n\nIn Proc. IJCAI, pages 2259\u20132265, 2016.\n\n[29] A. L. Yuille and A. Rangarajan. The concave-convex procedure. Neural Computation, 15(4):915\u2013936,\n\n2003.\n\n[30] Y. Zhou, H. Xue, and X. Geng. Emotion distribution recognition from facial expressions. In Proc. MM,\n\npages 1247\u20131250, 2015.\n\n10\n\n\f", "award": [], "sourceid": 557, "authors": [{"given_name": "Wei", "family_name": "Shen", "institution": "Shanghai University"}, {"given_name": "KAI", "family_name": "ZHAO", "institution": "Nankai University"}, {"given_name": "Yilu", "family_name": "Guo", "institution": "Shanghai University"}, {"given_name": "Alan", "family_name": "Yuille", "institution": "Johns Hopkins University"}]}