{"title": "Self Supervised Boosting", "book": "Advances in Neural Information Processing Systems", "page_first": 681, "page_last": 688, "abstract": null, "full_text": "Self Supervised Boosting\n\nMax Welling, Richard S. Zemel, and Geoffrey E. Hinton\n\nDepartment of Computer Science\n\nUniversity of Toronto\n10 King\u2019s College Road\n\nToronto, M5S 3G5 Canada\n\nAbstract\n\nBoosting algorithms and successful applications thereof abound for clas-\nsi\ufb01cation and regression learning problems, but not for unsupervised\nlearning. We propose a sequential approach to adding features to a ran-\ndom \ufb01eld model by training them to improve classi\ufb01cation performance\nbetween the data and an equal-sized sample of \u201cnegative examples\u201d gen-\nerated from the model\u2019s current estimate of the data density. Training in\neach boosting round proceeds in three stages: \ufb01rst we sample negative\nexamples from the model\u2019s current Boltzmann distribution. Next, a fea-\nture is trained to improve classi\ufb01cation performance between data and\nnegative examples. Finally, a coef\ufb01cient is learned which determines the\nimportance of this feature relative to ones already in the pool. Negative\nexamples only need to be generated once to learn each new feature. The\nvalidity of the approach is demonstrated on binary digits and continuous\nsynthetic data.\n\n1 Introduction\n\nWhile researchers have developed and successfully applied a myriad of boosting algorithms\nfor classi\ufb01cation and regression problems, boosting for density estimation has received rel-\natively scant attention. Yet incremental, stage-wise \ufb01tting is an attractive model for density\nestimation. One can imagine that the initial features, or weak learners, could model the\nrough outlines of the data density, and more detailed carving of the density landscape could\noccur on each successive round. Ideally, the algorithm would achieve automatic model se-\nlection, determining the requisite number of weak learners on its own. It has proven dif\ufb01cult\nto formulate an objective for such a system, under which the weights on examples, and the\nobjective for training a weak learner at each round have a natural gradient-descent interpre-\ntation as in standard boosting algorithms [10] [7]. In this paper we propose an algorithm\nthat provides some progress towards this goal.\n\nA key idea in our algorithm is that unsupervised learning can be converted into supervised\nlearning by using the model\u2019s imperfect current estimate of the data to generate negative\nexamples. A form of this idea was previously exploited in the contrastive divergence algo-\nrithm [4]. We take the idea a step further here by training a weak learner to discriminate\nbetween the positive examples from the original data and the negative examples generated\nby sampling from the current density estimate. This new weak learner minimizes a simple\nadditive logistic loss function [2].\n\n\fOur algorithm obtains an important advantage over sampling-based, unsupervised methods\nthat learn features in parallel. Parallel-update methods require a new sample after each\niteration of parameter changes, in order to re\ufb02ect the current model\u2019s estimate of the data\ndensity. We improve on this by using one sample per boosting round, to \ufb01t one weak\nlearner. The justi\ufb01cation for this approach comes from the proposal that, for stagewise\nadditive models, boosting can be considered as gradient-descent in function space, so the\nnew learner can simply optimize its inner product with the gradient of the objective in\nfunction space [3].\n\nUnlike other attempts at \u201cunsupervised boosting\u201d [9], where at each round a new com-\nponent distribution is added to a mixture model, our approach will add features in the\nlog-domain and as such learns a product model.\n\nOur algorithm incrementally constructs random \ufb01elds from examples. As such, it bears\nsome relation to maximum entropy models, which are popular in natural language pro-\ncessing [8]. In these applications, the features are typically not learned; instead the al-\ngorithms greedily select at each round the most informative feature from a large set of\npre-enumerated features.\n\n2 The Model\n\nrandom variables taking values in some \ufb01nite\n\nconverted into a probability using the Boltzmann distribution,\n\nLet the input, or state \u0002\u0001 be a vector of \u0003\nis de\ufb01ned by assigning it an energy, \b\n\t\u000b\u0007\r\f , which is\ndomain \u0004\u0006\u0005\n\b\n\t\u000b\u0007\r\f\u001d\u001c\n\n. The probability of \u0007\n\u0012\u0014\u0013\u0002\u0015\u0017\u0016\u0019\u0018\u001b\u001a\n\nWe furthermore assume that the energy is additive. More explicitly, it will be modelled as\na weighted sum of features,\n\n\u0013\"\u0015#\u0016\u0019\u0018$\u001a\n\n\b%\t&\u0007'\f(\u001c\n\n\t\u000b\u0007\r\f\u0010\u000f\n\n(1)\n\nthe features and each feature may depend on its own\n\n(2)\n\n\u000f\u001f\u001e! \n)-,.)\n\n\t&\u00070/21\n\n\u001e\u0017)+*\n\n\u001e#)\n\n) .\n\n\t\u000b\u0007\r\f\u0010\u000f\n\n\b%\t&\u0007'\f\u0010\u000f\n)54 are the weights, 3\n\nwhere 3\nset of parameters1\n\ufb02exible through its dependence on the parameters 1\n\n\t76\n\n) .\n\nThe model described above is very similar to an \u201cadditive random \ufb01eld\u201d, otherwise known\nas \u201cmaximum entropy model\u201d. The key difference is that we allow each feature to be\n\n8:9\n8:;\n\nLearning in random \ufb01elds may proceed by performing gradient ascent on the log-\nlikelihood:\n\n\b%\t\u000bB\n8:;\n\n\b%\t\u000b\u0007\r\f\n8I;\n\n\t\u000b\u0007\r\f\n\n>@?\u0006A\n\nis a data-vector and ;\n\nis some arbitrary parameter that we want to learn. This\nequation makes explicit the main philosophy behind learning in random \ufb01elds: the energy\n) while the energy of all states is\n\n EDGF\rH\nwhere B\nof states \u201coccupied\u201d by data is lowered (weighted by A\nraised (weighted by \u000e\nsystem, the second term is often approximated by a sample from \u000e\n\n\t\u000b\u0007\r\f ). Since there are usually an exponential number of states in the\n\t\u000b\u0007\r\f . To reduce sampling\n\nnoise a relatively large sample is necessary and moreover, it must be drawn each time we\ncompute gradients. These considerations make learning in random \ufb01elds generally very\ninef\ufb01cient.\n\n(3)\n\nIterative scaling methods have been developed for models that do not include adaptive\n[8]. These methods\nmake more ef\ufb01cient use of the samples than gradient ascent, but they only minimize a\nloose bound on the cost function and their terminal convergence can be slow.\n\n4 but instead train only the coef\ufb01cients 3\n\nfeature parameters 3J1\n\n\u000e\n\u0011\n\u0012\n\b\n)\n)\n\f\n*\n,\n)\n\f\n4\n\u000f\n\u001a\n\u0011\n<\n=\n\u001e\n8\n>\n\f\nC\n\u001e\n\u000e\n8\n>\n=\n)\n*\n)\n4\n\f3 An Algorithm for Self Supervised Boosting\n\nBoosting algorithms typically implement phases: a feature (or weak learner) is trained,\n\nthe relative weight of this feature with respect to the other features already in the pool is\ndetermined, and \ufb01nally the data vectors are reweighted. In the following we will discuss a\nsimilar strategy in an unsupervised setting.\n\n3.1 Finding New Features\n\nIn [7], boosting is reinterpreted as functional gradient descent on a loss function. Using the\nlog-likelihood as a negative loss function this idea can be used to \ufb01nd features for additive\nrandom \ufb01eld models. Consider a change in the energy by adding an in\ufb01nitesimal multiple\nof a feature. The optimal feature is then the one that provides the maximal increase in\nlog-likelihood, i.e. the feature that maximizes the second term of\n\nUsing Eqn. 3 with8\n\n\b\n\t\n\n,.)\n\nC\u0002\u0001\n\n\t&\u0007'\f(\u001c\u0004\u0003\n\n\b%\t&\u0007'\f(\u001c\n\nC\u0002\u0001\n\b%\t&\u0007'\f\n,:) we rewrite the second term as,\n\t&\u0007'\f\n\n EDGF\n\n\t&B\n\n>@?\u0006A\n\n?\b\u0007\n\n?\b\u0007\n\n\t&\u0007'\f\n\n(4)\n\n(5)\n\n8:9\n\n\t\u000b\u0007\r\f\n\n\t\u000b\u0007\r\f .\n\nBecause the total number of possible states of a model is often exponentially large, the\n\nis our current estimate of the data distribution.\n\nIn order to maximize this\nderivative, the feature should therefore be small at the data and large at all other states. It is\nhowever important to realize that the norm of the feature must be bounded, since otherwise\n\nwhere \u000e\nthe derivative can be made arbitrarily large by simply increasing the length of ,\n\t\u000b\u0007\r\f ,\nsecond term of Eqn. 5 must be approximated using samples \u0007\f\u000b\n?\u0006A\n\u0011 , we can map this to a supervised\n\nThese samples, or \u201cnegative examples\u201d, inform us about the states that are likely under\nthe current model. Intuitively, because the model is imperfect, we would like to move its\ndensity estimate away from these samples and towards the actual data. By labelling the\n\nproblem where a new feature is a classi\ufb01er. Since a good classi\ufb01er is negative at the data\nand positive at the negative examples (so we can use its sign to discriminate them), adding\nits output to the total energy will lower the energy at states where there are data and raise it\nat states where there are negative examples. The main difference with supervised boosting\nis that the negative examples change at every round.\n\n\u0011 and the negative examples with \u0010\n\nfrom\u000e\n\t\u000b\u0007\n\ndata with\u0010\n\n>@?\u0006A\n\n?\r\u0007\n\n\t\u000bB\n\n(6)\n\n3.2 Weighting the Data\n\nIt has been observed [6] that boosting algorithms can outperform classi\ufb01cations algorithms\nthat maximize log-likelihood. This has motivated us to use the logistic loss function from\nthe boosting literature for training new features.\n\nwhere #\n\nruns over data (\u0010\n\nenergy of the negative loss function by adding an in\ufb01nitesimal multiple of a new feature:\n\nLoss \u000f+\u001e\u0012\u0011\u0014\u0013\u0016\u0015\u0012\u0017\u0019\u0018\n\nC\u001b\u001a\u0012\u001c\u0004\u001d\u001f\u001e! \u0004\u001e\u0012\"\n\u0011 ) and negative examples (\u0010\n\n(7)\n\n\u0011 ). Perturbing the\n\n9\n\u0018\n9\n\u0018\n8\n9\n8\n\u0001\n\u0005\n\u0005\n\u0005\n\u0005\n\u0006\n8\n\u0001\n\u000f\n8\n9\n8\n\u0001\n\u0005\n\u0005\n\u0005\n\u0005\n\u0006\n\u000f\n\u001a\n\u0011\n<\n=\n\u001e\n,\n)\n>\n\f\nC\n\u001e\nH\n\u000e\n,\n)\n)\n8\n\u0001\n\u0005\n\u0005\n\u0005\n\u0005\n\u0006\n\u0003\n\u001a\n\u0011\n<\n=\n\u001e\n,\n)\n>\n\f\nC\n\u0011\n\u000e\n\u000f\n\u001e\n\u000b\n,\n)\n\u000b\n\f\n\u000f\n\u001a\n\u000f\nC\n\u0011\n\u0011\n\u000f\n\u001a\n\u0011\n\u000f\nC\n\f) and computing the derivative w.r.t. \u0001 we derive the following cost function\n\nfor adding a new feature,\n\n\b\u0001\n\n\t&B\n\n?\u0006A\n\n\t\u000b\u0007\n\u0011 on data and negative examples, that give\n\n\u000f\u0005\u0004\n\n(8)\n\nThe main difference with Eqn. 6 is the weights \u0003\npoorly \u201cclassi\ufb01ed\u201d examples (data with very high energy and negative examples with very\nlow energy) a stronger vote in changes to the energy surface. The extra weights (which are\nbounded between [0,1]) will incur a certain bias w.r.t. the maximum likelihood solution.\nHowever, it is expected that the extra effort on \u201chard cases\u201d will cause the algorithm to\nconverge faster to good density models.\n\nIt is important to realize that the loss function Eqn. 7 is a valid cost function only when the\nnegative examples are \ufb01xed. The reason is that after a change of the energy surface, the\nnegative examples are no longer a representative sample from the Boltzmann distribution in\nEqn. 1. However, as long as we re-sample the negative examples after every change in the\nenergy we may use Eqn. 8 as an objective to decide what feature to add to the energy, i.e.\nwe may consider it as the derivative of some (possibly unknown) weighted log-likelihood:\n\n\u0001\b\u0007\n\n?\r\u0007\n\t\n\n\b%\t\u000b\u0007\r\f\r\nw.r.t. to*\n>@?\u0006A\n>@?\u0006A\nw.r.t. \u000e\n\nstate \u0007\n\nBy analogy, we can interpret\n\nis occupied by a data-vector and consequently \u001a\n\nthe introduction of the weights has given meaning to the \u201cheight\u201d of the energy surface, in\ncontrast to the Boltzmann distribution for which only relative energy differences count. In\nfact, as we will further explain in the next section, the height of the energy will be chosen\nsuch that the total weight on data is equal to the total weight on the negative examples.\n\n\b%\t&\u0007'\f2\f as the probability that a certain\n\b%\t&\u0007'\f as the \u201cmargin\u201d. Note that\n\n\u000f\f\u0004\n\n\u0007'\f\n\n3.3 Adding the New Feature to the Pool\n\nAccording to the functional gradient interpretation, the new feature computed as described\nabove represents the in\ufb01nitesimal change in energy that maximally increases the (weighted)\n\nlog-likelihood. Consistent with that interpretation we will determine *\n\nthe direction of this \u201cgradient\u201d. In fact, we will propose a slightly more general change in\nenergy given by,\n\n\b%\t&\u0007'\f\nAs mentioned in the previous section, the constant \u000e\ndistribution in Eqn. 1. However, it does in\ufb02uence the relative total weight on data versus\nsee that the derivatives of9\nit is not hard to\nnegative examples. Using the interpretation of \u0002\n) and1 \u000e\n\nC\u000f\u000e\n\t&\u0007'\f\n) will have no effect on the Boltzmann\nin Eqn. 8 as8\n) are given by,\n\n?\b\u0007\n\n\u0001\u0010\u0007\n\n(9)\n\n) via a line search in\n\n(10)\n\n(11)\n\n\t&B\n\n\t&\u0007\n\n?\u0006A\n\n?\u0006A\n\nthe total weight on data and negative exam-\n\n\u0011 but also the Boltzmann\n\nples precisely balances out.\n\nTherefore, at a stationary point of 9\nWhen iteratively updating *\n\ndistribution, which makes the negative examples no longer representative of the current\n\nand we can do Newton updates to compute the stationary point.\n\nis independent of\n\n, it is easy to compute the second derivative\n\n1Since\n\n\u0011\u0013\u0012\u0015\u0014\u0017\u0016\n\n\u001e \u001f\"!\u000f#%$\n\n\u001b\u001d\u001c\n\n#'&\n\n) we not only change the weights \u0003\n\u0018\u001a\u0019\n\n\b\nC\n\u0001\n,\n\u0002\n\u000f\n\u001a\n=\n\u001e\n>\n?\nA\n\u0003\n>\n,\n)\n>\n\f\nC\n\u000f\n\u001e\n\u000b\n\u0003\n\u000b\n,\n)\n\u000b\n\f\n\u0003\n\u0011\n\t\n\u001a\n\u0010\n\u0011\n\b\n\u0011\n\f\n\u0002\n\u000f\n8\n9\n\u0006\n\t\n8\n\u0006\n\u000b\n\t\n\u0010\n\u000f\n\u001a\n\u0011\n\u0007\n\t\n\u001a\nC\n*\n)\n,\n)\n)\n9\n\u0006\n\t\n8\n\u0006\n\u0006\n8\n9\n\u0006\n8\n*\n)\n\u000f\n\u001a\n=\n\u001e\n\u0003\n>\n,\n)\n>\n\f\nC\n\u000f\n\u001e\n\u000b\n\u0003\n\u000b\n,\n)\n\u000b\n\f\n8\n9\n\u0006\n8\n\u000e\n)\n\u000f\n\u001a\n=\n\u001e\n\u0003\n>\nC\n\u000f\n\u001e\n\u000b\n\u0003\n\u000b\n\u0006\n)\n\u001c\n\u001c\n#\n\fr\no\nr\nr\n\n \n\nE\nn\no\n\ni\nt\n\na\nc\ni\nf\ni\ns\ns\na\nC\n%\n\n \n\nl\n\n3\n\n2.5\n\n2\n\n1.5\n\n1\n\n0.5\n\n0\n0\n\n100\n\n200\n\n300\n\n400\n\n500\n\n600\n\nboosting round\n\nweighted boosting algorithm (solid curves) and the un-weighted algorithm (dashed curves).\n\n) found by the learning algorithm.\n\nFigure 1: (a \u2013 left). Training error (lower curves) and test error (higher curves) for the\n(b \u2013 right). Features \u0001\nestimated data distribution. To correct for this we include importance weights \u0002\nnegative examples that are all \u0011\nfrom iteration to iteration using \u0002\n\t\t\b\nBoltzmann distribution, reset the importance weights to \u0011\nnatively, we simply accept the current value of*\nBecause we initialize *\n\n\u000b on the\n\u000f\u0004\u0003 . It is very easy to update these weights\n\u000b.\f2\f and renormalizing. It is well\n\u0013\u0002\u0015\u0017\u0016\n\u001a\u0007\u0006\n\u000b , where the sum runs over the negative examples only. If it drops below a thresh-\n\u0002\u000b\n\n) . Alter-\n) and proceed to the next round of boosting.\n\nin the \ufb01tting procedure, the latter approach underestimates\nthe importance of this particular feature, which is not a problem since a similar feature can\nbe added in the next round.\n\nknown that in high dimensions the effective sample size of the weighted sample can rapidly\nbecome too small to be useful. We therefore monitor the effective sample size, given by\n\nold we have two choices. We can obtain a new set of negative examples from the updated\n\nand resume \ufb01tting*\n\nat *\n\u0005\u0002\u0012\u000b\n\n\u000f\f\u0003\n\n,.)\n\n\t&\u0007\n\n4 A Binary Example: The Generalized RBM\n\n) :\n\n\t\u000b\u0007\r\f\u0010\u000f\n\nWe propose a simple extension of the \u201drestricted Boltzmann machine\u201d (RBM) with (+1,-\n\nbias;\n1)-units [1] as a model for binary data. Each feature is parametrized by weights \u0001\n\u0013\u0016\u0015\u0012\u0017\t\r\nwhere the RBM is obtained by setting all *\nunits and vice versa. Unfortunately, by including the coef\ufb01cients *\n\nenergy model using straightforward Gibbs sampling, where every visible unit is sampled\ngiven all the others. Alternatively, one can design a much faster mixing Markov chain by\nintroducing hidden variables and sampling all hidden units independently given the visible\nthis trick is no longer\n\n) and a\n\t\u0012\u0001\u0014\u0013\n\u0011 . One can sample from the summed\n\nvalid. But an approximate Markov chain can be used\n\n(13)\nThis approximate Gibbs sampling thus involves sampling from an RBM with scaled\nweights and biases,\n\n\u0015\u000f\u000e\u0011\u0010\n\n\u0015\u000f\u000e\u0011\u0010\n\n\u0017\t\n\n(12)\n\n\t\u0012\u0001\n\n\u0013\u0016\u0015\u0012\u0017\t\r\n\t\u0018\n\n\t\u0018\u0017\n\n\u0007\r\f\u0010\u000f\u0005\u0004\n\n\u0001\u0019\u0013\n\n\u0015\u0015\u000e\u0016\u0010\n\u0001\u001b\u001a\n\n)\u001f\u001e\n\n\f\u0010\u000f\n\n\t\u001d\n\n(14)\n\nWhen using the above Markov chain, we will not wait until it has reached equilibrium but\ninitialize it at the data-vectors and use it for a \ufb01xed number of steps, as is done in contrastive\ndivergence learning [4].\n\n\t\n\u000e\n)\n\u000b\n\t\n*\n)\n\u0011\n\u000b\n\t\n\u000e\n)\n*\n)\n,\n)\n\u001a\n*\n)\n)\n\u0007\nC\n;\n)\n\f\n)\n\u000f\n)\n*\n)\n\u0013\n)\n\u0007\nC\n;\n)\n\f\n\u0003\n\u0013\n\u0015\n\t\n*\n)\n\u0001\n)\n\u0007\nC\n*\n)\n;\n)\n\f\n\u000e\n)\n\u000f\n\u0011\n\u0007\n*\n)\n)\n\u0007\nC\n\n*\n)\n;\n)\n\f\n\u000e\n\t\n\u000f\n\u0011\n\u0007\n\u001c\n\u0004\n\u001e\n)\n*\n\u001a\n)\n\u0017\n)\n\f\n\f(15)\n\n\u0004\n\u0005\b\u0007\n\n)\u0003\u0002\n\n\u0011 . The updates\n\nWhen we \ufb01t a new feature we need to make sure its norm is controlled. The appropriate\nvalue depends on the number of dimensions in the problem; in the experiment described\n\nbelow we bounded the norm of the vector \u0018\nare thus given by \u0001\n\nto be no larger than \u0003\n) with,\n)\t\u0002\n\t\u0012\u0001\u0019\u0013\n\f . The coef\ufb01cients*\n\n) and;\n\u0004\u0001\n\u0004\u0006\u0005\b\u0007\n\f-\u0007\n\u0010\u0006\t\u0012\u0001\u0019\u0013\n) are determined\n\u000b are proportional to\nreal-valued digits from the \u201cbr\u201d set on the CEDAR cdrom # \u0011 . We learned\n\u0003 data cases of each\n\u0003 digits of each class were used for test-\n. We used \n) . After a new fea-\n\u0003\u0011\u0010\n\nwhere the weights \u0003\nusing the procedure of Section 3.3.\nTo test whether we can learn good models of (fairly) high-dimensional, real-world data, we\nused the\u0011\f\u000b\u000e\r%\u0011\f\u000b\ncompletely separate models on binarized \u201c2\u201ds and \u201c3\u201ds. The \ufb01rst \u000b\nclass were used for training while the remaining \u000f\n) was set to \u000b\ning. The minimum effective sample size for the coef\ufb01cients *\n\u0003 examples each, to \ufb01t,\ndifferent sets of negative examples, \u000b\n\f and*\n\t-6\n\nture was added, the total energies of all \u201c2\u201ds and \u201c3\u201ds were computed under both models.\nThe energies of the training data (under both models) were used as two-dimensional fea-\ntures to compute a separation boundary using logistic regression, which was subsequently\napplied to the test data to compute the total misclassi\ufb01cation. In Figure 1a we show the total\nerror on both training data and test data as a function of the number of features in the model.\nFor comparison we also plot the training and test error for the un-weighted version of the\nrounds of boosting for the\nalgorithm (\u0003\nweighted algorithm is about \u0003\nafter\n), k-nearest\n\u0003\u000f\u0003\nneighbors (\u0011\n\u0003\u001a\u0010\nunits achieves \u0003\nwe show every \u0011\n\nrounds of boosting. This is good as compared to logistic regression (\nis optimal), while a parallel-trained RBM with \u000f\n\n, and only very gradually increases to about \u0003\n\t\u0019\u0017\n\u0003\u000f\u0003\n\nrespectively. The un-weighted learning algorithm con-\nverges much more slowly to a good solution, both on training and test data. In Figure 1b\n\n\u000f\u0016\u0010\n\u000f\u0016\u0010\n) between rounds \u0011 and \n\n\u0001\u0013\u0012\u0015\u0014 ). The classi\ufb01cation error after \u0011\n\n\u000f\u0011\u0010\n\u0003 hidden\n\n\t\u0018\u0017\nfeature \u0001\n\nfor both digits.\n\n)\u001e\u001d\n\n\t\u0018\u0017\n\n5 A Continuous Example: The Dimples Model\n\nFor continuous data we propose a different form of feature, which we term a dimple because\nof its shape in the energy domain. A dimple is a mixture of a narrow Gaussian and a broad\nGaussian, with a common mean:\n\n(16)\n\n\f\u001d\u001c\n\n\u0013\u0016\u0015\u0012\u0017\n\n\t\u000b\u0007!/! \n\n\t&\u00070/\" \n\nwhere the mixing proportion is constant and equal, and\n\nof the algorithm \ufb01ts and\n\nreduce the entropy of an existing distribution by placing the dimple in a region that already\nhas low energy, but they can also raise the entropy by putting the dimple in a high energy\nregion [5].\n\nis \ufb01xed and large. Each round\n\n\t\u000b\u0007\r\f\nA for a new learner. A nice property of dimples is that they can\n\u0011 , since in that case we can use a Gibbs chain which\n\n\ufb01rst picks a narrow or broad Gaussian for every feature given the visible variables and then\nthe\n\nSampling is again simple if all *\nsamples the visible variables from the resulting multivariate Gaussian. For general *\n\nsituation is less tractable, but using a similar approximation as for the generalized RBM,\n\n\u0013\u0016\u0015\u0012\u0017\n\n\t\u000b\u0007!/! \n\n\t&\u00070/\" \n\n\f(\u001c\u0004\u0003\n\n\u0013\u0016\u0015\u0012\u0017\n\n\t&\u00070/\" \n\nThis approximation will be accurate when one Gaussian is dominating the other, i.e., when\nthe responsibilities are close to zero and one. This is expected to be the case in high-\ndimensional applications.\nIn the low-dimensional example discussed below we imple-\nmented a simple MCMC chain with isotropic, normal proposal density which was initiated\nat the data-points and run for a \ufb01xed number of steps.\n\n\f!#\n\n\t&\u00070/\" \n\n\f!#\n\n(17)\n\n\n\u0013\n)\n\u0001\n;\n)\n\u001c\n\t\n)\n)\nC\n\u0006\n\u0001\n)\n\n;\n)\nC\n\u0006\n;\n\u0006\n\u0001\n\u001e\n\u000b\n\u0003\n\u000b\n\u0010\n\u000b\n)\n\u0007\n\u000b\nC\n;\n)\n\u000b\n\u0006\n;\n\u001e\n\u000b\n\u0003\n\u000b\n\u0010\n\u000b\n\u0010\n)\n\u0007\n\u000b\nC\n;\n)\n\f\n\u0004\n\t\n\u001a\n\u0010\n\u000b\n\b\n\u000b\n\u0003\n\u0003\n\u0003\n)\n\u000b\n\u000f\n\u0011\n\u0003\n\u0003\n\t\n\u000b\n\u000b\n\u0010\n\t\n\u0001\n#\n\u000f\n\u0011\n\u0003\n\u0001\n\u0011\n\u0001\n\n\u0003\n\t\n\u001b\n\u0010\n\u0001\n\u0003\n\u0010\n\u0001\n\u0003\n\t\n\u001c\n\u0003\n\u0003\n\u0003\n,\n)\n\u000f\n\u001a\n\u0018\n\u001f\n\u0001\n\u0004\nA\n\f\nC\n\u001f\n\u0001\n\u0004\n\n\u0004\n\n\u0004\n)\n\u000f\n*\n\u0018\n\u001f\n\u0001\n\u0004\nA\n\f\nC\n\u001f\n\u0001\n\u0004\n\n\u0018\n\u001f\n\u0001\n\u0004\nA\nC\n\u001f\n\u0001\n\u0004\n\n\u001c\n\f25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221240\n\n(a)\n\n(b)\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n30\n\n20\n\n10\n\n0\n\n\u221210\n\n\u221220\n\n\u221230\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221240\n\n(c)\n\n(d)\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n0\n\n\u221220\n\n\u221240\n\n\u221260\n\n\u221280\n\n\u2212100\n\n25\n\n20\n\n15\n\n10\n\n5\n\n0\n\n\u22125\n\n\u221210\n\n\u221215\n\n\u221220\n\n\u221225\n\n\u221240\n\n\u221230\n\n\u221220\n\n\u221210\n\n0\n\n10\n\n20\n\n30\n\n40\n\n\u0006\u000b\t\n\n\t\u0006\u0005\n\n\u000b\n\t\n\nrounds of boosting. The crosses rep-\nresent the data and the dots the negative examples generated from the model. (b). Three\n\nFigure 2: (a). Plot of iso-energy contours after \u0015\u0003\ndimensional plot of the negative energy surface. (c). Contour plot for a mixture of \u000f\u0003\n\u0003 Gaussians.\nGaussians learned using EM. (d). Negative energy surface for the mixture of \nThe type of dimple we used in the experiment below can adapt a common mean (\u0001\nA ) in each dimension separately. The update\nthe inverse-variance of the small Gaussian (\u0002\nrules are given by, \u0001\n\u0004\u0002\n\u001a\u0006\u0007\n\t\u0006\u0005\n\u0010\u0012\u000b\f\b\n\nA with\n\f\"\t\u0006\b\n\u000b\n\t\n\u001a\u0006\u0007\n\u000b are the responsibilities for the narrow\n\f . Finally,\n\nand \u0002\n\n\u0003\u0001\n\n) and\n\n(18)\n\n(19)\n\n\t\u0017\t\n\n\f and \b\n) are computed as described in Section 3.3.\n\nwhere \b\nand broad Gaussian respectively and the weights are given by \u0003\nthe combination coef\ufb01cients \r\nTo illustrate the proposed algorithm we \ufb01t the dimples model to the two-dimensional data\n(crosses) shown in Figure 2a-c. The data were synthetically generated by de\ufb01ning angles\n\u000e\u0010\u000f\nstandard\nnormal, which were converted to Euclidean coordinates and mirrored and translated to pro-\nduce the spirals. The \ufb01rst feature is an isotropic Gaussian with the mean and the variance\nof the data, while later features were dimples trained in the way described above. Figure 2a\nrounds of boosting together with exam-\nples (dots) from the model. A 3-dimensional plot of the negative energy surface is shown in\n\nalso shows the contours of equal energy after \u0015\u0003\nFigure 2b. For comparison, similar plots for a mixture of \u000f\u0003 Gaussians, trained in parallel\n\n\u000f\u0012\u0011\u0014\u0013\u0016\u0015 with \u0015 uniform between \u0018\n\nwith EM, are depicted in Figures 2c and 2d.\n\n\u001c and a radius \b\n\n\u000e\u0017\u000f with \u0014\n\nThe main qualitative difference between the \ufb01ts in Figures 2a-b (product of dimples) and\n\n\u0003\nC\n\u0006\n\u0001\nA\nA\nC\n\u0006\n\u0002\n\u0006\n \n\u001a\n\u0002\n\u001e\n\u000b\n\u0003\n\u000b\n\u0010\n\u000b\n\u000b\n\u001a\n \n\u001a\nA\n\u0007\nA\n\u0007\n\u001a\nC\n\b\n\n\u0007\n\n\u0007\n\u001a\n\f\nA\n\u0007\n\u001a\n\u0002\n\u001a\n\u001e\n\u000b\n\u0003\n\u000b\nA\n\u0007\n\u000b\n\u0018\n\u000b\n\u001a\n \n\u001a\n\f\n\n\u001a\n\u0011\n\t\n\t\nA\n\u0007\n\u001a\n\"\nA\n\u0007\n\u000b\n\u000f\n\u001f\nA\n\u001f\nA\nC\n\u001f\n\n\u0007\n\u000b\n\u000f\n\u0011\n\u001a\n\b\nA\n\u0007\n\u000b\n\u000f\n\u0004\n\t\n\u001a\n\u0010\n\u000b\n\b\n\u000b\n\u0003\n\u0001\n\u0011\n\u000f\n\u000f\n\u0011\n\u0003\nC\n\u0014\n\u0003\n\f2c-d (mixture of Gaussians), is that the \ufb01rst seems to produce smoother energy surfaces,\nonly creating structure where there is structure in the data. This can be understood by\nrecalling that the role of the negative examples is precisely to remove \u201cdips\u201d in the energy\nsurface where there is no data. The philosophy of avoiding structure in the model that is\nnot dictated by the data is consistent with the ideas behind maximum entropy modelling\n[11] and is thought to improve generalization.\n\n6 Discussion\n\nThis paper discusses a boosting approach to density estimation, which we formulate as a\nsequential approach to training additive random \ufb01eld models. The philosophy is to view\nunsupervised learning as a sequence of classi\ufb01cation problems where the aim is to discrim-\ninate between data-vectors and negative examples generated from the current model. The\nsampling step is usually the most time consuming operation, but it is also unavoidable since\nit informs the algorithm of the states whose energy is too low. The proposed algorithm uses\njust one sample of negative examples to \ufb01t a new feature, which is very economical as\ncompared to most non-sequential algorithms which must generate an entire new sample for\nevery gradient update.\n\nThere are many interesting issues and variations that we have not addressed in this paper.\n\nWhat is the effect of using approximate, e.g. variational distributions for \u000e\n\nimprove the accuracy of the model by \ufb01tting the feature parameters and the coef\ufb01cients\ntogether? Does re-sampling the negative examples more frequently during learning\nimprove the \ufb01nal model? What is the effect of using different functions to weight the data\nand how do the weighting schemes interact with the dimensionality of the problem?\n\n\t\u000b\u0007\r\f ? Can we\n\nReferences\n\n[1] Y. Freund and D. Haussler. Unsupervised learning of distributions of binary vectors using\nIn Advances in Neural Information Processing Systems, volume 4, pages\n\n2-layer networks.\n912\u2013919, 1992.\n\n[2] J. Friedman, T. Hastie, and R. Tibshirani. Additive logistic regression: A statistical view of\n\nboosting. Technical report, Dept. of Statistics, Stanford University Technical Report., 1998.\n\n[3] J.H. Friedman. Greedy function approximation: A gradient boosting machine. Technical report,\n\nTechnical Report, Dept. of Statistics, Stanford University, 1999.\n\n[4] G.E. Hinton. Training products of experts by minimizing contrastive divergence. Neural Com-\n\nputation, 14:1771\u20131800, 2002.\n\n[5] G.E. Hinton and A. Brown. Spiking Boltzmann machines. In Advances in Neural Information\n\nProcessing Systems, volume 12, 2000.\n\n[6] G. Lebanon and J. Lafferty. Boosting and maximum likelihood for exponential models.\n\nAdvances in Neural Information Processing Systems, volume 14, 2002.\n\nIn\n\n[7] L. Mason, J. Baxter, P. Bartlett, and M. Frean. Boosting algorithms as gradient descent. In\n\nAdvances in Neural Information Processing Systems, volume 12, 2000.\n\n[8] S. Della Pietra, V.J. Della Pietra, and J.D. Lafferty. Inducing features of random \ufb01elds. IEEE\n\nTransactions on Pattern Analysis and Machine Intelligence, 19(4):380\u2013393, 1997.\n\n[9] S. Rosset and E. Segal. Boosting density estimation. In Advances in Neural Information Pro-\n\ncessing Systems, volume 15 (this volume), 2002.\n\n[10] R.E. Schapire and Y. Singer. Improved boosting algorithms using con\ufb01dence-rated predictions.\n\nIn Computational Learing Theory, pages 80\u201391, 1998.\n\n[11] S.C. Zhu, Z.N. Wu, and D. Mumford. Minimax entropy principle and its application to texture\n\nmodeling. Neural Computation, 9(8):1627\u20131660, 1997.\n\n*\n)\n\f", "award": [], "sourceid": 2275, "authors": [{"given_name": "Max", "family_name": "Welling", "institution": null}, {"given_name": "Richard", "family_name": "Zemel", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}