{"title": "Unsupervised Classifiers, Mutual Information and 'Phantom Targets", "book": "Advances in Neural Information Processing Systems", "page_first": 1096, "page_last": 1101, "abstract": null, "full_text": "Unsupervised Classifiers, Mutual Information \n\nand 'Phantom Targets' \n\nJohn s. Bridle \nAnthony J .R. Heading \nDefence Research Agency \nSt. Andrew's Road, Malvern \n\"\"orcs. \"\\VR14 3PS, U.K. \n\nAbstract \n\nDavid J.e. MacKay \n\nCalifornia Institute of Technology 139-74 \n\nPasadena CA 91125 U.S.A \n\nWe derive criteria for training adaptive classifier networks to perform unsu(cid:173)\npervised data analysis. The first criterion turns a simple Gaussian classifier \ninto a simple Gaussian mixture analyser. The second criterion, which is \nmuch more generally applicable, is based on mutual information. It simpli(cid:173)\nfies to an intuitively reasonable difference between two entropy functions, \none encouraging 'decisiveness,' the other 'fairness' to the alternat.ive in(cid:173)\nterpretations of the input. This 'firm but fair' criterion can be applied \nto any network that produces probability-type outputs, but it does not \nnecessarily lead to useful behavior. \n\n1 Unsupervised Classification \n\nOne of the main distinctions made in discussing neural network architectures, and \npattern analysis algorithms generally, is between supervised and unsupervised data \nanalysis. We should therefore be interested in any method of building bridges \nbetween techniques in these two categories. For instance, it is possible to use an \nunsupervised system such as a Boltzmann machine to learn the joint distribution of \ninputs and a teacher's classificat.ion labels. The particular type of bridge we seek is a \nmethod of taking a supervised pattern classifier and turning it into an unsupervised \ndata analyser. That is, we are interested in methods of \"bootstrapping\" classifiers. \n\nConsider a classifier system. Its input is a vector x, and the output is a probability \nvector y(x). (That is, the elements ofy are positive and sum to 1.) The elements of \ny, (Yi (x), i = 1 ... N c ) are to be taken as the probabilities that x should be assigned \nto each of Nc classes. \n(Note that our definition of classifier does not include a \ndecision process.) \n\n1096 \n\n\fUnsupervised Classifiers, Mutual Information and 'Phantom Targets' \n\n1097 \n\nTo enforce the conditions we require for the output values, v,,'e recommend using a \ngeneralised logistic (normalised exponential, or SoftMax) output stage. \\Ve call t.he \nunnormalised log probabilities of the classes ai, and the softmax performs: \n\nYi = ea,/Z with Z = Lea, \n\n(1 ) \n\nNormally the parameters of such a system would be adjust.ed using a training set \ncomprising examples of inputs and corresponding classes, {(Xi, cd}, vVe assume that \nthe system includes means t.o convert derivatives of a t.raining criterion with respect \nto the outputs into a form suitable for adjusting the values of the parameters, for \ninstance by \"backpropagation\", \nImagine however that we have unlabelled data, X m , m. = 1, , ,Nts , and wish to use \nit to 'improve' the classifier. We could think of this as self-supervised learning, \nto hone an already good system on lots of easily-obtained unlabelled real-world \ndata, or to adapt to a slowly changing environment, or as a way of turning a \nclassifier int.o some sort of cluster analyser. (Just what kind depends on details of \nthe classifier itself.) The ideal method would be theoretically well-founded, general(cid:173)\npurpose (independent of the details of the classifier), and computationally tractable. \n\nOne well known approach to unsupervised data analysis is to minimise a recon(cid:173)\nstruction error: for linear projections and squared euclidean distance this leads to \nprincipal components analysis, while reference-point based classifiers lead to vector \nquantizer design methods, such as the LBG algorithm, Variants on VQ , such as \nKohonen's feature maps, can be motivated by requiring robustness to distortions \nin the code space . Reconstruction error is only available as a training criterion if \nreconstruction is defined: in general we are only given class label probabilities. \n\n2 A Data Likelihood Criterion \n\nFor the special case of a Gaussian clustering of an unlabelled data set, it was demon(cid:173)\nstrated in [1] that gradient ascent on the likelihood of the data has an appealing \ninterpretation in terms of backpropagation in an equivalent unit-Gaussian classifier \nnetwork: for each input X presented to the network, the output y is doubled to \ngive 'phantom targets' t = 2y; when the derivatives of the log likelihood criterion \nJ = -Eiti 10gYi relative to these targets are propagated back through the network, \nit turns out that the resulting gradient is identical to t.he gradient of the likelihood \nof the data given a Gaussian mixture model. \n\nFor the unit-Gaussian classifier, the activations ai in (1) are \n\nai = -Ix - wd 2 , \n\nso the outputs of the network are \n\nYi = P(class = i I x, w) \n\n(2) \n\n(3) \n\nwhere we assume the inputs are drawn from equi-probable unit-Gaussian distribu(cid:173)\ntions with the mean of the distribution of the ith class equal to Wi. \n\nThis result was only derived in a limited context, and it was speculated that it might \nbe generalisable to arbitrary classification models . The above phantom t.arget. rule \n\n\f1098 \n\nBridle, Heading, and MacKay \n\nhas been re-derived for a larger class of networks [4], but the conditions for strict \napplicability are quite severe. Briefly, there should be exponential density functions \nfor each class, and the normalizing factors for these densit.ies should be independent \nof the parameters. Thus Gaussians with fixed covariance matrices are acceptable, \nbut variable covariances are not, and neither are linear transformat.ions preceeding \nthe Gaussians. \n\nThe next section introduces a new objective function which is independent of details \nof the classifier. \n\n3 Mutual Information Criterion \n\nIntuitively, an unsupervised adaptive classifier is doing a plausible job if its outputs \nusually give a fairly clear indication of the class of an input vector, and if there is \nalso an even dist.ribution of input patterns between the classes. We could label these \ndesiderata 'decisive' and 'fair' respectively. Note that it is trivial to achieve either \nof them alone. For a poorly regularised model it may also be trivial to achieve both. \n\nThere are several ways to proceed. We could devise ad-hoc measures corresponding \nto our notions of decisiveness and fairness, or we could consider particular types \nof classifier and their unsupervised equivalents, seeking a general way of turning \none into the other. Our approach is to return to the general idea that the class \npredictions should retain as much information about the input values as possible. \nWe use a measure of the information about x which is conveyed by the output \ndistribution, i. e. \nthe mutual information between the inputs and the outputs. 'Ne \ninterpret the outputs y as a probability distribution over a discrete random variable \ne (the class label), thus y = p( elx). The mutual information between x and e is \n\nI(e; x) \n\njr{ \np(e,x) \nJ dcdxp(e, x) log p(e)p(x) \nJ dxp(x) J dep(elx) log p~~~~) \nJ J \ndxp(x) de p(elx) log J dxp(x)p( elx) \n\np(clx) \n\nThe elements of this expression are separately recognizable: \nJ dx p(x)(.) is equivalent to an average over a training set .~t. Lts (.); \np( clx) is simply the network output Yc; \nJ dc(\u00b7) is a sum over the class labels and corresponding network outputs. \nHence: \n\nI(c; x) \n\nI\n\nNc \n\nN L L Yi log :-! \n\ny. \n\nts \n\nt$ \n\ni=l \n\nYi \n\n(4) \n\n(5) \n\n(6) \n\n(7) \n\n\fUnsupervised Classifiers, Mutual Information and 'Phantom Targets' \n\n1099 \n\nNc \n\n- L fh log Yh + IV L L Yi log Yi \n\nNc \n\n1 \n\n1 \n\nis \n\nts \n\ni=l \n\ni=l \n\n1i(y) -1i(y) \n\n(8) \n\n(9) \n\nThe objective function I is the difference between the entropy of the average of the \nout.puts, and the average of the entropy of the outputs, where both averages are \nover the training set. 1i(y) has its maximum value when the average activities of \nthe separate output.s are equal- this is 'fairness'. 1i(Y) has its minimum value when \none output is full on and the rest are off for every training case - this is 'firmness'. \n\n\\Ve now evaluate I for the training set. a.nd take the gradient of I. \n\n4 Gradient descent \n\nTo use this criterion with back-propagation network training, we need its derivatives \nwith respect to the network outputs. \n\noI(c ;x) \n\nOYi \n\n(10) \n\n(11 ) \n\n(12) \n\nThe resulting expression is quite simple, but note that the presence of a fii term \nmeans that two passes through the training set are required: the first to calculate the \naverage output node activations, and the second to back-propagate the derivatives. \n\n5 \n\nIllustrations \n\nFigures 1 shows I (divided by its maximum possible value, log Nc ) for a run of a \nparticular unit-Gaussian classifier network. The 30 data points are drawn from a \n2-d isotropic Gaussian. Figure 2 shows the fairness and firmness criteria separately. \n(The upper curve is 'fairness' ?i(y )/log N e , and the lower curve is 'firmness' (1 -\n1i(y)/log N c ).) \n\nThe t.en reference points had starting values drawn from the same distribution as the \ndata. Figure 3 shows their movement during training. From initial positions within \nthe data cluster, they move outwards into a circle around the data. The resulting \nclassification regions are shown in Figure 4. (The grey level is proportional to the \nvalue of the maximum response at each point, and since the outputs are positive \nnormalised this value drops to 0.5 or less at the decision boundaries.) We observe \nthat the space is being partitioned into regions with roughly equal numbers of \npoints. It might be surprising at. first t.hat t.he reference points do not end up near \n\n\f1100 \n\nBridle, Heading, and MacKay \n\n1.0 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0.2 \n\n20 \n\n40 \n60 \nIteration \n\n80 \n\n100 \n\n20 \n\n40 \n60 \nIteration \n\n80 \n\n100 \n\n1. The M.1. criterion \n\n2. Firm and Fair separately \n\n1.0 \n\n0.8 \n\n0.2 \n\n4 \n\n2 \n\no \n\n-2 \n\n4~~~~~~~~~~~ \n\n2 \n\n4 \n\n4 \n\n-2 \n\no \n\n3. 'Tracks of reference points \n\n4. Decision Regions \n\n\fUnsupervised Classifiers, Mutual Information and 'Phantom Targets' \n\n1101 \n\nthe dat.a. However, it is only the transformat.ion from dat.a x to out.puts y that is \nbeing trained, and t.he refereme points are just parameters of t.hat t.ra.nsformation. \nAs t.he reference point.s move further away from OBe anot.her t.he dE'cision bounclaries \ngrow firmer. In t.his example the fairness crit.erion happens t.o decreasf' in favour of \nt.he firmness, and this usually happens. \\Ve could consider different weightings of \nthe two components of the criterion. \n\n6 Con1n1ents \n\nThe usefulness of this objective function will prove will depend very much on the \nform of classifier that it is applied t.o. For a poorly regularised classifier, maximisa(cid:173)\ntion of the criterion alone will not necessarily lead to good solutions to unsupervised \nclassification; it could be ma.ximised by any implausible classification of the input. \nthat is completely hard (i. e. \nthe output vector always has one 1 and all the other \noutputs 0), and t.hat. chops the t.raining set int.o regions cont.aining similar numbers \nof training points; such a solution would be one of many global maxima, regardless \nof whether it chopped t.he data into natural classes. \n\nThe meaning of a 'natural' partition in t.his cont.ext is, of course, rather ill-defined. \nSimple models often do not. have t.he capacity t.o break a pattern space int.o highly \ncontorted regions - the decision boundaries shown in the figure below is an example \nof model producing a reasonable result as a consequence of its inherent simplicity. \nWhen we use more complex models, however, we must ensure t.hat we find simpler \nsolutions in preference to more complex ones. Thus this criterion encourages us \nto pursue objective t.echniques for regularising classification networks [2, 3]; such \ntechniques are probably long overdue. \n\nCopyright \u00a9 Controller HMSO London 1992 \n\nReferences \n\n[1] J .S. Bridle (1988). The phantom target cluster network: a peculiar relative \nof (unsupervised) maximum likelihood stochastic modelling and (supervised) \nerror backpropagation, RSRE Research Note SP4: 66, DRA Malvern UK. \n\n[2] D.J .C. MacKay (1991). Bayesian interpolation, submitted to Neural computa(cid:173)\n\ntion. \n\n[3] D.J .C. MacKay (1991). A practical Bayesian framework for backprop networks, \n\nsubmitted to Neural computation. \n\n[4] .J S Bridle and S J Cox. Recnorm: Simultaneous normalisation and clas(cid:173)\nsification applied to speech recognition. In Advances in Ne'ural Information \nProcessing Systems ;g. Morgan Kaufmann, 1991. \n\n[5] J S Bridle. Training stochastic model recognition algorithms as networks can \nlead to maximum mut.ual informat.ion estimation of parameters. In Advances \nin Neural Informatio71 Processing Systems 2. Morgan Kaufmann, 1990. \n\n\f", "award": [], "sourceid": 440, "authors": [{"given_name": "John", "family_name": "Bridle", "institution": null}, {"given_name": "Anthony", "family_name": "Heading", "institution": null}, {"given_name": "David", "family_name": "MacKay", "institution": null}]}