{"title": "A Weighted Probabilistic Neural Network", "book": "Advances in Neural Information Processing Systems", "page_first": 1110, "page_last": 1117, "abstract": null, "full_text": "A Weighted Probabilistic Neural Network \n\nDavid Montana \n\nBolt Beranek and Newman Inc. \n\n10 Moulton Street \n\nCambridge, MA 02138 \n\nAbstract \n\nThe Probabilistic Neural Network (PNN) algorithm represents the likeli(cid:173)\nhood function of a given class as the sum of identical, isotropic Gaussians. \nIn practice, PNN is often an excellent pattern classifier, outperforming \nother classifiers including backpropagation. However, it. is not. robust with \nrespect to affine transformations of feature space, and this can lead to \npoor performance on certain data. We have derived an extension of PNN \ncalled Weighted PNN (WPNN) which compensates for this flaw by allow(cid:173)\ning anisotropic Gaussians, i.e. Gaussians whose covariance is not a mul(cid:173)\ntiple of the identity matrix. The covariance is optimized using a genetic \nalgorithm, some interesting features of which are its redundant, logarith(cid:173)\nmic encoding and large population size. Experimental results validate our \nclaims. \n\n1 \n\nINTRODUCTION \n\n1.1 PROBABILISTIC NEURAL NETWORKS (PNN) \n\nPNN (Specht 1990) is a pattern classification algorithm which falls into the broad \nclass of \"nearest-neighbor-like\" algorithms. It is called a \"neural network\" because \nof its natural mapping onto a two-layer feedforward network. It works as follows. \nLet the exemplars from class i be the k-vectors iT} for j = 1, ... , Ni. Then, the \nlikelihood function for class i is \n\n1 \n\nN, \n\nLi(i) = - - - -_ \"\"' e-(x-xj)2/ u \n\nNi(27r(j)k/2 ~ \n\n(1) \n\n1110 \n\n\fA Weighted Probabilistic Neural Network \n\n1111 \n\nclass B \n\nclass A \n\nclass B \n\n(a) \n\n(b) \n\nFigure 1: PNN is not robust with respect to affine transformations of feature space. \nOriginally (a), A2 is closer to its classmate Al than to B 1 ; however, after a simple \naffine transformation (b), A2 is closer to B 1\u2022 \n\nand the conditional probability for class i is \n\nPi(i) = Li(i)/ L Lj(i) \n\nM \n\nj=l \n\n(2) \n\nNote that the class likelihood functions are sums of identical isotropic Gaussians \ncentered at the exemplars. \n\nThe single free parameter of this algorithm is u, the variance of the Gaussians (the \nrest of the terms in the likelihood functions are determined directly from the training \ndata). Hence, training a PNN consists of optimizing u relative to some evaluation \ncriterion, typically the number of classification errors during cross-validation (see \nSections 2.1 and 3). Since the search space is one-dimensional, the search procedure \nis trivial and is often performed by hand. \n\n1.2 THE PROBLEM WITH PNN \n\nThe main drawback of PNN and other \"nearest-neighbor-like\" algorithms is that \nthey are not robust with respect to affine transformations (i.e., transformations of \nthe form x 1--+ Ax + b) of feature space. (Note that in theory affine transformations \nshould not affect the performance of backpropagation, but the results of Section 3 \nshow that this is not true in practice.) Figures 1 and 2 depict examples of how \naffine transformations of feature space affect classification performance. In Figures \nla and 2a, the point A2 is closer (using Euclidean distance) to point Al , which is \nalso from class A, than to point B 1 , which is from class B. Hence, with a training set \nconsisting of the exemplars Al and B1 , PNN would classify A2 correctly. Figures \nIb and 2b depict the feature space after affine transformations. In both cases, A2 is \ncloser to Bl than to Al and would hence be classified incorrectly. For the example \nof Figure 2, the transformation matrix A is not diagonal (i.e., the principle axes \nof the transformation are not the coordinate axes), and the adverse effects of this \ntransformation cannot be undone by any affine transformation with diagonal A. \n\nThis problem has motivated us to generalize the PNN algorithm in such a way that \nit is robust with respect to affine transformations of the feature space. \n\n\f1112 \n\nMontana \n\nA2 \n\nclass A \n\nclass~Bl \n\n(a) \n\n(b) \n\nFigure 2: The principle axes of the affine transformation do not necessarily corre(cid:173)\nspond with the coordinate axes. \n\n1.3 A SOLUTION: WEIGHTED PNN (WPNN) \n\nThis flaw of nearest-neighbor-like algorithms has been recognized before, and there \nhave been a few proposed solutions. They all use what Dasarathy (1991) calls \n\"modified metrics\", which are non-Euclidean distance measures in feature space. \nAll the approaches to modified metrics define criteria which the chosen metric \nshould optimize. Some criteria allow explicit derivation of the new metrics (Short \nand Fukunuga 1981; Fukunuga and Flick 1984) . However, the validity of these \nderivations relies on there being a very large number of exemplars in the training \nset. A more recent set of approaches (Atkeson 1991; Kelly and Davis 1991) (i) \nuse criteria which measure the performance on the training set using leaving-one(cid:173)\nout cross-validation (see (Stone 1974) and Section 2.1), (ii) restrict the number of \nparameters of the metric to increase statistical significance, and (iii) optimize the \nparameters of the metric using non-linear search techniques. For his technique of \n\"locally weighted regression\", Atkeson (1991) uses an evaluation criterion which is \nthe sum of the squares of the error using leaving-one-out. His metric has the form \nd2 = Wl(Xl-Yl?+ ... +Wk(Xk-Yk)2, and hence has k free parameters WI, ... , Wk. He \nuses Levenberg-Marquardt to optimize these parameters with respect to the evalu(cid:173)\nation criterion. For their Weighted K-Nearest Neighbors (WKNN) algorithm, Kelly \nand Davis (1991) use an evaluation criterion which is the total number of incorrect \nclassifications under leaving-one-out. Their metric is the same as Atkeson's, and \ntheir optmization is done with a genetic algorithm. \n\n1 \n\nWe use an approach similar to that of Atkeson (1991) and Kelly and Davis (1991) \nto make PNN more robust with respect to affine transformations. Our approach, \ncalled Weighted PNN (WPNN), works by using anisotropic Gaussians rather than \nthe isotropic Gaussians used by PNN. An anisotropic Gaussian has the form \nt' d fi . t k k \ne covarIance LJ IS a nonnega Ive- e m e x \n(271')Jc/2(det ~)1/2 e-\nsymmetric matrix. Note that ~ enters into the exponent of the Gaussian so as to \ndefine a new distance metric, and hence the use of anisotropic Gaussians to extend \nPNN is analogous to the use of modified metrics to extend other nearest-neighbor(cid:173)\nlike algorithms. \nThe likelihood function for class i is \n\n(i i ) T ~ -1 (i i ) Th \n\n~ . \n\n. \n\n-\n\n0 \n\n\u2022 \n\n-\n\n0 \n\n. _ _ \nL,(x) -\n\nNi(271\")kI2(det~)1/2 f;;:e \n\nI\n\nN \u2022 \n'\"'\" _(i_x;)TE-l(i_xj) \n\n(3) \n\n\fA Weighted Probabilistic Neural Network \n\n1113 \n\nand the conditional probability is still as given in Equation 2. Note that when E is \na multiple of the identity, i.e. E = (J'I, Equation 3 reduces to Equation 1. Section 2 \ndescribes how we select the value of E. \n\nTo ensure good generalization, we have so far restricted ourselves to diagonal co(cid:173)\nvariances (and thus metrics of the form used by Atkeson (1991) and Kelly and \nDavis (1991). This reduces the number of degrees of freedom of the covariance from \nk( k + 1) /2 to k. However, this restricted set of covariances is not sufficiently general \nto solve all the problems of PNN (as demonstrated in Section 3), and we therefore \nin Section 2 hint at some modifications which would allow us to use arbitrary co(cid:173)\nvarIances. \n\n2 OPTIMIZING THE COVARIANCE \n\nWe have used a genetic algorithm (Goldberg 1988) to optimize the covariance of the \nGaussians. The code we used was a non-object-oriented C translation of the OOGA \n(Object-Oriented Genetic Algorithm) code (Davis 1991) . This code preserves the \nfeatures of OOGA including arbitrary encodings, exponential fitness, steady-state \nreplacement, and adaptive operator probabilities. We now describe the distinguish(cid:173)\ning features of our genetic algorithm: (1) the evaluation function (Section 2.1), (2) \nthe genetic encoding (Section 2.2), and (3) the population size (Section 2.3). \n\n2.1 THE EVALUATION FUNCTION \n\nTo evaluate the performance of a particular covariance matrix on the training set, we \nuse a technique called \"leaving-one-out\", which is a special form of cross-validation \n(Stone 1974). One exemplar at a time is withheld from the training set, and we \nthen determine how well WPNN with that covariance matrix classifies the with(cid:173)\nheld exemplar. The full evaluation is the sum of the evaluations on the individual \nexemplars. \nFor the exemplar X}, let lq(x}) for q = 1, ... , M denote the class likelihoods obtained \nupon withholding this exemplar and applying Equation 3, and let Pq(?) be the \nprobabilities obtained from these likelihoods via Equation 2. Then, we define the \nperformance as \n\nM N, \n\nE = 2:2:\u00ab(1- Pi(X;\u00bb2 + 2:(Pq(X;\u00bb2) \n\ni=l j=l \n\nq# \n\n(4) \n\nWe have incorporated two heuristics to quickly identify covariances which are clearly \nbad and give them a value of 00, the worst possible score. This greatly speeds up the \noptimization process because many of the generated covariances can be eliminated \nthis way (see Section 2.3) . The first heuristic identifies covariances which are too \n\"small\" based on the condition that, for some exemplar x} and all q = 1, ... M, \nlq (x}) = 0 to within the precision of IEEE double-precision floating-point format. \nIn this case, the probabilities Pq (X1) are not well-defined. (When E is this \"small\" , \nWPNN is approximately equivalent to WKNN with k = 1, and if such a small E is \nindeed required, then the WKNN algorithm should be used instead.) \n\n\f1114 \n\nMontana \n\nThe second heuristic identifies covariances which are too \"big\" in the sense that \ntoo many exemplars contribute significantly to the likelihood functions. Empirical \nobservations and theoretical arguments show that PNN (and WPNN) work best \nwhen only a small fraction of the exemplars contribute significantly. Hence, we \nreject a particular E if, for any exemplar xJ, \n\n(5) \n\nHere, P is a parameter which we chose for our experiments to equal four. \nNote: If we wish to improve the generalization by discarding some of the degrees \nof freedom of the covariance (which we will need to do when we allow non-diagonal \ncovariances), we should modify the evaluation function by subtracting off a term \nwhich is montonically increasing with the number of degrees of freedom discarded. \n\n2.2 THE GENETIC ENCODING \n\nRecall from Section 1.3 that we have presently restricted the covariance to be diag(cid:173)\nonal. Hence, the set of all possible covariances is k-dimensional, where k is the di(cid:173)\nmension ofthe feature space. We encode the covariances as k+l integers (ao, ... , ak), \nwhere the ai's are in the ranges (ao)min ::; ao ::; (ao)max and 0 ::; ai ::; amax for \ni = 1, ... , k. The decoding map is \n\n(6) \n\nWe observe the following about this encoding. First, it is a \"logarithmic encoding\" , \ni.e. the encoded parameters are related logarithmically to the original parameters. \nThis provides a large dynamic range without the sacrifice of sufficient resolution at \nany scale and without making the search space unmanageably large. The constants \nC 1 and C 2 determine t.he resolution, while the constants (aO)min, (ao)max, and \namax det.ermine t.he range. Second, it. is possibly a \"redundant\" encoding, i.e. there \nmay be multiple encodings of a single covariance. We use this redundant encoding, \ndespite the seeming paradox, t.o reduce the size of t.he search space. The ao term \nencodes the size of the Gaussian, roughly equivalent to (J' in PNN. The other aj's \nencode the relative weighting of the various dimensions. If we dropped the ao term, \nthe other aj terms would have to have larger ranges to compensate, thus making \nthe search space larger. \n\nNote: If we wish to improve the generalization by discarding some of the degrees \nof freedom of the covariance, we need to allow all the entries besides ao to take on \nthe value of 00 in addition to the range of values defined above. When aj = 00, its \ncorresponding entry in the covariance matrix is zero and is hence discarded. \n\n2.3 POPULATION SIZE \n\nFor their success, genetic algorithms rely on having multiple individuals with partial \ninformation in the population. The problem we have encountered is that the ratio of \nthe the area of the search space with partial information to the entire search space \nis small. In fact, with our very loose heuristics, on Dataset 1 (see Section 3) about \n\n\fA Weighted Probabilistic Neural Network \n\n1115 \n\n90% of the randomly generated individuals of the initial population evaluated to 00. \nIn fact, we estimate very roughly that only 1 in 50 or 1 in 100 randomly generated \nindividuals contain partial information. To ensure that the initial population has \nmultiple individuals with partial information requires a population size of many \nhundreds, and we conservatively used a population size of 1600. Note that with \nsuch a large population it is essential to use a steady-state genetic algorithm (Davis \n1991) rather than generational replacement. \n\n3 EXPERIMENTAL RESULTS \n\nWe have performed a series of experiments to verify our claims about WPNN. To \ndo so, we have constructed a sequence of four datasets designed to illustrate the \nshortcomings of PNN and how WPNN in its present form can fix some of these \nshortcomings but not others. Dataset 1 is a training set we generated during an \neffort to classify simulated sonar signals. It has ten features, five classes, and 516 \ntotal exemplars. Dataset 2 is the same as Dataset 1 except that we supplemented the \nten features of Dataset 1 with five additional features, which were random numbers \nuniformly distributed between zero and one (and hence contained no information \nrelevant to classification), thus giving a total of 15 features. Dataset 3 is the same \nas Dataset 2 except with ten (rather than five) irrelevant features added and hence \na total of 20 features. Like Dataset 3, Dataset 4 has 20 features. It is obtained \nfrom Dataset 3 as follows. Pair each of the true features with one of the irrelevant \nfeatures. Call the feature values of the ith pair Ii and gi. Then, replace these feature \nvalues with the values 0.5(1i + gd and 0.5(1i - gi + 1), thus mixing up the relevant \nfeatures with the irrelevant features via linear combinations. \n\nTo evaluate the performance of different pattern classification algorithms on these \nfour datasets, we have used lO-fold cross-validation (Stone 1974). This involves \nsplitting each dataset into ten disjoint subsets of similar size and similar distribution \nof exemplars by class. To evaluate a particular algorithm on a dataset requires ten \ntraining and test runs, where each subset is used as the test set for the algorithm \ntrained on a training set consisting of the other nine subsets. \n\nThe pattern classification algorithms we have evaluated are backpropagation (with \nfour hidden nodes), PNN (with (f = 0.05), WPNN and CART. The results of the \nexperiments are shown in Figure 3. Note that the parenthesized quantities denote \nerrors on the training data and are not compensated for the fact that each exemplar \nof the original dataset is in nine of the ten training sets used for cross-validation. \n\nWe can draw a number of conclusions from these results. First, the performance of \nPNN on Datasets 2-4 clearly demonstrates the problems which arise from its lack \nof robustness with respect to affine transformations of feature space. In each case, \nthere exists an affine transformation which makes the problem essentially equiva(cid:173)\nlent to Dataset 1 from the viewpoint of Euclidean distance, but the performance \nis clearly very different. Second, WPNN clearly eliminates this problem with PNN \nfor Datasets 2 and 3 but not for Dataset 4. This points out both the progress we \nhave made so far in using WPNN to make PNN more robust and the importance \nof extending the WPNN algorithm to allow non-diagonal covariances. Third, al(cid:173)\nthough backpropagation is in theory transparent to affine transformations of feature \nspace (because the first layer of weights and biases implements an arbitrary affine \n\n\f1116 \n\nMontana \n\n~ 1 \n\nAlaorithm \n8ackprop \n\n11 (69) 16 (51) 20 (27) 13 (64) \n\n2 \n\n3 \n\n4 \n\nPNN \n\nWPNN \n\nCART \n\n9 \n\n10 \n\n14 \n\n94 \n\n11 \n\n17 \n\n109 \n\n11 \n\n18 \n\n29 \n\n25 \n\n53 \n\nFigure 3: Performance on the four datasets of backprop, CART, PNN and WPNN \n(parenthesized quantities are training set errors). \n\ntransformation), in practice affine transformations effect its performance. Indeed, \nDataset 4 is obtained from Dataset 3 by an affine transformation, yet backprop(cid:173)\nagation performs very differently on them. Backpropagation does better on the \ntraining sets for Dataset 3 than on the training sets for Dataset 4 but does better \non the test sets of Dataset 4 than the test sets of Dataset 3. This implies that for \nDataset 4 during the training procedure backpropagation is not finding the globally \noptimum set of weights and biases but is missing in such a way that improves its \ngeneralization. \n\n4 CONCLUSIONS AND FUTURE WORK \n\nWe have demonstrated through both theoretical arguments and experiments an \ninherent flaw of PNN, its lack or robustness with respect to affine transformations \nof feature space. To correct this flaw, we have proposed an extension of PNN, called \nWPNN, which uses anisotropic Gaussians rather than the isotropic Gaussians used \nby PNN. Under the assumption that the covariance of the Gaussians is diagonal, \nwe have described how to use a genetic algorithm to optimize the covariance for \noptimal performance on the training set. Experiments have shown that WPNN can \npartially remedy the flaw with PNN. \n\nWhat remains to be done is to modify the optimization procedure to allow arbitrary \n(i.e., non-diagonal) covariances. The main difficulty here is that the covariance \nmatrix has a large number of degrees offreedom (k(k+l)/2, where k is the dimension \nof feature space), and we therefore need to ensure that the choice of covariance is \nnot overfit to the data. We have presented some general ideas on how to approach \nthis problem, but a true solution still needs to be developed. \n\nAcknowledgements \n\nThis work was partially supported by DARPA via ONR under Contract N00014-\n89-C-0264 as part of the Artifical Neural Networks Initiative. \n\n\fA Weighted Probabilistic Neural Network \n\n1117 \n\nThanks to Ken Theriault for his useful comments. \n\nReferences \n\nC.G. Atkeson. (1991) Using locally weighted regression for robot learning. Proceed(cid:173)\nings of the 1991 IEEE Conference on Robotics and Automation, pp. 958-963. Los \nAlamitos, CA: IEEE Computer Society Press. \nB.V. Dasarathy. (1991) Nearest Neighbor (NN) Norms: NN Pattern Classification \nTechniques. Los Alamitos, CA: IEEE Computer Society Press. \n\nL. Davis. (1991) Handbook of Genetic Algorithms. New York: Van Nostrand Rein(cid:173)\nhold. \n\nK. Fukunaga and T.T. Flick. (1984) An optimal global nearest neighbor metric. \nIEEE Transactions on Pattern Analysis and Machine Intelligence, Vol. PAMI-6, \nNo.3, pp. 314-318. \n\nD. Goldberg. (1988) Genetic Algorithms in Machine Learning, Optimization and \nSearch. Redwood City, CA: Addison-Wesley. \n\nJ.D. Kelly, Jr. and L. Davis. (1991) Hybridizing the genetic algorithm and the k \nnearest neighbors classification algorithm. Proceedings of the Fourth Internation \nConference on Genetic Algorithms, pp. 377-383. San Mateo, CA: Morgan Kauf(cid:173)\nmann. \n\nR.D. Short and K. Fukunaga. (1981) The optimal distance measure for nearest \nneighbor classification. IEEE Transactions on Information Theory, Vol. IT-27, No. \n5, pp. 622-627. \nD.F. Specht. (1990) Probabilistic neural networks. Neural Networks, vol. 3, no. 1, \npp.109-118. \n\nM. Stone. (1974) Cross-validatory choice and assessment of statistical predictions. \nJournal of the Royal Statistical Society, vol. 36, pp. 111-147. \n\n\f", "award": [], "sourceid": 485, "authors": [{"given_name": "David", "family_name": "Montana", "institution": null}]}