{"title": "The Variational Ising Classifier (VIC) Algorithm for Coherently Contaminated Data", "book": "Advances in Neural Information Processing Systems", "page_first": 1497, "page_last": 1504, "abstract": null, "full_text": "The Variational Ising Classifier (VIC) algorithm\n for coherently contaminated data\n\n\n\n Oliver Williams Andrew Blake Roberto Cipolla\n Dept. of Engineering Microsoft Research Ltd. Dept. of Engineering\n University of Cambridge Cambridge, UK University of Cambridge\n omcw2@cam.ac.uk\n\n\n\n\n Abstract\n\n There has been substantial progress in the past decade in the development\n of object classifiers for images, for example of faces, humans and vehi-\n cles. Here we address the problem of contaminations (e.g. occlusion,\n shadows) in test images which have not explicitly been encountered in\n training data. The Variational Ising Classifier (VIC) algorithm models\n contamination as a mask (a field of binary variables) with a strong spa-\n tial coherence prior. Variational inference is used to marginalize over\n contamination and obtain robust classification. In this way the VIC ap-\n proach can turn a kernel classifier for clean data into one that can tolerate\n contamination, without any specific training on contaminated positives.\n\n\n\n1 Introduction\n\nRecent progress in discriminative object detection, especially for faces, has yielded good\nperformance and efficiency [1, 2, 3, 4]. Such systems are capable of classifying those\npositives that can be generalized from positive training data. This is restrictive in practice\nin that test data may contain distortions that take it outside the strict ambit of the training\npositives. One example would be lighting changes (to a face) but this can be addressed\nreasonably effectively by a normalizing transformation applied to training and test images;\ndoing so is common practice in face classification. Other sorts of disruption are not so\neasily factored out. A prime example is partial occlusion.\n\nThe aim of this paper is to extend a classifier trained on clean positives to accept also\npartially occluded positives, without further training. The approach is to capture some of\nthe regularity inherent in a typical pattern of contamination, namely its spatial coherence.\nThis can be thought of as extending the generalizing capability of a classifier to tolerate the\nsorts of image distortion that occur as a result of contamination.\n\nAs done previously in one-dimension, for image contours [5], the Variational Ising Classi-\nfier (VIC) models contamination explicitly as switches with a strong coherence prior in the\nform of an Ising model, but here over the full two-dimensional image array. In addition,\nthe Ising model is loaded with a bias towards non-contamination. The aim is to incorporate\nthese hidden contamination variables into a kernel classifier such as [1, 3]. In fact the Rel-\nevance Vector Machine (RVM) is particularly suitable [6] as it is explicitly probabilistic,\nso that contamination variables can be incorporated as a hidden layer of random variables.\n\n\f\n edge\n\n neighbours of i i\n\n\n\n\n\nFigure 1: The 2D Ising model is applied over a graph with edges e between neigh-\nbouring pixels (connected 4-wise).\n\n\nClassification is done by marginalization over all possible configurations of the hidden vari-\nable array, and this is made tractable by variational (mean field) inference. The inference\nscheme makes use of \"hallucination\" to fill in parts of the object that are unobserved due\nto occlusion.\n\nResults of VIC are given for face detection. First we show that the classifier performance\nis not significantly damaged by the inclusion of contamination variables. Then a contam-\ninated test set is generated using real test images and computer generated contaminations.\nOver this test data the VIC algorithm does indeed perform significantly better than a con-\nventional classifier (similar to [4]). The hidden variable layer is shown to operate effec-\ntively, successfully inferring areas of contamination. Finally, inference of contamination is\nshown working on real images with real contaminations.\n\n\n2 Bayesian modelling of contamination\n\nClassification requires P (F |I), the posterior for the proposition F that an object is present\ngiven the image data intensity array I. This can be computed in terms of likelihoods\n\n P (F | I) = P (I | F )P (F )/ P (I | F )P (F ) + P (I | F )P (F ) (1)\n\nso then the test P (F | I) > 1 becomes\n 2\n\n log P (I | F ) - log P (I | F ) > t (2)\n\nwhere t is a prior-dependent threshold that controls the tradeoff between positive and neg-\native classification errors. Suppose we are given a likelihood P (I|, F ) for the presence of\na face given contamination , an array of binary \"observation\" variables corresponding to\neach pixel Ij of I, such that j = 0 indicates contamination at that pixel, whereas j = 1\nindicates a successfully observed pixel. Then, in principle,\n\n P (I|F ) = P (I|, F )P (), (3)\n \n\n(making the reasonable assumption P (|F ) = P (), that the pattern of contamination is\nobject independent) and similarly for log P (I | F ). The marginalization itself is intractable,\nrequiring a summation over all 2N possible configurations of , for images with N pixels.\nApproximating that marginalization is dealt with in the next section. In the meantime, there\nare two other problems to deal with: specifying the prior P (); and specifying the likeli-\nhood under contamination P (I|, F ) given only training data for the unoccluded object.\n\n\n2.1 Prior over contaminations\n\nThe prior contains two terms: the first expresses the belief that contamination will occur\nin coherent regions of a subimage. This takes the form of an Ising model [7] with energy\n\n\f\nUI() that penalizes adjacent pixels which differ in their labelling (see Figure 1); the second\nterm UC biases generally against contamination a priori and its balance with the first term\nis mediated by the constant . The total prior energy is then\n\n U () = UI() + UC() = [1 - (e - )] + (\n 1 e2 j ), (4)\n e j\n\nwhere (x) = 1 if x = 0 and 0 otherwise, and e1, e2 are the indices of the pixels at either\nend of edge e (figure 1). The prior energy determines a probability via a temperature\nconstant 1/T0 [7]:\n\n P () e-U()/T0 = e-UI()/T0e-UC()/T0 (5)\n\n\n2.2 Relevance vector machine\n\nAn unoccluded classifier P (F |I, = 0) can be learned from training data using a Rele-\nvance Vector Machine (RVM) [6], trained on a database of frontal face and non-face im-\nages [8] (see Section 4 for details). The probabilistic properties of the RVM make it a good\nchoice when (later) it comes to marginalising over . For now we consider how to construct\nthe likelihood itself. First the conventional, unoccluded case is considered for which the\nposterior P (F |I) is learned from positive and negative examples. Kernel functions [9] are\ncomputed between a candidate image I and a subset of relevance vectors {xk}, retained\nfrom the training set. Gaussian kernels are used here to compute\n\n\n y(I) = wk exp - (Ij - xkj)2 . (6)\n k j\n\n\nwhere wk are learned weights, and xkj is the jth pixel of the kth relevance vector. Then the\nposterior is computed via the logistic sigmoid function as\n\n 1\n P (F |I, = 1) = (y(I)) = . (7)\n 1 + e-y(I)\n\nand finally the unoccluded data-likelihood would be\n\n P (I|F, = 1) (y(I))/P (F ). (8)\n\n\n2.3 Hallucinating appearance\n\nThe aim now is to derive the occluded likelihood from the unoccluded case, where the con-\ntamination mask is known, without any further training. To do this, (8) must be extended\nto give P (I|F, ) for arbitrary masks , despite the fact the pixels Ij from the object are\nnot observed wherever j = 0. In principle one should take into account all possible (or\nat least probable) values for the occluded pixels. Here, for simplicity, a single fixed hallu-\ncination is substituted for occluded pixels, then we proceed as if those values had actually\nbeen observed. This gives\n\n P (I|F, ) (~\n y(I, ))/P (F ) (9)\n\nwhere\n\n I\n ~\n y(, I) = y( ~\n I(I, , F )) and ~\n I(I, , F ) = j if j = 1 (10)\n j (E[I|F ])j otherwise\n\nin which E[I|F ] is a fixed hallucination, conditioned on the model F , and computed as a\nsample mean over training instances.\n\n\f\n3 Approximate marginalization of by mean field\n\n\nAt this point we return to the task of marginalising over (3) to obtain P (I|F ) and P (I|F )\nfor use in classification (2). Due to the connectedness of neighbouring pixels in the Ising\nprior (figure 1), P (I, |F ) is a Markov Random Field (MRF) [7]. The marginalized likeli-\nhood P (I|F ) could be estimated by Gibbs sampling [10] but that takes tens of minutes to\nconverge in our experiments. The following section describes a mean field approximation\nwhich converges in a few seconds. The mean field algorithm is given here for P (I|F ) but\nmust be repeated also for P (I|F ), simply substituting F for F throughout.\n\n\n3.1 Variational approximation\n\nMean field approximation is a form of variational approximation [11] and transforms an\ninference problem into the optimization of a functional J :\n\n J(Q) = log P (I|F ) - KL [Q() P (|F, I)] , (11)\n\nwhere KL is the Kullback-Liebler divergence\n\n Q()\n KL [Q() P (|F, I)] = Q() log .\n P (|F, I)\n \n\nThe objective functional J (Q) is a lower bound on the log-marginal probability log P (I|F )\n[11]; when it is maximized at Q, it gives both the marginal likelihood J (Q) =\nlog P (I|F ), and the posterior distribution Q() = P (|F, I) over hidden variables. Fol-\nlowing [11], J (Q) is simplified using Bayes' rule:\n\n J(Q) = H(Q) + EQ [log P (I, |F )]\n\nwhere H() is the entropy of a distribution [12] and EQ[g()] = Q()g() denotes the\n \nexpectation of a function g with respect to Q(). A form of Q() must be chosen that makes\nthe maximization of J (Q) tractable. For mean-field approximation, Q() is modelled as\na pixel-wise product of factors: Q() = Q\n i i(i). It is now possible to maximize J\niteratively with respect to each marginal Qi(i) in turn, giving the mean field update [11]:\n\n 1\n Qi exp E [log P (I, |F )] , (12)\n Z Q|i\n i\n\nwhere\n Zi = exp EQ| [log P (I, |F )]\n i\n\n i\n\nis the partition function and EQ| [] is the expectation with respect to Q given \n i i:\n\n \n EQ| [g()] = Q\n i j (j )\n {} g().\n j\\i j\\i\n\n\n3.2 Taking expectations over P (I, |F )\n\nTo perform the expectation required in (12), the log-joint distribution is written as:\n\n log {P (I, |F )} = - log 1 + e-~y(,I) - 1 U U\n T I () - C () + const.\n 0 T0\n\nThe conditional expectation EQ| in (12) is found efficiently from the complete expecta-\n i\ntions by replacing only terms in i. Likewise, when one factor of Q changes (12), the\n\n\f\ncomplete expectations may be updated without recomputing them ab initio. For brevity,\nwe give the expressions for the complete expectations only. For the prior this is simply:\n\n EQ[U ()] = Qe(e) [1 - (e - )] + Q\n 1 e2 j (j = 0). (13)\n e e j\n\nFor the likelihood it is more difficult. Saul et al. [13] show how to approximate the expec-\ntation over the sigmoid function by introducing a dummy variable :\n\n EQ log(1 + e-~y(,I)) -EQ[~\n y(, I)] + log EQ e~y(,I) + EQ e(-1)~y(,I) .\n\nThe Gaussian RBF in (6) means that it is not feasible to compute the expectation1\nEQ e~y(,I) , so a simpler approximation is used:\n\n EQ[log (~\n y(, I)] log (EQ[~\n y(, I)]) ,\n\nwhere\n 2\n EQ[~\n y(, I)] = wk Qj(j) exp - ~\n I(I, , F )j - xkj . (14)\n k j j\n\n\n4 Results and discussion\n\nThe mean field algorithm described above is capable only of local optimization of J (Q).\nA symptom of this is that it exhibits spontaneous symmetry breaking [11], setting the con-\ntamination field to either all contaminated or all uncontaminated. This is alleviated through\ncareful initialization. By performing iterations initially at a high temperature, Th, the prior\nis weakened. The temperature is then progressively decreased, on a linear annealing sched-\nule [10], until the modelled prior temperature T0 is reached. Figure 2 shows pseudo-code\nfor the VIC algorithm. Note also that an advantage of hallucinating appearance from the\nmean face is that the hallucination process requires no computation within the optimization\nloop. For 19 19 subimages, the average time taken for the VIC algorithm to converge\nis 4 seconds. However this is an unoptimized Matlab implementation; and in C++ it is\nanticipated to be at least 10 times faster.\n\nThe training set used for the RVM [8] contains subimages of registered faces and non-faces\nwhich were histogram equalized [14] to reduce the effect of different lighting with their\npixel values scaled to the range [0, 1]. The same is done to each test subimage I. The RVM\nwas trained using 1500 face examples and 1500 non-face examples 2. Parameters were set\nas follows: the RBF width parameter in (6) is = 0.05; the contamination cost = 0.2\nand the temperature constants are Th = 2.5, T0 = 1.5 and T = 0.2.\n\nAs a by-product of the VIC algorithm, the posterior pattern P (|F, I) of contamination is\napproximately inferred as the value of Q which maximizes J . Figure 3 shows some results\nof this. As might be expected, for a non-face, the algorithm hallucinates an intact face\nwith total contamination (For example, row 4 of the figure); but of course the marginalized\nposterior probability P (F |I) is very small in such a case.\n\n\n4.1 Classifier\n\nTo assess the classification performance of the VIC, contaminated positives were auto-\nmatically generated (figure 4). These were combined with pure faces and pure non-faces\n(none of which were used in the training set) and tested to produce the Receiver Operating\nCharacteristic (ROC) curves are given in Figure 4 for the unaltered RVM acting on the\n\n 1The term exp[ ~\n y(, I)] = exp[ w e-dj(I,xk|j)] does not factorize across pixels\n k k j\n 2These sizes are limited in practice by the complexity of the training algorithm [6]\n\n\f\nRequire: Candidate image region I\nRequire: Parameters Th, T0, T , \nRequire: RVM weights and examples wk, xk\nRequire: Mean face appearance \n I\n\n Initialize Qi(i = 1) 0.5 i\n Compute EQ[U ()] (13)\n Compute EQ[~\n y(, I)] (14)\n\n T Th\n while T > T0 do\n while Q not converged do\n for All image locations i do\n Compute conditional expectations EQ| [U ()] and E [~\n y(, I)]\n i Q|i\n Compute EQ| [log P (I, |F )] = log E [~\n y(, I)] - E [U ()]\n i Q|i Q|i\n Compute partition Zi = exp E [log P (I, |F )]\n Q|\n i i\n Update Qi(i) 1 exp E [log P (I, |F )]\n Z Q|\n i i\n Update complete expectations EQ[U ()] and EQ[~\n y(, I)]\n end for\n T T - T\n end while\n end while\n\n Figure 2: Pseudo-code for the VIC algorithm\n\n\n Input I Hallucinated image Contamination field Q( = 1)\n\n 0.8\n\n\n 0.6\n\n\n 0.4\n\n\n 0.2\n\n\n\n\n\n 0.8\n\n\n 0.6\n\n\n 0.4\n\n\n 0.2\n\n\n\n\n\n 0.8\n\n 0.6\n\n 0.4\n\n 0.2\n\n 0\n\n\n 0.4\n\n\n 0.3\n\n\n 0.2\n\n\n 0.1\n\n\n\n\n\nFigure 3: Partially occluded mages with inferred areas of probable contamination (dark).\n\n\n\ncontaminated set and for the new contamination-tolerant VIC outlined in this paper. For\ncomparison, points are shown for a boosted cascade of classifiers [15] which is a publicly\navailable detector based on the system of Viola and Jones [4]. The curve shown for the\nRVM against an uncontaminated test set confirms that contamination does make the classi-\nfication task considerably harder. Figure 5 shows some natural face images that the boosted\ncascade [15] fails to detect, either because of occlusion or due to a degree of deviation from\n\n\f\n 1\n\n\n 0.95\n\n\n 0.9\n\n\n 0.85\n\n\n True positive rate 0.8 RVM, no cont.\n RVM\n 0.75 VIC\n Boosted cascade\n Cascade, no cont.\n 0.70 0.1 0.2 0.3 0.4 0.5 0.6\n False positive rate\n\n\nFigure 4: ROC curves. Also shown are some of the contaminated positives used to generate\nthe curves. These were made by sampling contamination patterns from the prior and using\nthem to mix a face and a non-face artificially.\n\n\n Input I Hallucinated image Contamination field Q( = 1)\n\n 0.8\n\n\n 0.6\n\n\n 0.4\n\n\n 0.2\n\n\n\n\n\n 0.8\n\n\n 0.6\n\n\n 0.4\n\n\n 0.2\n\n\n\n\n\n 0.8\n\n\n 0.6\n\n\n 0.4\n\n\n 0.2\n\n\n\n\n\nFigure 5: Images that the boosted cascade [15] failed to detect as faces: the VIC algo-\nrithm produces higher posterior face probability by labelling certain regions with unusual\nappearance (eg due to 3D rotation) as contaminated.\n\n\n\nthe frontal pose. The VIC algorithm detects them successfully however.\n\n\n4.2 Discussion\n\nFigure 4 shows that by modelling the contamination field explicitly, the VIC detector im-\nproves on the performance, over a contaminated test set, both of a plain RVM and of a\nboosted cascade detector. The algorithm is relatively expensive to execute compared, say,\nwith the contamination-free RVM. However, this could be mitigated by cascading [4], in\nwhich a simple and efficient classifier, tuned to return a high rate of false positives for all\nobjects, contaminated and non-contaminated, would make a preliminary sweep of a test\nimage. The contamination-tolerant VIC algorithm would then be applied to the candidate\nsubimages that remain, thereby concentrating computational power on just a few locations.\n\nFigure 5 illustrates the operation of the contamination mechanism on real images, all of\n\n\f\nwhich are detected as faces by the VIC algorithm but missed by the boosted cascade. There\nis no occlusion in these examples but rotations have distorted the appearance of certain\nfeatures. The VIC algorithm has deals with this by labelling the distortions as contaminated\nareas, and hallucinating face-like texture in their place.\n\nIn conclusion, we have developed the VNC algorithm for object detection in the presence\nof coherently contaminated data. Contamination is modelled as coherent via an Ising prior,\nand is marginalized out by variational inference. Experiments show that VIC classifies\ncontaminated images more robustly than classifiers designed for clean data. It is worth\npointing out that the approach of the VIC algorithm is not limited to RVMs. Any proba-\nbilistic detector for which it is possible to estimate the expectation (14) could be modified\nin a similar way to deal with spatially coherent contamination. Future work will address:\nimproved efficiency by incorporating the VIC into a cascade of simple classifiers; alterna-\ntives to data hallucination using marginalization over missing data, if a tractable means of\ndoing this can be found.\n\n\nReferences\n\n [1] E. Osuna, R. Freund, and F. Girosi. Training support vector machines: An application\n to face detection. Proc. Conf. Computer Vision and Pattern Recognition, pages 130\n 136, 1997.\n\n [2] H.A. Rowley, S. Baluja, and T. Kanade. Neural network-based face detection. IEEE\n Transactions on Pattern Alaysis and Machine Intelligence, 20(1):2338, 1998.\n\n [3] S. Romdhani, P. Torr, B. Scholkopf, and A. Blake. Computationally efficient face\n detection. In Proc. Int. Conf. on Computer Vision, volume 2, pages 524531, 2001.\n\n [4] P. Viola and M. Jones. Rapid object detection using a boosted cascade of simple\n features. In Proc. Conf. Computer Vision and Pattern Recognition, 2001.\n\n [5] J. MacCormick and A. Blake. Spatial dependence in the observation of visual con-\n tours. In Proc. European Conf. on Computer Vision, pages 765781, 1998.\n\n [6] M.E. Tipping. Sparse Bayesian learning and the relevance vector machine. Journal\n of Machine Learning Research, 1:211244, 2001.\n\n [7] R. Kindermann and J.L. Snell. Markov Random Fields and Their Applications. Amer-\n ican Mathematical Society, 1980.\n\n [8] CBCL face database #1. MIT Center For Biological and Computation Learning:\n http://www.ai.mit.edu/projects/cbcl.\n\n [9] B. Scholkopf and A. Smola. Learning with Kernels: Support Vector Machines, Reg-\n ularization, Optimization, and Beyond (Adaptive Computation and Machine Learn-\n ing). MIT Press, 2001.\n\n[10] S. Geman and D. Geman. Stochastic relaxation, Gibbs distributions, and the Bayesian\n restoration of images. IEEE Trans. on Pattern Analysis and Machine Intelligence,\n 6(6):721741, 1984.\n\n[11] T. Jaakkola. Tutorial on variational approximation methods. In Advanced Mean Field\n Methods: Theory and Practice. MIT Press, 2000.\n\n[12] T. Cover and J. Thomas. Elements of Information Theory. John Wiley & Sons, 1991.\n\n[13] L. Saul, T. Jaakkola, and M. Jordan. Mean field theory for sigmoid belief networks.\n Journal of Artificial Intelligence Research, 4:6176, 1996.\n\n[14] A.K. Jain. Fundamentals of Digital Image Processing. System Sciences. Prentice-\n Hall, New Jersey, 1989.\n\n[15] R. Lienhart and J. Maydt. An extended set of Haar-like features for rapid object\n detection. In Proc. IEEE ICIP, volume 1, pages 900903, 2002.\n\n\f\n", "award": [], "sourceid": 2616, "authors": [{"given_name": "Oliver", "family_name": "Williams", "institution": null}, {"given_name": "Andrew", "family_name": "Blake", "institution": null}, {"given_name": "Roberto", "family_name": "Cipolla", "institution": null}]}