{"title": "Topographic Transformation as a Discrete Latent Variable", "book": "Advances in Neural Information Processing Systems", "page_first": 477, "page_last": 483, "abstract": null, "full_text": "Topographic Transformation as a \n\nDiscrete Latent Variable \n\nNebojsa Jojic \n\nBeckman Institute \n\nUniversity of Illinois at Urbana \n\nwww.ifp.uiuc.edu/\",jojic \n\nBrendan J. Frey \nComputer Science \n\nUniversity of Waterloo \n\nwww.cs.uwaterloo.ca/ ... frey \n\nAbstract \n\nInvariance to topographic transformations such as translation and \nshearing in an image has been successfully incorporated into feed(cid:173)\nforward mechanisms, e.g., \"convolutional neural networks\", \"tan(cid:173)\ngent propagation\". We describe a way to add transformation invari(cid:173)\nance to a generative density model by approximating the nonlinear \ntransformation manifold by a discrete set of transformations. An \nEM algorithm for the original model can be extended to the new \nmodel by computing expectations over the set of transformations. \nWe show how to add a discrete transformation variable to Gaussian \nmixture modeling, factor analysis and mixtures of factor analysis. \nWe give results on filtering microscopy images, face and facial pose \nclustering, and handwritten digit modeling and recognition. \n\n1 \n\nIntroduction \n\nImagine what happens to the point in the N-dimensional space corresponding to an \nN-pixel image of an object, while the object is deformed by shearing. A very small \namount of shearing will move the point only slightly, so deforming the object by \nshearing will trace a continuous curve in the space of pixel intensities. As illustrated \nin Fig. la, extensive levels of shearing will produce a highly nonlinear curve (consider \nshearing a thin vertical line ), although the curve can be approximated by a straight \nline locally. \n\nLinear approximations of the transformation manifold have been used to signif(cid:173)\nicantly improve the performance of feedforward discriminative classifiers such as \nnearest neighbors (Simard et al., 1993) and multilayer perceptrons (Simard et al., \n1992). Linear generative models (factor analysis, mixtures of factor analysis) have \nalso been modified using linear approximations of the transformation manifold to \nbuild in some degree of transformation invariance (Hinton et al., 1997). \nIn general, the linear approximation is accurate for transformations that couple \nneighboring pixels, but is inaccurate for transformations that couple nonneighboring \npixels. In some applications (e.g., handwritten digit recognition), the input can be \nblurred so that the linear approximation becomes more robust. \n\nFor significant levels of transformation, the nonlinear manifold can be better mod(cid:173)\neled using a discrete approximation. For example, the curve in Fig. 1a can be \n\n\f478 \n\nN. Jojic and B. J. Frey \n\n(b) \n\n(c) \n\n(d) \n\n(e) \n\np(z) \n\n~ \n\nFigure 1: (a) An N-pixel greyscale image is represented by a point (unfilled disc) in an N(cid:173)\ndimensional space. When the object being imaged is deformed by shearing. the point moves \nalong a continuous curve. Locally. the curve is linear. but high levels of shearing produce a \nhighly nonlinear curve. which we approximate by discrete points (filled discs) indexed bye. (b) \nA graphical model showing how a discrete transformation variable e can be added to a density \nmodel p(z) for a latent image z to model the observed image x . The Gaussian pdf p(xle, z) \ncaptures the eth transformation plus a small amount of pixel noise. (We use a box to represent \nvariables that have Gaussian conditional pdfs.) We have explored (c) transformed mixtures \nof Gaussians. where c is a discrete cluster index; (d) transformed component analysis (TeA). \nwhere y is a vector of Gaussian factors. some of which may model locally linear transformation \nperturbations; and (e) mixtures of transformed component analyzers. or transformed mixtures \nof factor analyzers. \n\nrepresented by a set of points (filled discs). In this approach, a discrete set of possi(cid:173)\nble transformations is specified beforehand and parameters are learned so that the \nmodel is invariant to the set of transformations. This approach has been used to \ndesign \"convolutional neural networks\" that are invariant to translation (Le Cun \net al., 1998) and to develop a general purpose learning algorithm for generative \ntopographic maps (Bishop et al., 1998) . \n\nWe describe how invariance to a discrete set of known transformations (like transla(cid:173)\ntion) can be built into a generative density model and we show how an EM algorithm \nfor the original density model can be extended to the new model by computing ex(cid:173)\npectations over the set of transformations. We give results for 5 different types of \nexperiment involving translation and shearing. \n\n2 Transformation as a Discrete Latent Variable \n\nWe represent transformation f by a sparse transformation generating matrix G e that \noperates on a vector of pixel intensities. For example, integer-pixel translations of \nan image can be represented by permutation matrices. Although other types of \ntransformation matrix may not be accurately represented by permutation matrices, \nmany useful types of transformation can be represented by sparse transformation \nmatrices. For example, rotation and blurring can be represented by matrices that \nhave a small number of nonzero elements per row (e.g., at most 6 for rotations). \n\nThe observed image x is linked to the nontransformed latent image z and the \ntransformation index f E {I, ... , L} as follows: \n\n(1) \nwhere W is a diagonal matrix of pixel noise variances. Since the probability of \na transformation may depend on the latent image, the joint distribution over the \nlatent image z, the transformation index f and the observed image x is \n\np(xlf, z) = N(x; Gez, w), \n\np(x, f, z) = N(x; Gez, w)P(flz)p(z). \n\n(2) \n\nThe corresponding graphical model is shown in Fig. lb. For example, to model noisy \ntransformed images of just one shape, we choose p(z) to be a Gaussian distribution. \n\n\fTopographic Transformation as a Discrete Latent Variable \n\n479 \n\n2.1 Transformed mixtures of Gaussians (TMG). Fig. lc shows the graph(cid:173)\nical model for a TMG, where different clusters may havp. different transformation \nprobabilities. Cluster c has mixing proportion 7rc , mean /-tc and diagonal covariance \nmatrix ~ c. The joint distribution is \n\n(3) \n\nwhere the probability of transformation f for cluster c is Plc. Marginalizing over \nthe latent image gives the cluster/transformation conditional likelihood, \n\n(4) \n\nwhich can be used to compute p(x) and the cluster/transformation responsibility \nP(f, clx). This likelihood looks like the likelihood for a mixture of factor analyzers \n(Ghahramani and Hinton, 1997). However, whereas the likelihood computation for \nN latent pixels takes order N 3 time in a mixture of factor analyzers, it takes linear \ntime, order N, in a TMG, because Gl~cG'I + W is sparse. \n\n2.2 Transformed component analysis (TCA). Fig. Id shows the graphical \nmodel for TCA (or \"transformed factor analysis\"). The latent image is modeled \nusing linearly combined Gaussian factors, y. The joint distribution is \np(x, f, z, y) = N(x; Glz, w)N(z; /-t + Ay, ~ )N(y; 0, I)Pl, \n\n(5) \n\nwhere /-t is the mean of the latent image, A is a matrix of latent image components \n(the factor loading matrix) and ~ is a diagonal noise covariance matrix for the latent \nimage. Marginalizing over the factors and the latent image gives the transformation \nconditional likelihood, \n\np(xlf) = N(x; Gl/-t, Gl(AA T + ~)G'I + w), \n\n(6) \n\nwhich can be used to compute p(x) and the transformation responsibility p(flx). \nGl(AA T + ~)G'I is not sparse, so computing this likelihood exactly takes N 3 \ntime. However, the likelihood can be computed in linear time if we assume \nIGl(AA T + f))G'I + wi ~ IGl(AAT + ~)G'II, which corresponds to assuming \nthat the observed noise is smaller than the variation due to the latent image, or \nthat the observed noise is accounted for by the latent noise model, ~. In our ex(cid:173)\nperiments, this approximation did not lead to degenerate behavior and produced \nuseful models. \n\nBy setting columns of A equal to the derivatives of /-t with respect to continuous \ntransformation parameters, a TCA can accommodate both a local linear approxi(cid:173)\nmation and a discrete approximation to the transformation manifold. \n\n2.3 Mixtures of transformed component analyzers (MTCA). A combi(cid:173)\nnation of a TMG and a TCA can be used to jointly model clusters, linear compo(cid:173)\nnents and transformations. Alternatively, a mixture of Gaussians that is invariant \nto a discrete set of transformations and locally linear transformations can be ob(cid:173)\ntained by combining a TMG with a TCA whose components are all set equal to \ntransformation derivatives. \n\nThe joint distribution for the combined model in Fig. Ie is \n\np(x, f, z, c, y) = N(x; GlZ, w)N(z; /-tc + AcY, ~c)N(y; 0, I)Plc7rc. \n\n(7) \nThe cluster/transformation likelihood is p(xlf,c) = N(X;Gl/-tc,Gl(AcA; + \n~c)G'I + w), which can be approximated in linear time as for TCA. \n\n\f480 \n\nN. Jojic and B. J. Frey \n\n3 Mixed Transformed Component Analysis (MTCA) \n\nWe present an EM algorithm for MTCA; EM algorithms for TMG or TCA emerge \nby setting the number of factors to 0 or setting the number of clusters to 1. \nLet 0 represent a parameter in the generative model. For Li.d. data, the derivative \nof the log-likelihood of a training set Xl, ... ,XT with respect to 0 can be written \n\n8 10gp(XI , ... ,XT) \n\n80 \n\nT \n\n' \" ' [8 \nIJ)I ] \n= ~ E 80 logp(xt, c, (., Z, Y Xt , \nt=l \n\n(8) \n\nwhere the expectation is taken over p(c, f, z, ylxt). The EM algorithm iteratively \nsolves for a new set of parameters using the old parameters to compute the expec(cid:173)\ntations. This procedure consistently increases the likelihood of the training data. \n\nBy setting (8) to 0 and solving for the new parameter values, we obtain update equa(cid:173)\ntions based on the expectations given in the Appendix. Notation: (-) = .!. Ei=l (.) \nis a sufficient statistic computed by averaging over the training set; diag(A) gives a \nvector containing the diagonal elements of matrix A; diag(a) gives a diagonal matrix \nwhose diagonal contains the elements of vector a; and a 0 h gives the element-wise \nproduct of vectors a and h. Denoting the updated parameters by \"-\", we have \n\n(9) \n\n(10) \n\n(11) \n\n(12) \n\nire = (P(cJXt)), \n\nhe = (P(flxt, c)), \n\n_ \nJ.\u00a3e = \n\n(P(clxt)E[z - AeyIXt,c]) \n\n(P(cIXt)) \n\n, \n\n-\n~e = \n\ndiag( (P( cIXt)E[(z - J.\u00a3e - AeY) 0 (z - J.\u00a3e - Aey)IXt, cD) \n\n(P(cIXt)) \n\n, \n\n~ = diag( (E[(Xt -GiZ)O(Xt - Giz)IXtD), \n\nAe = (P(cJxdE[(z - J.\u00a3e)yTlxtl)(P(cIXt)E[yyTlxtD-I. \n\n(13) \nTo reduce the number of parameters, we will sometimes assume Pic does not depend \non c or even that Pic is held constant at a uniform distribution. \n\n4 Experiments \n\n4.1 Filtering Images from a Scanning Electron Microscope (SEM). \nSEM images (e.g., Fig. 2a) can have a very low signal to noise ratio due to a \nhigh variance in electron emission rate and modulation of this variance by the im(cid:173)\naged material (Golem and Cohen, 1998). To reduce noise, multiple images are \nusually averaged and the pixel variances can be used to estimate certainty in ren(cid:173)\ndered structures. Fig. 2b shows the estimated means and variances of the pixels \nfrom 230 140 x 56 SEM images like the ones in Fig. 2a. In fact, averaging images \ndoes not take into account spatial uncertainties and filtering in the imaging process \nintroduced by the electron detectors and the high-speed electrical circuits. \n\nWe trained a single-cluster TMG with 5 horizontal shifts and 5 vertical shifts on \nthe 230 SEM images using 30 iterations of EM. To keep the number of parameters \nalmost equal to the number of parameters estimated using simple averaging, the \ntransformation probabilities were not learned and the pixel variances in the observed \nimage were set equal after each M step. So, TMG had 1 more parameter. Fig. 2c \nshows the mean and variance learned by the TMG. Compared to simple averaging, \nthe TMG finds sharper, more detailed structure. The variances are significantly \nlower, indicating that the TMG produces a more confident estimate of the image. \n\n\fTopographic Transformation as a Discrete Latent Variable \n\n481 \n\n(a) \n\n(b) \n\n(e) \n\nFigure 2: (a) 140 x 56 pixel SEM images. (b) The mean and variance of the image pixels. \n(c) The mean and variance found by a TMG reveal more structure and less uncertainty. \n\n\", \n\n,,~: \n\n,.,io \n\ny: .-\n. ~ \n\n,. \n.. FO \n'-\u00ab, \n\n~ ! \n\nr; \n\n.', O J' \n.. \n.. ., .. \n.. \n\n.~ . \n.~. , .,. \n\n.lxt,l, c] +diag(Acl3t,cA~) - 2diag(AcE[(z(cid:173)\nl-'C>y'lxt, l, en +(AcE[Ylxt, e, c])o(AcE[ylxt, e, cD}, E[(Xt-Gtz)o(Xt-Gtz)lxtl = Ec,t P(c,lIXt) {(Xt(cid:173)\nGtE[zlxt, l, c]) 0 (Xt - GtE[zIXt, l, c]) + diag(GtOt,cGi) + diag(GtOt,c~;l Ac{3t,cA~~;lOt,cG~)}, \nP(clxt}E[(z-l-')y'lxt, c] = El P(c,llxt}E[(z-l-')y'lxt, l, c], P(clxt)E[yy'lxt, c] = Et P(c,llxt}{3t,c + \nEt P(c, llxt)E[ylxt.e, c]E[ylxt.l, c]'. \n\n\f", "award": [], "sourceid": 1729, "authors": [{"given_name": "Nebojsa", "family_name": "Jojic", "institution": null}, {"given_name": "Brendan", "family_name": "Frey", "institution": null}]}