{"title": "Generalized\u00b2 Linear\u00b2 Models", "book": "Advances in Neural Information Processing Systems", "page_first": 593, "page_last": 600, "abstract": "", "full_text": "Generalized2 Linear2 Models \n\nGeoffrey J. Gordon \nggordon@es.emu.edu \n\nAbstract \n\nWe introduce the Generalized2 Linear2 Model, a statistical estima(cid:173)\ntor which combines features of nonlinear regression and factor anal(cid:173)\nysis. A (GL)2M approximately decomposes a rectangular matrix \nX into a simpler representation j(g(A)h(B)). Here A and Bare \nlow-rank matrices, while j, g, and h are link functions. (GL)2Ms \ninclude many useful models as special cases, including principal \ncomponents analysis, exponential-family peA, the infomax formu(cid:173)\nlation of independent components analysis, linear regression, and \ngeneralized linear models. They also include new and interesting \nspecial cases, one of which we describe below. We also present an \niterative procedure which optimizes the parameters of a (GL)2M. \nThis procedure reduces to well-known algorithms for some of the \nspecial cases listed above; for other special cases, it is new. \n\n1 \n\nIntroduction \n\nLet the m x n matrix X contain an independent sample from some unknown distri(cid:173)\nbution. Each column of X represents a training example, and each row represents a \nmeasured feature of the examples. It is often reasonable to assume that some of the \nfeatures are redundant, that is, that there exists a reduced set of I features which \ncontains all or most of the information in X. \nIf the reduced features are linear functions of the original features and the distri(cid:173)\nbutions of the elements of X are Gaussian, redundancy means we can write X as \nthe product of two smaller matrices U and V with small sum of squared errors. \nThis factorization is essentially a singular value decomposition: U must span the \nfirst I dimensions of the left principal subspace of X, while V T must span the first \nI dimensions of the right principal subspace. (Since the above requirements do not \nuniquely determine U and V, the SVD traditionally imposes additional restrictions \nwhich we will ignore in this paper.) \n\nThe SVD has a long list of successes in machine learning, including information \nretrieval applications such as latent semantic analysis [1] and link analysis [2]; pat(cid:173)\ntern recognition applications such as \"eigenfaces\" [3]; structure from motion al(cid:173)\ngorithms [4]; and data compression tools [5]. Unfortunately, the SVD makes two \nassumptions which can limit its accuracy as a learning tool. \n\nThe first assumption is the use of the sum of squared errors between X and UV as \na loss function. Squared error loss means that predicting 1000 when the answer is \n1010 is as bad as saying -7 when the answer is 3. The second assumption is that \n\n\fthe reduced features are linear functions of the original features. Instead, X might \nbe a nonlinear function of UV, and U and V might be nonlinear functions of some \nother matrices A and B. To address these shortcomings, we propose the model \n\nx = f(g(A)h(B)) \n\n(1) \n\nfor the expected value of X. We also propose allowing non-quadratic loss functions \nfor the error (X - X) and the parameter matrices A and B. The fixed functions \n\nare called link functions. By analogy to generalized linear models [6], we call equa(cid:173)\ntion (1) a Generalized2 Linear2 Model: generalized2 because it uses link functions \nfor the parameters A and B as well as the prediction X, and linear2 because like \nthe SVD it is bilinear. \n\nAs long as we choose link and loss functions that match each other (see below for the \ndefinition of matching link and loss), there will exist efficient algorithms for finding \nA and B given X, f, g, and h. Because (1) is a generalization of the SVD, (GL)2Ms \nare drop-in replacements for SVDs in all of the applications mentioned above, with \nbetter reconstruction performance when the SVD's error model is inaccurate. In \naddition, they open up new applications (see section 7 for one) where an SVD would \nhave been unable to provide a sufficiently accurate reconstruction. \n\n2 Matching link and loss functions \n\nWhenever we try to optimize the predictions of a nonlinear model, we need to worry \nabout getting stuck in local minima. One example of this problem is when we try \nto fit a single sigmoid unit with parameters (J E lRd to training inputs Xi E lRd and \ntarget outputs Yi E lR under squared error loss: \n\nYi = 10git(Zi) \n\nZi = Xi . (J \n\nEven for small training sets, the number of local minima of L can grow exponentially \nwith the dimension d [7]. On the other hand, if we optimize the same predictions \nYi under the logarithmic loss function ~i[Yi log Yi + (1 - Vi) 10g(1 - Yi)] instead of \nsquared error, our optimization problem is convex. Because the logistic link works \nwith the log loss to produce a convex optimization problem, we say they match each \nother [7]. Matching link-loss pairs are important because minimizing a convex loss \nfunction is usually far easier than minimizing a non convex one. \nWe can use any convex function F(z) to generate a matching pair of link and loss \nfunctions. The loss function which corresponds to F is \n\n(2) \nwhere F*(y) is defined so that minz DF(Z I y) = O. (F* is the convex dual of F [8], \nand D F is the generalized Bregman divergence from Z to Y [9].) \nExpression (2) is nonnegative, and it is globally convex in all of the ZiS (and therefore \nalso in (J since each Zi is a linear function of (J). If we write f for the gradient of F, \nthe derivative of (2) with respect to Zi is f(Zi) - Vi. So, (2) will be zero if and only \nif Yi = f(Zi) for all i; in other words, using the loss (2) implies that Yi = f(z;) is \nour best prediction of Vi, and f is therefore our matching link function. \nWe will need two facts about convex duals below. The first is that F* is always \nconvex, and the second is that the gradient of F* is equal to f - l. (Also, convex \nduality is defined even when F, G, and H aren't differentiable. If they are not, \nreplace derivatives by subgradients below.) \n\n\f3 Loss functions for (G L )2Ms \n\nIn (GL)2Ms, matching loss functions will be particularly important because we need \nto deal with three separate nonlinear link functions. We will usually not be able \nto avoid local minima entirely; instead, the overall loss function will be convex in \nsome groups of parameters if we hold the remaining parameters fixed. \n\nWe will specify a (GL)2M by picking three link functions and their matching loss \nfunctions. We can then combine these individual loss functions into an overall loss \nfunction as described in section 4; fitting a (GL)2M will then reduce to minimizing \nthe overall loss function with respect to our parameters. Each choice of links results \nin a different (G L)2M and therefore potentially a different decomposition of X. \n\nThe choice of link functions is where we should inject our domain knowledge about \nwhat sort of noise there is in X and what parameter matrices A and B are a priori \nmost likely. Useful link functions include f (x) = x (corresponding to squared error \nand Gaussian noise), f (x) = log x (unnormalized KL-di vergence and Poisson noise), \nand f(x) = (1 + e- x) - l (log-loss and Bernoulli noise). \nThe loss functions themselves are only necessary for the analysis; all of our algo(cid:173)\nrithms need only the link functions and (in some cases) their derivatives. So, we \ncan pick the loss functions and differentiate to get the matching link functions; or, \nwe can pick the link functions directly and not worry about the corresponding loss \nfunctions. In order for our analysis to apply, the link functions must be derivatives \nof some (possibly unknown) convex functions. \nOur loss functions are D F , DG, and DH where \n\nG : lRmxl H \n\nlR \n\nare convex functions. We will abuse notation and call F, G, and H loss functions as \nwell: F is the prediction loss, and its derivative f is the prediction link; it provides \nour model of the noise in X. G and H are the parameter losses, and their derivatives \ng and h are the parameter links; they tell us which values of A and B are a priori \nmost likely. By convention, since F takes an m x n matrix argument, we will say \nthat the input and output to f are also m x n matrices (similarly for g and h). \n\n4 The model and its fixed point equations \n\nWe will define a (GL)2M by specifying an overall loss function which relates the \nparameter matrices A and B to the data matrix X. If we write U = g(A) and \nV = h(B), the (GL)2M loss function is \n\nL(U, V) = F(UV) - X 0 UV + G*(U) + H*(V) \nHere A 0 B is the \"matrix dot product,\" often written tr(AT B). \n\n(3) \n\nExpression (3) is a sum of three Bregman divergences, ignoring terms which don't \ndepend on U and V: it is DF(UV I X)+DG(O I U) +DH(O I V). The F-divergence \ntends to pull UV towards X, while the G- and H-divergences favor small U and V. \n\nTo further justify (3), we can examine what happens when we compute its deriva(cid:173)\ntives with respect to U and V and set them to O. The result is a set of fixed-point \nequations that the optimal parameter settings must satisfy: \n\nUT(X - f(UV)) \n(X - f(UV))VT \n\nB \nA \n\n(4) \n(5) \n\n\fTo understand these equations, we can consider two special cases. First, if we let \nG* go to zero (so there is no pressure to keep U and V small) , (4) becomes \n\nUT(X -\n\nf(UV)) = 0 \n\n(6) \n\nwhich tells us that each column of the error matrix must be orthogonal to each \ncolumn of U. Second, if we set the prediction link to be f(UV) = UV, (6) becomes \n\nUTUV = UTX \n\nwhich tells us that for a given U, we must choose V so that UV reconstructs X \nwith the smallest possible sum of squared errors. \n\n5 Algorithms for fitting (GL)2Ms \n\nWe could solve equations (4- 5) with any of several different algorithms. For exam(cid:173)\nple, we could use gradient descent on either U, V or A, B. Or, we could use the \ngeneralized gradient descent [9] update rule (with learning rate a): \n\nA +-\", (X -\n\nf(UV))V T \n\nB +-\", UT(X -\n\nf(UV)) \n\nThe advantage of these algorithms is that they are simple to implement and don't \nrequire additional assumptions on F , G, and H. They can even work when F, G, \nand Hare nondifferentiable by using subgradients. \nIn this paper, though, we will focus on a different algorithm. Our algorithm is based \non Newton's method, and it reduces to well-known algorithms for several special \ncases of (GL)2Ms. Of course, since the end goal is solving (4-5), this algorithm will \nnot always be the method of choice; instead, any given implementation of a (GL)2M \nshould use the simplest algorithm that works. \n\nFor our Newton algorithm we will need to place some restrictions on the prediction \nand parameter loss functions. (These restrictions are only necessary for the Newton \nalgorithm; more general loss functions still give valid (GL)2Ms, but require different \nalgorithms.) First, we will require (4-5) to be differentiable. Second, we will restrict \n\nF(Z) = LFij (Zij) \n\nij \n\nH(B) = L Hj(B. j ) \n\nj \n\nThese definitions fix most of the second derivatives of L(U, V) to be zero, simplifying \nand speeding up computation. Write f ij , gi, and hj for the respective derivatives. \nWith these restrictions, we can linearize (4) and (5) around our current guess at \nthe parameters, then solve the resulting equations to find updated parameters. To \nsimplify notation, we can think of (4) as j separate equations, one for each column \nof V. Linearizing with respect to Vj gives: \n\n(UT DjU + Hj)(Vr w - Vj) = UT(X.j - f.j(UV j )) - B. j \n\nwhere the l x l matrix H j is the Hessian of Hi at V j ' or equivalently the inverse of \nthe Hessian of Hj at B.j; and the m x m diagonal matrix Dj contains the second \nderivatives of F with respect to the jth column of its argument. That is, \n\nNow, collecting terms involving Vjew yields: \n\n\fWe can recognize (7) as a weighted least squares problem with weights VJ5j, prior \nprecision H j , prior mean Vj + H j1 B-j , and target outputs \n\nUV j + Dj1(x.j -\n\nf(UV j )) \n\nSimilarly, we can linearize with respect to rows of U to find the equation \n\nUreW(VDiVT + G i ) = ((Xi. - j;.(Ui.V))Di1 + Ui.V)DiVT + Ui. G i - Ai. \n\n(8) \nwhere G i is the Hessian of Gi and Di contains the second derivatives of F with \nrespect to the ith row of its argument. \n\nWe can solve one copy of (7) simultaneously for each column of V, then replace V \nby vnew. Next we can solve one copy of (8) simultaneously for each row of U, then \nreplace U by unew. Alternating between these two updates will tend to reduce (3).1 \n\n6 Related models \n\nThere are many important special cases of (GL)2Ms. We derive two in this section; \nothers include principal components analysis, \"sensible\" PCA, linear regression, \ngeneralized linear models, and the weighted majority algorithm. (Our Newton al(cid:173)\ngorithm turns into power iteration for PCA and iteratively-reweighted least squares \nfor GLMs.) (GL)2Ms are related to generalized bilinear models; the latter include \nsome of the above special cases, but not ICA, weighted majority, or the example of \nsection 7. There are natural generalizations of (GL)2Ms to multilinear interactions. \nFinally, some models such as non-negative matrix factorization [10] and general(cid:173)\nized low-rank approximation [11] are cousins of (GL)2Ms: they use a loss function \nwhich is convex in either factor with the other fixed but which is not a Bregman \ndivergence. \n\n6.1 \n\nIndependent components analysis \n\nIn ICA, we assume that there is a hidden matrix V (the same size as X) of inde(cid:173)\npendent random variables, and that X was generated from V by applying a square \nmatrix U. We seek to recover the mixing matrix U and the sources V; in other \nwords, we want to decompose X = UV so that the elements of V are as nearly \nindependent as possible. \n\nThe info max algorithm for ICA assumes that the elements of V follow distributions \nwith heavy tails (i.e. , high kurtosis). This assumption helps us find independent \ncomponents because the sum of two heavy-tailed random variables tends to have \nlighter tails, so we can search for U by trying to make the elements of V follow a \nheavy-tailed distribution. \n\nIn our notation, the fixed point of the info max algorithm for ICA is \n\n(9) \n(see, e.g., equation (11) or (13) of [12]). To reproduce (9) , we will let the prediction \nlink f be the identity, and we will let the duals of the parameter loss functions be \n\n_ UT = tanh(V)XT \n\nG*(U) \n\nH*(V) \n\n-dogdet U \nE L log cosh Vij \n\nij \n\niTo guarantee convergence, we can check that (3) decreases and reduce our step size if \nwe encounter problems. (Since UT D j U, H j\n, V Di V T, and G i are all positive definite, the \nNewton update directions are descent directions; so, there always exists a small enough \nstep size.) We have not found this check necessary in practice. \n\n\fwhere f is a small positive real number. Then equations (4) and (5) become \n\nUT(X - UV) \n(X - UV)VT \n\nttanh(V) \n-fU- T \n\n(10) \n(11) \n\nsince the derivative of log cosh v is tanh v and the derivative of log det U is U - T . \nRight-multiplying (10) by (UV)T and substituting in (11) yields \n\n_uT = tanh(V)(UV)T \n\n(12) \n\nNow since UV -+ X as f -+ 0, (12) is equivalent to (9) in the limit of vanishing f. \n\n6.2 Exponential family peA \n\nTo duplicate exponential family PCA [13], we can set the prediction link f arbi(cid:173)\ntrarily and let the parameter links 9 and h be large multiples of the identity. Our \nNewton algorithm is applicable under the assumptions of [13], and (7) becomes \n\n(13) \n\nEquation (13) along with the corresponding modification of (8) should provide a \nmuch faster algorithm than the one proposed in [13], which updates only part of U \nor V at a time and keeps updating the same part until convergence before moving \non to the next one. \n\n7 Example: robot belief states \n\nFigure 1 shows a map of a corridor in the CMU CS building. A robot navigating \nin this corridor can sense both side walls and compute an accurate estimate of its \nlateral position. Unless it is near an observable feature such the lab door near the \nmiddle of the corridor, however, it can't accurately resolve its position along the \ncorridor and it can't tell whether it is pointing left or right. \n\nIn order to plan to achieve a goal in this environment, the robot must maintain \na belief state (a probability distribution representing its best information about \nthe unobserved state variables). The map shows the robot's starting belief state: \nit is at one end of the corridor facing in, but it doesn't know which end. We \ncollected a training set of 400 belief states by driving the robot along the corridor and \nfeeding its sensor readings to a belief tracker [14]. To simulate a larger environment \nwith greater uncertainty, we artificially reduced sensor range and increased error. \nFigure 1 shows two of the collected beliefs. \n\nPlanning is difficult because belief states are high-dimensional: even in this simple \nworld there are 550 states (275 positions at lOcm intervals along the corridor x 2 \norientations), so a belief is a vector in ]R550. Fortunately, the robot never encounters \nmost belief states. This regularity can make planning tractable: if we can identify \na few features which extract the important information from belief states, we can \nplan in low-dimensional feature space instead of high-dimensional belief space. \nWe factored the matrix of belief states using feature space ranks l = 3,4, 5. For the \nprediction link f(Z) we used exp(Z) (componentwise); this link ensures that the \npredicted beliefs are positive, and treats errors in small probabilities as proportion(cid:173)\nally more important than errors in large ones. (The matching loss for f is a Poisson \nlog-likelihood or unnormalized KL-divergence.) For the parameter link h we used \n10 12 I, corresponding to H* = lO- 12 11V11 2 /2 (a weak bias towards small V). \n\n\f~,~I ~,A~\"\"~, _ -----=-----' \n~,:------------c.\n1 \n,. A. 1 ~,~I \n\n-------::::c-,f\\~\"'~\\ _ ~ \n-------c:L,. A-----\"-----..t\n\n~,~I -------c:Lj~\\ .~\\ _ -----=-----,1 \n~t \n\n,/____,________=_\\ -----=-----' \n\nFigure 1: Belief states. Left panel: overhead map of corridor with initial belief b1 ; \nbelief state bso (just before robot finds out which direction it's pointing); belief bgo \n(just after finding out). Right panel: reconstruction of bso with 3, 4, and 5 features. \n\nWe set G* = 1O- 1211U11 2 j2 +6..(U), where 6.. is 0 when the first column of U contains \nall Is and 00 otherwise. This loss function fixes the first column of U, representing \nour knowledge that one feature should be a normalizing constant so that each belief \nsums to 1. The subgradient of G* is 1O- 12U + [k, 0], so equation (5) becomes \n\n(X - exp(UV))VT = 1O- 12U + [k, 0] \n\nHere [k,O] is a matrix with an arbitrary first column and all other elements 0; this \nmatrix has enough degrees of freedom to compensate for the constraints on U. \n\nOur Newton algorithm handles this modified fixed point equation without difficulty. \nSo, this (GL)2M is a principled and efficient way to decompose a matrix of prob(cid:173)\nability distributions. So far as we know this model and algorithm have not been \ndescribed in the literature. \n\nFigure 1 shows our reconstructions of a representative belief state using I = 3, 4,5 \nfeatures (one of which is a normalizing constant that can be discarded for planning). \nThe I = 5 reconstruction is consistently good across all 400 beliefs, while the I = 4 \nreconstruction has minor artifacts for some beliefs. A small number of restarts is \nrequired to achieve good decompositions for I = 3 where the optimization problem \nis most constrained. For comparison, a traditional SVD requires a matrix of rank \nabout 25 to achieve the same mean-squared reconstruction error as our rank-3 \ndecomposition. It requires rank about 85 to match our rank-5 decomposition. \nExamination of the learned U matrix (not shown) for I = 4 reveals that the cor(cid:173)\nridor is mapped into two smooth curves in feature space, one for each orientation. \nCorresponding states with opposite orientations are mapped into similar feature \nvectors for one half of the corridor (where the training beliefs were sometimes con(cid:173)\nfused about orientation) but not the other (where there were no training beliefs \nthat indicated any connection between orientations). Reconstruction artifacts occur \nwhen a curve nearly self-intersects and causes confusion between states. This self(cid:173)\nintersection happens because of local minima in the loss function; with more flexi(cid:173)\nbility (I = 5) the optimizer is able to untangle the curves and avoid self-intersection. \n\nOur success in compressing the belief state translates directly into success in plan(cid:173)\nning; see [15] for details. By comparison, traditional SVD on either the beliefs or \nthe log beliefs produces feature sets which are unusable for planning because they \ndo not achieve sufficiently good reconstruction with few enough features. \n\n\f8 Discussion \n\nWe have introduced a new general class of nonlinear regression and factor analysis \nmodel, presenting a derivation and algorithm, reductions to previously-known spe(cid:173)\ncial cases, and a practical example. The model is a drop-in replacement for PCA, \nbut can provide much better reconstruction performance in cases where the PCA \nerror model is inaccurate. Future research includes online algorithms for parameter \nadjustment; extensions for missing data; and exploration of new link functions. \n\nAcknowledgments \n\nThanks to Nick Roy for helpful comments and for providing the data analyzed \nin section 7. This work was supported by AFRL contract F30602-01-C-0219, \nDARPA's MICA program, and by AFRL contract F30602- 98- 2- 0137, DARPA's \nCoABS program. The opinions and conclusions are the author's and do not reflect \nthose of the US government or its agencies. \n\nReferences \n\n[1] T. K. Landauer, P. W . Foltz, and D. Laham. Introduction to latent semantic analysis. \n\nDiscourse Processes, 25:259- 284, 1998. \n\n[2] Jon M. Kleinberg. Authoritative sources in a hyperlinked environment. Journal of \n\nthe ACM, 46(5) :604-632, 1999. \n\n[3] M. Turk and A. Pentland. Eigenfaces for recognition. Journal of Cognitive Neuro(cid:173)\n\nscience, 3(1) :71-86, 1991. \n\n[4] Carlo Tomasi and Takeo Kanade. Shape and motion from image streams under \n\northography: a factorization method. Int. J. Computer Vision , 9(2):137- 154, 1992. \n\n[5] D. P. O'Leary and S. Peleg. Digital image compression by outer product expansion. \n\nIEEE Trans . Communications, 31:441-444, 1983. \n\n[6] P. McCullagh and J. A. Neider. Generalized Linear Models. Chapman & Hall, London, \n\n2nd edition, 1983. \n\n[7] Peter Auer, Mark Hebster, and Manfred K. Warmuth. Exponentially many local \n\nminima for single neurons. In NIPS, vol. 8. MIT Press, 1996. \n\n[8] R. Tyrell Rockafellar. Convex Analysis. Princeton University Press, New Jersey, 1970. \n[9] Geoffrey J. Gordon. Approximate Solutions to Markov Decision Processes. PhD \n\nthesis, Carnegie Mellon University, 1999. \n\n[10] Daniel Lee and H. Sebastian Seung. Algorithms for nonnegative matrix factorization. \n\nIn NIPS, vol. 13. MIT Press, 2001. \n\n[11] Nathan Srebro. Personal communication, 2002. \n[12] Anthony J . Bell and Terrence J. Sejnowski. The 'independent components' of natural \n\nscenes are edge filters. Vision Research, 37(23):3327- 3338, 1997. \n\n[13] Michael Collins, Sanjoy Dasgupta, and Robert Schapire. A generalization of principal \n\ncomponent analysis to the exponential family. In NIPS, vol. 14. MIT Press, 2002. \n\n[14] D. Fox, W. Burgard, F . Dellaert, and S. Thrun. Monte Carlo localization: Efficient \n\nposition estimation for mobile robots. In AAAI, 1999. \n\n[15] Nicholas Roy and Geoffrey J. Gordon. Exponential family PCA for belief compression \n\nin POMDPs. In NIPS, vol. 15. MIT Press, 2003. \n\n[16] Sam Roweis. EM algorithms for PCA and SPCA. In NIPS, vol. 10. MIT Press, 1998. \n\n\f", "award": [], "sourceid": 2144, "authors": [{"given_name": "Geoffrey", "family_name": "Gordon", "institution": null}]}