{"title": "Efficient Inference for Distributions on Permutations", "book": "Advances in Neural Information Processing Systems", "page_first": 697, "page_last": 704, "abstract": "", "full_text": "Ef\ufb01cient Inference for Distributions on Permutations\n\nJonathan Huang\n\nCarnegie Mellon University\n\njch1@cs.cmu.edu\n\nCarlos Guestrin\n\nCarnegie Mellon University\nguestrin@cs.cmu.edu\n\nLeonidas Guibas\nStanford University\n\nguibas@cs.stanford.edu\n\nAbstract\n\nPermutations are ubiquitous in many real world problems, such as voting,\nrankings and data association. Representing uncertainty over permutations is\nchallenging, since there are n! possibilities, and typical compact representations\nsuch as graphical models cannot ef\ufb01ciently capture the mutual exclusivity con-\nstraints associated with permutations. In this paper, we use the \u201clow-frequency\u201d\nterms of a Fourier decomposition to represent such distributions compactly. We\npresent Kronecker conditioning, a general and ef\ufb01cient approach for maintaining\nthese distributions directly in the Fourier domain. Low order Fourier-based\napproximations can lead to functions that do not correspond to valid distributions.\nTo address this problem, we present an ef\ufb01cient quadratic program de\ufb01ned\ndirectly in the Fourier domain to project the approximation onto a relaxed form\nof the marginal polytope. We demonstrate the effectiveness of our approach on a\nreal camera-based multi-people tracking setting.\n\n1 Introduction\n\nPermutations arise naturally in a variety of real situations such as card games, data association\nproblems, ranking analysis, etc. As an example, consider a sensor network that tracks the positions\nof n people, but can only gather identity information when they walk near certain sensors. Such\nmixed-modality sensor networks are an attractive alternative to exclusively using sensors which can\nmeasure identity because they are potentially cheaper, easier to deploy, and less intrusive. See [1]\nfor a real deployment. A typical tracking system maintains tracks of n people and the identity of\nthe person corresponding to each track. What makes the problem dif\ufb01cult is that identities can be\nconfused when tracks cross in what we call mixing events. Maintaining accurate track-to-identity\nassignments in the face of these ambiguities based on identity measurements is known as the\nIdentity Management Problem [2], and is known to be N P -hard. Permutations pose a challenge for\nprobabilistic inference, because distributions on the group of permutations on n elements require\nstoring at least n! \u2212 1 numbers, which quickly becomes infeasible as n increases. Furthermore,\ntypical compact representations, such as graphical models, cannot capture the mutual exclusivity\nconstraints associated with permutations.\n\nDiaconis [3] proposes maintaining a small subset of Fourier coef\ufb01cients of the actual distribution al-\nlowing for a principled tradeoff between accuracy and complexity. Schumitsch et al. [4] use similar\nideas to maintain a particular subset of Fourier coef\ufb01cients of the log probability distribution. Kon-\ndor et al. [5] allow for general sets of coef\ufb01cients, but assume a restrictive form of the observation\nmodel in order to exploit an ef\ufb01cient FFT factorization. The main contributions of this paper are:\n\n\u2022 A new, simple and general algorithm, Kronecker Conditioning, which performs all proba-\nbilistic inference operations completely in the Fourier domain. Our approach is general, in\nthe sense that it can address any transition model or likelihood function that can be repre-\nsented in the Fourier domain, such as those used in previous work, and can represent the\nprobability distribution with any desired set of Fourier coef\ufb01cients.\n\n\u2022 We show that approximate conditioning can sometimes yield Fourier coef\ufb01cients which do\nnot correspond to any valid distribution, and present a method for projecting the result back\nonto a relaxation of the marginal polytope.\n\n\u2022 We demonstrate the effectiveness of our approach on a real camera-based multi-people\n\ntracking setting.\n\n1\n\n\f2 Filtering over permutations\nIn identity management, a permutation \u03c3 represents a joint assignment of identities to internal tracks,\nwith \u03c3(i) being the track belonging to the ith identity. When people walk too closely together, their\nidentities can be confused, leading to uncertainty over \u03c3. To model this uncertainty, we use a Hidden\nMarkov Model on permutations, which is a joint distribution over P (\u03c3(1), . . . , \u03c3(T ), z(1), . . . , z(T ))\nwhich factors as:\n\nP (\u03c3(1), . . . , \u03c3(T ), z(1), . . . , z(T )) = P (z(1)|\u03c3(1))Yt\n\nP (zt|\u03c3(t)) \u00b7 P (\u03c3(t)|\u03c3(t\u22121)),\n\nwhere the \u03c3(t) are latent permutations and the z(t) denote observed variables. The conditional\nprobability distribution P (\u03c3(t)|\u03c3(t\u22121)) is called the transition model, and might re\ufb02ect for example,\nthat the identities belonging to two tracks were swapped with some probability. The distribution\nP (z(t)|\u03c3(t)) is called the observation model, which might capture a distribution over the color of\nclothing for each individual.\n\nWe focus on \ufb01ltering, in which one queries the HMM for the posterior at some timestep, conditioned\non all past observations. Given the distribution P (\u03c3(t)|z(1), . . . , z(t)), we recursively compute\nP (\u03c3(t+1)|z(1), . . . , z(t+1)) in two steps: a prediction/rollup step and a conditioning step. The\n\ufb01rst updates the distribution by multiplying by the transition model and marginalizing out the\nprevious timestep:\nrule:\nThe\nP (\u03c3(t+1)|z(1), . . . , z(t+1)) \u221d P (z(t+1)|\u03c3(t+1))P (\u03c3(t+1)|z(1), . . . , z(t)).\nSince there are n!\npermutations, a single update requires O((n!)2) \ufb02ops and is consequently intractable for all but\nvery small n. The approach that we advocate is to maintain a compact approximation to the true\ndistribution based on the Fourier transform. As we discuss later, the Fourier based approximation\nis equivalent to maintaining a set of low-order marginals, rather than the full joint, which we regard\nas being analagous to an Assumed Density Filter [6].\n\nP (\u03c3(t+1)|z(1), . . . , z(t)) = P\u03c3(t) P (\u03c3(t+1)|\u03c3(t))P (\u03c3(t)|z(1), . . . , z(t)).\n\nthe distribution on an observation z(t+1) using Bayes\n\nsecond conditions\n\n3 Fourier projections of functions on the Symmetric Group\nOver the last 50 years, the Fourier Transform has been ubiquitously applied to everything digital,\nparticularly with the invention of the Fast Fourier Transform. On the real line, the Fourier Transform\nis a well-studied method for decomposing a function into a sum of sine and cosine terms over\na spectrum of frequencies. Perhaps less familiar, is its group theoretic generalization, which we\nreview in this section with an eye towards approximating functions on the group of permutations, the\nSymmetric Group. For permutations on n objects, the Symmetric Group will be abbreviated by Sn.\nThe formal de\ufb01nition of the Fourier Transform relies on the theory of group representations, which\nwe brie\ufb02y discuss \ufb01rst. Our goal in this section is to motivate the idea that the Fourier transform of\na distribution P is related to certain marginals of P . For references on this subject, see [3].\nDe\ufb01nition 1. A representation of a group G is a map \u03c1 from G to a set of invertible d\u03c1 \u00d7 d\u03c1\nmatrix operators which preserves algebraic structure in the sense that for all \u03c31, \u03c32 \u2208 G,\n\u03c1(\u03c31\u03c32) = \u03c1(\u03c31) \u00b7 \u03c1(\u03c32). The matrices which lie in the image of this map are called the\nrepresentation matrices, and we will refer to d\u03c1 as the degree of the representation.\n\nRepresentations play the role of basis functions, similar to that of sinusoids, in Fourier theory. The\nsimplest basis functions are constant functions \u2014 and our \ufb01rst example of a representation is the triv-\nial representation \u03c10 : G \u2192 R which maps every element of G to 1. As a more pertinent example,\nwe de\ufb01ne the 1st order permutation representation of Sn to be the degree n representation, \u03c41, which\nmaps a permutation \u03c3 to its corresponding permutation matrix given by: [\u03c41(\u03c3)]ij = 1 {\u03c3(j) = i}.\nFor example, the permutation in S3 which swaps the second and third elements maps to:\n\n\u03c41(1 7\u2192 1, 2 7\u2192 3, 3 7\u2192 2) =0@\n\n1\n0\n0\n\n0\n0\n1\n\n0\n1\n\n0 1A .\n\nThe \u03c41 representation can be thought of as a collection of n2 functions at once, one for each matrix\nentry, [\u03c41(\u03c3)]ij. There are other possible permutation representations - for example the 2nd order\nunordered permutation representation, \u03c42, is de\ufb01ned by the action of a permutation on unordered\npairs of objects, ([\u03c1(\u03c3)]{i,j},{\u2113,k} = 1 {\u03c3({\u2113, k}) = {i, j}}), and is a degree n(n\u22121)\nrepresentation.\nAnd the list goes on to include many more complicated representations.\n\n2\n\n2\n\n\fIt is useful to think of two representations as being the same if the representation matrices are equal\nup to some consistent change of basis. This idea is formalized by declaring two representations \u03c1\nand \u03c4 to be equivalent if there exists an invertible matrix C such that C \u22121 \u00b7 \u03c1(\u03c3) \u00b7 C = \u03c4 (\u03c3) for all\n\u03c3 \u2208 G. We write this as \u03c1 \u2261 \u03c4 .\nMost representations can be seen as having been built up by smaller representations. We say that\na representation \u03c1 is reducible if there exist smaller representations \u03c11, \u03c12 such that \u03c1 \u2261 \u03c11 \u2295 \u03c12\nwhere \u2295 is de\ufb01ned to be the direct sum representation:\n\n\u03c11 \u2295 \u03c12(g) ,\u201e \u03c11(g)\n\n0\n\n0\n\n\u03c12(g) \u00ab .\n\n(1)\n\nIn general, there are in\ufb01nitely many inequivalent representations. However, for any \ufb01nite group,\nthere is always a \ufb01nite collection of atomic representations which can be used to build up any\nother representation using direct sums. These representations are referred to as the irreducibles\nof a group, and they are simply the collection of representations which are not reducible. We will\nrefer to the set of irreducibles by R. It can be shown that any representation of a \ufb01nite group G\nis equivalent to a direct sum of irreducibles [3], and hence, for any representation \u03c4 , there exists a\nmatrices C for which C \u22121 \u00b7 \u03c4 \u00b7 C = \u2295\u03c1i\u2208R \u2295 \u03c1i, where the inner \u2295 refers to some \ufb01nite number\nof copies of the irreducible \u03c1i.\nDescribing the irreducibles of Sn up to equivalence is a subject unto itself; We will simply say\nthat there is a natural way to order the irreducibles of Sn that corresponds to \u2018simplicity\u2019 in the\nsame way that low frequency sinusoids are simpler than higher frequency ones. We will refer to the\nirreducibles in this order as \u03c10, \u03c11, . . . . For example, the \ufb01rst two irreducibles form the \ufb01rst order\npermutation representation (\u03c41 \u2261 \u03c10 \u2295 \u03c11), and the second order permutation representation can be\nformed by the \ufb01rst 3 irreducibles.\n\nIrreducible representation matrices are not always orthogonal, but they can always be chosen to be\nso (up to equivalence). For notational convenience, the irreducible representations in this paper will\nalways be assumed to be orthogonal.\n\n3.1 The Fourier transform\nOn the real line, the Fourier Transform corresponds to computing inner products of a function with\nsines and cosines at varying frequencies. The analogous de\ufb01nition for \ufb01nite groups replaces the\nsinusoids by group representations.\nDe\ufb01nition 2. Let f : G \u2192 R be any function on a group G and let \u03c1 be any representation on G.\n\nThe Fourier Transform of f at the representation \u03c1 is de\ufb01ned to be: \u02c6f\u03c1 =P\u03c3 f (\u03c3)\u03c1(\u03c3).\n\nThere are two important points which distinguish this Fourier Transform from the familiar version\non the real line \u2014 it is matrix-valued, and instead of real numbers, the inputs to \u02c6f are representations\nof G. The collection of Fourier Transforms of f at all irreducibles form the Fourier Transform of f.\nAs in the familiar case, there is an inverse transform given by:\n\nf (\u03c3) =\n\n1\n\n|G|Xk\n\nd\u03c1k Trh \u02c6f T\n\n\u03c1k \u00b7 \u03c1k(\u03c3)i ,\n\nwhere k indexes over the collection of irreducibles of G.\n\n(2)\n\nWe provide two examples for intuition. For functions on the real line, the Fourier Transform at\nzero gives the DC component of a signal. This is also true for functions on a group; If f : G \u2192 R\nis any function, then the Fourier Transform of f at the trivial representation is constant with\n\n\u02c6f\u03c10 =P\u03c3 f (\u03c3). Thus, for any probability distribution P , we have \u02c6P\u03c10 = 1. If P were the uniform\n\ndistribution, then \u02c6P\u03c1 = 0 at all irreducibles except at the trivial representation.\nThe Fourier Transform at \u03c41 also has a simple interpretation:\n\n[ \u02c6f\u03c41 ]ij = X\u03c3\u2208Sn\n\nf (\u03c3)[\u03c41(\u03c3)]ij = X\u03c3\u2208Sn\n\nf (\u03c3)1 {\u03c3(j) = i} = X\u03c3:\u03c3(j)=i\n\nf (\u03c3).\n\nThus, if P is a distribution, then \u02c6P\u03c41 is a matrix of marginal probabilties, where the ij-th element\nis the marginal probability that a random permutation drawn from P maps element j to i. Similarly,\nthe Fourier transform of P at the second order permutation representation is a matrix of marginal\nprobabilities of the form P (\u03c3({i, j}) = {k, \u2113}).\n\n3\n\n\fIn Section 5, we will discuss function approximation by bandlimiting the Fourier coef\ufb01cients, but\nthis example should illustrate the fact that maintaining Fourier coef\ufb01cients at low-order irreducibles\nis the same as maintaining low-order marginal probabilities, while higher order irreducibles\ncorrespond to more complicated marginals.\n4 Inference in the Fourier domain\nBandlimiting allows for compactly storing a distribution over permutations, but the idea is rather\nmoot if it becomes necessary to transform back to the primal domain each time an inference\noperation is called. Naively, the Fourier Transform on Sn scales as O((n!)2), and even the fastest\nFast Fourier Transforms for functions on Sn are no faster than O(n! log(n!)) (see [7] for example).\nTo resolve this issue, we present a formulation of inference which operates solely in the Fourier\ndomain, allowing us to avoid a costly transform. We begin by discussing exact inference in the\nFourier domain, which is no more tractable than the original problem because there are n! Fourier\ncoef\ufb01cients, but it will allow us to discuss the bandlimiting approximation in the next section. There\nare two operations to consider: prediction/rollup, and conditioning. The assumption for the rest of\nthis section is that the Fourier Transforms of the transition and observation models are known. We\ndiscuss methods for obtaining the models in Section 7.\n\n4.1 Fourier prediction/rollup\nWe will consider one particular type of transition model \u2014 that of a random walk over a group.\nThis model assumes that \u03c3(t+1) is generated from \u03c3(t) by drawing a random permutation \u03c4 (t)\nfrom some distribution Q(t) and setting \u03c3(t+1) = \u03c4 (t)\u03c3(t). In our identity management example,\n\u03c4 (t) represents a random identity permutation that might occur among tracks when they get close\nto each other (a mixing event), but the random walk model appears in other applications such as\nmodeling card shuf\ufb02es [3]. The Fourier domain Prediction/Rollup step is easily formulated using\nthe convolution theorem (see also [3]):\nProposition 3. Let Q and P be probability distributions on Sn. De\ufb01ne the convolution of Q and P to\n=\n\n2 )P (\u03c32). Then for any representation \u03c1,h\\Q \u2217 Pi\u03c1\n\nbe the function [Q \u2217 P ] (\u03c31) =P\u03c32\nbQ\u03c1 \u00b7 bP\u03c1, where the operation on the right side is matrix multiplication.\n\nThe Prediction/Rollup step for the random walk transition model can be written as a convolution:\n\nQ(\u03c31 \u00b7 \u03c3\u22121\n\nQ(t)(\u03c3(t+1)\u00b7(\u03c3(t))\u22121)P (\u03c3(t)) =hQ(t) \u2217 Pi (\u03c3(t+1)).\n\n\u03c1 are given, the prediction/rollup update rule is simply:\n\nP (\u03c3(t+1)) =\n\n{(\u03c3(t),\u03c4 (t)) : \u03c3(t+1)=\u03c4 (t)\u00b7\u03c3(t)}\n\nX\n\nThen assuming that bP (t)\n\n\u03c1\n\nQ(t)(\u03c4 (t))\u00b7P (\u03c3(t)) =X\u03c3(t)\nand bQ(t)\n\nbP (t+1)\n\u03c1 \u2190 bQ(t)\n\n\u03c1 \u00b7 bP (t)\n\n\u03c1 .\n\nNote that the update requires only knowledge of \u02c6P and does not require P . Furthermore, the update\nis pointwise in the Fourier domain in the sense that the coef\ufb01cients at the representation \u03c1 affect\n\nonly at \u03c1.\n\n\u03c1\n\nbP (t+1)\n\n4.2 Fourier conditioning\nAn application of Bayes rule to \ufb01nd a posterior distribution P (\u03c3|z) after observing some evidence z\nrequires a pointwise product of likelihood L(z|\u03c3) and prior P (\u03c3), followed by a normalization step.\n\nWe showed earlier that the normalization constantP\u03c3 L(z|\u03c3) \u00b7 P (\u03c3) is given by the Fourier trans-\ncan be implemented by simply dividing each Fourier coef\ufb01cient by the scalarh \\\n\nL(t)P (t) at the trivial representation \u2014 and therefore the normalization step of conditioning\n\nL(t)P (t)i\u03c10\n\nform of \\\n\n.\n\nThe pointwise product of two functions f and g, however, is trickier to formulate in the Fourier\ndomain. For functions on the real line, the pointwise product of functions can be implemented\nby convolving the Fourier coef\ufb01cients of \u02c6f and \u02c6g, and so a natural question is: can we apply a\nsimilar operation for functions over other groups? Our answer to this is that there is an analogous\n(but more complicated) notion of convolution in the Fourier domain of a general \ufb01nite group. We\npresent a convolution-based conditioning algorithm which we call Kronecker Conditioning, which,\nin contrast to the pointwise nature of the Fourier Domain prediction/rollup step, and much like\nconvolution, smears the information at an irreducible \u03c1k to other irreducibles.\n\n4\n\n\fFourier transforming the pointwise product Our approach to Fourier Transforming the point-\nwise product in terms of \u02c6f and \u02c6g is to manipulate the function f (\u03c3)g(\u03c3) so that it can be seen as the\nresult of an inverse Fourier Transform. Hence, the goal will be to \ufb01nd matrices Ak (as a function of\n\u02c6f , \u02c6g) such that for any \u03c3 \u2208 G,\n\nf (\u03c3) \u00b7 g(\u03c3) =\n\n1\n\n|G|Xk\n\nd\u03c1k Tr\u201cAT\n\nk \u00b7 \u03c1k(\u03c3)\u201d ,\n\n(3)\n\n. For any \u03c3 \u2208 G we can write the pointwise product in terms \u02c6f and \u02c6g using the\n\nwhere Ak =hcf gi\u03c1k\n\ninverse Fourier Transform (Equation 2):\n\nf (\u03c3) \u00b7 g(\u03c3) = \" 1\n= \u201e 1\n\n\u03c1i \u00b7 \u03c1i(\u03c3)\u201d# \u00b7\" 1\n\n|G|Xi\n|G|\u00ab2Xi,j\n\nd\u03c1iTr\u201c \u02c6f T\nd\u03c1i d\u03c1jhTr\u201c \u02c6f T\n\n|G|Xj\n\u03c1i \u00b7 \u03c1i(\u03c3)\u201d \u00b7 Tr\u201c\u02c6gT\n\nd\u03c1j Tr\u201c\u02c6gT\n\u03c1j \u00b7 \u03c1j(\u03c3)\u201di .\n\n\u03c1j \u00b7 \u03c1j(\u03c3)\u201d#\n\n(4)\n\n(5)\n\nNow we want to manipulate this product of traces in the last line to be just one trace (as in\nEquation 3), by appealing to some properties of the matrix Kronecker product. The connection\nto the pointwise product (\ufb01rst observed in [8]), lies in the property that for any matrices U, V ,\nTr (U \u2297 V ) = (Tr U ) \u00b7 (Tr V ). Applying this to Equation 4, we have:\n\nTr\u201c \u02c6f T\n\n\u03c1i \u00b7 \u03c1i(\u03c3)\u201d \u00b7 Tr\u201c\u02c6gT\n\n\u03c1j \u00b7 \u03c1j(\u03c3)\u201d = Tr\u201c\u201c \u02c6f T\n\n\u03c1i \u00b7 \u03c1i(\u03c3)\u201d \u2297\u201c\u02c6gT\n\n= Tr\u201e\u201c \u02c6f\u03c1i \u2297 \u02c6g\u03c1j\u201dT\n\n\u03c1j \u00b7 \u03c1j(\u03c3)\u201d\u201d\n\u00b7 (\u03c1i(\u03c3) \u2297 \u03c1j(\u03c3))\u00ab ,\n\nwhere the last line follows by standard matrix properties. The term on the right, \u03c1i(\u03c3) \u2297 \u03c1j(\u03c3),\nitself happens to be a representation, called the Kronecker Product Representation.\nIn general,\nthe Kronecker Product representation is reducible, and so it can decomposed into a direct sum of\nirreducibles. This means that if \u03c1i and \u03c1j are any two irreducibles of G, there exists a similarity\ntransform Cij such that for any \u03c3 \u2208 G,\n\nC \u22121\nij\n\n\u00b7 [\u03c1i \u2297 \u03c1j] (\u03c3) \u00b7 Cij =Mk\n\nzijkM\u2113=1\n\n\u03c1k(\u03c3).\n\nThe \u2295 symbols here refer to a matrix direct sum as in Equation 1, k indexes over all irreducible\nrepresentations of Sn, while \u2113 indexes over a number of copies of \u03c1k which appear in the de-\ncomposition. We index blocks on the right side of this equation by pairs of indices (k, \u2113). The\nnumber of copies of each \u03c1k is denoted by the integer zijk, the collection of which, taken over\nall triples (i, j, k), are commonly referred to as the Clebsch-Gordan series. Note that we allow\nthe zijk to be zero, in which case \u03c1k does not contribute to the direct sum. The matrices Cij are\nknown as the Clebsch-Gordan coef\ufb01cients. The Kronecker Product Decomposition problem is\nthat of \ufb01nding the irreducible components of the Kronecker product representation, and thus to\n\ufb01nd the Clebsch-Gordan series/coef\ufb01cients for each pair of representations (\u03c1i, \u03c1j). Decomposing\nthe Kronecker product inside Equation 5 using the Clebsch-Gordan series/coef\ufb01cients yields the\ndesired Fourier Transform, which we summarize here:\nProposition 4. Let \u02c6f , \u02c6g be the Fourier Transforms of functions f and g respectively, and for each\nordered pair of irreducibles (\u03c1i, \u03c1j), de\ufb01ne the matrix: Aij , C \u22121\nij\nFourier tranform of the pointwise product f g is:\n\n\u00b7(cid:16) \u02c6f\u03c1i \u2297 \u02c6g\u03c1j(cid:17) \u00b7 Cij. Then the\n\nhcf gi\u03c1k\n\n=\n\n1\n\nd\u03c1k |G|Xij\n\nzijkX\u2113=1\n\nd\u03c1i d\u03c1j\n\nAk\u2113\nij ,\n\n(6)\n\nwhere Ak\u2113\n\nij is the block of Aij corresponding to the (k, \u2113) block in \u2295k \u2295zijk\n\n\u03c1k.\n\n\u2113\n\nSee the Appendix for a full proof of Proposition 4. The Clebsch-Gordan series, zijk, plays an\nimportant role in Equation 6, which says that the (\u03c1i, \u03c1j) crossterm contributes to the pointwise\nproduct at \u03c1k only when zijk > 0. For example,\n\n\u03c11 \u2297 \u03c11 \u2261 \u03c10 \u2295 \u03c11 \u2295 \u03c12 \u2295 \u03c13.\n\n(7)\n\nSo z1,1,k = 1 for k \u2264 3 and is zero otherwise.\n\n5\n\n\fUnfortunately, there are no analytical formulas for \ufb01nding the Clebsch-Gordan series or coef\ufb01cients,\nand in practice, these computations can take a long time. We emphasize however, that as fundamen-\ntal quantities, like the digits of \u03c0, they need only be computed once and stored in a table for future\nreference. Due to space limitations, we will not provide complete details on computing these num-\nbers. We refer the reader to Murnaghan [9], who provides general formulas for computing Clebsch-\nGordan series for pairs of low-order irreducibles, and to Appendix 1 for details about computing\nClebsch-Gordan coef\ufb01cients. We will also make precomputed coef\ufb01cients available on the web.\n5 Approximate inference by bandlimiting\nWe approximate the probability distribution P (\u03c3) by \ufb01xing a bandlimit B and maintaining the\nFourier transform of P only at irreducibles \u03c10, . . . \u03c1B. We refer to this set of irreducibles as B. As on\nthe real line, smooth functions are generally well approximated by only a few Fourier coef\ufb01cients,\nwhile \u201cwigglier\u201d functions require more. For example, when B = 3, B is the set \u03c10, \u03c11, \u03c12, and\n\u03c13, which corresponds to maintaining marginal probabilities of the form P (\u03c3((i, j)) = (k, \u2113)).\nDuring inference, we follow the procedure outlined in the previous section but ignore the higher\norder terms which are not maintained. Pseudocode for bandlimited prediction/rollup and Kronecker\nconditioning is given in Figures 1 and 2.\n\nSince the Prediction/Rollup step is pointwise in the Fourier domain, the update is exact for the\nmaintained irreducibles because higher order irreducibles cannot affect those below the bandlimit.\nAs in [5], we \ufb01nd that the error from bandlimiting creeps in through the conditioning step. For\nexample, Equation 7 shows that if B = 1 (so that we maintain \ufb01rst-order marginals), then the\npointwise product spreads information to second-order marginals. Conversely, pairs of higher-order\nirreducibles may propagate information to lower-order irreducibles. If a distribution is diffuse, then\nmost of the energy is stored in low-order Fourier coef\ufb01cients anyway, and so this is not a big prob-\nlem. However, it is when the distribution is sharply concentrated at a small subset of permutations,\nthat the low-order Fourier projection is unable to faithfully approximate the distribution, in many\ncircumstances, resulting in a bandlimited Fourier Transform with negative \u201cmarginal probabilities\u201d!\nTo combat this problem, we present a method for enforcing nonnnegativity.\n\nProjecting to a relaxed marginal polytope The marginal polytope, M, is the set of marginals\nwhich are consistent with some joint distribution over permutations. We project our approximation\nonto a relaxation of the marginal polytope, M\u2032, de\ufb01ned by linear inequality constraints that\nmarginals be nonnegative, and linear equality constraints that they correspond to some legal Fourier\ntransform. Intuitively, our relaxation produces matrices of marginals which are doubly stochastic\n(rows and columns sum to one and all entries are nonnegative), and satisfy lower-order marginal\nconsistency (different high-order marginals are consistent at lower orders).\n\nAfter each conditioning step, we apply a \u2018correction\u2019 to the approximate posterior P (t) by \ufb01nding\nthe bandlimited function in M\u2032 which is closest to P (t) in an L2 sense. To perform the projection,\nwe employ the Plancherel Theorem [3] which relates the L2 distance between functions on Sn to a\ndistance metric in the Fourier domain.\n\nProposition 5. X\u03c3\n\n(f (\u03c3) \u2212 g(\u03c3))2 =\n\n1\n\n|G|Xk\n\nd\u03c1kTr(cid:18)(cid:16) \u02c6f\u03c1k \u2212 \u02c6g\u03c1k(cid:17)T\n\n\u00b7(cid:16) \u02c6f\u03c1k \u2212 \u02c6g\u03c1k(cid:17)(cid:19) .\n\n(8)\n\nWe formulate the optimization as a quadratic program where the objective is to minimize the right\nside of Equation 8 \u2014 the sum is taken only over the set of maintained irreducibles, B, and subject\nto the linear constraints which de\ufb01ne M\u2032.\nWe remark that even though the projection will always produce a Fourier transform corresponding\nto nonnegative marginals, there might not necessarily exist a joint probability distribution on Sn\nconsistent with those marginals.\nIn the case of \ufb01rst-order marginals, however, the existence of\na consistent joint distribution is guaranteed by the Birkhoff-von Neumann theorem [10], which\nstates that a matrix is doubly stochastic if and only if it can be written as a convex combination of\npermutation matrices. And so for the case of \ufb01rst-order marginals, our relaxation is in fact, exact.\n6 Related Work\nThe Identity Management problem was \ufb01rst introduced in [2] which maintains a doubly stochastic\n\ufb01rst order belief matrix to reason over data associations. Schumitsch et al. [4] exploits a similar\nidea, but formulated the problem in log-space.\n\n6\n\n\fFigure 1: Pseudocode for the Fourier Prediction/Rollup Algorithm.\nPREDICTIONROLLUP\nforeach \u03c1k \u2208 B do \u02c6P (t+1)\n\n\u03c1k \u2190 \u02c6Q(t)\n\n\u03c1k \u00b7 \u02c6P (t)\n\u03c1k ;\n\nFigure 2: Pseudocode for the Kronecker Conditioning Algorithm.\nKRONECKERCONDITIONING\n\n\u2190 0 //Initialize Posterior\n\nforeach \u03c1k \u2208 B do h \\L(t)P (t)i\u03c1k\n\n//Pointwise Product\nforeach \u03c1i \u2208 B do\n\nforeach \u03c1j \u2208 B do\n\nz \u2190 CGseries(\u03c1i, \u03c1j) ;\nCij \u2190 CGcoef f icients(\u03c1i, \u03c1j) ; Aij \u2190 C T\nfor \u03c1k \u2208 B such that zijk 6= 0 do\n\nij \u00b7\u201c \u02c6f\u03c1i \u2297 \u02c6g\u03c1j\u201d \u00b7 Cij ;\n\n\u2190h \\L(t)P (t)i\u03c1k\n\nfor \u2113 = 1 to zk do\n\nh \\L(t)P (t)i\u03c1k\nZ \u2190h \\L(t)P (t)i\u03c10\nforeach \u03c1k \u2208 B do h \\L(t)P (t)i\u03c1k\n\n;\n\n\u2190 1\n\nZh \\L(t)P (t)i\u03c1k\n\n//Normalization\n\n+\n\nd\u03c1i\nd\u03c1k\n\nd\u03c1j\nn! Ak\u2113\n\nij\n\n//Ak\u2113\n\nij is the (k, \u2113) block of Aij\n\nKondor et al. [5] were the \ufb01rst to show that the data association problem could be approximately\nhandled via the Fourier Transform. For conditioning, they exploit a modi\ufb01ed FFT factorization\nwhich works on certain simpli\ufb01ed observation models. Our approach generalizes the type of\nobservations that can be handled in [5] and is equivalent in the simpli\ufb01ed model that they present.\nWe require O(D3n2) time in their setting. Their FFT method saves a factor of D due to the fact that\ncertain representation matrices can be shown to be sparse. Though we do not prove it, we observe\nthat the Clebsch-Gordan coef\ufb01cients, Cij are typically similarly sparse, which yields an equivalent\nrunning time in practice. In addition, Kondor et al. do not address the issue of projecting onto valid\nmarginals, which, as we show in our experimental results, is fundamental in practice.\n\nWillsky [8] was the \ufb01rst to formulate a nonabelian version of the FFT algorithm (for Metacyclic\ngroups) as well as to note the connection between pointwise products and Kronecker product\ndecompositions for general \ufb01nite groups. In this paper, we address approximate inference, which is\nnecessary given the n! complexity of inference for the Symmetric group.\n\n7 Experimental results\nFor small n, we compared our algorithm to exact inference on synthetic datasets in which tracks are\ndrawn at random to be observed or swapped. For validation we measure the L1 distance between true\nand approximate marginal distributions. In (Fig. 3(a)), we call several mixings followed by a single\nobservation, after which we measured error. As expected, the Fourier approximation is better when\nthere are either more mixing events, or when more Fourier coef\ufb01cients are maintained. In (Fig. 3(b))\nwe allow for consecutive conditioning steps and we see that that the projection step is fundamental,\nespecially when mixing events are rare, reducing the error dramatically. Comparing running times,\nit is clear that our algorithm scales gracefully compared to the exact solution (Fig. 3(c)).\n\nWe also evaluated our algorithm on data taken from a real network of 8 cameras (Fig. 3(d)). In the\ndata, there are n = 11 people walking around a room in fairly close proximity. To handle the fact\nthat people can freely leave and enter the room, we maintain a list of the tracks which are external\nto the room. Each time a new track leaves the room, it is added to the list and a mixing event is\ncalled to allow for m2 pairwise swaps amongst the m external tracks.\n\nThe number of mixing events is approximately the same as the number of observations. For each\nobservation, the network returns a color histogram of the blob associated with one track. The\ntask after conditioning on each observation is to predict identities for all tracks inside the room,\nand the evaluation metric is the fraction of accurate predictions. We compared against a baseline\napproach of predicting the identity of a track based on the most recently observed histogram\nat that track. This approach is expected to be accurate when there are many observations and\ndiscriminative appearance models, neither of which our problem afforded. As (Fig. 3(e)) shows,\n\n7\n\n\fi\n\nl\n\ns\na\nn\ng\nr\na\nm\n\n \nr\ne\nd\nr\no\n\n \nt\ns\n1\n\n \nt\n\na\n\n \nr\no\nr\nr\ne\n\n \n\nL\n\n1\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n \n0\n\nError of Kronecker Conditioning, n=8\n\nProjection versus No Projection (n=6)\n\n \n\nb=1\nb=2\nb=3\n\n5\n# Mixing Events\n\n10\n\n15\n\ni\n\nl\n\ns\na\nn\ng\nr\na\nM\n\n \nr\ne\nd\nr\no\n\n \nt\ns\n1\n\n \nt\n\na\n\n \nr\no\nr\nr\ne\n\n \n\nL\n\n1\n\ns\np\ne\n\nt\ns\ne\nm\n\ni\nt\n \n\n0\n5\n2\n\n \n\n \nr\ne\nv\no\nd\ne\ng\na\nr\ne\nv\nA\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\n0\n\n \n\n \n\nb=1, w/o Projection\nb=2, w/o Projection\nb=3, w/o Projection\nb=1, w/Projection\nb=2, w/Projection\nb=3, w/Projection\nb=0 (Uniform distribution)\n\n0.2\n\n0.4\n\n0.6\n\nFraction of Observation events\n\n0.8\n\n(a) Error of Kronecker Con-\nditioning\n\n(b) Projection vs. No Projec-\ntion\n\nRunning time of 10 forward algorithm iterations\n\n \n\nb=1\nb=2\nb=3\nexact\n\n5\n\n4\n\n3\n\n2\n\n1\n\ns\nd\nn\no\nc\ne\ns\n \nn\n\ni\n \n\ne\nm\n\ni\nt\n \n\ni\n\ng\nn\nn\nn\nu\nR\n\n0\n \n4\n\n5\n\n6\nn\n\n7\n\n8\n\n(c) n versus Running Time\n\nd\ne\n\ni\nf\ni\nt\n\nn\ne\nd\n\nI\n \ny\nl\nt\nc\ne\nr\nr\no\nc\n \ns\nk\nc\na\nr\nT\n%\n\n \n\n60\n\n50\n\n40\n\n30\n\n20\n\n10\n\n0\n\nOmniscient\n\nw/Projection\n\nw/o Projection\n\nBaseline\n\n \n\n(d) Sample Image\n\n(e) Accuracy for Camera Data\n\nFigure 3: Evaluation on synthetic ((a)-(c)) and real camera network ((d),(e)) data.\n\nboth the baseline and \ufb01rst order model(without projection) fared poorly, while the projection step\ndramatically boosted the accuracy. To illustrate the dif\ufb01culty of predicting based on appearance\nalone, the rightmost bar re\ufb02ects the performance of an omniscient tracker who knows the result of\neach mixing event and is therefore left only with the task of distinguishing between appearances.\n8 Conclusions\nWe presented a formulation of hidden Markov model inference in the Fourier domain. In particular,\nwe developed the Kronecker Conditioning algorithm which performs a convolution-like operation\non Fourier coef\ufb01cients to \ufb01nd the Fourier transform of the posterior distribution. We argued that\nbandlimited conditioning can result in Fourier coef\ufb01cients which correspond to no distribution, but\nthat the problem can be remedied by projecting to a relaxation of the marginal polytope. Our eval-\nuation on data from a camera network shows that our methods outperform well when compared to\nthe optimal solution in small problems, or to an omniscient tracker in large problems. Furthermore,\nwe demonstrated that our projection step is fundamental to obtaining these high-quality results.\n\nWe conclude by remarking that the mathematical framework developed in this paper is quite general.\nIn fact, both the prediction/rollup and conditioning formulations hold over any \ufb01nite group, provid-\ning a principled method for approximate inference for problems with underlying group structure.\nAcknowledgments\nThis work is supported in part by the ONR under MURI N000140710747, the ARO under grant\nW911NF-06-1-0275, the NSF under grants DGE-0333420, EEEC-540865, Nets-NOSS 0626151\nand TF 0634803, and by the Pennsylvania Infrastructure Technology Alliance (PITA). Carlos\nGuestrin was also supported in part by an Alfred P. Sloan Fellowship. We thank Kyle Heath for\nhelping with the camera data and Emre Oto, and Robert Hough for valuable discussions.\nReferences\n[1] Y. Ivanov, A. Sorokin, C. Wren, and I. Kaur. Tracking people in mixed modality systems. Technical\n\n[2] J. Shin, L. Guibas, and F. Zhao. A distributed algorithm for managing multi-target identities in wireless\n\nReport TR2007-11, MERL, 2007.\n\nad-hoc sensor networks. In IPSN, 2003.\n\n[3] P. Diaconis. Group Representations in Probability and Statistics. IMS Lecture Notes, 1988.\n[4] B. Schumitsch, S. Thrun, G. Bradski, and K. Olukotun. The information-form data association \ufb01lter. In\n\nNIPS. 2006.\n\nIn AISTATS, 2007.\n\n[5] R. Kondor, A. Howard, and T. Jebara. Multi-object tracking with representations of the symmetric group.\n\n[6] X. Boyen and D. Koller. Tractable inference for complex stochastic processes. In UAI, 1998.\n[7] R. Kondor. Snob: a C++ library for fast Fourier transforms on the symmetric group, 2006. Available at\n\nhttp://www.cs.columbia.edu/\u02dcrisi/Snob/.\n\n[8] A. Willsky. On the algebraic structure of certain partially observable \ufb01nite-state markov processes. Infor-\n\nmation and Control, 38:179\u2013212, 1978.\n\n[9] F.D. Murnaghan. The analysis of the kronecker product of irreducible representations of the symmetric\n\ngroup. American Journal of Mathematics, 60(3):761\u2013784, 1938.\n\n[10] J. van Lint and R.M. Wilson. A Course in Combinatorics. Cambridge University Press, 2001.\n\n8\n\n\f", "award": [], "sourceid": 3183, "authors": [{"given_name": "Jonathan", "family_name": "Huang", "institution": null}, {"given_name": "Carlos", "family_name": "Guestrin", "institution": null}, {"given_name": "Leonidas", "family_name": "Guibas", "institution": null}]}