{"title": "Hamming Ball Auxiliary Sampling for Factorial Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2960, "page_last": 2968, "abstract": "We introduce a novel sampling algorithm for Markov chain Monte Carlo-based Bayesian inference for factorial hidden Markov models. This algorithm is based on an auxiliary variable construction that restricts the model space allowing iterative exploration in polynomial time. The sampling approach overcomes limitations with common conditional Gibbs samplers that use asymmetric updates and become easily trapped in local modes. Instead, our method uses symmetric moves that allows joint updating of the latent sequences and improves mixing. We illustrate the application of the approach with simulated and a real data example.", "full_text": "Hamming Ball Auxiliary Sampling for Factorial\n\nHidden Markov Models\n\nMichalis K. Titsias\n\nDepartment of Informatics\n\nAthens University of Economics and Business\n\nmtitsias@aueb.gr\n\nChristopher Yau\n\nWellcome Trust Centre for Human Genetics\n\nUniversity of Oxford\n\ncyau@well.ox.ac.uk\n\nAbstract\n\nWe introduce a novel sampling algorithm for Markov chain Monte Carlo-based\nBayesian inference for factorial hidden Markov models. This algorithm is based\non an auxiliary variable construction that restricts the model space allowing it-\nerative exploration in polynomial time. The sampling approach overcomes lim-\nitations with common conditional Gibbs samplers that use asymmetric updates\nand become easily trapped in local modes. Instead, our method uses symmetric\nmoves that allows joint updating of the latent sequences and improves mixing. We\nillustrate the application of the approach with simulated and a real data example.\n\n1\n\nIntroduction\n\nThe hidden Markov model (HMM) [1] is one of the most widely and successfully applied statistical\nmodels for the description of discrete time series data. Much of its success lies in the availability of\nef\ufb01cient computational algorithms that allows the calculation of key quantities necessary for statis-\ntical inference [1, 2]. Importantly, the complexity of these algorithms is linear in the length of the\nsequence and quadratic in the number of states which allows HMMs to be used in applications that\ninvolve long data sequences and reasonably large state spaces with modern computational hardware.\nIn particular, the HMM has seen considerable use in areas such as bioinformatics and computational\nbiology where non-trivially sized datasets are commonplace [3, 4, 5].\nThe factorial hidden Markov model (FHMM) [6] is an extension of the HMM where multiple in-\ndependent hidden chains run in parallel and cooperatively generate the observed data. In a typical\nsetting, we have an observed sequence Y = (y1, . . . , yN ) of length N which is generated through\nK binary hidden sequences represented by a K \u00d7 N binary matrix X = (x1, . . . , xN ). The inter-\npretation of the latter binary matrix is that each row encodes for the presence or absence of a single\nfeature across the observed sequence while each column xi represents the different features that are\nactive when generating the observation yi. Different rows of X correspond to independent Markov\nchains following\n\np(xk,i|xk,i\u22121) =\n\nxk,i = xk,i\u22121,\nxk,i (cid:54)= xk,i\u22121,\n\n(1)\n\n(cid:26)1 \u2212 \u03c1k,\n\n\u03c1k,\n\nand where the initial state xk,1 is drawn from a Bernoulli distribution with parameter \u03bdk. All hidden\nchains are parametrized by 2K parameters denoted by the vectors \u03c1 = {\u03c1k}K\nk=1.\nFurthermore, each data point yi is generated conditional on xi through a likelihood model p(yi|xi)\nparametrized by \u03c6. The whole set of model parameters consists of the vector \u03b8 = (\u03c6, \u03c1, v) which\ndetermines the joint probability density over (Y, X), although for notational simplicity we omit\nreference to it in our expressions. The joint probability density over (Y, X) is written in the form\n\nk=1 and v = {vk}K\n\np(Y, X) = p(Y |X)p(X) =\n\np(yi|xi)\n\np(xk,1)\n\np(xk,i|xk,i\u22121)\n\n,\n\n(2)\n\n(cid:32) N(cid:89)\n\n(cid:33)(cid:32) K(cid:89)\n\nN(cid:89)\n\ni=1\n\nk=1\n\ni=2\n\n1\n\n(cid:33)\n\n\fx1,i\u22121\n\nx2,i\u22121\n\nx3,i\u22121\n\nyi\u22121\n\nx1,i\n\nx2,i\n\nx3,i\n\nyi\n\nx1,i+1\n\nx2,i+1\n\nx3,i+1\n\nyi+1\n\nFigure 1: Graphical model for a factorial HMM with three hidden chains and three consecutive data points.\n\nand it is depicted as a directed graphical model in Figure 1.\nWhile the HMM has enjoyed widespread application, the utility of the FHMM has been relatively\nless abundant. One considerable challenge in the adoption of FHMMs concerns the computation of\nthe posterior distribution p(X|Y ) (conditional on observed data and model parameters) which com-\nprises a fully dependent distribution in the space of the 2KN possible con\ufb01gurations of the binary\nmatrix X. Exact Monte Carlo inference can be achieved by applying the standard forward-\ufb01ltering-\nbackward-sampling (FF-BS) algorithm to simulate a sample from p(X|Y ) in O(22KN ) time (the\nindependence of the Markov chains can be exploited to reduce this complexity to O(2K+1KN ) [6]).\nJoint updating of X is highly desirable in time series analysis since alternative strategies involving\nconditional single-site, single-row or block updates can be notoriously slow due to strong coupling\nbetween successive time steps. However, although the use of FF-BS is quite feasible for even very\nlarge HMMs, it is only practical for small values of K and N in FHMMs. As a consequence, infer-\nence in FHMMs has become somewhat synonymous with approximate methods such as variational\ninference [6, 7].\nThe main burden of the FF-BS algorithm is the requirement to sum over all possible con\ufb01gurations of\nthe binary matrix X during the forward \ufb01ltering phase. The central idea in this work is to avoid this\ncomputationally expensive step by applying a restricted sampling procedure with polynomial time\ncomplexity that, when applied iteratively, gives exact samples from the true posterior distribution.\nWhilst regular conditional sampling procedures use locally asymmetric moves that only allow one\npart of X to be altered at a time, our sampling method employs locally symmetric moves that allow\nlocalized joint updating of all the constituent chains making it less prone to becoming trapped in\nlocal modes. The sampling strategy adopts the use of an auxiliary variable construction, similar\nto slice sampling [8] and the Swendsen-Wang algorithm [9], that allows the automatic selection of\nthe sequence of restricted con\ufb01guration spaces. The size of these restricted con\ufb01guration spaces\nis user-de\ufb01ned allowing control over balance between the sampling ef\ufb01ciency and computational\ncomplexity. Our sampler generalizes the standard FF-BS algorithm which is a special case.\n\n2 Standard Monte Carlo inference for the FHMM\n\nBefore discussing the details of our new sampler, we \ufb01rst describe the limitations of standard con-\nditional sampling procedures for the FHMM. The most sophisticated conditional sampling schemes\nare based on alternating between sampling one chain (or a small block of chains) at a time using the\nFF-BS recursion. However, as discussed in the following and illustrated experimentally in Section\n4, these algorithms can easily become trapped in local modes leading to inef\ufb01cient exploration of\nthe posterior distribution.\nOne standard Gibbs sampling algorithm for the FHMM is based on simulating from the posterior\nconditional distribution over a single row of X given the remaining rows. Each such step can be\ncarried out in O(4N ) time using the FF-BS recursion, while a full sweep over all K rows requires\nO(4KN ) time. A straightforward generalization of the above is to apply a block Gibbs sampling\nwhere at each step a small subset of chains is jointly sampled. For instance, when we consider pairs\nof chains the time complexity for sampling a pair is O(16N ) while a full sweep over all possible\npairs requires time O(16 K(K\u22121)\n\n2 N ).\n\n2\n\n\f(cid:32)..\n\n..\n..\n\n0\n1\n0\n0\n1\n0\nX (t)\n\n(cid:33)\n\n..\n..\n..\n\n(cid:32)..\n\n..\n..\n\n(cid:59)\n\n(a)\n\n0 1\n0 1\n0 0\nX (t+1)\n\n..\n..\n..\n\n(cid:33) (cid:32)..\n\n..\n..\n\n(cid:33)\n\n..\n..\n..\n\n(cid:32)..\n\n..\n..\n\n\u21d2\n\n0\n1\n0\n0\n1\n0\nX (t)\n\n(cid:33)\n\n..\n..\n..\n\n\u21d2\n\n0\n1\n0\n\n1\n0\n0\nU\n(b)\n\n(cid:32).. 0\n\n..\n..\n\n1\n1\n0\n\n0\n0\nX (t+1)\n\n(cid:33)\n\n..\n..\n..\n\nFigure 2: Panel (a) shows an example where from a current state X (t) it is impossible to jump to a new state\nX (t+1) in a single step using block Gibbs sampling on pairs of rows. In contrast, Hamming ball sampling ap-\nplied with the smallest valid radius, i.e. m = 1, can accomplish such move through the intermediate simulation\nof U as illustrated in (b). Speci\ufb01cally, simulating U from the uniform p(U|X) results in a state having one bit\n\ufb02ipped per column compared to X (t). Then sampling X (t+1) given U \ufb02ips further two bits so in total X (t+1)\ndiffers by X (t) in four bits that exist in three different rows and two columns.\n\nWhile these schemes can propose large changes to X and be ef\ufb01ciently implemented using forward-\nbackward recursions, they can still easily get trapped to local modes of the posterior distribution. For\ninstance, suppose we sample pairs of rows and we encounter a situation where, in order to escape\nfrom a local mode, four bits in two different columns (two bits from each column) must be jointly\n\ufb02ipped. Given that these four bits belong to more than two rows, the above Gibbs sampler will fail to\nmove out from the local mode no matter which row-pair, from the K(K\u22121)\npossible ones, is jointly\nsimulated. An illustrative example of this phenomenon is given in Figure 2(a).\nWe could describe the conditional sampling updates of block Gibbs samplers as being locally asym-\nmetric, in the sense that, in each step, one part of X is restricted to remain unchanged while the\nother part is free to change. As the above example indicates, these locally asymmetric updates can\ncause the chain to become trapped in local modes which can result in slow mixing. This can be\nparticularly problematic in FHMMs where the observations are jointly dependent on the underlying\nhidden states which induces a coupling between rows of X. Of course, locality in any possible\nMCMC scheme for FHMMs seems unavoidable, certainly however, such a locality does not need\nto be asymmetric. In the next section, we develop a symmetrically local sampling approach so that\neach step gives a chance to any element of X to be \ufb02ipped in any single update.\n\n2\n\n3 Hamming ball auxiliary sampling\n\nHere we develop the theory of the Hamming ball sampler. Section 3.1 presents the main idea while\nSection 3.2 discusses several extensions.\n\n3.1 The basic Hamming ball algorithm\n\nRecall the K-dimensional binary vector xi (the i-th column of X) that de\ufb01nes the hidden state at\ni-th location. We consider the set of all K-dimensional binary vectors ui that lie within a certain\nHamming distance from xi so that each ui is such that\n\nwhere m \u2264 K. Here, h(ui, xi) = (cid:80)K\n\nh(ui, xi) \u2264 m.\n(3)\nk=1 I(uk,i (cid:54)= xk,i) is the Hamming distance between two\nbinary vectors and I(\u00b7) denotes the indicator function. Notice that the Hamming distance is simply\nthe number of elements the two binary vectors disagree. We refer to the set of all uis satisfying (3)\nas the i-th location Hamming ball of radius m. For instance, when m = 1, the above set includes\nall ui vectors restricted to be the same as xi but with at most one bit \ufb02ipped, when m = 2 these\nvectors can have at most two bits \ufb02ipped and so on. For a given m, the cardinality of the i-th location\nHamming ball is\n\nM =\n\n.\n\n(4)\n\n(cid:18)K\n\n(cid:19)\n\nm(cid:88)\n\nj\n\nj=0\n\nFor m = 1 this number is equal to K + 1, for m = 2 is equal to K(K\u22121)\n+ K + 1 and so on.\nClearly, when m = K there is no restriction on the values of ui and the above number takes its\nmaximum value, i.e. M = 2K. Subsequently, given a certain X we de\ufb01ne the full path Hamming\n\n2\n\n3\n\n\fN(cid:89)\n\nball or simply Hamming ball as the set\n\nBm(X) = {U ; h(ui, xi) \u2264 m, i = 1, . . . , N},\n\n(5)\nwhere U is a K \u00d7 N binary matrix such that U = (u1, . . . , uN ). This Hamming ball, centered at X,\nis simply the intersection of all i-th location Hamming balls of radius m. Clearly, the Hamming ball\nset is such that U \u2208 Bm(X) iff X \u2208 Bm(U ), or more concisely we can write I(U \u2208 Bm(X)) =\nI(X \u2208 Bm(U )). Furthermore, the indicator function I(U \u2208 Bm(X)) factorizes as follows,\n\nI(U \u2208 Bm(X)) =\n\nI(h(ui, xi) \u2264 m).\n\n(6)\nWe wish now to consider U as an auxiliary variable generated given X uniformly inside Bm(X),\ni.e. we de\ufb01ne the conditional distribution\n\ni=1\n\np(U|X) =\n\n1\n\nZ I(U \u2208 Bm(X)),\n\n(7)\nwhere crucially the normalizing constant Z simply re\ufb02ects the volume of the ball and is independent\nfrom X. We can augment the initial joint model density from Eq. (2) with the auxiliary variables U\nand express the augmented model\n\np(Y, X, U ) = p(Y |X)p(X)p(U|X).\n\n(8)\nBased on this, we can apply Gibbs sampling in the augmented space and iteratively sample U from\nthe posterior conditional, which is just p(U|X), and then sample X given the remaining variables.\nSampling p(U|X) is trivial as it requires to independently draw each ui, with i = 1, . . . , N, from the\nuniform distribution proportional to I(h(ui, xi) \u2264 m), i.e. randomly select a ui within Hamming\ndistance at most m from xi. Then, sampling X is carried out by simulating from the following\nposterior conditional distribution\n\n(cid:33)\n\n(cid:32) N(cid:89)\n\np(X|Y, U ) \u221d p(Y |X)p(X)p(U|X) \u221d\n\np(yi|xi)I(h(xi, ui) \u2264 m)\n\np(X),\n\n(9)\n\ni=1\n\nwhere we used Eq. (6). Exact sampling from this distribution can be done using the FF-BS algorithm\nin O(M 2N ) time where M is the size of each location-speci\ufb01c Hamming ball given in (4).\nThe intuition behind the above algorithm is the following. Sampling p(U|X) given the current state\nX can be thought of as an exploration step where X is randomly perturbed to produce an auxiliary\nmatrix U. We can imagine this as moving the Hamming ball that initially is centered at X to a new\nlocation centered at U. Subsequently, we take a slice of the model by considering only the binary\nmatrices that exist inside this new Hamming ball, centered at U, and draw an new state for X by\nperforming exact sampling in this sliced part of the model. Exact sampling is possible using the\nFF-BS recursion and it has an user-controllable time complexity that depends on the volume of the\nHamming ball. An illustrative example of how the algorithm operates is given in Figure 2(b).\nTo be ergodic the above sampling scheme (under standard conditions) the auxiliary variable U must\nbe allowed to move away from the current X (t) (the value of X at the t-th iteration) which implies\nthat the radius m must be strictly larger than zero. Furthermore, the maximum distance a new X (t+1)\ncan travel away from the current X (t) in a single iteration is 2mN bits (assuming m \u2264 K/2). This\nis because resampling a U given the current X (t) can select a U that differs at most mN bits from\nX (t), while subsequently sampling X (t+1) given U further adds at most other mN bits.\n\n3.2 Extensions\n\nSo far we have de\ufb01ned Hamming ball sampling assuming binary factor chains in the FHMM. It is\npossible to generalize the whole approach to deal with factor chains that can take values in general\n\ufb01nite discrete state spaces. Suppose that each hidden variable takes P values so that the matrix\nX \u2208 {1, . . . , P}K\u00d7N . Exactly as in the binary case, the Hamming distance between the auxiliary\nvector ui \u2208 {1, . . . , P}K and the corresponding i-th column xi of X is the number of elements\nthese two vectors disagree. Based on this we can de\ufb01ne the i-th location Hamming ball of radius m\nas the set of all uis satisfying Eq. (3) which has cardinality\n\nM =\n\n(P \u2212 1)j\n\n.\n\n(10)\n\n(cid:18)K\n\n(cid:19)\n\nj\n\nm(cid:88)\n\nj=0\n\n4\n\n\f2\n\nThis, for m = 1 is equal (P \u2212 1)K + 1, for m = 2 it is equal to (P \u2212 1)2 K(K\u22121)\n+ (P \u2212 1)K + 1\nand so forth. Notice that for the binary case, where P = 2, all these expressions reduce to the ones\nfrom Section 3.1. Then, the sampling scheme from the previous section can be applied unchanged\nwhere in one step we sample U given the current X and in the second step we sample X given U\nusing the FF-BS recursion.\nAnother direction of extending the method is to vary the structure of the uniform distribution p(U|X)\nwhich essentially determines the exploration area around the current value of X. We can even add\nrandomness in the structure of this distribution by further expanding the joint density in Eq. (8) with\nrandom variables that determine this structure. For instance, we can consider a distribution p(m)\nover the radius m that covers a range of possible values and then sample iteratively (U, m) from\np(U|X, m)p(m) and X from p(X|Y, U, m) \u221d p(Y |X)p(X)p(U|X, m). This scheme remains\nvalid since essentially it is Gibbs sampling in an augmented probability model where we added\nthe auxiliary variables (U, m). In practical implementation, such a scheme would place high prior\nprobability on small values of m where sampling iterations would be fast to compute and enable\nef\ufb01cient exploration of local structure but, with non-zero probabilities on larger values on m, the\nsampler could still periodically consider larger portions of the model space that would allow more\nsigni\ufb01cant changes to the con\ufb01guration of X.\nMore generally, we can determine the structure of p(U|X) through a set of radius constraints m =\n(m1, . . . , mQ) and base our sampling on the augmented density\n\np(Y, X, U, m) = p(Y |X)p(X)p(U|X, m)p(m).\n\n(11)\n\nFor instance, we can choose m = (m1, . . . , mN ) and consider mi as determining the radius of the\ni-location Hamming ball (for the column xi) so that the corresponding uniform distribution over\nui becomes p(ui|xi, mi) \u221d I(h(ui, xi) \u2264 mi). This could allow for asymmetric local moves\nwhere in some part of the hidden sequence (where mis are large) we allow for greater exploration\ncompared to others where the exploration can be more constrained. This could lead to more ef\ufb01cient\nvariations of the Hamming Ball sampler where the vector m could be automatically tuned during\nsampling to focus computational effort in regions of the sequence where there is most uncertainty in\nthe underlying latent structure of X.\nIn a different direction, we could introduce the constraints m = (m1, . . . , mK) associated with the\nrows of X instead of the columns. This can lead to obtain regular Gibbs sampling as a special case.\nIn particular, if p(m) is chosen so that in a random draw we pick a single k such that mk = N\nand the rest mk(cid:48) = 0, then we essentially freeze all rows of X apart from the k-th row1 and thus\nallowing the subsequent step of sampling X to reduce to exact sampling the k-th row of X using\nthe FF-BS recursion. Under this perspective, block Gibbs sampling for FHMMs can be seen as a\nspecial case of Hamming ball sampling.\nFinally, there maybe utility in developing other proposals for sampling U based on distributions\nother than the uniform approach used here. For example, a local exponentially weighted proposal\ni=1 exp(\u2212\u03bbh(ui, xi))I(h(ui, xi) \u2264 m), would keep the centre of the\nproposed Hamming ball closer to its current location enabling more ef\ufb01cient exploration of local\ncon\ufb01gurations. However, in developing alternative proposals, it is crucial that the normalizing con-\nstant of p(U|X) is computed ef\ufb01ciently so that the overall time complexity remains O(M 2N ).\n\nof the form p(U|X) \u221d (cid:81)N\n\n4 Experiments\n\nTo demonstrate Hamming ball (HB) sampling we consider an additive FHMM as the one used in\n[6] and popularized recently for energy disaggregation applications [7, 10, 11]. In this model, each\nk-th factor chain interacts with the data through an associated mean vector wk \u2208 RD so that each\nobserved output yi is taken to be a noisy version of the sum of all factor vectors activated at time i:\n\nK(cid:88)\n\nyi = w0 +\n\nwkxk,i + \u03b7i,\n\n(12)\n\n1In particular, for the rows k(cid:48) (cid:54)= k the corresponding uniform distribution over uk(cid:48),is collapses to a point\n\ndelta mass centred at the previous states xk(cid:48),is.\n\nk=1\n\n5\n\n\fwhere w0 is an extra bias term while \u03b7i is white noise that typically follows a Gaussian: \u03b7i \u223c\nN (0, \u03c32I). Using this model we demonstrate the proposed method using an arti\ufb01cial dataset in\nSection 4.1 and a real dataset [11] in energy disaggregation in Section 4.2. In all examples, we\ncompare HB with block Gibbs (BG) sampling.\n\n4.1 Simulated dataset\nHere, we wish to investigate the ability of HB and BG sampling schemes to ef\ufb01cient escape from\nlocal modes of the posterior distribution. We consider an arti\ufb01cial data sequence of length N = 200\ngenerated as follows. We simulated K = 5 factor chains (with vk = 0.5 , \u03c1k = 0.05, k = 1, . . . , 5)\nwhich subsequently generated observations in the 25-dimensional space according to the additive\nFHMM from Eq. (12) assuming Gaussian noise with variance \u03c32 = 0.05. The associated factor\nvector where selected to be wk = wk \u2217 Maskk where wk = 0.8 + 0.05 \u2217 (k \u2212 1), k = 1, . . . , 5 and\nMaskk denotes a 25-dimensional binary vector or a mask. All binary masks are displayed as 5 \u00d7 5\nbinary images in Figure 1(a) in the supplementary \ufb01le together with few examples of generated data\npoints. Finally, the bias term w0 was set to zero.\nWe assume that the ground-truth model parameters \u03b8 = ({vk, \u03c1k, wk,}K\nk=1, w0, \u03c32) that gen-\nerated the data are known and our objective is to do posterior inference over the latent factors\nX \u2208 {0, 1}5\u00d7200, i.e. to draw samples from the conditional posterior distribution p(X|Y, \u03b8). Since\nthe data have been produced with small noise variance, this exact posterior is highly picked with\nmost all the probability mass concentrated on the single con\ufb01guration Xtrue that generated the data.\nSo the question is whether BG and HB schemes will able to discover the \u201cunknown\u201d Xtrue from\na random initialization. We tested three block Gibbs sampling schemes: BG1, BG2 and BG3 that\njointly sample blocks of rows of size one, two or three respectively. For each algorithm a full it-\neration is chosen to be a complete pass over all possible combinations of rows so that the time\ncomplexity per iteration for BG1 is O(20N ), for BG2 is O(160N ) and for BG3 is O(640N ). Re-\ngarding HB sampling we considered three schemes: HB1, HB2 and HB3 with radius m = 1, 2\nand 3 respectively. The time complexities for these HB algorithms were O(36N ), O(256N ) and\nO(676N ). Notice that an exact sample from the posterior distribution can be drawn in O(1024N )\ntime.\nWe run all algorithms assuming the same random initialization X (0) so that each bit was chosen\nfrom the uniform distribution. Figure 3(a) shows the evolution of the error of misclassi\ufb01ed bits in\nX, i.e. the number of bits the state X (t) disagrees with the ground-truth Xtrue. Clearly, HB2 and\nHB3 discover quickly the optimal solution with HB3 being slightly faster. HB1 is unable to discover\nthe ground-truth but it outperforms BG1 and BG2. All the block Gibbs sampling schemes, including\nthe most expensive BG3 one, failed to reach Xtrue.\n\n(a)\n\n(b)\n\n(c)\n\nFigure 3: The panel in (a) shows the sampling evolution of the Hamming distance between Xtrue and X (t) for\nthe three block Gibbs samplers (dashed lines) and the HB schemes (solid lines). The panel in (b) shows the\nevolution of the MSE during the MCMC training phase for the REDD dataset. The two Gibbs samplers are\nshown with dashed lines while the two HB algorithms with solid lines. Similarly to (b), the plot in (c) displays\nthe evolution of MSEs for the prediction phase in the REDD example where we only simulate the factors X.\n\n4.2 Energy disaggregation\n\nHere, we consider a real-world example from the \ufb01eld of energy disaggregation where the objective\nis to determine the component devices from an aggregated electricity signal. This technology is use-\n\n6\n\n050100150200050100150200250300350Number of errors in XSampling iterations BG1BG2BG3HB1HB2HB3020040060080010000500100015002000250030003500Train MSESampling iterations BG1BG2HB1HB20501001502001000120014001600Test MSESampling iterations BG1BG2HB1HB2\fful because having a decomposition, into components for each device, of the total electricity usage\nin a household or building can be very informative to consumers and increase awareness of energy\nconsumption which subsequently can lead to possibly energy savings. For full details regarding the\nenergy disaggregation application see [7, 10, 11]. Next we consider a publicly available data set2,\ncalled the Reference Energy Disaggregation Data Set (REDD) [11], to test the HB and BG sampling\nalgorithms. The REDD data set contains several types of home electricity data for many different\nhouses recorded during several weeks. Next, we will consider the main signal power of house_1\nfor seven days which is a temporal signal of length 604, 800 since power was recorded every second.\nWe further downsampled this signal to every 9 seconds to obtain a sequence of 67, 200 size in which\nwe applied the FHMM described below.\nEnergy disaggregation can be naturally tackled by an additive FHMM framework, as realized in\n[10, 11], where an observed total electricity power yi at time instant i is the sum of individual\npowers for all devices that are \u201con\u201d at that time. Therefore, the observation model from Eq. (12)\ncan be used to model this situation with the constraint that each device contribution wk (which is\na scalar) is restricted to be non-negative. We assume an FHMM with K = 10 factors and we\nfollow a Bayesian framework where each wk is parametrized by the exponential transformation, i.e.\n\nwk = e(cid:101)wk, and a vague zero-mean Gaussian prior is assigned on (cid:101)wk. To learn these factors we apply\ni) sampling X, ii) sampling each (cid:101)wk individually using its own Gaussian proposal distribution and\n\nunsupervised learning using as training data the \ufb01rst day of recorded data. This involves applying an\nMetropolis-within-Gibbs type of MCMC algorithm that iterates between the following three steps:\n\naccepting or rejecting based on the M-H step and iii) sampling the noise variance \u03c32 based on its\nconjugate Gamma posterior distribution. Notice that the step ii) involves adapting the variance of\nthe Gaussian proposal to achieve an acceptance ratio between 20 and 40 percent following standard\nideas from adaptive MCMC. For the \ufb01rst step we consider one of the following four algorithms:\nBG1, BG2, HB1 and HB2 de\ufb01ned in the previous section. Once the FHMM has been trained then\nwe would like to do predictions and infer the posterior distribution over the hidden factors for a test\nsequence, that will consist of the remaining six days, according to\n\n(cid:90)\n\nT(cid:88)\n\nt=1\n\np(X\u2217|Y\u2217, Y ) =\n\np(X\u2217|Y\u2217, W, \u03c32)p(W, \u03c32|Y )dW d\u03c32 \u2248 1\nT\n\np(X\u2217|Y\u2217, W (t), (\u03c32)(t)),\n\n(13)\n\nwhere Y\u2217 denotes the test observations and X\u2217 the corresponding hidden sequence we wish to in-\nfer3. This computation requires to be able to simulate from p(X\u2217|Y\u2217, W, \u03c32) for a given \ufb01xed setting\nfor the parameters (W, \u03c32). Such prediction step will tell us which factors are \u201con\u201d at each time.\nSuch factors could directly correspond to devices in the household, such as Electronics, Lighting,\nRefrigerator etc, however since our learning approach is purely unsupervised we will not attempt to\nestablish correspondences between the inferred factors and the household appliances and, instead,\nwe will focus on comparing the ability of the sampling algorithms to escape from local modes of\nthe posterior distribution. To quantify such ability we will consider the mean squared error (MSE)\nbetween the model mean predictions and the actual data. Clearly, MSE for the test data can measure\nhow well the model predicts the unseen electricity powers, while MSE at the training phase can indi-\ncate how well the chain mixes and reaches areas with high probability mass (where training data are\nreconstructed with small error). Figure 3(b) shows the evolution of MSE through the sampling iter-\nations for the four MCMC algorithms used for training. Figure 3(c) shows the corresponding curves\nfor the prediction phase, i.e. when sampling from p(X\u2217|Y\u2217, W, \u03c32) given a representative sample\nfrom the posterior p(W, \u03c32|Y ). All four MSE curves in Figure 3(c) are produced by assuming the\nsame setting for (W, \u03c32) so that any difference observed between the algorithms depends solely on\nthe ability to sample from p(X\u2217|Y\u2217, W, \u03c32). Finally, Figure 4 shows illustrative plots on how we \ufb01t\nthe data for all seven days (\ufb01rst row) and how we predict the test data on the second day (second\nrow) together with corresponding inferred factors for the six most dominant hidden states (having\nthe largest inferred wk values). The plots in Figure 4 were produced based on the HB2 output.\nSome conclusions we can draw are the following. Firstly, Figure 3(c) clearly indicate that both HB\nalgorithms for the prediction phase, where the factor weights wk are \ufb01xed and given, are much better\nthan block Gibbs samplers in escaping from local modes and discovering hidden state con\ufb01gurations\n\n2Available from http://redd.csail.mit.edu/.\n3Notice that we have also assumed that the training and test sequences are conditionally independent given\n\nthe model parameters (W, \u03c32).\n\n7\n\n\fthat explain more ef\ufb01ciently the data. Moreover, HB2 is clearly better than HB1, as expected, since\nit considers larger global moves. When we are jointly sampling weights wk and their interacting\nlatent binary states (as done in the training MCMC phase), then, as Figure 3(b) shows, block Gibbs\nsamplers can move faster towards \ufb01tting the data and exploring local modes while HB schemes are\nslower in terms of that. Nevertheless, the HB2 algorithm eventually reaches an area with smaller\nMSE error than the block Gibbs samplers.\n\nFigure 4: First row shows the data for all seven days together with the model predictions (the blue solid line\ncorresponds to the training part and the red line to the test part). Second row zooms in the predictions for the\nsecond day, while the third row shows the corresponding activations of the six most dominant factors (displayed\nwith different colors). All these results are based on the HB2 output.\n\n5 Discussion\n\nfactored as p(X, Y ) \u221d p(X)(cid:81)N\n\nExact sampling using FF-BS over the entire model space for the FHMM is intractable. Alternative\nsolutions based on conditional updating approaches that use locally asymmetric moves will lead to\npoor mixing due to the sampler becoming trapped in local modes. We have shown that the Hamming\nball sampler gives a relative improvement over conditional approaches through the use of locally\nsymmetric moves that permits joint updating of hidden chains and improves mixing.\nWhilst we have presented the Hamming ball sampler applied to the factorial hidden Markov model,\nit is applicable to any statistical model where the observed data vector yi depends only on the i-th\ncolumn of a binary latent variable matrix X and observed data Y and hence the joint density can be\ni=1 p(yi|xi). Examples include the spike and slab variable selection\nmodels in Bayesian linear regression [12] and multiple membership models including Bayesian\nnonparametric models that utilize the Indian buffet process [13, 14]. While, in standard versions of\nthese models, the columns of X are independent and posterior inference is trivially parallelizable,\nthe utility of the Hamming ball sampler arises where K is large and sampling individual columns of\nX is itself computationally very demanding. Other suitable models that might be applicable include\nmore complex dependence structures that involve coupling between Markov chains and undirected\ndependencies.\n\nAcknowledgments\n\nWe thank the reviewers for insightful comments. MKT greatly acknowledges support from \u201cRe-\nsearch Funding at AUEB for Excellence and Extroversion, Action 1: 2012-2014\u201d. CY acknowl-\nedges the support of a UK Medical Research Council New Investigator Research Grant (Ref No.\nMR/L001411/1). CY is also af\ufb01liated with the Department of Statistics, University of Oxford.\n\n8\n\nDay 1Day 2Day 3Day 4Day 5Day 6Day 70200040000500100015002000Day 20500100015002000\fReferences\n[1] Lawrence Rabiner. A tutorial on hidden Markov models and selected applications in speech\n\nrecognition. Proceedings of the IEEE, 77(2):257\u2013286, 1989.\n\n[2] Steven L Scott. Bayesian methods for hidden Markov models. Journal of the American Statis-\n\ntical Association, 97(457), 2002.\n\n[3] Na Li and Matthew Stephens. Modeling linkage disequilibrium and identifying recombination\n\nhotspots using single-nucleotide polymorphism data. Genetics, 165(4):2213\u20132233, 2003.\n\n[4] Jonathan Marchini and Bryan Howie. Genotype imputation for genome-wide association stud-\n\nies. Nature Reviews Genetics, 11(7):499\u2013511, 2010.\n\n[5] Christopher Yau. OncoSNP-SEQ: a statistical approach for the identi\ufb01cation of somatic copy\nnumber alterations from next-generation sequencing of cancer genomes. Bioinformatics, 29\n(19):2482\u20132484, 2013.\n\n[6] Zoubin Ghahramani and Michael I. Jordan. Factorial hidden Markov models. Mach. Learn.,\n\n29(2-3):245\u2013273, November 1997.\n\n[7] J Zico Kolter and Tommi Jaakkola. Approximate inference in additive factorial HMMs with\nIn International Conference on Arti\ufb01cial Intelligence\n\napplication to energy disaggregation.\nand Statistics, pages 1472\u20131482, 2012.\n\n[8] Radford M Neal. Slice sampling. Annals of Statistics, pages 705\u2013741, 2003.\n[9] Robert H Swendsen and Jian-Sheng Wang. Nonuniversal critical dynamics in Monte Carlo\n\nsimulations. Physical review letters, 58(2):86\u201388, 1987.\n\n[10] Hyungsul Kim, Manish Marwah, Martin F. Arlitt, Geoff Lyon, and Jiawei Han. Unsuper-\nvised disaggregation of low frequency power measurements. In SDM, pages 747\u2013758. SIAM /\nOmnipress, 2011.\n\n[11] J. Zico Kolter and Matthew J. Johnson. REDD: a public data set for energy disaggregation\n\nresearch. In SustKDD Workshop on Data Mining Applications in Sustainability, 2011.\n\n[12] Toby J Mitchell and John J Beauchamp. Bayesian variable selection in linear regression. Jour-\n\nnal of the American Statistical Association, 83(404):1023\u20131032, 1988.\n\n[13] Thomas L Grif\ufb01ths and Zoubin Ghahramani.\n\nIn\ufb01nite latent feature models and the Indian\n\nbuffet process. In NIPS, volume 18, pages 475\u2013482, 2005.\n\n[14] J. Van Gael, Y. W. Teh, and Z. Ghahramani. The in\ufb01nite factorial hidden Markov model. In\n\nAdvances in Neural Information Processing Systems, volume 21, 2009.\n\n9\n\n\f", "award": [], "sourceid": 1551, "authors": [{"given_name": "Michalis", "family_name": "Titsias RC AUEB", "institution": "Athens University of Economics and Business"}, {"given_name": "Christopher", "family_name": "Yau", "institution": "University of Oxford"}]}