{"title": "Inverse Filtering for Hidden Markov Models", "book": "Advances in Neural Information Processing Systems", "page_first": 4204, "page_last": 4213, "abstract": "This paper considers a number of related inverse filtering problems for hidden Markov models (HMMs). In particular, given a sequence of state posteriors and the system dynamics; i) estimate the corresponding sequence of observations, ii) estimate the observation likelihoods, and iii) jointly estimate the observation likelihoods and the observation sequence. We show how to avoid a computationally expensive mixed integer linear program (MILP) by exploiting the algebraic structure of the HMM filter using simple linear algebra operations, and provide conditions for when the quantities can be uniquely reconstructed. We also propose a solution to the more general case where the posteriors are noisily observed. Finally, the proposed inverse filtering algorithms are evaluated on real-world polysomnographic data used for automatic sleep segmentation.", "full_text": "Inverse Filtering for Hidden Markov Models\n\nRobert Mattila\n\nDepartment of Automatic Control\nKTH Royal Institute of Technology\n\nrmattila@kth.se\n\nVikram Krishnamurthy\n\nCornell Tech\n\nCornell University\n\nvikramk@cornell.edu\n\nCristian R. Rojas\n\nDepartment of Automatic Control\nKTH Royal Institute of Technology\n\ncrro@kth.se\n\nBo Wahlberg\n\nDepartment of Automatic Control\nKTH Royal Institute of Technology\n\nbo@kth.se\n\nAbstract\n\nThis paper considers a number of related inverse \ufb01ltering problems for hidden\nMarkov models (HMMs). In particular, given a sequence of state posteriors and\nthe system dynamics; i) estimate the corresponding sequence of observations,\nii) estimate the observation likelihoods, and iii) jointly estimate the observation\nlikelihoods and the observation sequence. We show how to avoid a computation-\nally expensive mixed integer linear program (MILP) by exploiting the algebraic\nstructure of the HMM \ufb01lter using simple linear algebra operations, and provide\nconditions for when the quantities can be uniquely reconstructed. We also propose a\nsolution to the more general case where the posteriors are noisily observed. Finally,\nthe proposed inverse \ufb01ltering algorithms are evaluated on real-world polysomno-\ngraphic data used for automatic sleep segmentation.\n\n1\n\nIntroduction\n\nThe hidden Markov model (HMM) is a cornerstone of statistical modeling [1\u20134]. In it, a latent (i.e.,\nhidden) state evolves according to Markovian dynamics. The state of the system is only indirectly\nobserved via a sensor that provides noisy observations. The observations are sampled independently,\nconditioned on the state of the system, according to observation likelihood probabilities. Of paramount\nimportance in many applications of HMMs is the classical stochastic \ufb01ltering problem, namely:\n\nGiven observations from an HMM with known dynamics and observation likelihood\nprobabilities, compute the posterior distribution of the latent state.\n\nThroughout the paper, we restrict our attention to discrete-time \ufb01nite observation-alphabet HMMs.\nFor such HMMs, the solution to the \ufb01ltering problem is a recursive algorithm known as the HMM\n\ufb01lter [1, 4].\nIn this paper, we consider the inverse of the above problem. In particular, our aim is to provide\nsolutions to the following inverse \ufb01ltering problems:\n\nGiven a sequence of posteriors (or, more generally, noisily observed posteriors)\nfrom an HMM with known dynamics, compute (estimate) the observation likelihood\nprobabilities and/or the observations that generated the posteriors.\n\nTo motivate these problems, we give several possible applications of our results below.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\fApplications The underlying idea of inverse \ufb01ltering problems (\u201cinform me about your state\nestimate and I will know your sensor characteristics, including your measurements\u201d) has potential\napplications in, e.g., autonomous calibration of sensors, fault diagnosis, and detecting Bayesian\nbehavior in agents. In model-based fault-detection [5, 6], sensor information together with solutions\nto related inverse \ufb01ltering problems are used to detect abnormal behavior. (As trivial examples; i)\nif the true sequence of observations is known from a redundant sensor, it can be compared to the\nreconstructed sequence; if there is a miss-match, something is wrong, or ii) if multiple data batches\nare available, then change detection can be performed on the sequence of reconstructed observation\nlikelihoods.) They are also of relevance in a revealed preference context in microeconomics where\nthe aim is to detect expected utility maximization behavior of an agent; estimating the posterior given\nthe agent\u2019s actions is a crucial step, see, e.g., [7].\nRecent advances in wearables and smart-sensor technology have led to consumer grade products\n(smart watches with motion and heart-beat monitoring, sleep trackers, etc.) that produce vast amounts\nof personal data by performing state estimation. This information can serve as an indicator of health,\n\ufb01tness and stress. It may be very dif\ufb01cult, or even impossible, to access the raw sensor data since the\nsensor and state estimator usually are tightly integrated and encapsulated in intelligent sensor systems.\nInverse \ufb01ltering provides a framework for reverse engineering and performing fault detection of such\nsensors. In Section 5, we demonstrate our proposed solutions on a system that performs automatic\nsequencing of sleep stages based on electroencephalogram (EEG) data \u2013 the outputs of such an\nautomatic system are exactly posteriors over the different sleep stages [8].\nAnother important application of the inverse \ufb01ltering problem arises in electronic warfare and cyber-\nphysical security. How can one determine how accurate an enemy\u2019s sensors are? In such problems,\nthe state of the underlying Markov chain is usually known (a probing sequence), and one observes\nactions taken by the enemy which are based on \ufb01ltered posterior distributions. The aim is to estimate\nthe observation likelihood probabilities of the enemy, i.e., determine how accurate its sensors are.\n\nOur contributions\nIt is possible to obtain a solution to the inverse \ufb01ltering problem for HMMs by\nemploying a brute-force approach (see Section 2.3) \u2013 essentially by testing observations from the\nalphabet, and at the same time \ufb01nding system parameters consistent with the data. However, this\nleads to a computationally expensive combinatorial optimization problem. Instead, we demonstrate\nin this paper an ef\ufb01cient solution based on linear algebra by exploiting the inherent structure of the\nproblem and the HMM \ufb01lter. In particular, the contributions of this paper are three-fold:\n\n1. We propose analytical solutions to three inverse \ufb01ltering problems for HMMs that avoid\ncomputationally expensive mixed integer linear program (MILP) formulations. Moreover,\nwe establish theorems guaranteeing unique identi\ufb01ability.\n\n2. We consider the setting where the output of the HMM \ufb01lter is corrupted by noise, and\n\npropose an inverse \ufb01ltering algorithm based on clustering.\n\n3. We evaluate the algorithm on real-world data for automatic segmentation of the sleep cycle.\n\nRelated work There are only two known cases where the optimal \ufb01lter allows a \ufb01nite dimensional\ncharacterization: the HMM \ufb01lter for (discrete) HMMs, and the Kalman \ufb01lter [9, 10] for linear\nGaussian state-space models. Inverse \ufb01ltering problems for the Kalman \ufb01lter have been considered\nin, e.g., [5, 6, 10], however, inverse \ufb01ltering for HMMs has, to the best knowledge of the authors,\nreceived much less attention.\nThe inverse \ufb01ltering problem has connections to a number of other inverse problems in various \ufb01elds.\nFor example, in control theory, the fundamental inverse optimal control problem, whose formulation\ndates back to 1964 [11], studies the question: given a system and a policy, for what cost criteria is the\npolicy optimal? In microeconomic theory, the related problem of revealed preferences [12] asks the\nquestion: given a set of decisions made by an agent, is it possible to determine if a utility is being\nmaximized, and if so, which?\nIn machine learning, there are clear connections to, e.g., apprenticeship learning, imitation learning\nand inverse reinforcement learning, see, e.g., [13\u201317], which recently have received much attention.\nIn these, the reward function of a Markov decision process (MDP) is learned by observing an expert\ndemonstrating the task that an agent wants to learn to perform.\nThe key difference between these works and our work is the set of system parameters we aim to learn.\n\n2\n\n\f2 Preliminaries\n\nIn this section, we formulate the inverse \ufb01ltering problems, discuss how these can be solved using\ncombinatorial optimization, and state our assumptions formally. With regards to notation, all vectors\n\u2020 denotes the\nare column vectors, unless transposed. The vector 1 is the vector of all ones.\nMoore\u2013Penrose pseudoinverse.\n\n2.1 Hidden Markov models (HMMs) and the HMM \ufb01lter\nWe consider a discrete-time \ufb01nite observation-alphabet HMM. Denote its state at time k as xk \u2208\n{1, . . . , X} and the corresponding observation yk \u2208 {1, . . . , Y }. The underlying Markov chain xk\nevolves according to the row-stochastic transition probability matrix P \u2208 RX\u00d7X, where [P ]ij =\nPr[xk+1 = j|xk = i]. The initial state x0 is sampled from the probability distribution \u03c00 \u2208 RX,\nwhere [\u03c00]i = Pr[x0 = i]. The noisy observations of the underlying Markov chain are obtained from\nthe row-stochastic observation likelihood matrix B \u2208 RX\u00d7Y , where [B]ij = Pr[yk = j|xk = i] are\nthe observation likelihood probabilities. We denote the columns of the observation likelihood matrix\nas {bi}Y\nIn the classical stochastic \ufb01ltering problem, the aim is to compute the posterior distribution \u03c0k \u2208 RX\nof the latent state (Markov chain, in our case) at time k, given observations from the system up to\ntime k. The HMM \ufb01lter [1, 4] computes these posteriors via the following recursive update:\n\ni=1, i.e., B = [b1 . . . bY ].\n\n(1)\ninitialized by \u03c00, where [\u03c0k]i = Pr[xk = i|y1, . . . , yk] is the posterior distribution at time k,\nByk = diag(byk ) \u2208 RX\u00d7X, and {yk}N\n\nk=1 is a set of observations.\n\n\u03c0k =\n\n,\n\nByk P T \u03c0k\u22121\n1T Byk P T \u03c0k\u22121\n\n2.2\n\nInverse HMM \ufb01ltering problem formulations\n\nk=0\n\nk=0\n\nk=1.\n\n(cid:8)P, B,{\u03c0k}N\n(cid:8)P,{yk}N\ndata D = (cid:8)P,{\u03c0k}N\n\nThe inverse \ufb01ltering problem for HMMs is not a single problem \u2013 multiple variants can be formulated\ndepending on what information is available a priori. We pose and consider a number of variations of\nincreasing levels of generality depending on what data we can extract from the sensor system. To\nrestrict the scope of the paper, we assume throughout that the transition matrix P is known, and is the\nsame in both the system and the HMM \ufb01lter (i.e, we do not consider miss-matched HMM \ufb01ltering\nproblems). Formally, the inverse \ufb01ltering problems considered in this paper are as follows:\nProblem 1 (Inverse \ufb01ltering problem with unknown observations). Consider the known data D =\nthe observations {yk}N\nProblem 2 (Inverse \ufb01ltering problem with unknown sensor). Consider the known data D =\n\n(cid:9), where the posteriors have been generated by an HMM-\ufb01lter sensor. Reconstruct\n(cid:9), where the posteriors have been generated by an HMM-\ufb01lter sensor. Recon-\n(cid:9), where the posteriors have been generated by an HMM-\ufb01lter sensor.\n\nCombining these two formulations yields the general problem:\nProblem 3 (Inverse \ufb01ltering problem with unknown sensor and observations). Consider the known\nReconstruct the observations {yk}N\nFinally, we consider the more general setting where the posteriors we obtain are corrupted by noise\n(due to, e.g., quantization, measurement or model uncertainties). In particular, we consider the case\nwhere the following sequence of noisy posteriors is obtained over time:\n\nk=1 and the observation likelihood matrix B.\n\nstruct the observation likelihood matrix B.\n\nk=1,{\u03c0k}N\n\nk=0\n\n\u02dc\u03c0k = \u03c0k + noise,\n\nConsider the data D =(cid:8)P,{\u02dc\u03c0k}N\n\n(2)\nfrom the sensor system. We state directly the generalization of Problem 3 (the corresponding\ngeneralizations of Problems 1 and 2 follow as special-cases):\nProblem 4 (Noise-corrupted inverse \ufb01ltering problem with unknown sensor and observations).\n\ufb01lter sensor, but we obtain noise-corrupted measurements \u02dc\u03c0k. Estimate the observations {yk}N\nand the observation likelihood matrix B.\n\n(cid:9), where the posteriors \u03c0k have been generated by an HMM-\n\nk=0\n\nk=1\n\n3\n\n\f2.3\n\nInverse \ufb01ltering as an optimization problem\n\nIt is possible to formulate Problems 1-4 as optimization problems of increasing levels of generality.\nAs a \ufb01rst step, rewrite the HMM \ufb01lter equation (1) as:1\n\n(1) \u21d0\u21d2 bT\n\nyk\n\nP T \u03c0k\u22121\u03c0k = diag(byk )P T \u03c0k\u22121.\n\n(3)\n\nIn Problem 3 we need to \ufb01nd what observation occurred at each time instant (a combinatorial\nproblem), and at the same time reconstruct an observation likelihood matrix consistent with the\ndata. To be consistent with the data, equation (3) has to be satis\ufb01ed. This feasibility problem can be\nformulated as the following mixed-integer linear program (MILP):\n\nN(cid:88)\n\n{yk}N\n\nmin\nk=1,{bi}Y\ns.t.\n\ni=1\n\n(cid:107)bT\n\nyk\n\nP T \u03c0k\u22121\u03c0k \u2212 diag(byk )P T \u03c0k\u22121(cid:107)\u221e\n\nk=1\n\nyk \u2208 {1, . . . , Y },\nbi \u2265 0,\n[b1 . . . bY ]1 = 1,\n\nfor k = 1, . . . , N,\nfor i = 1, . . . , Y,\n\n(4)\n\nwhere the choice of norm is arbitrary since for noise-free data it is possible to exactly \ufb01t observations\nand an observation likelihood matrix. In Problem 1, the bi:s are dropped as optimization variables and\nthe problem reduces to an integer program (IP). In Problem 2, where the sequence of observations is\nknown, the problem reduces to a linear program (LP) .\nDespite the ease of formulation, the down-side of this approach is that, even though Problems 1 and 2\nare computationally tractable, the MILP-formulation of Problem 3 can become computationally very\nexpensive for larger data sets. In the following sections, we will outline how the problems can be\nsolved ef\ufb01ciently by exploiting the structure of the HMM \ufb01lter.\n\n2.4 Assumptions\n\nBefore providing solutions to Problems 1-4, we state the assumptions that the HMMs in this paper\nneed to satisfy to guarantee unique solutions. The \ufb01rst assumption serves as a proxy for ergodicity of\nthe HMM and the HMM \ufb01lter \u2013 it is a common assumption in statistical inference for HMMs [18, 4].\nAssumption 1 (Ergodicity). The transition matrix P and the observation matrix B are elementwise\n(strictly) positive.\n\nThe second assumption is a natural rank assumption on the observation likelihoods. The assumption\nsays that the conditional distribution of any observation is not a linear combination of the conditional\ndistributions of any other observations.\nAssumption 2 (Distinguishable observation likelihoods). The observation likelihood matrix B is full\ncolumn rank.\n\nWe will see that this assumption can be relaxed to the following assumption in problems where only\nthe sequence of observations is to be reconstructed:\nAssumption 3 (Non-parallel observation likelihoods). No pair of columns of the observation likeli-\nhood matrix B is colinear, i.e., bi (cid:54)= \u03babj for any real number \u03ba and any i (cid:54)= j.\nWithout Assumption 3, it is impossible to distinguish between observation i and observation j. Note\nalso that Assumption 2 implies Assumption 3.\n\n3 Solution to the inverse \ufb01ltering problem for HMMs in absence of noise\n\nIn this section, we detail our solutions to Problems 1-3. We \ufb01rst provide the following two useful\nlemmas that will be key to the solutions for Problems 1-4. They give an alternative characterization\nof the HMM-\ufb01lter update equation. (Note that all proofs are in the supplementary material.)\n\n1Multiplication by the denominator is allowed under Assumption 1 \u2013 see below.\n\n4\n\n\fLemma 1. The HMM-\ufb01lter update equation (3) can equivalently be written\n\n\u03c0k(P T \u03c0k\u22121)T \u2212 diag(P T \u03c0k\u22121)\n\nbyk = 0.\n\n(cid:16)\n\n(cid:17)\n\nThe second lemma characterizes the solutions to (5).\nLemma 2. Under Assumption 1, the nullspace of the X \u00d7 X matrix\n\n\u03c0k(P T \u03c0k\u22121)T \u2212 diag(P T \u03c0k\u22121)\n\nis of dimension one for k > 1.\n\n3.1 Solution to the inverse \ufb01ltering problem with unknown observations\n\n(5)\n\n(6)\n\nIn the formulation of Problem 1, we assumed that the observation likelihoods B were known, and\naimed to reconstruct the sequence of observations from the posterior data. Equation (5) constrains\nwhich columns of the observation matrix B that are consistent with the update of the posterior vector\nat each time instant. Formally, any sequence\n\n\u02c6yk \u2208(cid:8)y \u2208 {1, . . . , Y } :(cid:0)\u03c0k(P T \u03c0k\u22121)T \u2212 diag(P T \u03c0k\u22121)(cid:1) by = 0(cid:9),\n\n(7)\n\nfor k = 1, . . . , N, is consistent with the HMM \ufb01lter posterior updates. (Recall that by denotes column\ny of the observation matrix B.) Since the problems (7) are decoupled in time k, they can trivially be\nsolved in parallel.\nTheorem 1. Under Assumptions 1 and 3, the set in the right-hand side of equation (7) is a singleton,\nand is equal to the true observation, i.e.,\n\n\u02c6yk = yk,\n\n(8)\n\nfor k > 1.\n\n3.2 Solution to the inverse \ufb01ltering problem with unknown sensor\n\nThe second inverse \ufb01ltering problem we consider is when the sequence of observations is known, but\nthe observation likelihoods B are unknown (Problem 2). This problem can be solved by exploiting\nLemmas 1 and 2.\nComputing a basis for the nullspace of the coef\ufb01cient matrix in formulation (5) of the HMM \ufb01lter\nrecovers, according to Lemmas 1 and 2, the direction of one column of B. In particular, the direction\nof the column corresponding to observation yk, i.e., byk. From such basis vectors, we can construct a\nmatrix C \u2208 RX\u00d7Y where the yth column is aligned with by. Note that to be able to fully construct\nthis matrix, every observation from the set {1, . . . , Y } needs to have been observed at least once.\nDue to being basis vectors for nullspaces, the columns of C are only determined up to scalings, so\nwe need to exploit the structure of the observation matrix B to properly normalize them. To form an\nestimate \u02c6B from C, we employ that the observation likelihood matrix is row-stochastic. This means\nthat we should rescale each column:\n\n\u02c6B = C diag(\u03b1)\n\n(9)\n\nfor some \u03b1 \u2208 RY , such that \u02c6B1 = 1. Details are provided in the following theorem.\nTheorem 2. If Assumption 1 holds, and every possible observation has been observed (i.e., that\n{1, . . . , Y } \u2282 {yk}N\n\nk=1), then:\n\ni) there exists \u03b1 \u2208 RY such that \u02c6B = B,\nii) if Assumption 2 holds, then the choice of \u03b1 is unique, and \u02c6B is equal to B. In particular,\n\n\u03b1 = C\u20201.\n\n5\n\n\f3.3 Solution to the inverse \ufb01ltering problem with unknown sensor and observations\n\nFinally, we turn to the general formulation in which we consider the combination of the previous\ntwo problems: both the sequence of observations and the observation likelihoods are unknown\n(Problem 3). Again, the solution follows from Lemmas 1 and 2. Note that there will be a degree of\nfreedom since we can arbitrarily relabel each observation and correspondingly permute the columns\nof the observation likelihood matrix.\nAs in the solution to Problem 2, computing a basis vector, say \u00afck, for the nullspace of the coef\ufb01cient\nmatrix in equation (5) recovers the direction of one column of the B matrix. However, since the\nsequence of observations is unknown, we do not know which column. To circumvent this, we\nconcatenate such basis vectors in a matrix2\n\n\u00afC = [\u00afc2 . . . \u00afcN ] \u2208 RX\u00d7(N\u22121).\n\n(10)\n\nFor suf\ufb01ciently large N \u2013 essentially when every possible observation has been processed by the\nHMM \ufb01lter \u2013 the matrix \u00afC in (10) will contain Y columns out of which no pair is colinear (due\nto Assumption 3). All the columns that are parallel correspond to one particular observation. Let\n{\u03c31, . . . , \u03c3Y } be the indices of Y such columns, and construct\n\nusing the selection matrix\n\nC = \u00afC\u03a3\n\n\u03a3 = [e\u03c31 . . . e\u03c3Y ] \u2208 R(N\u22121)\u00d7Y ,\n\n(11)\n\n(12)\n\nwhere ei is the ith Cartesian basis vector.\nLemma 3. Under Assumption 1 and Assumption 3, the expected number of samples needed to be\nable to construct the selection matrix \u03a3 is upper-bounded by\n\n\u03b2\u22121 (1 + 1/2 + \u00b7\u00b7\u00b7 + 1/Y ) ,\n\n(13)\n\nwhere B \u2265 \u03b2 > 0 elementwise.\n\nWith C constructed in (11), we have obtained the direction of each column of the observation matrix.\nHowever, as before, they need to be properly normalized. For this, we exploit the sum-to-one property\nof the observation matrix as in the previous section. Let\n\n\u02c6B = C diag(\u03b1),\n\n(14)\n\nfor \u03b1 \u2208 RY , such that \u02c6B1 = 1. Details on how to \ufb01nd \u03b1 are provided in the theorem below.\nThis solves the \ufb01rst part of the problem, i.e., reconstructing the observation matrix. Secondly, to\nrecover the sequence of observations, take\n\n(cid:111)\n\n\u02c6yk \u2208(cid:110)\n\ny \u2208 {1, . . . , Y } : \u02c6by = \u03ba\u00afck for some real number \u03ba\n\n,\n\n(15)\n\nfor k > 1. In words; check which columns of \u02c6B that the nullspace of the HMM \ufb01lter coef\ufb01cient-\nmatrix (6) is colinear with at each time instant.\nTheorem 3. If Assumptions 1 and 3 hold, and the number of samples N is suf\ufb01ciently large \u2013 see\nLemma 3 \u2013 then:\n\ni) there exists \u03b1 \u2208 RY in equation (14) such that \u02c6B = BP, where P is a permutation matrix.\nii) the set on the right-hand side of equation (15) is a singleton. Moreover, the reconstructed\nobservations \u02c6yk are, up to relabellings corresponding to P, equal to the true observations\nyk.\n\niii) if Assumption 2 holds, then the choice of \u03b1 is unique, and \u02c6B = BP. In particular, \u03b1 = C\u20201.\n\n2We start with \u00afc2, since we make no assumption on the positivity of \u03c00 \u2013 see the proof of Lemma 2.\n\n6\n\n\f4 Solution to the inverse \ufb01ltering problem for HMMs in presence of noise\n\nIn this section, we discuss the more general setting where the posteriors obtained from the sensor\nsystem are corrupted by noise. We will see that this problem naturally \ufb01ts in a clustering framework\nsince every posterior update will provide us with a noisy estimate of the direction of one column of\nthe observation likelihood matrix. We consider an additive noise model of the following form:\nAssumption 4 (Noise model). The posteriors are corrupted by additive noise wk:\n\n\u02dc\u03c0k = \u03c0k + wk,\n\n(16)\n\nsuch that 1T \u02dc\u03c0k = 1 and \u02dc\u03c0k > 0.\n\nThis noise model is valid, for example, when each observed posterior vector has been subsequently\nrenormalized after noise that originates from quantization or measurement errors has been added.\nIn the solution proposed in Section 3.3 for the noise-free case, the matrix \u00afC in equation (10) was\nconstructed by concatenating basis vectors for the nullspaces of the coef\ufb01cient matrix in equation (5).\nWith perturbed posterior vectors, the corresponding system of equations becomes\n\n\u02dc\u03c0k(P T \u02dc\u03c0k\u22121)T \u2212 diag(P T \u02dc\u03c0k\u22121)\n\n(17)\nwhere \u02dcck is now a perturbed (and scaled) version of byk. That this equation is valid is guaranteed by\nthe generalization of Lemma 2:\nLemma 4. Under Assumptions 1 and 4, the nullspace of the matrix\n\u02dc\u03c0k(P T \u02dc\u03c0k\u22121)T \u2212 diag(P T \u02dc\u03c0k\u22121)\n\n\u02dcck = 0,\n\n(18)\n\n(cid:16)\n\n(cid:17)\n\nis of dimension one for k > 1.\nRemark 1. In case Assumption 4 does not hold, the problem can instead be interpreted as a perturbed\neigenvector problem. The vector \u02dcck should then be taken as the eigenvector corresponding to the\nsmallest eigenvalue.\n\nLemma 4 says that we can construct a matrix \u02dcC (analogous to \u00afC in Section 3.3) by concatenating the\nbasis vectors from the one-dimensional nullspaces in (17). Due to the perturbations, every solution to\nequation (17) will be a perturbed version of the solution to the corresponding noise-free version of the\nequation. This means that it will not be possible to construct a selection matrix \u03a3 as was done for \u00afC\nin equation (12). However, because there are only Y unique solutions to the noise-free equations (5),\nit is natural to circumvent this (assuming that the perturbations are small) by clustering the columns\nof \u02dcC into Y clusters. As the columns of \u02dcC are only unique up to scaling, the clustering has to be\nperformed with respect to their angular separations (using, e.g., the spherical k-means algorithm\n[19]).\nLet C \u2208 RX\u00d7Y be the matrix of the Y centroids resulting from running a clustering algorithm on the\ncolumns of \u02dcC. Each centroid can be interpreted as a noisy estimate of one column of the observation\nlikelihood matrix. To obtain a properly normalized estimate of the observation likelihood matrix, we\ntake\n(19)\nwhere A \u2208 RY \u00d7Y . Note that, since C now contains noisy estimates of the directions of the columns\nof the observation likelihood matrix, we are not certain to be able to properly normalize it by purely\nrescaling each column (i.e., taking A to be a diagonal matrix as was done in Sections 3.2 and 3.3). A\nlogical choice is the solution to the following LP,\n\n\u02c6B = CA,\n\n(20)\nwhich tries to minimize the off-diagonal elements of A. The resulting rescaling matrix A guarantees\nthat \u02c6B = CA is a proper stochastic matrix (non-negative and has row-sum equal to one), as well as\nthat the discrepancy between the directions of the columns of C and \u02c6B are minimized.\nThe second part of the problem \u2013 reconstructing the sequence of observations \u2013 follows naturally\nfrom the clustering algorithm; an estimate of the sequence is obtained by checking to what cluster the\nsolution \u02dcck of equation (17) belongs in for each time instant.\n\n(cid:12)(cid:12)[A]ij\n\n(cid:12)(cid:12)\n\nmax\ni(cid:54)=j\nCA \u2265 0,\nCA1 = 1,\n\nmin\n\nA\u2208RY \u00d7Y\n\ns.t.\n\n7\n\n\f5 Experimental results for sleep segmentation\n\nIn this section, we illustrate the inverse \ufb01ltering problem on real-world data.\n\nBackground Roughly one third of a person\u2019s life is spent sleeping. Sleep disorders are becoming\nmore prevalent and, as public awareness has increased, the usage of sleep trackers is becoming\nwide-spread. The example below illustrates how the inverse \ufb01ltering formulation and associated\nalgorithms can be used as a step in real-time diagnosis of failure of sleep-tracking medical equipment.\nDuring the course of sleep, a human transitions through \ufb01ve different sleep stages [20]: wake, S1,\nS2, slow wave sleep (SWS) and rapid eye movement (REM). An important part of sleep analysis is\nobtaining a patient\u2019s evolution over these sleep stages. Manual sequencing from all-night polysomno-\ngraphic (PSG) recordings (including, e.g., electroencephalogram (EEG) readings) can be performed\naccording to the Rechtschaffen and Kales (R&K) rules by well-trained experts [8, 20]. However,\nthis is costly and laborious, so several works, e.g., [8, 20, 21], propose automatic sequencing based\non HMMs. These systems usually output a posterior distribution over the sleep stages, or provide a\nViterbi path.\nA malfunction of such an automatic system could have problematic consequences since medical\ndecisions would be based on faulty information. The inverse \ufb01ltering problem arises naturally for\nsuch reasons of fault-detection. Joint knowledge of the transition matrix can be assumed, since it is\npossible to obtain, from public sources, manually labeled data from which an estimate of P can be\ncomputed.\n\nSetup A version of the automatic sleep-staging system in [8, 20] was implemented. The mean\nfrequency over the 0-30 Hz band of the EEG (over C3-A2 or C4-A1, according to the international\n10-20 system) was used as observations. These readings were encoded to \ufb01ve symbols using a vector-\nquantization based codebook. The model was trained on data from nine patients in the PhysioNet\nCAP Sleep Database [22, 23]. The model was then evaluated on another patient \u2013 see Fig. 1 \u2013 over\none full-night of sleep. The manually labeled stages according to K&R-rules are dashed-marked in\nthe \ufb01gure. To summarize the resulting posterior distributions over the sleep stages, we plot the mean\nstate estimate when equidistant numbers have been assigned to each state.\nFor the inverse \ufb01ltering, the full posterior vectors were elementwise corrupted by Gaussian noise of\nstandard deviation \u03c3, and projected back to the simplex (to ensure a valid posterior probability vector)\n\u2013 simulating a noisy reading from the automatic system. A total of one hundred noise realizations\nwere simulated. The noise can be a manifestation of measurement or quantization noise in the sensor\nsystem, or noise related to model uncertainties (in this case, an error in the transition probability\nmatrix P ).\n\nResults After permuting the labels of the observations, the error in the reconstructed observation\nlikelihood matrix, as well as the fraction of correctly reconstructed observations, were computed. This\nis illustrated in Fig. 2. For the 1030 quantized EEG samples from the patient, the entire procedure\ntakes less than one second on a 2.0 Ghz Intel Core 2 Duo processor system.\n\nREM\nSWS\nS2\nS1\nWAKE\n\n0\n\n1\n\n2\n\n3\n\n4\n\n5\n\n6\n\n7\n\n8\n\nhours since bedtime\n\nFigure 1: One night of sleep in which polysomnographic (PSG) observation data has been manually\nprocessed by an expert sleep analyst according to the R&K rules to obtain the sleep stages (\n).\nThe posterior distribution over the sleep stages, resulting from an automatic sleep-staging system, has\nbeen summarized to a mean state estimate (\n\n).\n\n8\n\n\fn\no\ni\nt\nc\na\nr\nf\n\n1\n\n0.5\n0.2\n\nCorrectly recovered observations\n\n10\u22128\n\n10\u22126\n\n10\u22124\nnoise \u03c3\n\n10\u22122\n\n100\n\nF\n(cid:107)\nP\nB\n\u2212\n\u02c6B\n(cid:107)\nn\nP\nm\n\ni\n\n100\n10\u22122\n10\u22124\n\nError in B\n\n10\u22128\n\n10\u22126\n\n10\u22124\nnoise \u03c3\n\n10\u22122\n\n100\n\nFigure 2: Result of inverse \ufb01ltering for various noise standard deviations \u03c3. The vector of posterior\nprobabilities is perturbed elementwise with Gaussian noise. Right: Error in the recovered observation\nlikelihood matrix after permuting the columns to \ufb01nd the best match to the true matrix. Left: Fraction\nof correctly reconstructed observations. As the signal-to-noise ratio increases, the inverse \ufb01ltering\nalgorithm successfully reconstructs the sequence of observations and estimates the observation\nlikelihoods.\n\nFrom Fig. 2, we can see that as the variance of the noise decreases, the left hand side of equation\n(17) converges to that of equation (5) and the true quantities are recovered. On the other extreme,\nas the signal-to-noise ratio becomes small, the estimated sequence of observations tends to that of\na uniform distribution at 1/Y = 0.2. This is because the clusters in \u02dcC become heavily intertwined.\nThe discontinuous nature of the solution of the clustering algorithm is apparent by the plateau-like\nbehaviour in the middle of the scale \u2013 a few observations linger on the edge of being assigned to the\ncorrect clusters.\nIn conclusion, the results show that it is possible to estimate the observation sequence processed by\nthe automatic sleep-staging system, as well as, its sensor\u2019s speci\ufb01cations. This is an important step in\nperforming fault detection for such a device: for example, using several nights of data, it is possible\nto perform change detection on the observation likelihoods to detect if the sleep monitoring device\nhas failed.\n\n6 Conclusions\n\nIn this paper, we have considered several inverse \ufb01ltering problems for HMMs. Given posteriors\nfrom an HMM \ufb01lter (or more generally, noisily observed posteriors), the aim was to reconstruct the\nobservation likelihoods and also the sample path of observations. It was shown that a computationally\nexpensive solution based on combinatorial optimization can be avoided by exploiting the algebraic\nstructure of the HMM \ufb01lter. We provided solutions to the inverse \ufb01ltering problems, as well as\ntheorems guaranteeing unique identi\ufb01ability. The more general case of noise-corrupted posteriors\nwas also considered. A solution based on clustering was proposed and evaluated on real-world data\nbased on a system for automatic sleep-staging from EEG readings.\nIn the future, it would be interesting to consider other variations and generalizations of inverse\n\ufb01ltering. For example, the case where the system dynamics are unknown and need to be estimated, or\nwhen only actions based on the \ufb01ltered distribution can be observed.\n\nAcknowledgments\n\nThis work was partially supported by the Swedish Research Council under contract 2016-06079,\nthe U.S. Army Research Of\ufb01ce under grant 12346080 and the National Science Foundation under\ngrant 1714180. The authors would like to thank Alexandre Proutiere for helpful comments during the\npreparation of this work.\n\nReferences\n[1] V. Krishnamurthy, Partially Observed Markov Decision Processes. Cambridge, UK: Cambridge\n\nUniversity Press, 2016.\n\n9\n\n\f[2] L. Rabiner, \u201cA tutorial on hidden Markov models and selected applications in speech recogni-\n\ntion,\u201d Proceedings of the IEEE, vol. 77, pp. 257\u2013286, Feb. 1989.\n\n[3] R. J. Elliott, J. B. Moore, and L. Aggoun, Hidden Markov Models: Estimation and Control.\n\nNew York, NY: Springer, 1995.\n\n[4] O. Capp\u00e9, E. Moulines, and T. Ryd\u00e9n, Inference in Hidden Markov Models. New York, NY:\n\nSpringer, 2005.\n\n[5] F. Gustafsson, Adaptive \ufb01ltering and change detection. New York: Wiley, 2000.\n[6] J. Chen and R. J. Patton, Robust Model-Based Fault Diagnosis for Dynamic Systems. Boston,\n\nMA: Springer, 1999.\n\n[7] A. Caplin and M. Dean, \u201cRevealed preference, rational inattention, and costly information\n\nacquisition,\u201d The American Economic Review, vol. 105, no. 7, pp. 2183\u20132203, 2015.\n\n[8] A. Flexerand, G. Dorffner, P. Sykacekand, and I. Rezek, \u201cAn automatic, continuous and\nprobabilistic sleep stager based on a hidden Markov model,\u201d Applied Arti\ufb01cial Intelligence,\nvol. 16, pp. 199\u2013207, Mar. 2002.\n\n[9] D. Koller and N. Friedman, Probabilistic graphical models: principles and techniques. Cam-\n\nbridge, MA: MIT Press, 2009.\n\n[10] B. Anderson and J. Moore, Optimal Filtering. Englewood Cliffs, NJ: Prentice-Hall, 1979.\n[11] R. E. Kalman, \u201cWhen is a linear control system optimal,\u201d Journal of Basic Engineering, vol. 86,\n\nno. 1, pp. 51\u201360, 1964.\n\n[12] H. R. Varian, Microeconomic analysis. New York: Norton, 3rd ed., 1992.\n[13] D. Had\ufb01eld-Menell, S. J. Russell, P. Abbeel, and A. Dragan, \u201cCooperative inverse reinforcement\n\nlearning,\u201d in Advances in Neural Information Processing Systems, 2016.\n\n[14] J. Choi and K.-E. Kim, \u201cNonparametric Bayesian inverse reinforcement learning for multiple\n\nreward functions,\u201d in Advances in Neural Information Processing Systems, 2012.\n\n[15] E. Klein, M. Geist, B. Piot, and O. Pietquin, \u201cInverse Reinforcement Learning through Structured\n\nClassi\ufb01cation,\u201d in Advances in Neural Information Processing Systems, 2012.\n\n[16] S. Levine, Z. Popovic, and V. Koltun, \u201cNonlinear inverse reinforcement learning with gaussian\n\nprocesses,\u201d in Advances in Neural Information Processing Systems, 2011.\n\n[17] A. Ng, \u201cAlgorithms for inverse reinforcement learning,\u201d in Proceedings of the 17th International\n\nConference on Machine Learning (ICML\u201900), pp. 663\u2013670, 2000.\n\n[18] L. E. Baum and T. Petrie, \u201cStatistical inference for probabilistic functions of \ufb01nite state Markov\n\nchains,\u201d The annals of mathematical statistics, vol. 37, no. 6, pp. 1554\u20131563, 1966.\n\n[19] C. Buchta, M. Kober, I. Feinerer, and K. Hornik, \u201cSpherical k-means clustering,\u201d Journal of\n\nStatistical Software, vol. 50, no. 10, pp. 1\u201322, 2012.\n\n[20] S.-T. Pan, C.-E. Kuo, J.-H. Zeng, and S.-F. Liang, \u201cA transition-constrained discrete hidden\nMarkov model for automatic sleep staging,\u201d BioMedical Engineering OnLine, vol. 11, no. 1,\np. 52, 2012.\n\n[21] Y. Chen, X. Zhu, and W. Chen, \u201cAutomatic sleep staging based on ECG signals using hidden\nMarkov models,\u201d in Proceedings of the 37th Annual International Conference of the IEEE\nEngineering in Medicine and Biology Society (EMBC), pp. 530\u2013533, 2015.\n\n[22] A. L. Goldberger, L. A. Amaral, L. Glass, J. M. Hausdorff, P. C. Ivanov, R. G. Mark, J. E.\nMietus, G. B. Moody, C.-K. Peng, and H. E. Stanley, \u201cPhysiobank, physiotoolkit, and physionet,\u201d\nCirculation, vol. 101, no. 23, pp. e215\u2013e220, 2000.\n\n[23] M. G. Terzano, L. Parrino, A. Sherieri, R. Chervin, S. Chokroverty, C. Guilleminault, M. Hir-\nshkowitz, M. Mahowald, H. Moldofsky, A. Rosa, and others, \u201cAtlas, rules, and recording\ntechniques for the scoring of cyclic alternating pattern (CAP) in human sleep,\u201d Sleep medicine,\nvol. 2, no. 6, pp. 537\u2013553, 2001.\n\n10\n\n\f", "award": [], "sourceid": 2213, "authors": [{"given_name": "Robert", "family_name": "Mattila", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Cristian", "family_name": "Rojas", "institution": "KTH Royal Institute of Technology"}, {"given_name": "Vikram", "family_name": "Krishnamurthy", "institution": "Cornell University"}, {"given_name": "Bo", "family_name": "Wahlberg", "institution": "KTH Royal Inst. of Technology"}]}