{"title": "Global Sensitivity Analysis for MAP Inference in Graphical Models", "book": "Advances in Neural Information Processing Systems", "page_first": 2690, "page_last": 2698, "abstract": "We study the sensitivity of a MAP configuration of a discrete probabilistic graphical model with respect to perturbations of its parameters. These perturbations are global, in the sense that simultaneous perturbations of all the parameters (or any chosen subset of them) are allowed. Our main contribution is an exact algorithm that can check whether the MAP configuration is robust with respect to given perturbations. Its complexity is essentially the same as that of obtaining the MAP configuration itself, so it can be promptly used with minimal effort. We use our algorithm to identify the largest global perturbation that does not induce a change in the MAP configuration, and we successfully apply this robustness measure in two practical scenarios: the prediction of facial action units with posed images and the classification of multiple real public data sets. A strong correlation between the proposed robustness measure and accuracy is verified in both scenarios.", "full_text": "Global Sensitivity Analysis\n\nfor MAP Inference in Graphical Models\n\nJasper De Bock\n\nGhent University, SYSTeMS\n\nGhent (Belgium)\n\nCassio P. de Campos\nQueen\u2019s University\n\nBelfast (UK)\n\nAlessandro Antonucci\n\nIDSIA\n\nLugano (Switzerland)\n\njasper.debock@ugent.be\n\nc.decampos@qub.ac.uk\n\nalessandro@idsia.ch\n\nAbstract\n\nWe study the sensitivity of a MAP con\ufb01guration of a discrete probabilistic graph-\nical model with respect to perturbations of its parameters. These perturbations are\nglobal, in the sense that simultaneous perturbations of all the parameters (or any\nchosen subset of them) are allowed. Our main contribution is an exact algorithm\nthat can check whether the MAP con\ufb01guration is robust with respect to given per-\nturbations. Its complexity is essentially the same as that of obtaining the MAP\ncon\ufb01guration itself, so it can be promptly used with minimal effort. We use our\nalgorithm to identify the largest global perturbation that does not induce a change\nin the MAP con\ufb01guration, and we successfully apply this robustness measure in\ntwo practical scenarios: the prediction of facial action units with posed images and\nthe classi\ufb01cation of multiple real public data sets. A strong correlation between\nthe proposed robustness measure and accuracy is veri\ufb01ed in both scenarios.\n\n1\n\nIntroduction\n\nProbabilistic graphical models (PGMs) such as Markov random \ufb01elds (MRFs) and Bayesian net-\nworks (BNs) are widely used as a knowledge representation tool for reasoning under uncertainty.\nWhen coping with such a PGM, it is not always practical to obtain numerical estimates of the\nparameters\u2014the local probabilities of a BN or the factors of an MRF\u2014with suf\ufb01cient precision.\nThis is true even for quanti\ufb01cations based on data, but it becomes especially important when elic-\niting the parameters from experts. An important question is therefore how precise these estimates\nshould be to avoid a degradation in the diagnostic performance of the model. This remains impor-\ntant even if the accuracy can be arbitrarily re\ufb01ned in order to trade it off with the relative costs. This\npaper is an attempt to systematically answer this question.\nMore speci\ufb01cally, we address sensitivity analysis (SA) of discrete PGMs in the case of maximum a\nposteriori (MAP) inferences, by which we mean the computation of the most probable con\ufb01guration\nof some variables given an observation of all others.1\nLet us clarify the way we intend SA here, while giving a short overview of previous work on SA\nin PGMs. First of all, a distinction should be made between quantitative and qualitative SA. Quan-\ntitative approaches are supposed to evaluate the effect of a perturbation of the parameters on the\nnumerical value of a particular inference. Qualitative SA is concerned with deciding whether or not\nthe perturbed values are leading to a different decision, e.g., about the most probable con\ufb01guration of\nthe queried variable(s). Most of the previous work in SA is quantitative, being in particular focused\non updating, i.e., the computation of the posterior probability of a single variable given some evi-\ndence, and mostly focus on BNs. After a \ufb01rst attempt based on a purely empirical investigation [17],\na number of analytical methods based on the derivatives of the updated probability with respect to\n\n1Some authors refer to this problem as MPE (most probable explanation) rather than MAP.\n\n1\n\n\fthe perturbed parameters have been proposed [3, 4, 5, 11, 14]. Something similar has been done for\nMRFs as well [6]. To the best of our knowledge, qualitative SA received almost no attention, with\nfew exceptions [7, 18].\nSecondly, we distinguish between local and global SA. The former considers the effect of the per-\nturbation of a single parameter (and of possible additional perturbations that are induced by nor-\nmalization constraints), while the latter aims at more general perturbations possibly affecting all the\nparameters of the PGM. Initial work on SA in PGMs considered the local approach [4, 14], while\nlater work considered global SA as well [3, 5, 11]. Yet, for BNs, global SA has been tackled by\nmethods whose time complexity is exponential in the number of perturbed conditional probability\ntables (CPTs), as they basically require the computation of all the mixed derivatives. For qualita-\ntive SA, as far as we know, only the local approach has been studied [7, 18]. This is unfortunate,\nas global SA might reveal stronger effects of perturbations due to synergetic effects, which might\nremain hidden in a local analysis.\nIn this paper, we study global qualitative SA in discrete PGMs for MAP inferences, thereby intend-\ning to \ufb01ll the existing gap in this topic. Let us introduce it by a simple example.\nExample 1. Let X1 and X2 be two Boolean variables. For each i \u2208 {1, 2}, Xi takes values in\n{xi,\u00acxi}. The following probabilistic assessments are available: P (x1) = .45, P (x2|x1) = .2,\nand P (x2|\u00acx1) = .9. This induces a complete speci\ufb01cation of the joint probability mass func-\ntion P (X1, X2). If no evidence is present, the MAP joint state is (\u00acx1, x2), its probability being\n.495. The second most probable joint state is (x1,\u00acx2), whose probability is .36. We perturb\nthe above three parameters. Given \u0001x1 \u2265 0, we consider any assessment of P (x1) such that\n|P (x1) \u2212 .45| \u2264 \u0001x1. We similarly perturb P (x2|x1) with \u0001x2|x1 and P (x2|\u00acx1) with \u0001x2|\u00acx1.\nThe goal is to investigate whether or not (\u00acx1, x2) is also the unique MAP instantiation for each\nP (X1, X2) consistent with the above constraints, given a maximum perturbation level of \u0001 = .06\nfor each parameter. Straightforward calculations show that this is true if only one parameter is\nperturbed at each time. The state (\u00acx1, x2) remains the most probable even if two parameters are\nperturbed (for any pair of them). The situation is different if the perturbation level \u0001 = .06 is applied\nto all three parameters simultaneously. There is a speci\ufb01cation of the parameters consistent with\nthe perturbations and such that the MAP instantiation is (x1,\u00acx2) and achieves probability .4386,\ncorresponding to P (x1) = .51, P (x2|x1) = .14, and P (x2|\u00acx1) = .84. The minimum perturbation\nlevel for which this behaviour is observed is \u0001\u2217 = .05. For this value, there is a single speci\ufb01cation\nof the model for which (x1,\u00acx2) has the same probability as (\u00acx1, x2), which\u2014for this value\u2014is\nthe single most probable instantiation for any other speci\ufb01cation of the model that is consistent with\nthe perturbations.\n\nThe above example can be regarded as a qualitative SA for which the local approach is unable to\nidentify a lack of robustness in the MAP solution, which is revealed instead by the global analysis.\nIn the rest of the paper we develop an algorithm to ef\ufb01ciently detect the minimum perturbation level\n\u0001\u2217 leading to a different MAP solution. The time complexity of the algorithm is equal to that of the\nMAP inference in the PGM times the number of variables in the domain, that is, exponential in the\ntreewidth of the graph in the worst case. The approach can be specialized to local SA or any other\nchoice of parameters to perform SA, thus reproducing and extending existing results. The paper\nis organized as follows: the problem of checking the robustness of a MAP inference is introduced\nin its general formulation in Section 2. The discussion is then specialized to the case of PGMs in\nSection 3 and applied to global SA in Section 4. Experiments with real data sets are reported in\nSection 5, while conclusions and outlooks are given in Section 6.\n\n2 MAP Inference and its Robustness\n\nWe start by explaining how we intend SA for MAP inference and how this problem can be translated\ninto an optimisation problem very similar to that used for the computation of MAP itself. For the\nsake of readibility, but without any lack of generality, we begin by considering a single variable\nonly; the multivariate and the conditional cases are dicussed in Section 3. Consider a single variable\nX taking its values in a \ufb01nite set Val(X). Given a probability mass function P over X, \u02dcx \u2208 Val(X)\nis said to be a MAP instantiation for P if\n\nP (x),\n\n(1)\n\n\u02dcx \u2208 arg max\nx\u2208Val(X)\n\n2\n\n\fwhich means that \u02dcx is the most likely value of X according to P . In principle a mass function P can\nhave multiple (equally probable) MAP instantiations. However, in practice there will often be only\none, and we then call it the unique MAP instantiation for P .\nAs we did in Example 1, SA can be achieved by modeling perturbations of the parameters in terms\nof (linear) constraints over them, which are used to de\ufb01ne the set of all perturbed models whose\nmass function is consistent with these constraints. Generally speaking, we consider an arbitrary set\nP of candidate mass functions, one of which is the original unperturbed mass function P . The only\nimposed restriction is that P must be compact. This way of de\ufb01ning candidate models establishes\na link between SA and the theory of imprecise probability, which extends the Bayesian theory of\nprobability to cope with compact (and often convex) sets of mass functions [19].\nFor the MAP inference in Eq. (1), performing SA with respect to a set of candidate models P requires\nthe identi\ufb01cation of the instantiations that are MAP for at least one perturbed mass function, that is,\n\n(cid:12)(cid:12)(cid:12)(cid:12) \u2203P (cid:48) \u2208 P : \u02dcx \u2208 arg max\n\nx\u2208Val(X)\n\n(cid:27)\n\nP (cid:48)(x)\n\n.\n\n(2)\n\n\u2217\nVal\n\n(X) :=\n\n\u02dcx \u2208 Val(X)\n\n(cid:26)\n\nThese instantiations are called E-admissible [15].\nIf the above set contains only a single MAP\ninstantiation \u02dcx (which is then necessarily the unique solution of Eq. (1) as well), then we say that\nthe model P is robust with respect to the perturbation P.\nExample 2. Let X take values in Val(X) := {a, b, c, d}. Consider a perturbation P := {P1, P2}\nthat contains only two candidate mass functions over X. Let P1 be de\ufb01ned by P1(a) = .5, P1(b) =\nP1(c) = .2 and P1(d) = .1 and let P2 be de\ufb01ned by P2(b) = .35, P2(a) = P2(c) = .3 and\nP2(d) = .05. Then a and b are the unique MAP instantiations of P1 and P2, respectively. This\n\u2217\nimplies that Val\n\n(X) = {a, b} and that neither P1 nor P2 is robust with respect to P.\n\n\u2217\n(X) is a time con-\nFor large domains Val(X), for instance in the multivariate case, evaluating Val\n\u2217\nsuming task that is often intractable. However, if we are not interested in evaluating Val\n(X), but\nonly want to decide whether or not P is robust with respect to the perturbation described by P,\nmore ef\ufb01cient methods can be used. The following theorem establishes how this decision can be\nreformulated as an optimisation problem that, as we are about to show in Section 3, can be solved\nef\ufb01ciently for PGMs. Due to space constraints, the proofs are provided as supplementary material.\nTheorem 1. Let X be a variable taking values in a \ufb01nite set Val(X) and let P be a set of candidate\nmass functions over X. Let \u02dcx be a MAP instantiation for a mass funtion P \u2208 P. Then \u02dcx is the\nunique MAP instantiation for every P (cid:48) \u2208 P, that is, Val\n\u2217\n\n(X) has cardinality one, if and only if\n\nP (cid:48)\u2208P P (cid:48)(\u02dcx) > 0 and\n\nmin\n\nmax\n\nx\u2208Val(X)\\{\u02dcx} max\nP (cid:48)\u2208P\n\nP (cid:48)(x)\nP (cid:48)(\u02dcx)\n\n< 1,\n\n(3)\n\nwhere the \ufb01rst inequality should be checked \ufb01rst because if it fails, then the left-hand side of the\nsecond inequality is ill-de\ufb01ned.\n\n3 PGMs and Ef\ufb01cient Robustness Veri\ufb01cation\n\nLet X = (X1, . . . , Xn) be a vector of variables taking values in their respective \ufb01nite domains\nVal(X1), . . . , Val(Xn). We will use [n] a shorthand notation for {1, . . . , n}, and similarly for other\nnatural numbers. For every non-empty C \u2286 [n], XC is a vector that consists of the variables Xi,\ni \u2208 C, that takes values in Val(XC) := \u00d7i\u2208C Val(Xi). For C = [n] and C = {i}, we obtain\nX = X[n] and Xi = X{i} as important special cases. A factor \u03c6 over a vector XC is a real-valued\nmap on Val(XC). If for all xC \u2208 XC, \u03c6(xC) \u2265 0, then \u03c6 is said to be nonnegative.\nLet I1, . . . , Im be a collection of index sets such that I1 \u222a\u00b7\u00b7\u00b7\u222a Im = [n] and \u03a6 = {\u03c61, . . . , \u03c6m} be\na set of nonnegative factors over the vectors XI1, . . . , XIm, respectively. We say that \u03a6 is a PGM if\nit induces a joint probability mass function P\u03a6 over Val(X), de\ufb01ned by\n\u03c6k(xIk ) for all x \u2208 Val(X),\n\nm(cid:89)\n\nP\u03a6(x) :=\n\n(4)\n\nwhere Z\u03a6 :=(cid:80)\n\n1\nZ\u03a6\n\nk=1\n\n(cid:81)m\n\nVal(X) is \ufb01nite, \u03a6 is a PGM if and only if Z\u03a6 > 0.\n\nx\u2208Val(X)\n\nk=1 \u03c6k(xIk ) is the normalising constant called partition function. Since\n\n3\n\n\f3.1 MAP and Second Best MAP Inference for PGMs\nIf \u03a6 is a PGM then, by merging Eqs. (1) and (4), we see that \u02dcx \u2208 Val(X) is a MAP instantiation for\nP\u03a6 if and only if\n\n\u03c6k(\u02dcxIk ) for all x \u2208 Val(X),\n\nm(cid:89)\n\n\u03c6k(xIk ) \u2264 m(cid:89)\n\nk=1\n\nk=1\n\nwhere \u02dcxIk is the unique element of Val(XIk ) that is consistent with \u02dcx, and likewise for xIk and x.\nSimilarly, x(2) \u2208 Val(X) is said to be a second best MAP instantiation for P\u03a6 if and only if there is\na MAP instantiation x(1) for P\u03a6 such that x(1) (cid:54)= x(2) and\n\n\u03c6k(x(2)\nIk\n\n) for all x \u2208 Val(X) \\ {x(1)}.\n\n(5)\n\nm(cid:89)\n\n\u03c6k(xIk ) \u2264 m(cid:89)\n\nk=1\n\nk=1\n\nMAP inference in PGMs is an NP-hard task (see [12] for details). The task can be solved exactly by\njunction tree algorithms in time exponential in the treewidth of the network\u2019s moral graph. While\n\ufb01nding the k-th best instantiation might be an even harder task [13] for general k, the second best\nMAP instantiation can be found by a sequence of MAP queries: (i) compute a \ufb01rst best MAP\ninstantiation \u02dcx(1); (ii) for each queried variable Xi, take the original PGM and add an extra factor\nfor Xi that equals 1 minus the indicator of the value that Xi has in \u02dcx(1), and run the MAP inference;\n(iii) report the instantiation with highest probability among all these runs. Because the second best\nhas to differ from the \ufb01rst best in at least one Xi (and this is ensured by that extra factor), this\nprocedure is correct and in worst case it spends time equal to a single MAP inference multiplied\nby the number of variables. Faster approaches to directly compute the second best MAP, without\nreduction to standard MAP queries, have been also proposed (see [8] for an overview).\n\n3.2 Evaluating the Robustness of MAP Inference With Respect to a Family of PGMs\nFor every k \u2208 [m], let \u03c8k be a set of nonnegative factors over the vector XIk. Every combination\nof factors \u03a6 = {\u03c61, . . . , \u03c6m} from the sets \u03c81, . . . , \u03c8m, respectively, is called a selection. Let\n\u03a8 := \u00d7m\nk=1\u03c8k be the set consisting of all these selections. If every selection \u03a6 \u2208 \u03a8 is a PGM,\nthen \u03a8 is said to be a family of PGMs. We then denote the corresponding set of distributions by\nP\u03a8 := {P\u03a6 : \u03a6 \u2208 \u03a8}. In the following theorem, we establish that evaluating the robustness of MAP\ninference with respect to this set P\u03a8 can be reduced to a second best MAP instantiation problem.\nTheorem 2. Let X = (X1, . . . , Xn) be a vector of variables taking values in their respective \ufb01nite\ndomains Val(X1), . . . , Val(Xn), let I1, . . . , Im be a collection of index sets such that I1\u222a\u00b7\u00b7\u00b7\u222aIm =\n[n] and, for every k \u2208 [m], let \u03c8k be a compact set of nonnegative factors over XIk such that\n\u03a8 = \u00d7m\nConsider now a PGM \u03a6 \u2208 \u03a8 and a MAP instantiation \u02dcx for P\u03a6 and de\ufb01ne, for every k \u2208 [m] and\nevery xIk \u2208 Val(XIk ):\n\nk=1\u03c8k is a family of PGMs.\n\n\u03b1k := min\nk\u2208\u03c8k\n\u03c6(cid:48)\n\n\u03c6(cid:48)\nk(\u02dcxk) and \u03b2k(xIk ) := max\nk\u2208\u03c8k\n\u03c6(cid:48)\n\n\u03c6(cid:48)\nk(xIk )\n\u03c6(cid:48)\nk(\u02dcxIk )\n\n.\n\n(6)\n\nThen \u02dcx is the unique MAP instantiation for every P (cid:48) \u2208 P\u03a8 if and only if\n\n(\u2200k \u2208 [m]) \u03b1k > 0 and\n\n\u03b2k(x(2)\nIk\n\n) < 1,\n\n(RMAP)\n\nk=1\n\nwhere x(2) is an arbitrary second best MAP instantiation for the distribution P \u02dc\u03a6 that corresponds\nto the PGM \u02dc\u03a6 := {\u03b21, . . . , \u03b2m}. The \ufb01rst criterion in (RMAP) should be checked \ufb01rst because\n\u03b2k(x(2)\nIk\n\n) is ill-de\ufb01ned if \u03b1k = 0.\n\nTheorem 2 provides an algorithm to test the robustness of MAP in PGMs. From a computational\npoint of view, checking (RMAP) can be done as described in the previous subsection, apart from\nthe local computations appearing in Eq. (6). These local computations will depend on the particular\nchoice of perturbation. As we will see further on, many natural perturbations induce very ef\ufb01cient\nlocal computations (usually because they are related somehow to simple linear or convex program-\nming problems).\n\n4\n\nm(cid:89)\n\n\fIn most practical situations, some variables XO, with O \u2282 [n], are observed and therefore known\nto be in a given con\ufb01guration y \u2208 Val(XO). In this case, the MAP inference for the conditional\nmass function P\u03a6(XQ|y) should be considered, where XQ := X[n]\\O are the queried variables.\nWhile we have avoided the discussion about the conditional case and considered only the MAP\ninference (and its robustness check) for the whole set of variables of the PGM, the standard technique\nemployed with MRFs of including additional identity functions to encode observations suf\ufb01ces, as\nthe probability of the observation (and therefore also the partition function value) does not in\ufb02uence\nthe result of MAP inferences. Hence, one can run the MAP inference for the PGM \u03a6(cid:48) augmented\nwith local identity functions that yield y, such that Z\u03a6(cid:48)P\u03a6(cid:48)(XQ) = Z\u03a6P\u03a6(XQ, y) (that is, the\nunnormalized probabilities are equal, so MAP instantiations are equal too) and hence the very same\ntechniques explained for the unconditional case are applicable to conditional MAP inference (and\nits robustness check) as well.\n\n4 Global SA in PGMs\nThe most natural way to perform global SA in a PGM \u03a6 = {\u03c61, . . . , \u03c6m} is by perturbing all its\nfactors. Following the ideas introduced in Section 2 and 3, we model the effect of the perturbation\nby replacing the factor \u03c6k with a compact set \u03c8k of factors, for each k \u2208 [m]. This induces a\nfamily \u03a8 of PGMs. The condition (RMAP) can be therefore used to decide whether or not the MAP\ninstantiation for P\u03a6 is the unique MAP instantiation for every P (cid:48) \u2208 P\u03a8. In other words, we have an\nalgorithm to test the robustness of P\u03a6 with respect to the perturbation P\u03a8.\nTo characterize the perturbation level we introduce the notion of a parametrized perturbation \u03c8\u0001\nk of\na factor \u03c6k, de\ufb01ned by requiring that: (i) for each \u0001 \u2208 [0, 1], \u03c8\u0001\nk is a compact set of factors, each of\nwhich has the same domain as \u03c6k; (ii) if \u00012 \u2265 \u00011, then \u03c8\u00012\nk = {\u03c6k}. Given a\nparametrized perturbation for each factor of the PGM \u03a6, we denote by \u03a8\u0001 the corresponding family\nof PGMs and by P\u03a8\u0001 the relative set of joint mass functions.\nWe de\ufb01ne the critical perturbation threshold \u0001\u2217 as the supremum value of \u0001 \u2208 [0, 1] such that P\u03a6\u0001\nis robust with respect to the perturbation P\u03a8\u0001, i.e., such that the condition (RMAP) is still satis\ufb01ed.\nBecause of the property (ii) of parametrized perturbations, we know that if (RMAP) is not satis\ufb01ed\nfor a particular value of \u0001 then it cannot be satis\ufb01ed for a larger value and, vice versa, if the criterion\nis satis\ufb01ed for a particular value than it will also be satis\ufb01ed for every smaller value. An algorithm\nto evaluate \u0001\u2217 can therefore be obtained by iteratively checking (RMAP) according to a bracketing\nscheme (e.g., bisection) over \u0001. Local SA, as well as SA of only a selective collection of parameters,\ncome as a byproduct, as one can perturb only some factors and our results and algorithm still apply.\n\nk ; and (iii) \u03c80\n\nk \u2287 \u03c8\u00011\n\n4.1 Global SA in Markov Random Fields (MRFs)\n\nMRFs are PGMs based on undirected graphs. The factors are associated to cliques of the graph. The\nspecialization of the technique outlined by Theorem 2 is straightforward. A possible perturbation\ntechnique is the rectangular one. Given a factor \u03c6k, its rectangular parametric perturbation \u03c8\u0001\n\nk is:\n\nk = {\u03c6(cid:48)\n\u03c8\u0001\n\nk \u2265 0 : |\u03c6(cid:48)\n\nk(xIk ) \u2212 \u03c6k(xIk )| \u2264 \u0001\u2206 for all xIk \u2208 Val(XIk )} ,\n\n(7)\n\nwhere \u2206 > 0 is a chosen maximum perturbation level, achieved for \u0001 = 1.\nFor this kind of perturbation, the optimization in Eq. (6) is trivial: \u03b1k = max{0, \u03c6k(\u02dcxk) \u2212 \u0001\u2206}\nand, if \u03b1k > 0, then \u03b2k(\u02dcxIk ) = 1 and, for all xIk \u2208 Val(XIk ) \\ {\u02dcxIk}, \u03b2k(xIk ) = \u03c6k(xIk )+\u0001\u2206\n\u03c6k(\u02dcxIk )\u2212\u0001\u2206. If\n\u03b1k = 0, even for a single k, the criterion (RMAP) is not satis\ufb01ed and \u03b2k should not be computed.\n\n4.2 Global SA in Bayesian Networks (BNs)\n\nBNs are PGMs based on directed graphs. The factors are CPTs, one for each variable, each con-\nditioned on the parents of the variable. Each CPT contains a conditional mass function for each\njoint state of the parents. Perturbations in BNs can take this into consideration and use perturbations\nwith a direct probabilistic interpretation. Consider an unconditional mass function P over X. A\nparametrized perturbation P \u0001 of P can be achieved by \u0001-contamination [2]:\n\nP \u0001 := {(1 \u2212 \u0001)P (X) + \u0001P \u2217(X) : P \u2217(X) any mass function on X}.\n\n(8)\n\n5\n\n\fIt is a trivial exercise to check that this is a proper parametric perturbation of P (X) and that P 1 is\nthe whole probabilistic simplex.\nWe perturb the CPTs of a BN by applying this parametric perturbation to every conditional mass\nfunction. Let P (X|Y) =: \u03c8(X, Y) be a CPT. The optimization in Eq. (6) is trivial also in this case.\nWe have \u03b1k = (1\u2212\u0001)P (\u02dcx|\u02dcy) and, if \u03b1k > 0, then \u03b2k(\u02dcxIk ) = 1 and, for all xIk \u2208 Val(XIk )\\{\u02dcxIk},\n\u03b2k(xIk ) = (1\u2212\u0001)P (x|y)+\u0001\nMore general perturbations can also be considered, and the ef\ufb01ciency of their computation relates to\nthe optimization in Eq. (6). Because of that, we are sure that at least any linear or convex perturbation\ncan be solved ef\ufb01ciently and in polynomial time by convex programming methods, while other\nmore sophisticated perturbations might demand general non-linear optimization and hence cannot\nanymore ensure that computations are exact and quick.\n\n(1\u2212\u0001)P (\u02dcx|\u02dcy) , where \u02dcx and \u02dcy are consistent with \u02dcxIk and similarly for x, y and xIk.\n\n5 Experiments\n\n5.1 Facial Action Unit Recognition\n\nWe consider the problem of recognizing facial action units from real image data using the CK+ data\nset [10, 16]. Based on the Facial Action Coding System [9], facial behaviors can be decomposed\ninto a set of 45 action units (AUs), which are related to contractions of speci\ufb01c sets of facial muscles.\nWe work with 23 recurrent AUs (for a complete description, see [9]). Some AUs happen together\nto show a meaningful facial expression: AU6 (cheek raiser) tends to occur together with AU12 (lip\ncorner puller) when someone is smiling. On the other hand, some AUs may be mutually exclusive:\nAU25 (lips part) never happens simultaneously with AU24 (lip presser) since they are activated by\nthe same muscles but with opposite motions. The data set contains 68 landmark positions (given\nby coordinates x and y) of the face of 589 posed individuals (after \ufb01ltering out cases with missing\ndata), as well as the labels for the AUs. Our goal is to predict all the AUs happening in a given\nimage. In this work, we do not aim to outperform other methods designed for this particular task,\nbut to analyse the robustness of a model when applied in this context. In spite of that, we expected\nto obtain a reasonably good accuracy by using an MRF.\nOne third of the posed faces are selected for testing, and two thirds for training the model. The\nlabels of the testing data are not available during training and are used only to compute the accuracy\nof the predictions. Using the training data and following the ideas in [16], we build a linear support\nvector machine (SVM) separately for each one of the 23 AUs, using the image landmarks to predict\nthat given AU. With these SVMs, we create new variables o1,. . ., o45, one for each selected AU,\ncontaining the predicted value from the SVM. This is performed for all the data, including training\nand testing data. After that, landmarks are discarded and the data is considered to have 46 variables\n(true values and SVM predicted ones). At this point, the accuracy of the SVM measurements on the\ntesting data, if one considers the average Hamming distance between the vector of 23 true values\nand the vector of 23 predicted ones (that is, the sum of the number of times AUi equals oi over all i\nand all instances in the testing data divided by 23 times the number of instances), is about 87%. We\nnow use these 46 variables to build an MRF (we use a very simplistic penalized likelihood approach\nfor learning the MRF, as the goal is not to obtain state-of-the-art classi\ufb01cation but to analyse robust-\nness), as shown in Fig. 1(a), where SVM-built variables are treated as observational/measurement\nnodes and relations are learned between the AUs (non displayed AU variables in the \ufb01gure are only\nconnected to their corresponding measurements).\nUsing the MRF, we predict the AU con\ufb01guration using a MAP algorithm, where all AUs are queried\nand all measurement nodes are observed. As before, we characterise the accuracy of this model\nby the average Hamming distance between predicted vectors and true vectors, obtaining about 89%\naccuracy. That is, the inclusion of the relations between AUs by means of the MRF was able to\nslightly improve the accuracy obtained independently for each AU from the SVM. For our present\npurposes, we are however more interested in the associated perturbation thresholds \u0001\u2217. For each\ninstance of the testing data (that is, for each vector of 23 measurements), we compute it using the\nrectangular perturbations of Section 4.1. The higher \u0001\u2217 is, the more robust is the issued vector,\nbecause it represents the single optimal MAP instantiation even if one varied all the parameters of\nthe MRF by \u0001\u2217. To understand the relation between \u0001\u2217 and the accuracy of predictions, we have\nsplit the testing instances into bins, according to the Hamming distance between true and predicted\n\n6\n\n\f(a) MRF used in the computations.\n\n(b) Robustness split by Hamming distances.\n\nFigure 1: On the left, the graph of the MRF used to compute MAP. On the right, boxplots for the\nrobustness measure \u0001\u2217 of MAP solutions, for different values of the Hamming distance to the truth.\n\nvectors. Figure 1(b) shows the boxplot of \u0001\u2217 for each value of the Hamming distance between 0 and\n4 (lower \u0001\u2217 of a MAP instantiation means lower robustness). As we can see in the \ufb01gure, the median\nrobustness \u0001\u2217 decreases monotonically with the distance, indicating that this measure is correlated\nwith the accuracy of the issued predictions, and hence can be used as a second order information\nabout the obtained MAP instantiation for each instance.\nThe data set also contains information about the emotion expressed in the posed faces (at least for\npart of the images), which are shown in Figure 2(b): anger, disgust, fear, happy, sadness and sur-\nprise. We have partitioned the testing data according to these six emotions and plotted the robustness\nmeasure \u0001\u2217 of them (Figure 2(a)). It is interesting to see the relation between robustness and emo-\ntions. Arguably, it is much easier to identify surprise (because of the stretched face and open mouth)\nthan anger (because of the more restricted muscle movements de\ufb01ning it). Figure 2 corroborates\nwith this statement, and suggests that the robustness measure \u0001\u2217 can have further applications.\n\n(a) Robustness split by emotions.\n\n(b) Examples of emotions.\n\nFigure 2: On the left, box plots for the robustness measure \u0001\u2217 of the MAP solutions, split according\nto the emotion that was presented in the instance were MAP was computed. On the right, examples\nof emotions encoded in the data set [10, 16]. Each row is a different emotion.\n\n7\n\n0.0000.0050.0100.0150.0200.0250.030012340.0000.0050.0100.0150.0200.0250.030angerdisgustfearhappysadnesssurprise\fy\nc\na\nr\nu\nc\nc\nA\n\n1\n\n0.8\n\n0.6\n\n0.4\n\n0.14\n\n0.12\n\n0.1\n\n0.08\n\n0.06\n\n0.04\n\n0.02\n\n0\n\n\u0001\u2217\n\naudiology\n\nautos\n\nbreast-cancer\nhorse-colic\ngerman-credit\npima-diabetes\nhypothyroid\nionosphere\n\nlymphography\n\nmfeat\n\noptdigits\nsegment\nsolar-\ufb02are\n\nsonar\nsoybean\nsponge\n\nzoo\nvowel\n\nFigure 3: Average accuracy of a classi\ufb01er over 10 times 5-fold cross-validation. Each instance is\nclassi\ufb01ed by a MAP inference. Instances are categorized by their \u0001\u2217, which indicates their robustness\n(or amount of perturbation up to which the MAP instantiation remains unique).\n\n5.2 Robustness of Classi\ufb01cation\n\nIn this second experiment, we turn our attention to the classi\ufb01cation problem using data sets from\nthe UCI machine learning repository [1]. Data sets with many different characteristics have been\nused. Continuous variables have been discretized by their median before any other use of the data.\nOur empirical results are obtained out of 10 runs of 5-fold cross-validation (each run splits the data\ninto folds randomly and in a strati\ufb01ed way), so the learning procedure of each classi\ufb01er is called 50\ntimes per data set. In all tests we have employed a Naive Bayes classi\ufb01er with equivalent sample size\nequal to one. After the classi\ufb01er is learned using 4 out of 5 folds, predictions for the other fold are\nissued based on the MAP solution, and the computation of the robustness measure \u0001\u2217 is done. Here,\nthe value \u0001\u2217 is related to the size of the contamination of the model for which the classi\ufb01cation result\nof a given test instance remains unique and unchanged (as described in Section 4.2). Figure 3 shows\nthe classi\ufb01cation accuracy for varying values of \u0001\u2217 that were used to perturb the model (in order to\nobtain the curves, the technicality was to split the test instances into bins according to the computed\nvalue \u0001\u2217, using intervals of length 10\u22122, that is, accuracy was calculated for every instance with \u0001\u2217\nbetween 0 and 0.01, then between 0.01 and 0.02, and so on). We can see a clear relation between\naccuracy and predicted robustness \u0001\u2217. We remind that the computation of \u0001\u2217 does not depend on the\ntrue MAP instantiation, which is only used to verify the accuracy. Again, the robustness measure\nprovides a valuable information about the quality of the obtained MAP results.\n\n6 Conclusions\n\nWe consider the sensitivity of the MAP instantiations of discrete PGMs with respect to perturbations\nof the parameters. Simultaneous perturbations of all the parameters (or any chosen subset of them)\nare allowed. An exact algorithm to check the robustness of the MAP instantiation with respect to\nthe perturbations is derived. The worst-case time complexity is that of the original MAP inference\ntimes the number of variables in the domain. The algorithm is used to compute a robustness measure,\nrelated to changes in the MAP instantiation, which is applied to the prediction of facial action units\nand to classi\ufb01cation problems. A strong association between that measure and accuracy is veri\ufb01ed.\nAs future work, we want to develop ef\ufb01cient algorithms to determine, if the result is not robust, what\nde\ufb01nes such instances and how this robustness can be used to improve classi\ufb01cation accuracy.\n\nAcknowledgements\n\nJ. De Bock is a PhD Fellow of the Research Foundation Flanders (FWO) and he wishes to acknowl-\nedge its \ufb01nancial support. The work of C. P. de Campos has been mostly performed while he was\nwith IDSIA and has been partially supported by the Swiss NSF grant 200021 146606 / 1.\n\n8\n\n\fReferences\n[1] A. Asuncion\n\nhttp://www.ics.uci.edu/\u223cmlearn/MLRepository.html, 2007.\n\nand D.J. Newman.\n\nUCI machine\n\nlearning\n\nrepository.\n\n[2] J. Berger. Statistical decision theory and Bayesian analysis. Springer Series in Statistics.\n\nSpringer, New York, NY, 1985.\n\n[3] E.F. Castillo, J.M. Gutierrez, and A.S. Hadi. Sensitivity analysis in discrete Bayesian networks.\n\nIEEE Transactions on Systems, Man, and Cybernetics, Part A, 27(4):412\u2013423, 1997.\n\n[4] H. Chan and A. Darwiche. When do numbers really matter? Journal of Arti\ufb01cial Intelligence\n\nResearch, 17:265\u2013287, 2002.\n\n[5] H. Chan and A. Darwiche. Sensitivity analysis in Bayesian networks: from single to multiple\n\nparameters. In Proceedings of UAI 2004, pages 67\u201375, 2004.\n\n[6] H. Chan and A. Darwiche. Sensitivity analysis in Markov networks. In Proceedings of IJCAI\n\n2005, pages 1300\u20131305, 2005.\n\n[7] H. Chan and A. Darwiche. On the robustness of most probable explanations. In Proceedings\n\nof UAI 2006, pages 63\u201371, 2006.\n\n[8] R. Dechter, N. Flerova, and R. Marinescu. Search algorithms for m best solutions for graphical\n\nmodels. In Proceedings of AAAI 2012, 2012.\n\n[9] P. Ekman and W. V. Friesen. Facial action coding system: A technique for the measurement of\n\nfacial movement. Consulting Psychologists Press, Palo Alto, CA, 1978.\n\n[10] T. Kanade, J. F. Cohn, and Y. Tian. Comprehensive database for facial expression analysis.\nIn Proceedings of the Fourth IEEE International Conference on Automatic Face and Gesture\nRecognition, pages 46\u201353, Grenoble, 2000.\n\n[11] U. Kjaerulff and L.C. van der Gaag. Making sensitivity analysis computationally ef\ufb01cient. In\n\nProceedings of UAI 2000, pages 317\u2013325, 2000.\n\n[12] J. Kwisthout. Most probable explanations in Bayesian networks: complexity and tractability.\n\nInternational Journal of Approximate Reasoning, 52(9):1452\u20131469, 2011.\n\n[13] J. Kwisthout, H. L. Bodlaender, and L. C. van der Gaag. The complexity of \ufb01nding k-th most\nprobable explanations in probabilistic networks. In Proceedings of SOFSEM 2011, pages 356\u2013\n367, 2011.\n\n[14] K. B. Laskey. Sensitivity analysis for probability assessments in Bayesian networks. IEEE\n\nTransactions on Systems, Man, and Cybernetics, 25(6):901\u2013909, 1995.\n\n[15] I. Levi. The Enterprise of Knowledge. MIT Press, London, 1980.\n[16] P. Lucey, J. F. Cohn, T. Kanade, J. Saragih, Z. Ambadar, and I. Matthews. The Extended\nCohn-Kanade Dataset (CK+): A complete expression dataset for action unit and emotion-\nspeci\ufb01ed expression. In Proceedings of the Third International Workshop on CVPR for Human\nCommunicative Behavior Analysis, pages 94\u2013101, San Francisco, 2010.\n\n[17] M. Pradhan, M. Henrion, G.M. Provan, B.D. Favero, and K. Huang. The sensitivity of be-\nlief networks to imprecise probabilities: an experimental investigation. Arti\ufb01cial Intelligence,\n85(1-2):363\u2013397, 1996.\n\n[18] S. Renooij and L.C. van der Gaag. Evidence and scenario sensitivities in naive Bayesian\n\nclassi\ufb01ers. International Journal of Approximate Reasoning, 49(2):398\u2013416, 2008.\n\n[19] P. Walley. Statistical Reasoning with Imprecise Probabilities. Chapman and Hall, London,\n\n1991.\n\n9\n\n\f", "award": [], "sourceid": 1396, "authors": [{"given_name": "Jasper", "family_name": "De Bock", "institution": "Ghent University, SYSTeMS"}, {"given_name": "Cassio", "family_name": "de Campos", "institution": "Queen's University Belfast"}, {"given_name": "Alessandro", "family_name": "Antonucci", "institution": "IDSIA"}]}