{"title": "On Relating Explanations and Adversarial Examples", "book": "Advances in Neural Information Processing Systems", "page_first": 15883, "page_last": 15893, "abstract": "The importance of explanations (XP's) of machine learning (ML) model predictions and of adversarial examples (AE's) cannot be overstated, with both arguably being essential for the practical success of ML in different settings. There has been recent work on understanding and assessing the relationship between XP's and AE's. However, such work has been mostly experimental and a sound theoretical relationship has been elusive. This paper demonstrates that explanations and adversarial examples are related by a generalized form of hitting set duality, which extends earlier work on hitting set duality observed in model-based diagnosis and knowledge compilation. Furthermore, the paper proposes algorithms, which enable computing adversarial examples from explanations and vice-versa.", "full_text": "On Relating Explanations and Adversarial Examples\n\nAlexey Ignatiev\n\nMonash University, Australia\n\nalexey.ignatiev@monash.edu\n\nNina Narodytska\n\nVMWare Research, CA, USA\nnnarodytska@vmware.com\n\nJoao Marques-Silva\n\nANITI, Toulouse, France\n\njoao.marques-silva@univ-toulouse.fr\n\nAbstract\n\nThe importance of explanations (XP\u2019s) of machine learning (ML) model predictions\nand of adversarial examples (AE\u2019s) cannot be overstated, with both arguably being\nessential for the practical success of ML in different settings. There has been\nrecent work on understanding and assessing the relationship between XP\u2019s and\nAE\u2019s. However, such work has been mostly experimental and a sound theoretical\nrelationship has been elusive. This paper demonstrates that explanations and\nadversarial examples are related by a generalized form of hitting set duality, which\nextends earlier work on hitting set duality observed in model-based diagnosis and\nknowledge compilation. Furthermore, the paper proposes algorithms, which enable\ncomputing adversarial examples from explanations and vice-versa.\n\n1\n\nIntroduction\n\nAdversarial examples (AE\u2019s) [54] illustrate the brittleness of machine learning (ML) models, and\nhave been the subject of growing interest in recent years. Explanations (XP\u2019s) of (black-box) ML\nmodels provide trust in ML models, and exemplify the increasing importance of eXplainable AI\n(XAI) [17, 12, 13]. Over the last few years, a number of works realized the existence of some\nconnection between AE\u2019s and XP\u2019s [34, 55, 48, 56, 59, 41, 8]. However, past work has been\nexperimental, and a deeper theoretical connection between AE\u2019s and XP\u2019s has been elusive. This\npaper demonstrates the existence of such a theoretical connection between AE\u2019s and XP\u2019s.\nIn this work we take a formal logic point of view on the analysis of ML models. Namely, we employ\n\ufb01rst order logic (FOL) as a framework to specify an ML model, de\ufb01ne notions of an explanation and a\ncounterexample to that explanation, which can be viewed as a generalization of an adversarial example.\nWe then demonstrate that these notions possess well-known counterparts in the FOL terminology,\nlike prime implicants and implicates, respectively. Such formalization allows us to obtain our main\nresult that reveals a duality relation between explanations and counterexamples. Based on this\nconnection, we show how explanations and counterexample interact. For example, explanations can\nbe used to generate counterexamples. Dually, we can generate explanations given counterexamples.\nFurthermore, we also show how to compute adversarial examples from counterexamples. The ideas\nin the paper build on tightly related work in model-based diagnosis [45], namely the hitting set\nduality between diagnoses and con\ufb02icts, but also builds on related work in knowledge compilation,\nconcretely in the use of prime implicants and implicates to compile knowledge [51, 36].\nThe paper is organized as follows. Section 2 introduces concepts used in the remainder of the\npaper. Section 3 investigates the connections between adversarial examples and explanations, and\nproposes algorithms for the enumeration of explanations and adversarial examples. Experimental\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fevidence demonstrating the relationship between explanations and adversarial examples is analyzed\nin Section 4. The paper concludes in Section 5.\n\n2 Background\n\n2.1 Preliminaries\nWe consider an ML model M, represented by a \ufb01nite set of \ufb01rst order logic (FOL) sentences M.\n(Where viable, alternative representations for M can be considered, e.g. fragments of FOL, (mix-\ned-)integer linear programming, constraint language(s), etc.) A set of features F = {f1, . . . , fL}\nis assumed. Each feature fi is categorical (or ordinal), with values taken from some set Di. An\ninstance is an assignment of values to features. The space of instances, also referred to as feature\n(or instance) space, is de\ufb01ned by F = D1 \u00d7 D2 \u00d7 . . . \u00d7 DL. (Domains may or may not have an\norder relation. Also, for real-valued features, a suitable interval discretization can be considered.) A\n(feature) literal \u03bbi is of the form (fi = vi), with vi \u2208 Di. In what follows, a literal will be viewed as\nan atom, i.e. it can take value true or false. As a result, an instance can be viewed as a set of L literals,\ndenoting the L distinct features, i.e. an instance contains a single occurrence of a literal de\ufb01ned on\nany given feature. A set of literals is consistent if there exists at most one literal associated with each\nfeature. A consistent set of literals can be interpreted as a conjunction or as a disjunction of literals;\nthis will be clear from the context. When interpreted as a conjunction, the set of literals denotes a\ncube in instance space, where the unspeci\ufb01ed features can take any possible value of their domain.\nWhen interpreted as a disjunction, the set of literals denotes a clause in instance space. As before, the\nunspeci\ufb01ed features can take any possible value of their domain.\nThe remainder of the paper assumes a classi\ufb01cation problem with a set of classes K = {\u03ba1, . . . , \u03baM}.\nA prediction \u03c0 \u2208 K is associated with each instance I.\nGiven some target prediction \u03c0 \u2208 K, one can devise representations for the formula FM,\u03c0 (cid:44)\n(M\u2192 \u03c0) [52, 23]. In particular, we will be interested in computing prime implicants and implicates\nof FM,\u03c0, where a consistent set of feature literals \u03c4 is an implicant of FM,\u03c0 if \u03c4 (cid:15) FM,\u03c0, and a\nconsistent set of feature literals \u03bd is a (negated) implicate of FM,\u03c0 if FM,\u03c0 (cid:15) \u00ac\u03bd, or alternatively\n(\u03bd (cid:15) \u00acFM,\u03c0) \u2261 (\u03bd (cid:15) \u2228\u03c1(cid:54)=\u03c0(M\u2192 \u03c1)). An implicant \u03c4 (implicate \u03bd, resp.) is called prime if none\nof its proper subsets \u03c4(cid:48) (cid:40) \u03c4 (\u03bd(cid:48) (cid:40) \u03bd, resp.) is an implicant (implicate, resp.).\nThroughout the paper, it will be convenient to use a more detailed notation, where ML models,\nprime implicants and prime implicates represent functions, respectively mapping F into K and {0, 1}.\nConcretely, the ML model M computes a function M : F \u2192 K 1. As a result, given some instance\nX \u2208 F in feature space, M(X) denotes the prediction computed by the ML model. Furthermore, the\nnotation \u03c4 (cid:15) FM,\u03c0 represents the following \ufb01rst order logic statement:\n\n\u2200(X \u2208 F).\u03c4 (X)\u2192(M(X) = \u03c0)\n\n(1)\nwhere \u03c4 is a Boolean function mapping F into {0, 1}, and M is a function mapping F into K.\nEssentially, a prime implicant is viewed as a Boolean function taking value 1 for a cube (i.e. set of\npoints) in feature space for which the prediction is \u03c0. Similarly, the notation \u03bd (cid:15) \u00acFM,\u03c0 represents\nthe following \ufb01rst order logic statement:\n\n\u2200(X \u2208 F).\u03bd(X)\u2192 (\u2228\u03c1(cid:54)=\u03c0M(X) = \u03c1)\n\n(2)\n\nwhere \u03bd is a Boolean function mapping F into {0, 1}.\nExample 1. The paper\u2019s running example is the restaurant example from Russell&Norvig\u2019s book [50,\nFig. 18.3, page 700]. For this example, the set of features is:\n{A(lternate), B(ar), W(eekend), H(ungry), Pa(trons), Pr(ice), Ra(in), Re(serv.), T(ype), E(stim.)}.\nFor instance, A, B, W, H, Ra, Re are Boolean features taking True or False values. T is a cate-\ngorical feature with four possible values {Burger, French, Italian, Thai}. The other domains are\nde\ufb01ned similarly. The dataset predicts whether the customer should wait or not in a given sit-\nuation. So we have the target label Wait that takes Yes or No values. An example instance is:\n{A,\u00acB,\u00acW, H, (Pa = Some), (Pr = $$$),\u00acRa, Re, (T = French), (E = 0\u201310)}.\n\n1Since formula M computes a function, the prediction for any instance is unique. It is straightfoward to\n\ndevise ML models where M does not compute a function. Such cases are not considered in this paper.\n\n2\n\n\fExample 2. Throughout the paper, the examples will consider decision sets [28]. (The selection of\ndecision sets is motivated by simplicity. The actual experiments consider black-box models.) For the\nexample dataset, a decision set obtained with an off-the-shelf tool is:\n\nIF (Pa = Some) \u2227 \u00ac(E = >60)\nTHEN (Wait = Yes)\nIF W \u2227 \u00ac(Pr = $$$) \u2227 \u00ac(E = >60) THEN (Wait = Yes)\nIF \u00acW \u2227 \u00ac(Pa = Some)\nTHEN (Wait = No)\nTHEN (Wait = No)\nIF (E = >60)\nIF \u00ac(Pa = Some) \u2227 (Pr = $$$)\nTHEN (Wait = No)\n\n(R1)\n(R2)\n(R3)\n(R4)\n(R5)\n\n2.2 Related Work\n\nWe overview two main research directions: (a) methods for generating adversarial attacks and\n(b) methods for producing explanations of ML model decisions. These two research directions have\nsimilarities, e.g. both types of methods make assumptions about transparency of the model, i.e.\nwhether it is a white-box or a black-box, enforce different guarantees on the outcome (best effort vs\nguaranteed solution), etc. However, somewhat surprisingly, a formal connection between adversarial\nexamples and explanations has not been proposed in the literature. Our work bridges this gap.\n\nAdversarial attacks. Szegedy et al. demonstrated that ML models lack robustness: a small\nperturbation of an input may lead to a signi\ufb01cant perturbation of the output of an ML model [54].\nThis vulnerability can be exploited to augment the original input with a crafted perturbation, invisible\nto a human but suf\ufb01cient for the ML model to misclassify this input. A perturbation is required to\nbe small w.r.t. a given metric, e.g. l1, l2, and l\u221e [54, 16, 6] norms. This in\ufb02uential work triggered\nseveral new research directions [7].\nDepending on the threat model, adversarial attack generators can be broadly partitioned into black-box\nand white-box methods. Black-box methods assume that the attacker has no knowledge about the\nML model. In this case, adversarial attacks are based on algorithms that recover the original model\nstructure and transfer adversarial examples to the original model [42, 43, 27]. In contrast, white-box\nmethods assume that the adversary has complete knowledge about the model, e.g. [54, 6, 37, 38]. In\nthis work, we lean towards a black-box model, however we assume the existence of a logic-based\noracle that can be queried about entailment relations between inputs and outputs. The majority\nof white-box methods to produce adversarial examples are heuristic and rely on gradient descent\nmethods. Hence, they cannot guarantee that an adversarial example will be found even if it exists.\nFor safety-critical applications such uncertainty might not be tolerable, therefore, a new trend is\nemerging focusing on methods with provable guarantees [25, 4, 33, 40, 10, 14, 26, 30]. For example,\nKatz et al. proposed the Reluplex system that \ufb01nds an adversarial example if it exists in ReLU-based\nnetworks [25]. The main idea is to encode a network function as an SMT formula and to prove its\nproperties (e.g. the absence of an adversarial perturbation in a given neighborhood) using an SMT\nand ILP hybrid. A formal approach was also applied to binarized neural networks [11, 40, 10, 26].\nFinally, there is work on understanding adversarial examples. Xu et al. provide sensitivity analysis\nof pixel level perturbations and investigate the effect of these perturbations on internal layers of the\nnetwork [58]. In [59], the authors produce structured attacks, where the attack mechanism achieves\nstrong group sparsity leading to more interpretable examples.\n\nModel explanations. Explainability of ML models depends on the type of the model that a user\nworks with. There is a class of models considered to be interpretable by a human decision maker,\nlike decision trees, lists or sets. When considering interpretable ML models, the goal is to compute\nmodels that provide minimal explanations associated with each prediction [28, 2, 39, 24, 49]2.\nIn case of non-interpretable models, like neural networks or ensembles of trees, there are two main\noptions: (a) recompile (augment) the original model into (with) an explainable model [15, 57] or\n(b) extract an explanation from the model. The former approach might not be suitable if we want\nto provide explanations with guarantees w.r.t. the original model. The latter approach is currently\nthe mainstream one. As in the case of adversarial attacks, the method designer needs to make an\n\n2Explanations \ufb01nd a wide range of uses in AI and CS in general. In general, understanding the causes of\ninconsistency in over-constrained systems of constraints corresponds to explaining the reasons of inconsistency.\nIn a similar fashion, diagnosis of failing systems can also be seen as explaining the reasons for system failure [45].\n\n3\n\n\fassumption on transparency of the model to the explainer. As above, white-box methods rely on\ncomputing gradients, e.g. saliency maps or integrated gradients [53]. However, these methods are\nmostly applicable to computer vision tasks. One in\ufb02uential line of research looks into explaining\nblack-box models [19]. Explanations can be local [46, 35, 18] or global [47, 29], depending on\nwhether they only apply to a local neighborhood of a target instance or not. While these methods can\nprovide probabilistic guarantees, they do not provide worst-case guarantees on generated explanations.\nMoreover, there exist concerns regarding robustness of some of these methods [1].\nRecently, two approaches were proposed to compute global explanations. The \ufb01rst method takes a\ncompilation-based approach to computing global explanations [52]. If it is possible to compile an\nML model to a suitable compilation structure, this method can extract all possible global explana-\ntions. However, the main drawback of this approach is exponential worst-case size of the compiled\nrepresentation. In [23], the authors proposed new methods for computing explanations, by extracting\nprime implicants. This approach scales better compared to the compilation based approach and can\ngenerate a number of global explanations on demand. The current work is based on ideas from [23]\nto generate explanations and counterexamples.\nThere is a recent line of work on defending ML models against adversarial attacks based on inter-\npretablity [34, 55]. For example, in [55] the authors identify neurons that correspond to human\nperceptible attributes and check whether these attributes are used in classi\ufb01cation of the input. If\nso the input is non-adversarial and adversarial otherwise. Tomsett et al. [56] stated that adversarial\nexamples and explanations are related notions. Namely, they argue that adversarial examples can\nimprove ML interpretability and vice versa, e.g. neurons activations patterns are different for ad-\nversarial and original inputs which provides an insight about the network\u2019s internal representation.\nFinally, there is work on using advanced training procedures, like robust training, to improve the\nnetwork interpretablity [48, 41, 8]. For example, in [48] the authors propose to regularize input\ngradients to improve robustness to transferred adversarial examples and quality of gradient-based\nexplanations. Chalasani et al. proposed to employ adversarial training to learn logistic models with\nthe feature-concentration property that are easier for the user to interpret [8].\n\n3 Relating Explanations and Adversarial Examples\n\nThe goal of this section is to establish a tight connection between adversarial examples and expla-\nnations. To achieve this goal, we must \ufb01rst formalize the notion of (absolute) explanations and that\nof counterexample, and prove a (minimal) hitting set relationship between the two. Afterwards, we\ndemonstrate how adversarial examples can be computed from explanations and vice-versa.\n\n3.1 Explanations & Counterexamples\n\nIn this section, we introduce two new notions: absolute explanations and counterexamples over the\ninput features. An absolute explanation for the prediction \u03c0 is a generalization of commonly used\nlocal and global explanations [46, 47, 52, 23]. An absolute explanation is the strongest form of an\nexplanation that does not depend on a concrete input instance and acts globally over the entire feature\nspace. For any instance that matches an absolute explanation, the ML model prediction is guaranteed\nto be \u03c0. By matching, we mean that a set of features shared by the instance and the explanation have\nthe same values. The second notion is a counterexample to the prediction \u03c0, which is a generalization\nof commonly used adversarial and some forms of universal adversarial examples [46, 37]. Intuitively,\na counterexample is a set of input feature values that forces the ML model to output a prediction that\nis different from \u03c0. We are mostly interested in minimal such sets. Again, it is a strong notion as any\ninstance that matches the counterexample must not be classi\ufb01ed as \u03c0.\nGiven an ML model, represented by some logic encoding M, and a prediction \u03c0 \u2208 K, the following\nde\ufb01nitions are considered.\nDe\ufb01nition 1 (Explanation). A(n absolute) explanation (XP) of a prediction \u03c0 is a subset-minimal\nset3 of literals E, representing distinct features, such that E (cid:15) (M\u2192 \u03c0).\n\n3Given a set R, subset-minimality of a set \u03d5 \u2286 R wrt. a predicate P over set R means that (a) P(\u03d5) and\n\n(b) \u2200(\u03d5(cid:48) (cid:40) \u03d5)\u00acP(\u03d5(cid:48)).\n\n4\n\n\fObserve that explanations are often deemed as local [46, 47]. Alternatively, global explanations\n(although hold in the complete instance space) are relative to an instance I, when E \u2286 I [23].\nExplanations in this paper are independent of a concrete instance, and so are referred to as absolute.\nDe\ufb01nition 2 (Counterexample). A subset-minimal set C of literals is a counterexample (CEx) to a\nprediction \u03c0, if C (cid:15) (M\u2192 \u03c1), with \u03c1 \u2208 K \u2227 \u03c1 (cid:54)= \u03c0.\nClearly, an explanation E is a prime implicant of FM,\u03c0 and a counterexample C is a (negated) prime\nimplicate of FM,\u03c0.\nExample 3. For the running example, due to (R1), an explanation for the prediction (Wait = Yes)\nis: (Pa = Some) \u2227 \u00ac(E = >60). Moreover, due to (R5), a counterexample for the prediction\n(Wait = Yes) is: \u00ac(Pa = Some) \u2227 (Pr = $$$).\nTwo literals are inconsistent if they represent the same feature but refer to different values. We say\nthat a literal \u03c4i breaks a set of literals S (each denoting a different feature) if S contains a literal\ninconsistent with \u03c4i. Thus we can talk about breaking an explanation E or breaking a counterexample\nC. Moreover, two sets of literals S1 and S2 break each other if they contain literals in the same feature\nreferring to different values.\nExample 4. For the running example, the explanation corresponding to the set of literals S1 =\n{(Pa = Some),\u00ac(E = >60)} breaks the counterexample corresponding to the set of literals\nS2 = {\u00ac(Pa = Some), (Pr = $$$)} and vice-versa, as (Pa = Some) and \u00ac(Pa = Some) are the\ninconsistent literals in this case.\n\nAs hinted in the examples above, we can now state the paper\u2019s main result. We start with a general\nassumption.\nAssumption 1. The ML model M computes a function M : F \u2192 K.\nThis assumption is essential to ensure that for any instance in feature space the prediction is unique.\nTheorem 1. Given an ML model M, represented by some logic encoding M, and a prediction \u03c0,\nevery explanation E of \u03c0 breaks every counterexample of \u03c0, and every counterexample C of \u03c0 breaks\nevery explanation of \u03c0.\nProof. The proof of the theorem statement consists of two parts:\n\nDe\ufb01nition of some counterexample C\n\n1. Every explanation E of \u03c0 breaks every counterexample C of \u03c0.\n\u2200(X \u2208 F).C(X)\u2192(\u2228\u03c1(cid:54)=\u03c0M(X) = \u03c1)\n\u2200(X \u2208 F).\u00ac(\u2228\u03c1(cid:54)=\u03c0M(X) = \u03c1)\u2192\u00acC(X) Contrapositive\n\u2200(X \u2208 F).(M(X) = \u03c0)\u2192\u00acC(X)\n\u2200(X \u2208 F).E(X) (cid:15) (M(X) = \u03c0)\n\u2200(X \u2208 F).E(X) (cid:15) \u00acC(X)\n\u2200(X \u2208 F).E(X)\u2192(M(X) = \u03c0)\n\u2200(X \u2208 F).\u00ac(M(X) = \u03c0)\u2192\u00acE(X)\n\u2200(X \u2208 F). \u2228\u03c1(cid:54)=\u03c0 (M(X) = \u03c1)\u2192\u00acE(X) Negation given set of classes\n\u2200(X \u2208 F).C(X) (cid:15) \u2228\u03c1(cid:54)=\u03c0(M(X) = \u03c1)\n\u2200(X \u2208 F).C(X) (cid:15) \u00acE(X)\n\nNegation given set of classes\nDe\ufb01nition of some explanation E\nExplanation breaks counterexample\nDe\ufb01nition of some explanation E\nContrapositive\nDe\ufb01nition of some counterexample C\nCounterexample breaks explanation\n\n2. Every counterexample C of \u03c0 breaks every explanation E of \u03c0.\n\nAs argued below, by listing all minimal counterexamples, we can extract all minimal explanations,\nand vice-versa. Furthermore, one can readily conclude that Theorem 1 generalizes the hitting set\nrelationship between diagnoses and con\ufb02icts \ufb01rst investigated by Reiter in the 80s [45], and since\nthen studied in different settings [5, 3, 32].\nExample 5 (Duality). For the running example, let the prediction again be (Wait = Yes) and the\ndecision set proposed in Example 2. For this prediction and given the ML model, there are two global\nexplanations:\n\n1. (Pa = Some) \u2227 \u00ac(E = >60); and\n2. W \u2227 \u00ac(Pr = $$$) \u2227 \u00ac(E = >60).\n\nThis means that as long as any of these two conjunctions of literals holds, then the prediction will\nbe (Wait = Yes). Moreover, there are three counterexamples (i.e. explanations for not predicting\n(Wait = Yes) (which for this example corresponds to predicting (Wait = No)):\n\n5\n\n\f1. \u00acW \u2227 \u00ac(Pa = Some);\n2. (E = >60); and\n3. \u00ac(Pa = Some) \u2227 (Pr = $$$).\n\nThis means that as long as any of these three conjunctions of literals holds, then the prediction will\nnot be (Wait = Yes). It can be readily concluded that the XP\u2019s (minimally) break the CEx\u2019s and\nvice-versa.\nRemark 1. As hinted in Example 5, and building on earlier work on model-based diagnosis and\nenumeration of prime implicants and implicates [45, 51, 44], it is straightforward to compute\ncounterexamples from explanations and vice-versa:\n\n1. If we have the set of explanations for a prediction, then we can compute the set of counterex-\n\namples as the consistent minimal breaks of the set of explanations.\n\n2. Similarly, if we have the set of counterexamples, then we can compute the set of explanations\n\nas the consistent minimal breaks of the set of counterexamples.\n\nRemark 2. The assumptions for relating explanations with counterexamples are fairly general. For\nexample, features do not need to be ordinal. Clearly, adversarial examples expect features to be\nordinal. This is covered in the next section.\n\n3.2 Relationship with Adversarial Examples\n\nThe previous section showed that each explanation breaks every counterexample, and that each\ncounterexample breaks every explanation. This holds for any machine learning model for which the\nlogic representation of the model computes a function mapping feature space F into K. Adversarial\nexamples were introduced in earlier work [54] denoting small changes to the features w.r.t. to a given\ndistance measure that results in the prediction error. In this paper, we use the following de\ufb01nition of\nadversarial example. Given an instance I in feature space, corresponding to prediction \u03c0, our goal\nis to \ufb01nd another instance, Iae, corresponding to a different prediction, and which is closest to the\noriginal instance. Formally,\n\nmin Dist(Iae,I)\nIae (cid:15) \u2228\u03c1(cid:54)=\u03c0\u03c1\nst\n\n(3)\n\nClearly, adversarial examples assume that features are ordinal, enabling the notion of distance to be\nwell-de\ufb01ned. (For real-valued features, we assume an often-used discretization of the input.) We can\nnow relate adversarial examples with counterexamples and with explanations.\nTheorem 2 (From XP\u2019s to AE\u2019s). Given an ML model M, represented with a logic representation\nM, a prediction \u03c0, with set of explanations E and set of counterexamples C, and an instance I\ntaken from feature space, let Cae denote the counterexample with minimum distance to I. Then Iae\ncorresponds to Cae by setting the unspeci\ufb01ed feature values to the values in I.\nProof (Sketch).\n(see Remark 1).\nGiven the set of counterexamples, we can compute one that minimizes some measure of distance to\nthe instance I.\nA counterexample represents a cube in feature space; we just need to pick the point closest to I. This\ncan be achieved by \ufb01xing the free features. Observe that Iae is the complete assignment obtained\nfrom the partial assignment Cae by setting the missing coordinates to the feature values speci\ufb01ed in\nthe given instance I.\n\nIf we know the set of explanations, then we can compute the set of counterexamples\n\n3.3 Exploiting Duality\nThe duality between absolute explanations E \u2208 E and counterexamples C \u2208 C for a prediction \u03c0\nmade by model M (represented as formula M) can be exploited directly to compute either E, or\nC, or both. This can be done following the ideas of prime compilation of Boolean formulas [44].\nAlternatively, the classi\ufb01er can be compiled into a succinct logical representation, e.g. binary decision\ndiagram (BDD) [52], which allows for ef\ufb01cient enumeration of prime implicants and implicates.\nAlgorithm 1 shows a Pythonic-style algorithm to compute the complete set E. It computes the set\nE of all explanations and a subset of all the counterexamples C. The algorithm utilizes the hitting\nset duality and represents the implicit hitting set enumeration paradigm [9]. It can be seen as a loop,\neach iteration of which computes a smallest size hitting set E of the set C of counterexamples (see\n\n6\n\n\fformula M and prediction \u03c0\n\nAlgorithm 1: Duality-based computation of all absolute explanations\nInput:\nOutput: set E of all absolute explanations of prediction \u03c0\n1 (C, E,E) \u2190 (\u2205,\u2205,\u2205)\n2 do:\nif E (cid:15) (M \u2192 \u03c0) :\n3\nE \u2190 E \u222a {E}\n# E is an explanation; save it\n4\n5\n(C, \u03c1) \u2190 ExtractInstance() # get an instance C with a prediction \u03c1, \u03c1 (cid:54)= \u03c0\n6\nfor l \u2208 C :\n7\n8\n9\n10\n11\n12 while E (cid:54)= \u2205\n13 return E\n\n# update C with a new counterexample C\n# get a new hitting set of C\n\nC \u2190 C \u222a {C}\n\nE \u2190 MinimumHS(C)\n\nelse:\n\nif (C \\ {l}) (cid:15) (M \u2192 \u03c1) :\n\nC \u2190 C \\ {l}\n\nline 11) and checks whether or not E is an explanation for prediction \u03c0 (line 3). (This check can be\ndone by calling an oracle testing unsatis\ufb01ability of formula E \u2227 M \u2227 \u00ac\u03c0.) If it is, E is added to E.\nOtherwise, i.e. if E \u2227 M \u2227 \u00ac\u03c0 is satis\ufb01able, a satisfying assignment exists de\ufb01ning an instance C that\nis classi\ufb01ed by M as some \u03c1 \u2208 K s.t. \u03c1 (cid:54)= \u03c0. Such satisfying assignment is typically easy to obtain\nfrom the oracle (line 6). Instance C is then reduced to a counterexample by removing all redundant,\ni.e. unnecessary, literals (see the loop line 7\u2013line 9 and concretely the oracle call in line 8). The new\ncounterexample is added to the set C (see line 10) and a new hitting set E is obtained (line 11). The\nalgorithm proceeds until there is no more hitting set E of C. Observe that initially E is empty and,\nthus, the \ufb01rst iteration of Algorithm 1 always results in a new counterexample C being computed and\nadded to C. Note that although Algorithm 1 targets enumerating all explanations E, it can also be\napplied for computing the set C of all counterexamples, with minimal modi\ufb01cations of the formulas\nin line 3 and line 8.\nRemark 3. Although the goal of Algorithm 1 is to illustrate a way to exploit the duality, its practical\nef\ufb01ciency may not be ideal in some speci\ufb01c settings. The algorithm relies on the oracle calls in line 3\nand line 8, which are NP-hard. Also, extracting a smallest size hitting set is NP-hard as well and can\nbe done with the use of modern optimization procedures, e.g. mixed integer programming (MILP) [20]\nor maximum satis\ufb01ability (MaxSAT) [31]. Furthermore, in the worst case, the algorithm could end\nup enumerating both all explanations and all counterexamples and there might be an exponential\nnumber of them. However, in other settings, this worst-case scenario is not often observed in practice.\n\n4 Experimental Evidence\n\nThe section practically illustrates the described duality between the concepts of absolute explanation\nand counterexample for a given model prediction. To do this, the following experiment was performed\non a Macbook Pro with an Intel Core i5 2.3GHz CPU and 16GB of memory. The experiment targets\nthe well-known and widely used MNIST digits database4 as it enables a visual demonstration of\nthe discovered duality relationship. As a classi\ufb01er model, we consider neural networks (NNs) with\nrecti\ufb01ed linear unit (ReLU) non-linear activation operators and the known encoding of ReLU-based\nNNs into MILP [14]. The developed Python-based prototype5 follows the prime compilation approach\nof Algorithm 1 and uses CPLEX 12.8.0 [20] as an MILP oracle, which is invoked at each iteration of\nthe algorithm. The implementation of minimum hitting set enumeration of Algorithm 1 is based on\nan award-winning maximum satis\ufb01ability solver RC26 [22] written on top of the PySAT toolkit [21].\nFor the sake of simplicity, the networks used are trained to distinguish two digits, e.g. 5 and 6 (because\nof their visual resemblance). Also, due to a signi\ufb01cant number of explanations and counterexamples,\n\n4http://yann.lecun.com/exdb/mnist/\n5https://github.com/alexeyignatiev/xpce-duality/\n6https://maxsat-evaluations.github.io/\n\n7\n\n\f(a) digit \u201c5\u201d\n\n(b) patch area\n\n(c) an XP\n\n(d) all XP\u2019s\n\n(e) a CE\n\n(f) an AE (\u201c6\u201d)\n\nFigure 1: An example of digit \ufb01ve.\n\n(a) digit \u201c6\u201d\n\n(b) patch area\n\n(c) an XP\n\n(d) all XP\u2019s\n\n(e) a CE\n\n(f) an AE (\u201c5\u201d)\n\nFigure 2: An example of digit six.\n\nthe following is assumed: (1) only pixels from a prede\ufb01ned patch area can participate in an expla-\nnation/counterexample with the other pixels being \ufb01xed; (2) the images were binarized, i.e. every\npixel can be either black or white. However, note that the duality holds for the most general case with\nassumptions (1) and (2) being disabled.\nFigure 1a and Figure 2a show two concrete examples of digits 5 and 6. The patch areas for these\nimages are highlighted in Figure 1b and Figure 2b. The patches contains 20 and 7 pixels, respectively.\nThese patch areas are selected intentionally as their pixels are supposed to be crucial for the prediction\nbeing \u201c5\u201d or \u201c6\u201d. Enumerating all explanations and counterexamples for these images with the given\npatches result in 20 (7, resp.) unit-size explanations for digit \ufb01ve (six, resp.). An example of one\nconcrete explanation for these images is shown in Figure 1c and Figure 2c, respectively. Recall\nthat the images are binarized; here, the corresponding pixel is blue (red, resp.) if the explanation\nsets it black (white, resp.) while the other (gray) pixels of the patch may have any color and the\nprediction will still remain as long as the pixels outside of the patch area are \ufb01xed. The unions of all\nexplanations are shown in Figure 1d and Figure 2d, respectively. Also, for both images there is a\nunique counterexample. These are depicted in Figure 1e and Figure 2e. Observe that \u201cpolarities\u201d of\nthe pixels in explanations and counterexamples are opposite to each other. This clearly exhibits the\ndescribed duality between the concepts of absolute explanations and counterexamples.\nNote that the only counterexample for prediction \u201c5\u201d sets all pixels white. Such an image is shown in\nFigure 1f and represents an adversarial example for Figure 1a, i.e. it is classi\ufb01ed as \u201c6\u201d. A similar\nobservation can be made with respect to digit six. However, in this case the only counterexample\nsets all patch pixels black. Thus, an adversarial example for digit six is shown in Figure 2f and it is\nclassi\ufb01ed as \u201c5\u201d.\n\n5 Conclusions\n\nAdversarial examples and explanations of ML models are arguably two of the most signi\ufb01cant\nareas of research in ML. This paper shows a tight relationship between the two. Concretely, the\npaper proposes the dual concept of counterexample, the notion of breaking an explanation or a\ncounterexample, and shows that each explanation must break every counterexample and vice-versa.\nThis property is tightly related with the concept of hitting set duality between diagnoses and con\ufb02icts\nin model-based diagnosis [45], but also with computation of prime implicants and implicates of\nBoolean functions [51]. The paper also overviews algorithms for computing explanations from\ncounterexamples and vice-versa. Furthermore, the paper shows how adversarial examples can be\ncomputed given a reference instance in feature space and counterexample that minimizes the distance\nto the instance. The experimental evidence illustrates the applicability of the duality relationship\nbetween explanations and counterexamples (and adversarial examples). Future work will investigate\nextensions to the work to target problems of larger scale.\n\n8\n\n\fReferences\n[1] D. Alvarez-Melis and T. S. Jaakkola. On the robustness of interpretability methods. CoRR,\n\nabs/1806.08049, 2018.\n\n[2] E. Angelino, N. Larus-Stone, D. Alabi, M. Seltzer, and C. Rudin. Learning certi\ufb01ably optimal\n\nrule lists. In KDD, pages 35\u201344, 2017.\n\n[3] J. Bailey and P. J. Stuckey. Discovery of minimal unsatis\ufb01able subsets of constraints using\n\nhitting set dualization. In PADL, pages 174\u2013186, 2005.\n\n[4] O. Bastani, Y. Ioannou, L. Lampropoulos, D. Vytiniotis, A. V. Nori, and A. Criminisi. Measuring\n\nneural net robustness with constraints. In NIPS, pages 2613\u20132621, 2016.\n\n[5] E. Birnbaum and E. L. Lozinskii. Consistent subsets of inconsistent systems: structure and\n\nbehaviour. J. Exp. Theor. Artif. Intell., 15(1):25\u201346, 2003.\n\n[6] N. Carlini and D. A. Wagner. Towards evaluating the robustness of neural networks. In 2017\nIEEE Symposium on Security and Privacy, SP 2017, San Jose, CA, USA, May 22-26, 2017,\npages 39\u201357. IEEE Computer Society, 2017.\n\n[7] A. Chakraborty, M. Alam, V. Dey, A. Chattopadhyay, and D. Mukhopadhyay. Adversarial\n\nattacks and defences: A survey. CoRR, abs/1810.00069, 2018.\n\n[8] P. Chalasani, S. Jha, A. Sadagopan, and X. Wu. Adversarial learning and explainability in\n\nstructured datasets. CoRR, abs/1810.06583, 2018.\n\n[9] K. Chandrasekaran, R. M. Karp, E. Moreno-Centeno, and S. Vempala. Algorithms for implicit\n\nhitting set problems. In SODA, pages 614\u2013629, 2011.\n\n[10] C. Cheng, G. N\u00fchrenberg, and H. Ruess. Veri\ufb01cation of binarized neural networks. CoRR,\n\nabs/1710.03107, 2017.\n\n[11] M. Courbariaux and Y. Bengio. Binarynet: Training deep neural networks with weights and\n\nactivations constrained to +1 or -1. CoRR, abs/1602.02830, 2016.\n\n[12] DARPA. DARPA explainable Arti\ufb01cial Intelligence (XAI) program, 2016.\n[13] EU Data Protection Regulation. Regulation (EU) 2016/679 of the European Parliament and of\n\nthe Council, 2016.\n\n[14] M. Fischetti and J. Jo. Deep neural networks and mixed integer linear optimization. Constraints,\n\n23(3):296\u2013309, 2018.\n\n[15] N. Frosst and G. E. Hinton. Distilling a neural network into a soft decision tree. In CExAIIA,\n\n2017.\n\n[16] I. J. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\nY. Bengio and Y. LeCun, editors, 3rd International Conference on Learning Representations,\nICLR 2015, San Diego, CA, USA, May 7-9, 2015, Conference Track Proceedings, 2015.\n\n[17] B. Goodman and S. R. Flaxman. European Union regulations on algorithmic decision-making\n\nand a \"right to explanation\". AI Magazine, 38(3):50\u201357, 2017.\n\n[18] R. Guidotti, A. Monreale, S. Ruggieri, D. Pedreschi, F. Turini, and F. Giannotti. Local rule-based\n\nexplanations of black box decision systems. CoRR, abs/1805.10820, 2018.\n\n[19] R. Guidotti, A. Monreale, S. Ruggieri, F. Turini, F. Giannotti, and D. Pedreschi. A survey of\n\nmethods for explaining black box models. ACM Comput. Surv., 51(5):93:1\u201393:42, 2019.\n\n[20] IBM ILOG: CPLEX optimizer 12.8.0. http://www-01.ibm.com/software/commerce/\n\noptimization/cplex-optimizer, 2018.\n\n[21] A. Ignatiev, A. Morgado, and J. Marques-Silva. PySAT: A Python toolkit for prototyping with\n\nSAT oracles. In SAT, pages 428\u2013437, 2018.\n\n[22] A. Ignatiev, A. Morgado, and J. Marques-Silva. RC2: An ef\ufb01cient MaxSAT solver. Journal on\n\nSatis\ufb01ability, Boolean Modeling and Computation, 11:53\u201364, 2019.\n\n[23] A. Ignatiev, N. Narodytska, and J. Marques-Silva. Abduction-based explanations for machine\n\nlearning models. In AAAI, 2019.\n\n[24] A. Ignatiev, F. Pereira, N. Narodytska, and J. Marques-Silva. A SAT-based approach to learn\n\nexplainable decision sets. In IJCAR, pages 627\u2013645, 2018.\n\n9\n\n\f[25] G. Katz, C. W. Barrett, D. L. Dill, K. Julian, and M. J. Kochenderfer. Reluplex: An ef\ufb01cient\n\nSMT solver for verifying deep neural networks. In CAV (1), pages 97\u2013117, 2017.\n\n[26] E. B. Khalil, A. Gupta, and B. Dilkina. Combinatorial attacks on binarized neural networks.\n\nCoRR, abs/1810.03538, 2018.\n\n[27] A. Kurakin, I. J. Goodfellow, and S. Bengio. Adversarial examples in the physical world. In\n5th International Conference on Learning Representations, ICLR 2017, Toulon, France, April\n24-26, 2017, Workshop Track Proceedings. OpenReview.net, 2017.\n\n[28] H. Lakkaraju, S. H. Bach, and J. Leskovec. Interpretable decision sets: A joint framework for\n\ndescription and prediction. In KDD, pages 1675\u20131684, 2016.\n\n[29] H. Lakkaraju, E. Kamar, S. Caruana, and J. Leskovec. Faithful and customizable explanations\n\nof black box models. In AIES, 2019.\n\n[30] F. Leofante, N. Narodytska, L. Pulina, and A. Tacchella. Automated veri\ufb01cation of neural\n\nnetworks: Advances, challenges and perspectives. CoRR, abs/1805.09938, 2018.\n\n[31] C. M. Li and F. Many\u00e0. MaxSAT, hard and soft constraints. In Handbook of Satis\ufb01ability, pages\n\n613\u2013631. 2009.\n\n[32] M. H. Lif\ufb01ton, A. Previti, A. Malik, and J. Marques-Silva. Fast, \ufb02exible MUS enumeration.\n\nConstraints, 21(2):223\u2013250, 2016.\n\n[33] C. Liu, T. Arnon, C. Lazarus, C. Barrett, and M. J. Kochenderfer. Algorithms for verifying deep\n\nneural networks. CoRR, abs/1903.06758, 2019.\n\n[34] N. Liu, H. Yang, and X. Hu. Adversarial detection with model interpretation. In KDD, pages\n\n1803\u20131811, 2018.\n\n[35] S. M. Lundberg and S.-I. Lee. A uni\ufb01ed approach to interpreting model predictions. In I. Guyon,\nU. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 4765\u20134774. Curran Associates,\nInc., 2017.\n\n[36] P. Marquis. Consequence \ufb01nding algorithms.\n\nIn Handbook of Defeasible Reasoning and\n\nUncertainty Management Systems, pages 41\u2013145. 2000.\n\n[37] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, and P. Frossard. Universal adversarial perturbations.\n\nCoRR, abs/1610.08401, 2016.\n\n[38] S. Moosavi-Dezfooli, A. Fawzi, O. Fawzi, P. Frossard, and S. Soatto. Robustness of classi\ufb01ers\n\nto universal perturbations: A geometric perspective. In ICLR. OpenReview.net, 2018.\n\n[39] N. Narodytska, A. Ignatiev, F. Pereira, and J. Marques-Silva. Learning optimal decision trees\n\nwith SAT. In IJCAI, pages 1362\u20131368, 2018.\n\n[40] N. Narodytska, S. P. Kasiviswanathan, L. Ryzhyk, M. Sagiv, and T. Walsh. Verifying properties\n\nof binarized deep neural networks. In AAAI, pages 6615\u20136624. AAAI Press, 2018.\n\n[41] P. Panda and K. Roy. Explainable learning: Implicit generative modelling during training for\n\nadversarial robustness. CoRR, abs/1807.02188, 2018.\n\n[42] N. Papernot, P. D. McDaniel, and I. J. Goodfellow. Transferability in machine learning: from\n\nphenomena to black-box attacks using adversarial samples. CoRR, abs/1605.07277, 2016.\n\n[43] N. Papernot, P. D. McDaniel, I. J. Goodfellow, S. Jha, Z. B. Celik, and A. Swami. Practical\nblack-box attacks against machine learning. In R. Karri, O. Sinanoglu, A. Sadeghi, and X. Yi,\neditors, Proceedings of the 2017 ACM on Asia Conference on Computer and Communications\nSecurity, AsiaCCS 2017, Abu Dhabi, United Arab Emirates, April 2-6, 2017, pages 506\u2013519.\nACM, 2017.\n\n[44] A. Previti, A. Ignatiev, A. Morgado, and J. Marques-Silva. Prime compilation of non-clausal\n\nformulae. In IJCAI, pages 1980\u20131988. AAAI Press, 2015.\n\n[45] R. Reiter. A theory of diagnosis from \ufb01rst principles. Artif. Intell., 32(1):57\u201395, 1987.\n[46] M. T. Ribeiro, S. Singh, and C. Guestrin. \"why should I trust you?\": Explaining the predictions\nof any classi\ufb01er. In B. Krishnapuram, M. Shah, A. J. Smola, C. C. Aggarwal, D. Shen, and\nR. Rastogi, editors, KDD, pages 1135\u20131144. ACM, 2016.\n\n[47] M. T. Ribeiro, S. Singh, and C. Guestrin. Anchors: High-precision model-agnostic explanations.\n\nIn AAAI, pages 1527\u20131535. AAAI Press, 2018.\n\n10\n\n\f[48] A. S. Ross and F. Doshi-Velez. Improving the adversarial robustness and interpretability of\ndeep neural networks by regularizing their input gradients. In AAAI, pages 1660\u20131669, 2018.\n[49] C. Rudin. Please stop explaining black box models for high stakes decisions. CoRR,\n\nabs/1811.10154, 2018.\n\n[50] S. J. Russell and P. Norvig. Arti\ufb01cial Intelligence - A Modern Approach 3ed. Prentice Hall,\n\n2010.\n\n[51] R. Rymon. An SE-tree-based prime implicant generation algorithm. Ann. Math. Artif. Intell.,\n\n11(1-4):351\u2013366, 1994.\n\n[52] A. Shih, A. Choi, and A. Darwiche. A symbolic approach to explaining bayesian network\n\nclassi\ufb01ers. In IJCAI, pages 5103\u20135111, 2018.\n\n[53] M. Sundararajan, A. Taly, and Q. Yan. Axiomatic attribution for deep networks. In ICML,\n\nvolume 70 of Proceedings of Machine Learning Research, pages 3319\u20133328. PMLR, 2017.\n\n[54] C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. J. Goodfellow, and R. Fergus.\n\nIntriguing properties of neural networks. In ICLR, 2014.\n\n[55] G. Tao, S. Ma, Y. Liu, and X. Zhang. Attacks meet interpretability: Attribute-steered detection\n\nof adversarial samples. In NeurIPS, pages 7728\u20137739, 2018.\n\n[56] R. Tomsett, A. Widdicombe, T. Xing, S. Chakraborty, S. Julier, P. Gurram, R. M. Rao, and M. B.\nSrivastava. Why the failure? how adversarial examples can provide insights for interpretable\nmachine learning. In FUSION, pages 838\u2013845, 2018.\n\n[57] T. Wang and Q. Lin. Hybrid Predictive Model: When an Interpretable Model Collaborates with\n\na Black-box Model. arXiv e-prints, page arXiv:1905.04241, May 2019.\n\n[58] K. Xu, S. Liu, G. Zhang, M. Sun, P. Z. and/ Quanfu Fan, C. Gan, and X. Lin. Interpreting\nadversarial examples by activation promotion and suppression. CoRR, abs/1904.02057, 2019.\n[59] K. Xu, S. Liu, P. Zhao, P. Chen, H. Zhang, D. Erdogmus, Y. Wang, and X. Lin. Structured adver-\nsarial attack: Towards general implementation and better interpretability. CoRR, abs/1808.01664,\n2018.\n\n11\n\n\f", "award": [], "sourceid": 9327, "authors": [{"given_name": "Alexey", "family_name": "Ignatiev", "institution": "Reason Lab, Faculty of Sciences, University of Lisbon"}, {"given_name": "Nina", "family_name": "Narodytska", "institution": "VMmare Research"}, {"given_name": "Joao", "family_name": "Marques-Silva", "institution": "ANITI, Federal University of Toulouse Midi-Pyr\u00e9n\u00e9es"}]}