{"title": "Classification-by-Components: Probabilistic Modeling of Reasoning over a Set of Components", "book": "Advances in Neural Information Processing Systems", "page_first": 2792, "page_last": 2803, "abstract": "Abstract Neural networks are state-of-the-art classification approaches but are generally difficult to interpret. This issue can be partly alleviated by constructing a precise decision process within the neural network. In this work, a network architecture, denoted as Classification-By-Components network (CBC), is proposed. It is restricted to follow an intuitive reasoning based decision process inspired by Biederman's recognition-by-components theory from\u00a0cognitive psychology. The network is trained to learn and detect generic components that characterize objects.\u00a0In parallel, a class-wise reasoning strategy based on these components is learned to solve the classification problem. In contrast to other work on reasoning, we propose three different types of reasoning: positive, negative, and indefinite. These three types together form a probability space to provide a probabilistic classifier. The decomposition of objects into generic components combined with the probabilistic reasoning provides by design a clear interpretation of the classification decision process. The evaluation of the approach on MNIST shows that CBCs are viable classifiers. Additionally, we demonstrate that the inherent interpretability offers a profound understanding of the classification behavior such that we can explain the success of an adversarial attack. The method's scalability is successfully tested using the ImageNet dataset.", "full_text": "Classi\ufb01cation-by-Components: Probabilistic\n\nModeling of Reasoning over a Set of Components\n\nSascha Saralajew1,\u2217 Lars Holdijk1,\u2217 Maike Rees1 Ebubekir Asan1 Thomas Villmann2,\u2217\n\n1Dr. Ing. h.c. F. Porsche AG, Weissach, Germany,\n\nsascha.saralajew@porsche.de\n\n2University of Applied Sciences Mittweida, Mittweida, Germany,\n\nthomas.villmann@hs-mittweida.de\n\nAbstract\n\nNeural networks are state-of-the-art classi\ufb01cation approaches but are generally\ndif\ufb01cult to interpret. This issue can be partly alleviated by constructing a pre-\ncise decision process within the neural network. In this work, a network archi-\ntecture, denoted as Classi\ufb01cation-By-Components network (CBC), is proposed.\nIt is restricted to follow an intuitive reasoning based decision process inspired\nby BIEDERMAN\u2019s recognition-by-components theory from cognitive psychology.\nThe network is trained to learn and detect generic components that characterize\nobjects. In parallel, a class-wise reasoning strategy based on these components is\nlearned to solve the classi\ufb01cation problem. In contrast to other work on reasoning,\nwe propose three different types of reasoning: positive, negative, and inde\ufb01nite.\nThese three types together form a probability space to provide a probabilistic clas-\nsi\ufb01er. The decomposition of objects into generic components combined with the\nprobabilistic reasoning provides by design a clear interpretation of the classi\ufb01-\ncation decision process. The evaluation of the approach on MNIST shows that\nCBCs are viable classi\ufb01ers. Additionally, we demonstrate that the inherent inter-\npretability offers a profound understanding of the classi\ufb01cation behavior such that\nwe can explain the success of an adversarial attack. The method\u2019s scalability is\nsuccessfully tested using the IMAGENET dataset.\n\n1\n\nIntroduction\n\nNeural Networks (NNs) dominate the \ufb01eld of machine learning in terms of image classi\ufb01cation\naccuracy. Due to their design, considered as black boxes, it is however hard to gain insights into\ntheir decision making process and to interpret why they sometimes behave unexpectedly. In general,\nthe interpretability of NNs is under controversial discussion [1\u20134] and pushed researchers to new\nmethods to improve the weaknesses [5\u20137]. This is also highlighted in the topic of robustness of NNs\nagainst adversarial examples [8]. Prototype-based classi\ufb01ers like Learning Vector Quantizers [9, 10]\nare more interpretable and can provide insights into their classi\ufb01cation processes. Unfortunately,\nthey are still hindered by their low base accuracies.\nThe method proposed in this work aims to answer the question of interpretability by drawing in-\nspirations from BIEDERMAN\u2019s theory recognition-by-components [11] from the \ufb01eld of cognitive\npsychology. Roughly speaking, BIEDERMAN\u2019s theory describes how humans recognize complex\nobjects by assuming that objects can be decomposed into generic parts that operate as structural\nprimitives, called components. Objects are then classi\ufb01ed by matching the extracted decomposition\nplan with a class Decomposition Plan (DP) for each potential object class. Intuitively, the class DPs\n\n\u2217Authors contributed equally.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: An example realization of the classi\ufb01cation process of a CBC on a digit classi\ufb01cation task.\nFor simplicity, we illustrate a discrete case where \u201c1\u201d corresponds to detection / positive reasoning,\n\u201c0\u201d to no detection / negative reasoning, and \u201c(cid:2)\u201d to inde\ufb01nite reasoning.\n\ndescribe which components are important to be detected and which components are important to not\nbe detected for an object to belong to a speci\ufb01c class. For example, if we consider the classi\ufb01cation\nof a digit as illustrated in Fig. 1, the detection of a component representing a vertical bar provides\nevidence in favor of the class 1. In other words, we reason positively over the vertical bar compo-\nnent for the class 1. Similarly, we can reason negatively over all curved components. In contrast\nto other work on reasoning, the presented approach extends these two intuitive reasoning states by\na third type considering inde\ufb01nite reasoning. In Fig. 1, not all components will be important for\nthe recognition of a 1. For instance, we reason neither positively nor negatively over the serif and\nbottom stroke because not all writing styles use them. In Sec. 2, a network architecture is introduced\nthat models the described classi\ufb01cation process in an end-to-end trainable framework such that the\ncomponents as well as the class DPs can be learned. In line with BIEDERMAN\u2019s theory, we call this\na Classi\ufb01cation-By-Components network (CBC).\nIn summary, the contribution of this paper is a classi\ufb01cation method, called CBC, with the fol-\nlowing important characteristics: (1) The method classi\ufb01es its input by applying positive, negative,\nand inde\ufb01nite reasoning over an extracted DP. To the best of our knowledge, this is the \ufb01rst time\nthat optionality of components / features is explicitly modeled. (2) The method uses a probabilis-\ntic reasoning process that directly outputs class hypothesis probabilities without requiring heuristic\nsquashing methods such as softmax. (3) The reasoning process is easily interpretable and simpli\ufb01es\nthe understanding of the classi\ufb01cation decision. (4) The method retains advantages of NNs such as\nbeing end-to-end trainable on large scale datasets and achieving high accuracies on complex tasks.\n\n2 The classi\ufb01cation-by-components network\n\nIn the following, we will describe the CBC architecture and how to train it. We present the ar-\nchitecture using full-size components and consecutively generalize this to patch components. Both\nprinciples are used in the evaluation in Sec. 4. The architectures are de\ufb01ned (without loss of gener-\nality) for vectorial inputs but can be extended to higher dimensional inputs like images.\n\n2.1 Reasoning over a set of full-size components\n\nThe proposed framework relies on a probabilistic model based on a probability tree diagram T . This\ntree T can be decomposed into sub-trees Tc for each class c with the prior class probability P (c)\non the starting edge. Such a sub-tree is depicted in Fig. 2. The whole probability tree diagram is\nmodeled over \ufb01ve random variables: c, indicator variable of the class; k, indicator variable of the\ncomponent; I, binary random variable for importance; R, binary random variable for reasoning by\ndetection; D, binary random variable for detection. The probabilities in the tree Tc are interpreted\nin the following way: P (k), probability that the k-th component occurs; P (I|k, c), probability that\nthe k-th component is important for the class c; P (R|k, c), probability that the k-th component has\nto be detected for the class c; P (D|k, x), probability that the k-th component is detected in the\n\ninput x. The horizontal bar indicates the complementary event, i. e. P(cid:0)D|k, x(cid:1) is the probability\n\nthat the k-th component is not detected in the input x. Based on these de\ufb01nitions we derive the CBC\narchitecture.\nExtracting the decomposition plan Given an input x \u2208 Rnx and a set of trainable full-size\ncomponents K = {\u03bak \u2208 Rn\u03ba|k = 1, ..., #K} with nx = n\u03ba, the \ufb01rst part of the network detects\n\n2\n\n\u22ee\u22ee\u2714\u2714\u2714\u2714\u2714\u2714\fFigure 2: The probability tree diagram Tc that represents the reasoning about a class c. For better\nreadability, the variable of class c is dropped in the mathematical expressions and we only show the\nfull sub-tree for the \ufb01rst component. The solid line paths are the paths of agreement.\n\nthe presence of a component \u03bak in x. A feature extractor f (x) = f (x; \u03b8) with trainable weights \u03b8\ntakes an input and outputs a feature vector f (x) \u2208 Rmx. The feature extractor is used in a Siamese\narchitecture [12] to extract the features of the input x and of all the components {f (\u03bak)}k. The\nextracted features are used to measure the probability P (D|k, x) for the detection of a component\nby a detection probability function dk (x) = d (f (x) , f (\u03bak)) \u2208 [0, 1] with the requirement that\nf (x) = f (\u03bak) implies dk (x) = 1. Examples of suitable detection probability functions are the\nnegative exponential over the squared Euclidean distance or the cosine similarity with a suitable\nhandling of its negative part. To \ufb01nalize the \ufb01rst part of the network, the detection probabilities are\ncollected into the extracted DP as a vector d (x) = (d1 (x) , ..., d#K (x))T \u2208 [0, 1]#K.\nModeling of the class decomposition plans The second part of the network models the class DPs\nfor each class c \u2208 C = {1, ..., #C} using the three forms of reasoning discussed earlier. Therefore,\nwe de\ufb01ne the reasoning probabilities, r+\nc,k as trainable parameters of the model. Pos-\nc,k = P (I, R|k, c): The probability that the k-th component is important and must\nitive reasoning r+\nbe detected to support the class hypothesis c. Negative reasoning r\u2212\nbility that the k-th component is important and must not be detected to support the class hypothesis\nc. Inde\ufb01nite reasoning r0\nfor the class hypothesis c.2 Together they form a probability space and hence r+\nAll reasoning probabilities are collected class-wise into vectors r+\nand r\u2212\n\nc,k = P(cid:0)I, R|k, c(cid:1): The proba-\nc,k = P(cid:0)I|k, c(cid:1): The probability that the k-th component is not important\n\nc,k = 1.\nc,#K)T \u2208 [0, 1]#K\n\nc,k + r\u2212\n\nc,k, and r0\n\nc,k, r\u2212\n\nc = (r+\n\nc,1, ..., r+\n\nc,k + r0\n\nc , r0\n\nc , respectively.\n\nReasoning We compute the class hypothesis probability pc (x) regarding the paths of agreement\nunder the condition of importance. An agreement A is a path in the tree T where either a component\nis detected (D) and requires reasoning by detection (R), or a component is not detected (D) and\nrequires reasoning by no detection (R). The paths of agreement are marked with solid lines in\nFig. 2. Hence, we model pc (x) by P (A|I, x, c):\n\n(cid:0)P (R|k, c) P (D|k, x) + P(cid:0)R|k, c(cid:1) P(cid:0)D|k, x(cid:1)(cid:1) P (I|k, c) P (k)\n\n(cid:80)\n\nP (A|I, x, c) =\n\nk\n\n(cid:80)\n\nk\n\n(cid:0)1 \u2212 P(cid:0)I|k, c(cid:1)(cid:1) P (k)\n\n.\n\nSubstituting by the short form notations for the probabilities, assuming that P (k) = 1\nrewriting it with matrix calculus yields\n\n#K , and\n\npc (x) =\n\n(d (x))T \u00b7 r+\n\nc + (1 \u2212 d (x))T \u00b7 r\u2212\n1T \u00b7 (1 \u2212 r0\nc)\n\nc\n\n= (d (x))T \u00b7 \u00afr+\n\nc + (1 \u2212 d (x))T \u00b7 \u00afr\u2212\nc ,\n\n(1)\n\nvector as(cid:80)\n\nwhere 1 is the one vector of dimension #K and \u00afr\u00b1\nc are the normalized effective reasoning possibility\nvectors. The probabilities for all classes are then collected into the class hypothesis possibility vector\np (x) = (p1 (x) , ..., p#C (x))T to create the network output. We emphasize that p (x) is a possibility\nc pc (x) = 1 does not necessarily hold. See the supplementary material Sec. B.1 for a\ndetailed derivation of Eq. (1) and Sec. B.2 for a transformation of p (x) into a class probability\nvector.\n\n2Note that the idea to explicitly model the state that a component does not contribute and avoid the general\n\nc,k = 1 \u2212 r\n\n\u2212\nc,k is related to the DEMPSTER\u2013SHAFER theory of evidence [13].\n\nprobabilistic approach r+\n\n3\n\n\u22ee\u22ef\u22ee\fFigure 3: CBC with patch components and spatial reasoning for image inputs.\n\nTraining of a CBC We train the networks end-to-end by minimizing the contrastive loss\n\nl (x, y) = \u03c6 (max{pc (x)|c (cid:54)= y, c \u2208 C} \u2212 py (x))\n\n(2)\nwhere y \u2208 C is the class label of x, using stochastic gradient descent learning. The function \u03c6 :\n[\u22121, 1] \u2192 R is a monotonically increasing, almost everywhere differentiable squashing function.\nIt regulates the generalization-robustness-trade-off over the probability gap between the correct and\nhighest probable incorrect class. This loss is similar to commonly used functions in prototype-based\nc for all c \u2208 C.\nlearning [14, 15]. The trainable parameters of a CBC are \u03b8, all \u03ba \u2208 K, and r+\nWe refer to the supplementary material Sec. D for detailed information about the training procedure.\n\nc, r\u2212\n\nc , r0\n\n2.2 Extension to patch components\n\nAssume the feature extractor f processes different input sizes down to a minimum (receptive \ufb01eld)\ndimension of n0, similar to most Convolutional NNs (CNNs). To relax the assumption nx = n\u03ba\nof full-size components and to step closer to the motivating example of Fig. 1, we use a set K\nof trainable patch components with nx \u2265 n\u03ba \u2265 n0 such that f (\u03bak) \u2208 Rm\u03ba where mx \u2265 m\u03ba.\nMoreover, dk (x) is extended to a sliding operation [16, 17], denoted as (cid:126). The result is a detection\npossibility stack (extracted spatial DP) of size vd \u00d7 #K where vd is the spatial dimension after the\nsliding operation, see Fig. 3 for an image processing CBC. However, Eq. (1) can only handle one\ndetection probability for each component and thus the reasoning process has to be rede\ufb01ned:\nDownsampling A simple approach is to downsample the detection possibility stack over the spa-\ntial dimension vd such that the output is a detection possibility vector and Eq. (1) can be applied.\nThis can be achieved by applying global pooling techniques like global max pooling.\nSpatial reasoning Another approach is the extension of the reasoning process to work on the\nspatial DP which we call spatial reasoning. For this, the detection possibility stack of size vd \u00d7 #K\nis kept as depicted in Fig. 3. To compute the class hypothesis probabilities pc (x), Eq. (1) is rede\ufb01ned\nto be a weighted mean over the reasoning at each spatial position i = 1, ..., vd. Thereby, \u03b1c,i \u2208 [0, 1]\ni \u03b1c,i = 1 are the (non)-trainable class-wise pixel probabilities resembling the importance\n\nwith(cid:80)\n\nof each pixel position i. See the supplementary material in Sec. C for a further extension.\n\n3 Related Work\n\nReasoning in neural networks\nIn its simplest form, one can argue that a NN already yields de-\ncisions based on reasoning. If one considers a NN to be entirely similar to a multilayer perceptron,\nthe sign of each weight can be interpreted as either negative or positive reasoning over the corre-\nsponding feature. In this case, a weight of zero would model inde\ufb01nite reasoning. However, the\nuse of the Recti\ufb01ed Linear Unit (ReLU) activations forces NNs to be positive reasoning driven only.\nNevertheless, this interpretation of the weights is used in interpretation techniques such as Class-\nActivation-Mapping (CAM) [5], which is similar to heatmap visualizations of CBCs.\nExplicit modeling of reasoning The use of components, and the inclusion of the negative and\ninde\ufb01nite reasoning can be seen as an extension of the work in [7]. However, CBCs do not rely on\nthe complicated three step training procedure presented in the paper and are built upon a probabilistic\nreasoning model. In [18], a form of reasoning is introduced similar to the inde\ufb01nite reasoning state\nby occluding parts of the learned representation. Their components are, however, modeled in a\ntextual form. In general, the reasoning process has slight similarities to ideas mentioned in [19] and\nthe modeling of knowledge via graph structures [20\u201322].\n\n4\n\n\u22ef\u22ef\fa\n\nb\n\nc\n\nd\n\ne\n\nf\n\ng\n\nh\n\ni\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\na b c d e f g h i\n\nFigure 4: Learned reasoning process of a CBC with 9 components on MNIST. Top row: The learned\ncomponents. Bottom row: The learned reasoning probabilities collected in reasoning matrices. The\nclass is indicated by the MNIST digit below. The top row corresponds to r+\nc,k,\nand bottom row to r\u2212\n\nc,k. White squares depict a probability of one and black squares of zero.\n\nc,k, middle row to r0\n\nFeature visualization If the components are de\ufb01ned as trainable parameters in the input space,\nthen the learned components become similar to feature visualization techniques of NNs [23\u201325].\nIn contrast, the components are the direct visualizations of the penultimate layer weights (detection\nprobability layer), are not computed via a post-processing, and have a probabilistic interpretation.\nMoreover, we are not applying regularizations to the components to resemble realistic images.\nPrototype-based classi\ufb01cation rules and similarity learning A key ingredient of the proposed\nnetwork is a Siamese architecture to learn a similarity measure [12, 26\u201328] and the idea to incor-\nporate a kind of prototype-based classi\ufb01cation rule into NNs [29\u201335]. Currently, the prototype3\nclassi\ufb01cation principle is gaining a lot of attention in few-shot learning due to its ability to learn fast\nfrom few data [29, 30, 36\u201338]. The idea to replace prototypes with patches in similarity learning has\nalso been gaining attraction, as can be seen in [39] for the use of object tracking.\n\n4 Evaluation\n\nIn this section, the evaluation of the CBCs is presented. Throughout the evaluation, interpretability is\nconsidered as an important characteristic. In this case, something is interpretable if it has a meaning\nto experts. We evaluate CBCs on MNIST [40] and IMAGENET [41]. The input spaces are de\ufb01ned\nover [0, 1] and the datasets are normalized appropriately. Moreover, components that are de\ufb01ned in\nthe input space are constrained to this space as well. The CBCs use the cosine similarity with ReLU\nactivation as detection probability function. They are trained with the margin loss de\ufb01ned as Eq. (2)\nwith \u03c6 (x) = ReLU(x + \u03b2), where \u03b2 is a margin parameter, using the Adam optimizer [42]. An ex-\ntended evaluation including an ablation study regarding the network setting on MNIST is presented\nin the supplementary material in Sec. E. Where possible, we report mean and standard deviation of\nthe results. The source code is available at www.github.com/saralajew/cbc_networks.\n\n4.1 MNIST\n\nThe CNN feature extractors are implemented without the use of batch normalization [43], with\nSwish activation [44], and the convolutional \ufb01lters constraint to a Euclidean norm of one. We trained\nthe components and reasoning probabilities from scratch using random initialization. Moreover, the\nmargin parameter \u03b2 was set to 0.3.\n\n4.1.1 Negative reasoning: Beyond the best matching prototype principle\n\nThe CBC architecture in this experiment uses a 4-layer CNN feature extractor and full-size com-\nponents. During the ablation study we found that in nearly all cases this CBC with 10 components\nconverged to the Best Matching Prototype Principle (BMPP) [45] and formed prototypical compo-\nnents. This means that the reasoning for one class is performed with only strong positive reasoning\nover one and inde\ufb01nite reasoning over all the other components, e. g. see the reasoning matrix of\nclass 0 in Fig. 4 and the corresponding prototypical component d. To analyze if the network is able\nto classify using negative reasoning, we restricted the number of components to be smaller than the\nnumber of classes.\n\n3In contrast to prototypes, components are not class-dependent.\n\n5\n\n\fFigure 5: Visualization of the \u03b1-CBC heatmaps and the (cid:31)-CBC reconstructions for an adversarial\n\ninput. For simplicity, we illustrate the more meaningful visualization for each model. The model\nvisualizations correspond to the best matching reasoning stack regarding the input. We use the color\ncoding \u201cJET\u201d to map probabilities of 0 to blue and 1 to red.\n\nFig. 4 shows the learned reasoning process of a CBC with 9 components. Similar to the 10 compo-\nnent version, the CBC learns to classify as many classes as possible by the BMPP. In the example,\nthese are all classes except the class 1, for which the CBC uses weak positive reasoning over the\ncomponents a, c, f, and h but mostly depends on negative reasoning over component i. This in-\ndicates that if an input image is classi\ufb01ed as a 1, the network requires it to not look like an 8. A\ncomparison of the shapes of the digits 1 and 8 supports this observation, the 8 only consists of curved\nedges while the 1 does not contain any and on average contains the least white pixels while the 8\nrequires the most. This result shows that by incorporating the negative and inde\ufb01nite reasoning state,\nthe CBCs are able to learn both the well understood BMPP and unrestricted approaches beyond the\nintuitive classi\ufb01cation principles by themselves. Both networks achieved close to the state-of-the-art\ntest accuracies over three runs of (99.32 \u00b1 0.09)%.\n\n4.1.2 Interpretation of the reasoning\n\n\u22121, denoted as (cid:31)-CBC. (2) Generate an\n\nIn this section, we show the interpretability of CBCs. Similar to interpretation techniques from\nNNs we do this by considering input dependent and input independent visualizations. Moreover, to\nstress the visualizations in such a way that they really show how the model classi\ufb01es, we: (1) Train\ntwo patch component CBCs similar to Fig. 3, one with trainable, denoted as \u03b1-CBC, and one with\nnon-trainable pixel probabilities \ufb01xed to \u03b1c,i,j = (vd \u00b7 hd)\nadversarial image for both models with the boundary attack [46] and show how they fool the model.\nBoth CBCs use 8 patch components4 of size v\u03ba, h\u03ba = 7. The feature extractor is a 2-layer CNN\nwhich extracts feature stacks of spatial size v(cid:48)\nx = 22. The spatial reasoning\nsize of vd, hd = 7 was obtained by including a \ufb01nal max pooling operation of pool size 3 in d (x).\nAdditionally for each class, two reasoning possibility stacks were learned and winner-take-all was\napplied to determine pc (x). We call this multiple reasoning as we allow the model to learn multiple\nconcepts for each class. The \ufb01nal test accuracies of both models are quasi equivalent and on average\nover three runs (97.33 \u00b1 0.19) %. Similar to the previous section, the patch components start to\nresemble realistic digit parts like strokes, arcs, line-endings, etc.\nThe interpretability of the CBCs is based on visualizations of how the probability mass is distributed\nover the tree T . The class hypothesis probability pc (x), see Eq. (1), is the probability of agreement\nunder the condition of importance, denoted by A|I. This event describes the correct matching of the\nextracted and class DP. Moreover, we decompose this event into the positive and negative reasoning\npart: Positive A|I is the event that a component is detected that should be detected and is denoted by\n\n\u03ba, h(cid:48)\n\nk = 1 and v(cid:48)\n\nx, h(cid:48)\n\n4The idea is to learn patches of: four quarters of a circle plus two diagonal, horizontal, and vertical lines.\n\n6\n\n\fA+|I. Negative A|I is the event that a component that should not be detected is not detected and is\ndenoted by A\u2212|I. Both events can be related to paths in the trees Tc from the root to the leaves, i. e.\nA+|I is the upper solid line path and A\u2212|I is the lower solid line path in Fig. 2. The probability of\nA|I can be thought of as evidence in favor of a class. Similarly, we can consider the complementary\nevent of A|I which is disagreement under the condition of importance, denoted by A|I, and occurs\nwhen the extracted DP does not match the class DP. Again, this occurs either as positive A|I when a\n+|I, or as negative\ncomponent over which the CBC reasons positively is not detected, denoted by A\n\u2212|I. The related paths in\nA|I when a component with negative reasoning is detected, denoted by A\nthe tree Tc in Fig. 2 are the dashed line paths excluding non-importance. In general, the probability\nof A|I is evidence against a class.\nAccordingly to Eq. (1), the visualizations are based on the probabilities in the tree T for respective\ndetection possibility vectors zi,j. These probabilities are collected into the following possibility\nvectors:5 zi,j \u25e6 \u00afr+\nc,i,j for A\u2212|I; zi,j \u25e6 \u00afr\u2212\n\u2212|I. Moreover, we collect all the possibility vectors of one event for all i, j in a stack. Using\nfor A\nsuch a stack we create the visualizations by three procedures: Probability heatmaps: Upsample\na stack to the input size and sum over k. This visualizes the probabilities for the respective event\nx \u00d7 #K, scale each patch\nat the certain position. Reconstructions: Upsample a stack to v(cid:48)\ncomponent \u03bak by the respective probability and draw them onto an initially black image of size\nvx\u00d7hx at the respective position. After a normalization step, the resulted reconstruction image gives\nan impression of the combination of the patches that is used to classify the image. Incorporation of\npixel probabilities: Upsample the class-wise pixel probability maps \u03b1c to vx \u00d7 hx and normalize\nby the maximum value such that the most important pixels have a value of one. This map is \ufb01nally\noverlaid over the heatmaps and reconstructions to highlight the impact of each pixel to the overall\nclassi\ufb01cation decision.\n\nc,i,j for A+|I; (1 \u2212 zi,j) \u25e6 \u00afr+\n\nx \u00d7 h(cid:48)\n\nc,i,j for A\n\n+|I; (1 \u2212 zi,j) \u25e6 \u00afr\u2212\n\nc,i,j\n\nInput independent interpretation Input independent interpretations are calculated by setting zi,j\nto the optimal vector with 1 for positive and for 0 negative A|I. They provide an answer to the\nquestion: \u201cWhat has the model learned about the dataset?\u201d, see Fig. 5 \u201cx independent\u201d. For both\nmodels, the learned concepts of the clean and adversarial class are visualized by the optimal A+|I\nand A\u2212|I. As visible in the heatmaps, the \u03b1-CBC learned to recognize only as few parts as needed\nto distinguish the two classes. In case of the 4, this consists of a check that there is no stroke at the\nbottom and top, see A\u2212|I, while there is a corner on the left, see A+|I. Such a radical sparse coding\nis learned for all classes. The reasoning for the 9 is similar except that it requires A+|I instead of\nA\u2212|I for the top stroke. In contrast, the (cid:31)-CBC learned the whole concept for digits and not just\na sparse coding as the reconstructions show real digit shapes in the A+|I. Moreover, the model\nperforms interpretable \u201csanity checks\u201d via A\u2212|I, e. g. no top stroke at the 4.\n\nInput dependent interpretation Input dependent interpretations are obtained by setting zi,j to\ndi,j (x). To understand why the adversarial images fool the models by human imperceptible \u201cnoise\u201d\nwe answer the following question: \u201cWhich parts of the input provide evidence for / against the cur-\nrent classi\ufb01cation decision?\u201d, see Fig. 5 \u201cx dependent\u201d. By considering the clean probability his-\ntogram p (x) of the \u03b1-CBC we see that the clean input perfectly \ufb01ts the learned concept of a 4 as it\nhad a probability of 1. The adversarial attack has turned the input into a 4 and 9 at the same time,\nsee adversarial p (x). Remarkably, the attack found the high similarity between the two learned\nconcepts and attacks the model by highlighting a few pixels in the top bar region in form of a patch\n\u2013 the manipulation only changes one pixel in d (x). Hence, the concept of a 4 is slightly violated\n\u2212|I. This causes the probability drop of\nas we see a highlighting of the top stroke region in the A\nthe class 4. At the same time, these few pixels provide A+|I for the top stroke of a 9 and, hence,\nraise the probability. For the (cid:31)-CBC, the attack behavior is totally different. Since the clean input\nalready does not match the learned concept perfectly as p4 (x) \u2248 0.8, the attack fools the model by\n+|I the model highlights that the\nreducing the contrast via background noise. For example, via the A\nclear detection of the upper part of the 4 is not given. Moreover, it recognizes that there could be a\ntop / bottom stroke, see A\n\n\u2212|I. A similar interpretation holds for the adversarial class.\n\n5The symbol \u201c\u25e6\u201d denotes the Hadamard product (element-wise multiplication).\n\n7\n\n\f1.00\n\n1.00\n\n1.00\n\n1.00\n\n0.99\n\n0.98\n\n0.88\n\n0.84\n\n0.72\n\n0.70\n\n1.00\n\n1.00\n\n1.00\n\n1.00\n\n1.00\n\n0.98\n\n0.75\n\n0.65\n\n0.61\n\n0.61\n\n1.00\n\n1.00\n\n1.00\n\n1.00\n\n0.83\n\n0.79\n\n0.65\n\n0.61\n\n0.55\n\n0.53\n\nFigure 6: The 10 components with the highest r+\nc,k for three different classes in the IMAGENET\ndataset. From top to bottom the classes are: dalmatian, giant panda, and trolleybus. Below\neach component the r+\n\nc,k (rounded to two digits) is given with respect to the class in question.\n\n\u22121 is trained to learn a strong concept as it\nOverall result The (cid:31)-CBC with \u03b1c,i,j = (vd \u00b7 hd)\ncan only reach py (x) \u2248 1 if it reasons perfectly at each pixel position. Therefore, the probability\nhistogram shows a relatively high base probability for all classes, as the overlap between encoded\ndigits to a spatial size of vd, hd = 7 is often around 50%. Moreover, this restrictive classi\ufb01cation\nprinciple violates the motivating example in Fig. 1 as the model cannot apply inde\ufb01nite reasoning\nover a pixel region. In contrast, the \u03b1-CBC is capable of modeling the motivating example but is\nat the same time a clear example of what happens if we optimize without any constraints as usually\nperformed in NNs. Since the model is trained by minimizing an energy function, it learns to classify\ncorrectly with the lowest effort and, hence, oversimpli\ufb01es. Therefore, the classi\ufb01cation will be\nperformed in a non-intuitive way. Moreover, the interpretation shows that the classi\ufb01cation of both\nCBCs is based on non-robust features of f as both are highly sensitive to background manipulations.\n\n4.2\n\nIMAGENET\n\n(cid:17)\n\n1 \u2212 r+\n\nc,k\n\nc,k \u00b7(cid:16)\n\nc,k was determined by r+\n\nTo evaluate CBCs on more complex data, we trained a CBC on the IMAGENET dataset. The\nCBC trained on IMAGENET was implemented using a pre-trained ResNet-50 [47] as non-trainable\nfeature extractor.\nIn contrast to the CBCs discussed earlier, the patch components of shape\nm\u03ba = 2 \u00d7 2 \u00d7 2048 are de\ufb01ned directly in the feature space. This removes the relation between\nthe components and the input space but drastically improves training time. After downsampling the\ndetection possibility stack of size vd, hd = 6 by global max pooling, the reasoning is applied, see\nSec. 2.2. The components were initialized by cropping the center of 5 images from each class and\nconsecutively processing them through the feature extractor, resulting in 5 000 patch components. If\nthe component \u03bak was initialized by a sample from the class c, then we initialized r+\nc,k as a uniform\nrandom value of [z, 1] where z = 0.75 and as a uniform random value of [0, 1 \u2212 z] otherwise. Af-\nterwards, the initialization of r\u2212\n. Hence, we biased the model\nwith positive reasoning to components that were sampled from the respective class. The CBC was\ntrained with the margin loss and \u03b2 = 0.1. In compliance with earlier work on IMAGENET, the\ninput images were rescaled, by \ufb01rst rescaling the shortest side to 224 and then performing a center\ncropping of size 224 \u00d7 224. For the same reason, no image augmentation was used.\nInterpretability In Fig. 6, the 10 components with the highest positive reasoning probabilities for\nthree exemplary classes are presented. After training the components in the feature space, the input\nrepresentation of the components is determined by searching for the highest detection probability\nin the training set for the given component and cropping the corresponding image area in the input\nspace. This method is similar to the approach from [7]. In general, the components with a high\npositive reasoning probability (above the initialization bound of z) are found to be conceptually\nmeaningful for the respective class. Further investigation of the components shows that the detection\nof the component with the second highest positive reasoning probability for the dalmatian class in\nan image also provides evidence in favor of the giant panda class. Similarly, the component with\nthe \ufb01fth highest positive reasoning probability for the dalmatian class is also highly important for\nthe classes hyena, snow leopard, and english setter while the component with the \ufb01fth highest\n\n8\n\n\fpositive reasoning probability for the class trolleybus is also important for the class trolley\ncar. Similar shared components can be found across many classes, which shows that the CBC is\ncapable of learning complex class-independent structures.\nAveraged across all classes a positive reasoning probability greater than z was learned for 5.2 \u00b1 0.8\ncomponents per class while a negative reasoning probability greater than z was assigned to\n2 781.8 \u00b1 23.3 out of 5 000 components. As can be seen in Fig. 6, in most cases the positive\nreasoning probabilities assigned to components are close to 1.00. This includes components that\nwere not initialized with a bias towards the class in question. For example, the component with\nthe \ufb01fth highest positive reasoning probability for the dalmatian class was initially biased towards\nthe english setter class. The ratio between the number of positive and negative reasoning com-\nponents suggest that the model heavily relies on negative reasoning to establish a baseline for its\nclassi\ufb01cation decision. We hypothesize that in this higher dimensional setting with a large number\nof components positive reasoning is primarily utilized to \ufb01ne tune the models classi\ufb01cation decision\nafter rough categorization by negative reasoning.\nPerformance To evaluate the performance of CBCs, we compare both the accuracy and inference\ntime to that of a CNN. The resulting CBC had an inference time of (371\u00b1 6) images / sec, similar to\n(369\u00b1 2) images / sec of a normal ResNet-50 with global average pooling and fully-connected layer.\nThis shows that the CBC generates no signi\ufb01cant computational overhead. The top-5 validation\naccuracy of 82.4% is on par with earlier CNN generations such as AlexNet with 82.8% [48]. Note\nthat the used CBC had a non-trainable feature extractor and no parameter tuning was performed. We\nare con\ufb01dent that the accuracy of CBCs on IMAGENET can be improved with further studies. The\nCBC was evaluated using one NVIDIA Tesla V100 32 GB GPU.\n\n5 Conclusion and outlook\n\nIn this paper, we have presented a probabilistic classi\ufb01cation model called classi\ufb01cation-by-\ncomponents network together with several possible realizations. Boiling down to the essential\nchange we made, this is the de\ufb01nition of a probabilistic framework for the \ufb01nal and penultimate\nlayer of a NN. The detection probability layer is an extension of a convolution layer with the require-\nment to measure the detection of convolutional \ufb01lters called components, expressed in probabilities.\nMoreover, the \ufb01nal reasoning layer is still af\ufb01ne but follows a special implicit constraint de\ufb01ned by\nthe probability model. The overall output is a probability value for each class without any arti\ufb01cial\nsquashing. Independently of the feature extractor used in the CBC, we can always take advantage of\nthis relation during inference by rede\ufb01ning the network to a single feedforward NN such that almost\nno computational overhead is created. This is shown in the experiment on IMAGENET.\nDepending on the training setup, the method inherently contains a lot of different interpretation\nproperties which are all founded on the new probability framework. As shown in the MNIST exper-\niments with Siamese architectures, the method can produce human understandable components and\nis able to converge to the BMPP without any explicit regularization. Additionally, we have shown\nthat the models can answer questions about the classi\ufb01cation decision by an experiment with patch\ncomponents on MNIST. More precisely, the model shows what causes the failure on an adversarial\nexample. The conclusion drawn here supports the recently published results in [49]. A drawback of\nthe Siamese architecture is the training overhead and the potential introduction of a lot of parameters\ndue to components in the input space. In the non Siamese training, CBCs have almost no downsides\nto NNs. To be able to use all the presented interpretation techniques, the back projection strategy\npresented in [7] can be applied, as we have shown on IMAGENET. The evaluation on IMAGENET\nalso showed that CBCs are capable of learning high dimensional components that can be utilized by\nmultiple classes. Investigation of these shared components can provide additional insight into the\nmodel\u2019s classi\ufb01cation approach. The heatmap visualizations are always applicable and extend the\nfamiliar CAM method by the option to visualize disagreement.\nThe CBC is a promising new method for classi\ufb01cation and motivates further research. An initial ro-\nbustness evaluation and the use of the class hypothesis possibility vectors for outlier detection show\npromising results, see supplementary material in Sec. E.2.4. Nevertheless, the following remain\nunanswered: What are proper regularizations for \u03b1c,i? What are more suitable detection probability\nfunctions? What are the advantages of the explicit injection of knowledge into the network in the\nform of trainable or non-trainable components, as we partly applied in the IMAGENET experiment?\n\n9\n\n\fAcknowledgements\n\nWe would like to thank Peter Schlicht and Jacek Bodziony from Volkswagen AG, Jensun Ravichan-\ndran from the University of Applied Sciences Mittweida, and Frank-Michael Schleif from the Uni-\nversity of Applied Sciences W\u00fcrzburg-Schweinfurt for their valuable input on previous versions\nof the manuscript. We would also like to thank the whole team at the Innovation Campus from\nPorsche AG, especially Emilio Oldenziel, Philip Elspas, Mathis Brosowsky, Simon Isele, Simon\nMates, and Sebastian S\u00f6hner for their continued support and input. Lastly, we would like to thank\nour attentive anonymous reviewers whose comments have greatly improved this manuscript.\n\nReferences\n[1] B. M. Lake, T. D. Ullman, J. B. Tenenbaum, and S. J. Gershman. Building machines that learn\n\nand think like people. Behavioral and Brain Sciences, 40, 2017.\n\n[2] J. Adebayo, J. Gilmer, M. Muelly, I. Goodfellow, M. Hardt, and B. Kim. Sanity checks for\nIn Advances in Neural Information Processing Systems, pages 9505\u20139515,\n\nsaliency maps.\n2018.\n\n[3] R. Geirhos, P. Rubisch, C. Michaelis, M. Bethge, F. A. Wichmann, and W. Brendel. ImageNet-\ntrained CNNs are biased towards texture; increasing shape bias improves accuracy and robust-\nness. In International Conference on Learning Representations, 2019.\n\n[4] C. Rudin. Stop explaining black box machine learning models for high stakes decisions and\n\nuse interpretable models instead. Nature Machine Intelligence, 1(5):206, 2019.\n\n[5] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning deep features for\ndiscriminative localization. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 2921\u20132929, 2016.\n\n[6] Y. Gal and Z. Ghahramani. Dropout as a Bayesian approximation: Representing model uncer-\ntainty in deep learning. In International Conference on Machine Learning, pages 1050\u20131059,\n2016.\n\n[7] C. Chen, O. Li, C. Tao, A. J. Barnett, J. Su, and C. Rudin. This looks like that: Deep learning\n\nfor interpretable image recognition. arXiv preprint arXiv:1806.10574, 2018.\n\n[8] I. Goodfellow, J. Shlens, and C. Szegedy. Explaining and harnessing adversarial examples. In\n\nInternational Conference on Learning Representations, 2015.\n\n[9] J. Bien and R. Tibshirani. Prototype selection for interpretable classi\ufb01cation. The Annals of\n\nApplied Statistics, 5(4):2403\u20132424, 2011.\n\n[10] M. Biehl, B. Hammer, and T. Villmann. Prototype-based models in machine learning. Wiley\n\nInterdisciplinary Reviews: Cognitive Science, 7(2):92\u2013111, 2016.\n\n[11] I. Biederman. Recognition-by-components: A theory of human image understanding. Psycho-\n\nlogical review, 94(2):115, 1987.\n\n[12] J. Bromley, I. Guyon, Y. LeCun, E. S\u00e4ckinger, and R. Shah. Signature veri\ufb01cation using a\n\"Siamese\" time delay neural network. In Advances in Neural Information Processing Systems,\npages 737\u2013744, 1994.\n\n[13] G. Shafer. A mathematical theory of evidence, volume 42. Princeton university press, 1976.\n\n[14] A. Sato and K. Yamada. Generalized Learning Vector Quantization. In Advances in Neural\n\nInformation Processing Systems, pages 423\u2013429, 1996.\n\n[15] K. Crammer, R. Gilad-Bachrach, A. Navot, and A. Tishby. Margin analysis of the LVQ algo-\n\nrithm. In Advances in Neural Information Processing Systems, pages 479\u2013486, 2003.\n\n[16] Kamaledin Ghiasi-Shirazi. Generalizing the convolution operator in convolutional neural net-\n\nworks. Neural Processing Letters, pages 1\u201320, 2019.\n\n10\n\n\f[17] Sascha Saralajew, Lars Holdijk, Maike Rees, and Thomas Villmann. Prototype-based neural\n\nnetwork layers: incorporating vector quantization. arXiv preprint arXiv:1812.01214, 2018.\n\n[18] P. Tokmakov, Y.-X. Wang, and M. Hebert. Learning compositional representations for few-shot\n\nrecognition. arXiv preprint arXiv:1812.09213, 2018.\n\n[19] Z. Akata, F. Perronnin, Z. Harchaoui, and C. Schmid. Label-embedding for attribute-based\nclassi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern Recog-\nnition, pages 819\u2013826, 2013.\n\n[20] K. Marino, R. Salakhutdinov, and A. Gupta. The more you know: Using knowledge graphs for\nimage classi\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 2673\u20132681, 2017.\n\n[21] C. Jiang, H. Xu, X. Liang, and L. Lin. Hybrid knowledge routed modules for large-scale object\n\ndetection. In Advances in Neural Information Processing Systems, pages 1559\u20131570, 2018.\n\n[22] X. Chen, L.-J. Li, L. Fei-Fei, and A. Gupta. Iterative visual reasoning beyond convolutions.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n7239\u20137248, 2018.\n\n[23] D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a deep\n\nnetwork. University of Montreal, 1341(3):1, 2009.\n\n[24] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In Euro-\n\npean Conference on Computer Vision, pages 818\u2013833. Springer, 2014.\n\n[25] A. Nguyen, J. Yosinski, and J. Clune. Understanding neural networks via feature visualization:\n\nA survey. arXiv preprint arXiv:1904.08939, 2019.\n\n[26] S. Chopra, R. Hadsell, and Y. LeCun. Learning a similarity metric discriminatively, with\napplication to face veri\ufb01cation. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 539\u2013546, 2005.\n\n[27] R. Salakhutdinov and G. Hinton. Learning a nonlinear embedding by preserving class neigh-\n\nbourhood structure. In Arti\ufb01cial Intelligence and Statistics, pages 412\u2013419, 2007.\n\n[28] G. Koch, R. Zemel, and R. Salakhutdinov. Siamese neural networks for one-shot image recog-\nnition. In International Conference on Machine Learning \u2013 Deep Learning Workshop, 2015.\n\n[29] T. Mensink, J. Verbeek, F. Perronnin, and G. Csurka. Metric learning for large scale image\nclassi\ufb01cation: Generalizing to new classes at near-zero cost. In European Conference on Com-\nputer Vision, pages 488\u2013501. Springer, 2012.\n\n[30] J. Snell, K. Swersky, and R. Zemel. Prototypical networks for few-shot learning. In Advances\n\nin Neural Information Processing Systems, pages 4077\u20134087, 2017.\n\n[31] O. Li, H. Liu, C. Chen, and C. Rudin. Deep learning for case-based reasoning through proto-\ntypes: A neural network that explains its predictions. In Thirty-Second AAAI Conference on\nArti\ufb01cial Intelligence, 2018.\n\n[32] N. Papernot and P. McDaniel. Deep k-nearest neighbors: Towards con\ufb01dent, interpretable and\n\nrobust deep learning. arXiv preprint arXiv:1803.04765, 2018.\n\n[33] H.-M. Yang, X.-Y. Zhang, F. Yin, and C.-L. Liu. Robust classi\ufb01cation with convolutional\nprototype learning. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pages 3474\u20133482, 2018.\n\n[34] T. Pl\u00f6tz and S. Roth. Neural nearest neighbors networks. In Advances in Neural Information\n\nProcessing Systems, pages 1093\u20131104, 2018.\n\n[35] S. O. Arik and T. P\ufb01ster. Attention-based prototypical learning towards interpretable, con\ufb01dent\n\nand robust deep neural networks. arXiv preprint arXiv:1902.06292, 2019.\n\n11\n\n\f[36] O. Vinyals, C. Blundell, T. Lillicrap, K. Kavukcuoglu, and D. Wierstra. Matching networks for\none shot learning. In Advances in Neural Information Processing Systems, pages 3630\u20133638,\n2016.\n\n[37] A. Santoro, S. Bartunov, M. Botvinick, D. Wierstra, and T. Lillicrap. Meta-learning with\nIn International Conference on Machine Learning,\n\nmemory-augmented neural networks.\npages 1842\u20131850, 2016.\n\n[38] S. Gidaris and N. Komodakis. Dynamic few-shot visual learning without forgetting. In Pro-\nceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 4367\u2013\n4375, 2018.\n\n[39] L. Bertinetto, J. Valmadre, J. F. Henriques, A. Vedaldi, and P. H. S. Torr. Fully-convolutional\nIn European Conference on Computer Vision, pages\n\nsiamese networks for object tracking.\n850\u2013865. Springer, 2016.\n\n[40] Y. LeCun, C. Cortes, and C. J.C. Burges. The MNIST database of handwritten digits. 1998.\n\nhttp://yann.lecun.com/exdb/mnist/.\n\n[41] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei.\n\nImageNet: A large-scale hi-\nerarchical image database. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pages 248\u2013255, 2009.\n\n[42] D. P. Kingma and J. L. Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[43] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In International Conference on Machine Learning, pages 448\u2013456,\n2015.\n\n[44] P. Ramachandran, B. Zoph, and Q. V. Le. Searching for activation functions. arXiv preprint\n\narXiv:1710.05941, 2017.\n\n[45] T. Villmann, A. Bohnsack, and M. Kaden. Can Learning Vector Quantization be an alterna-\ntive to SVM and deep learning? - Recent trends and advanced variants of Learning Vector\nQuantization for classi\ufb01cation learning. Journal of Arti\ufb01cial Intelligence and Soft Computing\nResearch, 7(1):65\u201381, 2017.\n\n[46] W. Brendel, J. Rauber, and M. Bethge. Decision-based adversarial attacks: Reliable attacks\nagainst black-box machine learning models. In International Conference on Learning Repre-\nsentations, 2018.\n\n[47] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Pro-\nceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 770\u2013778,\n2016.\n\n[48] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classi\ufb01cation with deep convolutional\nneural networks. In Advances in Neural Information Processing Systems, pages 1097\u20131105,\n2012.\n\n[49] A. Ilyas, S. Santurkar, D. Tsipras, L. Engstrom, B. Tran, and A. Madry. Adversarial examples\n\nare not bugs, they are features. arXiv preprint arXiv:1905.02175, 2019.\n\n12\n\n\f", "award": [], "sourceid": 1589, "authors": [{"given_name": "Sascha", "family_name": "Saralajew", "institution": "Dr. Ing. h.c. F. Porsche AG"}, {"given_name": "Lars", "family_name": "Holdijk", "institution": "Radboud University Nijmegen"}, {"given_name": "Maike", "family_name": "Rees", "institution": "Dr. Ing. h.c. F. Porsche AG"}, {"given_name": "Ebubekir", "family_name": "Asan", "institution": "Porsche AG"}, {"given_name": "Thomas", "family_name": "Villmann", "institution": "University of Applied Sciences Mittweida"}]}