{"title": "Beyond Novelty Detection: Incongruent Events, when General and Specific Classifiers Disagree", "book": "Advances in Neural Information Processing Systems", "page_first": 1745, "page_last": 1752, "abstract": "Unexpected stimuli are a challenge to any machine learning algorithm. Here we identify distinct types of unexpected events, focusing on 'incongruent events' - when 'general level' and 'specific level' classifiers give conflicting predictions. We define a formal framework for the representation and processing of incongruent events: starting from the notion of label hierarchy, we show how partial order on labels can be deduced from such hierarchies. For each event, we compute its probability in different ways, based on adjacent levels (according to the partial order) in the label hierarchy . An incongruent event is an event where the probability computed based on some more specific level (in accordance with the partial order) is much smaller than the probability computed based on some more general level, leading to conflicting predictions. We derive algorithms to detect incongruent events from different types of hierarchies, corresponding to class membership or part membership. Respectively, we show promising results with real data on two specific problems: Out Of Vocabulary words in speech recognition, and the identification of a new sub-class (e.g., the face of a new individual) in audio-visual facial object recognition.", "full_text": "Beyond Novelty Detection: Incongruent Events, when\n\nGeneral and Speci\ufb01c Classi\ufb01ers Disagree\n\nAbstract\n\nUnexpected stimuli are a challenge to any machine learning algorithm. Here we\nidentify distinct types of unexpected events, focusing on \u2019incongruent events\u2019 -\nwhen \u2019general level\u2019 and \u2019speci\ufb01c level\u2019 classi\ufb01ers give con\ufb02icting predictions.\nWe de\ufb01ne a formal framework for the representation and processing of incongru-\nent events: starting from the notion of label hierarchy, we show how partial order\non labels can be deduced from such hierarchies. For each event, we compute its\nprobability in different ways, based on adjacent levels (according to the partial\norder) in the label hierarchy. An incongruent event is an event where the proba-\nbility computed based on some more speci\ufb01c level (in accordance with the partial\norder) is much smaller than the probability computed based on some more general\nlevel, leading to con\ufb02icting predictions. We derive algorithms to detect incongru-\nent events from different types of hierarchies, corresponding to class membership\nor part membership. Respectively, we show promising results with real data on\ntwo speci\ufb01c problems: Out Of Vocabulary words in speech recognition, and the\nidenti\ufb01cation of a new sub-class (e.g., the face of a new individual) in audio-visual\nfacial object recognition.\n\n1 Introduction\n\nMachine learning builds models of the world using training data from the application domain and\nprior knowledge about the problem. The models are later applied to future data in order to estimate\nthe current state of the world. An implied assumption is that the future is stochastically similar to\nthe past. The approach fails when the system encounters situations that are not anticipated from the\npast experience. In contrast, successful natural organisms identify new unanticipated stimuli and\nsituations and frequently generate appropriate responses.\n\nBy de\ufb01nition, an unexpected event is one whose probability to confront the system is low, based\non the data that has been observed previously. In line with this observation, much of the computa-\ntional work on novelty detection focused on the probabilistic modeling of known classes, identifying\noutliers of these distributions as novel events (see e.g. [1, 2] for recent reviews). More recently, one-\nclass classi\ufb01ers have been proposed and used for novelty detection without the direct modeling of\ndata distribution [3, 4]. There are many studies on novelty detection in biological systems [5], often\nfocusing on regions of the hippocampus [6].\n\nTo advance beyond the detection of outliers, we observe that there are many different reasons why\nsome stimuli could appear novel. Our work, presented in Section 2, focuses on unexpected events\nwhich are indicated by the incongruence between prediction induced by prior experience (training)\nand the evidence provided by the sensory data. To identify an item as incongruent, we use two\nparallel classi\ufb01ers. One of them is strongly constrained by speci\ufb01c knowledge (both prior and data-\nderived), the other classi\ufb01er is more general and less constrained. Both classi\ufb01ers are assumed\nto yield class-posterior probability in response to a particular input signal. A suf\ufb01ciently large\ndiscrepancy between posterior probabilities induced by input data in the two classi\ufb01ers is taken as\nindication that an item is incongruent.\n\nThus, in comparison with most existing work on novelty detection, one new and important charac-\nteristic of our approach is that we look for a level of description where the novel event is highly\nprobable. Rather than simply respond to an event which is rejected by all classi\ufb01ers, which more\noften than not requires no special attention (as in pure noise), we construct and exploit a hierarchy of\n\n1\n\n\frepresentations. We attend to those events which are recognized (or accepted) at some more abstract\nlevels of description in the hierarchy, while being rejected by the more concrete classi\ufb01ers.\n\nThere are various ways to incorporate prior hierarchical knowledge and constraints within different\nclassi\ufb01er levels, as discussed in Section 3. One approach, used to detect images of unexpected in-\ncongruous objects, is to train the more general, less constrained classi\ufb01er using a larger more diverse\nset of stimuli, e.g., the facial images of many individuals. The second classi\ufb01er is trained using a\nmore speci\ufb01c (i.e. smaller) set of speci\ufb01c objects (e.g., the set of Einstein\u2019s facial images). An\nincongruous item (e.g., a new individual) could then be identi\ufb01ed by a smaller posterior probability\nestimated by the more speci\ufb01c classi\ufb01ers relative to the probability from the more general classi\ufb01er.\n\nA different approach is used to identify unexpected (out-of-vocabulary) lexical items. The more\ngeneral classi\ufb01er is trained to classify sequentially speech sounds (phonemes) from a relatively short\nsegments of the input speech signal (thus yielding an unconstrained sequence of phoneme labels);\nthe more constrained classi\ufb01er is trained to classify a particular set of words (highly constrained\nsequences of phoneme labels) from the information available in the whole speech sentence. A word\nthat did not belong to the expected vocabulary of the more constrained recognizer could then be\nidenti\ufb01ed by discrepancy in posterior probabilities of phonemes derived from both classi\ufb01ers.\n\nOur second contribution in Section 2 is the presentation of a unifying theoretical framework for\nthese two approaches. Speci\ufb01cally, we consider two kinds of hierarchies: Part membership as in\nbiological taxonomy or speech, and Class membership, as in human categorization (or levels of\ncategorization). We de\ufb01ne a notion of partial order on such hierarchies, and identify those events\nwhose probability as computed using different levels of the hierarchy does not agree. In particular,\nwe are interested in those events that receive high probability at more general levels (for example,\nthe system is certain that the new example is a dog), but low probability at more speci\ufb01c levels (in the\nsame example, the system is certain that the new example is not any known dog breed). Such events\ncorrespond to many interesting situations that are worthy of special attention, including incongruous\nscenes and new sub-classes, as shown in Section 3.\n\n2 Incongruent Events - uni\ufb01ed approach\n\n2.1 Introducing label hierarchy\n\nThe set of labels represents the knowledge base about stimuli, which is either given (by a teacher in\nsupervised learning settings) or learned (in unsupervised or semi-supervised settings). In cognitive\nsystems such knowledge is hardly ever a set; often, in fact, labels are given (or can be thought of) as\na hierarchy. In general, hierarchies can be represented as directed graphs. The nodes of the graphs\nmay be divided into distinct subsets that correspond to different entities (e.g., all objects that are\nanimals); we call these subsets \u201clevels\u201d. We identify two types of hierarchies:\nPart membership, as in biological taxonomy or speech. For example, eyes, ears, and nose combine\nto form a head; head, legs and tail combine to form a dog.\nClass membership, as in human categorization \u2013 where objects can be classi\ufb01ed at different levels\nof generality, from sub-ordinate categories (most speci\ufb01c level), to basic level (intermediate level),\nto super-ordinate categories (most general level). For example, a Beagle (sub-ordinate category) is\nalso a dog (basic level category), and it is also an animal (super-ordinate category).\n\nThe two hierarchies de\ufb01ned above induce constraints on the observed features in different ways. In\nthe class-membership hierarchy, a parent class admits higher number of combinations of features\nthan any of its children, i.e., the parent category is less constrained than its children classes. In\ncontrast, a parent node in the part-membership hierarchy imposes stricter constraints on the observed\nfeatures than a child node. This distinction is illustrated by the simple \u201dtoy\u201d example shown in Fig. 1.\nRoughly speaking, in the class-membership hierarchy (right panel), the parent node is the disjunction\nof the child categories. In the part-membership hierarchy (left panel), the parent category represents\na conjunction of the children categories. This difference in the effect of constraints between the two\nrepresentations is, of course, re\ufb02ected in the dependency of the posterior probability on the class,\nconditioned on the observations.\n\n2\n\n\f(cid:1)(cid:2)(cid:3)\n(cid:1)(cid:2)(cid:3)\n\n(cid:1)(cid:2)(cid:3)\n(cid:1)(cid:2)(cid:3)\n\n(cid:7)(cid:5)(cid:8)(cid:9)\n(cid:7)(cid:5)(cid:8)(cid:9)\n\n(cid:4)(cid:5)(cid:3)(cid:6)\n(cid:4)(cid:5)(cid:3)(cid:6)\n\n(cid:10)(cid:8)(cid:11)(cid:12)\n(cid:10)(cid:8)(cid:11)(cid:12)\n\n(cid:13)(cid:14)(cid:3)(cid:8)(cid:15)\n(cid:13)(cid:14)(cid:3)(cid:8)(cid:15)\n\n(cid:16)(cid:5)(cid:8)(cid:3)(cid:12)(cid:5)\n(cid:16)(cid:5)(cid:8)(cid:3)(cid:12)(cid:5)\n\n(cid:17)(cid:2)(cid:12)(cid:12)(cid:11)(cid:5)\n(cid:17)(cid:2)(cid:12)(cid:12)(cid:11)(cid:5)\n\nFigure 1: Examples. Left: part-membership hierarchy, the concept of a dog requires a conjunction of parts -\na head, legs and tail. Right: class-membership hierarchy, the concept of a dog is de\ufb01ned as the disjunction of\nmore speci\ufb01c concepts - Afghan, Beagle and Collie.\n\nIn order to treat different hierarchical representations uniformly we invoke the notion of partial\norder. Intuitively speaking, different levels in each hierarchy are related by a partial order: the more\nspeci\ufb01c concept, which corresponds to a smaller set of events or objects in the world, is always\nsmaller than the more general concept, which corresponds to a larger set of events or objects.\n\nTo illustrate this point, consider Fig. 1 again. For the part-membership hierarchy example (left\npanel), the concept of \u2019dog\u2019 requires a conjunction of parts as in DOG = LEGS \\ HEAD \\ TAIL,\nand therefore, for example, DOG (cid:26) LEGS ) DOG (cid:22) LEGS . Thus\n\nDOG (cid:22) LEGS ; DOG (cid:22) HEAD ; DOG (cid:22) TAIL\n\nIn contrast, for the class-membership hierarchy (right panel), the class of dogs requires the conjunc-\ntion of the individual members as in DOG = AFGHAN [ BEAGEL [ COLLIE , and therefore,\nfor example, DOG (cid:27) AFGHAN ) DOG (cid:23) AFGHAN . Thus\n\nDOG (cid:23) AFGHAN ; DOG (cid:23) BEAGEL; DOG (cid:23) COLLIE\n\n2.2 De\ufb01nition of Incongruent Events\n\nNotations\n\nWe assume that the data is represented as a Graph fG; Eg of Partial Orders (GP O). Each node in\nG is a random variable which corresponds to a class or concept (or event). Each directed link in E\ncorresponds to partial order relationship as de\ufb01ned above, where there is a link from node a to node\nb iff a (cid:22) b.\nFor each node (concept) a, de\ufb01ne As = fb 2 G; b (cid:22) ag - the set of all nodes (concepts) b more\nspeci\ufb01c (smaller) than a in accordance with the given partial order; similarly, de\ufb01ne Ag = fb 2\nG; a (cid:22) bg - the set of all nodes (concepts) b more general (larger) than a in accordance with the\ngiven partial order.\n\nFor each concept a and training data T , we train up to 3 probabilistic models which are derived from\nT in different ways, in order to determine whether the concept a is present in a new data point X:\n\n(cid:15) Qa(X): a probabilistic model of class a, derived from training data T without using the\n\npartial order relations in the GP O.\n\n(cid:15) If jAsj > 1\n\na(X): a probabilistic model of class a which is based on the probability of concepts in\nQs\nAs, assuming their independence of each other. Typically, the model incorporates some\nrelatively simple conjunctive and/or disjunctive relations among concepts in As.\n\n(cid:15) If jAgj > 1\n\na(X): a probabilistic model of class a which is based on the probability of concepts in\nQg\nAg, assuming their independence of each other. Here too, the model typically incorporates\nsome relatively simple conjunctive and/or disjunctive relations among concepts in Ag.\n\n3\n\n\fExamples\n\nTo illustrate, we use the simple examples shown in Fig. 1, where our concept of interest a is the\nconcept \u2018dog\u2019:\n\nIn the part-membership hierarchy (left panel), jAgj = 3 (head, legs, tail). We can therefore learn 2\nmodels for the class \u2018dog\u2019 (Qs\n\ndog is not de\ufb01ned):\n\n1. Qdog - obtained using training pictures of \u2019dogs\u2019 and \u2019not dogs\u2019 without body part labels.\n2. Qg\ndog - obtained using the outcome of models for head, legs and tail, which were trained on\nthe same training set T with body part labels. For example, if we assume that concept a is\nthe conjunction of its part member concepts as de\ufb01ned above, and assuming that these part\nconcepts are independent of each other, we get\n\nQg\n\ndog = Y\n\nb2Ag\n\nQb = QHead (cid:1) QLegs (cid:1) QTail\n\n(1)\n\nIn the class-membership hierarchy (right panel), jAsj = 3 (Afghan, Beagle, Collie). If we further\nassume that a class-membership hierarchy is always a tree, then jAgj = 1. We can therefore learn 2\nmodels for the class \u2018dog\u2019 (Qg\n\ndog is not de\ufb01ned):\n\n1. Qdog - obtained using training pictures of \u2019dogs\u2019 and \u2019not dogs\u2019 without breed labels.\n2. Qs\n\ndog - obtained using the outcome of models for Afghan, Beagle and Collie, which were\ntrained on the same training set T with only speci\ufb01c dog type labels. For example, if we\nassume that concept a is the disjunction of its sub-class concepts as de\ufb01ned above, and\nassuming that these sub-class concepts are independent of each other, we get\n\nQs\n\ndog = X\n\nb2As\n\nQb = QAf ghan + QBeagle + QCollie\n\nIncongruent events\n\nIn general, we expect the different models to provide roughly the same probability for the presence\nof concept a in data X. A mismatch between the predictions of the different models should raise the\nred \ufb02ag, possibly indicating that something new and interesting had been observed. In particular, we\nare interested in the following discrepancy:\n\nDe\ufb01nition: Observation X is incongruent if there exists a concept 0a0 such that\n\nQg\n\na(X) (cid:29) Qa(X) or Qa(X) (cid:29) Qs\n\na(X):\n\n(2)\n\nAlternatively, observation X is incongruent if a discrepancy exists between the inference of the two\nclassi\ufb01ers: either the classi\ufb01er based on the more general descriptions from level g accepts the X\nwhile the direct classier rejects it, or the direct classi\ufb01er accepts X while the classi\ufb01er based on the\nmore speci\ufb01c descriptions from level s rejects it. In either case, the concept receives high probability\nat the more general level (according to the GP O), but much lower probability when relying only on\nthe more speci\ufb01c level.\n\nLet us discuss again the examples we have seen before, to illustrate why this de\ufb01nition indeed\ncaptures interesting \u201csurprises\u201d:\n\n(cid:15) In the part-membership hierarchy (left panel of Fig. 1), we have\ndog = QHead (cid:1) QLegs (cid:1) QTail (cid:29) Qdog\n\nQg\n\nIn other words, while the probability of each part is high (since the multiplication of those\nprobabilities is high), the \u2019dog\u2019 classi\ufb01er is rather uncertain about the existence of a dog in\nthis data.\nHow can this happen? Maybe the parts are con\ufb01gured in an unusual arrangement for a dog\n(as in a 3-legged cat), or maybe we encounter a donkey with a cat\u2019s tail (as in Shrek 3).\nThose are two examples of the kind of unexpected events we are interested in.\n\n4\n\n\f(cid:15) In the class-membership hierarchy (right panel of Fig. 1), we have\n\nQs\n\ndog = QAf ghan + QBeagle + QCollie (cid:28) Qdog\n\nIn other words, while the probability of each sub-class is low (since the sum of these prob-\nabilities is low), the \u2019dog\u2019 classi\ufb01er is certain about the existence of a dog in this data.\nHow may such a discrepancy arise? Maybe we are seeing a new type of dog that we haven\u2019t\nseen before - a Pointer. The dog model, if correctly capturing the notion of \u2019dogness\u2019,\nshould be able to identify this new object, while models of previously seen dog breeds\n(Afghan, Beagle and Collie) correctly fail to recognize the new object.\n\n3 Incongruent events: algorithms\n\nOur de\ufb01nition for incongruent events in the previous section is indeed uni\ufb01ed, but as a result quite\nabstract. In this section we discuss two different algorithmic implementations, one generative and\none discriminative, which were developed for the part membership and class membership hierar-\nchies respectively (see de\ufb01nition in Section 1). In both cases, we use the notation Q(x) for the class\nprobability as de\ufb01ned above, and p(x) for the estimated probability.\n\n3.1 Part membership - a generative algorithm\n\nConsider the left panel of Fig. 1. The event in the top node is incongruent if its probability is low,\nwhile the probability of all its descendants is high.\n\nIn many applications, such as speech recognition, one computes the probability of events (sentences)\nbased on a generative model (corresponding to a speci\ufb01c language) which includes a dictionary of\nparts (words). At the top level the event probability is computed conditional on the model; in which\ncase typically the parts are assumed to be independent, and the event probability is computed as\nthe multiplication of the parts probabilities conditioned on the model. For example, in speech pro-\ncessing and assuming a speci\ufb01c language (e.g., English), the probability of the sentence is typically\ncomputed by multiplying the probability of each word using an HMM model trained on sentences\nfrom a speci\ufb01c language. At the bottom level, the probability of each part is computed independently\nof the generative model.\nMore formally, Consider an event u composed of parts wk. Using the generative model of events\nand assuming the conditional independence of the parts given this model, the prior probability of the\nevent is given by the product of prior probabilities of the parts,\n\np(ujL) = Y\n\np(wkjL)\n\nk\n\n(3)\n\nwhere L denotes the generative model (e.g., the language).\n\nFor measurement X, we compute Q(X) as follows\n\nQ(X) = p(XjL) = X\n\np(Xju; L)p(ujL) (cid:25) p(Xj(cid:22)u; L)p((cid:22)ujL) = p(Xj(cid:22)u) Y\n\np(wkjL)\n\n(4)\n\nu\n\nk\n\nusing p(Xju; L) = p(Xju) and (3), and where (cid:22)u = arg max\np(ujL) is the most likely interpreta-\ntion. At the risk of notation abuse, fwkg now denote the parts which compose the most likely event\n(cid:22)u. We assume that the \ufb01rst sum is dominated by the maximal term.\nGiven a part-membership hierarchy, we can use (1) to compute the probability Qg(X) directly,\nwithout using the generative model L.\n\nu\n\nQg(X) = p(X) = X\n\np(Xju)p(u) (cid:21) p(Xj(cid:22)u)p((cid:22)u) = p(Xj(cid:22)u) Y\n\np(wk)\n\nu\n\nk\n\nIt follows from (4) and (5) that\n\nQ(X)\nQg(X)\n\n(cid:20) Y\n\nk\n\np(wkjL)\np(wk)\n\n5\n\n(5)\n\n(6)\n\n\fWe can now conclude that X is an incongruent event according to our de\ufb01nition if there exists at\nleast one part k in the \ufb01nal event (cid:22)u, such that p(wk) (cid:29) p(wkjL) (assuming all other parts have\nroughly the same conditional and unconditional probabilities). In speech processing, a sentence is\nincongruent if it includes an incongruent word - a word whose probability based on the generative\nlanguage model is low, but whose direct probability (not constrained by the language model) is high.\n\nExample: Out Of Vocabulary (OOV) words\n\nFor the detection of OOV words, we performed experiments using a Large Vocabulary Continuous\nSpeech Recognition (LVCSR) system on the Wall Street Journal Corpus (WSJ). The evaluation\nset consists of 2.5 hours. To introduce OOV words, the vocabulary was restricted to the 4968 most\nfrequent words from the language training texts, leaving the remaining words unknown to the model.\nA more detailed description is given in [7].\n\nIn this task, we have shown that the comparison between two parallel classi\ufb01ers, based on strong\nand weak posterior streams, is effective for the detection of OOV words, and also for the detection\nof recognition errors. Speci\ufb01cally, we use the derivation above to detect out of vocabulary words,\nby comparing their probability when computed based on the language model, and when computed\nbased on mere acoustic modeling. The best performance was obtained by the system when a Neural\nNetwork (NN) classi\ufb01er was used for the direct estimation of frame-based OOV scores. The network\nwas directly fed by posteriors from the strong and the weak systems. For the WSJ task, we achieved\nperformance of around 11% Equal-Error-Rate (EER) (Miss/False Alarm probability), see Fig. 2.\n\nFigure 2: Several techniques used to detect OOV: (i) Cmax: Con\ufb01dence measure computed ONLY from\nstrongly constrained Large Vocabulary Continuous Speech Recognizer (LVCSR), with frame-based posteriors.\n(ii) LVCSR+weak features: Strongly and weakly constrained recognizers, compared via the KL-divergence\nmetric. (iii) LVCSR+NN posteriors: Combination of strong and weak phoneme posteriors using NN classi\ufb01er.\n(iv) all features: fusion of (ii) and (iii) together.\n\n3.2 Class membership - a discriminative algorithm\n\nConsider the right panel of Fig. 1. The general class in the top node is incongruent if its probability\nis high, while the probability of all its sub-classes is low.\nIn other words, the classi\ufb01er of the\nparent object accepts the new observation, but all the children object classi\ufb01ers reject it. Brute\nforce computation of this de\ufb01nition may follow the path taken by traditional approaches to novelty\ndetection, e.g., looking for rejection by all one class classi\ufb01ers corresponding to sub-class objects.\n\nThe result we have obtained by this method were mediocre, probably because generative models are\nnot well suited for the task. Instead, it seems like discriminative classi\ufb01ers, trained to discriminate\n\n6\n\n\fbetween objects at the sub-class level, could be more successful. We note that unlike traditional\napproaches to novelty detection, which must use generative models or one-class classi\ufb01ers in the\nabsence of appropriate discriminative data, our dependence on object hierarchy provides discrimi-\nnative data as a by-product. In other words, after the recognition by a parent-node classi\ufb01er, we may\nuse classi\ufb01ers trained to discriminate between its children to implement a discriminative novelty\ndetection algorithm.\n\nSpeci\ufb01cally, we used the approach described in [8] to build a uni\ufb01ed representation for all objects\nin the sub-class level, which is the representation computed for the parent object whose classi\ufb01er\nhad accepted (positively recognized) the object. In this feature space, we build a classi\ufb01er for each\nsub-class based on the majority vote between pairwise discriminative classi\ufb01ers. Based on these\nclassi\ufb01ers, each example (accepted by the parent classi\ufb01er) is assigned to one of the sub-classes, and\nthe average margin over classi\ufb01ers which agree with the \ufb01nal assignment is calculated. The \ufb01nal\nclassi\ufb01er then uses a threshold on this average margin to identify each object as known sub-class or\nnew sub-class. Previous research in the area of face identi\ufb01cation can be viewed as an implicit use\nof this propsed framework, see e.g. [9].\n\nExample: new face recognition from audio-visual data\n\nWe tested our algorithm on audio-visual speaker veri\ufb01cation. In this setup, the general parent cate-\ngory level is the \u2018speech\u2019 (audio) and \u2018face\u2019 (visual), and the different individuals are the offspring\n(sub-class) levels. The task is to identify an individual as belonging to the trusted group of individ-\nuals vs. being unknown, i.e. known sub-class vs. new sub-class in a class membership hierarchy.\n\nThe uni\ufb01ed representation of the visual cues was built using the approach described in [8]. All\nobjects in the sub-class level (different individuals) were represented using the representation learnt\nfor the parent level (\u2019face\u2019). For the audio cues we used the Perceptual linear predictive (PLP)\nCepstral features [10] as the uni\ufb01ed representation. We used SVM classi\ufb01ers with RBF kernel as the\npairwise discriminative classi\ufb01ers for each of the different audio/visual representations separately.\n\nData was collected for our experiments using a wearable device, which included stereo panoramic\nvision sensors and microphone arrays. In the recorded scenario, individuals walked towards the\ndevice and then read aloud an identical text; we acquired 30 sequences with 17 speakers (see Fig. 3\nfor an example). We tested our method by choosing six speakers as members of the trusted group,\nwhile the rest were assumed unknown.\n\nThe method was applied separately using each one of the different modalities, and also in an in-\ntegrated manner using both modalities. For this fusion the audio signal and visual signal were\nsynchronized, and the winning classi\ufb01cation margins of both signals were normalized to the same\nscale and averaged to obtain a single margin for the combined method.\n\nSince the goal is to identify novel incongruent events, true positive and false positive rates were\ncalculated by considering all frames from the unknown test sequences as positive events and the\nknown individual test sequences as negative events. We compared our method to novelty detection\nbased on one-class SVM [3] extended to our multi-class case. Decision was obtained by comparing\nthe maximal margin over all one-class classi\ufb01ers to a varying threshold. As can be seen in Fig. 3,\nour method performs substantially better in both modalities as compared to the \u201cstandard\u201d one class\napproach for novelty detection. Performance is further improved by fusing both modalities.\n\n4 Summary\n\nUnexpected events are typically identi\ufb01ed by their low posterior probability. In this paper we em-\nployed label hierarchy to obtain a few probability values for each event, which allowed us to tease\napart different types of unexpected events. In general there are 4 possibilities, based on the classi-\n\ufb01ers\u2019 response at two adjacent levels:\n\nSpeci\ufb01c level General level\n\npossible reason\n\n1\n2\n3\n4\n\nreject\nreject\naccept\naccept\n\nreject\naccept\nreject\naccept\n\nnoisy measurements, or a totally new concept\n\nincongruent concept\n\ninconsistent with partial order, models are wrong\n\nknown concept\n\n7\n\n\f1\n\n0.9\n\n0.8\n\n0.7\n\n0.6\n\n0.5\n\n0.4\n\n0.3\n\n0.2\n\n0.1\n\ne\nt\na\nr\n \ne\nv\ni\nt\ni\ns\no\np\n \ne\nu\nr\nT\n\n0\n \n0\n\n0.1\n\n0.2\n\n0.3\n\n \n\n audio\n visual\n audio\u2212visual\n audio (OC\u2212SVM)\n visual (OC\u2212SVM)\n\n0.4\n0.6\nFalse positive rate\n\n0.5\n\n0.7\n\n0.8\n\n0.9\n\n1\n\nFigure 3: Left: Example: one frame used for the visual veri\ufb01cation task. Right: True Positive vs. False\nPositive rates when detecting unknown vs. trusted individuals. The unknown are regarded as positive events.\nResults are shown for the proposed method using both modalities separately and the combined method (solid\nlines). For comparison, we show results with a more traditional novelty detection method using One Class\nSVM (dashed lines).\n\nWe focused above on the second type of events - incongruent concepts, which have not been studied\npreviously in isolation. Such events are characterized by some discrepancy between the response of\ntwo classi\ufb01ers, which can occur for a number different reasons: Context: in a given context such as\nthe English language, a sentence containing a Czech word is assigned low probability. In the visual\ndomain, in a given context such as a street scene, otherwise high probability events such as \u201ccar\u201d\nand \u201celephant\u201d are not likely to appear together. New sub-class: a new object has been encountered,\nof some known generic type but unknown speci\ufb01cs.\n\nWe described how our approach can be used to design new algorithms to address these problems,\nshowing promising results on real speech and audio-visual facial datasets.\n\nReferences\n\n[1] Markou, M., Singh, S.: Novelty detection: a review-part 1: statistical approaches. Signal Processing 83\n\n(2003) 2499 \u2013 2521\n\n[2] Markou, M., Singh, S.: Novelty detection: a review-part 2: neural network based approaches. Signal\n\nProcessing 83 (2003) 2481\u20132497\n\n[3] Scholkopf, B., Williamson, R.C., Smola, A.J., Shawe-Taylor, J., Platt, J.: Support vector method for\n\nnovelty detection. In: Proc. NIPS. Volume 12. (2000) 582\u2013588\n\n[4] Lanckrietand, G.R.G., Ghaoui, L.E., Jordan, M.I.: Robust novelty detection with single-class mpm. In:\n\nProc. NIPS. Volume 15. (2003) 929\u2013936\n\n[5] Berns, G.S., Cohen, J.D., Mintun, M.A.: Brain regions responsive to novelty in the absence of awareness.\n\nScience 276 (1997) 1272 \u2013 1275\n\n[6] Rokers, B., Mercado, E., Allen, M.T., Myers, C.E., Gluck, M.A.: A connectionist model of septohip-\npocampal dynamics during conditioning: Closing the loop. Behavioral Neuroscience 116 (2002) 48\u201362\n[7] Burget, L., Schwarz, P., Matejka, P., Hannemann, M., Rastrow, A., White, C., Khudanpur, S., Hermansky,\nH., Cernocky, J.: Combination of strongly and weakly constrained recognizers for reliable detection of\noovs. In: Proceedings of IEEE Int. Conf. on Acoustics, Speech, and Signal Processing (ICASSP). (2008)\n[8] Bar-Hillel, A., Weinshall, D.: Subordinate class recognition using relational object models. Proc. NIPS\n\n19 (2006)\n\n[9] Lanitis, A., Taylor, C.J., Cootes, T.F.: A uni\ufb01ed approach to coding and interpreting face images. In:\n\nProc. ICCV. (1995) 368\u2013373\n\n[10] Hermansky, H.: Perceptual linear predictive (PLP) analysis of speech. The Journal of the Acoustical\n\nSociety of America 87 (1990) 1738\n\n8\n\n\f", "award": [], "sourceid": 250, "authors": [{"given_name": "Daphna", "family_name": "Weinshall", "institution": null}, {"given_name": "Hynek", "family_name": "Hermansky", "institution": null}, {"given_name": "Alon", "family_name": "Zweig", "institution": null}, {"given_name": "Jie", "family_name": "Luo", "institution": null}, {"given_name": "Holly", "family_name": "Jimison", "institution": null}, {"given_name": "Frank", "family_name": "Ohl", "institution": null}, {"given_name": "Misha", "family_name": "Pavel", "institution": null}]}