{"title": "Probabilistic Semantic Video Indexing", "book": "Advances in Neural Information Processing Systems", "page_first": 967, "page_last": 973, "abstract": null, "full_text": "Probabilistic Semantic Video Indexing \n\nMilind R. Naphade, Igor Kozintsev and Thomas Huang \n\nDepartment of Electrical and Computer Engineering \n\nUniversity of Illinois at Urbana-Champaign \n\n{milind, igor,huang}@ifp.uiuc.edu \n\nAbstract \n\nWe propose a novel probabilistic framework for semantic video in(cid:173)\ndexing. We define probabilistic multimedia objects (multijects) \nto map low-level media features to high-level semantic labels. A \ngraphical network of such multijects (multinet) captures scene con(cid:173)\ntext by discovering intra-frame as well as inter-frame dependency \nrelations between the concepts. The main contribution is a novel \napplication of a factor graph framework to model this network. \nWe model relations between semantic concepts in terms of their \nco-occurrence as well as the temporal dependencies between these \nconcepts within video shots. Using the sum-product algorithm [1] \nfor approximate or exact inference in these factor graph multinets, \nwe attempt to correct errors made during isolated concept detec(cid:173)\ntion by forcing high-level constraints. This results in a significant \nimprovement in the overall detection performance. \n\n1 \n\nIntroduction \n\nResearch in video retrieval has traditionally focussed on the paradigm of query-by(cid:173)\nexample (QBE) using low-level features [2]. Query by keywords/key-phrases (QBK) \n(preferably semantic) instead of examples has motivated recent research in semantic \nvideo indexing. For this, we need models which capture the feature representation \ncorresponding to these keywords. A QBK system can support semantic retrieval for \na small set of keywords and also act as the first step in QBE systems to narrow down \nthe search. The difficulty lies in the gap between low-level media features and high(cid:173)\nlevel semantics. Recent attempts to address this include detection of audio-visual \nevents like explosion [3] and semantic visual templates [4]. \n\nWe propose a statistical pattern recognition approach for training probabilistic mul(cid:173)\ntimedia objects (multijects) which map the high level concepts to low-level audio(cid:173)\nvisual features. We also propose a probabilistic factor graph framework, which mod(cid:173)\nels the interaction between concepts within each video frame as well as across the \nvideo frames within each video shot. Factor graphs provide an elegant framework to \nrepresent the stochastic relationship between concepts, while the sum-product algo-\n\n\frithm provides an efficient tool to perform learning and inference in factor graphs. \nUsing exact as well as approximate inference (through loopy probability propaga(cid:173)\ntion) we show that there is significant improvement in the detection performance. \n\n2 Proposed Framework \n\nTo support retrieval based on high-level queries like' Explosion on a beach', we need \nmodels for the event explosion and site beach. User queries might similarly involve \nsky, helicopter, car-chase etc. Detection of some of these concepts may be possible, \nwhile some others may not be directly observable. To support such queries, we \nproposed a probabilistic multimedia object (multiject) [3] as shown in Figure 1 (a) , \nwhich has a semantic label and which summarizes a time sequence of features from \nmultiple media. A Multiject can belong to any of the three categories: objects (car, \nman, helicopter), sites (outdoor, beach), or events (explosion, man-walking). \n\nIntuitively it is clear that the presence of certain multijects suggests a high possibil(cid:173)\nity of detecting certain other multijects. Similarly some multijects are less likely to \noccur in the presence of others. The detection of sky and water boosts the chances \nof detecting a beach, and reduces the chances of detecting Indoor. It might also be \npossible to detect some concepts and infer more complex concepts based on their \nrelation with the detected ones. Detection of human speech in the audio stream and \na face in the video stream may lead to the inference of human talking. To integrate \nall the multijects and model their interaction, we propose the network of multijects \nwhich we term as multinet. A conceptual figure of a multinet is shown in Figure \n1 (b) with positive (negative) signs indicating positive (negative) interaction. In \n\nP (c:oncept=Outdoor I :flBtures, other mut~ijectll) = 0.7 \n\n7\\ \n\naudio \n\nfeatur \u2022\u2022 \n\nv1deo \n\nfeatur.. \n\n(a) \n\n(b) \n\nFigure 1: (a) A probabilistic multimedia object. (b) A conceptual multinet. \n\nSection 5 we present a factor graph multinet implementation. \n\n3 Video segmentation and Feature Extraction \n\nWe have digitized movies of different genres to create a large database of a few \nhours of video data. The video clips are segmented into shots using the algorithm \nin [5]. We then perform spatio-temporal segmentation [2] within each shot to ob(cid:173)\ntain and track regions homogeneous in color and motion separated by strong edges. \nLarge dominant regions are labeled manually. Each region is then processed to \nextract features characterizing the color (3-channel histogram [3]), texture (statis(cid:173)\ntical properties of the Gray-level Co-occurrence matrices at 4 different orientations \n[6]), structure (edge direction histogram [7]), motion (affine motion parameters) \nand shape (moment invariants [8]). Details about the extracted features can be \nfound in [9]. For sites we use color, texture and structural features (84 elements) \n\n\fand for objects and events we use all features (98 elements)l . Audio features are \nextracted as in [10]. For training our multiject and multinet models we use 1800 \nframes from different video shots and for testing our framework we use 9400 frames. \nSince consecutive images within a shot are correlated, the video data is subsampled \nto create the training and testing without redundancy. \n\n4 Modeling semantic concepts using Multijects \n\nWe use an identical approach to model concepts in video and audio (independently \nand jointly). The following site multijects are used in our experiments: sky, water, \nforest , rocks and snow. Audio-only multijects (human-speech, music) can be found \nin [10] and audio-visual multijects (explosion) in [3]. Detection of multijects is \nperformed on every segmented region2 within each video frame. Let the feature \nvector for the region j be Xj . We model the semantic concept as a binary random \nvariable and define the two hypotheses Ho and Hl as \n\nwhere Po(Xj ) and PdXj) denote the class conditional probability density func(cid:173)\ntions conditioned on the null hypothesis (concept absent) and the true hypothesis \n(concept present). Po (Xj) and P1 (Xj) are modeled using a mixture of Gaussian \ncomponents for the site multijects3 . For objects and events (in video and audio), \nhidden Markov models replace the Gaussian mixture models and feature vectors for \nall the frames within a shot constitute to the time series modeled. The detection \nperformance for the five site multijects on the test-set is given in Table 1. \n\n(1) \n\nmultiject \n\nRocks Sky Snow Water Forest \n\nDetection (%) \nFalse Alarm (%) \n\n77 \n24.1 \n\n81.8 \n11.9 \n\n81.5 \n12.9 \n\n79.4 \n15.6 \n\n85.1 \n14.9 \n\nTable 1: Maximum likelihood binary classification performance for site multijects. \n\n4.1 Frame level semantic features \n\nSince multijects are used as semantic feature detectors at a regional level, it is easy \nto define multiject-based semantic features at the frame level by integrating the \nregion-level classification. We check each region for each concept individually and \nobtain probabilities of each concept being present or absent in the region. Imperfect \nsegmentation does not hurt us too much since these soft decisions are modified in \nthe multinet based on high-level constraints. Defining a binary random variable Rij \n(Rij = 1/0 if concept present/absent) and assuming uniform priors on the presence \nor absence of a concept in any region we can use Bayes' rule to obtain: \n\nP(Rij = 11Xj) = P(XjlRij = l)/(P(Xj IRij = 1) +P(XjIRij = 0)) \n\n(2) \n\nDefining binary random variables Fi , i E {I, N} (N is the number of concepts) to \ntake on value 1 if concept i is present in the frame and value 0 otherwise, we use the \n\nI Automatic feature selection is not addressed here. \n2We thank Prof. Chang and D. Zhong for the algorithm [2]. \n3Po(Xj ) used 5 gaussian components, while PI(Xj ) used 10. The number of mixing \ncomponents can be fixed experimentally and could be different for optimal performance. \nIn general models for Ho are represented better with more components than those for HI \n\n\fOR function to combine soft decisions for each concept from all regions to obtain \nFi . Let X = {Xl, ... , X~} (M is the number of regions in a frame), then \nP(Fi = OIX) = II P(Rij = 0IXj ) and P(Fi = 11X) = 1- P(Fi = OIX) \n\n(3) \n\nj=M \n\nj=l \n\n5 The multinet as a factor graph \n\nTo model the interaction between multijects in a multinet, we propose to use a \nfactor graph [1] framework. Factor graphs subsume graphical models like Bayesian \nnets and Markov random fields and have been successfully applied in the area of \nchannel error correction coding [1] and specifically, iterative decoding. Let x = \n{Xl, X2, ... , Xn} be a vector of variables. A factor graph visualizes the factorization \nof a global function f(x). Let f(x) factor as \n\n(4) \n\ni=l \n\nwhere x( i) is the set of variables of the function fi. A factor graph for f is defined as \nthe bipartite graph with two vertex classes Vf and Vv of sizes m and n respectively \nsuch that the ith node in Vf is connected to the jth node in Vv iff fi is a function \nof Xj. Figure 2 (a) shows a simple factor graph representation of f(x,y,z) = \nh(x,y)h(y,z) with function nodes h,h and variable nodes X,y,z. \n\nMany signal processing and learning problems are formulated as optimizing a global \nfunction f(x) marginalized for a subset of its arguments. The algorithm which al(cid:173)\nlows us to perform this efficiently, though in most cases only approximately, is called \nthe sum-product algorithm. The sum-product algorithm works by computing \nmessages at the nodes using a simple rule and then passing the messages between \nnodes according to a reasonable schedule. A message from a function node to a \nvariable node is the product of all messages incoming to the function node with the \nfunction itself, marginalized for the variable associated with the variable node. A \nmessage from a variable node to a function node is simply the product of all mes(cid:173)\nsages incoming to the variable node from other functions connected to it. Pearl's \nprobability propagation working on a Bayesian net is equivalent to the sum-product \nalgorithm applied to the corresponding factor graph. If the factor graph is a tree, \nexact inference is possible using a single set of forward and backward passage of \nmessages. For all other cases inference is approximate and the message passing is \niterative [1] leading to loopy probability propagation. This has a direct bearing on \nour problem because relations between semantic concepts are complicated and in \ngeneral contain numerous cycles (e.g., see Figure 1 (b)). \n\n5.1 Relating semantic concepts in a factor graph \n\nWe now describe a frame-level factor graph to model the probabilistic relations \nbetween various frame-level semantic features Fi obtained using Equation 3. To \ncapture the co-occurrence relationship between the five semantic concepts at the \nframe-level, we define a function node which is connected to the five variable nodes \nrepresenting the concepts as shown in Figure 2 (b). This function node represents \nP(F1' F2 , F3 , .. , FN). The function nodes below the five variable nodes denote the \nmessages passed by the OR function of Equation 3 (P(Fi = 1), P(Fi = 0)). These \nare then propagated to the function node. At the function node the messages are \n\n\fmultiplied by the function which is estimated from the co-occurrence of the concepts \nin the training set. The function node then sends back messages summarized for \neach variable. This modifies the soft decisions at the variable nodes according to \nthe high-level relationship between the five concepts. In general, the distribution \n\nI omtdenslty functl on of 5 semantic concepts \n\n(, ) \n\n(b) \n\n(0) \n\nFusIOn at imme ko llel uSing OR functt on \n\nFusIon at frame levd uSing OR functIOn \n\nFigure 2: (a) An example of a simple factor graph (b)A multinet: Accounting for \nconcept dependencies using a single function (b) Another multinet: Replacing the \nfunction in (b) by a product of 10 local functions. \n\nat the function node in Figure 2 (b) is exponential in the number of concepts \n(N) and the computational cost may increase quickly. To alleviate this we can \nenforce a factorization of the function in Figure 2 (b) as a product of a set of \nlocal functions where each local function accounts for co-occurrence of two variables \nonly. This modification to the graph in Figure 2 (b) is shown in Figure 2 (c). \nEach function in Figure 2 (c) represents the joint probability mass of those two \nvariables that are its arguments (and there are eli such functions) thus reducing \nthe complexity. The factor graph is no longer a tree and exact inference becomes \nhard as the number of loops grows. We then apply iterative techniques based on \nthe sum-product algorithm to overcome this. We can also incorporate temporal \n\nA dynamic multi net with unfactored global \n\ndistribution for each frame \n\nMultinet \n\nMultinet \nslate at \n\nA dynamic multinet with factored global \n\ndi stribution for each frame \n\nMultinet \nstate al \nframe 1-1 \n\nMullmet \nstale at \nframel \n\n,-\n\nAccounting for temporal dependency using a Markov chain \n\nAccounting for temporal dependency using a Markov chain \n\n! \n\n(a) \n\n(b) \n\nFigure 3: (a) Replicating the multinet in Figure 2 (b) for each frame in a shot and \nintroducing temporal dependencies between the value of each concept in consecutive \nframes. (b) Repeating this for Figure 2 (c). \n\ndependencies. This can be done by replicating the slice of factor graph in Figure \n2 (b) or (c) as many times as the number of frames within a single video shot and \nby introducing a first order Markov chain for each concept. Figures 3 (a) and (b) \nshow two consecutive time slices and extend the models in Figures 2 (b) and (c) \nrespectively. The horizontal links in Figures 3 (a), (b) connect the variable node \nfor each concept in a time slice to the corresponding variable node in the next time \nslice through a function modeling the transition probability. This framework now \nbecomes a dynamic probabilistic network. For inference, messages are iteratively \npassed locally within each slice. This is followed by message passing across the \ntime slices in the forward direction and then in the backward direction. Accounting \n\n\ffor temporal dependencies thus leads to temporal smoothing of the soft decisions \nwithin each shot. \n\n6 Results \n\nWe compare detection performance of the multijects with and without accounting \nfor the concept dependencies and temporal dependencies. The reference system \nperforms multiject detection by thresholding soft-decisions (i.e., P(Fi IX)) at the \nframe-level. The proposed schemes are then evaluated by thresholding the soft de(cid:173)\ncisions obtained after message passing using the structures in Figures 2 (b), (c) \n(conceptual dependencies) and Figures 3 (a), (b) (conceptual and temporal depen(cid:173)\ndencies). We use receiver operating characteristics (ROC) curves which show a plot \nof the probability of detection plotted against the probability of false alarms for \ndifferent values of a parameter (the threshold in our case). \n\nFigure 4 shows the ROC curves for the overall performance over the test-set across \nall the five multijects. The three curves in Figure 4 (a) correspond to the perfor(cid:173)\nmance using isolated frame-level classification, the factor graph in Figure 2 (b) and \nthe factor graph in Figure 2 (c) with ten iterations of loopy propagation. The curves \nin Figure 4 (b) correspond to isolated detection followed by temporal smoothing, \nthe dynamic multinet in Figure 3 (a) and the one in Figure 3 (b) respectively. From \n\n,,~~~~~~,,--~,,==,~.~~~~\u00a5=~ ,,~~~~~~,,==~,,~,~. ~~~~\u00a5=~ \n\nP,obllb4l ityoiF.,. AI.ms \n\n(a) \n\nP, ob.bilityofF.lseAI., ms \n\n(b) \n\nFigure 4: ROC curves for overall performance using isolated detection and two \nfactor graph representations. (a) With static multinets (b) With dynamic multinets. \n\nFigure 4 we observe that there is significant improvement in detection performance \nby using the multinet to model the dependencies between concepts than without \nusing it. This improvement is especially stark for low Pf where detection rate im(cid:173)\nproves by more than 22 % for a threshold corresponding to Pf = 0.1. Interestingly, \ndetection based on the factorized functions (Figure 2 (c)) performs better than \nthe the one based on the unfactorized function. This suggests that the factorized \nfunction is a better representative and can be estimated more reliably due to fewer \nparameters being involved. Also by using models in Figure 3, which account for \ntemporal dependencies across video frames and by performing smoothing using the \nforward backward algorithm, we see further improvement in detection performance \nin Figure 4 (b). The detection rate corresponding to Pf = 0.1 is 68 % for the \nstatic multinet (Figure 2 (c)) and 72 % for its dynamic counterpart (Figure 3 (b)). \n\n\fComparison of ROC curves with and without temporal smoothing (not shown here \ndue to lack of space) reveal that temporal smoothing results in better detection \nirrespective of the threshold or configuration. \n\n7 Conclusions and Future Research \n\nWe propose a probabilistic framework for detecting semantic concepts using multi(cid:173)\njects and multinets. We present implementations of static and dynamic multinets \nusing factor graphs. We show that there is significant improvement in detection per(cid:173)\nformance by accounting for the interaction between semantic concepts and temporal \ndependency amongst the concepts. The multinet architecture imposes no restric(cid:173)\ntions on the classifiers used in the multijects and we can improve performance by \nusing better multiject models. Our framework can be easily expanded to integrate \nmultiple modalities if they have not been integrated in the multijects to account for \nthe loose coupling between audio and visual streams in movies. It can also support \ninference of concepts that are observed not through media features but through \ntheir relation to those concepts which are observed in media features. \n\nReferences \n\n[1] F. Kschischang, B. Frey, and H .-A. Loeliger, \"Factor graphs and the sum-product \n\nalgorithm,\" submitted to IEEE Trans. Inform. Theory, July 1998. \n\n[2] D. Zhong and S. F. Chang, \"Spatio-temporal video search using the object-based \nvideo representation,\" in Proceedings of the IEEE International Conference on Image \nProcessing, vol. 2, Santa Barbara, CA, Oct. 1997, pp. 21-24. \n\n[3] M. Naphade, T. Krist jansson, B. Frey, and T . S. Huang, \"Probabilistic multimedia \n\nobjects (multijects): A novel approach to indexing and retrieval in multimedia sys(cid:173)\ntems,\" in Proceedings of the fifth IEEE International Conference on Image Processing, \nvol. 3, Chicago, IL, Oct 1998, pp. 536-540. \n\n[4] S. F. Chang, W. Chen, and H. Sundaram, \"Semantic visual templates - linking fea(cid:173)\n\ntures to semantics,\" in Proceedings of the fifth IEEE International Conference on \nImage Processing, vol. 3, Chicago, IL, Oct 1998, pp. 531- 535. \n\n[5] M. Naphade, R. Mehrotra, A. M. Ferman, J. Warnick, T. S. Huang, and A. M. \nTekalp, \"A high performance shot boundary detection algorithm using multiple cues,\" \nin Proceedings of the fifth IEEE International Conference on Image Processing, vol. 2, \nChicago, IL, Oct 1998, pp. 884-887. \n\n[6] R. Jain, R. Kasturi, and B. Schunck, Machine Vision. MIT Press and McGraw-Hill, \n\n1995. \n\n[7] A. K. Jain and A. Vailaya, \"Shape-based retrieval: A case study with trademark \n\nimage databases,\" Pattern Recognition, vol. 31, no. 9, pp. 1369- 1390, 1998. \n\n[8] S. Dudani, K. Breeding, and R. McGhee, \"Aircraft identification by moment invari(cid:173)\n\nants,\" IEEE Trans. on Computers, vol. C-26, pp. 39- 45, Jan 1977. \n\n[9] M. R. Naphade and T. S. Huang, \"A probabilistic framework for semantic indexing \nand retrieval in video,\" to appear in IEEE International Conference on Multimedia \nand Expo, New York, NY, July 2000. http://www.ifp.uiuc.edu;-milind/cpapers.html \n\n[10] M. R. Naphade and T. S. Huang, \"Stochastic modeling of soundtrack for efficient \nsegmentation and indexing of video,\" in SPIE IS \u20ac3 T Storage and Retrieval for \nMultimedia Databases, vol. 3972, Jan 2000, pp. 168-176. \n\n\f", "award": [], "sourceid": 1932, "authors": [{"given_name": "Milind", "family_name": "Naphade", "institution": null}, {"given_name": "Igor", "family_name": "Kozintsev", "institution": null}, {"given_name": "Thomas", "family_name": "Huang", "institution": null}]}