{"title": "Recurrent Space-time Graph Neural Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 12838, "page_last": 12850, "abstract": "Learning in the space-time domain remains a very challenging problem in machine learning and computer vision. Current computational models for understanding spatio-temporal visual data are heavily rooted in the classical single-image based paradigm. It is not yet well understood how to integrate information in space and time into a single, general model. We propose a neural graph model, recurrent in space and time, suitable for capturing both the local appearance and the complex higher-level interactions of different entities and objects within the changing world scene. Nodes and edges in our graph have dedicated neural networks for processing information. Nodes operate over features extracted from local parts in space and time and over previous memory states. Edges process messages between connected nodes at different locations and spatial scales or between past and present time. Messages are passed iteratively in order to transmit information globally and establish long range interactions. Our model is general and could learn to recognize a variety of high level spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through extensive experiments and ablation studies, that our model outperforms strong baselines and top published methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art performance on the challenging Something-Something human-object interaction dataset.", "full_text": "Recurrent Space-time Graph Neural Networks\n\nAndrei Nicolicioiu\u2217, Iulia Duta\u2217\n\nBitdefender, Romania\n\nanicolicioiu, iduta@bitdefender.com\n\nMarius Leordeanu\nBitdefender, Romania\n\nInstitute of Mathematics of the Romanian Academy\n\nUniversity \"Politehnica\" of Bucharest\n\nmarius.leordeanu@imar.ro\n\nAbstract\n\nLearning in the space-time domain remains a very challenging problem in machine learning and\ncomputer vision. Current computational models for understanding spatio-temporal visual data are\nheavily rooted in the classical single-image based paradigm. It is not yet well understood how to\nintegrate information in space and time into a single, general model. We propose a neural graph\nmodel, recurrent in space and time, suitable for capturing both the local appearance and the complex\nhigher-level interactions of different entities and objects within the changing world scene. Nodes\nand edges in our graph have dedicated neural networks for processing information. Nodes operate\nover features extracted from local parts in space and time and over previous memory states. Edges\nprocess messages between connected nodes at different locations and spatial scales or between past\nand present time. Messages are passed iteratively in order to transmit information globally and\nestablish long range interactions. Our model is general and could learn to recognize a variety of high\nlevel spatio-temporal concepts and be applied to different learning tasks. We demonstrate, through\nextensive experiments and ablation studies, that our model outperforms strong baselines and top\npublished methods on recognizing complex activities in video. Moreover, we obtain state-of-the-art\nperformance on the challenging Something-Something human-object interaction dataset.\n\n1\n\nIntroduction\n\nVideo data is available almost everywhere. While image level recognition is better understood, visual\nlearning in space and time is far from being solved. The main challenge is how to model interactions\nbetween objects and higher level concepts, within the large spatio-temporal context. For such a\ndif\ufb01cult learning task it is important to ef\ufb01ciently model the local appearance, the spatial relationships\nand the complex interactions and changes that take place over time.\nOften, for different learning tasks, different models are preferred, such that they capture the speci\ufb01c\ndomain priors and biases of the problem [1]. Convolutional neural networks (CNNs) are preferred on\ntasks involving strong local and stationary assumptions about the data. Recurrent models are chosen\nwhen data is sequential in nature. Fully connected models could be preferred when there is no known\nstructure in the data. Our recurrent neural graph ef\ufb01ciently processes information in both space and\ntime and can be applied to different learning tasks in video.\nWe propose Recurrent Space-time Graph (RSTG) neural networks, in which each node receives\nfeatures extracted from a speci\ufb01c region in space-time using a backbone deep neural network.\n\n\u2217Equal contribution.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: The RSTG-to-map architecture: the input to RSTG is a feature volume, extracted by a\nbackbone network, down-sampled according to each scale. Each node receives input from a cell,\ncorresponding to a region of interest in space. The edges between different nodes represent messages\nin space, the red links are spatial updates, while the purple links represent messages in time. All\nthe extracted (input to graph) and up-sampled features (output from graph) have the same spatial\nand temporal dimension T \u00d7 H \u00d7 W \u00d7 C and are only represented at different scales for a better\nvisualisation.\n\nGlobal processing is achieved through iterative message passing in space and time. Spatio-temporal\nprocessing is factorized, into a space processing stage and a time processing stage, which are\nalternated within each iteration. We aim to decouple, conceptually, the data from the computational\nmachine that processes the data. Thus, our nodes are processing units that receive inputs from several\nsources: local regions in space at the present time, their neighbor spatial nodes as well as their past\nmemory states (Fig. 1).\nMain contributions. We sum up our contributions into the following three main ideas:\n\n1. We propose a novel computational model for learning in spatio-temporal domain. Space\nand time are treated differently, while they function together in complementary ways. Our\nmodel is general and could be applied to various learning problems. It could also be used\nas a processing block in combination with other powerful models.\n\n2. We factorize space and time and process them differently within a uni\ufb01ed neural graph\nmodel from an unstructured video. In extensive ablation studies we show the importance of\neach graph component and also demonstrate that different temporal and spatial processing\nis crucial for learning in space-time domain. Through recurrent and factorized space-time\nprocessing our model achieves a relatively low computational complexity.\n\n3. We introduce a new synthetic dataset, with complex interactions, to analyse and evaluate\ndifferent spatio-temporal models. We obtain a performance that is superior to several\npowerful baselines and top published methods. More importantly, we obtain state-of-the-\nart results on the challenging Something-Something, real world dataset.\n\nRelation to previous work:\nIterative graph based methods have a long history in machine learning\nand are currently enjoying a fast-growing interest [1, 2]. Their main paradigm is the following: at\neach iteration, messages are passed between nodes, information is updated at each node and the\nprocess continues until convergence or a stopping criterion is met. Such ideas trace back to work on\nimage denoising, restoration and labeling [3, 4, 5, 6], with many inference methods, graphical models\nand mathematical formulations being proposed over time for various tasks [7, 8, 9, 10, 11, 12, 13].\nCurrent approaches combine the idea of message passing between graph nodes, from graphical\nmodels, with convolution operations. Thus, the idea of graph convolutions was born. Initial methods\ngeneralizing conv nets to the case of graph structured data [14, 15, 16] learn in the spectral domain\nof the graph. They are approximated [17] by message passing based on linear operations [18] or\n\n2\n\n\fMLPs [19]. Aggregation of messages needs permutation invariant operators such as max or sum, the\nlast one being proved superior in [20], with attention mechanism [21] as an alternative.\nRecurrence in graph models has been proposed for sequential tasks [22, 23] or for iteratively pro-\ncessing the input [24, 25]. Recurrence is used in graph neural nets [22] to tackle symbolic tasks with\nsingle input and sequential language output. Different from them, we have two types of recurrent\nstages, with distinct functionality, one over space and the other over time.\nThe idea of modeling complex, higher order and long range spatial relationships by the spatial\nrecurrence relates to more classical work using pictorial structures [26] to model object parts and their\nrelationships and perform inference through iterative optimization algorithms. The idea of combining\ninformation at different scales also relates to classic approaches in object recognition, such as the\nwell-known spatial pyramid model [27, 28].\nLong-range dependencies in sequential language are captured in [29] with a self-attention model. It\nhas a stack of attention layers, each with different parameters. It is improved in [24] by performing\noperations recurrently. This is similar to our recurrent spatial processing stage. As mentioned before,\nour model is different by adding another complementary dimension - the temporal one. In [25]\nnew information is incorporated into the existing memory by self-attention using a temporary new\nnode. Then each node is updated by an LSTM [30]. Their method is applied on program evaluation,\nsimulated environments used in reinforcement learning and language modeling where they do not\nhave a spatial dimension. Their nodes act as a set of memories. Different from them, we receive new\ninformation for each node and process them in multiple interleaved iterations of our two stages.\nInitial node information could come from each local spatio-temporal point in convolutional feature\nmaps [31, 32] or from features corresponding to entities detected by external methods, such as\nobjects [33, 34, 35] or skeletons [36]. Also, the approach in [37] is to extract objects and form\nrelations between objects from pairs of time steps randomly chosen. Different from that methods,\nour nodes are not attached to speci\ufb01c volumes in time and space. Also, we do not need pre-trained\nhigher-level detectors, our model working on unstructured videos.\nWhile the above methods need access to the whole video at test time, ours is recurrent and can\nfunction in an online, continuous manner in time. All space-time positions in the input volume are\nconnected in [32, 33, 38]. In contrast, we treat space and time differently and prove the effectiveness\nof our choice in experiments. A 1D convolution is used in [36] to temporally connect only the\nnodes corresponding to the same skeleton joint and recently [34] send messages between nodes\ncorresponding to the same entities, while we recurrently update in time the state of each node.\nWe could see our different handling of time and space as an ef\ufb01cient factorization into simpler\nmechanisms that function together along different dimensions. The work in [39, 40] con\ufb01rm our\nhypothesis that features could be more ef\ufb01ciently processed by factorization into simpler operations.\nThe models in [41, 42, 43] factorize 3D convolutions into 2D spatial and 1D temporal convolutions.\nFor spatio-temporal processing, some methods, which do not use explicit graph modeling, encode\nframes individually using 2D convolutions and aggregate them in different ways [44, 45, 46]; others\nform relations as functions (MLPs) over sets of frames [47] or use 3D convolution in\ufb02ated from\nexisting 2D convolutional networks [48] . Optical \ufb02ow could be used as input to a separate branch\nof a 2D ConvNet [49] or used as part of the model to guide the kernel of 3D convolutions [50]. To\ncover both spatial and temporal dimensions simultaneously, Convolutional LSTM [51] can be used,\naugmented with additional memory [52] or self-attention in order to update LSTM hidden states [53].\n\n2 Recurrent Space-time Graph Model\n\nThe Recurrent Space-time Graph (RSTG) model is designed to process data in both space and time, to\ncapture both local and long range spatio-temporal interactions (Fig. 1). RSTG takes into consideration\nlocal information by computing over features extracted from speci\ufb01c locations and scales at each\nmoment in time. Then it integrates long range spatial and temporal information by iterative message\npassing at the spatial level between connected nodes and by recurrence in time, respectively. The\nspace and time message passing is coupled with the two stages succeeding one after another.\nOur model takes a video and process it using a backbone function into a features volume\nF \u2208 RT\u00d7H\u00d7W\u00d7C, where T is the time dimension and H,W the spatial ones. The backbone\nfunction could be modeled by any deep neural network that operates over single frames or over\n\n3\n\n\fAlgorithm 1 Space-time processing in RSTG model.\n\nInput: Time-space features F \u2208 RT\u00d7H\u00d7W\u00d7C\nrepeat\n\nvi \u2190 extract_f eatures(Ft, i)\nfor k = 0 to K \u2212 1 do\n\n\u2200i\n\n)\n\ni\n\ni = ftime(vi, ht\u22121,k\n\nvi = ht,k\nmj,i = fsend(vj, vi)\ngi = fgather(vi,{mj,i}j\u2208N (i)) \u2200i\n\u2200i\nvi = fspace(vi, gi)\n\n\u2200i\n\u2200i,\u2200j \u2208 N (i)\n\nend for\ni = ftime(vi, ht\u22121,K\nht,K\nt = t + 1\n\ni\n\n)\n\nuntil end-of-video\nvf inal = faggregate({h1:T,K\n\ni\n\n}\u2200i)\n\n\u2200i\n\nFigure 2: Two Space Processing\nStages (K = 2) from top to bottom,\neach one preceded by a Temporal\nProcessing Stage.\n\nspace-time volumes. Thus, we extract local spatio-temporal information from the video volume and\nwe process it using our graph, sequentially, time step after time step. This approach makes it possible\nfor our graph to also process a continuous \ufb02ow of spatio-temporal data and function in an online\nmanner.\nInstead of fully connecting all positions in time and space, which is costly, we establish long range\ninteractions through recurrent and complementary Space and Time Processing Stages. Thus, in the\ntemporal processing stage, each node receives a message from the previous time step. Then, at\nthe spatial stage, the graph nodes, which now have information from both present and past, start\nexchanging information through message passing. Space and time are coupled and performed\nalternatively: after each space iteration iter, another time iteration follows, with a message coming\nfrom past memory associated with the same space iteration iter. The processing stages of our\nalgorithm are succinctly presented in Alg. 1 and Fig. 2. They are detailed below. The code for the\nfull model can be found in our repository 2.\nGraph Creation. We create N nodes connected in a graph structure and use them to process a\nfeatures volume F \u2208 RT\u00d7H\u00d7W\u00d7C. Each node receives input from a speci\ufb01c region (a window\nde\ufb01ned by a location and scale) of the features volume at each time step t (Fig. 1). At each scale we\ndownsample the H \u00d7 W feature maps into h \u00d7 w grids, each cell corresponding to one node. Two\nnodes are connected if they are neighbours in space or if their regions at different scales intersect.\n\n2.1 Space Processing Stage\n\nSpatial interactions are established by exchanging messages between nodes. The process involves\n3 steps: send messages between all connected nodes, gather information at node level from the\nreceived messages and update internal nodes representations. Each step has its own dedicated MLP.\nMessage passing is iterated K times, with time processing steps followed by space processing steps,\nat each iteration.\n\nMessage sending function. A given message between two nodes should represent relevant infor-\nmation about their pairwise interaction. Thus, the message is a function of both the source and\ndestination nodes j and i, respectively. The function, fsend(vj, vi) is modeled as a multilayer\nperceptron (MLP) applied on the concatenation of the two node features:\n\n2https://github.com/IuliaDuta/RSTG\n\n4\n\n\ffsend(vj, vi) = MLPs([vj|vi]) \u2208 RD.\n\nMLPa(x) = \u03c3(Wa2\u03c3(Wa1(x) + ba1) + ba2 ).\n\n(1)\n(2)\n\nPosition-aware messages. The pairwise interactions between nodes should have positional aware-\nness - each node should be aware of the position of the neighbor that sends a particular message.\nTherefore we include the position information as a (linearized) low-resolution 6 \u00d7 6 map in the\nmessage body sent with fsend, by concatenating the map to the rest of the message. The actual map\nis formed by putting ones for the cells corresponding to the region of interest of the sending nodes\nand zeros for the remaining cells, and then applying \ufb01ltering with a Gaussian kernel.\n\nGather function. Each node receives a message from each of its neighbours and aggregates them\nusing the fgather function, which could be a simple sum of all messages or an attention mechanism\nthat gives a different weight to each message, according to its importance. In this way, a node could\nchoose what information to receive. In our implementation, the attentional weight function \u03b1 is\ncomputed as the dot product between features of the two nodes, measuring their similarity.\n\n(cid:88)\n\nj\u2208N (i)\n\nfgather(vi) =\n\n\u03b1(vj, vi)fsend(vj, vi) \u2208 RD.\n\n\u03b1(vj, vi) = (W\u03b11 vj)T (W\u03b12vi) \u2208 R.\n\n(3)\n\n(4)\n\nUpdate function. We update the representation of each node with the information gathered from\nits neighbours, using function fspace modeled as a multilayer perceptron (MLP). We want each node\nto be capable of taking into consideration global information while also maintaining its local identity.\nThe MLP is able to combine ef\ufb01ciently new information received from neighbours with the local\ninformation from the node\u2019s input features.\n\nfspace(vi) = MLPu([vi|fgather(vi)]) \u2208 RD.\n\n(5)\n\nIn general, the parameters Wu, bu could be shared among all nodes at all scales or each set could be\nspeci\ufb01c to the actual scale.\n\n2.2 Time Processing Stage\n\nEach node updates its state in time by aggregating the current spatial representation fspace(vi) with\nits time representation from the previous step using a recurrent function. In order to model more\nexpressive spatio-temporal interactions and to give it the ability to reason about all the information\nin the scene, with knowledge about past states, we put a Time Processing Stage before each Space\nProcessing Stage, at each iteration, and another Time Processing Stage after the last spatial processing.\nThus messages are passed iteratively in both space and time, alternatively. The Time Processing\nStage at iteration k updates each node\u2019s internal state vt,k\ni with information from its corespondent\nstate vt\u22121,k\n, at iteration k, in the previous time t \u2212 1, resulting in features that take into account both\nspatial interactions and history (Fig. 2).\n\ni\n\nht,k\ni,time = ftime(vk\n\ni,space, ht\u22121,k\ni,time).\n\n(6)\n\n2.3 Aggregation step\n\nThe aggregation faggregate function could produce two types of \ufb01nal representations, a 1D vector or\na 3D map. In the \ufb01rst case, denoted RSTG-to-vec, we obtain the vector encoding by summing the\nrepresentation of all the nodes from the last time step. In the second case, denoted RSTG-to-map,\nwe create the inverse operation of the node creation, by sending the processed information contained\nin each node back to the original region in the space-time volume as shown in Figure 1. For each\nscale, we have h \u2217 w nodes with C-channel features, that we arrange in a h \u00d7 w grid resulting in a\nvolume of size h \u00d7 w \u00d7 C. We up-sample the grid map for each scale into H \u00d7 W \u00d7 C maps and\nsum all maps for all scales for the \ufb01nal H \u00d7 W \u00d7 C representation.\n\n5\n\n\fTable 1: Accuracy on SyncMNIST dataset, showing the capabili-\nties of different parts of our model.\n\nModel\n\n3 SyncMNIST\n\n5 SyncMNIST\n\nMean + LSTM\nConv + LSTM\nI3D\nNon-Local\nRSTG: Space-Only\nRSTG: Time-Only\nRSTG: Homogenous\nRSTG: 1-temp-stage\nRSTG: All-temp-stages\nRSTG: Positional All-temp\n\n77.0\n95.0\n\n-\n-\n\n61.3\n89.7\n95.7\n97.0\n98.9\n\n-\n\n2.4 Computational complexity\n\n-\n\n39.7\n90.6\n93.5\n\n-\n-\n\n58.3\n74.1\n94.5\n97.2\n\nFigure 3: On each row we present\nframes from videos of 5SyncM-\nNIST dataset.\nIn each video se-\nquence two digits follow the exact\nsame pattern of movement. The cor-\nrect classes: \"3-9\" \"6-7\" and \"9-1\".\n\nWe analyse the computational complexity of the RSTG model. If N is the number of nodes in a\nframe and E the number of edges, we have O(2E) messages per space-processing stage, as there are\ntwo different spatial messages in each edge direction. With a total of T time steps and K (=3) spatio-\ntemporal message passing iterations, each of the K spatial message passing iterations is preceded by\na temporal iteration, resulting in a total complexity of O(T \u00d7 ( 2E) \u00d7 K + T \u00d7 N \u00d7 (K + 1)).\nNote that E is upper-bounded by N (N \u2212 1)/2. Without the factorisation, with messages between all\nthe nodes in time and space (similar to [32, 33]), we would arrive at a complexity of O(T 2\u00d7 N 2\u00d7 K)\nin the number of messages, which is quadratic in time. Note that our lower complexity is due to the\nrecurrent nature of our model and the space-time factorization.\n\n3 Experiments\n\nWe perform experiments on two video classi\ufb01cation tasks, which involve complex object interactions.\nWe experiment on a video dataset that we create synthetically, containing complex patterns of move-\nments and shapes, and on the challenging Something-Something-v1 dataset, involving interactions\nbetween a human and other objects [54].\n\n3.1 Learning patterns of movements and shapes\n\nThere are not many available video datasets that require modeling of dif\ufb01cult object interactions.\nImprovements are often made by averaging the \ufb01nal predictions over space and time [38]. The\ncomplex interactions and the structure of the space-time world still seem to escape the modeling\ncapabilities. For this reason, and to better understand the role played by each component of our model\nin relation to some very strong baselines, we introduce a novel dataset, named SyncMNIST.\nWe make several MNIST digits move in complex ways. We designed the dataset such that the\nrelationships involved are challenging in both space and time. The dataset contains 600K videos\nshowing multiple digits, where all of them move randomly, apart from a pair of digits that moves\nsynchronously - that speci\ufb01c pair determines the class of the activity pattern, for a total of 45 unique\ndigit pairs (classes) plus one extra class (no pair is synchronous).\nIn order to recognize the pattern, a given model has to reason about the location in space of each digit,\ntrack them across the entire time in order to learn the association between a label and a pair of digits\nthat moves synchronously. The data has 18 \u00d7 18 size digits moving on a black 64 \u00d7 64 background\nfor 10 frames. In Fig. 3 we present frames from three different videos used in our experiments. We\ntrained and evaluated our models \ufb01rst on an easier 3 digits (3SyncMNIST) dataset and then, only the\nbest models were trained and tested on the harder 5 digits dataset (5SyncMNIST).\nWe compared against four strong baseline models that are often used on video understanding tasks.\nFor all tested models we used a convolutional network as a backbone. It is a small CNN with 3 layers,\n\n6\n\n\fpre-trained to classify a digit randomly placed in a frame of the video. It is important to notice that\npublished models such as MeanPooling+LSTM, Conv+LSTM, I3D and Non-Local, have the same\nranking on our SyncMNIST dataset as on other datasets such as UCF-101 [55], HMDB-51 [56],\nKinetics (see [48]) and Something-Something (see [33]). The available performance of these models\non all datasets can be found in Section A of the Appendix.\nIt is also important that the performance of different models seems to be well correlated with the\nability of a speci\ufb01c model to incorporate and process time axis. This aspect, combined with the fact\nthat, by design, on SyncMNIST the temporal dimension is important, make the tests on SyncMNIST\nrelevant.\nMean pooling + LSTM: Use backbone for feature extraction, spatial mean pool and temporally\naggregate them using an LSTM. This model is capable of processing information from distant\ntime-steps but it has poor understanding of spatial information.\nConvNet + LSTM: Replace the mean pooling with convolutional layers that are able to capture \ufb01ne\nspatial relationships between different parts of the scene. Thus, it is fully capable of analysing the\nentire video, both in space and in time.\nI3D: We adapt the I3D model [48] with a smaller ResNet [57] backbone to maintain the number of\nparameters comparable to our model. 3D convolutions are capable of capturing some of the longer\nrange relationships both spatially and temporally.\nNon-Local: We used the previous I3D architecture as a backbone for a Non-Local [32] model. We\nobtained best results with one non-local block in the second residual block.\n\nImplementation details for RSTG: Our recurrent neural graph model (RSTG) uses the initial\n3-layer CNN as backbone, an LSTM with 512 hidden state size for the ftime and RSTG-to-vec as\naggregation. We use 3 scales with 1 \u00d7 1, 2 \u00d7 2 and 3 \u00d7 3 grids with nodes of dimension 512. We\nimplement our model in Tensor\ufb02ow framework [58]. We use cross-entropy as loss function and\ntrained the model end-to-end with SGD with Nesterov Momentum with value 0.9 for momentum,\nstarting from a learning rate of 0.0001 and decreasing by a factor of 10 when performance saturates.\nIn Table 3 results show that RSTG is signi\ufb01cantly more powerful than the competitors. Note that the\ngraph model runs on single-image based features, without any temporal processing at the backbone\nlevel. The only temporal information is transmitted between nodes at the higher graph level.\n\n3.1.1 Ablation study\n\nSolving the moving digits task requires a model capable of capturing pairwise interactions both\nin space and time. RSTG is able to accomplish that, through spatial connections between nodes\nand the temporal updates of their state. In order to prove the bene\ufb01ts of each element, we perform\nexperiments that shows the contributions brought by each one and present them in Table 3. We\nobserved the ef\ufb01ciently transfer capabilities of our model between the two versions of the SyncMNIST\ndataset. When pretrained on 3SyncMNIST, our best model RSTG-all-temp-stages achieves 90% of\nits maximum performance in a number of steps in which an uninitialized model only attains 17% of\nits maximum performance.\nSpace-Only RSTG: We create this model in order to prove the necessity of having powerful time\nmodeling. It performs the Space Processing Stage on each frame, but ignores the temporal sequence,\nreplacing the recurrence with an average pool across time dimension, applied for each node. As\nexpected, this model obtains the worst results because the task is based on the movement of each\ndigit, an information that could not be inferred only from spatial exploration.\nTime-Only RSTG: This model performs just the Time Processing Stage, without any message-\npassing between nodes. The features used in the recurrent step are the initial features extracted from\nthe backbone neural network, which takes as input single frames.\nHomogeneous Space-time RSTG: This model allows the graph to interact both spatially and\ntemporally, but learn the same set of parameters for the MLPs that compute messages in time\nand space. Thus, time and space are computed in the same way.\n\n7\n\n\fTable 2: Comparison with state-of-the-art models on Something-Something-v1 dataset showing Top-1\nand Top-5 accuracy.\n\nModel\n\nC2D\nTRN [47]\nours C2D + RSTG\n\nMFNet-C50 [59]\nI3D [33]\nNL I3D [33]\nNL I3D + Joint GCN [33]\nECOLite-16F [60]\nMFNet-C101 [59]\nI3D [42]\nS3D-G [42]\nours I3D + RSTG\n\nBackbone\n\nVal Top-1 Val Top-5\n\n2D ResNet-50\n2D Inception\n2D ResNet-50\n\n3D ResNet-50\n3D ResNet-50\n3D ResNet-50\n3D ResNet-50\n\n2D Inc+3D Res-18\n\n3D ResNet-101\n3D Inception\n3D Inception\n3D ResNet-50\n\n31.7\n34.4\n42.8\n\n40.3\n41.6\n44.4\n46.1\n42.2\n43.9\n45.8\n48.2\n49.2\n\n64.7\n\n-\n\n73.6\n\n70.9\n72.2\n76.0\n76.8\n\n-\n\n73.1\n76.5\n78.7\n78.8\n\nHeterogeneous Space-time RSTG: We developed different schedulers for our spatial and temporal\nstages. In the \ufb01rst scheduler, used in the 1-temp RSTG model, for each time step, we performed 3\nsuccessive spatial iteration, followed by a single \ufb01nal temporal update. The second scheduler, the\nall-temp RSTG model, alternates between the spatial and temporal stages (as presented in Alg.1).\nWe use one Time Processing Stage before each of the three Space-Processing Stages, and a last Time\nProcessing Stage to obtain the \ufb01nal nodes representation.\nPositional All-temp RSTG: This is the previous all-temp RSTG model, but enriched with positional\nembeddings used in fsend function as explained in Section 2. This model, which is our best and \ufb01nal\nmodel, is also able to reason about global locations of the entities.\n\n3.2 Learning human-object interaction\n\nIn order to evaluate our method in a real world scenario involving complex interactions, we use the\nSomething-Something-v1 dataset [54]. It consists of a collection of 108499 videos with 86017, 11522\nand 10960 videos for train, validation and test splits respectively. It has 174 classes for \ufb01ne-grained\ninteractions between humans and objects. It is designed such that classes can be discriminated not by\nsome global context or background but from the actual speci\ufb01c interactions.\nFor this task we investigate the performance of our graph model combined with two backbones, a 2D\nconvolutional one (C2D [32]), based on ResNet-50 architecture and an I3D [48] model in\ufb02ated also\nfrom the ResNet-50. We start with backbones pretrained on Kinetics-400 [48] dataset as provided by\n[32] and train the whole model end-to-end.\nWe analyse our both aggregation types, described in Section 2.3. For RSTG-to-vec we use the last\nconvolutional features given by the I3D backbone as input to our graph model and obtain a vector\nrepresentation. To facilitate the optimisation process we use residual connections in RSTG, by adding\nthe results of the graph processing to the pooled features of the backbone. For the second case we use\nintermediate features of I3D as input to the graph and also add them to the graph output by a residual\nconnection and continue the I3D model. For this purpose we need both the input and the output of the\ngraph to have the same dimension. Thus we use RSTG-to-map to obtain a 3D map at each time step.\n\nTraining and evaluation. For training, we uniformly sample 32 frames from each video resized\nsuch that the height is 256, preserving the aspect ratio and randomly cropped to a 224 \u00d7 224 clip.\nFor inference, we apply the backbone fully convolutional on a 256 \u00d7 256 crop with the graph taking\nfeatures from larger activation maps. We use 11 square clips uniformly sampled on the width of the\nframes for covering the entire spatial size of the video, and use 2 samplings along the time dimension.\nWe mean pool the clips output for the \ufb01nal prediction.\n\n8\n\n\fFigure 4: We show running time (clips / s) on\nthe left axis and \ufb01nal accuracy on the right axis.\n\nTable 3: Ablation study showing where to place\nthe graph inside the ResNet-50 I3D backbone.\nFor our best model we use two different graphs\nafter the res3 and res4 stages of the I3D.\n\nModel\n\nTop-1\n\nTop-5\n\nRSTG-to-vec\nRSTG-to-map res2\nRSTG-to-map res3\nRSTG-to-map res4\nRSTG-to-map res3-4\n\n47.7\n46.9\n47.7\n48.4\n49.2\n\n77.9\n76.8\n77.8\n78.1\n78.8\n\nResults. We analyse how our graph model could be used to improve I3D by applying RSTG-to-map\nat different layers in the backbone and RSTG-to-vec after the last convolutional layer. In all cases\nthe model achieves competitive results, and the best performance is obtained using the graph in\nthe res3 and res4 blocks of the I3D as shown in Table 3. We compare against recent methods on\nthe Something-Something-v1 dataset and show the results in Table 2. Among the models using\n2D ConvNet backbones, ours obtains the best results (with a signi\ufb01cant improvement of more than\n8% over all methods using a 2D backbone, for the Top-1 setup). When using the I3D backbone,\nRSTG reaches state-of-the-art results, with 1% improvement over all methods (Top-1 case) and 3.1%\nimprovement over top methods (Top-1 case) with the same 3D-ResNet-50 backbone.\n\nComputational requirements We show the compute times for different variants of our model and\nfor the Non-Local model using the Resnet-50 backbone on Something-Something videos running on\none Nvidia GTX 1080 Ti GPU in Figure 4. We observe that our RSTG-to-vec model is faster, while\nhaving better accuracy than the Non-Local model, whereas our top performing model RSTG-to-map\nres3-4 further increase the results at the cost of being about 2x slower than RSTG-to-vec. Our\nRSTG-to-vec requires 6.95 GB for training and 1.23 GB for inference, while RSTG-to-map res3-res4\nrequires 7.50 GB and 1.93 GB respectively, with a batch of 2 clips.\n\n4 Conclusions\n\nIn this paper we introduce the Recurrent Space-time Graph (RSTG) neural network model, which\nis speci\ufb01cally designed to learn ef\ufb01ciently in space and time. The graph, at each moment in time,\nstarts by receiving local space-time information from features produced by a given backbone network.\nThen it moves towards global understanding by passing messages over space between different\nlocations and scales and recurrently in time, by having a different past memory for each space-time\niteration. Our model is unique in the literature in the way it processes space and time, with several\nmain contributions: 1) it treats space and time differently; 2) it factorizes them and uses recurrent\nconnections within a uni\ufb01ed neural graph model from an unstructured video, with relatively low\ncomputational complexity; 3) it is \ufb02exible and general, being relatively easy to adapt to various\nlearning tasks in the spatio-temporal domain; 4) our ablation study justi\ufb01es the structure and different\ncomponents of our model, which obtains state-of-the-art results on the challenging Something-\nSomething dataset. In future work we plan to further study and extend our model to other higher-level\ntasks such as semantic segmentation in spatio-temporal data and vision-to-language translation.\n\nAcknowledgements: This work has been supported in part by Bitdefender and UEFISCDI, through\nprojects EEA-RO-2018-0496 and TE-2016-2182.\n\n9\n\n\fReferences\n[1] Peter W Battaglia, Jessica B Hamrick, Victor Bapst, Alvaro Sanchez-Gonzalez, Vinicius\nZambaldi, Mateusz Malinowski, Andrea Tacchetti, David Raposo, Adam Santoro, Ryan\nFaulkner, et al. Relational inductive biases, deep learning, and graph networks. arXiv preprint\narXiv:1806.01261, 2018.\n\n[2] Justin Gilmer, Samuel S. Schoenholz, Patrick F. Riley, Oriol Vinyals, and George E. Dahl.\nNeural message passing for quantum chemistry. In Doina Precup and Yee Whye Teh, edi-\ntors, Proceedings of the 34th International Conference on Machine Learning, volume 70 of\nProceedings of Machine Learning Research, pages 1263\u20131272, 2017.\n\n[3] Julian Besag. On the statistical analysis of dirty pictures. Journal of the Royal Statistical Society.\n\nSeries B (Methodological), pages 259\u2013302, 1986.\n\n[4] Robert A Hummel and Steven W Zucker. On the foundations of relaxation labeling processes.\n\nIEEE Transactions on Pattern Analysis and Machine Intelligence, (3):267\u2013287, 1983.\n\n[5] Stuart Geman and Donald Geman. Stochastic relaxation, gibbs distributions, and the bayesian\nrestoration of images. IEEE Transactions on pattern analysis and machine intelligence, (6):721\u2013\n741, 1984.\n\n[6] Stuart Geman and Christine Graf\ufb01gne. Markov random \ufb01eld image models and their applications\nto computer vision. In Proceedings of the international congress of mathematicians, volume 1,\npage 2. Berkeley, CA, 1986.\n\n[7] John Lafferty, Andrew McCallum, and Fernando CN Pereira. Conditional random \ufb01elds:\n\nProbabilistic models for segmenting and labeling sequence data. 2001.\n\n[8] Sanjiv Kumar and Martial Hebert. Discriminative random \ufb01elds. International Journal of\n\nComputer Vision, 68(2):179\u2013201, 2006.\n\n[9] Judea Pearl. Probabilistic reasoning in intelligent systems: networks of plausible inference.\n\nElsevier, 2014.\n\n[10] Pradeep Ravikumar and John Lafferty. Quadratic programming relaxations for metric labeling\nand markov random \ufb01eld map estimation. In Proceedings of the 23rd international conference\non Machine learning, pages 737\u2013744. ACM, 2006.\n\n[11] Satu Elisa Schaeffer. Graph clustering. Computer science review, 1(1):27\u201364, 2007.\n\n[12] Marius Leordeanu, Rahul Sukthankar, and Martial Hebert. Unsupervised learning for graph\n\nmatching. International journal of computer vision, 96(1):28\u201345, 2012.\n\n[13] Andrew Y Ng, Michael I Jordan, and Yair Weiss. On spectral clustering: Analysis and an\n\nalgorithm. In Advances in neural information processing systems, pages 849\u2013856, 2002.\n\n[14] Joan Bruna, Wojciech Zaremba, Arthur Szlam, and Yann LeCun. Spectral networks and locally\n\nconnected networks on graphs. CoRR, abs/1312.6203, 2013.\n\n[15] Mikael Henaff, Joan Bruna, and Yann LeCun. Deep convolutional networks on graph-structured\n\ndata. CoRR, abs/1506.05163, 2015.\n\n[16] Micha\u00ebl Defferrard, Xavier Bresson, and Pierre Vandergheynst. Convolutional neural networks\non graphs with fast localized spectral \ufb01ltering. In Advances in neural information processing\nsystems, pages 3844\u20133852, 2016.\n\n[17] Thomas N. Kipf and Max Welling. Semi-supervised classi\ufb01cation with graph convolutional\n\nnetworks. In International Conference on Learning Representations (ICLR), 2017.\n\n[18] David K Duvenaud, Dougal Maclaurin, Jorge Iparraguirre, Rafael Bombarell, Timothy Hirzel,\nAl\u00e1n Aspuru-Guzik, and Ryan P Adams. Convolutional networks on graphs for learning\nmolecular \ufb01ngerprints. In Advances in neural information processing systems, pages 2224\u2013\n2232, 2015.\n\n10\n\n\f[19] Peter Battaglia, Razvan Pascanu, Matthew Lai, Danilo Jimenez Rezende, et al. Interaction\nnetworks for learning about objects, relations and physics. In Advances in neural information\nprocessing systems, pages 4502\u20134510, 2016.\n\n[20] Keyulu Xu, Weihua Hu, Jure Leskovec, and Stefanie Jegelka. How powerful are graph neural\n\nnetworks? In International Conference on Learning Representations, 2019.\n\n[21] Petar Veli\u02c7ckovi\u00b4c, Guillem Cucurull, Arantxa Casanova, Adriana Romero, Pietro Li\u00f2, and Yoshua\nBengio. Graph attention networks. In International Conference on Learning Representations,\n2018.\n\n[22] Yujia Li, Daniel Tarlow, Marc Brockschmidt, and Richard Zemel. Gated graph sequence neural\n\nnetworks. International Conference on Learning Representations (ICLR), 2016.\n\n[23] Ashesh Jain, Amir R Zamir, Silvio Savarese, and Ashutosh Saxena. Structural-rnn: Deep\nlearning on spatio-temporal graphs. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 5308\u20135317, 2016.\n\n[24] Mostafa Dehghani, Stephan Gouws, Oriol Vinyals, Jakob Uszkoreit, and Lukasz Kaiser. Uni-\n\nversal transformers. In International Conference on Learning Representations, 2019.\n\n[25] Adam Santoro, Ryan Faulkner, David Raposo, Jack Rae, Mike Chrzanowski, Theophane Weber,\nDaan Wierstra, Oriol Vinyals, Razvan Pascanu, and Timothy Lillicrap. Relational recurrent\nneural networks. In S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and\nR. Garnett, editors, Advances in Neural Information Processing Systems 31, pages 7310\u20137321.\nCurran Associates, Inc., 2018.\n\n[26] Pedro F Felzenszwalb and Daniel P Huttenlocher. Pictorial structures for object recognition.\n\nInternational journal of computer vision, 61(1):55\u201379, 2005.\n\n[27] Svetlana Lazebnik, Cordelia Schmid, and Jean Ponce. Beyond bags of features: Spatial pyramid\n\nmatching for recognizing natural scene categories. In null, pages 2169\u20132178. IEEE, 2006.\n\n[28] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Spatial pyramid pooling in deep\nconvolutional networks for visual recognition. IEEE transactions on pattern analysis and\nmachine intelligence, 37(9):1904\u20131916, 2015.\n\n[29] Ashish Vaswani, Noam Shazeer, Niki Parmar, Jakob Uszkoreit, Llion Jones, Aidan N Gomez,\n\u0141ukasz Kaiser, and Illia Polosukhin. Attention is all you need. In Advances in Neural Informa-\ntion Processing Systems, pages 5998\u20136008, 2017.\n\n[30] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural computation,\n\n9(8):1735\u20131780, 1997.\n\n[31] Adam Santoro, David Raposo, David G Barrett, Mateusz Malinowski, Razvan Pascanu, Peter\nBattaglia, and Timothy Lillicrap. A simple neural network module for relational reasoning. In\nI. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 30, pages 4967\u20134976. Curran\nAssociates, Inc., 2017.\n\n[32] Xiaolong Wang, Ross Girshick, Abhinav Gupta, and Kaiming He. Non-local neural networks.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume 1,\npage 4, 2018.\n\n[33] Xiaolong Wang and Abhinav Gupta. Videos as space-time region graphs. In Proceedings of the\n\nEuropean Conference on Computer Vision (ECCV), pages 399\u2013417, 2018.\n\n[34] Pallabi Ghosh, Yi Yao, Larry S. Davis, and Ajay Divakaran. Stacked spatio-temporal graph\n\nconvolutional networks for action segmentation. ArXiv, abs/1811.10575, 2018.\n\n[35] Yao-Hung Hubert Tsai, Santosh Divvala, Louis-Philippe Morency, Ruslan Salakhutdinov,\nand Ali Farhadi. Video relationship reasoning using gated spatio-temporal energy graph. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages\n10424\u201310433, 2019.\n\n11\n\n\f[36] Sijie Yan, Yuanjun Xiong, and Dahua Lin. Spatial temporal graph convolutional networks for\nskeleton-based action recognition. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[37] Fabien Baradel, Natalia Neverova, Christian Wolf, Julien Mille, and Greg Mori. Object level\n\nvisual reasoning in videos. In ECCV, June 2018.\n\n[38] Yunpeng Chen, Yannis Kalantidis, Jianshu Li, Shuicheng Yan, and Jiashi Feng. A\u02c6 2-nets:\nDouble attention networks. In Advances in Neural Information Processing Systems, pages\n350\u2013359, 2018.\n\n[39] Christian Szegedy, Vincent Vanhoucke, Sergey Ioffe, Jon Shlens, and Zbigniew Wojna. Re-\nthinking the inception architecture for computer vision. In Proceedings of the IEEE conference\non computer vision and pattern recognition, pages 2818\u20132826, 2016.\n\n[40] Fran\u00e7ois Chollet. Xception: Deep learning with depthwise separable convolutions. arXiv\n\npreprint, pages 1610\u201302357, 2017.\n\n[41] Lin Sun, Kui Jia, Dit-Yan Yeung, and Bertram E Shi. Human action recognition using factorized\nspatio-temporal convolutional networks. In Proceedings of the IEEE International Conference\non Computer Vision, pages 4597\u20134605, 2015.\n\n[42] Saining Xie, Chen Sun, Jonathan Huang, Zhuowen Tu, and Kevin Murphy. Rethinking spa-\ntiotemporal feature learning: Speed-accuracy trade-offs in video classi\ufb01cation. In Proceedings\nof the European Conference on Computer Vision (ECCV), pages 305\u2013321, 2018.\n\n[43] Du Tran, Heng Wang, Lorenzo Torresani, Jamie Ray, Yann LeCun, and Manohar Paluri. A\ncloser look at spatiotemporal convolutions for action recognition. In Proceedings of the IEEE\nconference on Computer Vision and Pattern Recognition, pages 6450\u20136459, 2018.\n\n[44] Andrej Karpathy, George Toderici, Sanketh Shetty, Thomas Leung, Rahul Sukthankar, and\nLi Fei-Fei. Large-scale video classi\ufb01cation with convolutional neural networks. In Proceedings\nof the IEEE conference on Computer Vision and Pattern Recognition, pages 1725\u20131732, 2014.\n\n[45] Joe Yue-Hei Ng, Matthew Hausknecht, Sudheendra Vijayanarasimhan, Oriol Vinyals, Rajat\nMonga, and George Toderici. Beyond short snippets: Deep networks for video classi\ufb01cation.\nIn Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n4694\u20134702, 2015.\n\n[46] Jeffrey Donahue, Lisa Anne Hendricks, Sergio Guadarrama, Marcus Rohrbach, Subhashini\nVenugopalan, Kate Saenko, and Trevor Darrell. Long-term recurrent convolutional networks for\nvisual recognition and description. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pages 2625\u20132634, 2015.\n\n[47] Bolei Zhou, Alex Andonian, Aude Oliva, and Antonio Torralba. Temporal relational reasoning\nin videos. In Proceedings of the European Conference on Computer Vision (ECCV), pages\n803\u2013818, 2018.\n\n[48] Joao Carreira and Andrew Zisserman. Quo vadis, action recognition? a new model and the\nkinetics dataset. In Computer Vision and Pattern Recognition (CVPR), 2017 IEEE Conference\non, pages 4724\u20134733. IEEE, 2017.\n\n[49] Karen Simonyan and Andrew Zisserman. Two-stream convolutional networks for action\nrecognition in videos. In Advances in neural information processing systems, pages 568\u2013576,\n2014.\n\n[50] Yue Zhao, Yuanjun Xiong, and Dahua Lin. Trajectory convolution for action recognition. In\nS. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 2204\u20132215. Curran Associates,\nInc., 2018.\n\n[51] Xingjian Shi, Zhourong Chen, Hao Wang, Dit-Yan Yeung, Wai-Kin Wong, and Wang chun\nWoo. Convolutional lstm network: A machine learning approach for precipitation nowcasting.\nIn NIPS, 2015.\n\n12\n\n\f[52] Yunbo Wang, Mingsheng Long, Jianmin Wang, Zhifeng Gao, and Philip S. Yu. Predrnn:\nRecurrent neural networks for predictive learning using spatiotemporal lstms. In NIPS, 2017.\n\n[53] Yunbo Wang, Lu Jiang, Ming-Hsuan Yang, Li-Jia Li, Mingsheng Long, and Li Fei-Fei. Eidetic\n3d LSTM: A model for video prediction and beyond. In International Conference on Learning\nRepresentations, 2019.\n\n[54] Raghav Goyal, Samira Ebrahimi Kahou, Vincent Michalski, Joanna Materzynska, Susanne\nWestphal, Heuna Kim, Valentin Haenel, Ingo Fruend, Peter Yianilos, Moritz Mueller-Freitag,\net al. The\" something something\" video database for learning and evaluating visual common\nsense. In ICCV, volume 1, page 3, 2017.\n\n[55] Khurram Soomro, Amir Roshan Zamir, and Mubarak Shah. Ucf101: A dataset of 101 human\n\nactions classes from videos in the wild. arXiv preprint arXiv:1212.0402, 2012.\n\n[56] Hildegard Kuehne, Hueihan Jhuang, Est\u00edbaliz Garrote, Tomaso Poggio, and Thomas Serre.\nHmdb: a large video database for human motion recognition. In 2011 International Conference\non Computer Vision, pages 2556\u20132563. IEEE, 2011.\n\n[57] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image\nrecognition. 2016 IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\npages 770\u2013778, 2016.\n\n[58] Mart\u00edn Abadi, Ashish Agarwal, Paul Barham, Eugene Brevdo, Zhifeng Chen, Craig Citro,\nGreg S. Corrado, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Ian Goodfellow,\nAndrew Harp, Geoffrey Irving, Michael Isard, Yangqing Jia, Rafal Jozefowicz, Lukasz Kaiser,\nManjunath Kudlur, Josh Levenberg, Dandelion Man\u00e9, Rajat Monga, Sherry Moore, Derek\nMurray, Chris Olah, Mike Schuster, Jonathon Shlens, Benoit Steiner, Ilya Sutskever, Kunal\nTalwar, Paul Tucker, Vincent Vanhoucke, Vijay Vasudevan, Fernanda Vi\u00e9gas, Oriol Vinyals, Pete\nWarden, Martin Wattenberg, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: Large-\nscale machine learning on heterogeneous systems, 2015. Software available from tensor\ufb02ow.org.\n\n[59] Myunggi Lee, Seungeui Lee, Sung Joon Son, Gyutae Park, and Nojun Kwak. Motion feature\n\nnetwork: Fixed motion \ufb01lter for action recognition. In ECCV, 2018.\n\n[60] Mohammadreza Zolfaghari, Kamaljeet Singh, and Thomas Brox. Eco: Ef\ufb01cient convolutional\nIn Proceedings of the European Conference on\n\nnetwork for online video understanding.\nComputer Vision (ECCV), pages 695\u2013712, 2018.\n\n13\n\n\f", "award": [], "sourceid": 6993, "authors": [{"given_name": "Andrei", "family_name": "Nicolicioiu", "institution": "Bitdefender"}, {"given_name": "Iulia", "family_name": "Duta", "institution": "Bitdefender"}, {"given_name": "Marius", "family_name": "Leordeanu", "institution": "Institute of Mathematics of the Romanian Academy"}]}