{"title": "Integrated perception with recurrent multi-task neural networks", "book": "Advances in Neural Information Processing Systems", "page_first": 235, "page_last": 243, "abstract": "Modern discriminative predictors have been shown to match natural intelligences in specific perceptual tasks in image classification, object and part detection, boundary extraction, etc. However, a major advantage that natural intelligences still have is that they work well for all perceptual problems together, solving them efficiently and coherently in an integrated manner. In order to capture some of these advantages in machine perception, we ask two questions: whether deep neural networks can learn universal image representations, useful not only for a single task but for all of them, and how the solutions to the different tasks can be integrated in this framework. We answer by proposing a new architecture, which we call multinet, in which not only deep image features are shared between tasks, but where tasks can interact in a recurrent manner by encoding the results of their analysis in a common shared representation of the data. In this manner, we show that the performance of individual tasks in standard benchmarks can be improved first by sharing features between them and then, more significantly, by integrating their solutions in the common representation.", "full_text": "Integrated Perception with Recurrent Multi-Task\n\nNeural Networks\n\nHakan Bilen\n\nAndrea Vedaldi\n\nVisual Geometry Group, University of Oxford\n\n{hbilen,vedaldi}@robots.ox.ac.uk\n\nAbstract\n\nModern discriminative predictors have been shown to match natural intelligences in\nspeci\ufb01c perceptual tasks in image classi\ufb01cation, object and part detection, boundary\nextraction, etc. However, a major advantage that natural intelligences still have is\nthat they work well for all perceptual problems together, solving them ef\ufb01ciently\nand coherently in an integrated manner. In order to capture some of these advan-\ntages in machine perception, we ask two questions: whether deep neural networks\ncan learn universal image representations, useful not only for a single task but for\nall of them, and how the solutions to the different tasks can be integrated in this\nframework. We answer by proposing a new architecture, which we call multinet, in\nwhich not only deep image features are shared between tasks, but where tasks can\ninteract in a recurrent manner by encoding the results of their analysis in a common\nshared representation of the data. In this manner, we show that the performance of\nindividual tasks in standard benchmarks can be improved \ufb01rst by sharing features\nbetween them and then, more signi\ufb01cantly, by integrating their solutions in the\ncommon representation.\n\n1\n\nIntroduction\n\nNatural perception can extract complete interpretations of sensory data in a coherent and ef\ufb01cient\nmanner. By contrast, machine perception remains a collection of disjoint algorithms, each solving\nspeci\ufb01c information extraction sub-problems. Recent advances such as modern convolutional neural\nnetworks have dramatically improved the performance of machines in individual perceptual tasks,\nbut it remains unclear how these could be integrated in the same seamless way as natural perception\ndoes.\n\nIn this paper, we consider the problem of learning data representations for integrated perception. The\n\ufb01rst question we ask is whether it is possible to learn universal data representations that can be used\nto solve all sub-problems of interest. In computer vision, \ufb01ne-tuning or retraining has been show\nto be an effective method to transfer deep convolutional networks between different tasks [9, 29].\nHere we show that, in fact, it is possible to learn a single, shared representation that performs well on\nseveral sub-problems simultaneously, often as well or even better than specialised ones.\n\nA second question, complementary to the one of feature sharing, is how different perceptual subtasks\nshould be combined. Since each subtask extracts a partial interpretation of the data, the problem\nis to form a coherent picture of the data as a whole. We consider an incremental interpretation\nscenario, where subtasks collaborate in parallel or sequentially in order to gradually enrich a shared\ninterpretation of the data, each contributing its own \u201cdimension\u201d to it. Informally, many computer\nvision systems operate in this strati\ufb01ed manner, with different modules running in parallel or in\nsequence (e.g. object detection followed by instance segmentation). The question is how this can be\ndone end-to-end and systematically.\n\nIn this paper, we develop an architecture, multinet (\ufb01g. 1), that provides an answer to such questions.\nMultinet builds on the idea of a shared representation, called an integration space, which re\ufb02ects both\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fclass\n\nlocation\n\nparts\n\nairplane\n\nFigure 1: Multinet. We propose a modular multi-task architecture in which several perceptual tasks\nare integrated in a synergistic manner. The subnetwork \u03c60\nenc encodes the data x0 (an image in the\nexample) producing a representation h shared between K different tasks. Each task estimates one of\nK different labels x\u03b1 (object class, location, and parts in the example) using K decoder functions\n\u03c8\u03b1\ndec. Each task contributes back to the shared representation by means of a corresponding encoder\nfunction \u03c6\u03b1\nenc. The loop is closed in a recurrent con\ufb01guration by means of suitable integrator functions\n(not shown here to avoid cluttering the diagram).\n\nthe statistics extracted from the data as well as the result of the analysis carried by the individual\nsubtasks. As a loose metaphor, one can think the integration space as a \u201ccanvas\u201d which is progressively\nupdated with the information obtained by solving sub-problems. The representation distills this\ninformation and makes it available for further task resolution, in a recurrent con\ufb01guration.\n\nMultinet has several advantages. First, by learning the latent integration space automatically, synergies\nbetween tasks can be discovered automatically. Second, tasks are treated in a symmetric manner, by\nassociating to each of them encoder, decoder, and integrator functions, making the system modular\nand easily extensible to new tasks. Third, the architecture supports incremental understanding because\ntasks contribute back to the latent representation, making their output available to other tasks for\nfurther processing. Finally, while multinet is applied here to a image understanding setting, the\narchitecture is very general and could be applied to numerous other domains as well.\n\nThe new architecture is described in detail in sect. 2 and an instance specialized for computer vision\napplications is given in sect. 3. The empirical evaluation in sect. 4 demonstrates the bene\ufb01ts of\nthe approach, including that sharing features between different tasks is not only economical, but\nalso sometimes better for accuracy, and that integrating the outputs of different tasks in the shared\nrepresentation yields further accuracy improvements. Sect. 5 summarizes our \ufb01ndings.\n\n1.1 Related work\n\nMultiple task learning (MTL): Multitask learning [5, 25, 1] methods have been studied over\ntwo decades by the machine learning community. The methods are based on the key idea that\nthe tasks share a common low-dimensional representation which is jointly learnt with the task\nspeci\ufb01c parameters. While MLT trains many tasks in parallel, Mitchell and Thrun [18] propose a\nsequential transfer method called Explanation-Based Neural Nets (EBNN) which exploits previously\nlearnt domain knowledge to initialise or constraint the parameters of the current task. Breiman\nand Freidman [3] devise a hybrid method that \ufb01rst learns separate models and then improves their\ngeneralisation by exploiting the correlation between the predictions.\n\nMulti-task learning in computer vision: MTL has been shown to improve results in many computer\nvision problems. Typically, researchers incorporate auxiliary tasks into their target tasks, jointly train\nthem in parallel and achieve performance gains in object tracking [30], object detection [11], facial\nlandmark detection [31]. Differently, Dai et al. [8] propose multi-task network cascades in which\nconvolutional layer parameters are shared between three tasks and the tasks are predicted sequentially.\nUnlike [8], our method can train multiple tasks in parallel and does not require a speci\ufb01cation of task\nexecution.\n\nRecurrent networks: Our work is also related to recurrent neural networks (RNN) [22] which has\nbeen successfully used in language modelling [17], speech recognition [13], hand-written recogni-\ntion [12], semantic image segmentation [20] and human pose estimation [2]. Related to our work,\nCarreira et al. [4] propose an iterative segmentation model that progressively updates an initial\n\n2\n\n\fFigure 2: Multinet recurrent architecture. The components in the rounded box are repeated K\ntimes, one for each task \u03b1 = 1, . . . , K.\n\nsolution by feeding back error signal. Najibi et al. [19] propose an ef\ufb01cient grid based object detector\nthat iteratively re\ufb01ne the predicted object coordinates by minimising the training error. While these\nmethods [4, 19] are also based on an iterative solution correcting mechanism, our main goal is to\nimprove generalisation performance for multiple tasks by sharing the previous predictions across\nthem and learning output correlations.\n\n2 Method\n\nIn this section, we \ufb01rst introduce the multinet architecture for integrated multi-task prediction\n(sect. 2.1) and then we discuss ordinary multi-task prediction as a special case of multinet (sect. 2.2).\n\n2.1 Multinet: integrated multiple-task prediction\n\nWe propose a recurrent neural network architecture (\ufb01g. 1 and 2) that can address simultaneously\nmultiple data labelling tasks. For symmetry, we drop the usual distinction between input and output\nspaces and consider instead K label spaces X \u03b1, \u03b1 = 0, 1, . . . , K. A label in the \u03b1-th space is denoted\nby the symbol x\u03b1 \u2208 X \u03b1. In the following, \u03b1 = 0 is used for the input (e.g. an image) of the network\nand is not inferred, whereas x1, . . . , xK are labels estimated by the neural network (e.g. an object\nclass, location, and parts). One reason why it is useful to keep the notation symmetric is because it is\npossible to ground any label x\u03b1 and treat it as an input instead.\n\nEach task \u03b1 is associated to a corresponding encoder function \u03c6\u03b1\nvectorial representation r\u03b1 \u2208 R\u03b1 given by\n\nenc, which maps the label x\u03b1 to a\n\nr\u03b1 = \u03c6\u03b1\n\nenc(x\u03b1).\n\n(1)\n\nEach task has also a decoder function \u03c8\u03b1\nspace h \u2208 H to the label x\u03b1:\n\ndec going in the other direction, from a common representation\n\n(2)\nThe information r0, r1, . . . , r\u03b1 extracted from the data and the different tasks by the encoders is\nintegrated in the shared representation h by using an integrator function \u0393. Since this update operation\nis incremental, we associate to it an iteration number t = 0, 1, 2, . . . . By doing so, the update equation\ncan be written as\n\ndec(h).\n\nx\u03b1 = \u03c8\u03b1\n\n(3)\nNote that, in the equation above, r0 is constant as the corresponding variable x0 is the input of the\nnetwork, which is grounded and not updated.\n\nht+1 = \u0393(ht, r0, r1\n\nt , . . . , rK\n\nt ).\n\nenc, \u03c8\u03b1\n\nenc, \u03c8\u03b1\n\nOverall, a task \u03b1 is speci\ufb01ed by the triplet T \u03b1 = (X \u03b1, \u03c6\u03b1\ndec) and by its contribution to the update\nrule (3). Full task modularity can be achieved by decomposing the integrator function as a sequence\nof task-speci\ufb01c updates ht+1 = \u0393K(\u00b7, rK\nt ), such that each task is a quadruplet\n(X \u03b1, \u03c6\u03b1\nGiven tasks T \u03b1, \u03b1 = 1, . . . , K, several variants of the recurrent architecture are possible. A natural\none is to process tasks sequentially, but this has the added complication of having to choose a\nparticular order and may in any case be suboptimal; instead, we propose to update all the task at each\nrecurrent iteration, as follows:\n\ndec, \u0393\u03b1), but this option is not investigated further here.\n\nt ) \u25e6 \u00b7 \u00b7 \u00b7 \u25e6 \u03931(ht, r1\n\n3\n\n\ft = 0 Ordinary multi-task prediction. At the \ufb01rst iteration, the measurement x0 is acquired\nenc(x0) = \u0393(\u2217, r0, \u2217, . . . , \u2217). The\nand the shared representation h is initialized as h0 = \u03c60\nsymbol \u2217 denotes the initial value of a variable (often zero in practice). Given h0, the output\nenc)(x0) for each task is computed. This step corresponds to\nx\u03b1\n0 = \u03c8\u03b1\nordinary multi-task prediction, as discussed later (sect. 2.2).\n\ndec(h0) = (\u03c8\u03b1\n\ndec \u25e6 \u03c60\n\nt > 0 Iterative updates. Each task \u03b1 = 1, . . . , K is re-encoded using equations r\u03b1\n\nthe shared representation is updated using ht+1 = \u0393(ht, r0, r1\npredicted again using x\u03b1\n\nt+1 = \u03c8\u03b1\n\ndec(ht+1).\n\nt = \u03c6\u03b1\n\nenc(x\u03b1\nt ),\nt ), and the labels are\n\nt , . . . , rK\n\nThe idea of feeding back the network output for further processing exists in several existing recurrent\narchitectures [16, 24]; however, in these cases it is used to process sequential data, passing back\nthe output obtained from the last process element in the sequence; here, instead, the feedback is\nused to integrate different and complementary labelling tasks. Our model is also reminiscent of\nencoder/decoder architectures [15, 21, 28]; however, in our case the encoder and decoder functions\nare associated to the output labels rather than to the input data.\n\n2.2 Ordinary multi-task learning\n\nOrdinarily, multiple-task learning [5, 25, 1] is based on sharing features or parameters between\ndifferent tasks. Multinet reduces to ordinary multi-task learning when there is no recurrence. At the\n\ufb01rst iteration t = 0, in fact, multinet simply evaluates K predictor functions \u03c81\nenc,\none for each task, which share the common subnetwork \u03c60\n\nenc, . . . , \u03c8K\n\ndec \u25e6\u03c60\n\ndec \u25e6\u03c60\n\nenc.\n\nWhile multi-task learning from representation sharing is conceptually simple, it is practically im-\nportant because it allows learning a universal representation function \u03c60\nenc which works well for all\ntasks simultaneously. The possibility of learning such a polyvalent representation, which can only\nbe veri\ufb01ed empirically, is a non-trivial and useful fact. In particular, in our experiments in image\nunderstanding (sect. 4), we will see that, for certain image analysis tasks, it is not only possible and\nef\ufb01cient to learn such a shared representation, but that in some cases feature sharing can even improve\nthe performance in the individual sub-problems.\n\n3 A multinet for classi\ufb01cation, localization, and part detection\n\nIn this section we instantiate multinet for three complementary tasks in computer vision: object\nclassi\ufb01cation, object detection, and part detection. The main advantage of multinet compared to\nordinary multi-task prediction is that, while sharing parameters across related tasks may improve\ngeneralization [5], it is not enough to capture correlations in the task input spaces. For example,\nin our computer vision application ordinary multi-task prediction would not be able to ensure that\nthe detected parts are contained within a detected object. Multinet can instead capture interactions\nbetween the different labels and potentially learn to enforce such constraints. The latter is done in\na soft and distributed manner, by integrating back the output of the individual tasks in the shared\nrepresentation.\n\nNext, we discuss in some detail the speci\ufb01c architecture components used in our application. As a\nstarting point we consider a standard CNN for image classi\ufb01cation. While more powerful networks\nexist, we choose here a good performing model which is at the same time reasonably ef\ufb01cient to\ntrain and evaluate, namely the VGG-M-1024 network of [6]. This model is pre-trained for image\nclassi\ufb01cation from the ImageNet ILSVRC 2012 data [23] and was extended in [11] to object detection;\nhere we follow such blueprints, and in particular the Fast R-CNN method of [11], to design the\nsubnetworks for the three tasks. These components are described in some detail below, \ufb01rst focusing\non the components corresponding to ordinary multi-task prediction, and then moving to the ones used\nfor multiple task integration.\n\nOrdinary multiple-task components. The \ufb01rst several layers of the VGG-M network can be\ngrouped in \ufb01ve convolutional sections, each comprising linear convolution, a non-linear activation\nfunction and, in some cases, max pooling and normalization. These are followed by three fully-\nconnected sections, which are the same as the convolutional ones, but with \ufb01lter support of the same\nsize as the corresponding input. The last layer is softmax and computes a posterior probability vector\nover the 1,000 ImageNet ILSVRC classes.\n\n4\n\n\fVGG-M is adapted for the different tasks as follows. For clarity, we use symbolic names for the tasks\nrather than numeric indexes, and consider \u03b1 \u2208 {img, cls, det, part} instead of \u03b1 \u2208 {0, 1, 2, 3}. The\n\ufb01ve convolutional sections of VGG-M are used as the image encoder \u03c6img\nenc and hence compute the\ninitial value h0 of the shared representation. Cutting VGG-M at the level of the last convolutional\nlayer is motivated by the fact that the fully-connected layers remove or at least dramatically blur\nspatial information, whereas we would like to preserve it for object and part localization. Hence, the\nshared representation is a tensor h \u2208 RH\u00d7W \u00d7C , where H \u00d7 W are the spatial dimensions and C is\nthe number of feature channels as determined by the VGG-M con\ufb01guration (see sect. 4).\n\nenc is branched off in three directions, choosing a decoder \u03c8\u03b1\n\nNext, \u03c6img\ndec for each task: image classi\ufb01ca-\ntion (\u03b1 = cls), object detection (\u03b1 = det), and part detection (\u03b1 = part). For the image classi\ufb01cation\nbranch, we choose \u03c6\u03b1\nenc as the rest of the original VGG-M network for image classi\ufb01cation. In\nother words, the decoder function \u03c8cls\ndec for the image-level labels is initialized to be the same as the\nfully-connected layers of the original VGG-M, such that \u03c6VGG-M\nenc . There are however\ntwo differences. The \ufb01rst is the last fully-connected layer is reshaped and reinitialized randomly to\npredict a different number C of possible objects instead of the 1,000 ImageNet classes. The second\ndifference is that the \ufb01nal output is a vector of binary probabilities obtained using sigmoid instead of\na softmax.\n\ndec \u25e6 \u03c6img\n\n= \u03c8cls\n\nenc\n\nThe object and part detection decoders are instead based on the Fast R-CNN architecture [11], and\nclassify individual image regions as belonging to one of the object classes (part types) or background.\nTo do so, the Selective Search Windows (SSW) method [26] is used to generate a shortlist of M\nregion (bounding box) proposals B(ximg) = {b1, . . . , bM } from image ximg; this set is inputted to\nthe spatial pyramid pooling (SPP) layer [14, 11] \u03c8SPP\ndec (h, B(ximg)), which extracts subsets of the\nfeature map h in correspondence of each region using max pooling. The object detection decoder\ndec (h, B(ximg))) where\n(and similarly for the part detector) is then given by \u03c8det\ndet contains fully connected layers initialized in the same manner as the classi\ufb01cation decoder\n\u03c8dec\nabove (hence, before training one also has \u03c6VGG-M\nenc ). The exception is once more the\nlast layer, reshaped and reinitialized as needed, whereas softmax is still used as regions can have only\none class.\nSo far, we have described the image encored \u03c6img\ndec for\nthe three tasks. Such components are suf\ufb01cient for ordinary multi-task learning, corresponding to the\ninitial multinet iteration. Next, we specify the components that allow to iterate multinet several times.\n\nenc and the decoder branches \u03c8cls\n\ndec and \u03c8part\n\ndec(h) = \u03c8dec\n\ndet \u25e6 \u03c6img\n\ndec, \u03c8det\n\ndet(\u03c8SPP\n\n= \u03c8dec\n\nenc\n\nRecurrent components: integrating multiple tasks. For task integration, we need to construct\nthe encoder functions \u03c6cls\nenc for each task as well as the integrator function \u0393. While\nseveral constructions are possible, here we experiment with simple ones.\n\nenc and \u03c6part\n\nenc, \u03c6det\n\nIn order to encode the image label xcls, the encoder rcls = \u03c6cls\nprobabilities xcls \u2208 RC cls\ncorresponding values to all H \u00d7 W spatial locations (u, v) in h. Formally rcls \u2208 RH\u00d7W \u00d7C cls\n\nenc(xcls) takes the vector of C cls binary\n, one for each of the C cls possible object classes, and broadcasts the\n\nand\n\n\u2200u, v, c :\n\nrcls\nuvc = xcls\nc .\n\nEncoding the object detection label xdet is similar, but re\ufb02ects the geometric information captured by\nsuch labels. In particular, each bounding box bm of the M extracted by SSW is associated to a vector\nm \u2208 RC cls+1.\nof C cls + 1 probabilities (one for each object class plus one more for background) xdet\nThis is decoded in a heat map rcls \u2208 RH\u00d7W \u00d7(C cls+1) by max pooling across boxes:\n\n\u2200u, v, c :\n\nrcls\nuvc = max (cid:8)xdet\n\nmc, \u2200m : (u, v) \u2208 bm(cid:9) \u222a {0}.\n\nThe part label xpart is encoded in an entirely analogous manner.\n\nLastly, we need to construct the integrator function \u0393. We experiment with two simple designs. The\n\ufb01rst one simply stacks evidence from the different sources: h = stack(rimg, rcls, rdet, rpart). Then the\nupdate equation is given by\n\nht = \u0393(ht\u22121, rimg, rcls\n\n(4)\nNote that this formulation requires modifying the \ufb01rst fully-connected layers of each decoder \u02c6\u03c8cls\ndec,\n\u02c6\u03c8det\ndec and \u02c6\u03c8part\ndec as the shared representation h has now C + 2C cls + C part + 2 channels instead of just C\n\n) = stack(rimg, rcls\n\nt , rdet\n\nt , rdet\n\n).\n\nt\n\nt\n\n, rpart\n\nt\n\n, rpart\n\nt\n\n5\n\n\fFigure 3: Illustration of the multinet instantiation tackling three computer vision problem: image\nclassi\ufb01cation, object detection, and part detection.\n\nas for the original VGG-M architecture. This is done by initializing randomly additional dimensions\nin the linear maps.\n\nWe also experiment with a second update equation\n\nht = \u0393(ht\u22121, rimg, rcls\n\nt , rdet\n\nt\n\n, rpart\n\nt\n\n) = ReLU(A \u2217 stack(ht\u22121, rcls, rcls\n\nt , rdet\n\nt\n\n, rpart\n\nt\n\n))\n\n(5)\n\nwhere A \u2208 R1\u00d71\u00d7(2C+2C cls+C part+2)\u00d7C is a \ufb01lter bank whose purpose is to reduce the stacked\nrepresentation back to the original C channels. This is a useful design as it maintains the same repre-\nsentation dimensionality regardless of the number of tasks added. However, due to the compression,\nit may perform less well.\n\n4 Experiments\n\n4.1\n\nImplementation details and training\n\nThe image encoder \u03c6img\nenc is initialized from the pre-trained VGG-M model using sections conv1 to\nconv5. If the input to the network is an RGB image ximg \u2208 RH img\n\u00d73, then, due to downsampling,\nthe spatial dimension H \u00d7 W \u00d7 C of rimg = \u03c6img\nenc (ximg) are H \u2248 H img/16 and W \u2248 W img/16.\nThe number of feature channels is C = 512. As noted above, the decoders contain respectively\npart comprising layers fc6 and fc7 from VGG-M, followed by a\nsubnetworks \u03c8dec\nrandomly-initialized linear predictor with output dimension equal to, respectively, C cls, C cls + 1, and\nC part + 1. Max pooling in SPP is performed in a grid of 6 \u00d7 6 spatial bins as in [14, 11]. The task\nencoders \u03c6cls\n\nenc are given in sect. 2 and contain no parameter.\n\ndet, and \u03c8dec\n\ncls, \u03c8dec\n\nenc, \u03c6det\n\nenc, \u03c6part\n\n\u00d7W img\n\nFor training, each task is associated with a corresponding loss function. For the classi\ufb01cation task, the\nobjective is to minimize the sum of negative posterior log-probabilities of whether the image contains\na certain object type or not (this allows different objects to be present in a single image). Combined\nwith the fact that the classi\ufb01cation branch uses sigmoid, this is the same as binary logistic regression.\nFor the object and part detection tasks, decoders are optimized to classify the target regions as one of\nthe C cls or C part classes or background (unlike image-level labels, classes in region-level labels are\nmutually exclusive). Furthermore, we also train a branch performing bounding box re\ufb01nement to\nimprove the \ufb01t of the selective search region as proposed by [11].\n\nThe fully connected layers used for softmax classi\ufb01cation and bounding-box regression in object\nand part detection tasks are initialized from zero-mean Gaussian distributions with 0.01 and 0.001\nstandard deviations respectively. The fully connected layers used for object classi\ufb01cation task and the\nadaptation layer A (see eq. 5) are initialized with zero-mean Gaussian with 0.01 standard deviation.\n\n6\n\n\fAll layers use a learning rate of 1 for \ufb01lters and 2 for biases. We used SGD to optimize the parameters\nwith a learning rate of 0.001 for 6 epochs and lower it to 0.0001 for another 6 epochs. We observe that\nrunning two iterations of recursion is suf\ufb01cient to reach 99% of the performance, although marginal\ngains are possible with more. We use the publicly available CNN toolbox MatConvNet [27] in our\nexperiments.\n\n4.2 Results\n\nIn this section, we describe and discuss experimental results of our models in two benchmarks.\n\nPASCAL VOC 2010 [10] and Parts [7]: The dataset contains 4998 training and 5105 validation\nimages for 20 object categories and ground truth bounding box annotations for target categories. We\nuse the PASCAL-Part dataset [7] to obtain bounding box annotations of object parts which consists\nof 193 annotated part categories such as aeroplane engine, bicycle back-wheel, bird left-wing, person\nright-upper-leg. After removing annotations that are smaller than 20 pixels on one side and the\ncategories with less than 50 training samples, the number of part categories reduces to 152. The\ndataset provides annotations for only training and validation splits, thus we train our models in the\ntrain split and report results in the validation split for all the tasks. We follow the standard PASCAL\nVOC evaluation and report average precision (AP) and AP at 50% intersection-over-union (IoU) of\nthe detected boxes with the ground ones for object classi\ufb01cation and detection respectively. For the\npart detection, we follow [7] and report AP at a more relaxed 40% IoU threshold. The results for the\ntasks are reported in tab. 1.\n\nIn order to establish the \ufb01rst baseline, we train an independent network for each task. Each network\nis initialized with the VGG-M model, the last classi\ufb01cation and regression layers are initialized with\nrandom noise and all the layers are \ufb01ne-tuned for the respective task. For object and part detection,\nwe use our implementation of Fast-RCNN [11]. Note that, for consistency between the baselines and\nour method, minimum dimension of each image is scaled to be 600 pixels for all the tasks including\nobject classi\ufb01cation. An SPP layer is employed to scale the feature map into 6 \u00d7 6 dimensionality.\n\nFor the second baseline, we train a multi-task network that shares the convolutional layers across the\ntasks (this setting is called ordinary multi-task prediction in sect. 2.1). We observe in tab. 1 that the\nmulti-task model performs comparable or better than the independent networks, while being more\nef\ufb01cient due to the shared convolutional computations. Since the training images are the same in all\ncases, this shows that just combining multiple labels together improves ef\ufb01ciency and in some cases\neven performance.\n\nFinally we test the full multinet model for two settings de\ufb01ned as update rules (1) and (2) corre-\nsponding to eq. 4 and 5 respectively. We \ufb01rst see that both models outperforms the independent\nnetworks and multi-task network as well. This is remarkable because our model consists of smaller\nnumber of parameters than the sum of three independent networks and yet our best model (update 1)\nconsistently outperforms them by roughly 1.5 points in mean AP. Furthermore, multinet improves\nover the ordinary multi-task prediction by exploiting the correlations in the solutions of the individual\ntasks. In addition, we observe that update (1) performs better than update (2) that constraints the\nshared representation space to 512 dimensions regardless of the number of tasks, as it can be expected\ndue to the larger capacity. Nevertheless, even with the bottleneck we observe improvements compared\nto ordinary multi-task prediction.\n\nWe also run a test case to verify whether multinet learns to mix information extracted by the various\ntasks as presumed. To do so, we exploit the predictions performed by these task in will be able to\nimprove more with ground truth labels during test time. At test time we ground the classi\ufb01cation\nlabel rcls in the \ufb01rst iteration of multinet to the ground truth class labels and we read the predictions\nafter one iteration. The performances expectedly in the three tasks improve to 90.1, 58.9 and 39.2\nrespectively. This shows that, the feedback on the class information has a strong effect on class\nprediction itself, and a more modest but nevertheless signi\ufb01cant effect on the other tasks as well.\n\nPASCAL VOC 2007 [10]: The dataset consists of 2501 training, 2510 validation, and 5011 test\nimages containing bounding box annotations for 20 object categories. There is no part annotations\navailable for this dataset, thus, we exclude the part detection task and run the same baselines and\nour best model for object classi\ufb01cation and detection. The results are reported for the test split and\ndepicted in tab. 2. Note that our RCNN for the individual networks obtains the same detection score\n\n7\n\n\fMethod / Task\n\nclassi\ufb01cation\n\nobject-detection\n\npart-detection\n\nIndependent\nMulti-task\nOurs\nOurs (with bottleneck)\n\n76.4\n76.2\n77.4\n76.8\n\n55.5\n57.1\n57.5\n57.3\n\n37.3\n37.2\n38.8\n38.5\n\nTable 1: Object classi\ufb01cation, detection and part detection results in the PASCAL VOC 2010\nvalidation split.\n\nMethod / Task\n\nclassi\ufb01cation\n\nobject-detection\n\nIndependent\nMTL\nOurs\n\n78.7\n78.9\n79.8\n\n59.2\n60.4\n61.3\n\nTable 2: Object classi\ufb01cation and detection results in the PASCAL VOC 2007 test split.\n\nin [11]. In parallel to the former results, our method consistently outperforms both the baselines in\nclassi\ufb01cation and detection tasks.\n\n5 Conclusions\n\nIn this paper, we have presented multinet, a recurrent neural network architecture to solve multiple\nperceptual tasks in an ef\ufb01cient and coordinated manner. In addition to feature and parameter sharing,\nwhich is common to most multi-task learning methods, multinet combines the output of the different\ntasks by updating a shared representation iteratively.\n\nOur results are encouraging. First, we have shown that such architectures can successfully inte-\ngrate multiple tasks by sharing a large subset of the data representation while matching or even\noutperforming specialised network. Second, we have shown that the iterative update of a common\nrepresentation is an effective method for sharing information between different tasks which further\nimprove performance.\n\nAcknowledgments\n\nThis work acknowledges the support of the ERC Starting Grant Integrated and Detailed Image\nUnderstanding (EP/L024683/1).\n\nReferences\n\n[1] J. Baxter. A model of inductive bias learning. J. Artif. Intell. Res.(JAIR), 12(149-198):3, 2000.\n\n[2] V. Belagiannis and A. Zisserman. Recurrent human pose estimation.\n\narXiv preprint\n\narXiv:1605.02914, 2016.\n\n[3] L. Breiman and J. H. Friedman. Predicting multivariate responses in multiple linear regression.\nJournal of the Royal Statistical Society: Series B (Statistical Methodology), 59(1):3\u201354, 1997.\n\n[4] J. Carreira, P. Agrawal, K. Fragkiadaki, and J. Malik. Human pose estimation with iterative\n\nerror feedback. CVPR, 2016.\n\n[5] R. Caruana. Multitask learning. Machine Learning, 28(1), 1997.\n\n[6] K. Chat\ufb01eld, K. Simonyan, A. Vedaldi, and A. Zisserman. Return of the devil in the details:\n\nDelving deep into convolutional nets. In BMVC, 2014.\n\n[7] X. Chen, R. Mottaghi, X. Liu, S. Fidler, R. Urtasun, and A. L. Yuille. Detect what you can:\nDetecting and representing objects using holistic models and body parts. In CVPR, pages\n1971\u20131978, 2014.\n\n[8] J. Dai, K. He, and J. Sun.\n\nInstance-aware semantic segmentation via multi-task network\n\ncascades. In CVPR, 2016.\n\n[9] J. Donahue, Y. Jia, O. Vinyals, J. Hoffman, N. Zhang, E. Tzeng, and T. Darrell. Decaf: A deep\n\nconvolutional activation feature for generic visual recognition. CoRR, abs/1310.1531, 2013.\n\n8\n\n\f[10] M. Everingham, L. Van Gool, C. K. I. Williams, J. Winn, and A. Zisserman. The PASCAL\n\nVisual Object Classes (VOC) challenge. IJCV, 88(2):303\u2013338, 2010.\n\n[11] R. Girshick. Fast r-cnn. In ICCV, 2015.\n\n[12] A. Graves, M. Liwicki, S. Fern\u00e1ndez, R. Bertolami, H. Bunke, and J. Schmidhuber. A novel\nconnectionist system for unconstrained handwriting recognition. Pattern Analysis and Machine\nIntelligence, IEEE Transactions on, 31(5):855\u2013868, 2009.\n\n[13] A. Graves, A. Mohamed, and G. Hinton. Speech recognition with deep recurrent neural\n\nnetworks. In ICASSP, pages 6645\u20136649. IEEE, 2013.\n\n[14] K. He, X. Zhang, S. Ren, and J. Sun. Spatial pyramid pooling in deep convolutional networks\n\nfor visual recognition. In ECCV, pages 346\u2013361, 2014.\n\n[15] G. E. Hinton and R. R. Salakhutdinov. Reducing the dimensionality of data with neural networks.\n\nScience, 313(5786):504\u2013507, 2006.\n\n[16] S. Hochreiter and J. Schmidhuber. Long short-term memory. Neural computation, 9(8):1735\u2013\n\n1780, 1997.\n\n[17] T. Mikolov. Statistical Language Models Based on Neural Networks. PhD thesis, Ph. D. thesis,\n\nBrno University of Technology, 2012.\n\n[18] T. M. Mitchell and S. B. Thrun. Explanation-based neural network learning for robot control.\n\nNIPS, pages 287\u2013287, 1993.\n\n[19] M. Najibi, M. Rastegari, and L. S. Davis. G-cnn: an iterative grid based object detector. CVPR,\n\n2016.\n\n[20] P. H. O. Pinheiro and R. Collobert. Recurrent convolutional neural networks for scene parsing.\n\narXiv preprint arXiv:1306.2795, 2013.\n\n[21] M. A. Ranzato, F. J. Huang, Y. Boureau, and Y. LeCun. Unsupervised learning of invariant\n\nfeature hierarchies with applications to object recognition. In CVPR, pages 1\u20138, 2007.\n\n[22] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning representations by back-propagating\n\nerrors. Cognitive modeling, 5(3):1, 1988.\n\n[23] O. Russakovsky, J. Deng, H. Su, J. Krause, S. Satheesh, S. Ma, S. Huang, A. Karpathy,\nImagenet large scale visual recognition\n\nA. Khosla, M. Bernstein, A.C. Berg, and F.F. Li.\nchallenge. IJCV, 2015.\n\n[24] I. Sutskever, O. Vinyals, and Q. V. Le. Sequence to sequence learning with neural networks. In\n\nNIPS, pages 3104\u20133112, 2014.\n\n[25] S. Thrun and L. Pratt, editors. Learning to Learn. Kluwer Academic Publishers, 1998.\n\n[26] K. van de Sande, J. Uijlings, T. Gevers, and A. Smeulders. Segmentation as selective search for\n\nobject recognition. In ICCV, 2011.\n\n[27] A. Vedaldi and K. Lenc. Matconvnet \u2013 convolutional neural networks for matlab. In Proceeding\n\nof the ACM Int. Conf. on Multimedia, 2015.\n\n[28] P. Vincent, H. Larochelle, Y. Bengio, and P. A. Manzagol. Extracting and composing robust\n\nfeatures with denoising autoencoders. In ICML, pages 1096\u20131103. ACM, 2008.\n\n[29] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. CoRR,\n\nabs/1311.2901, 2013.\n\n[30] T. Zhang, B. Ghanem, S. Liu, and N. Ahuja. Robust visual tracking via structured multi-task\n\nsparse learning. IJCV, 101(2):367\u2013383, 2013.\n\n[31] Z. Zhang, P. Luo, C. C. Loy, and X. Tang. Facial landmark detection by deep multi-task learning.\n\nIn ECCV, pages 94\u2013108. Springer, 2014.\n\n9\n\n\f", "award": [], "sourceid": 158, "authors": [{"given_name": "Hakan", "family_name": "Bilen", "institution": "University of Oxford"}, {"given_name": "Andrea", "family_name": "Vedaldi", "institution": "University of Oxford"}]}