{"title": "DeepUSPS: Deep Robust Unsupervised Saliency Prediction via Self-supervision", "book": "Advances in Neural Information Processing Systems", "page_first": 204, "page_last": 214, "abstract": "Deep neural network (DNN) based salient object detection in images based on high-quality labels is expensive. Alternative unsupervised approaches rely on careful selection of multiple handcrafted saliency methods to generate noisy pseudo-ground-truth labels. In this work, we propose a two-stage mechanism for robust unsupervised object saliency prediction, where the first stage involves refinement of the noisy pseudo labels generated from different handcrafted methods. Each handcrafted method is substituted by a deep network that learns to generate the pseudo labels. These labels are refined incrementally in multiple iterations via our proposed self-supervision technique. In the second stage, the refined labels produced from multiple networks representing multiple saliency methods are used to train the actual saliency detection network. We show that this self-learning procedure outperforms all the existing unsupervised methods over different datasets. Results are even comparable to those of fully-supervised state-of-the-art approaches.", "full_text": "DeepUSPS: Deep Robust Unsupervised Saliency\n\nPrediction With Self-Supervision\n\nDuc Tam Nguyen \u2217\u2020\u2021, Maximilian Dax \u2217\u2021, Chaithanya Kumar Mummadi \u2020\u00a7\n\nThi Phuong Nhung Ngo \u00a7, Thi Hoai Phuong Nguyen \u00b6, Zhongyu Lou \u2021, Thomas Brox \u2020\n\nAbstract\n\nDeep neural network (DNN) based salient object detection in images based on high-\nquality labels is expensive. Alternative unsupervised approaches rely on careful\nselection of multiple handcrafted saliency methods to generate noisy pseudo-\nground-truth labels. In this work, we propose a two-stage mechanism for robust\nunsupervised object saliency prediction, where the \ufb01rst stage involves re\ufb01nement\nof the noisy pseudo-labels generated from different handcrafted methods. Each\nhandcrafted method is substituted by a deep network that learns to generate the\npseudo-labels. These labels are re\ufb01ned incrementally in multiple iterations via\nour proposed self-supervision technique. In the second stage, the re\ufb01ned labels\nproduced from multiple networks representing multiple saliency methods are\nused to train the actual saliency detection network. We show that this self-learning\nprocedure outperforms all the existing unsupervised methods over different datasets.\nResults are even comparable to those of fully-supervised state-of-the-art approaches.\nThe code is available at https://tinyurl.com/wtlhgo3 .\n\n(b) Traditional methods\n\n(a) Input and GT\nFigure 1: Unsupervised object saliency detection based on a given (a) input image. Note that the\nground-truth (GT) label is depicted only for illustration purposes and not exploited by any traditional\nor deep unsupervised methods. (b) Traditional methods use handcrafted priors to predict saliencies\nand (c) deep unsupervised methods SBF, USD and ours (DeepUSPS) employ the outputs of the\nhandcrafted methods as pseudo-labels in the process of training saliency prediction network. It can be\nseen that while SBF results in noisy saliency predictions and USD produces smooth saliency maps,\nour method yields more \ufb01ne-grained saliency predictions and closely resembles the ground-truth.\n\n(c) Deep unsupervised methods\n\n\u2217Equal contribution, [\ufb01xed-term.Maximilian.Dax, Ductam.Nguyen]@de.bosch.com\n\u2020Computer Vision Group, University of Freiburg, Germany\n\u2021Bosch Research, Bosch GmbH, Germany\n\u00a7Bosch Center for AI, Bosch GmbH, Germany\n\u00b6Karlsruhe Institute of Technology, Germany\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 2: Evolution of re\ufb01ned pseudo-labels from the handcrafted method DSR in our pipeline. Here,\nwe show that the noisy pseudo label from the handcrafted method gets improved with inter-image\nconsistency and further gets re\ufb01ned with our incremental self-supervision technique. While the\nperceptual differences between pseudo-labels from inter-image consistency and self-supervision\ntechnique are minor, we quantitatively show in Table 2 that this additional re\ufb01nement improves our\nprediction results. Results from different handcrafted methods are depicted in Fig. 1 in Appendix.\n\nIntroduction\n\n1\nObject saliency prediction aims at \ufb01nding and segmenting generic objects of interest and help leverage\nunlabeled information contained in a scene. It can contribute to binary background/foreground\nsegmentation, image caption generation (Show, 2015), semantic segmentation (Long et al., 2015),\nor object removal in scene editing (Shetty et al., 2018). In semantic segmentation, for example, the\nnetwork trained on a \ufb01xed set of class labels can only identify objects belonging to these classes,\nwhile object saliency detection can highlight an unknown object (e.g., \"bear\" crossing a street).\nExisting techniques on the saliency prediction task primarily fall under supervised and unsupervised\nsettings. The line of work of supervised approaches (Hou et al., 2017; Luo et al., 2017; Zhang et al.,\n2017b,c; Wang et al., 2017; Li et al., 2016; Wang et al., 2016; Zhao et al., 2015; Jiang et al., 2013b;\nZhu et al., 2014) however, requires large-scale clean and pixel-level human-annotated datasets, which\nare expensive and time-consuming to acquire. Unsupervised saliency methods do not require any\nhuman annotations and can work in the wild on arbitrary datasets. These unsupervised methods are\nfurther categorized into traditional handcrafted salient object detectors (Jiang et al., 2013b; Zhu et al.,\n2014; Li et al., 2013; Jiang et al., 2013a; Zou & Komodakis, 2015) and DNN-based detectors (Zhang\net al., 2018, 2017a). These traditional methods are based on speci\ufb01c priors, such as center priors\n(Goferman et al., 2011), global contrast prior (Cheng et al., 2014), and background connectivity\nassumption (Zhu et al., 2014). Despite their simplicity, these methods perform poorly due to the\nlimited coverage of the hand-picked priors.\nDNN-based approaches leverage the noisy pseudo-label outputs of multiple traditional handcrafted\nsaliency models to provide a supervisory signal for training the saliency prediction network. Zhang\net al. (2017a) proposes a method (SBF, \u2019Supervision by fusion\u2019) to fuse multiple saliency models\nto remove noise from the pseudo-ground-truth labels. This method updates the pseudo-labels with\nthe predictions of the saliency detection network and yields very noisy saliency predictions, as\nshown in Fig. 1c. A slightly different approach (USD, \u2019Deep unsupervised saliency detection\u2019) is\ntaken by Zhang et al. (2018) and introduces an explicit noise modeling module to capture the noise\nin pseudo-labels of different handcrafted methods. The joint optimization, along with the noise\nmodule, enables the learning of the saliency-prediction network to generate the pseudo-noise-free\noutputs. It does so by \ufb01tting different noise estimates on predicted saliency map, based on different\nnoisy pseudo-ground-truth labels. This method produces smooth predictions of salient objects, as\nseen in Fig. 1c since it employs a noise modeling module to counteract the in\ufb02uence of noise in\npseudo-ground-truth labels from handcrafted saliency models.\nBoth DNN-based methods SBF and USD performs direct pseudo labels fusion on the noisy outputs of\nhandcrafted methods. This implies that the poor-quality pseudo-labels are directly used for training\nsaliency network. Hence, the \ufb01nal performance of the network primarily depends upon the quality\nof chosen handcrafted methods. On the contrary, a better way is to re\ufb01ne the poor pseudo-labels\nin isolation in order to maximize the strength of each method. The \ufb01nal pseudo-labels fusion step\n\n2\n\n\fFigure 3: Overview of the sequence of steps involved in our pipeline. Firstly, the training images are\nprocessed through different handcrafted methods to generate coarse pseudo-labels. In the second step,\nwhich we refer to as inter-images consistency, a deep network is learned from the training images\nand coarse pseudo-labels to generate consistent label outputs, as shown in Fig. 2. In the next step,\nthe label outputs are further re\ufb01ned with our self-supervision technique in an iterate manner. Lastly,\nthe re\ufb01ned labels from different handcrafted methods are fused for training the saliency prediction\nnetwork. Details of the individual components in the pipeline are depicted in Fig. 4.\n\nto train a network should be performed on a set of diverse and high-quality, re\ufb01ned pseudo-labels\ninstead.\nMore concretely, we propose a systematic curriculum to incrementally re\ufb01ne the pseudo-labels\nby substituting each handcrafted method with a deep neural network. The handcrafted methods\noperate on single image priors and do not infer high-level information such as object shapes and\nperspective projections. Instead, we learn a function or proxy for the handcrafted saliency method\nthat maps the raw images to pseudo-labels. In other words, we train a deep network to generate\nthe pseudo-labels which bene\ufb01ts from learning the representations across a broad set of training\nimages and thus signi\ufb01cantly improve the pseudo-ground-truth labels as seen in Fig. 2 (we refer this\neffect as inter-images consistency). We further re\ufb01ne our pseudo-labels obtained after the process of\ninter-image consistency to clear the remaining noise in the labels via the self-supervision technique in\nan iterative manner. Instead of using pseudo-labels from the handcrafted methods directly as Zhang\net al. (2018, 2017a), we alleviate the weaknesses of each handcrafted method individually. By doing\nso, the diversity of pseudo-labels from different methods is preserved until the \ufb01nal step when all\nre\ufb01ned pseudo-labels are fused. The large diversity reduces the over-\ufb01tting of the network to the\nlabels noise and results in better generalization capability.\nThe complete schematic overview of our approach is illustrated in Fig. 3. As seen in the \ufb01gure, the\ntraining images are \ufb01rst processed by different handcrafted methods to create coarse pseudo-labels.\nIn the second step, we train a deep network to predict the pseudo-labels (Fig. 4a) of the corresponding\nhandcrafted method using a image-level loss to enforce inter-images consistency among the predic-\ntions. As seen in Fig. 2, this step already improves the pseudo-labels over handcrafted methods. In\nthe next step, we employ an iterative self-supervision technique (Fig. 4c) that uses historical moving\naverages (MVA), which acts as an ensemble of various historical models during training (Fig. 4b) to\nre\ufb01ne the generated pseudo-labels further incrementally. The described pipeline is performed for each\nhandcrafted method individually. In the \ufb01nal step, the saliency prediction network is trained to predict\nthe re\ufb01ned pseudo-labels obtained from multiple saliency methods using a mean image-level loss.\nOur contribution in this work is outlined as follows: we propose a novel systematic mechanism\nto re\ufb01ne the pseudo-ground-truth labels of handcrafted unsupervised saliency methods iteratively\nvia self-supervision. Our experiments show that this improved supervisory signal enhances the\ntraining process of the saliency prediction network. We show that our approach improves the saliency\nprediction results, outperforms previous unsupervised methods, and is comparable to supervised\nmethods on multiple datasets. Since we use the re\ufb01ned pseudo-labels, the training behavior of the\nsaliency prediction network largely resembles supervised training. Hence, the network has a more\nstable training process compared to existing unsupervised learning approaches.\n\n2 Related work\n\nVarious object saliency methods are summarized in Borji et al. (2014) and evaluated on different\nbenchmarks (Borji et al., 2015). In the modern literature, the best performances are achieved by deep\n\n3\n\n\fsupervised methods (Hou et al., 2017; Luo et al., 2017; Zhang et al., 2017b,c; Wang et al., 2017; Li\net al., 2016; Wang et al., 2016; Zhao et al., 2015; Jiang et al., 2013b; Zhu et al., 2014) which all at\nleast use some form of label information. The labels might be human-annotated saliency maps or the\nclass of the object at hand. Compared to these fully- and weakly- supervised methods, our approach\ndoes not require any label for training. Our method can hence generalize to new datasets without\nhaving access to the labels.\nFrom the literature of deep unsupervised saliency prediction, both Zhang et al. (2018, 2017a) use\nsaliency predictions from handcrafted methods as pseudo-labels to train a deep network. Zhang et al.\n(2018) proposes a datapoint-dependent noise module to capture the noise among different saliency\nmethods. This additional noise module induces smooth predictions in desired saliency maps. Croitoru\net al. (2019) use an ensemble of teacher models to choose high-quality maps for the fusion steps.\nZhang et al. (2017a) de\ufb01nes a manual fusion strategy to combine the pseudo-labels from handcrafted\nmethods on super-pixel and image levels. The resulting, combined labels are a linear combination\nof existing pseudo-labels. This method updates the pseudo-labels with the predictions of a saliency\ndetection network and yields very noisy saliency predictions. In contrast, we re\ufb01ne the pseudo-labels\nfor each handcrafted method in isolation, and hence the diversity of the pseudo-labels is preserved\nuntil the last fusion step.\nThe idea of using handcrafted methods for a pseudo-labels generation has also been adapted by\nMakansi et al. (2018) for optical \ufb02ow prediction. They introduce an assessment network to predict the\npixel-wise error of each handcrafted method. Subsequently, they choose the pixel-wise maps to form\nthe best-unsupervised saliency maps. These maps are used as data augmentation for a new domain.\nHowever, the best maps are bounded by the quality of existing noisy-maps from the handcrafted\nmethods. In contrast to their work, our method improves individual methods gradually by enforcing\ninter-images consistency, instead of choosing pseudo-labels from the existing set. Further, their\nmethod fuses the original pseudo-labels directly in a single step. On the contrary, our fusion step is\nperformed on the re\ufb01ned pseudo-labels in a late-stage to preserve diversity.\nFrom the robust learning perspective, Nguyen et al. (2019b) proposes a robust way to learn from\nwrongly annotated datasets for classi\ufb01cation tasks. These techniques can be combined with our\npresented method to improve the performance further. These advances also improve one-class-\ntraining use cases such as anomaly detection (Nguyen et al., 2019a), where the models are typically\nsensitive to noisy labeled data.\nCompared to all previous unsupervised saliency methods, we are the \ufb01rst to improve the saliency\nmaps from handcrafted methods in isolation successfully. Furthermore, our proposed incremental\nre\ufb01ning with self-supervision via historical model averaging is unique among this line of research.\n\n3 DeepUSPS: Deep Unsupervised saliency prediction via self-supervision\n\nIn this section, we explain the technical details of components in the overall pipeline shown in Fig. 3.\n\n3.1 Enforcing inter-images consistency with image-level loss\n\nHandcrafted saliency predictions methods are consistent within an image due to the underlying\nimage priors, but not necessarily consistent across images. They only operate on single image priors\nand do not infer high-level information such as object shapes and perspective projections. Such\ninter-images-consistency can be enforced by using outputs from each method as pseudo-labels\nfor training a deep network with an image-level loss. Such a process leads to a re\ufb01nement of the\npseudo-labels suggested by each handcrafted method.\nLet D be the set of training examples and M be a handcrafted method. By M (x, p) we denote the\noutput prediction of method M over pixel p of image x \u2208 D. To binarize M (x, p), we use simple\nfunction l(x, p) with threshold \u03b3 such that: l(x, p) = 1 if M (x, p) > \u03b3; l(x, p) = 0, otherwise. \u03b3\nequals to 1.5 \u2217 \u00b5saliency of the handcrafted method. This discretization scheme counteracts the\nmethod-dependent dynamics in predicting saliency with different degree of uncertainties. The\ndiscretization of pseudo-labels make the network less sensitive to over-\ufb01tting to the large label noise,\ncompared to \ufb01tting to continuous, raw pseudo-labels.\n\n4\n\n\f(a) Enforcing inter-images consistency\n\n(b) Historical moving averages (MVA)\n\n(c) Incremental re\ufb01ning via self-supervision\n\n(d) Inter-methods consistent predictions\n\nFigure 4: A detailed demonstration of each step in our pipeline from Fig. 3. Handcrafted methods\nonly operate on single images and provide poor-quality pseudo-labels. Hence, (a)-(c) are performed\nfor each handcrafted method separately to re\ufb01ne the pseudo-labels with deep network training. In\nthe \ufb01nal stage (d), the re\ufb01ned pseudo-labels sets are fused by training a network to minimize the\naveraged loss between different methods.\n\nGiven method M, let \u03b8 be the set of its corresponding learning parameters in the corresponding FCN\nand y(x, p) be the output of pixel p in image x respectively. The precision and recall of the prediction\nover image x w.r.t. The pseudo-labels are straightforward and can be found in the Appendix. The\nimage-level loss function w.r.t. each training example x is then de\ufb01ned as L\u03b2 = 1 \u2212 F\u03b2 where the\nF-measure F\u03b2 re\ufb02ects the weighted harmonic mean of precision and recall such that:\n\nF\u03b2 =(cid:0)1 + \u03b22(cid:1)\n\nprecision \u00b7 recall\n\n\u03b22 precision + recall\n\n.\n\nL\u03b2 is a linear loss and therefore is more robust to outliers and noise compared to high-order losses\nsuch as Mean-Square-Error. The loss is to be minimized by training the FCN for a \ufb01xed number of\nepochs. The \ufb01xed number is small to prevent the network from strong over-\ufb01tting to the noisy labels.\nHistorical moving averages of predictions Due to the large noise ration in the pseudo-labels set,\nthe model snapshots in each training epoch \ufb02uctuates strongly. Therefore, a historical moving average\nof the network saliency predictions y(x, p) is composed during the training procedure, as shown in\nFig. 4b. Concretely, a fully-connected conditional random \ufb01eld (CRF) is applied to y(x, p) after each\nforward pass during training. These CRF-outputs are then accumulated into MVA-predictions for\neach data point at each epoch k as follows:\n\nMVA(x, p, k) = (1 \u2212 \u03b1) \u2217 CRF (yj(x, p)) + \u03b1 \u2217 MVA(x, p, k \u2212 1)\n\nSince the MVA is collected during the training process after each forward pass, they do not require\nadditional forward passes for the entire training set. Besides, the predictions are constructed using\na large historical model ensemble, where all models snapshots of the training process contribute\nto the \ufb01nal results. Due to this historical ensembling of saliency predictions, the resulting maps are\nmore robust and \ufb02uctuate less strongly compared to taking direct model snapshots.\n\n3.2\n\nIncremental pseudo-labels re\ufb01ning via self-supervision\n\nThe moving-average predictions have signi\ufb01cantly higher quality than the predictions of the network\ndue to (1) the use of large model ensembles during training and (2) the application of fully connected-\nCRF. However, the models from the past training iterations in the ensemble are weak due to strong\n\ufb02uctuations, which is a consequence of the training on the noisy pseudo-labels.\n\n5\n\n\fTo improve the individual models in the ensemble, our approach utilizes the MVA again as the new\nset of pseudo-labels to train on (Fig. 4c). Concretely, the network is reinitialized and minimize the\nL\u03b2 again w.r.t. MVA from the last training stage. The process is repeated until the MVA predictions\nhave reached a stable state. By doing so, the diversity in the model ensemble is reduced, but the\nquality of each model is improved over time. We refer to this process as self-supervised network\ntraining with moving average (MVA) predictions.\n\n3.3\n\nInter-methods consistent saliency predictions\n\nNote, that the processes from Fig. 4a to Fig. 4c are applied to re\ufb01ne the outputs from each handcrafted\nmethod individually. These steps are intended to re\ufb01ne the quality of each method while retaining the\nunderlying designed priors. Furthermore, re\ufb01ning each method in isolation increases the diversity\namong the pseudo-labels. Hence, the diversity of pseudo-labels is preserved until the \ufb01nal fusion\nstage. In the last step (Fig. 4d), the re\ufb01ned saliency maps are fused by minimizing the following loss:\n\nLen =\n\n1\nn\n\n\u03a3iLi\n\u03b2\n\nwhere Li\n\u03b2 is computed similarly as aforementioned L\u03b2 using the re\ufb01ned pseudo-labels of method\nMi; and {M1, . . . , Mn} are the set of re\ufb01ned handcrafted methods. This fusion scheme is simple\nand can be exchanged with those from Zhang et al. (2018, 2017a); Makansi et al. (2018).\nOur pipeline requires additional computation time to re\ufb01ne the handcrafted methods gradually. Since\nthe training is done in isolation, the added complexity is linear in the number of handcrafted methods.\nHowever, the computation of MVAs does not require additional inference steps, since they are\naccumulated over the training iterations.\n\n4 Experiments\n\nWe \ufb01rst compare our proposed pipeline to existing benchmarks by following the con\ufb01guration of\nZhang et al. (2018). Further, we show in detailed oracle and ablation studies how each component\nof the pipeline is crucial for the overall competitive performance. Moreover, we analyze the effect\nof the proposed self-supervision mechanism to improve the label quality over time.\n\n4.1 Experiments setup\n\nOur method is evaluated on traditional object saliency prediction benchmarks (Borji et al., 2015). Fol-\nlowing Zhang et al. (2018), we extract handcrafted maps from MSRA-B (Liu et al., 2010): 2500 and\n500 training and validation images respectively. The remaining test set contains in total 2000 images.\nFurther tests are performed on the ECCSD-dataset (Yan et al., 2013) (1000 images), DUT (Yang et al.,\n2013) (5168 images), SED2 (Alpert et al., 2011)(100 images). We re-size all images to 432x432.\nWe evaluate the proposed pipeline against different supervised methods, traditional unsupervised\nmethods and deep unsupervised methods from the literature. We follow the training con\ufb01guration\nand setting of the previous unsupervised method Zhang et al. (2018) to train the saliency detection\nnetwork. We use the DRN-network (Chen et al., 2018) which is pretrained on CityScapes (Cordts\net al., 2016). The last fully-convolutional layer of the network is replaced to predict a binary saliency\nmask. Our ablation study also test ResNet101 (He et al., 2016) that is pretrained on ImageNET\nILSVRC (Russakovsky et al., 2015). Our pseudo generation networks are trained for a \ufb01xed number\nof 25 epochs for each handcrafted method and saliency detection network is trained for 200 epochs\nin the \ufb01nal stage.We use ADAM (Kingma & Ba, 2014) with a momentum of 0.9, batch size 20, a\nlearning rate of 1e-6 in the \ufb01rst step when trained on the handcrafted methods. The learning rate is\ndoubled every time in later self-supervision iteration. Self-supervision is performed for two iterations.\nOur models are trained for three times to report the mean and standard deviation. Our proposed\npipeline needs about 30 hours of computation time on four Geforce Titan X for training.\nFor the handcrafted methods, we use RBD (\u2019robust background detection\u2019) (Zhu et al., 2014), DSR\n(\u2019dense and sparse reconstruction\u2019) (Li et al., 2013), MC (\u2019Markov chain\u2019) (Jiang et al., 2013a),\nHS (\u2019hierarchy-associated rich features\u2019) (Zou & Komodakis, 2015). The \u03b1-parameter for the\nexponential moving average for MVA maps is set to 0.7. Further, the model\u2019s predictions are fed\ninto a fully-connected CRF (Kr\u00e4henb\u00fchl & Koltun, 2011). As the evaluation metrics, we utilized\n\n6\n\n\fMean-Average-Error (MAE or L1-loss) and weighted F-score with a \u03b22 = 0.3 similar to previous\nworks. Furthermore, the analysis of the self-supervision mechanism includes precision and recall that\ncompared against ground-truth-labels. Please refer to Sec. 1 in the Appendix for more details on the\nde\ufb01nition of these metrics.\n\n4.2 Evaluation on different datasets\n\nTab. 1 shows the performance of our proposed approach on various traditional benchmarks. Our\nmethod outperforms other deep unsupervised works consistently on all datasets by a large margin\nregarding the MAE. Using the F-score metric, we outperform the state-of-the-art (noise modeling\nfrom Zhang et al. (2018) on three out of four datasets. Across the four datasets, our proposed baseline\nachieves up to 21% and 29% error reduction on the F-score and MAE-metric, respectively. The\neffects of different components are to be analyzed in the subsequent oracle test, ablation study, and\ndetailed improvement analysis with self-supervision. Some failure cases are shown in Fig 5.\n\nFigure 5: Failure Cases. The left panel shows images (\ufb01rst column) for which both, our approach\n(fourth column) and the supervised baseline (third column), fail to predict the GT label (second\ncolumn). In each of these cases, both predictions are close to each other and visually look like\njusti\ufb01able saliency masks despite being signi\ufb01cantly different than GT. We found that these kinds of\nimages are indeed responsible for a major part of the bad scores. The right panel shows images for\nwhich our predictions are particularly good compared to the baseline prediction, or vice versa. These\nimages are often disturbed by additional intricate details.\n\n4.3 Oracle test and ablation studies\n\nTab. 2 shows an oracle test and an ablation study when a particular component of the proposed\npipeline is removed. In the oracle test, we compare the training on ground-truth and oracle labels\nfusion in the \ufb01nal step, where we choose the pixel-wise best saliency predictions from the re\ufb01ned\npseudo-labels. The performance of the oracle labels fusion is on-par with training on the ground-truth,\nor even slightly better on MSRA-B and SED2. This experiment indicates that DeepUSPS leads to\nhigh-quality pseudo-labels. Despite the simple fusion scheme, DeepUSPS approach is only slightly\ninferior to the oracle label fusion. Interchanging the architecture to ResNet101, which is pretrained\non ImageNet ILSVRC, results in a similarly strong performance.\nThe ablation study shows the importance of the components in the pipeline, namely the inter images\nconsistency training and the self-supervision-step. Training on the pseudo-labels from handcrafted\nmethods directly causes consistently poor performance on all datasets. Gradually improving the\nparticular handcrafted maps with our network already leads to substantial performance improvement.\n\n7\n\n\fTable 1: Comparing our results against various approaches measured in % of F-score (higher is better)\nand % of MAE (lower is better). Bold entries represent the best values in unsupervised methods.\nMAE\u2193\n\nMAE\u2193 F\u2191\n\nMAE\u2193 F\u2191\n\nMSRA-B\n\nModels\n\nDUT\n\nSED2\n\nF\u2191\n\nECSSD\n\nMAE\u2193 F\u2191\nDeep and Supervised\n06.99\n06.55\n06.07\n07.97\n09.22\n16.01\n09.73\n10.19\n10.81\n09.06\n\n87.96\n89.08\n88.25\n85.21\n82.60\n75.89\n84.26\n80.61\n80.97\n83.15\n\n04.74\n04.78\n-\n-\n06.65\n-\n-\n04.91\n10.40\n04.67\n\n72.90\n73.60\n69.32\n65.95\n67.22\n60.45\n69.18\n67.15\n67.68\n69.02\n\nUnsupervised and handcrafted\n\n65.18\n63.87\n61.14\n62.34\n\n18.32\n17.42\n20.37\n22.83\n\n11.71\n12.07\n14.41\n16.09\nDeep And Unsupervised\n-\n05.60\n03.96\n00.03\n\n78.70\n87.83\n87.42\n00.46\n\n08.50\n07.04\n06.32\n00.10\n\n51.00\n55.83\n52.89\n52.05\n\n58.30\n71.56\n73.58\n00.87\n\nHou et al. (2017)\nLuo et al. (2017)\nZhang et al. (2017b)\nZhang et al. (2017c)\nWang et al. (2017)\nLi et al. (2016)\nWang et al. (2016)\nZhao et al. (2015)\nJiang et al. (2013b)\nZhu et al. (2014)\n\nRBD\nDSR\nMC\nHS\n\nSBF\nUSD\nDeepUSPS (ours)\n\u00b1\n\n89.41\n89.70\n-\n-\n85.06\n-\n-\n89.66\n77.80\n89.73\n\n75.08\n72.27\n71.65\n71.29\n\n-\n87.70\n90.31\n00.10\n\n07.60\n07.96\n09.76\n13.21\n08.46\n07.58\n09.45\n08.85\n09.16\n09.71\n\n20.11\n13.74\n18.63\n22.74\n\n13.50\n08.60\n06.25\n00.02\n\n82.36\n-\n87.45\n84.44\n74.47\n77.78\n76.16\n76.60\n76.58\n78.40\n\n79.39\n70.53\n66.19\n71.68\n\n-\n83.80\n84.46\n01.00\n\n10.14\n-\n06.29\n07.42\n11.64\n10.74\n11.40\n11.62\n11.71\n10.14\n\n10.96\n14.52\n18.48\n18.69\n\n-\n08.81\n06.96\n00.06\n\nThe performance further increases with more iterations of self-supervised training. Leaving out the\nself-supervision stage also decreases the performance of the overall pipeline.\n4.4 Analyzing the quality of the pseudo label\n\nFig. 6 shows an analysis of the quality of the labels of training images over different steps of our\npipeline. We analyze the quality of the generated saliency maps (pseudo labels) from the deep\nnetworks and also the quality of aggregated MVA maps. Here, the quality of the pseudo labels is\nmeasured using the ground-truth labels information of the training set. It can be seen in the \ufb01gure\nthat the quality of the labels is improved incrementally at each step of our pipeline. Moreover, the\nquality of MVA maps is shown to be improved rapidly when compared with the saliency maps. Our\nself-supervision technique further aids in improving the quality of the labels slightly. After few\niterations of self-supervision, the F-score and the MAE-score stagnate due to the stable moving-\naverage predictions, and the saliency outputs maps also reach the quality level of the MVA-maps.\nHence, in the case of of\ufb02ine-testing (when all test data are available at once), the entire proposed\nprocedure might be used to extract high-quality saliency maps. In addition, the precision and recall of\nthe quality of the labels are shown in Fig 2 in the Appendix. The handcrafted methods vary strongly\nin terms of precision as well as recall. This signi\ufb01cant variance indicates a large diversity among\nthese pseudo-labels. Our approach is capable of improving the quality of the pseudo labels of each\nmethod in isolation. Thus, the diversity of different methods is preserved until the last fusion step,\nwhich enforces inter-methods consistent saliency predictions by the deep network.\n\n5 Conclusion\n\nIn this work, we propose to re\ufb01ne the pseudo-labels from different unsupervised handcraft saliency\nmethods in isolation, to improve the supervisory signal for training the saliency detection network. We\nlearn a pseudo-labels generation deep network as a proxy for each handcraft method, which further\nenables us to adapt the self-supervision technique to re\ufb01ne the pseudo-labels. We quantitatively show\nthat re\ufb01ning the pseudo-labels iteratively enhances the results of the saliency prediction network and\n\n8\n\n\fTable 2: Results on extensive ablation studies analyzing the signi\ufb01cance of different components\nin our pipeline using F-score and MAE on different datasets. Our study includes oracle training on\nGT, oracle label fusion - best pixel-wise choice among different pseudo label maps, using only the\npseudo-labels of a single handcrafted method and also analyzing the in\ufb02uence of self-supervision\ntechnique over iterations.\n\nModels\nDeepUSPS (ours)\nDeepUSPS (ours)-Resnet101\n(Oracle) train on GT\n(Oracle) Labels fusion using GT\nDirect fusion of handcrafted methods\n\nDUT\n\nSED2\n\nECSSD\n\nMSRA-B\nF\u2191 MAE\u2193F\u2191 MAE\u2193F\u2191 MAE\u2193F\u2191 MAE\u2193\n90.31 03.96 87.42 06.32 73.58 06.25 84.46 06.96\n90.05 04.17 88.17 06.41 69.60 07.71 82.60 07.31\n91.00 03.37 90.32 04.54 74.17 05.46 80.57 07.19\n91.34 03.63 88.80 05.90 74.22 05.88 82.16 07.10\n84.57 06.35 74.88 11.17 65.83 08.19 78.36 09.20\n\nEffect of inter-images consistency training\n\nTrained on inter-images cons. RBD-maps\nTrained on inter-images cons. DSR-maps\nTrained on inter-images cons. MC-maps\nTrained on inter-images cons. HS-maps\n\n84.49 06.25 80.62 08.82 63.86 09.17 72.05 10.33\n85.01 06.37 80.93 09.28 64.57 08.24 65.88 10.71\n85.72 05.80 83.33 07.73 65.65 08.51 73.90 08.95\n85.98 05.58 84.02 07.51 66.83 07.83 71.45 08.43\n\nEffect of self-supervision\n\nNo self-supervision\nTrained on re\ufb01ned RBD-maps after iter. 1\nTrained on re\ufb01ned RBD-maps after iter. 2\nTrained on re\ufb01ned DSR-maps after iter. 1\nTrained on re\ufb01ned DSR-maps after iter. 2\nTrained on re\ufb01ned MC-maps after iter. 1\nTrained on re\ufb01ned MC-maps after iter. 2\nTrained on re\ufb01ned HS-maps after iter. 1\nTrained on re\ufb01ned HS-maps after iter. 2\n\n89.52 04.25 85.74 06.93 72.81 06.49 84.00 07.05\n87.10 05.33 83.38 08.03 68.45 07.54 74.75 09.05\n88.08 04.96 84.99 07.51 70.95 06.94 78.37 08.11\n87.11 05.62 82.77 08.68 67.52 07.55 71.40 09.41\n88.34 05.17 84.73 08.08 68.82 07.21 74.24 09.06\n87.53 05.22 84.94 07.58 67.82 07.33 70.72 09.48\n88.53 04.85 85.74 07.29 69.52 06.92 73.00 09.22\n88.23 04.73 86.21 06.66 71.21 06.63 76.75 07.80\n89.07 04.52 86.75 06.51 71.64 06.42 78.88 07.22\n\n(b) F-score\u2191 MVA maps\n\n(a) F-score\u2191 saliency maps\nFigure 6: Illustrating the improvement of labels quality of predicted saliency maps and aggregated\nMVA maps on MSRA-B training set from four handcrafted methods over different steps in our\npipeline. The steps 0-3 represent measure on the quality of the labels of four different handcrafted\nmethods, inter-images consistency, iteration 1 and iteration 2 of self-supervision with respect to the\nground-truth labels.\n\n(c) MAE\u2193 saliency maps\n\n(d) MAE\u2193 MVA-maps\n\noutperforms previous unsupervised techniques by up to 21% and 29% relative error reduction on\nthe F-score and Mean-Average-Error, respectively. We also show that our results are comparable\nto the fully-supervised state-of-the-art approaches, which explains that the re\ufb01ned labels are as\ngood as human-annotations. Our studies also reveal that the proposed curriculum learning is crucial\nto improve the quality of pseudo-labels and hence achieve competitive performance on the object\nsaliency detection tasks.\n\n9\n\n\fReferences\nAlpert, S., Galun, M., Brandt, A., and Basri, R. Image segmentation by probabilistic bottom-up\naggregation and cue integration. IEEE transactions on pattern analysis and machine intelligence,\n34(2):315\u2013327, 2011.\n\nBorji, A., Cheng, M.-M., Hou, Q., Jiang, H., and Li, J. Salient object detection: A survey. arXiv\n\npreprint arXiv:1411.5878, 2014.\n\nBorji, A., Cheng, M.-M., Jiang, H., and Li, J. Salient object detection: A benchmark.\n\ntransactions on image processing, 24(12):5706\u20135722, 2015.\n\nIEEE\n\nChen, L.-C., Papandreou, G., Kokkinos, I., Murphy, K., and Yuille, A. L. Deeplab: Semantic image\nsegmentation with deep convolutional nets, atrous convolution, and fully connected crfs. IEEE\ntransactions on pattern analysis and machine intelligence, 40(4):834\u2013848, 2018.\n\nCheng, M.-M., Mitra, N. J., Huang, X., Torr, P. H., and Hu, S.-M. Global contrast based salient region\ndetection. IEEE Transactions on Pattern Analysis and Machine Intelligence, 37(3):569\u2013582, 2014.\n\nCordts, M., Omran, M., Ramos, S., Rehfeld, T., Enzweiler, M., Benenson, R., Franke, U., Roth, S.,\nand Schiele, B. The cityscapes dataset for semantic urban scene understanding. In Proceedings of\nthe IEEE conference on computer vision and pattern recognition, pp. 3213\u20133223, 2016.\n\nCroitoru, I., Bogolin, S.-V., and Leordeanu, M. Unsupervised learning of foreground object segmen-\ntation. International Journal of Computer Vision, 127(9):1279\u20131302, Sep 2019. ISSN 1573-1405.\ndoi: 10.1007/s11263-019-01183-3. URL https://doi.org/10.1007/s11263-019-01183-3.\n\nGoferman, S., Zelnik-Manor, L., and Tal, A. Context-aware saliency detection. IEEE transactions on\n\npattern analysis and machine intelligence, 34(10):1915\u20131926, 2011.\n\nHe, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition. In Proceedings\n\nof the IEEE conference on computer vision and pattern recognition, pp. 770\u2013778, 2016.\n\nHou, Q., Cheng, M.-M., Hu, X., Borji, A., Tu, Z., and Torr, P. H. Deeply supervised salient object\ndetection with short connections. In Proceedings of the IEEE Conference on Computer Vision and\nPattern Recognition, pp. 3203\u20133212, 2017.\n\nJiang, B., Zhang, L., Lu, H., Yang, C., and Yang, M.-H. Saliency detection via absorbing markov\nchain. In Proceedings of the IEEE international conference on computer vision, pp. 1665\u20131672,\n2013a.\n\nJiang, H., Wang, J., Yuan, Z., Wu, Y., Zheng, N., and Li, S. Salient object detection: A discriminative\nregional feature integration approach. In Proceedings of the IEEE conference on computer vision\nand pattern recognition, pp. 2083\u20132090, 2013b.\n\nKingma, D. P. and Ba, J. Adam: A method for stochastic optimization.\n\narXiv:1412.6980, 2014.\n\narXiv preprint\n\nKr\u00e4henb\u00fchl, P. and Koltun, V. Ef\ufb01cient inference in fully connected crfs with gaussian edge potentials.\n\nIn Advances in neural information processing systems, pp. 109\u2013117, 2011.\n\nLi, X., Lu, H., Zhang, L., Ruan, X., and Yang, M.-H. Saliency detection via dense and sparse\nreconstruction. In Proceedings of the IEEE International Conference on Computer Vision, pp.\n2976\u20132983, 2013.\n\nLi, X., Zhao, L., Wei, L., Yang, M.-H., Wu, F., Zhuang, Y., Ling, H., and Wang, J. Deepsaliency:\nMulti-task deep neural network model for salient object detection. IEEE Transactions on Image\nProcessing, 25(8):3919\u20133930, 2016.\n\nLiu, T., Yuan, Z., Sun, J., Wang, J., Zheng, N., Tang, X., and Shum, H.-Y. Learning to detect a salient\n\nobject. IEEE Transactions on Pattern analysis and machine intelligence, 33(2):353\u2013367, 2010.\n\nLong, J., Shelhamer, E., and Darrell, T. Fully convolutional networks for semantic segmentation. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 3431\u20133440,\n2015.\n\n10\n\n\fLuo, Z., Mishra, A., Achkar, A., Eichel, J., Li, S., and Jodoin, P.-M. Non-local deep features for\nsalient object detection. In Proceedings of the IEEE Conference on Computer Vision and Pattern\nRecognition, pp. 6609\u20136617, 2017.\n\nMakansi, O., Ilg, E., and Brox, T. Fusionnet and augmented\ufb02ownet: Selective proxy ground truth for\n\ntraining on unlabeled images. arXiv preprint arXiv:1808.06389, 2018.\n\nNguyen, D. T., Lou, Z., Klar, M., and Brox, T. Anomaly detection with multiple-hypotheses\n\npredictions. In International Conference on Machine Learning, pp. 4800\u20134809, 2019a.\n\nNguyen, D. T., Mummadi, C. K., Ngo, T. P. N., Nguyen, T. H. P., Beggel, L., and Brox, T. Self:\n\nLearning to \ufb01lter noisy labels with self-ensembling. arXiv preprint arXiv:1910.01842, 2019b.\n\nRussakovsky, O., Deng, J., Su, H., Krause, J., Satheesh, S., Ma, S., Huang, Z., Karpathy, A., Khosla,\nA., Bernstein, M., et al. Imagenet large scale visual recognition challenge. International journal of\ncomputer vision, 115(3):211\u2013252, 2015.\n\nShetty, R. R., Fritz, M., and Schiele, B. Adversarial scene editing: Automatic object removal from\nweak supervision. In Advances in Neural Information Processing Systems, pp. 7706\u20137716, 2018.\n\nShow, A. Tell: Neural image caption generation with visual attention. Kelvin Xu et. al.. arXiv\n\nPre-Print, 23, 2015.\n\nWang, L., Wang, L., Lu, H., Zhang, P., and Ruan, X. Saliency detection with recurrent fully\nconvolutional networks. In European conference on computer vision, pp. 825\u2013841. Springer, 2016.\n\nWang, T., Borji, A., Zhang, L., Zhang, P., and Lu, H. A stagewise re\ufb01nement model for detecting\nsalient objects in images. In Proceedings of the IEEE International Conference on Computer\nVision, pp. 4019\u20134028, 2017.\n\nYan, Q., Xu, L., Shi, J., and Jia, J. Hierarchical saliency detection. In Proceedings of the IEEE\n\nConference on Computer Vision and Pattern Recognition, pp. 1155\u20131162, 2013.\n\nYang, C., Zhang, L., Lu, H., Ruan, X., and Yang, M.-H. Saliency detection via graph-based manifold\nranking. In Proceedings of the IEEE conference on computer vision and pattern recognition, pp.\n3166\u20133173, 2013.\n\nZhang, D., Han, J., and Zhang, Y. Supervision by fusion: Towards unsupervised learning of deep\nsalient object detector. In Proceedings of the IEEE International Conference on Computer Vision,\npp. 4048\u20134056, 2017a.\n\nZhang, J., Zhang, T., Dai, Y., Harandi, M., and Hartley, R. Deep unsupervised saliency detection: A\nmultiple noisy labeling perspective. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pp. 9029\u20139038, 2018.\n\nZhang, P., Wang, D., Lu, H., Wang, H., and Ruan, X. Amulet: Aggregating multi-level convolutional\nfeatures for salient object detection. In Proceedings of the IEEE International Conference on\nComputer Vision, pp. 202\u2013211, 2017b.\n\nZhang, P., Wang, D., Lu, H., Wang, H., and Yin, B. Learning uncertain convolutional features for\naccurate saliency detection. In Proceedings of the IEEE International Conference on Computer\nVision, pp. 212\u2013221, 2017c.\n\nZhao, R., Ouyang, W., Li, H., and Wang, X. Saliency detection by multi-context deep learning. In\nProceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pp. 1265\u20131274,\n2015.\n\nZhu, W., Liang, S., Wei, Y., and Sun, J. Saliency optimization from robust background detection. In\nProceedings of the IEEE conference on computer vision and pattern recognition, pp. 2814\u20132821,\n2014.\n\nZou, W. and Komodakis, N. Harf: Hierarchy-associated rich features for salient object detection. In\n\nProceedings of the IEEE international conference on computer vision, pp. 406\u2013414, 2015.\n\n11\n\n\f", "award": [], "sourceid": 91, "authors": [{"given_name": "Tam", "family_name": "Nguyen", "institution": "Freiburg Computer Vision Lab"}, {"given_name": "Maximilian", "family_name": "Dax", "institution": "Bosch GmbH"}, {"given_name": "Chaithanya Kumar", "family_name": "Mummadi", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Nhung", "family_name": "Ngo", "institution": "Bosch Center for Artificial Intelligence"}, {"given_name": "Thi Hoai Phuong", "family_name": "Nguyen", "institution": "Karlsruhe Institute of Technology (KIT)"}, {"given_name": "Zhongyu", "family_name": "Lou", "institution": "Robert Bosch Gmbh"}, {"given_name": "Thomas", "family_name": "Brox", "institution": "University of Freiburg"}]}