{"title": "Unsupervised Object Segmentation by Redrawing", "book": "Advances in Neural Information Processing Systems", "page_first": 12726, "page_last": 12737, "abstract": "Object segmentation is a crucial problem that is usually solved by using supervised learning approaches over very large datasets composed of both images and corresponding object masks. Since the masks have to be provided at pixel level, building such a dataset for any new domain can be very costly. We present ReDO, a new model able to extract objects from images without any annotation in an unsupervised way. It relies on the idea that it should be possible to change the textures or colors of the objects without changing the overall distribution of the dataset. Following this assumption, our approach is based on an adversarial architecture where the generator is guided by an input sample: given an image, it extracts the object mask, then redraws a new object at the same location. The generator is controlled by a discriminator that ensures that the distribution of generated images is aligned to the original one. We experiment with this method on different datasets and demonstrate the good quality of extracted masks.", "full_text": "Unsupervised Object Segmentation by Redrawing\n\nMicka\u00ebl Chen\n\nSorbonne Universit\u00e9, CNRS, LIP6, F-75005, Paris, France\n\nmickael.chen@lip6.fr\n\nThierry Arti\u00e8res\n\nAix Marseille Univ, Universit\u00e9 de Toulon, CNRS, LIS, Marseille, France\n\nEcole Centrale Marseille\n\nthierry.artieres@centrale-marseille.fr\n\nFacebook Arti\ufb01cial Intelligence Research\n\nLudovic Denoyer\n\ndenoyer@fb.com\n\nAbstract\n\nObject segmentation is a crucial problem that is usually solved by using supervised\nlearning approaches over very large datasets composed of both images and corre-\nsponding object masks. Since the masks have to be provided at pixel level, building\nsuch a dataset for any new domain can be very time-consuming. We present ReDO,\na new model able to extract objects from images without any annotation in an unsu-\npervised way. It relies on the idea that it should be possible to change the textures\nor colors of the objects without changing the overall distribution of the dataset.\nFollowing this assumption, our approach is based on an adversarial architecture\nwhere the generator is guided by an input sample: given an image, it extracts the\nobject mask, then redraws a new object at the same location. The generator is\ncontrolled by a discriminator that ensures that the distribution of generated images\nis aligned to the original one. We experiment with this method on different datasets\nand demonstrate the good quality of extracted masks.\n\n1\n\nIntroduction\n\nImage segmentation aims at splitting a given image into a set of non-overlapping regions correspond-\ning to the main components in the image. It has been studied for a long time in an unsupervised\nsetting using prior knowledge on the nature of the region one wants to detect using e.g. normalized\ncuts and graph-based methods. Recently the rise of deep neural networks and their spectacular\nperformances on many dif\ufb01cult computer vision tasks have led to revisit the image segmentation\nproblem using deep networks in a fully supervised setting [5, 20, 58], a problem referred as semantic\nimage segmentation.\nAlthough such modern methods allowed learning successful semantic segmentation systems, their\ntraining requires large-scale labeled datasets with usually a need for pixel-level annotations. This\nfeature limits the use of such techniques for many image segmentation tasks for which no such large\nscale supervision is available. To overcome this drawback, we follow here a very recent trend that\naims at revisiting the unsupervised image segmentation problem with new tools and new ideas from\nthe recent history and success of deep learning [55] and from the recent results of supervised semantic\nsegmentation [5, 20, 58].\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fBuilding on the idea of scene composition [4, 14, 18, 56] and on the adversarial learning principle\n[17], we propose to address the unsupervised segmentation problem in a new way. We start by\npostulating an underlying generative process for images that relies on an assumption of independence\nbetween regions of an image we want to detect. This means that replacing one object in the image\nwith another one, e.g. a generated one, should yield a realistic image. We use such a generative model\nas a backbone for designing an object segmentation model we call ReDO (ReDrawing of Objects),\nwhich outputs are then used to modify the input image by redrawing detected objects. Following ideas\nfrom adversarial learning, the supervision of the whole system is provided by a discriminator that is\ntrained to distinguish between real images and fake images generated accordingly to the generative\nprocess. Despite being a simpli\ufb01ed model for images, we \ufb01nd this generative process effective for\nlearning a segmentation model.\nThe paper is organized as follows. We present related work in Section 2, then we describe our method\nin Section 3. We \ufb01rst de\ufb01ne the underlying generative model that we consider in Section 3.2 and detail\nhow we translate this hypothesis into a neural network architecture to learn a segmentation module\nin Section 3.3. Then we give implementation details in Section 4. Finally, we present experimental\nresults on three datasets in Section 5 that explore the feasibility of unsupervised segmentation within\nour framework and compare its performance against a baseline supervised with few labeled examples.\n\n2 Related Work\n\nImage segmentation is a very active topic in deep learning that boasts impressive results when using\nlarge-scale labeled datasets. Those approaches can effectively parse high-resolution images depicting\ncomplex and diverse real-world scenes into informative semantics or instance maps. State-of-the-art\nmethods use clever architectural choices or pipelines tailored to the challenges of the task [5, 20, 58].\nHowever, most of those models use pixel-level supervision, which can be unavailable in some settings,\nor time-consuming to acquire in any case. Some works tackle this problem by using fewer labeled\nimages or weaker overall supervision. One common strategy is to use image-level annotations to\ntrain a classi\ufb01er from which class saliency maps can be obtained. Those saliency maps can then be\nexploited with other means to produce segmentation maps. For instance, WILDCAT [13] uses a\nConditional Random Field (CRF) for spatial prediction in order to post-process class saliency maps\nfor semantic segmentation. PRM [59], instead, \ufb01nds pixels that provoke peaks in saliency maps and\nuses these as a reference to choose the best regions out of a large set of proposals previously obtained\nusing MCG [2], an unsupervised region proposal algorithm. Both pipelines use a combination of a\ndeep classi\ufb01er and a method that take advantage of spatial and visual handcrafted image priors.\nCo-segmentation, introduced by Rother et al. 2006 [46], addresses the related problem of segmenting\nobjects that are shared by multiple images by looking for similar data patterns in all those images.\nLike the aforementioned models, in addition to prior image knowledge, deep saliency maps are often\nused to localize those objects [23]. Unsupervised co-segmentation [22], i.e. the task of covering\nobjects of a speci\ufb01c category without additional data annotations, is a setup that resembles ours.\nHowever, unsupervised co-segmentation systems are built on the idea of exploiting features similarity\nand can\u2019t easily be extended to a class-agnostic system. As we aim to ultimately be able to segment\nvery different objects, our approach instead relies on independence between the contents of different\nregions of an image which is a more general concept.\nFully unsupervised approaches have traditionally been more focused on designing handcrafted\nfeatures or energy functions to de\ufb01ne the desired property of objectness. Impressive results have\nbeen obtained when making full use of depth maps in addition to usual RGB images [44, 49] but\nit is much harder to specify good energy functions for purely RGB images. W-NET [55] extracts\nlatent representations via a deep auto-encoder that can then be used by a more classic CRF algorithm.\nKanezaki 2018 [28] further incorporate deep priors and train a neural network to directly minimize\ntheir chosen intra-region pixel distance. A different approach is proposed by Ji et al. 2019 [26] whose\nmethod \ufb01nds clusters of pixels using a learned distance invariant to some known properties. Unlike\nours, none of these approaches are learned entirely from data.\nOur work instead follows a more recent trend by inferring scene decomposition directly from data.\nStemming from DRAW [19], many of those approaches [4, 14] use an attention network to read a\nregion of an image and a Variational Auto-encoder (VAE) to partially reconstruct the image in an\niterative process in order to \ufb02esh out a meaningful decomposition. LR-GAN [56] is able to generate\n\n2\n\n\fsimple scenes recursively, building object after object, and Sbai et al. 2018 [48] decompose an image\ninto single-colored strokes for vector graphics. While iterative processes have the advantage of being\nable to handle an arbitrary number of objects, they are also more unstable and dif\ufb01cult to train. Most\nof those can either only be used in generation [56], or only handle very simple objects [4, 14, 18].\nAs a proof of concept, we decided to \ufb01rst ignore this additional dif\ufb01culty by only handling a set\nnumber of objects but our model can naturally be extended with an iterative composition process.\nThis choice is common among works that, like ours, focus on other aspects of image compositionality.\nVan Steenkiste et al. 2018 [52] advocates for a generative framework that accounts for relationship\nbetween objects. While they do produce masks as part of their generative process, they cannot\nsegment a given image. Closer to our setup, the very recent IODINE [18] propose a VAE adapted\nfor multi-objects representations. Their learned representations include a scene decomposition, but\nthey need a costly iterative re\ufb01nement process whose performance have only been demonstrated on\nsimulated datasets and not real images. Like ours, some prior work have tried to \ufb01nd segmentation\nmask by recomposing new images. SEIGAN [42] and Cut & Paste [45] learns to separate object and\nbackground by moving the region corresponding to the object to another background and making\nsure the image is still realistic. These methods however, need to have access to background images\nwithout objects, which might not be easy to obtain.\nOur work also ties to recent research in disentangled representation learning. Multiple techniques\nhave been used to separate information in factored latent representations. One line of work focuses on\nunderstanding and exploiting the innate disentangling properties of Variational Auto-Encoders. It was\n\ufb01rst observed by \u03b2-VAE [21] that VAEs can be constrained to produce disentangled representations\nby imposing a stronger penalty on the Kullback-Leibler divergence term on the VAE loss. FactorVAE\n[30] and \u03b2-TCVAE [7] extract a total correlation term from the KL term of the VAE objective and\nspeci\ufb01cally re-weight it instead of the whole KL term. In a similar fashion, HFVAE [15] introduces a\nhierarchical decomposition of the KL term to impose a structure on the latent space. A similar property\ncan be observed with GAN-based models, as shown by InfoGAN [9] which forces a generator to\nmap a code to interpretable features by maximizing the mutual information between the code and\nthe output. Using adversarial training is also a good way to split and control information in latent\nembeddings. Fader Networks [32] uses adversarial training to remove speci\ufb01c class information\nfrom a vector. This technique is also used in adversarial domain adaptation [16, 36, 51] to align\nembeddings from different domains. Similar methods can be used to build factorial representations\ninstead of simply removing information [6, 10, 11, 38]. Like our work, they use adversarial learning\nto match an implicitly prede\ufb01ned generative model but for purposes unrelated to segmentation.\n\n3 Method\n\n3.1 Overview\nA segmentation process F splits a given image I \u2208 RW\u00d7H\u00d7C into a set of non-overlapping regions.\nF can be described as a function that assigns to each pixel coordinate of I one of n regions. The\nproblem is then to \ufb01nd a correct partition F for any given image I. Lacking supervision, a common\nstrategy is to de\ufb01ne properties one wants the regions to have, and then to \ufb01nd a partition that produces\nregions with such properties. This can be done by de\ufb01ning an energy function and then \ufb01nding an\noptimal split. The challenge is then to accurately describe and model the statistical properties of\nmeaningful regions as a function one can optimize.\nWe address this problem differently. Instead of trying to de\ufb01ne the right properties of regions at\nthe level of each image, we make assumptions about the underlying generative process of images\nin which the different regions are explicitly modeled. Then, by using an adversarial approach, we\nlearn the parameters of the different components of our model so that the overall distribution of the\ngenerated images matches the distribution of the dataset. We detail the generative process in the\nsection 3.2, while the way we learn F is detailed in Section 3.3.\n\n3.2 Generative Process\n\nWe consider that images are produced by a generative process that operates in three steps: \ufb01rst, it\nde\ufb01nes the different regions in the image i.e the organization of the scene (composition step). Then,\ngiven this segmentation, the process generates the pixels for each of the regions independently\n(drawing step). At last, the resulting regions are assembled into the \ufb01nal image (assembling step).\n\n3\n\n\fLet us consider a scene composed of n \u2212 1 objects and one background we refer to as object n. Let\nus denote Mk \u2208 {0, 1}W\u00d7H the mask corresponding to object k which associates one binary value\nto each pixels in the \ufb01nal image so that Mk\nx,y = 1 iff the pixel of coordinate (x, y) belongs to object\nk. Note that, since one pixel can only belong to one object, the masks have to satisfy\nx,y = 1\nand the background mask Mn can therefore easily be retrieved computed from the object masks as\n\nn(cid:80)\n\nMk\n\nk=1\n\nMn = 1 \u2212 n\u22121(cid:80)\n\nMk.\n\nk=1\n\nThe pixel values of each object k are denoted Vk \u2208 RW\u00d7H\u00d7C. Given that the image we generate\nis of size W \u00d7 H \u00d7 C, each object is associated with an image of the same size but only the pixels\nselected by the mask will be used to compose the output image. The \ufb01nal composition of the objects\ninto an image is computed as follows:\n\nI \u2190 n(cid:88)\n\nMk (cid:12) Vk.\n\nk=1\n\nTo recap, the underlying generative process described previously can be summarized as follow: i)\n\ufb01rst, the masks Mk are chosen together based on a mask prior p(M). ii) Then, for each object\nindependently, the pixel values are chosen based on a distribution p(Vk|Mk, k). iii) Finally, the\nobjects are assembled into a complete image.\nThis process makes an assumption of independence between the colors and textures of the different\nobjects composing a scene. While this is a naive assumption, as colorimetric values such as exposition,\nbrightness, or even the real colors of two objects, are often related, this simple model still serves as a\ngood prior for our purposes.\n\n3.3 From Generative Process to Object Segmentation\n\nNow, instead of considering a purely generative process where the masks are generated following a\nprior p(M), we consider the inductive process where the masks are extracted directly from any input\nimage I through the function F which is the object segmentation function described previously. The\nrole of F is thus to output a set of masks given any input I. The new generative process acts as follows:\ni) it takes a random image in the dataset and computes the masks using F(I) \u2192 M1, . . . , Mn, and ii)\nit generates new pixel values for the regions in the image according to a distribution p(Vk|Mk, k).\niii) It aggregates the objects as before.\nIn order for output images to match the distribution of the training dataset, all the components\n(i.e F and p(Vk|Mk, k)) are learned adversarially following the GAN approach. Let us de\ufb01ne\nD : RW\u00d7H\u00d7C \u2192 R a discriminator function able to classify images as fake or real. Let us denote\nGF(I, z1, . . . , zn) our generator function able to compose a new image given an input image I, an\nobject segmentation function F, and a set of vectors z1, . . . , zn each sampled independently following\na prior p(z) for each object k, background included. Since the pixel values of the different regions are\nconsidered as independent given the segmentation, our generator can be decomposed in n generators\ndenoted Gk(Mk, zk), each one being in charge of deciding the pixel values for one speci\ufb01c region.\nThe complete image generation process thus operates in three steps:\n\n1) M1, . . . , Mn \u2190 F(I)\n2) Vk \u2190 Gk(Mk, zk) for k \u2208 {1, . . . , n}\n\n3) GF(I, z1, . . . , zn) =\n\nMk (cid:12) Vk\n\nn(cid:88)\n\n(composition step)\n(drawing step)\n\n(assembling step).\n\nk=1\n\nProvided the functions F and Gk are differentiable, they can thus be learned by solving the following\nadversarial problem:\n\n(cid:2) log D(I)(cid:3) + EI\u223cpdata,z1,...zn\u223cp(z)\n\n(cid:2) log(1 \u2212 D(GF(I, z1, . . . , zn)))(cid:3).\n\nL = EI\u223cpdata\n\nmin\nGF\n\nmax\n\nD\n\nTherefore, in practice we have F output soft masks in [0, 1] instead of binary masks. Also, in line\nwith recent GAN literature [3, 39, 50, 57], we choose to use the hinge version of the adversarial loss\n\n4\n\n\f[35, 50] instead, and obtain the following formulation:\n\nLG =EI\u223cpdata,z1,...,zn\u223cp(z)\nLD =EI\u223cpdata\n\n(cid:2) min(0,\u22121 + D(I))(cid:3)\n\nmax\n\nGF\n\nmax\n\nD\n\n+ EI\u223cpdata,z1,...,zn\u223cp(z)\n\n(cid:2)D(GF(I, z1, . . . , zn))(cid:3)\n(cid:2) min(0,\u22121 \u2212 D(GF(I, z1, . . . , zn)))(cid:3).\n\nStill, as it stands, the learning process of this model may fail for two reasons. First, it does not have to\nextract a meaningful segmentation in regards to the input I. Indeed, since the values of all the output\npixels will be generated, I can be ignored entirely to generate plausible pictures. For instance, the\nsegmentation could be the same for all the inputs regardless of input I. Second, it naturally converges\nto a trivial extractor F that puts the whole image into a single region, the other regions being empty.\nWe thus have to add additional constraints to our model.\n\nConstraining mask extraction by redrawing a single region. The \ufb01rst constraint aims at forcing\nthe model to extract meaningful region masks instead of ignoring the image. To this end, we take\nadvantage of the assumption that the different objects are independently generated. We can, therefore,\nreplace only one region at each iteration instead of regenerating all the regions. Since the generator\nnow has to use original pixel values from the image in the reassembled image, it cannot make arbitrary\nsplits. The generation process becomes as follows:\n\n1) M1, . . . , Mn \u2190 F(I)\n2) Vk \u2190 I for k \u2208 {1, . . . , n} \\ {i}\n\nVi \u2190 Gi(Mi, zi)\n\n(composition step)\n\n(drawing step)\n\n3) GF(I, zi, i) =\n\nMk (cid:12) Vk\n\n(assembling step),\n\nn(cid:88)\n\nk=1\n\nwhere i designates the index of the only region to redraw and is sampled from U(n), the discrete\nuniform distribution on {1, . . . , n}. The new learning objectives are as follows:\n\n(cid:2)D(GF(I, zi, i))(cid:3)\n\nLG = EI\u223cpdata,i\u223cU (n),zi\u223cp(z)\nLD = EI\u223cpdata [min(0,\u22121 + D(I))] + EI\u223cpdata,i\u223cU (n),zi\u223cp(z)[min(0,\u22121 \u2212 D(GF(I, zi, i)))].\n\nmax\n\nGF\n\nmax\n\nD\n\nConservation of Region Information. The second constraint is that given a region i generated\nfrom a latent vector zi, the \ufb01nal image GF(I, zi, i) must contain information about zi. This constraint\nis designed to prevent the mask extractor F to produce empty regions. Indeed, if region i is empty, i.e.\nx,y = 0 for all x, y, then zi cannot be retrieved from the \ufb01nal image. Equivalently, if zi can be\nMi\nretrieved, then region i is not empty. This information conservation constraint is implemented through\nan additional term in the loss function. Let us denote \u03b4k a function which objective is to infer the\nvalue of zk given any image I. One can learn such a function simultaneously to promote conservation\nof information by the generator. This strategy is similar to the mutual information maximization used\nin InfoGAN. [9].\nThe \ufb01nal complete process is illustrated in Figure 1 and correspond to the following learning\nobjectives:\n\n(cid:2)D(GF(I, zi, i)) \u2212 \u03bbz||\u03b4i(GF(I, zi, i)) \u2212 zi||2\n\n(cid:3)\n\nLG = EI\u223cpdata,i\u223cU (n),zi\u223cp(z)\nLD = EI\u223cpdata\n\n(cid:2) min(0,\u22121 + D(I)(cid:3) + EI\u223cpdata,i\u223cU (n),zi\u223cp(z)\n\n(cid:2) min(0,\u22121 \u2212 D(GF(I, zi, i))(cid:3),\n\n2\n\nmax\nGF,\u03b4\nmax\n\nD\n\nwhere \u03bbz is a \ufb01xed hyper-parameter that controls the strength of the information conservation\nconstraint. Note that the constraint is necessary for our model to \ufb01nd non trivial solutions, as\notherwise, putting the whole image into a single region is both optimal and easy to discover for\nthe neural networks. The \ufb01nal learning algorithm follows classical GAN schema [3, 17, 39, 57] by\nupdating the generator and the discriminator alternatively with the update functions presented in\nAlgorithm 1.\n\n5\n\n\fFigure 1: Example generation with Gf (I, zi, i) with i = 1 and n = 2. Learned functions are in color.\n\nAlgorithm 1 Networks update functions\n1: procedure GENERATORUPDATE\n2:\n3:\n4:\n5:\n6:\n7:\n8:\n9:\n10: procedure DISCRIMINATORUPDATE\n11:\n12:\n13:\n14:\n15:\n16:\n\nsample data I \u223c pdata,\nsample region i \u223c Uniform({1, . . . , n})\nsample noise vector zi \u223c p(z)\nIgen \u2190 Gf (I, zi, i)\nLz \u2190 \u2212||\u03b4i(Igen) \u2212 zi||2\nLG \u2190 D(Igen)\nupdate Gf with \u2207Gf\nupdate \u03b4i with \u2207\u03b4iLz\nsample datapoints Ireal, Iinput \u223c pdata\nsample region i \u223c Uniform({1, . . . , n})\nsample noise vector zi \u223c p(z)\nIgen \u2190 Gf (Iinput, zi, i)\nLD \u2190 min(0,\u22121 + D(Ireal)) + min(0,\u22121 \u2212 D(Igen)\nupdate D with \u2207DLD\n\n(cid:2)LG + Lz\n\n(cid:3)\n\n(cid:46) generate image\n(cid:46) compute information conservation loss\n(cid:46) compute adversarial loss\n\n(cid:46) generate image\n(cid:46) compute adversarial loss\n\n4\n\nImplementation\n\nWe now provide some information about the architecture of the different components (additional\ndetails are given in Supplementary materials). As usual with GAN-based methods, the choice of\na good architecture is crucial. We have chosen to build on the GAN and the image segmentation\nliterature and to take inspiration from the neural network architectures they propose.\nFor the mask generator F, we use an architecture inspired by PSPNet [58]. The proposed architecture\nis a fully convolutional neural network similar to one used in image-to-image translation [60], to\nwhich we add a Pyramid Pooling Module [58] whose goal is to gather information on different scales\nvia pooling layers. The \ufb01nal representation of a given pixel is thus encouraged to contain local,\nregional, and global information at the same time.\nThe region generators Gk, the discriminator D and the network \u03b4 that reconstructs z are based on\nSAGAN [57] that is frequently used in recent GAN literature [3, 37]. Notably, we use spectral\nnormalization [39] for weight regularization for all networks except for the mask provider F, and we\nuse self-attention [57] in Gk and D to handle non-local relations. To both promote stochasticity in our\ngenerators and encourage our latent code z to encode for texture and colors, we also use conditional\nbatch-normalization in Gk. The technique has emerged from style modeling for style transfer tasks\n[12, 43] and has since been used for GANs as a mean to encode for style and to improve stochasticity\n[1, 8, 54]. All parameters of the different \u03b4k functions are shared except for their last layers.\n\n6\n\nfz1G1generated imagegenerated region\u00a0input Iinferred mask M1pdatap(z)\u1e911DLz\u03b4RealorFake?inferred mask M2\fAs it is standard practice for GANs [3], we use orthogonal initialization [47] for our networks and\nADAM [31] with \u03b2 = (0, .9) as optimizer. Learning rates are set to 10\u22124 except for the mask network\nF which uses a smaller value of 10\u22125. We sample noise vectors zi of size 32 (except for MNIST\nwhere we used vectors of size 16) from N (0, Id) distribution. We used mini-batches of size 25 and\nran each experiment on a single NVidia Tesla P100 GPU. Despite our conservation of information\nloss, the model can still collapse into generating empty masks at the early steps of the training. While\nthe regularization does alleviate the problem, we suppose that the mask generator F can collapse even\nbefore the network \u03b4 learns anything relevant and can act as a stabilizer. As the failures happen early\nand are easy to detect, we automatically restart the training should the case arise.\nWe identi\ufb01ed \u03bbz and the initialization scheme as critical hyper-parameters and focus our hyper-\nparameters search on those. More details, along with speci\ufb01cs of the implementation used in our\nexperiments are provided as Supplementary materials. The code, dataset splits and pre-trained models\nare also available open-source 1.\n\n5 Experiments\n\n5.1 Datasets\n\nWe present results on three natural image datasets and one toy dataset. All images have been resized\nand then cropped to 128 \u00d7 128.\nFlowers dataset [40, 41] is composed of 8189 images of \ufb02owers. The dataset is provided with a set\nof masks obtained via an automated method built speci\ufb01cally for \ufb02owers [40]. We split into sets of\n6149 training images, 1020 validation and 1020 test images and use the provided masks as ground\ntruth for evaluation purpose only.\nLabeled Faces in the Wild dataset [25, 33] is a dataset of 13233 faces. A subpart of the funneled\nversion [24] has been segmented and manually annotated [27], providing 2927 groundtruth masks. We\nuse the non-annotated images for our training set. We split the annotated images between validation\nand testing sets so that there is no overlap in the identity of the persons between both sets. The test\nset is composed of 1600 images, and the validation set of 1327 images.\nThe Caltech-UCSD Birds 200 2011 (CUB-200-2011) dataset [53] is a dataset containing 11788\nphotographs of birds. We use 10000 images for our training split, 1000 for the test split, and the rest\nfor validation.\nAs a sanity check, we also build a toy dataset colored-2-MNIST in which each sample is composed\nof an uniform background on which we draw two colored MNIST [34] numbers: one odd number and\none even number. Odd and even numbers have colors sampled from different distributions so that our\nmodel can learn to differentiate them. For this dataset, we set n = 3 as there are three components.\nAs an additional experiment, we also build a new dataset by fusing Flowers and LFW datasets. This\nnew Flowers+LFW dataset has more variability, and contains different type of objects. We used this\ndataset to demonstrate that ReDO can work without label information on problems with multiple\ncategories of objects.\n\n5.2 Results\n\nTo evaluate our method ReDO, we use two metrics commonly used for segmentation tasks. The pixel\nclassi\ufb01cation accuracy (Acc) measures the proportion of pixels that have been assigned to the correct\nregion. The intersection over union (IoU) is the ratio between the area of the intersection between\nthe inferred mask and the ground truth over the area of their union. In both cases, higher is better.\nBecause ReDO is unsupervised and we can\u2019t control which output region corresponds to which\nobject or background in the image, we compute our evaluation based on the regions permutation\nthat matches the ground truth the best. For model selection, we used IoU computed on a held out\nlabeled validation set. When available, we present our evaluation on both the training set and a test\nset as, in an unsupervised setting, both can be relevant depending on the speci\ufb01c use case. Results are\npresented in Table 1 and show that ReDO achieves reasonable performance on the three real-world\ndatasets.\n\n1https://github.com/mickaelChen/ReDO\n\n7\n\n\fFigure 2: Generated samples (not cherry-picked, zoom in for better visibility). For each dataset, the\ncolumns are from left to right: 1) input images, 2) ground truth masks, 3) masks inferred by the\nmodel for object one, 4-7) generation by redrawing object one, 8-11) generation by redrawing object\ntwo. As we keep the same zi on any given column, the color and texture of the redrawn object is kept\nconstant across rows. More samples are provided in Supplementary materials. Faces from the LFW\ndataset have been anonymized, in vizualisations only, to protect personality rights.\n\nFigure 3: Results on LFW + Flowers dataset, arranged as in Figure 2. As z is kept constant on a\ncolumn across all rows, we can observe that z codes for different textures depending on the class of\nthe image even though the generator is never given this information explicitly. Faces from the LFW\ndataset have been anonymized, in vizualisations only, to protect personality rights.\n\nWe also compared the performance of ReDO, which is unsupervised, with a supervised method,\nkeeping the same architecture for F in both cases. We analyze how many training samples are needed\nto reach the performance of the unsupervised model (see Figure 4). One can see that the unsupervised\nresults are in the range of the ones obtained with a supervised method, and usually outperform\nsupervised models trained with less than around 50 or 100 examples depending on the dataset. For\ninstance, on the LFW Dataset, the unsupervised model obtains about 92% of accuracy and 79% IoU\nand the supervised model needs 50-60 labeled examples to reach similar performance.\nWe provide random samples of extracted masks (Figure 2) and the corresponding generated images\nwith a redrawn object or background. Note that our objective is not to generate appealing images but\nto learn an object segmentation function. Therefore, ReDO generates images that are less realistic\nthan the ones generated by state-of-the-art GANs. Focus is, instead, put on the extracted masks, and\nwe can see the good quality of the obtained segmentation in many cases. Best and worst masks, as\nwell as more random samples, are displayed in Supplementary materials.\nWe also trained ReDO on the fused Flowers+LFW dataset without labels. We re-used directly the\nhyper-parameters we have used to \ufb01t the Flowers dataset without further tuning and obtained, as\npreliminary results, a reasonable accuracy of 0.856 and an IoU of 0.691. This shows that ReDO is\nable to infer class information from masks even in a fully unsupervised setup. Samples are displayed\nin Figure 3.\n\n8\n\n\fDataset\nLFW\nCUB\n\nFlowers*\n\nFlowers+LFW\n\nTrain Acc\n\nTrain IoU\n\n0.840 \u00b1 0.012\n0.886 \u00b1 0.008\n\n-\n\n-\n\n0.423 \u00b1 0.023\n0.780 \u00b1 0.012\n\n-\n\n-\n\nTest Acc\n\n0.917 \u00b1 0.002\n0.845 \u00b1 0.012\n0.879 \u00b1 0.008\n\n0.856\n\nTest IoU\n\n0.781 \u00b1 0.005\n0.426 \u00b1 0.025\n0.764 \u00b1 0.012\n\n0.691\n\nTable 1: Performance of ReDO in accuracy (Acc) and intersection over union (IoU) on retrieved\nmasks. Means and standard deviations are based on \ufb01ve runs with \ufb01xed hyper-parameters. LWF train\nset scores are not available since we trained on unlabeled images. *Please note that segmentations\nprovided along the original Flowers dataset [41] have been obtained using an automated method. We\ndisplay samples with top disagreement masks between ReDO and ground truth in Supplementary\nmaterials. In those cases, we \ufb01nd ours to provide better masks.\n\nFigure 4: Comparison with supervised baseline as a function of the number of available training\nsamples.\n\n6 Conclusion\n\nWe presented a novel method called ReDO for unsupervised learning to segment images. Our\nproposal is based on the assumption that if a segmentation model is accurate, then one could edit any\nreal image by replacing any segmented object in a scene by another one, randomly generated, and\nthe result would still be a realistic image. This principle allows casting the unsupervised learning of\nimage segmentation as an adversarial learning problem. Our experimental results obtained on three\ndatasets show that this principle works. In particular, our segmentation model is competitive with\nsupervised approaches trained on a few hundred labeled examples.\nOur future work will focus on handling more complex and diverse scenes. As mentioned in Section\n2, our model could generalize to an arbitrary number of objects and objects of unknown classes via\niterative design and/or class agnostic generators. Currently, we are mostly limited by our ability to\neffectively train GANs on those more complicated settings but rapid advances in image generation\n[3, 29, 37] make it a reasonable goal to pursue in a near future. Meanwhile, we will be investigating\nthe use of the model in a semi-supervised or weakly-supervised setup. Indeed, additional information\nwould allow us to guide our model for harder datasets while requiring fewer labels than fully\nsupervised approaches. Conversely, our model could act as a regularizer by providing a prior for any\nsegmentation tasks.\n\nAcknowledgments\n\nThis work was supported by the French National Research Agency projects LIVES (grant number\nANR-15-CE23-0026-03) and \"Deep in France\" (grant number ANR-16-CE23-0006).\n\n9\n\n1011021030.700.750.800.850.900.951.00test set AccuracyLFW Dataset1011021031040.700.750.800.850.900.951.00CUB Dataset1011021030.700.750.800.850.900.951.00Flowers Dataset101102103number of training samples0.50.60.70.80.91.0test set IoU101102103104number of training samples0.00.20.40.60.81.0101102103number of training samples0.50.60.70.80.91.0supervisedours (unsupervised)\fReferences\n[1] Amjad Almahairi, Sai Rajeshwar, Alessandro Sordoni, Philip Bachman, and Aaron Courville. Augmented\ncyclegan: Learning many-to-many mappings from unpaired data. In International Conference on Machine\nLearning, pages 195\u2013204, 2018.\n\n[2] P. Arbel\u00e1ez, J. Pont-Tuset, J. Barron, F. Marques, and J. Malik. Multiscale combinatorial grouping. In\n\nComputer Vision and Pattern Recognition, 2014.\n\n[3] Andrew Brock, Jeff Donahue, and Karen Simonyan. Large scale GAN training for high \ufb01delity natural\n\nimage synthesis. In International Conference on Learning Representations, 2019.\n\n[4] Christopher P Burgess, Loic Matthey, Nicholas Watters, Rishabh Kabra, Irina Higgins, Matt Botvinick,\nand Alexander Lerchner. Monet: Unsupervised scene decomposition and representation. arXiv preprint\narXiv:1901.11390, 2019.\n\n[5] Liang-Chieh Chen, Yukun Zhu, George Papandreou, Florian Schroff, and Hartwig Adam. Encoder-decoder\nwith atrous separable convolution for semantic image segmentation. In Proceedings of the European\nConference on Computer Vision (ECCV), pages 801\u2013818, 2018.\n\n[6] Mickael Chen, Ludovic Denoyer, and Thierry Arti\u00e8res. Multi-view data generation without view supervi-\n\nsion. In International Conference on Learning Representations, 2018.\n\n[7] Tian Qi Chen, Xuechen Li, Roger B Grosse, and David K Duvenaud. Isolating sources of disentanglement\nin variational autoencoders. In Advances in Neural Information Processing Systems, pages 2610\u20132620,\n2018.\n\n[8] Ting Chen, Mario Lucic, Neil Houlsby, and Sylvain Gelly. On self modulation for generative adversarial\n\nnetworks. In International Conference on Learning Representations, 2019.\n\n[9] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\n\nInfogan:\nInterpretable representation learning by information maximizing generative adversarial nets. In Advances\nin neural information processing systems, pages 2172\u20132180, 2016.\n\n[10] Emily L Denton and vighnesh Birodkar. Unsupervised learning of disentangled representations from video.\nIn I. Guyon, U. V. Luxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 30, pages 4414\u20134423. Curran Associates, Inc., 2017.\n\n[11] Chris Donahue, Akshay Balsubramani, Julian McAuley, and Zachary C. Lipton. Semantically decom-\nposing the latent spaces of generative adversarial networks. In International Conference on Learning\nRepresentations, 2018.\n\n[12] Vincent Dumoulin, Jonathon Shlens, and Manjunath Kudlur. A learned representation for artistic style.\n\nProc. of ICLR, 2, 2017.\n\n[13] Thibaut Durand, Taylor Mordan, Nicolas Thome, and Matthieu Cord. Wildcat: Weakly supervised learning\nof deep convnets for image classi\ufb01cation, pointwise localization and segmentation. In The IEEE Conference\non Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[14] SM Ali Eslami, Nicolas Heess, Theophane Weber, Yuval Tassa, David Szepesvari, Geoffrey E Hinton, et al.\nAttend, infer, repeat: Fast scene understanding with generative models. In Advances in Neural Information\nProcessing Systems, pages 3225\u20133233, 2016.\n\n[15] Babak Esmaeili, Hao Wu, Sarthak Jain, Alican Bozkurt, N Siddharth, Brooks Paige, Dana H Brooks,\nJennifer Dy, and Jan-Willem Meent. Structured disentangled representations. In The 22nd International\nConference on Arti\ufb01cial Intelligence and Statistics, pages 2525\u20132534, 2019.\n\n[16] Yaroslav Ganin and Victor Lempitsky. Unsupervised domain adaptation by backpropagation. In Proceed-\nings of the 32nd International Conference on International Conference on Machine Learning-Volume 37,\npages 1180\u20131189. JMLR. org, 2015.\n\n[17] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Advances in neural information processing\nsystems, pages 2672\u20132680, 2014.\n\n[18] Klaus Greff, Rapha\u00ebl Lopez Kaufmann, Rishab Kabra, Nick Watters, Chris Burgess, Daniel Zoran, Loic\nMatthey, Matthew Botvinick, and Alexander Lerchner. Multi-object representation learning with iterative\nvariational inference. arXiv preprint arXiv:1903.00450, 2019.\n\n10\n\n\f[19] Karol Gregor, Ivo Danihelka, Alex Graves, Danilo Jimenez Rezende, and Daan Wierstra. Draw: a recurrent\nneural network for image generation. In Proceedings of the 32nd International Conference on International\nConference on Machine Learning-Volume 37, pages 1462\u20131471. JMLR. org, 2015.\n\n[20] Kaiming He, Georgia Gkioxari, Piotr Doll\u00e1r, and Ross Girshick. Mask r-cnn. In Proceedings of the IEEE\n\ninternational conference on computer vision, pages 2961\u20132969, 2017.\n\n[21] Irina Higgins, Lo\u00efc Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick, Shakir\nMohamed, and Alexander Lerchner. beta-vae: Learning basic visual concepts with a constrained variational\nframework. In 5th International Conference on Learning Representations, ICLR 2017, 2017.\n\n[22] Kuang-Jui Hsu, Yen-Yu Lin, and Yung-Yu Chuang. Co-attention cnns for unsupervised object co-\n\nsegmentation. In IJCAI, pages 748\u2013756, 2018.\n\n[23] Kuang-Jui Hsu, Yen-Yu Lin, and Yung-Yu Chuang. Deepco 3: Deep instance co-segmentation by co-\npeak search and co-saliency detection. In Proceedings of Conference on Computer Vision and Pattern\nRecognition (CVPR), 2019.\n\n[24] Gary B. Huang, Vidit Jain, and Erik Learned-Miller. Unsupervised joint alignment of complex images. In\n\nICCV, 2007.\n\n[25] Gary B. Huang, Manu Ramesh, Tamara Berg, and Erik Learned-Miller. Labeled faces in the wild: A\ndatabase for studying face recognition in unconstrained environments. Technical Report 07-49, University\nof Massachusetts, Amherst, October 2007.\n\n[26] Xu Ji, Jo\u00e3o F Henriques, and Andrea Vedaldi. Invariant information distillation for unsupervised image\n\nsegmentation and clustering. In International Conference on Computer Vision (ICCV), 2019.\n\n[27] Andrew Kae, Kihyuk Sohn, Honglak Lee, and Erik Learned-Miller. Augmenting CRFs with Boltzmann\n\nmachine shape priors for image labeling. In CVPR, 2013.\n\n[28] Asako Kanezaki. Unsupervised image segmentation by backpropagation.\n\nIn Proceedings of IEEE\n\nInternational Conference on Acoustics, Speech, and Signal Processing (ICASSP), 2018.\n\n[29] Tero Karras, Samuli Laine, and Timo Aila. A style-based generator architecture for generative adversarial\n\nnetworks. arXiv preprint arXiv:1812.04948, 2018.\n\n[30] Hyunjik Kim and Andriy Mnih. Disentangling by factorising. In International Conference on Machine\n\nLearning, pages 2654\u20132663, 2018.\n\n[31] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[32] Guillaume Lample, Neil Zeghidour, Nicolas Usunier, Antoine Bordes, Ludovic Denoyer, et al. Fader\nnetworks: Manipulating images by sliding attributes. In Advances in Neural Information Processing\nSystems, pages 5967\u20135976, 2017.\n\n[33] Gary B. Huang Erik Learned-Miller. Labeled faces in the wild: Updates and new reporting procedures.\n\nTechnical Report UM-CS-2014-003, University of Massachusetts, Amherst, May 2014.\n\n[34] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, Patrick Haffner, et al. Gradient-based learning applied to\n\ndocument recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[35] Jae Hyun Lim and Jong Chul Ye. Geometric gan. arXiv preprint arXiv:1705.02894, 2017.\n\n[36] Mingsheng Long, Zhangjie Cao, Jianmin Wang, and Michael I Jordan. Conditional adversarial domain\n\nadaptation. In Advances in Neural Information Processing Systems, pages 1640\u20131650, 2018.\n\n[37] Mario Lucic, Michael Tschannen, Marvin Ritter, Xiaohua Zhai, Olivier Bachem, and Sylvain Gelly.\n\nHigh-\ufb01delity image generation with fewer labels. arXiv preprint arXiv:1903.02271, 2019.\n\n[38] Michael F Mathieu, Junbo Jake Zhao, Junbo Zhao, Aditya Ramesh, Pablo Sprechmann, and Yann LeCun.\nDisentangling factors of variation in deep representation using adversarial training. In Advances in Neural\nInformation Processing Systems, pages 5040\u20135048, 2016.\n\n[39] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. In International Conference on Learning Representations, 2018.\n\n[40] Maria-Elena Nilsback and Andrew Zisserman. Delving into the whorl of \ufb02ower segmentation. In BMVC,\n\nvolume 2007, pages 1\u201310, 2007.\n\n11\n\n\f[41] Maria-Elena Nilsback and Andrew Zisserman. Automated \ufb02ower classi\ufb01cation over a large number of\nclasses. In 2008 Sixth Indian Conference on Computer Vision, Graphics & Image Processing, pages\n722\u2013729. IEEE, 2008.\n\n[42] Pavel Ostyakov, Roman Suvorov, Elizaveta Logacheva, Oleg Khomenko, and Sergey I Nikolenko. Seigan:\nTowards compositional image generation by simultaneously learning to segment, enhance, and inpaint.\narXiv preprint arXiv:1811.07630, 2018.\n\n[43] Ethan Perez, Florian Strub, Harm De Vries, Vincent Dumoulin, and Aaron Courville. Film: Visual\nreasoning with a general conditioning layer. In Thirty-Second AAAI Conference on Arti\ufb01cial Intelligence,\n2018.\n\n[44] Trung T Pham, Thanh-Toan Do, Niko S\u00fcnderhauf, and Ian Reid. Scenecut: joint geometric and object\nsegmentation for indoor scenes. In 2018 IEEE International Conference on Robotics and Automation\n(ICRA), pages 1\u20139. IEEE, 2018.\n\n[45] Tal Remez, Jonathan Huang, and Matthew Brown. Learning to segment via cut-and-paste. In Proceedings\n\nof the European Conference on Computer Vision (ECCV), pages 37\u201352, 2018.\n\n[46] Carsten Rother, Tom Minka, Andrew Blake, and Vladimir Kolmogorov. Cosegmentation of image pairs\nby histogram matching-incorporating a global constraint into mrfs. In 2006 IEEE Computer Society\nConference on Computer Vision and Pattern Recognition (CVPR\u201906), volume 1, pages 993\u20131000. IEEE,\n2006.\n\n[47] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\nlearning in deep linear neural networks. In 2nd International Conference on Learning Representations,\nICLR 2014, 2014.\n\n[48] Othman Sbai, Camille Couprie, and Mathieu Aubry. Vector image generation by learning parametric layer\n\ndecomposition. arXiv preprint arXiv:1812.05484, 2018.\n\n[49] Nathan Silberman, Derek Hoiem, Pushmeet Kohli, and Rob Fergus. Indoor segmentation and support\ninference from rgbd images. In European Conference on Computer Vision, pages 746\u2013760. Springer, 2012.\n\n[50] Dustin Tran, Rajesh Ranganath, and David M Blei. Deep and hierarchical implicit models. arXiv preprint\n\narXiv:1702.08896, 7, 2017.\n\n[51] Eric Tzeng, Judy Hoffman, Kate Saenko, and Trevor Darrell. Adversarial discriminative domain adaptation.\nIn Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition, pages 7167\u20137176,\n2017.\n\n[52] Sjoerd van Steenkiste, Karol Kurach, and Sylvain Gelly. A case for object compositionality in deep\n\ngenerative models of images. arXiv preprint arXiv:1810.10340, 2018.\n\n[53] C. Wah, S. Branson, P. Welinder, P. Perona, and S. Belongie. The Caltech-UCSD Birds-200-2011 Dataset.\n\nTechnical Report CNS-TR-2011-001, California Institute of Technology, 2011.\n\n[54] Xiaolong Wang and Abhinav Gupta. Generative image modeling using style and structure adversarial\n\nnetworks. In European Conference on Computer Vision, pages 318\u2013335. Springer, 2016.\n\n[55] Xide Xia and Brian Kulis. W-net: A deep model for fully unsupervised image segmentation. arXiv preprint\n\narXiv:1711.08506, 2017.\n\n[56] Jianwei Yang, Anitha Kannan, Dhruv Batra, and Devi Parikh. LR-GAN: layered recursive generative\nadversarial networks for image generation. In 5th International Conference on Learning Representations,\nICLR 2017, 2017.\n\n[57] Han Zhang, Ian Goodfellow, Dimitris Metaxas, and Augustus Odena. Self-attention generative adversarial\n\nnetworks. arXiv preprint arXiv:1805.08318, 2018.\n\n[58] Hengshuang Zhao, Jianping Shi, Xiaojuan Qi, Xiaogang Wang, and Jiaya Jia. Pyramid scene parsing\nnetwork. In Proceedings of the IEEE conference on computer vision and pattern recognition, pages\n2881\u20132890, 2017.\n\n[59] Yanzhao Zhou, Yi Zhu, Qixiang Ye, Qiang Qiu, and Jianbin Jiao. Weakly supervised instance segmentation\nusing class peak response. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2018.\n\n[60] Jun-Yan Zhu, Taesung Park, Phillip Isola, and Alexei A Efros. Unpaired image-to-image translation using\ncycle-consistent adversarial networks. In Proceedings of the IEEE international conference on computer\nvision, pages 2223\u20132232, 2017.\n\n12\n\n\f", "award": [], "sourceid": 6913, "authors": [{"given_name": "Micka\u00ebl", "family_name": "Chen", "institution": "Sorbonne Universit\u00e9"}, {"given_name": "Thierry", "family_name": "Arti\u00e8res", "institution": "Aix-Marseille Universit\u00e9"}, {"given_name": "Ludovic", "family_name": "Denoyer", "institution": "Facebook - FAIR"}]}