{"title": "Deep Set Prediction Networks", "book": "Advances in Neural Information Processing Systems", "page_first": 3212, "page_last": 3222, "abstract": "Current approaches for predicting sets from feature vectors ignore the unordered nature of sets and suffer from discontinuity issues as a result. We propose a general model for predicting sets that properly respects the structure of sets and avoids this problem. With a single feature vector as input, we show that our model is able to auto-encode point sets, predict the set of bounding boxes of objects in an image, and predict the set of attributes of these objects.", "full_text": "Deep Set Prediction Networks\n\nYan Zhang\n\nJonathon Hare\n\nyz5n12@ecs.soton.ac.uk\n\njsh2@ecs.soton.ac.uk\n\nUniversity of Southampton\n\nUniversity of Southampton\n\nSouthampton, UK\n\nSouthampton, UK\n\nAdam Pr\u00fcgel-Bennett\n\nUniversity of Southampton\n\nSouthampton, UK\n\napb@ecs.soton.ac.uk\n\nAbstract\n\nCurrent approaches for predicting sets from feature vectors ignore the unordered\nnature of sets and suffer from discontinuity issues as a result. We propose a general\nmodel for predicting sets that properly respects the structure of sets and avoids this\nproblem. With a single feature vector as input, we show that our model is able to\nauto-encode point sets, predict the set of bounding boxes of objects in an image,\nand predict the set of attributes of these objects.\n\n1\n\nIntroduction\n\nYou are given a rotation angle and your task is to draw the four corner points of a square that is\nrotated by that amount. This is a structured prediction task where the output is a set, since there is no\ninherent ordering to the four points. Such sets are a natural representation for many kinds of data,\nranging from the set of points in a point cloud, to the set of objects in an image (object detection), to\nthe set of nodes in a molecular graph (molecular generation). Yet, existing machine learning models\noften struggle to solve even the simple square prediction task [30].\nThe main dif\ufb01culty in predicting sets comes from the ability to permute the elements in a set freely,\nwhich means that there are n! equally good solutions for a set of size n. Models that do not take this\nset structure into account properly (such as MLPs or RNNs) result in discontinuities, which is the\nreason why they struggle to solve simple toy set prediction tasks [30]. We give background on what\nthe problem is in section 2.\nHow can we build a model that properly respects the set structure of the problem so that we can\npredict sets without running into discontinuity issues? In this paper, we aim to address this question.\nConcretely, we contribute the following:\n\n1. We propose a model (section 3, Algorithm 1) that can predict a set from a feature vector\n(vector-to-set) while properly taking the structure of sets into account. We explain what\nproperties we make use of that enables this. Our model uses backpropagation through a set\nencoder to decode a set and works for variable-size sets. The model is applicable to a wide\nvariety of set prediction tasks since it only requires a feature vector as input.\n\n2. We evaluate our model on several set prediction datasets (section 5). First, we demonstrate\nthat the auto-encoder version of our model is sound on a set version of MNIST. Next,\nwe use the CLEVR dataset to show that this works for general set prediction tasks. We\npredict the set of bounding boxes of objects in an image and we predict the set of object\nattributes in an image, both from a single feature vector. Our model is a completely\ndifferent approach to usual anchor-based object detectors because we pose the task as a set\nprediction problem, which does not need complicated post-processing techniques such as\nnon-maximum suppression.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f2 Background\n\nRepresentation We are interested in sets of feature vectors with the feature vector describing\nproperties of the element, for example the 2d position of a point in a point cloud. A set of size n\nwherein each feature vector has dimensionality d is represented as a matrix Y \u2208 Rd\u00d7n with the\nelements as columns in an arbitrary order, Y = [y1, . . . , yn]. To properly treat this as a set, it\nis important to only apply operations with certain properties to it [29]: permutation-invariance or\npermutation-equivariance. In other words, operations on sets should not rely on the arbitrary ordering\nof the elements.\nSet encoders (which turn such sets into feature vectors) are usually built by composing permutation-\nequivariant operations with a permutation-invariant operation at the end. A simple example is the\ni g(yi) where g is a neural network. Because g is applied to every element\nindividually, it does not rely on the arbitrary order of the elements. We can think of this as turning the\nset {yi}n\ni=1. This is permutation-equivariant because changing the order of elements\nin the input set affects the output set in a predictable way. Next, the set is summed to produce a single\nfeature vector. Since summing is commutative, the output is the same regardless of what order the\nelements are in. In other words, summing is permutation-invariant. This gives us an encoder that\nproduces the same feature vector regardless of the arbitrary order the set elements were stored in.\n\nmodel in [29]: f (Y ) =(cid:80)\n\ni=1 into {g(yi)}n\n\nLoss\nIn set prediction tasks, we need to compute a loss between a predicted set \u02c6Y = [\u02c6y1, . . . , \u02c6yn]\nand the target set Y . The main problem is that the elements of each set are in an arbitrary order,\nso we cannot simply compute a pointwise distance. The usual solution to this is an assignment\nmechanism that matches up elements from one set to the other set. This gives us a loss function that\nis permutation-invariant in both its arguments.\nOne such loss is the O(n2) Chamfer loss, which matches up every element of \u02c6Y to the closest\nelement in Y and vice versa:\n\n||\u02c6yi \u2212 yj||2 +\n\nmin\n\nj\n\n||\u02c6yi \u2212 yj||2\n\nmin\n\ni\n\n(1)\n\nLcha( \u02c6Y , Y ) =\n\n(cid:88)\n\ni\n\n(cid:88)\n\nj\n\nNote that this does not work well for multi-sets: the loss between [a, a, b], [a, b, b] is 0. A more\nsophisticated loss that does not have this problem involves the linear assignment problem with the\npairwise losses as assignment costs:\n\nLhun( \u02c6Y , Y ) = min\n\u03c0\u2208\u03a0\n\n||\u02c6yi \u2212 y\u03c0(i)||2\n\n(2)\n\nwhere \u03a0 is the space of permutations, which can be solved with the Hungarian algorithm in O(n3)\ntime. This has the bene\ufb01t that every element in one set is associated to exactly one element in the\nother set, which is not the case for the Chamfer loss.\n\nResponsibility problem A widely-used approach is to simply ignore the set structure of the\nproblem. A feature vector can be mapped to a set \u02c6Y by using an MLP that takes the vector as input\nand directly produces \u02c6Y with d \u00d7 n outputs. Since the order of elements in \u02c6Y does not matter, it\nappears reasonable to always produce them in a certain order based on the weights of the MLP.\nWhile this seems like a promising approach, [30] point out that this results in a discontinuity issue:\nthere are points where a small change in set space requires a large change in the neural network\noutputs. The model needs to \u201cdecide\u201d which of its outputs is responsible for producing which element,\nand this responsibility must be resolved discontinuously.\nThe intuition behind this is as follows. Consider an MLP that detects the colour of two otherwise\nidentical objects present in an image, so it has two outputs with dimensionality 3 (R, G, B) corre-\nsponding to those two colours. We are given an image with a blue and red object, so let us say that\noutput 1 predicts blue and output 2 predicts red; perhaps the weights of output 1 are more attuned\nto the blue channel and output 2 is more attuned to the red channel. We are given another image\nwith a blue and green object, so it is reasonable for output 1 to again predict blue and output 2 to\nnow predict green. When we now give the model an image with a red and green object, or two red\n\n2\n\n\fAlgorithm 1 One forward pass of the set prediction algorithm within the training loop.\n1: z = F (x)\n2: \u02c6Y (0) \u2190 init\n3: for t \u2190 1, T do\n4:\n5:\n6: end for\n7: predict \u02c6Y (T )\n8: L = 1\n\nl \u2190 Lrepr( \u02c6Y (t\u22121), z)\n\u02c6Y (t) \u2190 \u02c6Y (t\u22121) \u2212 \u03b7\n\nt=0 Lset( \u02c6Y (t), Y ) + \u03bbLrepr(Y , z)\n\n\u2202l\n\n\u2202 \u02c6Y (t\u22121)\n\n(cid:80)T\n\nT\n\n(cid:46) encode input with a model\n(cid:46) initialise set\n\n(cid:46) compute representation loss\n(cid:46) gradient descent step on the set\n\n(cid:46) compute loss of outer optimisation\n\nobjects, it is unclear which output should be responsible for predicting which object. Output 2 \u201cwants\u201d\nto predict both red and green, but has to decide between one of them, and output 1 now has to be\nresponsible for the other object while previously being a blue detector. This responsibility must be\nresolved discontinuously, which makes modeling sets with MLPs dif\ufb01cult [30].\nThe main problem is that there is a notion of output 1 and output 2 \u2013 an ordered output representation\n\u2013 in the \ufb01rst place, which forces the model to give the set an order. Instead, it would be better if\nthe outputs of the model were freely interchangeable \u2013 in the same way the elements of the set are\ninterchangeable \u2013 to not impose an order on the outputs. This is exactly what our model accomplishes.\n\n3 Deep Set Prediction Networks\n\nThis section contains our primary contribution: a model for decoding a feature vector into a set of\nfeature vectors. As we have previously established, it is important for the model to properly respect\nthe set structure of the problem to avoid the responsibility problem.\nOur main idea is based on the observation that the gradient of a set encoder with respect to the input\nset is permutation-equivariant (see proof in Appendix A): to decode a feature vector into a set, we\ncan use gradient descent to \ufb01nd a set that encodes to that feature vector. Since each update of the\nset using the gradient is permutation-equivariant, we always properly treat it as a set and avoid the\nresponsibility problem. This gives rise to a nested optimisation: an inner loop that changes a set to\nencode more similarly to the input feature vector, and an outer loop that changes the weights of the\nencoder to minimise a loss over a dataset.\nWith this idea in mind, we build up models of increasing usefulness for predicting sets. We start with\nthe simplest case of auto-encoding \ufb01xed-size sets (subsection 3.1), where a latent representation is\ndecoded back into a set. This is modi\ufb01ed to support variable-size sets, which is necessary for most\nsets encountered in the real-world. Lastly and most importantly, we extend our model to general set\nprediction tasks where the input no longer needs to be a set (subsection 3.2). This gives us a model\nthat can predict a set of feature vectors from a single feature vector. We give the pseudo-code of this\nmethod in Algorithm 1.\n\n3.1 Auto-encoding \ufb01xed-size sets\n\nIn a set auto-encoder, the goal is to turn the input set Y into a small latent space z = genc(Y ) with\nthe encoder genc and turn it back into the predicted set \u02c6Y = gdec(z) with the decoder gdec. Using our\nmain idea, we de\ufb01ne a representation loss and the corresponding decoder as:\n\nLrepr( \u02c6Y , z) = ||genc( \u02c6Y ) \u2212 z||2\n\ngdec(z) = arg min\n\n\u02c6Y\n\nLrepr( \u02c6Y , z)\n\n(3)\n\n(4)\n\nIn essence, Lrepr compares \u02c6Y to Y in the latent space. To understand what the decoder does, \ufb01rst\nconsider the simple, albeit not very useful case of the identity encoder genc(Y ) = Y . Solving gdec(z)\nsimply means setting \u02c6Y = Y , which perfectly reconstructs the input as desired.\n\n3\n\n\fWhen we instead choose genc to be a set encoder, the latent representation z is a permutation-invariant\nfeature vector. If this representation is \u201cgood\u201d, \u02c6Y will only encode to similar latent variables as Y if\nthe two sets themselves are similar. Thus, the minimisation in Equation 4 should still produce a set \u02c6Y\nthat is the same (up to permutation) as Y , except this has now been achieved with z as a bottleneck.\nSince the problem is non-convex when genc is a neural network, it is infeasible to solve Equation 4\nexactly. Instead, we perform gradient descent to approximate a solution. Starting from some initial\nset \u02c6Y (0), gradient descent is performed for a \ufb01xed number of steps T with the update rule:\n\n\u02c6Y (t+1) = \u02c6Y (t) \u2212 \u03b7 \u00b7 \u2202Lrepr( \u02c6Y (t), z)\n\n\u2202 \u02c6Y (t)\n\n(5)\n\nwith \u03b7 as the learning rate and the prediction being the \ufb01nal state, gdec(z) = \u02c6Y (T ). This is the\naforementioned inner optimisation loop. In practice, we let \u02c6Y (0) be a learnable Rd\u00d7n matrix which\nis part of the neural network parameters.\nTo obtain a good representation z, we still have to train the weights of genc. For this, we compute\nthe auto-encoder objective Lset( \u02c6Y (T ), Y ) \u2013 with Lset = Lcha or Lhun \u2013 and differentiate with respect\nto the weights as usual, backpropagating through the steps of the inner optimisation. This is the\naforementioned outer optimisation loop.\nIn summary, each forward pass of our auto-encoder \ufb01rst encodes the input set to a latent representation\nas normal. To decode this back into a set, gradient descent is performed on an initial guess with the\naim to obtain a set that encodes to the same latent representation as the input. The same set encoder\nis used in the encoding and decoding stages.\n\nVariable-size sets To extend this from \ufb01xed- to variable-size sets, we make a few modi\ufb01cations to\nthis algorithm. First, we pad all sets to a \ufb01xed maximum size to allow for ef\ufb01cient batch computation.\nWe then concatenate an additional mask feature mi to each set element \u02c6yi that indicates whether it is\na regular element (mi = 1) or padding (mi = 0). With this modi\ufb01cation to \u02c6Y , we can optimise the\nmasks in the same way as the set elements are optimised. To ensure that masks stay in the valid range\nbetween 0 and 1, we simply clamp values above 1 to 1 and values below 0 to 0 after each gradient\ndescent step. This performed better than using a sigmoid in our initial experiments, possibly because\nit allows exact 0s and 1s to be recovered.\n\n3.2 Predicting sets from a feature vector\n\nIn our auto-encoder, we used an encoder to produce both the latent representation as well as to decode\nthe set. This is no longer possible in the general set prediction setup, since the target representation z\ncan come from a separate model (for example an image encoder F encoding an image x), so there is\nno longer a set encoder in the model.\nWhen na\u00efvely using z = F (x) as input to our decoder, our decoding process is unable to predict sets\ncorrectly from it. Because the set encoder is no longer shared in our set decoder, there is no guarantee\nthat optimising genc( \u02c6Y ) to match z converges towards Y (or a permutation thereof). To \ufb01x this, we\nsimply add a term to the loss of the outer optimisation that encourages genc(Y ) \u2248 z again. In other\nwords, the target set should have a very low representation loss itself. This gives us an additional\nLrepr term in the loss function of the outer optimisation for supervised learning:\n\nL = Lset( \u02c6Y , Y ) + \u03bbLrepr(Y , z)\n\n(6)\n\nwith Lset again being either Lcha or Lhun. With this, minimising Lrepr( \u02c6Y , z) in the inner optimisation\nwill converge towards Y . The additional term is not necessary in the pure auto-encoder because\nz = genc(Y ), so Lrepr(Y , z) is always 0 already.\n\nPractical tricks For the outer optimisation, we can compute the set loss for not only \u02c6Y (T ), but\nall \u02c6Y (t). That is, we use the average set loss 1\nt Lset( \u02c6Y (t), Y ) as loss (similar to [4]). This\nT\n\n(cid:80)\n\n4\n\n\fencourages \u02c6Y to converge to Y quickly and not diverge with more steps, which signi\ufb01cantly\nincreases the robustness of our algorithm.\nWe sometimes observed divergent training behaviour when the outer learning rate is set inappropri-\nately. By replacing the instances of || \u00b7 ||2 in Lset and Lrepr with the Huber loss (squared error for\ndifferences below 1 and absolute error above 1) \u2013 as is commonly done in object detection models \u2013\ntraining became less sensitive to hyperparameter choices.\nThe inner optimisation can be modi\ufb01ed to include a momentum term, which stops a prediction from\noscillating around a solution. This gives us slightly better results, but we did not use this for any\nexperiments to keep our method as simple as possible.\nIt is possible to explicitly include the sum of masks as a feature in the representation z for our model.\nThis improves our results on MNIST \u2013 likely due to the explicit signal for the model to predict the\ncorrect set size \u2013 but again, we do not use this for simplicity.\n\n4 Related work\n\nThe main approach we compare our method to is the simple method of using an MLP decoder to\npredict sets. This has been used for predicting point clouds [1; 8], bounding boxes [20; 2], and graphs\n(sets of nodes and edges) [6; 22]. These predict an ordered representation (list) and treat it as if it\nis unordered (set). As we discussed in section 2, this approach runs into the responsibility problem.\nSome works on predicting 3d point clouds make domain-speci\ufb01c assumptions such as independence\nof points within a set [14] or grid-like structures [27]. To avoid inef\ufb01cient graph matching losses,\nYang et al. [26] compute a permutation-invariant loss between graphs by comparing them in the latent\nspace (similar to our Lrepr) in an adversarial setting.\nAn alternative approach is to use an RNN decoder to generate this list [15; 23; 25]. The problem can\nbe made easier if it can be turned from a set into a sequence problem by giving a canonical order to\nthe elements in the set through domain knowledge [25]. For example, You et al. [28] generate the\nnodes of a graph by ordering the set of nodes based on the traversal order of a breadth-\ufb01rst search.\nThe closest work to ours is by Mordatch [17]. They also iteratively minimise a function (their energy\nfunction) in each forward pass of the neural network and differentiate through the iteration to learn\nthe weights. They have only demonstrated that this works for modifying small sets of 2d elements in\nrelatively simple ways, so it is unclear whether their approach scales to the harder problems such\nas object detection that we tackle in this paper. In particular, minimising Lrepr in our model has the\neasy-to-understand consequence of making the predicted set more similar to the target set, while it is\nless clear what minimising their learned energy function E( \u02c6Y , z) does.\nZhang et al. [30] construct an auto-encoder that pools a set into a feature vector where information\nfrom the encoder is shared with their decoder. This is done to make their decoder permutation-\nequivariant, which they use to avoid the responsibility problem. However, this strictly limits their\ndecoder to usage in auto-encoders \u2013 not set prediction \u2013 because it requires an encoder to be present\nduring inference.\nGreff et al. [9] construct an auto-encoder for images with a set-structured latent space. They are\nable to \ufb01nd latent sets of variables to describe an image composed of a set of objects with some\ntask-speci\ufb01c assumptions. While interesting from a representation learning perspective, our model is\nimmediately useful in practice because it works for general supervised learning tasks.\nOur inspiration for using backpropagation through an encoder as a decoder comes from the line\nof introspective neural networks [12; 13] for image modeling. An important difference is that in\nthese works, the two optimisation loops (generating predictions and learning the network weights)\nare performed in sequence, while ours are nested. The nesting allows our outer optimisation to\ndifferentiate through the inner optimisation. This type of nested optimisation to obtain structured\noutputs with neural networks was \ufb01rst studied in [3; 4], of which our model can be considered an\ninstance of. Note that [9] and [17] also differentiate through an optimisation, which suggests that this\napproach is of general bene\ufb01t when working with sets. By differentiating through a decoder rather\nthan an encoder, Bojanowski et al. [5] learn a representation instead of a prediction.\nIt is important to clearly separate the vector-to-set setting in this paper from some related works on\nset-to-set mappings, such as the equivariant version of Deep Sets [29] and self-attention [24]. Tasks\n\n5\n\n\fFigure 1: Progression of set prediction algorithm on MNIST ( \u02c6Y (t)). Our predictions come from our\nmodel with 0.08 \u00d7 10\u22123 loss, while the baseline predictions come from an MLP decoder model with\n0.09 \u00d7 10\u22123 loss.\n\nTable 1: Chamfer reconstruction loss on MNIST in thousandths. Lower is better. Mean and standard\ndeviation over 6 runs.\n\nModel\nMLP baseline\nRNN baseline\nOurs\n\nLoss\n\n0.21\u00b10.18\n0.49\u00b10.19\n0.09\u00b10.01\n\nlike object detection, where no set input is available, can not be solved with set-to-set methods alone;\nthe feature vector from the image encoder has to be turned into a set \ufb01rst, for which a vector-to-set\nmodel like ours is necessary. Set-to-set methods do not have to deal with the responsibility problem,\nbecause the output usually has the same ordering as the input. Methods like [16] and [31] learn to\npredict a permutation matrix for a set (set-to-set-of-position-assignments). When this permutation is\napplied to the input set, the set is turned into a list (set-to-list). Again, our model is about producing a\nset as output while not necessarily taking a set as input.\n\n5 Experiments\n\nIn the following experiments, we compare our set prediction network to a model that uses an MLP\nor RNN (LSTM) as set decoder. In all experiments, we \ufb01x the hyperparameters of our model to\nT = 10, \u03b7 = 800, \u03bb = 0.1. Further details about the model architectures, training settings, and\nhyperparameters are given in Appendix B. We provide the PyTorch [18] source code to reproduce all\nexperiments at https://github.com/Cyanogenoid/dspn.\n\n5.1 MNIST\n\nWe begin with the task of auto-encoding a set version of MNIST. A set is constructed from each\nimage by including all the pixel coordinates (x and y, scaled to the interval [0, 1]) of pixels that have a\nvalue above the mean pixel value. The size of these sets varies from 32 to 342 across the dataset.\n\nModel\nIn our model, we use a set encoder that processes each element individually with a 3-layer\nMLP, followed by FSPool [30] as pooling function to produce 256 latent variables. These are decoded\nwith our algorithm to predict the input set. We compare this against a baseline model with the\nsame encoder, but with a traditional MLP or LSTM as decoder. This approach to decoding sets\nis used in models such as in [1] (AE-CD variant) and [23]; these baselines are representative of\nthe best approaches for set prediction in the literature. Note that these baselines have signi\ufb01cantly\nmore parameters than our model, since our decoder has almost no additional parameters by sharing\nthe encoder weights (ours: \u223c140 000 parameters, MLP: \u223c530 000, LSTM: \u223c470 000). For the\nbaselines, we include a mask feature with each element to allow for variable-size sets. Due to the\nlarge maximum set size, use of Hungarian matching is too slow. Instead, we use the Chamfer loss to\ncompute the loss between predicted and target set in this experiment.\n\n6\n\n\u02c6Y(0)\u02c6Y(1)\u02c6Y(2)\u02c6Y(3)\u02c6Y(4)\u02c6Y(5)\u02c6Y(6)\u02c6Y(7)\u02c6Y(8)\u02c6Y(9)\u02c6Y(10)TargetYBaseline\fTable 2: Average Precision (AP) for different intersection-over-union thresholds for a predicted\nbounding box to be considered correct. Higher is better. Mean and standard deviation over 6 runs.\n\nModel\nMLP baseline\nRNN baseline\nOurs (10 iters)\nOurs (20 iters)\nOurs (30 iters)\n\nAP50\n99.3\u00b10.2\n99.4\u00b10.2\n98.8\u00b10.3\n99.8\u00b10.0\n99.8\u00b10.1\n\nAP90\n94.0\u00b11.9\n94.9\u00b12.0\n94.3\u00b11.5\n98.7\u00b11.1\n96.7\u00b12.4\n\nAP95\n57.9\u00b17.9\n65.0\u00b110.3\n85.7\u00b13.0\n86.2\u00b17.2\n75.5\u00b112.3\n\nAP98\n0.7\u00b10.2\n2.4\u00b10.0\n34.5\u00b15.7\n24.3\u00b18.0\n17.4\u00b17.7\n\nAP99\n0.0\u00b10.0\n0.0\u00b10.0\n2.9\u00b11.2\n1.4\u00b10.9\n0.9\u00b10.7\n\nResults Table 1 shows that our model improves over the two baselines. In Figure 1, we show the\nprogression of \u02c6Y throughout the minimisation with \u02c6Y (10) as the \ufb01nal prediction, the ground-truth set,\nand the baseline prediction of an MLP decoder. Observe how every optimisation starts with the same\nset \u02c6Y (0), but is transformed differently depending on the gradient of genc. Through this minimisation\nof Lrepr by the inner optimisaton, the set is gradually changed into a shape that closely resembles the\ncorrect digit.\nThe types of errors of our model and the baseline are different, despite the use of models with similar\nlosses in Figure 1. Errors in our model are mostly due to scattered points outside of the main shape\nof the digit, which is particularly visible in the third row. We believe that this is due to the limits of\nthe encoder used: an encoder that is not powerful enough maps the slightly different sets to the same\nrepresentation, so there is no Lrepr gradient to work with. It still models the general shape accurately,\nbut misses the \ufb01ne details of these scattered points. The MLP decoder has less of this scattering, but\nmakes mistakes in the shape of the digit instead. For example, in the third row, the baseline has a\ndifferent curve at the top and a shorter line at the bottom. This difference in types of errors is also\npresent in the extended examples in Figure 3.\nNote that reconstructions shown in [30] for the same auto-encoding task appear better because their\ndecoder uses additional information outside of the latent space: they copy multiple n \u00d7 n matrices\nfrom the encoder into the decoder. In contrast, all information about the set is completely contained\nin our permutation-invariant latent space.\n\n5.2 Bounding box prediction\n\nNext, we turn to the task of object detection on the CLEVR dataset [11], which contains 70,000\ntraining and 15,000 validation images. The goal is to predict the set of bounding boxes for the objects\nin an image. The target set contains at most 10 elements with 4 dimensions each: the (normalised)\nx-y coordinates of the top-left and bottom-right corners of each box. As the dataset does not contain\nbounding box information canonically, we use [7] to calculate approximate bounding boxes. This\ncauses the ground-truth bounding boxes to not always be perfect, which is a source of noise.\n\nModel We encode the image with a ResNet34 [10] into a 512d feature vector, which is fed into\nthe set decoder. The set decoder predicts the set of bounding boxes from this single feature vector\ndescribing the whole image. This is in contrast to existing region proposal networks [19] for\nbounding box prediction where the use of the entire feature map is required for the typical anchor-\nbased approach. As the set encoder in our model, we use a 2-layer relation network [21] with FSPool\n[30] as pooling. This is stronger than the FSPool-only model (without RN) we used in the MNIST\nexperiment. We again compare this against a baseline that uses an MLP or LSTM as set decoder\n(matching AE-EMD [1] and [20] for the MLP decoder, [23] for the LSTM decoder). Since the sets\nare much smaller compared to our MNIST experiments, we can use the Hungarian loss as set loss.\nWe perform no post-processing (such as non-maximum suppression) on the predictions of the model.\nThe whole model is trained end-to-end.\n\nResults We show our results in Table 2 using the standard average precision (AP) metric used in\nobject detection with sample predictions in Figure 2. Our model is able to very accurately localise the\nobjects with high AP scores even when the intersection-over-union (IoU) threshold for a predicted\nbox to match a groundtruth box is very strict. In particular, our model using 10 iterations (the same\nit was trained with) has much better AP95 and AP98 than the baselines. The shown baseline model\n\n7\n\n\fFigure 2: Progression of set prediction algorithm for bounding boxes in CLEVR. The shown MLP\nbaseline sometimes struggles with heavily-overlapping objects and often fails to centre the object in\nthe boxes.\n\ncan predict bounding boxes in the close vicinity of objects, but fails to place the bounding box\nprecisely on the object. This is visible from the decent performance for low IoU thresholds, but bad\nperformance for high IoU thresholds.\nWe can also run our model with more inner optimisation steps than the 10 it was trained with.\nMany results improve when doubling the number of steps, which shows that further minimisation\nof Lrepr( \u02c6Y , z) is still bene\ufb01cial, even if it is unseen during training. The model \u201cknows\u201d that its\nprediction is still suboptimal when Lrepr is high and also how to change the set to decrease it. This\ncon\ufb01rms that the optimisation is reasonably stable and does not diverge signi\ufb01cantly with more steps.\nBeing able to change the number of steps allows for a dynamic trade-off between prediction quality\nand inference time depending on what is needed for a given task.\nThe less-strict AP metrics (which measure large mistakes) improve with more iterations, while the\nvery strict AP98 and AP99 metrics consistently worsen. This is a sign that the inner optimisation\nlearned to reach its best prediction at exactly 10 steps, but slightly overshoots when run for longer.\nThe model has learned that it does not fully converge with 10 steps, so it is compensating for that by\nslightly biasing the inner optimisation to get a better 10 step prediction. This is at the expense of the\nstrictest AP metrics worsening with 20 steps, where this bias is not necessary anymore.\nBear in mind that we do not intend to directly compete against traditional object detection methods.\nOur goal is to demonstrate that our model can accurately predict a set from a single feature vector,\nwhich is of general use for set prediction tasks not limited to image inputs.\n\n5.3 State prediction\n\nLastly, we want to directly predict the full state of a scene from images on CLEVR. This is the set of\nobjects with their position in the 3d scene (x, y, z coordinates), shape (sphere, cylinder, cube), colour\n(eight colours), size (small, large), and material (metal/shiny, rubber/matte) as features. For example,\nan object can be a \u201csmall cyan metal cube\u201d at position (0.95, -2.83, 0.35). We encode the categorial\nfeatures as one-hot vectors and concatenate them into an 18d feature vector for each object. Note\nthat we do not use bounding box information, so the model has to implicitly learn which object in\nthe image corresponds to which set element with the associated properties. This makes it different\nfrom usual object detection tasks, since bounding boxes are required for traditional object detection\nmodels that rely on anchors.\n\nModel We use exactly the same model as for the bounding box prediction in the previous experiment\nwith all hyperparameters kept the same. The only difference is that it now outputs 18d instead of 4d\nset elements. For simplicity, we continue using the Hungarian loss with the Huber loss as pairwise\ncost, as opposed to switching to cross-entropy for the categorical features.\n\nResults We show our results in Table 3 and give sample outputs in Appendix C. The evaluation\nmetric is the standard average precision as used in object detection, with the modi\ufb01cation that\n\n8\n\n\u02c6Y(0)\u02c6Y(5)\u02c6Y(10)\u02c6Y(20)TrueYBaseline\fTable 3: Average Precision (AP) in % for different distance thresholds of a predicted set element\nto be considered correct. AP\u221e only requires all attributes to be correct, regardless of 3d position.\nHigher is better. Mean and standard deviation over 6 runs.\n\nModel\nMLP baseline\nRNN baseline\nOurs (10 iters)\nOurs (20 iters)\nOurs (30 iters)\n\nAP\u221e\n3.6\u00b10.5\n4.0\u00b11.9\n72.8\u00b12.3\n84.0\u00b14.5\n85.2\u00b14.8\n\nAP1\n1.5\u00b10.4\n1.8\u00b11.2\n59.2\u00b12.8\n80.0\u00b14.9\n81.1\u00b15.2\n\nAP0.5\n0.8\u00b10.3\n0.9\u00b10.5\n39.0\u00b14.4\n57.0\u00b112.1\n47.4\u00b117.6\n\nAP0.25\n0.2\u00b10.1\n0.2\u00b10.1\n12.4\u00b12.5\n16.6\u00b19.0\n10.8\u00b19.0\n\nAP0.125\n0.0\u00b10.0\n0.0\u00b10.0\n1.3\u00b10.4\n1.6\u00b10.9\n0.6\u00b10.7\n\na prediction is considered correct if there is a matching groundtruth object with exactly the same\nproperties and within a given Euclidean distance of the 3d coordinates. Our model clearly outperforms\nthe baselines. This shows that our model is also suitable for modeling high-dimensional set elements.\nWhen evaluating with more steps than our model was trained with, the difference in the more lenient\nmetrics improves even up to 30 iterations. This time, the results for 20 iterations are all better than\nfor 10 iterations. This suggests that 10 steps is too few to reach a good solution in training, likely due\nto the higher dif\ufb01culty of this task compared to the bounding box prediction. Still, the representation\nz that the input encoder produces is good enough such that minimising Lrepr more at evaluation\ntime leads to better results. When going up to 30 iterations, the result for predicting the state only\n(excluding 3d position) improves further, but the accuracy of the 3d position worsens. We believe\nthat this is again caused by overshooting the target due to the bias of training the model with only 10\niterations.\n\n6 Discussion\n\nIn this paper we showed how to predict sets with a deep neural network in a way that respects the\nset structure of the problem. We demonstrated in our experiments that this works for small (size 10)\nand large sets (up to size 342), as well as low-dimensional (2d) and higher-dimensional (18d) set\nelements. Our model is consistently better than the baselines across all experiments by predicting\nsets properly, rather than predicting a list and pretending that it is a set.\nThe improved results of our approach come at a higher computational cost. Each evaluation of the\nnetwork requires time for O(T ) passes through the set encoder, which makes training take about 75%\nlonger on CLEVR with T = 10. Keep in mind that this only involves the set encoder (which can\nbe fairly small), not the input encoder (such as a CNN or RNN) that produces the target z. Further\nstudy into representationally-powerful and ef\ufb01cient set encoders such as RN [21] and FSPool [30] \u2013\nwhich we found to be critical for good results in our experiments \u2013 would be of considerable interest,\nas it could speed up the convergence and thus inference time of our method. Another promising\napproach is to better initialise Y (0) \u2013 perhaps with an MLP \u2013 so that the set needs to be changed less\nto minimise Lrepr. Our model would act as a set-aware re\ufb01nement method of the MLP prediction.\nLastly, stopping criteria other than iterating for a \ufb01xed 10 steps can be used, such as stopping when\nLrepr(genc( \u02c6Y ), z) is below a \ufb01xed threshold: this would stop when the encoder thinks \u02c6Y is of a\ncertain quality corresponding to that threshold.\nOur algorithm may be suitable for generating samples under other invariance properties. For example,\nwe may want to generate images of objects where the rotation of the object does not matter (such as\naerial images). Using our decoding algorithm with a rotation-invariant image encoder could predict\nimages without forcing the model to choose a \ufb01xed orientation of the image, which could be a useful\ninductive bias.\nIn conclusion, we are excited about enabling a wider variety of set prediction problems to be tackled\nwith deep neural networks. Our main idea should be readily extensible to similar domains such as\ngraphs to allow for better graph prediction, for example molecular graph generation or end-to-end\nscene graph prediction from images. We hope that our model inspires further research into graph\ngeneration, stronger object detection models, and \u2013 more generally \u2013 a more principled approach to\nset prediction.\n\n9\n\n\fReferences\n[1] Achlioptas, P., Diamanti, O., Mitliagkas, I., and Guibas, L. J. Learning representations and\ngenerative models for 3D point clouds. In Proceedings of the 35th International Conference on\nMachine Learning (ICML), 2018.\n\n[2] Balles, L. and Fischbacher, T. Holographic and other point set distances for machine learning,\n\n2019. URL https://openreview.net/forum?id=rJlpUiAcYX.\n\n[3] Belanger, D. and McCallum, A. Structured prediction energy networks. In Proceedings of the\n\n33rd International Conference on Machine Learning (ICML), 2016.\n\n[4] Belanger, D., Yang, B., and McCallum, A. End-to-end learning for structured prediction energy\nnetworks. In Proceedings of the 34th International Conference on Machine Learning (ICML),\n2017.\n\n[5] Bojanowski, P., Joulin, A., Paz, D. L., and Szlam, A. Optimizing the latent space of generative\nnetworks. In Proceedings of the 35th International Conference on Machine Learning (ICML),\n2018.\n\n[6] Cao, N. D. and Kipf, T. MolGAN: An implicit generative model for small molecular graphs. In\n\nICML Deep Generative Models Workshop, 2018.\n\n[7] Desta, M. T., Chen, L., and Kornuta, T. Object-based reasoning in VQA. In IEEE Winter\n\nConference on Applications of Computer Vision (WACV). 2018.\n\n[8] Fan, H., Su, H., and Guibas, L. J. A point set generation network for 3D object reconstruction\nfrom a single image. In The IEEE Conference on Computer Vision and Pattern Recognition\n(CVPR), 2017.\n\n[9] Greff, K., Kaufmann, R. L., Kabra, R., Watters, N., Burgess, C., Zoran, D., Matthey, L.,\nBotvinick, M., and Lerchner, A. Multi-object representation learning with iterative variational\ninference. arXiv:1903.00450, 2019.\n\n[10] He, K., Zhang, X., Ren, S., and Sun, J. Deep residual learning for image recognition.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), volume\narXiv:1512.03385, 2016.\n\n[11] Johnson, J., Hariharan, B., van der Maaten, L., Fei-Fei, L., Lawrence Zitnick, C., and Girshick,\nR. CLEVR: A diagnostic dataset for compositional language and elementary visual reasoning.\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[12] Lazarow, J., Jin, L., and Tu, Z. Introspective neural networks for generative modeling. In The\n\nIEEE International Conference on Computer Vision (ICCV), pp. 2774\u20132783, 2017.\n\n[13] Lee, K., Xu, W., Fan, F., and Tu, Z. Wasserstein introspective neural networks. In The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[14] Li, C.-L., Zaheer, M., Zhang, Y., Poczos, B., and Salakhutdinov, R. Point cloud GAN.\n\narXiv:1810.05795, 2018.\n\n[15] Li, Y., Vinyals, O., Dyer, C., Pascanu, R., and Battaglia, P. Learning deep generative models of\n\ngraphs. arXiv:1803.03324, 2018.\n\n[16] Mena, G., Belanger, D., Linderman, S., and Snoek, J. Learning Latent Permutations with\nGumbel-Sinkhorn Networks. In International Conference on Learning Representations (ICLR),\n2018.\n\n[17] Mordatch, I. Concept learning with energy-based models. arXiv:1811.02486, 2018.\n\n[18] Paszke, A., Gross, S., Chintala, S., Chanan, G., Yang, E., DeVito, Z., Lin, Z., Desmaison, A.,\nAntiga, L., and Lerer, A. Automatic differentiation in PyTorch. NeurIPS Workshop Autodiff,\n2017.\n\n10\n\n\f[19] Ren, S., He, K., Girshick, R., and Sun, J. Faster R-CNN: Towards real-time object detection\nwith region proposal networks. In Advances in Neural Information Processing Systems 28\n(NeurIPS), 2015.\n\n[20] Rezato\ufb01ghi, S. H., Kaskman, R., Motlagh, F. T., Shi, Q., Cremers, D., Leal-Taix\u00e9, L., and Reid,\nI. Deep perm-set net: Learn to predict sets with unknown permutation and cardinality using\ndeep neural networks. arXiv:1805.00613, 2018.\n\n[21] Santoro, A., Raposo, D., Barrett, D. G., Malinowski, M., Pascanu, R., Battaglia, P., and Lillicrap,\nT. A simple neural network module for relational reasoning. In Advances in Neural Information\nProcessing Systems 30 (NeurIPS), 2017.\n\n[22] Simonovsky, M. and Komodakis, N. GraphVAE: Towards generation of small graphs using\nvariational autoencoders. In International Conference on Arti\ufb01cial Neural Networks (ICANN),\n2018.\n\n[23] Stewart, R. and Andriluka, M. End-to-end people detection in crowded scenes. The IEEE\n\nConference on Computer Vision and Pattern Recognition (CVPR), 2016.\n\n[24] Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, L. u., and\nPolosukhin, I. Attention is all you need. In Advances in Neural Information Processing Systems\n30 (NeurIPS), 2017.\n\n[25] Vinyals, O., Bengio, S., and Kudlur, M. Order Matters: Sequence to sequence for sets. In\n\nInternational Conference on Learning Representations (ICLR), 2015.\n\n[26] Yang, C., Zhuang, P., Shi, W., Luu, A., and Li, P. Conditional structure generation through\ngraph variational generative adversarial nets. In Advances in Neural Information Processing\nSystems 32 (NeurIPS), 2019.\n\n[27] Yang, Y., Feng, C., Shen, Y., and Tian, D. FoldingNet: Point cloud auto-encoder via deep grid\ndeformation. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR),\nJune 2018.\n\n[28] You, J., Ying, R., Ren, X., Hamilton, W., and Leskovec, J. GraphRNN: Generating realistic\ngraphs with deep auto-regressive models. In Proceedings of the 35th International Conference\non Machine Learning (ICML), 2018.\n\n[29] Zaheer, M., Kottur, S., Ravanbakhsh, S., Poczos, B., Salakhutdinov, R. R., and Smola, A. J.\n\nDeep Sets. In Advances in Neural Information Processing Systems (NeurIPS), 2017.\n\n[30] Zhang, Y., Hare, J., and Pr\u00fcgel-Bennett, A. FSPool: Learning set representations with feature-\n\nwise sort pooling. arXiv:1906.02795, 2019.\n\n[31] Zhang, Y., Hare, J., and Pr\u00fcgel-Bennett, A. Learning representations of sets through optimized\n\npermutations. In International Conference on Learning Representations (ICLR), 2019.\n\n11\n\n\f", "award": [], "sourceid": 1810, "authors": [{"given_name": "Yan", "family_name": "Zhang", "institution": "University of Southampton"}, {"given_name": "Jonathon", "family_name": "Hare", "institution": "University of Southampton"}, {"given_name": "Adam", "family_name": "Prugel-Bennett", "institution": "apb@ecs.soton.ac.uk"}]}