{"title": "Neural Expectation Maximization", "book": "Advances in Neural Information Processing Systems", "page_first": 6691, "page_last": 6701, "abstract": "Many real world tasks such as reasoning and physical interaction require identification and manipulation of conceptual entities. A first step towards solving these tasks is the automated discovery of distributed symbol-like representations. In this paper, we explicitly formalize this problem as inference in a spatial mixture model where each component is parametrized by a neural network.  Based on the Expectation Maximization framework we then derive a differentiable clustering method that simultaneously learns how to group and represent individual entities.  We evaluate our method on the (sequential) perceptual grouping task and find that it is able to accurately recover the constituent objects.  We demonstrate that the learned representations are useful for next-step prediction.", "full_text": "Neural Expectation Maximization\n\nKlaus Greff\u2217\n\nIDSIA\n\nklaus@idsia.ch\n\nSjoerd van Steenkiste\u2217\n\nJ\u00fcrgen Schmidhuber\n\nsjoerd@idsia.ch\n\njuergen@idsia.ch\n\nIDSIA\n\nIDSIA\n\nAbstract\n\nMany real world tasks such as reasoning and physical interaction require identi\ufb01-\ncation and manipulation of conceptual entities. A \ufb01rst step towards solving these\ntasks is the automated discovery of distributed symbol-like representations. In\nthis paper, we explicitly formalize this problem as inference in a spatial mixture\nmodel where each component is parametrized by a neural network. Based on the\nExpectation Maximization framework we then derive a differentiable clustering\nmethod that simultaneously learns how to group and represent individual entities.\nWe evaluate our method on the (sequential) perceptual grouping task and \ufb01nd that\nit is able to accurately recover the constituent objects. We demonstrate that the\nlearned representations are useful for next-step prediction.\n\n1\n\nIntroduction\n\nLearning useful representations is an important aspect of unsupervised learning, and one of the\nmain open problems in machine learning. It has been argued that such representations should be\ndistributed [13, 37] and disentangled [1, 31, 3]. The latter has recently received an increasing amount\nof attention, producing representations that can disentangle features like rotation and lighting [4, 12].\nSo far, these methods have mostly focused on the single object case whereas, for real world tasks\nsuch as reasoning and physical interaction, it is often necessary to identify and manipulate multiple\nentities and their relationships. In current systems this is dif\ufb01cult, since superimposing multiple\ndistributed and disentangled representations can lead to ambiguities. This is known as the Binding\nProblem [21, 37, 13] and has been extensively discussed in neuroscience [33]. One solution to\nthis problem involves learning a separate representation for each object. In order to allow these\nrepresentations to be processed identically they must be described in terms of the same (disentangled)\nfeatures. This would then avoid the binding problem, and facilitate a wide range of tasks that require\nknowledge about individual objects. This solution requires a process known as perceptual grouping:\ndynamically splitting (segmenting) each input into its constituent conceptual entities.\nIn this work, we tackle this problem of learning how to group and ef\ufb01ciently represent individual\nentities, in an unsupervised manner, based solely on the statistical structure of the data. Our work\nfollows a similar approach as the recently proposed Tagger [7] and aims to further develop the\nunderstanding, as well as build a theoretical framework, for the problem of symbol-like representation\nlearning. We formalize this problem as inference in a spatial mixture model where each component\nis parametrized by a neural network. Based on the Expectation Maximization framework we then\nderive a differentiable clustering method, which we call Neural Expectation Maximization (N-EM). It\ncan be trained in an unsupervised manner to perform perceptual grouping in order to learn an ef\ufb01cient\nrepresentation for each group, and naturally extends to sequential data.\n\n\u2217Both authors contributed equally to this work.\n\n31st Conference on Neural Information Processing Systems (NIPS 2017), Long Beach, CA, USA.\n\n\f2 Neural Expectation Maximization\n\nThe goal of training a system that produces separate representations for the individual conceptual\nentities contained in a given input (here: image) depends on what notion of entity we use. Since we\nare interested in the case of unsupervised learning, this notion can only rely on statistical properties\nof the data. We therefore adopt the intuitive notion of a conceptual entity as being a common cause\n(the object) for multiple observations (the pixels that depict the object). This common cause induces a\ndependency-structure among the affected pixels, while the pixels that correspond to different entities\nremain (largely) independent. Intuitively this means that knowledge about some pixels of an object\nhelps in predicting its remainder, whereas it does not improve the predictions for pixels of other\nobjects. This is especially obvious for sequential data, where pixels belonging to a certain object share\na common fate (e.g. move in the same direction), which makes this setting particularly appealing.\nWe are interested in representing each entity (object) k with some vector \u03b8k that captures all the\nstructure of the affected pixels, but carries no information about the remainder of the image. This\nmodularity is a powerful invariant, since it allows the same representation to be reused in different\ncontexts, which enables generalization to novel combinations of known objects. Further, having all\npossible objects represented in the same format makes it easier to work with these representations.\nFinally, having a separate \u03b8k for each object (as opposed to for the entire image) allows \u03b8k to be\ndistributed and disentangled without suffering from the binding problem.\nWe treat each image as a composition of K objects, where each pixel is determined by exactly one\nobject. Which objects are present, as well as the corresponding assignment of pixels, varies from input\nto input. Assuming that we have access to the family of distributions P (x|\u03b8k) that corresponds to an\nobject level representation as described above, we can model each image as a mixture model. Then\nExpectation Maximization (EM) can be used to simultaneously compute a Maximum Likelihood\nEstimate (MLE) for the individual \u03b8k-s and the grouping that we are interested in.\nThe central problem we consider in this work is therefore how to learn such a P (x|\u03b8k) in a com-\npletely unsupervised fashion. We accomplish this by parametrizing this family of distributions by\na differentiable function f\u03c6(\u03b8) (a neural network with weights \u03c6). We show that in that case, the\ncorresponding EM procedure becomes fully differentiable, which allows us to backpropagate an\nappropriate outer loss into the weights of the neural network. In the remainder of this section we\nformalize and derive this method which we call Neural Expectation Maximization (N-EM).\n\n2.1 Parametrized Spatial Mixture Model\nWe model each image x \u2208 RD as a spatial mixture of K components parametrized by vectors\n\u03b81, . . . , \u03b8K \u2208 RM . A differentiable non-linear function f\u03c6 (a neural network) is used to transform\nthese representations \u03b8k into parameters \u03c8i,k = f\u03c6(\u03b8k)i for separate pixel-wise distributions. These\ndistributions are typically Bernoulli or Gaussian, in which case \u03c8i,k would be a single probability\nor a mean and variance respectively. This parametrization assumes that given the representation,\nthe pixels are independent but not identically distributed (unlike in standard mixture models). A set\nof binary latent variables Z \u2208 [0, 1]D\u00d7K encodes the unknown true pixel assignments, such that\nzi,k = 1 iff pixel i was generated by component k, and\ufffdk zi,k = 1. A graphical representation of\nthis model can be seen in Figure 1, where \u03c0 = (\u03c01, . . . \u03c0K) are the mixing coef\ufb01cients (or prior for\nz). The full likelihood for x given \u03b8 = (\u03b81, . . . , \u03b8K) is given by:\n\nP (x|\u03b8) =\n\nD\ufffdi=1\ufffdzi\n\nP (xi, zi|\u03c8i) =\n\nD\ufffdi=1\n\nK\ufffdk=1\n\nP (zi,k = 1)\n\nP (xi|\u03c8i,k, zi,k = 1).\n\n(1)\n\n2.2 Expectation Maximization\nDirectly optimizing log P (x|\u03c8) with respect to \u03b8 is dif\ufb01cult due to marginalization over z, while for\nmany distributions optimizing log P (x, z|\u03c8) is much easier. Expectation Maximization (EM; [6])\ntakes advantage of this and instead optimizes a lower bound given by the expected log likelihood:\n\n\ufffd\n\n\u03c0k\n\n\ufffd\ufffd\n\n\ufffd\n\nQ(\u03b8, \u03b8old) =\ufffdz\n\nP (z|x, \u03c8old) log P (x, z|\u03c8).\n\n2\n\n(2)\n\n\fFigure 1: left: The probabilistic graphical model that underlies N-EM. right: Illustration of the\ncomputations for two steps of N-EM.\n\nIterative optimization of this bound alternates between two steps: in the E-step we compute a new\nestimate of the posterior probability distribution over the latent variables given \u03b8old from the previous\niteration, yielding a new soft-assignment of the pixels to the components (clusters):\n\n\u03b3i,k := P (zi,k = 1|xi, \u03c8old\ni ).\n\n(3)\n\nIn the M-step we then aim to \ufb01nd the con\ufb01guration of \u03b8 that would maximize the expected log-\nlikelihood using the posteriors computed in the E-step. Due to the non-linearity of f\u03c6 there exists\nno analytical solution to arg max\u03b8 Q(\u03b8, \u03b8old). However, since f\u03c6 is differentiable, we can improve\nQ(\u03b8, \u03b8old) by taking a gradient ascent step:2\n\n\u03b8new = \u03b8old + \u03b7\n\n\u2202Q\n\u2202\u03b8\n\nwhere\n\n\u2202Q\n\u2202\u03b8k \u221d\n\nD\ufffdi=1\n\n\u03b3i,k(\u03c8i,k \u2212 xi)\n\n\u2202\u03c8i,k\n\u2202\u03b8k\n\n.\n\n(4)\n\nThe resulting algorithm belongs to the class of generalized EM algorithms and is guaranteed (for a\nsuf\ufb01ciently small learning rate \u03b7) to converge to a (local) optimum of the data log likelihood [42].\n\n2.3 Unrolling\n\nIn our model the information about statistical regularities required for clustering the pixels into\nobjects is encoded in the neural network f\u03c6 with weights \u03c6. So far we have considered f\u03c6 to be\n\ufb01xed and have shown how we can compute an MLE for \u03b8 alongside the appropriate clustering.\nWe now observe that by unrolling the iterations of the presented generalized EM, we obtain an\nend-to-end differentiable clustering procedure based on the statistical model implemented by f\u03c6. We\ncan therefore use (stochastic) gradient descent and \ufb01t the statistical model to capture the regularities\ncorresponding to objects for a given dataset. This is implemented by back-propagating an appropriate\nloss (see Section 2.4) through \u201ctime\u201d (BPTT; [39, 41]) into the weights \u03c6. We refer to this trainable\nprocedure as Neural Expectation Maximization (N-EM), an overview of which can be seen in Figure 1.\nUpon inspection of the structure of N-EM we \ufb01nd that\nit resembles K copies of a recurrent neural network\nwith hidden states \u03b8k that, at each timestep, receive\n\u03b3k \ufffd (\u03c8k \u2212 x) as their input. Each copy generates a\nnew \u03c8k, which is then used by the E-step to re-estimate\nthe soft-assignments \u03b3. In order to accurately mimic\nthe M-Step (4) with an RNN, we must impose several\nrestrictions on its weights and structure: the \u201cencoder\u201d\nmust correspond to the Jacobian \u2202\u03c8k/\u2202\u03b8k, and the\nrecurrent update must linearly combine the output of\nthe encoder with \u03b8k from the previous timestep. In-\nstead, we introduce a new algorithm named RNN-EM,\nwhen substituting that part of the computational graph\nof N-EM with an actual RNN (without imposing any re-\nstrictions). Although RNN-EM can no longer guarantee\n\nFigure 2: RNN-EM Illustration. Note the\nchanged encoder and recurrence compared\nto Figure 1.\n\n2Here we assume that P (xi|zi,k = 1, \u03c8i,k) is given by N (xi; \u00b5 = \u03c8i,k, \u03c32) for some \ufb01xed \u03c32, yet a similar\n\nupdate arises for many typical parametrizations of pixel distributions.\n\n3\n\nD\fconvergence of the data log likelihood, its recurrent weights increase the \ufb02exibility of the clustering\nprocedure. Moreover, by using a fully parametrized recurrent weight matrix RNN-EM naturally\nextends to sequential data. Figure 2 presents the computational graph of a single RNN-EM timestep.\n\n2.4 Training Objective\n\nN-EM is a differentiable clustering procedure, whose outcome relies on the statistical model f\u03c6. We\nare interested in a particular unsupervised clustering that corresponds to grouping entities based on\nthe statistical regularities in the data. To train our system, we therefore require a loss function that\nteaches f\u03c6 to map from representations \u03b8 to parameters \u03c8 that correspond to pixelwise distributions\nfor such objects. We accomplish this with a two-term loss function that guides each of the K networks\nto model the structure of a single object independently of any other information in the image:\n\n\u03b3i,k log P (xi, zi,k|\u03c8i,k)\n\n\u2212 (1 \u2212 \u03b3i,k)DKL[P (xi)||P (xi|\u03c8i,k, zi,k)]\n\n.\n\n(5)\n\nintra-cluster loss\n\ninter-cluster loss\n\n\ufffd\ufffd\n\n\ufffd\n\n\ufffd\n\n\ufffd\ufffd\n\n\ufffd\n\nL(x) = \u2212\n\nD\ufffdi=1\n\nK\ufffdk=1\n\n\ufffd\n\nThe intra-cluster loss corresponds to the same expected data log-likelihood Q as is optimized by\nN-EM. It is analogous to a standard reconstruction loss used for training autoencoders, weighted\nby the cluster assignment. Similar to autoencoders, this objective is prone to trivial solutions in\ncase of overcapacity, which prevent the network from modelling the statistical regularities that we\nare interested in. Standard techniques can be used to overcome this problem, such as making \u03b8 a\nbottleneck or using a noisy version of x to compute the inputs to the network. Furthermore, when\nRNN-EM is used on sequential data we can use a next-step prediction loss.\nWeighing the loss pixelwise is crucial, since it allows each network to specialize its predictions to an\nindividual object. However, it also introduces a problem: the loss for out-of-cluster pixels (\u03b3i,k = 0)\nvanishes. This leaves the network free to predict anything and does not yield specialized representa-\ntions. Therefore, we add a second term (inter-cluster loss) which penalizes the KL divergence between\nout-of-cluster predictions and the pixelwise prior of the data. Intuitively this tells each representation\n\u03b8k to contain no information regarding non-assigned pixels xi: P (xi|\u03c8i,k, zi,k) = P (xi).\nA disadvantage of the interaction between \u03b3 and \u03c8 in (5) is that it may yield con\ufb02icting gradients.\nFor any \u03b8k the loss for a given pixel i can be reduced by better predicting xi, or by decreasing \u03b3i,k\n(i.e. taking less responsibility) which is (due to the E-step) realized by being worse at predicting xi. A\npractical solution to this problem is obtained by stopping the \u03b3 gradients, i.e. by setting \u2202L/\u2202\u03b3 = 0\nduring backpropagation.\n\n3 Related work\n\nThe method most closely related to our approach is Tagger [7], which similarly learns perceptual\ngrouping in an unsupervised fashion using K copies of a neural network that work together by\nreconstructing different parts of the input. Unlike in case of N-EM, these copies additionally learn to\noutput the grouping, which gives Tagger more direct control over the segmentation and supports its\nuse on complex texture segmentation tasks. Our work maintains a close connection to EM and relies\non the posterior inference of the E-Step as a grouping mechanism. This facilitates theoretical analysis\nand simpli\ufb01es the task for the resulting networks, which we \ufb01nd can be markedly smaller than in\nTagger. Furthermore, Tagger does not include any recurrent connections on the level of the hidden\nstates, precluding it from next step prediction on sequential tasks.3\nThe Binding problem was \ufb01rst considered in the context of Neuroscience [21, 37] and has sparked\nsome early work in oscillatory neural networks that use synchronization as a grouping mechanism [36,\n38, 24]. Later, complex valued activations have been used to replace the explicit simulation of\noscillation [25, 26]. By virtue of being general computers, any RNN can in principle learn a suitable\nmechanism. In practice however it seems hard to learn, and adding a suitable mechanism like\ncompetition [40], fast weights [29], or perceptual grouping as in N-EM seems necessary.\n\n3RTagger [15]: a recurrent extension of Tagger that does support sequential data was developed concurrent\n\nto this work.\n\n4\n\n\fFigure 3: Groupings by RNN-EM (bottom row), N-\nEM (middle row) for six input images (top row).\nBoth methods recover the individual shapes accu-\nrately when they are separated (a, b, f), even when\nconfronted with the same shape (b). RNN-EM is able\nto handle most occlusion (c, d) but sometimes fails\n(e). The exact assignments are permutation invariant\nand depend on \u03b3 initialization; compare (a) and (f).\n\nUnsupervised Segmentation has been studied in several different contexts [30], from random vec-\ntors [14] over texture segmentation [10] to images [18, 16]. Early work in unsupervised video\nsegmentation [17] used generalized Expectation Maximization (EM) to infer how to split frames\nof moving sprites. More recently optical \ufb02ow has been used to train convolutional networks to do\n\ufb01gure/ground segmentation [23, 34]. A related line of work under the term of multi-causal mod-\nelling [28] has formalized perceptual grouping as inference in a generative compositional model of\nimages. Masked RBMs [20] for example extend Restricted Boltzmann Machines with a latent mask\ninferred through Block-Gibbs sampling.\nGradient backpropagation through inference updates has previously been addressed in the context of\nsparse coding with (Fast) Iterative Shrinkage/Tresholding Algorithms ((F)ISTA; [5, 27, 2]). Here the\nunrolled graph of a \ufb01xed number of ISTA iterations is replaced by a recurrent neural network that\nparametrizes the gradient computations and is trained to predict the sparse codes directly [9]. We\nderive RNN-EM from N-EM in a similar fashion and likewise obtain a trainable procedure that has\nthe structure of iterative pursuit built into the architecture, while leaving tunable degrees of freedom\nthat can improve their modeling capabilities [32]. An alternative to further empower the network by\nuntying its weights across iterations [11] was not considered for \ufb02exibility reasons.\n\n4 Experiments\n\nWe evaluate our approach on a perceptual grouping task for generated static images and video. By\ncomposing images out of simple shapes we have control over the statistical structure of the data, as\nwell as access to the ground-truth clustering. This allows us to verify that the proposed method indeed\nrecovers the intended grouping and learns representations corresponding to these objects. In particular\nwe are interested in studying the role of next-step prediction as a unsupervised objective for perceptual\ngrouping, the effect of the hyperparameter K, and the usefulness of the learned representations.\nIn all experiments we train the networks using ADAM [19] with default parameters, a batch size\nof 64 and 50 000 train + 10 000 validation + 10 000 test inputs. Consistent with earlier work [8, 7],\nwe evaluate the quality of the learned groupings with respect to the ground truth while ignoring the\nbackground and overlap regions. This comparison is done using the Adjusted Mutual Information\n(AMI; [35]) score, which provides a measure of clustering similarity between 0 (random) and 1\n(perfect match). We use early stopping when the validation loss has not improved for 10 epochs.4 A\ndetailed overview of the experimental setup can be found in Appendix A. All reported results are\naverages computed over \ufb01ve runs.5\n\n4.1 Static Shapes\n\nTo validate that our approach yields the intended behavior we consider a simple perceptual grouping\ntask that involves grouping three randomly chosen regular shapes (\ufffd\ufffd\ufffd) located in random positions\nof 28 \u00d7 28 binary images [26]. This simple setup serves as a test-bed for comparing N-EM and\nRNN-EM, before moving on to more complex scenarios.\nWe implement f\u03c6 by means of a single layer fully connected neural network with a sigmoid output\n\u03c8i,k for each pixel that corresponds to the mean of a Bernoulli distribution. The representation \u03b8k is\n\n4Note that we do not stop on the AMI score as this is not part of our objective function and only measured to\n\nevaluate the performance after training.\n\n5Code to reproduce all experiments is available at https://github.com/sjoerdvansteenkiste/\n\nNeural-EM\n\n5\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fFigure 4: A sequence of 5 shapes \ufb02ying along random trajectories (bottom row). The next-step\nprediction of each copy of the network (rows 2 to 5) and the soft-assignment of the pixels to each of\nthe copies (top row). Observe that the network learns to separate the individual shapes as a means to\nef\ufb01ciently solve next-step prediction. Even when many of the shapes are overlapping, as can be seen\nin time-steps 18-20, the network is still able to disentangle the individual shapes from the clutter.\n\na real-valued 250-dimensional vector squashed to the (0, 1) range by a sigmoid function before being\nfed into the network. Similarly for RNN-EM we use a recurrent neural network with 250 sigmoidal\nhidden units and an equivalent output layer. Both networks are trained with K = 3 and unrolled for\n15 EM steps.\nAs shown in Figure 3, we observe that both approaches are able to recover the individual shapes as long\nas they are separated, even when confronted with identical shapes. N-EM performs worse if the image\ncontains occlusion, and we \ufb01nd that RNN-EM is in general more stable and produces considerably\nbetter groupings. This observation is in line with \ufb01ndings for Sparse Coding [9]. Similarly we\nconclude that the tunable degrees of freedom in RNN-EM help speed-up the optimization process\nresulting in a more powerful approach that requires fewer iterations. The bene\ufb01t is re\ufb02ected in\nthe large score difference between the two: 0.826 \u00b1 0.005 AMI compared to 0.475 \u00b1 0.043 AMI\nfor N-EM. In comparison, Tagger achieves an AMI score of 0.79 \u00b1 0.034 (and 0.97 \u00b1 0.009 with\nlayernorm), while using about twenty times more parameters [7].\n\n4.2 Flying Shapes\n\nk\n\nWe consider a sequential extension of the static shapes dataset in which the shapes (\ufffd\ufffd\ufffd) are \ufb02oating\nalong random trajectories and bounce off walls. An example sequence with 5 shapes can be seen in\nthe bottom row of Figure 4. We use a convolutional encoder and decoder inspired by the discriminator\nand generator networks of infoGAN [4], with a recurrent neural network of 100 sigmoidal units (for\ndetails see Section A.2). At each timestep t the network receives \u03b3k(\u03c8(t\u22121)\n\u2212 \u02dcx(t)) as input, where\n\u02dcx(t) is the current frame corrupted with additional bit\ufb02ip noise (p = 0.2). The next-step prediction\nobjective is implemented by replacing x with x(t+1) in (5), and is evaluated at each time-step.\nTable 1 summarizes the results on \ufb02ying shapes, and an example of a sequence with 5 shapes when\nusing K = 5 can be seen in Figure 4. For 3 shapes we observe that the produced groupings are close\nto perfect (AMI: 0.970 \u00b1 0.005). Even in the very cluttered case of 5 shapes the network is able to\nseparate the individual objects in almost all cases (AMI: 0.878 \u00b1 0.003).\nThese results demonstrate the adequacy of the next step prediction task for perceptual grouping.\nHowever, we \ufb01nd that the converse also holds: the corresponding representations are useful for the\nprediction task. In Figure 5 we compare the next-step prediction error of RNN-EM with K = 1\n(which reduces to a recurrent autoencoder that receives the difference between its previous prediction\nand the current frame as input) to RNN-EM with K = 5 on this task. To evaluate RNN-EM\non next-step prediction we computed its loss using P (xi|\u03c8i) = P (xi| maxk \u03c8i,k) as opposed to\nP (xi|\u03c8i) =\ufffdk \u03b3i,kP (xi|\u03c8i,k) to avoid including information from the next timestep. The reported\nBCE loss for RNN-EM is therefore an upperbound to the true BCE loss. From the \ufb01gure we observe\nthat RNN-EM produces signi\ufb01cantly lower errors, especially when the number of objects increases.\n\n6\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fFigure 5: Binomial Cross Entropy Error\nobtained by RNN-EM and a recurrent au-\ntoencoder (RNN-EM with K = 1) on the\ndenoising and next-step prediction task.\nRNN-EM produces signi\ufb01cantly lower\nBCE across different numbers of objects.\n\nFigure 6: Average AMI score (blue line) measured for\nRNN-EM (trained for 20 steps) across the \ufb02ying MNIST\ntest-set and corresponding quartiles (shaded areas), com-\nputed for each of 50 time-steps. The learned grouping\ndynamics generalize to longer sequences and even fur-\nther improve the AMI score.\n\nTrain\n\nAMI\n\n# obj. K\n3\n5\n3\n5\n\n3\n3\n5\n5\n\n# obj. K\n3\n5\n3\n5\n\n3\n3\n5\n5\n\nTest\n\nAMI\n\nTest Generalization\nAMI\n\n0.969 \u00b1 0.006\n0.997 \u00b1 0.001\n0.614 \u00b1 0.003\n0.878 \u00b1 0.003\n\n0.972 \u00b1 0.007\n0.914 \u00b1 0.015\n0.886 \u00b1 0.010\n0.981 \u00b1 0.003\nTable 1: AMI scores obtained by RNN-EM on \ufb02ying shapes when varying the number of objects and\nnumber of components K, during training and at test time.\n\n0.970 \u00b1 0.005\n0.997 \u00b1 0.002\n0.614 \u00b1 0.003\n0.878 \u00b1 0.003\n\n# obj. K\n5\n3\n3\n5\n\n3\n3\n3\n3\n\nFinally, in Table 1 we also provide insight about the impact of choosing the hyper-parameter K,\nwhich is unknown for many real-world scenarios. Surprisingly we observe that training with too large\nK is in fact favourable, and that the network learns to leave the excess groups empty. When training\nwith too few components we \ufb01nd that the network still learns about the individual shapes and we\nobserve only a slight drop in score when correctly setting the number of components at test time. We\nconclude that RNN-EM is robust towards different choices of K, and speci\ufb01cally that choosing K to\nbe too high is not detrimental.\n\n4.3 Flying MNIST\n\nIn order to incorporate greater variability among the objects we consider a sequential extension of\nMNIST. Here each sequence consists of gray-scale 24 \u00d7 24 images containing two down-sampled\nMNIST digits that start in random positions and \ufb02oat along randomly sampled trajectories within the\nimage for T timesteps. An example sequence can be seen in the bottom row of Figure 7.\nWe deploy a slightly deeper version of the architecture used in \ufb02ying shapes. Its details can be found\nin Appendix A.3. Since the images are gray-scale we now use a Gaussian distribution for each pixel\nwith \ufb01xed \u03c32 = 0.25 and \u00b5 = \u03c8i,k as computed by each copy of the network. The training procedure\nis identical to \ufb02ying shapes except that we replace bit\ufb02ip noise with masked uniform noise: we \ufb01rst\nsample a binary mask from a multi-variate Bernoulli distribution with p = 0.2 and then use this\nmask to interpolate between the original image and samples from a Uniform distribution between the\nminimum and maximum values of the data (0,1).\nWe train with K = 2 and T = 20 on \ufb02ying MNIST having two digits and obtain an AMI score of\n0.819 \u00b1 0.022 on the test set, measured across 5 runs.\nIn early experiments we observed that, given the large variability among the 50 000 unique digits, we\ncan boost the model performance by training in stages using 20, 500, 50 000 digits. Here we exploit\nthe generalization capabilities of RNN-EM to quickly transfer knowledge from a less varying set of\n\n7\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\fFigure 7: A sequence of 3 MNIST digits \ufb02ying across random trajectories in the image (bottom row).\nThe next-step prediction of each copy of the network (rows 2 to 4) and the soft-assignment of the\npixels to each of the copies (top row). Although the network was trained (stage-wise) on sequences\nwith two digits, it is accurately able to separate three digits.\n\nMNIST digits to unseen variations. We used the same hyper-parameter con\ufb01guration as before and\nobtain an AMI score of 0.917 \u00b1 0.005 on the test set, measured across 5 runs.\nWe study the generalization capabilities and robustness of these trained RNN-EM networks by means\nof three experiments. In the \ufb01rst experiment we evaluate them on \ufb02ying MNIST having three digits\n(one extra) and likewise set K = 3. Even without further training we are able to maintain a high\nAMI score of 0.729 \u00b1 0.019 (stage-wise: 0.838 \u00b1 0.008) on the test-set. A test example can be seen\nin Figure 7. In the second experiment we are interested in whether the grouping mechanism that has\nbeen learned can be transferred to static images. We \ufb01nd that using 50 RNN-EM steps we are able to\ntransfer a large part of the learned grouping dynamics and obtain an AMI score of 0.619 \u00b1 0.023\n(stage-wise: 0.772 \u00b1 0.008) for two static digits. As a \ufb01nal experiment we evaluate the directly\ntrained network on the same dataset for a larger number of timesteps. Figure 6 displays the average\nAMI score across the test set as well as the range of the upper and lower quartile for each timestep.\nThe results of these experiments con\ufb01rm our earlier observations for \ufb02ying shapes, in that the learned\ngrouping dynamics are robust and generalize across a wide range of variations. Moreover we \ufb01nd\nthat the AMI score further improves at test time when increasing the sequence length.\n\n5 Discussion\n\nThe experimental results indicate that the proposed Neural Expectation Maximization framework can\nindeed learn how to group pixels according to constituent objects. In doing so the network learns a\nuseful and localized representation for individual entities, which encodes only the information relevant\nto it. Each entity is represented separately in the same space, which avoids the binding problem and\nmakes the representations usable as ef\ufb01cient symbols for arbitrary entities in the dataset. We believe\nthat this is useful for reasoning in particular, and a potentially wide range of other tasks that depend\non interaction between multiple entities. Empirically we \ufb01nd that the learned representations are\nalready bene\ufb01cial in next-step prediction with multiple objects, a task in which overlapping objects\nare problematic for standard approaches, but can be handled ef\ufb01ciently when learning a separate\nrepresentation for each object.\nAs is typical in clustering methods, in N-EM there is no preferred assignment of objects to groups and\nso the grouping numbering is arbitrary and only depends on initialization. This property renders our\nresults permutation invariant and naturally allows for instance segmentation, as opposed to semantic\nsegmentation where groups correspond to pre-de\ufb01ned categories. RNN-EM learns to segment in an\nunsupervised fashion, which makes it applicable to settings with little or no labeled data. On the\ndownside this lack of supervision means that the resulting segmentation may not always match the\nintended outcome. This problem is inherent to this task since in real world images the notion of\nan object is ill-de\ufb01ned and task dependent. We envision future work to alleviate this by extending\nunsupervised segmentation to hierarchical groupings, and by dynamically conditioning them on the\ntask at hand using top-down feedback and attention.\n\n8\n\n\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\ufffd\f6 Conclusion\n\nWe have argued for the importance of separately representing conceptual entities contained in the input,\nand suggested clustering based on statistical regularities as an appropriate unsupervised approach\nfor separating them. We formalized this notion and derived a novel framework that combines neural\nnetworks and generalized EM into a trainable clustering algorithm. We have shown how this method\ncan be trained in a fully unsupervised fashion to segment its inputs into entities, and to represent\nthem individually. Using synthetic images and video, we have empirically veri\ufb01ed that our method\ncan recover the objects underlying the data, and represent them in a useful way. We believe that\nthis work will help to develop a theoretical foundation for understanding this important problem of\nunsupervised learning, as well as providing a \ufb01rst step towards building practical solutions that make\nuse of these symbol-like representations.\n\nAcknowledgements\n\nThe authors wish to thank Paulo Rauber and the anonymous reviewers for their constructive feedback.\nThis research was supported by the Swiss National Science Foundation grant 200021_165675/1\nand the EU project \u201cINPUT\u201d (H2020-ICT-2015 grant no. 687795). We are grateful to NVIDIA\nCorporation for donating us a DGX-1 as part of the Pioneers of AI Research award, and to IBM for\ndonating a \u201cMinsky\u201d machine.\n\nReferences\n[1] H.B. Barlow, T.P. Kaushal, and G.J. Mitchison. Finding Minimum Entropy Codes. Neural\n\nComputation, 1(3):412\u2013423, September 1989.\n\n[2] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm with appli-\ncation to wavelet-based image deblurring. In Acoustics, Speech and Signal Processing, 2009.\nICASSP 2009. IEEE International Conference On, pages 693\u2013696. IEEE, 2009.\n\n[3] Yoshua Bengio. Deep learning of representations: Looking forward. In International Conference\n\non Statistical Language and Speech Processing, pages 1\u201337. Springer, 2013.\n\n[4] Xi Chen, Yan Duan, Rein Houthooft, John Schulman, Ilya Sutskever, and Pieter Abbeel.\nInfoGAN: Interpretable Representation Learning by Information Maximizing Generative Ad-\nversarial Nets. arXiv:1606.03657 [cs, stat], June 2016.\n\n[5] Ingrid Daubechies, Michel Defrise, and Christine De Mol. An iterative thresholding algorithm\nfor linear inverse problems with a sparsity constraint. Communications on pure and applied\nmathematics, 57(11):1413\u20131457, 2004.\n\n[6] A. P. Dempster, N. M. Laird, and D. B. Rubin. Maximum likelihood from incomplete data via\n\nthe EM algorithm. Journal of the royal statistical society., pages 1\u201338, 1977.\n\n[7] Klaus Greff, Antti Rasmus, Mathias Berglund, Tele Hotloo Hao, J\u00fcrgen Schmidhuber, and Harri\nValpola. Tagger: Deep Unsupervised Perceptual Grouping. arXiv:1606.06724 [cs], June 2016.\n[8] Klaus Greff, Rupesh Kumar Srivastava, and J\u00fcrgen Schmidhuber. Binding via Reconstruction\n\nClustering. arXiv:1511.06418 [cs], November 2015.\n\n[9] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In Proceedings\nof the 27th International Conference on Machine Learning (ICML-10), pages 399\u2013406, 2010.\n[10] Jose A. Guerrero-Col\u00f3n, Eero P. Simoncelli, and Javier Portilla. Image denoising using mixtures\nof Gaussian scale mixtures. In Image Processing, 2008. ICIP 2008. 15th IEEE International\nConference On, pages 565\u2013568. IEEE, 2008.\n\n[11] John R. Hershey, Jonathan Le Roux, and Felix Weninger. Deep unfolding: Model-based\n\ninspiration of novel deep architectures. arXiv preprint arXiv:1409.2574, 2014.\n\n[12] Irina Higgins, Loic Matthey, Arka Pal, Christopher Burgess, Xavier Glorot, Matthew Botvinick,\nShakir Mohamed, and Alexander Lerchner. Beta-VAE: Learning basic visual concepts with\na constrained variational framework. In In Proceedings of the International Conference on\nLearning Representations (ICLR), 2017.\n\n[13] Geoffrey E. Hinton. Distributed representations. 1984.\n\n9\n\n\f[14] Aapo Hyv\u00e4rinen and Jukka Perki\u00f6. Learning to Segment Any Random Vector. In The 2006\nIEEE International Joint Conference on Neural Network Proceedings, pages 4167\u20134172. IEEE,\n2006.\n\n[15] Alexander Ilin, Isabeau Pr\u00e9mont-Schwarz, Tele Hotloo Hao, Antti Rasmus, Rinu Boney, and\n\nHarri Valpola. Recurrent Ladder Networks. arXiv:1707.09219 [cs, stat], July 2017.\n\n[16] Phillip Isola, Daniel Zoran, Dilip Krishnan, and Edward H. Adelson. Learning visual groups\n\nfrom co-occurrences in space and time. arXiv:1511.06811 [cs], November 2015.\n\n[17] Nebojsa Jojic and Brendan J. Frey. Learning \ufb02exible sprites in video layers. In Computer Vision\nand Pattern Recognition, 2001. CVPR 2001. Proceedings of the 2001 IEEE Computer Society\nConference On, volume 1, pages I\u2013I. IEEE, 2001.\n\n[18] Anitha Kannan, John Winn, and Carsten Rother. Clustering appearance and shape by learning\n\njigsaws. In Advances in Neural Information Processing Systems, pages 657\u2013664, 2007.\n\n[19] Diederik Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[20] Nicolas Le Roux, Nicolas Heess, Jamie Shotton, and John Winn. Learning a generative model\n\nof images by factoring appearance and shape. Neural Computation, 23(3):593\u2013650, 2011.\n\n[21] P. M. Milner. A model for visual shape recognition. Psychological review, 81(6):521, 1974.\n[22] Augustus Odena, Vincent Dumoulin, and Chris Olah. Deconvolution and Checkerboard Artifacts.\n\nDistill, 2016.\n\n[23] Deepak Pathak, Ross Girshick, Piotr Doll\u00e1r, Trevor Darrell, and Bharath Hariharan. Learning\n\nFeatures by Watching Objects Move. arXiv:1612.06370 [cs, stat], December 2016.\n\n[24] R. A. Rao, G. Cecchi, C. C. Peck, and J. R. Kozloski. Unsupervised segmentation with\n\ndynamical units. Neural Networks, IEEE Transactions on, 19(1):168\u2013182, 2008.\n\n[25] R. A. Rao and G. A. Cecchi. An objective function utilizing complex sparsity for ef\ufb01cient\nsegmentation in multi-layer oscillatory networks. International Journal of Intelligent Computing\nand Cybernetics, 3(2):173\u2013206, 2010.\n\n[26] David P. Reichert and Thomas Serre. Neuronal Synchrony in Complex-Valued Deep Networks.\n\narXiv:1312.6115 [cs, q-bio, stat], December 2013.\n\n[27] Christopher J. Rozell, Don H. Johnson, Richard G. Baraniuk, and Bruno A. Olshausen.\nSparse coding via thresholding and local competition in neural circuits. Neural computa-\ntion, 20(10):2526\u20132563, 2008.\n\n[28] E. Saund. A multiple cause mixture model for unsupervised learning. Neural Computation,\n\n7(1):51\u201371, 1995.\n\n[29] J. Schmidhuber. Learning to control fast-weight memories: An alternative to dynamic recurrent\n\nnetworks. Neural Computation, 4(1):131\u2013139, 1992.\n\n[30] J\u00fcrgen Schmidhuber. Learning Complex, Extended Sequences Using the Principle of History\n\nCompression. Neural Computation, 4(2):234\u2013242, March 1992.\n\n[31] J\u00fcrgen Schmidhuber. Learning Factorial Codes by Predictability Minimization. Neural Compu-\n\ntation, 4(6):863\u2013879, November 1992.\n\n[32] Pablo Sprechmann, Alexander M. Bronstein, and Guillermo Sapiro. Learning ef\ufb01cient sparse\nIEEE transactions on pattern analysis and machine intelligence,\n\nand low rank models.\n37(9):1821\u20131833, 2015.\n\n[33] Anne Treisman. The binding problem. Current Opinion in Neurobiology, 6(2):171\u2013178, April\n\n1996.\n\n[34] Sudheendra Vijayanarasimhan, Susanna Ricco, Cordelia Schmid, Rahul Sukthankar, and Kate-\nrina Fragkiadaki. SfM-Net: Learning of Structure and Motion from Video. arXiv:1704.07804\n[cs], April 2017.\n\n[35] N. X. Vinh, J. Epps, and J. Bailey. Information theoretic measures for clusterings comparison:\n\nVariants, properties, normalization and correction for chance. JMLR, 11:2837\u20132854, 2010.\n\n[36] C. von der Malsburg. Binding in models of perception and brain function. Current opinion in\n\nneurobiology, 5(4):520\u2013526, 1995.\n\n10\n\n\f[37] Christoph von der Malsburg. The Correlation Theory of Brain Function. Departmental technical\n\nreport, MPI, 1981.\n\n[38] D. Wang and D. Terman. Locally excitatory globally inhibitory oscillator networks. Neural\n\nNetworks, IEEE Transactions on, 6(1):283\u2013286, 1995.\n\n[39] Paul J. Werbos. Generalization of backpropagation with application to a recurrent gas market\n\nmodel. Neural networks, 1(4):339\u2013356, 1988.\n\n[40] H. Wersing, J. J. Steil, and H. Ritter. A competitive-layer model for feature binding and sensory\n\nsegmentation. Neural Computation, 13(2):357\u2013387, 2001.\n\n[41] Ronald J. Williams. Complexity of exact gradient computation algorithms for recurrent neu-\nral networks. Technical report, Technical Report Technical Report NU-CCS-89-27, Boston:\nNortheastern University, College of Computer Science, 1989.\n\n[42] CF Jeff Wu. On the convergence properties of the EM algorithm. The Annals of statistics, pages\n\n95\u2013103, 1983.\n\n11\n\n\f", "award": [], "sourceid": 3354, "authors": [{"given_name": "Klaus", "family_name": "Greff", "institution": "IDSIA"}, {"given_name": "Sjoerd", "family_name": "van Steenkiste", "institution": "The Swiss AI Lab - IDSIA"}, {"given_name": "J\u00fcrgen", "family_name": "Schmidhuber", "institution": "Swiss AI Lab, IDSIA (USI & SUPSI) - NNAISENSE"}]}