{"title": "Learning to Make Coherent Predictions in Domains with Discontinuities", "book": "Advances in Neural Information Processing Systems", "page_first": 372, "page_last": 379, "abstract": null, "full_text": "Learning to Make Coherent Predictions in \n\nDomains with Discontinuities \n\nSuzanna Becker and Geoffrey E. Hinton \n\nDepartment of Computer Science, University of Toronto \n\nToronto, Ontario, Canada M5S 1A4 \n\nAbstract \n\nWe have previously described an unsupervised learning procedure that \ndiscovers spatially coherent propertit>_<; of the world by maximizing the in(cid:173)\nformation that parameters extracted from different parts of the sensory \ninput convey about some common underlying cause. When given random \ndot stereograms of curved surfaces, this procedure learns to extract sur(cid:173)\nface depth because that is the property that is coherent across space. It \nalso learns how to interpolate the depth at one location from the depths \nat nearby locations (Becker and Hint.oll. 1992). 1n this paper, we pro(cid:173)\npose two new models which handle surfaces with discontinuities. The first \nmodel attempts to detect cases of discontinuities and reject them. The \nsecond model develops a mixture of expert interpolators. It learns to de(cid:173)\ntect the locations of discontinuities and to invoke specialized, asymmetric \ninterpolators that do not cross the discontinuities. \n\nIntrod uction \n\n1 \nStandard backpropagation is implausible as a model of perceptual learning because \nit requires an external teacher to specify the desired output of the network. We \nhave shown (Becker and Hinton, 1992) how the external teacher can be replaced \nby internally derived teaching signals. These signals are generated by using the \nassumption that different parts of the perceptual input have common causes in \nthe external world. Small modules that look at separate but related parts of the \nperceptual input discover these common causes by striving to produce outputs that \nagree with each other (see Figure 1 a). The modules may look at different modalities \n(e.g. vision and touch), or the same modality at different times (e.g. the consecutive \n2-D views of a rotating 3-D object), or even spatially adjacent parts of the same \nimage. In previous work, we showed that when our learning procedure is applied \n\n372 \n\n\fLearning to Make Coherent Predictions in Domains with Discontinuities \n\n373 \n\nto adjacent patches of 2-dimensional images, it allows a neural network that has no \nprior knowledge of the third dimension to discover depth in random dot stereograms \nof curved surfaces. A more general version of the method allows the network to \ndiscover the best way of interpolating the depth at one location from the depths \nat nearby locations. We first summarize this earlier work, and then introduce \ntwo new models which allow coherent predictions to be made in the presence of \ndiscontinuities. \n\na) \n\nleft \nrightm~m~ \n\npatch A \n\npatch B \n\nFigure 1: a) Two modules that receive input from corresponding parts of stereo \nimages. The first module receives input from stereo patch A, consisting of a hori(cid:173)\nzontal strip from the left image (striped) and a corresponding strip from the right \nimage (hatched). The second module receives input from an adjacent stereo patch \nB . The modules try to make their outputs, da and db, convey as much informa(cid:173)\ntion as possible about some underlying signal (i. e., the depth) which is common to \nboth patches. b) The architecture of the interpolating network, consisting of multiple \ncopies of modules like those in a) plus a layer of interpolating units. The network \ntries to maximize the information that the locally extracted parameter de and the \ncontextually predicted parameter de convey about some common underlying signal. \nWe actually used 10 modules and the central 6 modules tried to maximize agreement \nbetween their outputs and contextually predicted values. We used weight averaging \nto constrain the interpolating function to be identical for all modules. \n\n2 Learning spatially coherent features in images \n\nThe simplest way to get the outputs of two modules to agree is to use the squared \ndifference between the outputs as a cost function, and to adjust the weights in each \nmodule so as to minimize this cost. Unfortunately, this usually causes each module \nto produce the same constant output that is unaffected by the input to the module \nand therefore conveys no information about it. What we want is for the outputs \nof two modules to agree closely (i.e. to have a small expected squared difference) \nrelative to how much they both vary as the input is varied. When this happens, the \ntwo modules must be responding to something that is common to their two inputs. \nIn the special case when the outputs, da , db, of the two modules are scalars, a good \n\n\f374 \n\nBecker and Hinton \n\nmeasure of agreement is: \n\n(1) \n\nwhere V is the variance over the training cases. \nIf da and db are both versions \nof the same underlying Gaussian signal that have been corrupted by independent \nGaussian noise, it can be shown that I is the mutual information between the \nunderlying signal and the average of da and db. By maximizing I we force the two \nmodules to extract as pure a version as possible of the underlying common signal. \n\n2.1 The basic stereo net \nWe have shown how this principle can be applied to a multi-layer network that learns \nto extract depth from random dot stereograms (Becker and Hinton, 1992). Each \nnetwork module received input from a patch of a left image and a corresponding \npatch of a right image, as shown in Figure 1 a). Adjacent modules received input \nfrom adjacent stereo image patches, and learned to extract depth by trying to \nmaximize agreement between their outputs. The real-valued depth (relative to the \nplane of fixation) of each patch of the surface gives rise to a disparity between \nfeatures in the left and right images; since that disparity is the only property that \nis coherent across each stereo image, the output units of modules were able to learn \nto accurately detect relative depth. \n\n2.2 The interpolating net \nThe basic stereo net uses a very simple model of coherence in which an underlying \nparameter at one location is assumed to be approximately equal to the parameter at \na neighbouring location. This model is fine for the depth of fronto-parallel surfaces \nbut it is far from the best model of slanted or curved surfaces. Fortunately, we can \nuse a far more general model of coherence in which the parameter at one location \nis assumed to be an unknown linear function of the parameters at nearby locations. \nThe particular linear function that is appropriate can be learned by the network. \n\nWe used a network of the type shown in Figure 1 b). The depth computed locally \nby a module, dc, was compared with the depth predicted by a linear combination de \nof the outputs of nearby modules, and the network tried to maximize the agreement \nbetween de and de. \nThe contextual prediction, dc, was produced by computing a weighted sum of \nthe outputs of two adjacent modules on either side. The interpolating weights \nused in this sum, and all other weights in the network, were adjusted so as to \nmaximize agreement between locally computed and contextually predicted depths. \nTo speed the learning, we first trained the lower layers of the network as be(cid:173)\nfore, so that agreement was maximized between neighbouring locally computed \noutputs. This made it easier to learn good interpolating weights. When the \nnetwork was trained on stereograms of cubic surfaces, it learned interpolating \nweights of -0.147,0.675,0.656, -0.131 (Becker and Hinton, 1992). Given noise \nfree estimates of local depth, the optimal linear interpolator for a cubic surfa.ce \nis -0.167,0.667,0.667, -0.167. \n\n\fLearning to Make Coherent Predictions in Domains with Discontinuities \n\n375 \n\n3 Throwing out discontinuities \n\nIf the surface is continuous, the depth at one patch can be accurately predicted from \nthe depths of two patches on either side. If, however, the training data contains cases \nin which there are depth discontinuities (see figure 2) the interpolator will also try \nto model these cases and this will contribute considerable noise to the interpolating \nweights and to the depth estimates. One way of reducing this noise is to treat the \ndiscontinuity cases as outliers and to throw them out. Rather than making a hard \ndecision about whether a case is an outlier, we make a soft decision by using a \nmixture model. For each training case, the network compares the locally extracted \ndepth, dc, with the depth predicted from the nearby context, de. It assumes that \nde - de is drawn from a zero-mean Gaussian if it is a continuity case and from a \nuniform distribution if it is a discontinuity case. It can then estimate the probability \nof a continuity case: \n\nSpline \ncurve \n\nLeft \nImage \n\nRight \nImage \n\n--------\n\nI 1 \n\nl I \n\nI I III \n\n1111 \n\ntil \n\nI, I \"'I \n\nII I 111 \n\nII \n\n\"I I \n\nII I II \\ \n\nII I Iii \n\nII ill I \n\nIII ,i I I \\ \nI \n\nI I I I ,,\\ \n\nIII \n\nII \n\nI \n\n1'1 \n\nFigure 2: Top: A curved surface strip with a discontinuity created by fitting 2 \ncubic splines through randomly chosen control points, 25 pixels apart, separated by \na depth discontinuity. Feature points are randomly scattered on each spline with an \naverage of 0.22 features per pixel. Bottom: A stereo pair of \"intensity\" images \nof the surface strip formed by taking two different projections of the feature points, \nfiltering them through a gaussian, and sampling the filtered projections at evenly \nspaced sample points. The sample values in corresponding patches of the two images \nare used as the inputs to a module. The depth of the surface for a particular zmage \nregion is directly related to the disparity between corresponding features in the left \nand right patch. Disparity ranges continuously from -1 to + 1 image pixels. Each \nstereo image was 120 pixels wide and divided into 10 receptive fields 10 pixels wide \nand separated by 2 pixel gaps, as input for the networks shown in figure 1. The \nreceptive field of an interpolating unit spanned 58 image pixels, and discontinuities \nwere randomly located a minimum of 40 pixels apart, so only rarely would more \nthan one discontinuity lie within an interpolator's receptive field. \n\n\f376 \n\nBecker and Hinton \n\nwhere N is a gaussian, and kdi3eont is a constant representing a uniform density. 1 \n\nWe can now optimize the average information de and de transmit about their com(cid:173)\nmon cause. We assume that no information is transmitted in discontinuity cases, \nso the average information depends on the probability of continuity and on the \nvariance of de + de and de - de measured only in the continuity cases. \n\n(2) \n\n(3) \n\nWe tried several variations of this mixture approach. The network is quite good at \nrejecting the discontinuity cases, but this leads to only a modest improvement in \nthe performance of the interpolator. In cases where there is a depth discontinuity \nbetween da and db or between dd and de the interpolator works moderately well \nbecause the weights on da or de are small. Because of the term Peont in equation \n3 there is pressure to include these cases as continuity cases, so they probably \ncontribute noise to the interpolating weights. In the next section we show how to \navoid making a forced choice between rejecting these cases or treating them just \nlike all the other continuity cases. \n\n4 Learning a mixture of expert interpolators \nThe presence of a depth discontinuity somewhere within a strip of five adjacent \npatches does not entirely eliminate the coherence of depth across these patches. It \njust restricts the range over which this coherence operates. So instead of throwing \nout cases that contain a discontinuity, the network could try to develop a number \nof different, specialized interpolators each of which captures the particular type of \ncoherence that remains in the presence of a discontinuity at a particular location. \nIf, for example, there is a depth discontinuity between de and de, an extrapolator \n\nwith weights of -1.0, +2.0,0, \u00b0 would be an appropriate predictor of de . \n\nFigure 3 shows the system of five expert interpolators that we used for predicting \nde from the neighboring depths. To allow the system to invoke the appropriate \ninterpolator, each expert has its own \"controller\" which must learn to detect the \npresence of a discontinuity at a particular location (or the absence of a discontinu(cid:173)\nity in the case of the interpolator for pure continuity cases). The outputs of the \ncontrollers are normalized, as shown in figure 3, so that they form a probability dis(cid:173)\ntribution. We can think of these normalized outputs as the probability with which \nthe system selects a particular expert. The controllers get to see all five local depth \nestimates and most of them learn to detect particular depth discontinuities by using \nlarge weights of opposite sign on the local depth estimates of neighboring patches. \n\nlWe empirically select a good (fixed) value of kdiseont, and we choose a starting value \nof Veont{de - de) (some proportion of the initial variance of de - de), and gradually shrink \nit during learning. \n\n\fLearning to Make Coherent Predictions in Domains with Discontinuities \n\n377 \n\nexpert 1 \n\nexpert 2 \n\nexpert 3 \n\nexpert 4 \n\nexpert 5 \n\nde I , \n\nde 2 , \n\nde ,3 \n\nde 4 , \n\nde 5 , \n\nPI \n\nP2 \n\nP3 \n\nP4 \n\nP5 \n\nXl controller 1 \n\nNormal-\nization \n\nX2 controller 2 \n\nPi = I: x \u00b7 2 \n\ne J \n\ne x ,2 \n\nJ \n\nX3 controller 3 \n\nX4 controller 4 \n\nX5 controller 5 \n\nFigure 3: The architecture of the mixture of interpolators and discontinuzty detec. \ntors . We actually used a larger modular network and equality constraints between \nmodules, as described in figure 1 b), with 6 copies of the architecture shown here . \nEach copy received input from different but overlapping parts of the input. \n\nFigure 4 shows the weights learned by the experts and by their controllers. As \nexpected, there is one interpolator (the top one) that is appropriate for continuity \ncases and four other interpolators that are appropriate for the four different loca(cid:173)\ntions of a discontinuity. In interpreting the weights of the controllers it is important \nto remember that a controller which produces a small X value for a particular case \nmay nevertheless assign high probability to its expert if all the other controllers \nproduce even smaller x values. \n\n4.1 The learning procedure \nIn the example presented here, we first trained the network shown in figure 1 b) \non images with discontinuities. We then used the outputs of the depth extracting \nlayer, da, ... ,de as the inputs to the expert interpolators and their controllers. The \nsystem learned a set of expert interpolators without backpropagating derivatives all \nthe way down to the weights of the local depth extracting modules. So the local \ndepth estimates d a , ... ,de did not change as the interpolators were learned. \n\nTo train the system we used an unsupervised version of the competing experts \nalgorithm described by Jacobs, Jordan, Nowlan and Hinton (1991) . The output of \nthe ith expert, de,i, is treated as the mean of a Gaussian distribution with variance 0- 2 \nand the normalized output of each controller, Pi , is treated as the mixing proportion \nof that Gaussian. So, for each training case, the outputs of the experts and their \ncontrollers define a probability distribution that is a mixture of Gaussians . The aim \n\n\f378 \n\nBecker and Hinton \n\n,a) \nInterpolator \nweights \n\nDiscontinuity \ndetector weights \n\nb) \n\nMean output vs. distance \nto nearest discontinuity \n\n1.00 \n\n0.95 \n\n0.90 \n\n0.15 \n\n0.10 \n\n0.75 \n\n0.70 \n\n0.6.5 \n\n0.60 \n\n0.55 \n\n0.50 \n\n0.45 \n\n0.40 \n\nO.lS \n\n0.10 \n\n02S \n\n020 \n\n0.15 \n\n0.10 \nOAS \n0.00 \n.oAS \n\niiiiiI \niiiii2 \n~T-\nYU.-4- -\nuair -\n\n\\ \n\n\\ \n\\ \n\n\\ \n\\ \n\n\" \n\n\\' \nI' \nj' \n\n, \nI \nI i \n\n~ \niI \" \" , . \n'I , . \n. \\ , . ~, \n, , \n, , , \n\u2022 , \\ \n, , \n,~ : I , , , \nI , . , \ni , , , , \nI, : , , \n/~ ; \n\u2022 ~ 1 \n, , \n.\u2022 I \nl ~, \n. ., \n, \n:' \nti \n\\ :' \n: \n:; \n, \n. I' \nI \n1 \n\u2022 \n~ \n'i \nI ~ \n\" \n:i \n'I \n\" \nI . : \\ , , \ni ~ \n.. , I \n\" \nI: ' ! \n. \n, \nI \n, I \n, \ni \nI , \n/ \ni \n! \nI\n' \nI ' \ni I \n\\' \n\" I, \n\n\" \n~ \n\n\\ \n\n\\ \n\n, \n\nI \nI \n\nI \n\n-~:' .. -f \n, \n..... ' , \n, \n, \n\nI \nI \nI \nI \nI \nI \nI \n\n\u00b760.00 \n\n.4()00 \n\n-20.00 \n\n0.00 \n\n20.00 \n\n40.00 \n\n60.00 \n\npudol \n\nFigure 4: a) Typical weights lear~ed by the five competing interpolators and cor(cid:173)\nresponding five discontinuity detectors. Positive weights are shown in white, and \nnegative weights in black. b) The mean probabilities computed by each discontinuity \ndetector are plotted against the the distance from the center of the units' receptive \nfield to the nearest discontinuity. The probabilistic outputs are averaged over an \nensemble of 1000 test cases. If the nearest discontinuity is beyond \u00b1 thirty pixels, \nit is outside the units' receptive field and the case is therefore a continuity example. \n\nof the learning is to maximize the log probability density of the desired output, de, \nunder this mixture of Gaussians distribution. For a particular training case this log \nprobability is given by: \n\n' \" \nlog P( de) = log L.,; Pi \n\n. \n\nI \n\n1 \nto= exp \nv2~u \n\n((d e -dei )2) \n-\n\n2 2 ' \nu \n\n(4) \n\nBy taking derivatives of this objective function we can simultaneously learn the \nweights in the experts and in the controllers. For the results shown here, the \nnework was trained for 30 conjugate gradient iterations on a set of 1000 random \ndot stereograms with discontinuities. \n\nThe rationale for the use of a variance ratio in equation 1 is to prevent the variances \nof da and db collapsing to zero. Because the local estimates d 1 , ... , ds did not change \nas the system learned the expert interpolators, it was possible to use (de - dc,d 2 in \nthe objective function without worrying about the possibility that the variance of \nde across cases would collapse to zero during the learning . Ideally we would like to \n\n\fLearning (0 Make Coherent Predictions in Domains with Discontinuities \n\n379 \n\nrefine the weights of the local depth estimators to maximize their agreement with \nthe contextually predicted depths produced by the mixture of expert interpolators. \nOne way to do this would be to generalize equation 3 to handle a mixture of expert \ninterpolators: \n\n(5) \n\nAlternatively we could modify equation 4 by normalizing the difference (de - de.i )2 \nby the actual variance of dc, though this makes the derivatives considerably more \ncomplicated. \n\n5 Discussion \nThe competing controllers in figure 3 explicitly represent which regularity applies in \na particular region. The outputs of the controllers for nearby regions may themselves \nexhibit coherence at a larger spatial scale, so the same learning technique could be \napplied recursively. In 2-D images this should allow the continuity of depth edges \nto be discovered. \n\nThe approach presented here should be applicable to other domains which contain \na mixture of alternative local regularities aCl\u00b7OSS space or time. For example, a l\u00b7igiJ \nshape causes a linear constraint between the locations of its parts in an image, so if \nthere are many possible shapes, there are many alternative local regularities (Zemel \nand Hinton, 1991). \n\nOur learning procedure differs from methods that try to capture as much informa(cid:173)\ntion as possible about the input (Linsker, 1988; Atick and Redlich, 1990) because \nwe ignore information in the input that is not coherent across space. \n\nAcknowledgements \n\nThis research was funded by grants from NSERC and the Ontario Information Technol(cid:173)\nogy Research Centre. Hinton is Noranda fellow of the Canadian Institute for Advanced \nResearch. Thanks to John Bridle and Steve Nowlan for helpful discussions. \n\nReferences \n\nAtick, J. J. and Redlich, A. N. (1990). Towards a theory of early visual processing. \n\nTechnical Report IASSNS-HEP-90j10, Institute for Advanced Study, Princeton. \n\nBecker, S. and Hinton, G. E. (1992). A self-organizing neural network that discovers \n\nsurfaces in random-dot stereograms. January 1992 Nature. \n\nJacobs, R. A., Jordan, M. 1., Nowlan, S. J., and Hinton, G. E. (1991). Adaptive mixtures \n\nof local experts. Neural Computation, 3(1). \n\nLinsker, R. (1988). Self-organization in a perceptual network. IEEE Computer, March, \n\n21:105-117. \n\nZemel, R. S. and Hinton, G. E. (1991). Discovering viewpoint-invariant relationships that \ncharacterize objects. In Advances In Neural Information Processing Systems 3, pages \n299-305. Morgan Kaufmann Publishers. \n\n\f", "award": [], "sourceid": 534, "authors": [{"given_name": "Suzanna", "family_name": "Becker", "institution": null}, {"given_name": "Geoffrey", "family_name": "Hinton", "institution": null}]}