{"title": "Attentional Neural Network: Feature Selection Using Cognitive Feedback", "book": "Advances in Neural Information Processing Systems", "page_first": 2033, "page_last": 2041, "abstract": "Attentional Neural Network is a new framework that integrates top-down cognitive bias and bottom-up feature extraction in one coherent architecture. The top-down influence is especially effective when dealing with high noise or difficult segmentation problems. Our system is modular and extensible. It is also easy to train and cheap to run, and yet can accommodate complex behaviors. We obtain classification accuracy better than or competitive with state of art results on the MNIST variation dataset, and successfully disentangle overlaid digits with high success rates. We view such a general purpose framework as an essential foundation for a larger system emulating the cognitive abilities of the whole brain.", "full_text": "Attentional Neural Network: Feature Selection Using\n\nCognitive Feedback\n\nQian Wang\n\nDepartment of Biomedical Engineering\n\nTsinghua University\nBeijing, China 100084\n\nqianwang.thu@gmail.com\n\nJiaxing Zhang\n\nMicrosoft Research Asia\n\n5 Danning Road, Haidian District\n\nBeijing, China 100080\n\njiaxz@microsoft.com\n\nDepartment of Biomedical Engineering\n\nDepartment of Computer Science\n\nSen Song \u2217\n\nTsinghua University\nBeijing, China 100084\n\nsen.song@gmail.com\n\nZheng Zhang * \u2020\n\nNYU Shanghai\n\n1555 Century Ave, Pudong\nShanghai, China 200122\n\nzz@nyu.edu\n\nAbstract\n\nAttentional Neural Network is a new framework that integrates top-down cog-\nnitive bias and bottom-up feature extraction in one coherent architecture. The\ntop-down in\ufb02uence is especially effective when dealing with high noise or dif-\n\ufb01cult segmentation problems. Our system is modular and extensible. It is also\neasy to train and cheap to run, and yet can accommodate complex behaviors. We\nobtain classi\ufb01cation accuracy better than or competitive with state of art results\non the MNIST variation dataset, and successfully disentangle overlaid digits with\nhigh success rates. We view such a general purpose framework as an essential\nfoundation for a larger system emulating the cognitive abilities of the whole brain.\n\n1\n\nIntroduction\n\nHow our visual system achieves robust performance against corruptions is a mystery. Although its\nperformance may degrade, it is capable of performing denoising and segmentation tasks with dif-\nferent levels of dif\ufb01culties using the same underlying architecture. Consider the \ufb01rst two examples\nin Figure 1. Digits overlaid over random images are harder to recognize than those over random\nnoise, since pixels in the background images are structured and highly correlated. It is even more\nchallenging if two digits are overlaid altogether, in a benchmark that we call MNIST-2. Yet, with\ndifferent levels of efforts (and error rates), we are able to recognize these digits for all three cases.\n\nFigure 1: Handwriting digits with different corruptions. From left to right: random background\nnoise, random background images, and MNIST-2\n\n\u2217These authors supervised the project jointly and are co-corresponding authors.\n\u2020Work partially done while at Microsoft Resarch Asia\n\n1\n\n\fAnother interesting property of the human visual system is that recognition is fast for low noise\nlevel but takes longer for cluttered scenes. Testers perform well on recognition tasks even when\nthe exposure duration is short enough to allow only one feed-forward pass [18], while \ufb01nding the\ntarget in cluttered scenes requires more time[4]. These evidences suggest that our visual system is\nsimultaneously optimized for the common, and over-engineered for the worst. One hypothesis is\nthat, when challenged with high noise, top-down \u201cexplanations\u201d propagate downwards via feedback\nconnections, and modulate lower level features in an iterative re\ufb01nement process[19].\nInspired by these intuitions, we propose a framework called attentional neural network (aNN). aNN\nis composed of a collection of simple modules. The denoising module performs multiplicative\nfeature selection controlled by a top-down cognitive bias, and returns a modi\ufb01ed input. The classi-\n\ufb01cation module receives inputs from the denoising module and generates assignments. If necessary,\nmultiple proposals can be evaluated and compared to pick the \ufb01nal winner. Although the modules\nare simple, their combined behaviors can be complex, and new algorithms can be plugged in to\nrewire the behavior, e.g., a fast pathway for low noise, and an iterative mode for complex problems\nsuch as MNIST-2. We have validated the performance of aNN on the MNIST variation dataset.\nWe obtained accuracy better than or competitive to state of art. In the challenging benchmark of\nMNIST-2, we are able to predict one digit or both digits correctly more than 95% and 44% of the\ntime, respectively. aNN is easy to train and cheap to run. All the modules are trained with known\ntechniques (e.g. sparse RBM and back propagation), and inference takes much fewer rounds of\niterations than existing proposals based on generative models.\n\n2 Model\n\naNN deals with two related issues: 1) constructing a segmentation module under the in\ufb02uence of\ncognitive bias and 2) its application to the challenging task of classifying highly corrupted data. We\ndescribe them in turn, and will conclude with a brief description of training methodologies.\n\n2.1 Segmentation with cognitive bias\n\nFigure 2: Segmentation module with cognitive bias (a) and classi\ufb01cation based on that (b,c).\n\nAs illustrated in Figure 2(a), the objective of the segmentation module M is to segment out an object\ny belonging to one of N classes in the noisy input image x. Unlike in the traditional deonising\nmodels such as autoencoders, M is given a cognitive bias vector b \u2208 {0, 1}N , whose ith element\nindicates a prior belief on the existence of objects belonging to the i-th class in the noisy image.\nDuring the bottom up pass, input image x is mapped into a feature vector h = \u03c3(W \u00b7 x), where\nW is the feature weight matrix and \u03c3 represents element-wise nonlinear Sigmoid function. During\nthe top-down pass, b generates a gating vector g = \u03c3(U \u00b7 b) with the feedback weights U. g selects\nand de-selects the features by modifying hidden activation hg = h (cid:12) g, where (cid:12) means pair-wised\nmultiplication. Reconstruction occurs from hg by z = \u03c3(W (cid:48) \u00b7 hg). In general, bias b can be a\nprobability distribution indicating a mixture of several guesses, but in this paper we only use two\nsimpler scenarios: a binary vector to indicate whether there is a particular object with its associated\nweights UG, or a group bias bG with equal values for all objects, which indicates the presence of\nsome object in general.\n\n2\n\n\ud835\udc54=\ud835\udf0e(\ud835\udc48\u22c5\ud835\udc4f)\u210e=\ud835\udf0e(\ud835\udc4a\u22c5\ud835\udc65)\ud835\udc4f\ud835\udc65\u210e\ud835\udc54=\u210e\u2299\ud835\udc54\ud835\udc66=\ud835\udf0e(\ud835\udc4a\u2032\u22c5\u210e\ud835\udc54)\ud835\udc4a\ud835\udc4a\u2019\ud835\udc48M\u2299(a)M\ud835\udc4f\ud835\udc65\ud835\udc66>\ud835\udf16\u2299C(b)M\ud835\udc4f\ud835\udc66>\ud835\udf16\u2299C(c)\ud835\udc65\ud835\udc65\ud835\udc67\ud835\udc67\ud835\udc66\ud835\udc66feedback\f2.2 Classi\ufb01cation\n\nA simple strategy would be to feed the segmented input y into a classi\ufb01er C. However, this suffers\nfrom the loss of details during M\u2019s reconstruction and is prone to hallucinations, i.e. y transforming\nto a wrong digit when given a wrong bias. We opted to use the reconstruction y to gate the raw\nimage x with a threshold \u0001 to produce gated image z = (y > \u0001)(cid:12) x for classi\ufb01cation (Figure 2b). To\nsegment complex images, we explored an iterative design that is reminiscent of a recurrent network\n(Figure 2c). At time step t, the input to the segmentation module M is zt = (yt\u22121 > \u0001) (cid:12) x, and the\nresult yt is used for the next iteration. Consulting the raw input x each time prevents hallucination.\nAlternatively, we could feed the intermediate representation hg to the classi\ufb01er and such a strategy\ngives reasonable performance (see section 3.2 group bias subsection), but in general this suffers\nfrom loss of modularity.\nFor iterative classi\ufb01cation, we can give the system an initial cognitive bias, and the system produces\na series of guesses b and classi\ufb01cation results given by C. If the guess b is con\ufb01rmed by the output\nof C, then we consider b as a candidate for the \ufb01nal classi\ufb01cation result. A wrong bias b will lead the\nnetwork to transform x to a different class, but the segmented images with the correct bias is often\nstill better than transformed images under the wrong bias. In the simplest version, we can give initial\nbs over all classes and compare the \ufb01tness of the candidates. Such \ufb01tness metrics can be the number\nof iterations it takes C to con\ufb01rm the guess, the con\ufb01dence of the con\ufb01rmation , or a combination\nof many related factors. For simplicity, we use the entropy of outputs of C, but more sophisticated\nextensions are possible (see section 3.2 making it scalable subsection).\n\n2.3 Training the model\n\nWe used a shallow network of RBM for the generative model, and autoencoders gave qualitatively\nsimilar results. The parameters to be learned include the feature weights W and the feedback\nweights U. The multiplicative nature of feature selection makes learning both W and U simultane-\nously problematic, and we overcame this problem with a two-step procedure: \ufb01rstly, W is trained\nwith noisy data in a standalone RBM (i.e. with the feedback disabled); next, we \ufb01x W and learn\nU with the noisy data as input but with clean data as target, using the standard back propagation\nprocedure. This forces U to learn to select relevant features and de-select distractors. We \ufb01nd it\nhelpful to use different noise levels in these two stages. In the results presented below, training W\nand U uses half and full noise intensity, respectively. In practice, this simple strategy is surprisingly\neffective (see Section 3). We found it important to use sparsity constraint when learning W to\nproduce local features. Global features (e.g.\ntemplates) tend to be activated by noises and data\nalike, and tend to be de-selected by the feedback weights. We speculate that feature locality might\nbe especially important when compositionality and segmentation is considered. Jointly training the\nfeatures and the classi\ufb01er is a tantalizing idea but proves to be dif\ufb01cult in practice as the procedure is\niterative and the feedback weights need to be handled. But attempts could be made in this direction\nin the future to \ufb01ne-tune performance for a particular task. Another hyper-parameter is the threshold\n\u0001. We assume that there is a global minimum, and used binary search on a small validation set. 1\n\n3 Results and Analysis\n\nWe used the MNIST variation dataset and MNIST-2 to evaluate the effectiveness of our framework.\nMNIST-2 is composed by overlaying two randomly chosen clean MNIST digits. Unless otherwise\nstated, we used an off-the-shelf classi\ufb01er: a 3-layer perceptron with 256 hidden nodes, trained on\nclean MNIST data with a 1.6% error rate. In the following sections, we will discuss bias-induced\nfeature selection, its application in denosing, segmentation and \ufb01nally classi\ufb01cation.\n\n3.1 Effectiveness of feedback\n\nIf feature selection is sensitive to the cognitive bias b, then a given b should leads to the activation\nof the corresponding relevant features. In Figure 3(a), we sorted the hidden units by the associated\n\n1The training and testing code can be found in https://github.com/qianwangthu/feedback-nips2014-wq.git\n\n3\n\n\f(a) Top features\n\n(b) Reconstruction\n\n(c) feature selection\n\nFigure 3: The effectiveness of bias-controlled feature selection. (a) top features selected by different\ncognitive bias (0, 1, 2, 8) and their accumulation; (b) denoising without bias, with group bias, correct\nbias and wrong bias (b = 1); (c) how bias selects and de-selects features, the second and the third\nrows correspond to the correct and wrong bias, respectively.\n\nweights in U for a given bias from the set {0, 1, 2, 8}, and inspected their associated feature weights\nin W. The top features, when superimposed, successfully compose a crude version of the target digit.\nSince b controls feature selection, it can lead to effective segmentation (shown in Figure 3(b)))\nBy comparing the reconstruction results in the second row without bias, with those in the third\nand fouth rows (with group bias and correct bias respectively), it is clear that segmentation quality\nprogressively improves. On the other hand, a wrong bias (\ufb01fth row) will try to select features to\nits favor in two ways: selecting features shared with the correct bias, and hallucinating incorrect\nfeatures by segmenting from the background noises. Figure 3(c) goes further to reveal how feature\nselection works. The \ufb01rst row shows features for one noisy input, sorted by their activity levels\nwithout the bias. Next three rows show their deactivtion by the cognitive biases. The last column\nshows a reconstructed image using the selected features in this \ufb01gure. It is clear how a wrong bias\nfails to produce a reasonable reconstructed image.\n\nFigure 4: Recurrent segmentation examples in six iterations. In each iteration, the classi\ufb01cation\nresult is shown under the reconstructed image, along with the con\ufb01dence (red bar, the longer the\nhigher con\ufb01dence).\n\nAs described in Section 2, segmentation might take multiple iterations, and each iteration produces a\nreconstruction that can be processed by an off-the-shelf classi\ufb01er. Figure 4 shows two cases, with as-\n\n4\n\nb=0b=1b=2b=8suminputno biasgroup biascorrect biaswrong biasactivatedgroup biasb=1b=2sum(a)guess1 \u21925555752 \u21922222223 \u21922222224 \u21922777777 \u21927777779 \u2192277777(b)guess1 \u21921111112 \u21922222223 \u21923333334 \u21924444445 \u21925555559 \u2192799999\fsociated predictions generated by the 3-layer MLP. In the \ufb01rst example (Figure 4(a)), two cognitive\nbiase guesses 2 and 7 are con\ufb01rmed by the network, and the correct guess 2 has a greater con\ufb01dence.\nThe second example (Figure 4(b)) illustratess that, under high intensity background, transformations\ncan happen and a hallucinated digit can be \u201cbuilt\u201d from a patch of high intensity region since they can\nindiscriminately activate features. Such transformations constitute false-positives (i.e. con\ufb01rming a\nwrong guess) and pose challenges to classi\ufb01cation. More complicated strategies such as local con-\ntrast normalization can be used in the future to deal with such cases. This phenomenon is not at all\nuncommon in everyday life experiences: when truth is \ufb02ooded with high noises, all interpretations\nare possible, and each one picks evidence in its favor while ignoring others.\nAs described in Section 2, we used an entropy con\ufb01dence metric to select the winner from candi-\ndates. The MLP classi\ufb01er C produces a predicted score for the likelihood of each class, and we take\nthe total con\ufb01dence as the entropy of the prediction distribution, normalized by its class average\nobtained under clean data. This con\ufb01dence metric, as well as the associated classi\ufb01cation result, are\ndisplayed under each reconstruction. The \ufb01rst example shows that con\ufb01dence under the right guess\n(i.e. 2) is higher. On the other hand, the second example shows that, with high noise, con\ufb01dences\nof many guesses are equally poor. Furthermore, more iterations often lead to higher con\ufb01dence,\nregardless of whether the guess is correct or not. This self-ful\ufb01lling process locks predictions to\ntheir given biases, instead of differentiating them, which is also a familiar scenario.\n\n3.2 Classi\ufb01cation\n\nTable 1: Classi\ufb01cation performance\n\nback-rand\n\nback-image\n\nRBM\n\nimRBM\n\ndiscRBM\n\nDBN-3\n\nCAE-2\n\nPGBM\n\nsDBN\n\naNN - \u03b8rand\n\naNN - \u03b8image\n\n11.39\n\n10.46\n\n10.29\n\n6.73\n\n10.90\n\n6.08\n\n4.48\n\n3.22\n\n6.09\n\n15.42\n\n16.35\n\n15.56\n\n16.31\n\n15.50\n\n12.25\n\n14.34\n\n22.30\n\n15.33\n\nFigure 5: (a) error vs. background level. (b) er-\nror vs. iteration number.\n\nTo compare with previous results, we used the standard training/testing split (12K/50K) of the\nMNIST variation set, and results are shown in the Table 1. We ran one-iteration denoising, and\nthen picked the winner by comparing normalized entropies among the candidates, i.e. those with\nbiases matching the prediction of the 3-layer MLP classi\ufb01er. We trained two parameter sets sepa-\nrately in random-noise background (\u03b8rand) and image background dataset(\u03b8image). To test transfer\nabilities, we also applied \u03b8image to random-noise background data and \u03b8rand to image background\ndata. On MNIST-back-rand and MNIST-back-image dataset, \u03b8noise achieves 3.22% and 22.3% err\nrate respectively, while \u03b8image achieves 6.09% and 15.33%.\nFigure 5(a) shows how the performance deteriorates with increasing noise level. In these experi-\nments, random noise and random images are modulated by scaling down their pixel intensity lin-\nearly. Intuitively, at low noise the performance should approach the default accuracy of the classi\ufb01er\nC and is indeed the case.\nThe effect of iterations: We have chosen to run only one iteration because under high noise, each\nguess will insist on picking features to its favor and some hallucination can still occur. With more\niterations, false positive rates will rise and false negative rates will decrease, as con\ufb01dence scores for\n\n5\n\n00.20.40.60.8100.050.10.150.2background levelerr rate mnist-background-noisemnist-background-image1234500.050.10.150.20.25iterationerr rate false negativefalse positive(a)(b)\fboth the right and the wrong guesses will keep on improving. This is shown in Figure-5(b). As such,\nmore iterations do not necessarily lead to better performance. In the current model, the predicted\nclass from the previous step is not feed into the next step, and more sophisticated strategies with\nsuch an extension might produce better results in the future.\nThe power of group bias: For this benchmark, good performance mostly depends on the quality of\nsegmentation. Therefore, a simpler approach is to denoise with coarse-grained group bias, followed\nby classi\ufb01cation. For \u03b8image, we attached a SVM to the hidden units with bG turned on, and obtained\na 16.2% error rate. However, if we trained a SVM with 60K samples, the error rate drops to 12.1%.\nThis con\ufb01rms that supervised learning can achieve better performance with more training data.\nMaking it scalable. So far, we enumerate over all the guesses. This is clearly not scalable if number\nof classes is large. One sensible solution is to \ufb01rst denoise with a group bias bG, and pick top-K\ncandidates from the prediction distribution, and then iterate among them.\nFinally, we emphasize that the above results are obtained with only one up-down pass. This is in\nstark contrast to other generative model based systems. For example, in PGBM [15], each inference\ntakes 25 rounds.\n\n3.3 MNIST-2 problem\n\nCompared to corruption by background noises, MNIST-2 is a much more challenging task, even for\na human observer. It is a problem of segmentation, not denoising. In fact, such segmentation requires\nsemantic understanding of the object. Knowing which features are task-irrelevant is not suf\ufb01cient,\nwe need to discover and utilize per-class features. Any denoising architectures only removing task-\nirrelevant features will fail on such a task without additional mechanisms. In aNN, each bias has its\nown associated features and explicitly call these features out in the reconstruction phase (modulated\nby input activations). Meanwhile, its framework permits multiple predictions so it can accommodate\nsuch problems.\n\nFigure 6: Sample results on MNIST-2. In each example, each column is one iteration. The \ufb01rst two\nrows are runs with two ground truth digits, others are with wrong biases.\n\nFor the MNIST-2 task, we used the same off-the-shelf 3-layer classi\ufb01er to validate a guess. In the\n\ufb01rst two examples in Figure 6, the pair of digits in the ground truth is correctly identi\ufb01ed. Supplying\neither digit as the bias successfully segments its features, resulting in imperfect reconstructions that\nare nonetheless con\ufb01dent enough to win over competing proposals. One would expect that the\nrandom nature of MNIST-2 would create much more challenging (and interesting) cases that either\ndefy or confuse any segmentation attempts. This is indeed true. The last example is an overlay of\nthe digit 1 and 5 that look like a perfect 8. Each of the 5 biases successfully segment out their target\n\u201cdigit\u201d, and sometimes creatively. It is satisfying to see that a human observor would make similar\nmisjudgements in those cases.\n\n6\n\n(a)guessground truth2 \u21922222226 \u21926666661 \u21922222224 \u21926444448 \u2192222222(b)guessground truth2 \u21922222227 \u21927777770 \u21922222221 \u21921111114 \u2192774444(c)guessground truth1 \u21922111115 \u21925555552 \u21928882223 \u21923333334 \u2192844444\fFigure 7: Sample results on MNIST-2 when adding background noises. (a) (b) (c) are examples\nthree groups of results, when both digits, one digit, or none are predicted, respectively.\n\nOut of the 5000 MNIST-2 pairs, there are 95.46% and 44.62% cases where at least one digit or\nboth digits get correctly predicted, respectively. Given the challenging nature of the benchmark, we\nare surprised by this performance. Contrary to random background dataset, in this problem, more\niterations conclusively lead to better performance. The above accuracy is obtained with 5 iterations,\nand the accuracy for matching both digits will drop to 36.28% if only 1 iteration is used. Even more\ninterestingly, this performance is resilient against background noise (Figure 7), the accuracy only\ndrops slightly (93.72% and 41.66%). The top-down biases allowed us to achieve segmentaion and\ndenoising at the same time.\n\n4 Discussion and Related Work\n\n4.1 Architecture\n\nFeedforward multilayer neural networks have achieved good performance in many classi\ufb01cation\ntasks in the past few years, notably achieving the best performance in the ImageNet competition\nin vision([21] [7]). However, they typically give a \ufb01xed outcome for each input image, therefore\ncannot naturally model the in\ufb02uence of cognitive biases and are dif\ufb01cult to incorporate into a larger\ncognitive framework. The current frontier of vision research is to go beyond object recognition\ntowards image understanding [16]. Inspired by neuroscience research, we believe that an uni\ufb01ed\nmodule which integrates feedback predictions and interpretations with information from the world\nis an important step towards this goal.\nGenerative models have been a popular approach([5, 13]). They are typically based on a probabilis-\ntic framework such as Boltzmann Machines and can be stacked into a deep architecture. They have\nadvantages over discriminative models in dealing with object occlusion. In addition, prior knowl-\nedge can be easily incorporated in generative models in the forms of latent variables. However,\ndespite the mathematical beauty of a probabilistic framework, this class of models currently suffer\nfrom the dif\ufb01culty of generative learning and have been mostly successful in learning small patches\nof natural images and objects [17, 22, 13]. In addition, inferring the hidden variables from images\nis a dif\ufb01cult process and many iterations are typically needed for the model to converge[13, 15]. A\nrecent trend is to \ufb01rst train a DBN or DBM model then turn the model into a discriminative network\nfor classi\ufb01cation. This allows for fast recognition but the discriminative network loses the generative\nability and cannot combine top-down and bottom-up information.\nWe sought a simple architecture that can \ufb02exibly navigate between discriminative and generative\nframeworks. This should ideally allow for one-pass quick recognition for images with easy and\nwell-segmented objects, but naturally allow for iteration and in\ufb02uence by cognitive-bias when the\nneed for segmentation arises in corrupted or occluded image settings.\n\n4.2 Models of Attention\n\nIn the \ufb01eld of computational modeling of attention, many models have been proposed to model the\nsaliency map and used to predict where attention will be deployed and provide \ufb01ts to eye-tracking\ndata[1]. We are instead more interested in how attentional signals propagating back from higher lev-\n\n7\n\n(a)\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192ground truthimageresult(b)\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192ground truthimageresult(c)\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192\u2192ground truthimageresult\fels in the visual hierarchy can be merged with bottom up information. Volitional top-down control\ncould update, bias or disambiguate the bottom-up information based on high-level tasks, contex-\ntual cues or behavior goals. Computational models incorporating this principle has so far mostly\nfocused on spatial attention [12, 1]. For example, in a pedestrian detection task, it was shown that\nvisual search can be sped up if the search is limited to spatial locations of high prior or posterior\nprobabilities [3]. However, human attention abilities go beyond simple highlighting based on loca-\ntion. For example, the ability to segment and disentangle object based on high level expectations\nas in the MNIST-2 dataset represents an interesting case. Here, we demonstrate that top-down at-\ntention can also be used to segment out relevant parts in a cluttered and entangled scene guided\nby top-down interpretation, demonstrating that attentional bias can be successfully deployed on a\nfar-more \ufb01ne-grained level than previous realized.\nWe have chosen the image-denoising and image-segmentation tasks as our test cases. In the context\nof image-denoising, feedforward neural networks have been shown to have good performance [6,\n20, 11]. However, their work has not included a feedback component and has no generative ability.\nSeveral Boltzmann machine based architectures have been proposed[9, 8]. In PGBM, gates on input\nimages are trained to partition such pixel as belonging to objects or backgrounds, which are modeled\nby two RBMs separately [15]. The gates and the RBM components make up a high-order RBM.\nHowever, such a high-order RBM is dif\ufb01cult to train and needs costly iterations during inference.\nsDBN [17] used a RBM to model the distribution of the hidden layer, and then denoises the hidden\nlayer by Gibbs sampling over the hidden units affected by noise. Besides the complexity of Gibbs\nsampling, the process of iteratively \ufb01nding which units are affected by noise is also complicated and\ncostly, as there is a process of Gibbs sampling for each unit. When there are multiple digits appearing\nin the image as in the case of MNSIT-2, the hidden layer denoising step leads to uncertain results,\nand the best outcome is an arbitrary choice of one of the mixed digits. a DBM based architecture has\nalso been proposed for modeling attention, but the complexity of learning and inference also makes\nit dif\ufb01cult to apply in practice [10]. All those works also lack the ability of controlled generation\nand input reconstruction under the direction of a top-down bias.\nIn our work, top-down biases in\ufb02uence the processing of feedforward information at two levels. The\ninputs are gated at the raw image stage by top-down reconstructions. We propose that this might be\nequivalent to the powerful gating in\ufb02uence of the thalamus in the brain [1, 15]. If the in\ufb02uence of\ninput image is shut off at this stage, then the system can engage in hallucination and might get into a\nstate akin to dreams, as when the thalamic gates are closed. Top-down biases also affect processing\nat a higher stage of high-level features. We think this might be equivalent to the processing level of\nV4 in the visual hierarchy. At this level, top-down biases mostly suppresses task-irrelevant features\nand we have modeled the interactions as multiplicative in accordance with results from neuroscience\nresearch [1, 2].\n\n4.3 Philosophical Points\n\nThe issue of whether top-down connections and iterative processing are useful for object recogni-\ntion has been a point of hot contention. Early work inspired by Hop\ufb01eld network and the tradition\nof probabilistic models based on Gibbs sampling argue for the usefulness of feedback and iteration\n[14],[13], but results from neuroscience research and recent success by purely feedforward networks\nargue against it [18],[7]. In our work, we \ufb01nd that feedforward processing is suf\ufb01cient for good per-\nformance on clean digits. Feedback connections play an essential role for digit denoising. However,\none pass with a simple cognitive bias towards digits seems to suf\ufb01ce and iteration seems only to con-\n\ufb01rm the initial bias and does not improve performance. We hypothesize that this \u201csee what you want\nto see\u201d is a side-effect of our ability to denoise a cluttered scene, as the deep hierarchy possesses the\nability to decompose objects into many shareable parts. In the more complex case of MNIST-2, per-\nformance does increase with iteration. This suggests that top-down connections and iteration might\nbe particularly important for good performance in the case of cluttered scenes. The architecture we\nproposed can naturally accommodate all these task requirements simultaneously with essentially no\nfurther \ufb01ne-tuning. We view such a general purpose framework as an essential foundation for a\nlarger system emulating the cognitive abilities of the whole brain.\n\n8\n\n\fReferences\n[1] F. Baluch and L. Itti. Mechanisms of top-down attention. Trends in neurosciences, 34(4):210\u2013\n\n224, 2011.\n\n[2] T. C\u00b8 ukur, S. Nishimoto, A. G. Huth, and J. L. Gallant. Attention during natural vision warps\nsemantic representation across the human brain. Nature neuroscience, 16(6):763\u2013770, 2013.\n[3] K. A. Ehinger, B. Hidalgo-Sotelo, A. Torralba, and A. Oliva. Modelling search for people in\n900 scenes: A combined source model of eye guidance. Visual cognition, 17(6-7):945\u2013978,\n2009.\n\n[4] J. M. Henderson, M. Chanceaux, and T. J. Smith. The in\ufb02uence of clutter on real-world scene\nsearch: Evidence from search ef\ufb01ciency and eye movements. Journal of Vision, 9(1):32, 2009.\n[5] G. E. Hinton, S. Osindero, and Y.-W. Teh. A fast learning algorithm for deep belief nets.\n\nNeural computation, 18(7):1527\u20131554, 2006.\n\n[6] V. Jain and H. S. Seung. Natural image denoising with convolutional networks.\n\nvolume 8, pages 769\u2013776, 2008.\n\nIn NIPS,\n\n[7] A. Krizhevsky, I. Sutskever, and G. E. Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, volume 1, page 4, 2012.\n\n[8] H. Larochelle and Y. Bengio. Classi\ufb01cation using discriminative restricted boltzmann ma-\nchines. In Proceedings of the 25th international conference on Machine learning, pages 536\u2013\n543. ACM, 2008.\n\n[9] V. Nair and G. E. Hinton.\n\nImplicit mixtures of restricted boltzmann machines.\n\nvolume 21, pages 1145\u20131152, 2008.\n\nIn NIPS,\n\n[10] D. P. Reichert, P. Series, and A. J. Storkey. A hierarchical generative model of recurrent object-\nIn Arti\ufb01cial Neural Networks and Machine Learning\u2013\n\nbased attention in the visual cortex.\nICANN 2011, pages 18\u201325. Springer, 2011.\n\n[11] S. Rifai, P. Vincent, X. Muller, X. Glorot, and Y. Bengio. Contractive auto-encoders: Explicit\ninvariance during feature extraction. In Proceedings of the 28th International Conference on\nMachine Learning (ICML-11), pages 833\u2013840, 2011.\n\n[12] A. L. Rothenstein and J. K. Tsotsos. Attention links sensing to recognition. Image and Vision\n\nComputing, 26(1):114\u2013126, 2008.\n\n[13] R. Salakhutdinov and G. E. Hinton. Deep boltzmann machines. In International Conference\n\non Arti\ufb01cial Intelligence and Statistics, pages 448\u2013455, 2009.\n\n[14] F. Schwenker, F. T. Sommer, and G. Palm.\n\nIterative retrieval of sparsely coded associative\n\nmemory patterns. Neural Networks, 9(3):445\u2013455, 1996.\n[15] K. Sohn, G. Zhou, C. Lee, and H. Lee. Learning and selecting features jointly with point-\nwise gated {B} oltzmann machines. In Proceedings of The 30th International Conference on\nMachine Learning, pages 217\u2013225, 2013.\n\n[16] C. Tan, J. Z. Leibo, and T. Poggio. Throwing down the visual intelligence gauntlet. In Machine\n\nLearning for Computer Vision, pages 1\u201315. Springer, 2013.\n\n[17] Y. Tang and C. Eliasmith. Deep networks for robust visual recognition. In Proceedings of the\n\n27th International Conference on Machine Learning (ICML-10), pages 1055\u20131062, 2010.\n\n[18] S. Thorpe, D. Fize, C. Marlot, et al. Speed of processing in the human visual system. nature,\n\n381(6582):520\u2013522, 1996.\n\n[19] S. Ullman. Sequence seeking and counter streams: a computational model for bidirectional\n\ninformation \ufb02ow in the visual cortex. Cerebral cortex, 5(1):1\u201311, 1995.\n\n[20] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol. Extracting and composing robust\nfeatures with denoising autoencoders. In Proceedings of the 25th international conference on\nMachine learning, pages 1096\u20131103. ACM, 2008.\n\n[21] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional neural networks.\n\narXiv preprint arXiv:1311.2901, 2013.\n\n[22] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restora-\ntion. In Computer Vision (ICCV), 2011 IEEE International Conference on, pages 479\u2013486.\nIEEE, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1100, "authors": [{"given_name": "Qian", "family_name": "Wang", "institution": "Tsinghua University"}, {"given_name": "Jiaxing", "family_name": "Zhang", "institution": "Microsoft Research"}, {"given_name": "Sen", "family_name": "Song", "institution": "Tsinghua University"}, {"given_name": "Zheng", "family_name": "Zhang", "institution": "Microsoft Reasearch"}]}