{"title": "Image Restoration Using Very Deep Convolutional Encoder-Decoder Networks with Symmetric Skip Connections", "book": "Advances in Neural Information Processing Systems", "page_first": 2802, "page_last": 2810, "abstract": "In this paper, we propose a very deep fully convolutional encoding-decoding framework for image restoration such as denoising and super-resolution. The network is composed of multiple layers of convolution and deconvolution operators, learning end-to-end mappings from corrupted images to the original ones. The convolutional layers act as the feature extractor, which capture the abstraction of image contents while eliminating noises/corruptions. Deconvolutional layers are then used to recover the image details. We propose to symmetrically link convolutional and deconvolutional layers with skip-layer connections, with which the training converges much faster and attains a higher-quality local optimum. First, the skip connections allow the signal to be back-propagated to bottom layers directly, and thus tackles the problem of gradient vanishing, making training deep networks easier and achieving restoration performance gains consequently. Second, these skip connections pass image details from convolutional layers to deconvolutional layers, which is beneficial in recovering the original image. Significantly, with the large capacity, we can handle different levels of noises using a single model. Experimental results show that our network achieves better performance than recent state-of-the-art methods.", "full_text": "Image Restoration Using Very Deep Convolutional\nEncoder-Decoder Networks with Symmetric Skip\n\nConnections\n\nXiao-Jiao Mao\u2020, Chunhua Shen(cid:63), Yu-Bin Yang\u2020\n\n\u2020State Key Laboratory for Novel Software Technology, Nanjing University, China\n\n(cid:63)School of Computer Science, University of Adelaide, Australia\n\nAbstract\n\nIn this paper, we propose a very deep fully convolutional encoding-decoding frame-\nwork for image restoration such as denoising and super-resolution. The network is\ncomposed of multiple layers of convolution and deconvolution operators, learning\nend-to-end mappings from corrupted images to the original ones. The convolu-\ntional layers act as the feature extractor, which capture the abstraction of image\ncontents while eliminating noises/corruptions. Deconvolutional layers are then\nused to recover the image details. We propose to symmetrically link convolutional\nand deconvolutional layers with skip-layer connections, with which the training\nconverges much faster and attains a higher-quality local optimum. First, the skip\nconnections allow the signal to be back-propagated to bottom layers directly, and\nthus tackles the problem of gradient vanishing, making training deep networks\neasier and achieving restoration performance gains consequently. Second, these\nskip connections pass image details from convolutional layers to deconvolutional\nlayers, which is bene\ufb01cial in recovering the original image. Signi\ufb01cantly, with\nthe large capacity, we can handle different levels of noises using a single model.\nExperimental results show that our network achieves better performance than recent\nstate-of-the-art methods.\n\n1\n\nIntroduction\n\nThe task of image restoration is to recover a clean image from its corrupted observation, which\nis known to be an ill-posed inverse problem. By accommodating different types of corruption\ndistributions, the same mathematical model applies to problems such as image denoising and super-\nresolution. Recently, deep neural networks (DNNs) have shown their superior performance in image\nprocessing and computer vision tasks, ranging from high-level recognition, semantic segmentation to\nlow-level denoising, super-resolution, deblur, inpainting and recovering raw images from compressed\nones. Despite the progress that DNNs achieve, some research questions remain to be answered. For\nexample, can a deeper network in general achieve better performance? Can we design a single deep\nmodel which is capable to handle different levels of corruptions?\nObserving recent superior performance of DNNs on image processing tasks, we propose a convolu-\ntional neural network (CNN)-based framework for image restoration. We observe that in order to\nobtain good restoration performance, it is bene\ufb01cial to train a very deep model. Meanwhile, we show\nthat it is possible to achieve very promising performance with a single network when processing\nmultiple different levels of corruptions due to the bene\ufb01ts of large-capacity networks. Speci\ufb01cally,\nthe proposed framework learns end-to-end fully convolutional mappings from corrupted images to the\nclean ones. The network is composed of multiple layers of convolution and deconvolution operators.\nAs deeper networks tend to be more dif\ufb01cult to train, we propose to symmetrically link convolutional\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fand deconvolutional layers with skip-layer connections, with which the training procedure converges\nmuch faster and is more likely to attain a high-quality local optimum.\nOur main contributions are summarized as follows.\n\n\u2022 A very deep network architecture, which consists of a chain of symmetric convolutional\nand deconvolutional layers, for image restoration is proposed in this paper. The convolu-\ntional layers act as the feature extractor which encode the primary components of image\ncontents while eliminating the corruption. The deconvolutional layers then decode the image\nabstraction to recover the image content details.\n\u2022 We propose to add skip connections between corresponding convolutional and deconvolu-\ntional layers. These skip connections help to back-propagate the gradients to bottom layers\nand pass image details to top layers, making training of the end-to-end mapping easier\nand more effective, and thus achieving performance improvement while the network going\ndeeper. Relying on the large capacity and \ufb01tting ability of our very deep network, we also\npropose to handle different level of noises/corruption using a single model.\n\u2022 Experimental results demonstrate the advantages of our network over other recent state-\nof-the-art methods on image denoising and super-resolution, setting new records on these\ntopics.1\n\nRelated work Extensive work has been done on image restoration in the literature. See detail\nreviews in a survey [21]. Traditional methods such as Total variation [24, 23], BM3D algorithm\n[5] and dictionary learning based methods [31, 10, 2] have shown very good performance on image\nrestoration topics such as image denoising and super-resolution. Since image restoration is in general\nan ill-posed problem, the use of regularization [34, 9] has been proved to be essential.\nAn active and probably more promising category for image restoration is the DNN based methods.\nStacked denoising auto-encoder [29] is one of the most well-known DNN models which can be used\nfor image restoration. Xie et al. [32] combined sparse coding and DNN pre-trained with denoising\nauto-encoder for low-level vision tasks such as image denoising and inpainting. Other neural networks\nbased methods such as multi-layer perceptron [1] and CNN [15] for image denoising, as well as DNN\nfor image or video super-resolution [4, 30, 7, 14] and compression artifacts reduction [6] have been\nactively studied in these years.\nBurger et al. [1] presented a patch-based algorithm learned with a plain multi-layer perceptron.\nThey also concluded that with large networks, large training data, neural networks can achieve\nstate-of-the-art image denoising performance. Jain and Seung [15] proposed a fully convolutional\nCNN for denoising. They found that CNNs provide comparable or even superior performance to\nwavelet and Markov Random Field (MRF) methods. Cui et al. [4] employed non-local self-similarity\n(NLSS) search on the input image in multi-scale, and then used collaborative local auto-encoder for\nsuper-resolution in a layer by layer fashion. Dong et al. [7] proposed to directly learn an end-to-end\nmapping between the low/high-resolution images. Wang et al. [30] argued that domain expertise\nrepresented by the conventional sparse coding can be combined to achieve further improved results.\nAn advantage of DNN methods is that these methods are purely data driven and no assumptions about\nthe noise distributions are made.\n\n2 Very deep RED-Net for Image Restoration\n\nThe proposed framework mainly contains a chain of convolutional layers and symmetric decon-\nvolutional layers, as shown in Figure 1. We term our method \u201cRED-Net\u201d\u2014very deep Residual\nEncoder-Decoder Networks.\n\n2.1 Architecture\n\nThe framework is fully convolutional and deconvolutional. Recti\ufb01cation layers are added after each\nconvolution and deconvolution. The convolutional layers act as feature extractor, which preserve\nthe primary components of objects in the image and meanwhile eliminating the corruptions. The\ndeconvolutional layers are then combined to recover the details of image contents. The output of\nthe deconvolutional layers is the \u201cclean\u201d version of the input image. Moreover, skip connections\n\n1We have released the evaluation code at https://bitbucket.org/chhshen/image-denoising/\n\n2\n\n\fFigure 1: The overall architecture of our proposed network. The network contains layers of symmetric\nconvolution (encoder) and deconvolution (decoder). Skip-layer connections are connected every a few (in our\nexperiments, two) layers.\n\nare also added from a convolutional layer to its corresponding mirrored deconvolutional layer. The\nconvolutional feature maps are passed to and summed with the deconvolutional feature maps element-\nwise, and passed to the next layer after recti\ufb01cation.\nFor low-level image restoration problems, we prefer using neither pooling nor unpooling in the\nnetwork as usually pooling discards useful image details that are essential for these tasks. Motivated\nby the VGG model [27], the kernel size for convolution and deconvolution is set to 3\u00d73, which has\nshown excellent image recognition performance. It is worth mentioning that the size of input image\ncan be arbitrary since our network is essentially a pixel-wise prediction. The input and output of the\nnetwork are images of the same size w \u00d7 h \u00d7 c, where w, h and c are width, height and number of\nchannels. In this paper, we use c = 1 although it is straightforward to apply to images with c > 1. We\nfound that using 64 feature maps for convolutional and deconvolutional layers achieves satisfactory\nresults, although more feature maps leads to slightly better performance. Deriving from the above\narchitecture, in this work we mainly conduct experiments with two networks, which are 20-layer and\n30-layer respectively.\n\n2.1.1 Deconvolution decoder\n\nArchitectures combining layers of convolution and deconvolution [22, 12] have been proposed for\nsemantic segmentation lately. In contrast to convolutional layers, in which multiple input activations\nwithin a \ufb01lter window are fused to output a single activation, deconvolutional layers associate a single\ninput activation with multiple outputs. Deconvolution is usually used as learnable up-sampling layers.\nOne can simply replace deconvolution with convolution, which results in an architecture that is very\nsimilar to recently proposed very deep fully convolutional neural networks [19, 7]. However, there\nexist differences between a fully convolution model and our model.\nFirst, in the fully convolution case, the noise is eliminated step by step, i.e., the noise level is reduced\nafter each layer. During this process, the details of the image content may be lost. Nevertheless, in our\nnetwork, convolution preserves the primary image content. Then deconvolution is used to compensate\nthe details. We compare the 5-layer and 10-layer fully convolutional network with our network\n(combining convolution and deconvolution, but without skip connection). For fully convolutional\nnetworks, we use padding and up-sample the input to make the input and output the same size. For\nour network, the \ufb01rst 5 layers are convolutional and the second 5 layers are deconvolutional. All the\nother parameters for training are the same. In terms of Peak Signal-to-Noise Ratio (PSNR), using\ndeconvolution works slightly better than the fully convolutional counterpart.\nOn the other hand, to apply deep learning models on devices with limited computing power such\nas mobile phones, one has to speed-up the testing phase. In this situation, we propose to use down-\nsampling in convolutional layers to reduce the size of the feature maps. In order to obtain an output\nof the same size as the input, deconvolution is used to up-sample the feature maps in the symmetric\ndeconvolutional layers. We experimentally found that the testing ef\ufb01ciency can be well improved\nwith almost negligible performance degradation.\n\n3\n\n\fFigure 2: An example of a building block in the proposed framework. For ease of visualization, only two skip\nconnections are shown in this example, and the ones in layers represented by fk are omitted.\n\n2.1.2 Skip connections\n\nAn intuitive question is that, is deconvolution able to recover image details from the image abstraction\nonly? We \ufb01nd that in shallow networks with only a few layers of convolution, deconvolution is able to\nrecover the details. However, when the network goes deeper or using operations such as max pooling,\ndeconvolution does not work so well, possibly because too much image detail is already lost in the\nconvolution. The second question is that, when our network goes deeper, does it achieve performance\ngain? We observe that deeper networks often suffer from gradient vanishing and become hard to\ntrain\u2014a problem that is well addressed in the literature.\nTo address the above two problems, inspired by highway networks [28] and deep residual networks\n[11], we add skip connections between two corresponding convolutional and deconvolutional layers\nas shown in Figure 1. A building block is shown in Figure 2. There are two reasons for using\nsuch connections. First, when the network goes deeper, as mentioned above, image details can be\nlost, making deconvolution weaker in recovering them. However, the feature maps passed by skip\nconnections carry much image detail, which helps deconvolution to recover a better clean image.\nSecond, the skip connections also achieve bene\ufb01ts on back-propagating the gradient to bottom layers,\nwhich makes training deeper network much easier as observed in [28] and [11]. Note that our skip\nlayer connections are very different from the ones proposed in [28] and [11], where the only concern\nis on the optimization side. In our case, we want to pass information of the convolutional feature\nmaps to the corresponding deconvolutional layers.\nInstead of directly learning the mappings from input X to the output Y , we would like the network to\n\ufb01t the residual [11] of the problem, which is denoted as F(X) = Y \u2212 X. Such a learning strategy\nis applied to inner blocks of the encoding-decoding network to make training more effective. Skip\nconnections are passed every two convolutional layers to their mirrored deconvolutional layers. Other\ncon\ufb01gurations are possible and our experiments show that this con\ufb01guration already works very well.\nUsing such skip connections makes the network easier to be trained and gains restoration performance\nvia increasing network depth.\nThe very deep highway networks [28] are essentially feed-forward long short-term memory (LSTMs)\nwith forget gates, and the CNN layers of deep residual network [11] are feed-forward LSTMs without\ngates. Note that our deep residual networks are in general not in the format of standard feed-forward\nLSTMs.\n\n2.2 Discussions\n\nTraining with symmetric skip connections As mentioned above, using skip connections mainly\nhas two bene\ufb01ts: (1) passing image detail forwardly, which helps to recover clean images and (2)\npassing gradient backwardly, which helps to \ufb01nd better local minimum. We design experiments to\nshow these observations.\nWe \ufb01rst compare two networks trained for denoising noises of \u03c3 = 70. In the \ufb01rst network, we use 5\nlayers of 3\u00d73 convolution with stride 3. The input size of training data is 243\u00d7243, which results in\na vector after 5 layers of convolution. Then deconvolution is used to recover the input. The second\nnetwork uses the same settings as the \ufb01rst one, except for adding skip connections. The results are\nshow in Figure 3(a). We can observe that it is hard for deconvolution to recover details from only a\nvector encoding the abstraction of the input, which shows that the ability on recovering image details\nfor deconvolution is limited. However, if we use skip connections, the network can still recover the\ninput, because details are passed from top layers by skip connections.\nWe also train \ufb01ve networks to show that using skip connections help to back-propagate gradient in\ntraining to better \ufb01t the end-to-end mapping, as shown in Figure 3(b). The \ufb01ve networks are: 10, 20\nand 30 layer networks without skip connections, and 20, 30 layer networks with skip connections.\n\n4\n\n\f(a)\n\n(b)\n\n(c)\n\nFigure 3: Analysis on skip connections: (a) Recovering image details using deconvolution and skip connections;\n(b) The training loss during training; (c) Comparisons of skip connection types in [11] and our model, where\n\u201cBlock-i-RED\u201d is the connections in our model with block size i and \u201cBlock-i-He et al.\u201d is the connections in He\net al. [11] with block size i. PSNR values at the last iteration for the curves are: 25.08, 24.59, 25.30 and 25.21.\n\nAs we can see, the training loss increases when the network going deeper without skip connections\n(similar phenomenon is also observed in [11]), but we obtain a lower loss value when using them.\nComparison with deep residual networks [11] One may use different types of skip connections\nin our network, a straightforward alternate is that in [11]. In [11], the skip connections are added\nto divide the network into sequential blocks. A bene\ufb01t of our model is that our skip connections\nhave element-wise correspondence, which can be very important in pixel-wise prediction problems.\nWe carry out experiments to compare the two types of skip connections. Here the block size\nindicates the span of the connections. The results are shown in Figure 3(c). We can observe that our\nconnections often converge to a better optimum, demonstrating that element-wise correspondence\ncan be important.\nDealing with different levels of noises/corruption An important question is that, can we handle\ndifferent levels of corruption with a single model? Almost all existing methods need to train different\nmodels for different levels of corruptions. Typically these methods need to estimate the corruption\nlevel at \ufb01rst. We use a trained model in [1], to denoise different levels of noises with \u03c3 being 10,\n30, 50 and 70. The obtained average PSNR on the 14 images are 29.95dB, 27.81dB, 18.62dB and\n14.84dB, respectively. The results show that the parameters trained on a single noise level cannot\nhandle different levels of noises well. Therefore, in this paper, we aim to train a single model\nfor recovering different levels of corruption, which are different noise levels in the task of image\ndenoising and different scaling parameters in image super-resolution. The large capacity of the\nnetwork is the key to this success.\n\n2.3 Training\n\nLearning the end-to-end mapping from corrupted images to clean ones needs to estimate the weights\n\u0398 represented by the convolutional and deconvolutional kernels. This is achieved by minimizing the\nEuclidean loss between the outputs of the network and the clean image. In speci\ufb01c, given a collection\nof N training sample pairs Xi, Yi, where Xi is a corrupted image and Yi is the clean version as the\nground-truth. We minimize the following Mean Squared Error (MSE):\n\nN(cid:88)\n\ni=1\n\nL(\u0398) =\n\n1\nN\n\n(cid:107)F(Xi; \u0398) \u2212 Yi(cid:107)2\nF .\n\n(1)\n\nWe implement and train our network using Caffe [16]. In practice, we \ufb01nd that using Adam [17]\nwith learning rate 10\u22124 for training converges faster than using traditional stochastic gradient descent\n(SGD). The base learning rate for all layers are the same, different from [7, 15], in which a smaller\nlearning rate is set for the last layer. This trick is not necessary in our network.\nFollowing general settings in the literature, we use gray-scale image for denoising and the luminance\nchannel for super-resolution in this paper. 300 images from the Berkeley Segmentation Dataset\n(BSD) [20] are used to generate the training set. For each image, patches of size 50\u00d750 are sampled\n\n5\n\n\fas ground-truth. For denoising, we add additive Gaussian noise to the patches multiple times to\ngenerate a large training set (about 0.5M). For super-resolution, we \ufb01rst down-sample a patch and\nthen up-sample it to its original size, obtaining a low-resolution version as the input of the network.\n\n2.4 Testing\n\nAlthough trained on local patches, our network can perform denoising and super-resolution on images\nof arbitrary size. Given a testing image, one can simply go forward through the network, which is\nable to obtain a better performance than existing methods. To achieve smoother results, we propose to\nprocess a corrupted image on multiple orientations. Different from segmentation, the \ufb01lter kernels in\nour network only eliminate the corruptions, which is not sensitive to the orientation of image contents.\nTherefore, we can rotate and mirror \ufb02ip the kernels and perform forward multiple times, and then\naverage the output to obtain a smoother image. We see that this can lead to slightly better denoising\nand super-resolution performance.\n\n3 Experiments\n\nIn this section, we provide evaluation of denoising and super-resolution performance of our models\nagainst a few existing state-of-the-art methods. Denoising experiments are performed on two datasets:\n14 common benchmark images [33, 3, 18, 9] and the BSD200 dataset. We test additive Gaussian\nnoises with zero mean and standard deviation \u03c3 = 10, 30, 50 and 70 respectively. BM3D [5],\nNCSR [8], EPLL [34], PCLR [3], PDPD [33] and WMMN [9] are compared with our method. For\nsuper-resolution, we compare our network with SRCNN [7], NBSRF [25], CSCN [30], CSC [10],\nTSE [13] and ARFL+ [26] on three datasets: Set5, Set14 and BSD100. The scaling parameter is\ntested with 2, 3 and 4.\nPeak Signal-to-Noise Ratio (PSNR) and Structural SIMilarity (SSIM) index are calculated for\nevaluation. For our method, which is denoted as RED-Net, we implement three versions: RED10\ncontains 5 convolutional and deconvolutional layers without skip connections, RED20 contains 10\nconvolutional and deconvolutional layers with skip connections, and RED30 contains 15 convolutional\nand deconvolutional layers with skip connections.\n\n3.1\n\nImage Denoising\n\nEvaluation on the 14 images Table 1 presents the PSNR and SSIM results of \u03c3 10, 30, 50, and\n70. We can make some observations from the results. First of all, the 10 layer convolutional and\ndeconvolutional network has already achieved better results than the state-of-the-art methods, which\ndemonstrates that combining convolution and deconvolution for denoising works well, even without\nany skip connections. Moreover, when the network goes deeper, the skip connections proposed in\nthis paper help to achieve even better denoising performance, which exceeds the existing best method\nWNNM [9] by 0.32dB, 0.43dB, 0.49dB and 0.51dB on noise levels of \u03c3 being 10, 30, 50 and 70\nrespectively. While WNNM is only slightly better than the second best existing method PCLR [3] by\n0.01dB, 0.06dB, 0.03dB and 0.01dB respectively, which shows the large improvement of our model.\nLast, we can observe that the more complex the noise is, the more improvement our model achieves\nthan other methods. Similar observations can be made on the evaluation of SSIM.\n\nTable 1: Average PSNR and SSIM results of \u03c3 10, 30, 50, 70 for the 14 images.\n\n\u03c3 = 10\n\u03c3 = 30\n\u03c3 = 50\n\u03c3 = 70\n\n\u03c3 = 10\n\u03c3 = 30\n\u03c3 = 50\n\u03c3 = 70\n\nBM3D\n34.18\n28.49\n26.08\n24.65\n\n0.9339\n0.8204\n0.7427\n0.6882\n\nEPLL\n33.98\n28.35\n25.97\n24.47\n\n0.9332\n0.8200\n0.7354\n0.6712\n\nNCSR\n34.27\n28.44\n25.93\n24.36\n\n0.9342\n0.8203\n0.7415\n0.6871\n\nPCLR\n34.48\n28.68\n26.29\n24.79\n\n0.9366\n0.8263\n0.7538\n0.6997\n\nPSNR\n\n34.49\n28.74\n26.32\n24.80\n\nPGPD WNNM RED10 RED20 RED30\n34.81\n34.22\n29.17\n28.55\n26.81\n26.19\n25.31\n24.71\nSSIM\n0.9309\n0.8199\n0.7442\n0.6913\n\n0.9402\n0.8423\n0.7733\n0.7206\n\n0.9363\n0.8273\n0.7517\n0.6975\n\n34.74\n29.10\n26.72\n25.23\n\n0.9392\n0.8396\n0.7689\n0.7177\n\n34.62\n28.95\n26.51\n24.97\n\n0.9374\n0.8327\n0.7571\n0.7012\n\n6\n\n\fEvaluation on BSD200 For testing ef\ufb01ciency, we convert the images to gray-scale and resize them\nto smaller ones on BSD-200. Then all the methods are run on these images to get average PSNR\nand SSIM results of \u03c3 10, 30, 50, and 70, as shown in Table 2. For existing methods, their denoising\nperformance does not differ much, while our model achieves 0.38dB, 0.47dB, 0.49dB and 0.42dB\nhigher of PSNR over WNNM.\n\nTable 2: Average PSNR and SSIM results of \u03c3 10, 30, 50, 70 on 200 images from BSD.\n\n\u03c3 = 10\n\u03c3 = 30\n\u03c3 = 50\n\u03c3 = 70\n\n\u03c3 = 10\n\u03c3 = 30\n\u03c3 = 50\n\u03c3 = 70\n\nBM3D\n33.01\n27.31\n25.06\n23.82\n\n0.9218\n0.7755\n0.6831\n0.6240\n\nEPLL\n33.01\n27.38\n25.17\n23.81\n\n0.9255\n0.7825\n0.6870\n0.6168\n\nNCSR\n33.09\n27.23\n24.95\n23.58\n\n0.9226\n0.7738\n0.6777\n0.6166\n\nPCLR\n33.30\n27.54\n25.30\n23.94\n\n0.9261\n0.7827\n0.6947\n0.6336\n\n3.2\n\nImage super-resolution\n\nPSNR\n\n33.25\n27.48\n25.26\n23.95\n\n33.59\n27.90\n25.67\n24.33\n\nPGPD WNNM RED10 RED20 RED30\n33.63\n33.02\n27.95\n27.33\n25.75\n25.18\n24.37\n23.89\nSSIM\n0.9176\n0.7717\n0.6841\n0.6245\n\n0.9319\n0.8019\n0.7167\n0.6551\n\n0.9310\n0.7993\n0.7117\n0.6521\n\n0.9244\n0.7807\n0.6928\n0.6346\n\n33.49\n27.79\n25.54\n24.13\n\n0.9290\n0.7918\n0.7032\n0.6367\n\nThe evaluation on Set5 is shown in Table 3. Our 10-layer network outperforms the compared methods\nalready, and we achieve even better performance with deeper networks. The 30-layer network exceeds\nthe second best method CSCN by 0.52dB, 0.56dB and 0.47dB on scales 2, 3 and 4 respectively. The\nevaluation on Set14 is shown in Table 4. The improvement on Set14 in not as signi\ufb01cant as that on\nSet5, but we can still observe that the 30 layer network achieves higher PSNR than the second best\nCSCN by 0.23dB, 0.06dB and 0.1dB. The results on BSD100, as shown in Table 5, are similar to\nthose on Set5. The second best method is still CSCN, the performance of which is worse than that of\nour 10 layer network. Our deeper network obtains much more performance gain than the others.\n\nTable 3: Average PSNR and SSIM results on Set5.\n\nSRCNN NBSRF\n36.76\n36.66\n32.75\n32.75\n30.49\n30.44\n\n0.9542\n0.9090\n0.8628\n\n0.9552\n0.9104\n0.8632\n\nCSCN\n37.14\n33.26\n31.04\n\n0.9567\n0.9167\n0.8775\n\nCSC\n36.62\n32.66\n30.36\n\n0.9549\n0.9098\n0.8607\n\nPSNR\nTSE\n36.50\n32.62\n30.33\nSSIM\n0.9537\n0.9094\n0.8623\n\nARFL+ RED10 RED20 RED30\n37.66\n36.89\n33.82\n32.72\n31.51\n30.35\n\n37.62\n33.80\n31.40\n\n37.43\n33.43\n31.12\n\n0.9559\n0.9094\n0.8583\n\n0.9590\n0.9197\n0.8794\n\n0.9597\n0.9229\n0.8847\n\n0.9599\n0.9230\n0.8869\n\nTable 4: Average PSNR and SSIM results on Set14.\n\nSRCNN NBSRF\n32.45\n32.45\n29.30\n29.25\n27.42\n27.50\n\n0.9067\n0.8215\n0.7513\n\n0.9071\n0.8212\n0.7511\n\nCSCN\n32.71\n29.55\n27.76\n\n0.9095\n0.8271\n0.7620\n\nCSC\n32.31\n29.15\n27.30\n\n0.9070\n0.8208\n0.7499\n\nPSNR\nTSE\n32.23\n29.16\n27.40\nSSIM\n0.9036\n0.8197\n0.7518\n\nARFL+ RED10 RED20 RED30\n32.94\n32.52\n29.61\n29.23\n27.86\n27.41\n\n32.87\n29.61\n27.80\n\n32.77\n29.42\n27.58\n\n0.9074\n0.8201\n0.7483\n\n0.9125\n0.8318\n0.7654\n\n0.9138\n0.8343\n0.7697\n\n0.9144\n0.8341\n0.7718\n\ns = 2\ns = 3\ns = 4\n\ns = 2\ns = 3\ns = 4\n\ns = 2\ns = 3\ns = 4\n\ns = 2\ns = 3\ns = 4\n\n3.3 Evaluation using a single model\n\nTo construct the training set, we extract image patches with different noise levels and scaling\nparameters for denoising and super-resolution. Then a 30-layer network is trained for the two tasks\nrespectively. The evaluation results are shown in Table 6 and Table 7. Although training with different\nlevels of corruption, we can observe that the performance of our network only slightly degrades\n\n7\n\n\fTable 5: Average PSNR and SSIM results on BSD100 for super-resolution.\n\nSRCNN NBSRF\n31.30\n31.36\n28.36\n28.41\n26.90\n26.88\n\n0.8879\n0.7863\n0.7103\n\n0.8876\n0.7856\n0.7110\n\nCSCN\n31.54\n28.58\n27.11\n\n0.8908\n0.7910\n0.7191\n\nCSC\n31.27\n28.31\n26.83\n\n0.8876\n0.7853\n0.7101\n\ns = 2\ns = 3\ns = 4\n\ns = 2\ns = 3\ns = 4\n\nPSNR\nTSE\n31.18\n28.30\n26.85\nSSIM\n0.8855\n0.7843\n0.7108\n\nARFL+ RED10 RED20 RED30\n31.99\n31.35\n28.93\n28.36\n27.40\n26.86\n\n31.95\n28.90\n27.35\n\n31.85\n28.79\n27.25\n\n0.8885\n0.7851\n0.7091\n\n0.8953\n0.7975\n0.7238\n\n0.8969\n0.7993\n0.7268\n\n0.8974\n0.7994\n0.7290\n\ncomparing to the case in which using separate models for denoising and super-resolution. This may\ndue to the fact that the network has to \ufb01t much more complex mappings. Except that CSCN works\nslightly better on Set14 super-resolution with scales 3 and 4, our network still beats the existing\nmethods, showing that our network works much better in image denoising and super-resolution even\nusing only one single model to deal with complex corruption.\n\nTable 6: Average PSNR and SSIM results for image denoising using a single 30-layer network.\n\n14 images\n\nBSD200\n\n\u03c3 = 10\nPSNR\n34.49\nSSIM 0.9368\n\n\u03c3 = 30\n29.09\n0.8414\n\n\u03c3 = 50\n26.75\n0.7716\n\n\u03c3 = 70\n25.20\n0.7157\n\n\u03c3 = 10\n33.38\n0.9280\n\n\u03c3 = 30\n27.88\n0.7980\n\n\u03c3 = 50\n25.69\n0.7119\n\n\u03c3 = 70\n24.36\n0.6544\n\nTable 7: Average PSNR and SSIM results for image super-resolution using a single 30-layer network.\n\ns = 2\nPSNR\n37.56\nSSIM 0.9595\n\nSet5\ns = 3\n33.70\n0.9222\n\ns = 4\n31.33\n0.8847\n\ns = 2\n32.81\n0.9135\n\nSet14\ns = 3\n29.50\n0.8334\n\ns = 4\n27.72\n0.7698\n\ns = 2\n31.96\n0.8972\n\nBSD100\ns = 3\n28.88\n0.7993\n\ns = 4\n27.35\n0.7276\n\n4 Conclusions\n\nIn this paper we have proposed a deep encoding and decoding framework for image restoration.\nConvolution and deconvolution are combined, modeling the restoration problem by extracting primary\nimage content and recovering details. More importantly, we propose to use skip connections, which\nhelps on recovering clean images and tackles the optimization dif\ufb01culty caused by gradient vanishing,\nand thus obtains performance gains when the network goes deeper. Experimental results and our\nanalysis show that our network achieves better performance than state-of-the-art methods on image\ndenoising and super-resolution.\nThis work was in part supported by Natural Science Foundation of China (Grants 61673204,\n61273257, 61321491), Program for Distinguished Talents of Jiangsu Province, China (Grant 2013-\nXXRJ-018), Fundamental Research Funds for the Central Universities (Grant 020214380026), and\nAustralian Research Council Future Fellowship (FT120100969). X.-J. Mao\u2019s contribution was made\nwhen visiting University of Adelaide. His visit was supported by the joint PhD program of China\nScholarship Council.\n\nReferences\n[1] H. C. Burger, C. J. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete with\n\nBM3D? In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2392\u20132399, 2012.\n\n[2] P. Chatterjee and P. Milanfar. Clustering-based denoising with locally learned dictionaries. IEEE Trans.\n\nImage Process., 18(7):1438\u20131451, 2009.\n\n[3] F. Chen, L. Zhang, and H. Yu. External patch prior guided internal clustering for image denoising. In Proc.\n\n[4] Z. Cui, H. Chang, S. Shan, B. Zhong, and X. Chen. Deep network cascade for image super-resolution. In\n\nIEEE Int. Conf. Comp. Vis., pages 603\u2013611, 2015.\n\nProc. Eur. Conf. Comp. Vis., pages 49\u201364, 2014.\n\n[5] K. Dabov, A. Foi, V. Katkovnik, and K. O. Egiazarian. Image denoising by sparse 3-d transform-domain\n\ncollaborative \ufb01ltering. IEEE Trans. Image Process., 16(8):2080\u20132095, 2007.\n\n8\n\n\f[6] C. Dong, Y. Deng, C. C. Loy, and X. Tang. Compression artifacts reduction by a deep convolutional\n\nnetwork. In Proc. IEEE Int. Conf. Comp. Vis., pages 576\u2013584, 2015.\n\n[7] C. Dong, C. C. Loy, K. He, and X. Tang. Image super-resolution using deep convolutional networks. IEEE\n\nTrans. Pattern Anal. Mach. Intell., 38(2):295\u2013307, 2016.\n\n[8] W. Dong, L. Zhang, G. Shi, and X. Li. Nonlocally centralized sparse representation for image restoration.\n\nIEEE Trans. Image Process., 22(4):1620\u20131630, 2013.\n\n[9] S. Gu, L. Zhang, W. Zuo, and X. Feng. Weighted nuclear norm minimization with application to image\n\ndenoising. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 2862\u20132869, 2014.\n\n[10] S. Gu, W. Zuo, Q. Xie, D. Meng, X. Feng, and L. Zhang. Convolutional sparse coding for image\n\nsuper-resolution. In Proc. IEEE Int. Conf. Comp. Vis., pages 1823\u20131831, 2015.\n\n[11] K. He, X. Zhang, S. Ren, and J. Sun. Deep residual learning for image recognition. In Proc. IEEE Conf.\n\nComp. Vis. Patt. Recogn., volume abs/1512.03385, 2016.\n\n[12] S. Hong, H. Noh, and B. Han. Decoupled deep neural network for semi-supervised semantic segmentation.\n\nIn Proc. Advances in Neural Inf. Process. Syst., 2015.\n\n[13] J. Huang, A. Singh, and N. Ahuja. Single image super-resolution from transformed self-exemplars. In\n\nProc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 5197\u20135206, 2015.\n\n[14] Y. Huang, W. Wang, and L. Wang. Bidirectional recurrent convolutional networks for multi-frame\n\nsuper-resolution. In Proc. Advances in Neural Inf. Process. Syst., pages 235\u2013243, 2015.\n\n[15] V. Jain and H. S. Seung. Natural image denoising with convolutional networks. In Proc. Advances in\n\nNeural Inf. Process. Syst., pages 769\u2013776, 2008.\n\n[16] Y. Jia, E. Shelhamer, J. Donahue, S. Karayev, J. Long, R. Girshick, S. Guadarrama, and T. Darrell. Caffe:\n\nConvolutional architecture for fast feature embedding. arXiv preprint arXiv:1408.5093, 2014.\n\n[17] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization.\n\nRepresentations, 2015.\n\nIn Proc. Int. Conf. Learn.\n\n[18] H. Liu, R. Xiong, J. Zhang, and W. Gao. Image denoising via adaptive soft-thresholding based on non-local\n\nsamples. In Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 484\u2013492, 2015.\n\n[19] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks for semantic segmentation. In Proc.\n\nIEEE Conf. Comp. Vis. Patt. Recogn., pages 3431\u20133440, 2015.\n\n[20] D. Martin, C. Fowlkes, D. Tal, and J. Malik. A database of human segmented natural images and its\napplication to evaluating segmentation algorithms and measuring ecological statistics. In Proc. IEEE Int.\nConf. Comp. Vis., volume 2, pages 416\u2013423, July 2001.\n\n[21] P. Milanfar. A tour of modern image \ufb01ltering: New insights and methods, both practical and theoretical.\n\nIEEE Signal Process. Mag., 30(1):106\u2013128, 2013.\n\n[22] H. Noh, S. Hong, and B. Han. Learning deconvolution network for semantic segmentation. In Proc. IEEE\n\nInt. Conf. Comp. Vis., pages 1520\u20131528, 2015.\n\n[23] S. Osher, M. Burger, D. Goldfarb, J. Xu, and W. Yin. An iterative regularization method for total\n\nvariation-based image restoration. Multiscale Modeling & Simulation, 4(2):460\u2013489, 2005.\n\n[24] L. I. Rudin, S. Osher, and E. Fatemi. Nonlinear total variation based noise removal algorithms. Phys. D,\n\n60(1-4):259\u2013268, November 1992.\n\n[25] J. Salvador and E. Perez-Pellitero. Naive bayes super-resolution forest. In Proc. IEEE Int. Conf. Comp.\n\nVis., pages 325\u2013333, 2015.\n\n[26] S. Schulter, C. Leistner, and H. Bischof. Fast and accurate image upscaling with super-resolution forests.\n\nIn Proc. IEEE Conf. Comp. Vis. Patt. Recogn., pages 3791\u20133799, 2015.\n\n[27] K. Simonyan and A. Zisserman. Very deep convolutional networks for large-scale image recognition. In\n\nProc. Int. Conf. Learn. Representations, 2015.\n\n[28] R. K. Srivastava, K. Greff, and J. Schmidhuber. Training very deep networks. In Proc. Advances in Neural\n\nInf. Process. Syst., pages 2377\u20132385, 2015.\n\n[29] P. Vincent, H. Larochelle, Y. Bengio, and P. Manzagol. Extracting and composing robust features with\n\ndenoising autoencoders. In Proc. Int. Conf. Mach. Learn., pages 1096\u20131103, 2008.\n\n[30] Z. Wang, D. Liu, J. Yang, W. Han, and T. S. Huang. Deep networks for image super-resolution with sparse\n\nprior. In Proc. IEEE Int. Conf. Comp. Vis., pages 370\u2013378, 2015.\n\n[31] Z. Wang, Y. Yang, Z. Wang, S. Chang, J. Yang, and T. S. Huang. Learning super-resolution jointly from\n\nexternal and internal examples. IEEE Trans. Image Process., 24(11):4359\u20134371, 2015.\n\n[32] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In Proc. Advances\n\nin Neural Inf. Process. Syst., pages 350\u2013358, 2012.\n\n[33] J. Xu, L. Zhang, W. Zuo, D. Zhang, and X. Feng. Patch group based nonlocal self-similarity prior learning\n\nfor image denoising. In Proc. IEEE Int. Conf. Comp. Vis., pages 244\u2013252, 2015.\n\n[34] D. Zoran and Y. Weiss. From learning models of natural image patches to whole image restoration. In\n\nProc. IEEE Int. Conf. Comp. Vis., pages 479\u2013486, 2011.\n\n9\n\n\f", "award": [], "sourceid": 1422, "authors": [{"given_name": "Xiaojiao", "family_name": "Mao", "institution": "Nanjing University"}, {"given_name": "Chunhua", "family_name": "Shen", "institution": null}, {"given_name": "Yu-Bin", "family_name": "Yang", "institution": "NanjingUniversity"}]}