{"title": "Blind Super-Resolution Kernel Estimation using an Internal-GAN", "book": "Advances in Neural Information Processing Systems", "page_first": 284, "page_last": 293, "abstract": "Super resolution (SR) methods typically assume that the low-resolution (LR) image was downscaled from the unknown high-resolution (HR) image by a fixed `ideal\u2019 downscaling kernel (e.g. Bicubic downscaling). However, this is rarely the case in real LR images, in contrast to synthetically generated SR datasets. When the assumed downscaling kernel deviates from the true one, the performance of SR methods significantly deteriorates. This gave rise to Blind-SR - namely, SR when the downscaling kernel (``SR-kernel\u2019\u2019) is unknown. It was further shown that the true SR-kernel is the one that maximizes the recurrence of patches across scales of the LR image. In this paper we show how this powerful cross-scale recurrence property can be realized using Deep Internal Learning. We introduce ``KernelGAN\u2019\u2019, an image-specific Internal-GAN, which trains solely on the LR test image at test time, and learns its internal distribution of patches. Its Generator is trained to produce a downscaled version of the LR test image, such that its Discriminator cannot distinguish between the patch distribution of the downscaled image, and the patch distribution of the original LR image. The Generator, once trained, constitutes the downscaling operation with the correct image-specific SR-kernel. KernelGAN is fully unsupervised, requires no training data other than the input image itself, and leads to state-of-the-art results in Blind-SR when plugged into existing SR algorithms.", "full_text": "Blind Super-Resolution Kernel Estimation using an Internal-GAN\n\nSe\ufb01 Bell-Kligler\n\nAssaf Shocher\n\nMichal Irani\n\nDept. of Computer Science and Applied Math\n\nThe Weizmann Institute of Science, Israel\n\nProject website: http://www.wisdom.weizmann.ac.il/\u223cvision/kernelgan\n\nAbstract\n\nSuper resolution (SR) methods typically assume that the low-resolution (LR) image\nwas downscaled from the unknown high-resolution (HR) image by a \ufb01xed \u2018ideal\u2019\ndownscaling kernel (e.g. Bicubic downscaling). However, this is rarely the case\nin real LR images, in contrast to synthetically generated SR datasets. When\nthe assumed downscaling kernel deviates from the true one, the performance of\nSR methods signi\ufb01cantly deteriorates. This gave rise to Blind-SR \u2013 namely, SR\nwhen the downscaling kernel (\u201cSR-kernel\u201d) is unknown. It was further shown\nthat the true SR-kernel is the one that maximizes the recurrence of patches across\nscales of the LR image. In this paper we show how this powerful cross-scale\nrecurrence property can be realized using Deep Internal Learning. We introduce\n\u201cKernelGAN\u201d, an image-speci\ufb01c Internal-GAN [29], which trains solely on the LR\ntest image at test time, and learns its internal distribution of patches. Its Generator\nis trained to produce a downscaled version of the LR test image, such that its\nDiscriminator cannot distinguish between the patch distribution of the downscaled\nimage, and the patch distribution of the original LR image. The Generator, once\ntrained, constitutes the downscaling operation with the correct image-speci\ufb01c\nSR-kernel. KernelGAN is fully unsupervised, requires no training data other than\nthe input image itself, and leads to state-of-the-art results in Blind-SR when plugged\ninto existing SR algorithms. 1\n\n1\n\nIntroduction\n\nILR = (IHR \u2217 ks) \u2193s\n\nThe basic model of SR assumes that the low-resolution input image ILR is the result of down-scaling\na high-resolution image IHR by a scaling factor s using some kernel ks (the \"SR kernel\"), namely:\n(1)\nThe goal is to recover IHR given ILR. This problem is ill-posed even when the SR-Kernel is assumed\nknown (an assumption made by most SR methods \u2013 older [8, 32, 7] and more recent [5, 20, 19, 21, 38,\n35, 12]). A boost in SR performance was achieved in the past few years introducing Deep-Learning\nbased methods [5, 20, 19, 21, 38, 35, 12]. However, since most SR methods train on synthetically\ndownscaled images, they implicitly rely on the SR-kernel ks being \ufb01xed and \u2018ideal\u2019 (usually a Bicubic\ndownscaling kernel with antialiasing\u2013 MATLAB\u2019s default imresize command). Real LR images,\nhowever, rarely obey this assumption. This results in poor SR performance by state-of-the-art\n(SotA) methods when applied to real or \u2018non-ideal\u2019 LR images (see Fig. 1a).\nThe SR kernel of real LR images is in\ufb02uenced by the sensor optics as well as by tiny camera motion\nof the hand-held camera, resulting in a different non-ideal SR-kernel for each LR image, even if taken\nby the same sensor. It was shown in [26] that the effect of using an incorrect SR-kernel is of greater\n\n1Project funded by the European Research Council (ERC) under the Horizon 2020 research & innovation\n\nprogram (grant No. 788535)\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\f(a) Comparison to SotA SR methods (SR\u00d74):\nSince they train on \u2019ideal\u2019 LR images, they perform poorly on real non-ideal LR images.\n\nLR Input image\n\nBicubic\n\ninterpolation\n\nEDSR+\n\n[21]\n\nRCAN+ KernelGAN(Ours) Ground\nTruth HR\n\nSR method: [30]\n\n[38]\n\n(b) Comparison to SotA Blind-SR methods (SR\u00d74):\n\nLR Input image\n\nPDN\n[34]\n\nWDSR Kernel of [24] KernelGAN(Ours) Ground\nTruth HR\n\nSR method:[30] SR method:[30]\n\n[36]\n\nFigure 1: SR\u00d74 on real \u2018non-ideal\u2019 LR images (downloaded from the internet, or downscaled by an\nunknown kernel). Full images and additional examples in supplementary material (please zoom in on screen).\n\n2\n\n\fin\ufb02uence on the SR performance than any choice of an image prior. This observation was later shown\nto hold also for deep-learning based priors [30]. The importance of the SR-kernel accuracy was\nfurther analyzed empirically in [27].\nThe problem of SR with an unknown kernel is known as Blind SR. Some Blind-SR methods [33, 17,\n14, 13, 3] were introduced prior to the deep learning era. Few Deep Blind-SR methods [34, 36] were\nrecently presented in the NTIRE\u20192018 SR challenge [31]. These methods do not explicitly calculate\nthe SR-kernel, but propose SR networks that are robust to variations in the downscaling kernel. A\nwork concurrent to ours, IKC [10], performs Blind-SR by alternating between the kernel estimation\nand the SR image reconstruction. A different family of recent Deep SR methods [30, 37], while not\nexplicitly Blind-SR, allow to input a different image-speci\ufb01c (pre-estimated) SR-kernel along with\nthe LR input image at test time. Note that SotA (non-blind) SR methods cannot make any use of an\nimage-speci\ufb01c SR kernel at test time (even if known/provided), since it is different from the \ufb01xed\ndownscaling kernel they trained on. These methods thus produce poor SR results in realistic scenarios\n\u2013 see Fig. 1a (in contrast to their excellent performance on synthetically generated LR images).\nThe recurrence of small image patches (5x5, 7x7) across scales of a single image, was shown to be a\nvery strong property of natural images [8, 39]. Michaeli & Irani [24] exploited this recurrence property\nto estimate the unknown SR-kernel directly from the LR input image. An important observation\nthey made was that the correct SR-kernel is also the downscaling kernel which maximizes the\nsimilarity of patches across scales of the LR image. Based on their observation, they proposed a\nnearest-neighbor patch based method to estimate the kernel. However, their method tends to fail for\nSR scale factors larger than \u00d72, and has a very long runtime.\nThis internal cross-scale recurrence property is very powerful, since it is image-speci\ufb01c and unsu-\npervised (requires no prior examples). In this paper we show how this property can be combined\nwith the power of Deep-Learning, to obtain the best of both worlds \u2013 unsupervised SotA SR-kernel\nestimation, with SotA Blind-SR results. We build upon the recent success of Deep Internal Learn-\ning [30] (training an image-speci\ufb01c CNN on examples extracted directly from the test image), and in\nparticular on Internal-GAN [29] \u2013 a self-supervised GAN which learns the image-speci\ufb01c distribution\nof patches.\nMore speci\ufb01cally, we introduce \"KernelGAN\" \u2013 an image-speci\ufb01c Internal-GAN, which estimates the\nSR kernel that best preserves the distribution of patches across scales of the LR image. Its Generator\nis trained to produce a downscaled version of the LR test image, such that its Discriminator cannot\ndistinguish between the patch distribution of the downscaled image, and the patch distribution\nof the original LR image. In other words, G trains to fool D to believe that all the patches of\nthe downscaled image were actually taken from the original one. The Generator, once trained,\nconstitutes the downscaling operation with the correct image-speci\ufb01c SR-kernel. KernelGAN is\nfully unsupervised, requires no training data other than the input image itself, and leads to state-of-\nthe-art results in Blind-SR when plugged into existing SR algorithms.\nSince downscaling by the SR-kernel is a linear operation applied to the LR image (convolution and\nsubsampling \u2013 see Eq. 1), our Generator (as opposed to the Discriminator) is a linear network (without\nnon-linear activations). At \ufb01rst glance, it may seem that a single-strided convolution layer (which\nstems from Eq. 1) should suf\ufb01ce as a Generator. Interestingly, we found that using a deep linear\nnetwork is dramatically superior to a single-strided one. This is consistent with recent \ufb01ndings in\ntheoretical deep-learning [28, 2, 18, 11], where deep linear networks were shown to have optimization\nadvantages over a single layer network for linear regression models. This is further elaborated in\nSec. 4.1.\nOur contributions are therefore several-fold:\n\n\u2022 This is the \ufb01rst dataset-invariant deep-learning method to estimate the unknown SR kernel (a\ncritical step for true SR of real LR images). KernelGAN is fully unsupervised and requires no\ntraining data other than the input image itself, hence enables true SR in \"the wild\".\n\u2022 KernelGAN leads to state-of-the-art results in Blind-SR when plugged into existing SR algorithms.\n\u2022 To the best of our knowledge, this is the \ufb01rst practical use for deep linear networks (so far used\n\nmostly for theoretical analysis), with demonstrated practical advantages.\n\n3\n\n\fFigure 2: KernelGAN: The patch GAN trains on patches of a single input image (real). D tries to distinguish\nreal patches from those generated by G (fake). G learns to downscale X2 the image while fooling D i.e.\nmaintaining the same distribution of patches. Both networks are fully convolutional, which in the case of images\nimplies that each pixel in the output is a result of a speci\ufb01c receptive \ufb01eld (i.e. patch) in the input.\n\n2 Overview of the Approach\n\nGiven only the input image, our goal is to \ufb01nd its underlying image-speci\ufb01c SR kernel. We seek the\nkernel that best preserves the distribution of patches across scales of the LR image. More speci\ufb01cally,\nwe aim to \"generate\" a downscaled image (e.g. by a factor of 2) such that its patch distribution is as\nclose as possible to that of the LR image.\nBy matching distributions rather than individual patches, we can leverage recent advances in dis-\ntribution modeling using Generative Adversarial Networks (GANs) [9]. GANs can be understood\nas a tool for distribution matching [9]. A GAN typically learns a distribution of images in a large\nimage dataset. It maps examples from a source distribution, px, to examples indistinguisable from\na target distribution, py: G : x \u2192 y with x\u223cpx, and G(x)\u223cpy. An internal GAN [29] trains on a\nsingle input image and learns its unique internal distribution of patches.\nInspired by InGAN [29], KernelGAN is also an image speci\ufb01c GAN that trains on a single input\nimage. It consists of a downscaling generator (G) and a discriminator (D) as depicted in Fig. 2. Both\nG and D are fully-convolutional, which implies the network is applied to patches rather than the\nwhole image, as in [16]. Given an input image ILR, G learns to downscale it, such that, for D, at the\npatch level, it is indistinguishable from the input image ILR.\nD trains to output a heat map, referred to as D-map (see \ufb01g. 2) indicating for each pixel, how likely\nis its surrounding patch to be drawn from the original patch-distribution. It alternately trains on\nreal examples (crops from the input image) and fake ones (crops from G\u2019s output). The loss is the\npixel-wise MSE difference between the output D-map and the label map. The labels for training D\nare a map of all ones for crops extracted from the original LR image, and a map of all zeros for crops\nextracted from the downscaled image.\nWe adopt a variant of the LSGAN [23] with the L1 Norm, and de\ufb01ne the objective of KernelGAN as:\n(2)\nwhere R is the regularization term on the downscaling SR-kernel resulting from the generator G (see\nmore details in Sec. 4.2). Once converged, the generator G\u2217 constitutes, implicitly, the ideal SR\ndownscaling function for the speci\ufb01c LR image.\nThe GAN trains for 3,000 iterations, alternating single optimization steps of G and D, with the\nADAM optimizer (\u03b21 = 0.5, \u03b22 = 0.999). Learning rate is 2e\u22124, decaying \u00d70.1 every 750 iters.\n\n(cid:8)Ex\u223cpatches(ILR)[|D(x) \u2212 1| + |D(G(x)))|] + R(cid:9)\n\nG\u2217(ILR) = argmin\n\nmax\n\nD\n\nG\n\n3 Discriminator\n\nThe goal of the discriminator D is to learn the distribution of patches of the input image ILR and\ndiscriminate between real patches belonging to this distribution and fake patches generated by G.\nD\u2019s real examples are crops from ILR, while fake examples are crops currently outputed by G.\nWe use a fully-convolutional patch discriminator, as introduced in [16], applied to learn the patch\ndistribution of a single image as in [29]. To discriminate small image patches, we use no pooling\n\n4\n\n\fFigure 3: Fully Convolutional Patch\nDiscriminator: A 7\u00d77 convolution \ufb01lter\nfollowed by six 1\u00d71 convolutions, includ-\ning Spectral normalization [25], Batch\nnormalization [15], ReLU and a Sigmoid\nactivations. An input crop of 32\u00d732 re-\nsults with a 32\u00d732 map\u2208 [0,1].\n\nor strides, thus achieving a small receptive \ufb01eld of a 7\u00d77 patch.\nIn this setting, D is implicitly\napplied to each patch separately and produces a heat-map (D-map) where each location in the map\ncorresponds to a patch from the input. The labels for the discriminator are maps (matrices of real/fake,\ni.e. 1/0 labels, respectively) of the same size as the input to the discriminator. Each pixel in D-map\nindicates how likely is its surrounding patch to be drawn from the learned patch distribution. See\nFig. 3 for architecture details of the discriminator.\n\n4 Deep Linear Generator = The downscaling SR-Kernel\n\n4.1 Deep Linear Generator\nThe generator G constitutes the downscaling model. Since downscaling by convolution and subsam-\npling is a linear transform (Fig. 1), we use a linear Generator (without any non-linear activations).\nWe refer to the model of downscaling from Eq. 1. In principle, the expressiveness of a single strided\nconvolutional layer should cover all possible downscaling methods captured by Eq. 1. However, we\nempirically found that such architecture does not converge to the correct solution (see 6). We attribute\nthis behavior to the following conjecture: A generator consisting of a single layer has exactly one set\nof parameters for the correct solution (the set of weights that make the ground-truth kernel). This\nmeans that there is only one point on the optimization surface that is acceptable as a solution and it is\nthe global minimum. Achieving such a global minimum is easy when the loss function is convex\n(as in linear regression). But in our case the overall loss function is a non-linear neural network-\nD which is highly non-convex. In such case, the probability for getting from some random initial\nstate to the global minimum using gradient based methods is negligible. In contrast to a single layer,\nstandard (deep) neural networks are assumed to converge from a random initialization due to the fact\nthat there are many good local minima and negligibly few bad local minima [6, 18]. Hence, for a\nproblem that by de\ufb01nition has one valid solution, optimizing a single layer generator is impossible.\nA non-linear generator would not be suitable either. Without an explicit constraint on G to be linear,\nit will generate physically unwanted solutions for the optimization objective. Such solutions include\ngenerating any image that contains valid patches but has no downscaling relation (Eq. 1). One\nexample is generating a tile of just several patches from the input. This would be a solution that\ncomplies with eq. 2 but is unwanted.\nThis conjecture motivated us to use deep linear networks. These are networks consisting of a\nsequence of linear layers with no activations, and are used for theoretical analysis [28, 2, 18, 11].\nThe expressiveness of a deep linear network is exactly as the one of a single linear layer (i.e. Linear\nregression), however, its optimization has several different aspects. While it can be convex with\nrespect to the input, the loss would never be convex with respect to the weights (assuming more than\none layer). In fact, linear networks have in\ufb01nitely many equally valued global minima. Any choice of\nnetwork parameters matching to one of these minima would be equivalent to any other minimum point-\nthey are just different factorizations of the same matrix. Motivated by these observations, we employ\na deep linear generator. By that we allow in\ufb01nitely many valid solutions to our optimization objective,\nall equivalent to the same linear solution. Furthermore, it was shown by [2], that gradient-based\noptimization is faster for deeper linear networks than shallow ones.\nFig. 4 depicts the architecture of G. We use 5 hidden convolutional layers with 64 channels each.\nThe \ufb01rst 3 \ufb01lters are 7\u00d77, 5\u00d75, 3\u00d73 and the rest are 1\u00d71. This makes a receptive \ufb01eld of 13 \u00d713\n(allowing for a 13 \u00d7 13 SR kernel). This setting of \ufb01lters takes into account the effective receptive\n\ufb01eld [22]; maintaining the same receptive \ufb01eld while having \ufb01lters bigger than 1\u00d71 following the\n\ufb01rst layer encourages the center of the kernel to have higher values. To provide a reasonable initial\nstarting point, the generator\u2019s output is constrained to be similar to an ideal downscaling (e.g. bicubic)\nof the input, for the \ufb01rst iterations. Once satis\ufb01ed, this constraint is discarded.\n\n5\n\n\fFigure 4: Deep linear G: A 6 layer\nconvolutional network without non-\nlinear activations.\nThe deep linear network on the left has\nequal expressiveness power as the sin-\ngle strided convolution downscaling\nmodel, on the right.\n\n4.2 Extracting the explicit kernel\nAt any iteration during the training, the generator G constitutes the currently estimated downscaling\noperation for the speci\ufb01c LR input image. The image-speci\ufb01c kernel is implicitly captured by the\ntrained weights of G. However, there are two reasons to explicitly extract the kernel from G. First,\nwe are interested in having a compact downscaling convolution kernel rather than a downscaling\nnetwork. This step is trivial; convolving all the \ufb01lters of G sequentially with sride 1 results in the\nSR-kernel k (see Fig. 4). When extracted, the kernel is just a small array that can be supplied to\nSR algorithms. The second reason for explicitly extracting the kernel is to allow applying explicit\nand physically-meaningful priors on the kernel. This is the goal of the regularization term R in\nEq.2 which decreases the hypotheses space to a sub-set of the plausible kernels, that obey certain\nrestrictions. Such restrictions are that the kernel would sum up to 1 and be centered so it will not\nshift the image. The regularization also ameliorates faulty tendencies of the optimization process to\nproduce kernels that are too spread out and smooth. However, it is not enough to extract the kernel,\nthis extraction must be differentiable so that the regularization losses can be back-propagated through\nit. At each iteration, we apply a differentiable action of calculating the kernel (convolving all the\n\ufb01lters of G sequentially with stride-1). The regularization loss term R is then applied and included in\nthe optimization objective.\nThe regularization term in our objective in eq. 2 is the following:\n\nR = \u03b1Lsum_to_1 + \u03b2Lboundaries + \u03b3Lsparse + \u03b4Lcenter\n\n(3)\n\nwhere \u03b1 = 0.5, \u03b2 = 0.5, \u03b3 = 5, \u03b4 = 1, and:\n\u2022 Lsum_to_1 =\n\n(cid:12)(cid:12)(cid:12) encourages k to sum to 1.\n\n(cid:12)(cid:12)(cid:12)1 \u2212(cid:80)\n\u2022 Lboundaries = (cid:80)\n\u2022 Lsparse =(cid:80)\n(cid:13)(cid:13)(cid:13)(x0, y0) \u2212\n\n\u2022 Lcenter =\n\nconstant mask of weights exponentially growing with distance from the center of k.\n\ni,j ki,j\ni,j |ki,j \u00b7 mi,j| penalizing non-zero values close to the boundaries. m is a\ni,j |ki,j|1/2 encourages sparsity to prevent the network from oversmoothing kernels.\nencourages k\u2019s center of mass to be at the center of the\n\n(cid:80)\ni,j ki,j\u00b7(i,j)\n\n(cid:80)\n\n(cid:13)(cid:13)(cid:13)2\n\nkernel, where (x0, y0) denote the center indices.\n\ni,j ki,j\n\nSR kernels are not only image speci\ufb01c, but also depend on the desired scale-factor s. However,\nthere is a simple relation between SR-kernels of different scales. We are thus able to obtain a\nkernel for SR \u00d74 from G that was trained to downscale by \u00d72. This is advantageous for two\nreasons: First, it allows extracting kernels for various scales by one run of KernelGAN. Second,\nit prevents downscaling the LR image too much. Small LR images downscaled by a scale factor\nof 4 may result in tiny images (\u00d716 smaller than the HR image) which may not contain enough\ndata to train on. KernelGAN is trained to estimate a kernel k2 for a scale-factor of 2. It is easy\nto show that the kernel for a scale factor of 4, i.e. k4, can be analytically calculated by requiring:\nILR \u2217 k4 \u21934= (ILR \u2217 k2 \u21932) \u2217 k2 \u21932.\nIt implies that k4 = k2 \u2217 k2_dilated , where k2_dilated[n1, n2] =\nFor a mathematical proof of the above derivation, as well as an ablation study of the various kernel\nconstraints \u2013 see our project website.\n\n(cid:3) n1, n2 even\n\n(cid:26)k2\n\n(cid:2) n1\n\n2 , n2\n\n0\n\n2\n\nelse\n\n5 Experiments and results\nWe evaluated our method on real LR images, as well as on \u2018non-ideal\u2019 synthetically generated LR\nimages with ground-truth (both ground-truth HR images, as well as the true SR-kernels).\n\n6\n\n\fFigure 5: SR kernel estimation: We compare\nground truth kernel, Michaeli and Irani [24]\nPSNR of ZSSR [30],\nand KernelGAN (ours).\nwhen supplied with those kernels,\nis noted in\nthe bottom right of each kernel, emphasizing the\nsigni\ufb01cance of kernel estimation accuracy. Our\nmethod outperforms [24] in SR performance (as\nwell as in visual similarity to ground truth SR kernel)\n\nType 1:\n\nSotA SR algorithms\n(trained on bicubically\ndownscaled images)\n\nType 2:\nBlind-SR\n\nNTIRE\u201918 [31]\n\nwinners\nType 3:\n\n+\n\nkernel estimation\n\nnon Blind-SR algorithm\n\nType 4:\n\nUpper bound\n\nMethod\nBicubic Interpolation\nBicubic kernel + ZSSR [30]\nEDSRplus [21]\nRCANplus [38]\nPDN [34] - 1st in NTIRE track4\nWDSR [36] - 1st in NTIRE track2\nWDSR [36] - 1st in NTIRE track3\nWDSR [36] - 2nd in NTIRE track4\nMichaeli & Irani [24] + SRMD [37]\nMichaeli & Irani [24] + ZSSR [30]\nKernelGAN (Ours) + SRMD [37]\nKernelGAN (Ours) + ZSSR [30]\nGround-truth kernel + SRMD [37]\nGround-truth kernel + ZSSR [30]\n\n\u00d72\n\n28.731 / 0.8040\n29.102 / 0.8215\n29.172 / 0.8216\n29.198 / 0.8223\n\n-\n-\n-\n-\n\n25.511 / 0.8083\n29.368 / 0.8370\n29.565 / 0.8564\n30.363 / 0.8669\n31.962 / 0.8955\n32.436 / 0.8992\n\n\u00d74\n\n25.330 / 0.6795\n25.605 / 0.6911\n25.638 / 0.6928\n25.659 / 0.6936\n26.340 / 0.7190\n21.546 / 0.6841\n21.539 / 0.7016\n25.636 / 0.7144\n23.335 / 0.6530\n26.080 / 0.7138\n25.711 / 0.7265\n26.810 / 0.7316\n27.375 / 0.7655\n27.527 / 0.7446\n\nTable 1: SotA SR performance (PSNR(dB) / SSIM) on 100 images of DIV2KRK (sec. 5.2) . Red indicates the\nbest performance, blue indicates second best.\n\n5.1 Evaluation Method\nThe performance of our method is analyzed in 2 ways: Kernel estimation accuracy and SR perfor-\nmance. The latter is done both visually (see \ufb01g. 1a and supplementary material), and empirically on\nthe synthetic dataset analyzing PSNR and SSIM measurements (Table 1). Evaluation is done using\nthe script provided by [19] and used by many works, including [30, 20, 21]. For evaluation of the\nkernel estimation we chose two non-blind SR algorithms that accept a SR-kernel as input [30, 37],\nprovide them with different kernels (bicubic, ground truth SR-kernel, ours and [30]) and compared\ntheir SR performance. We present four types (categories) of algorithms for the analysis:\n\u2022 Type 1 includes the non-blind SotA SR methods trained on bicubically downscaled images.\n\u2022 Type 2 are the winners of the NTIRE\u20192018 Blind-SR challenge [31].\n\u2022 Type 3 consists of combinations of 2 methods for SR-kernel estimation, [24] and ours, and 2\nnon-blind SR methods, [30, 37], that regard the estimated kernel as input. This combination is\nitself, a full Blind-SR algorithm.\n\u2022 Type 4, is again the combination above, only with the use of the ground truth SR-kernel, thus\n\nproviding an upper bound to Type 3.\n\nWhen providing a kernel to ZSSR [30] (whether ours, [24]\u2019s, ground-truth kernel), we provide both\nthe \u00d72 and \u00d74 kernels, in order to exploit the gradual process of ZSSR (using 2 sequential SR steps).\nWe compare our kernel estimation to Michaeli & Irani [24] which, to the best of our knowledge, is\nthe leading method for SR-kernel estimation from a single image.\n\n5.2 Dataset for Blind-SR\nThere is no adequate dataset for quantitatively evaluating Blind-SR. The data used in the NTIRE\u20192018\nBlind-SR challenge [31] has an inherent problem; it suffers from sub-pixel misalignments between\nthe LR image and the HR ground truth, resulting in subpixel misalignments between the reconstructed\nSR images and the HR ground truth images. Such small misalignments are known to prefer (i.e.\ngive lower error to) blurry images over sharp ones. A different benchmark suggested by [4] is not\nyet available, and is only restricted to 3 SR blur kernels. As a result, we generated a new Blind-SR\nbenchmark, referred to as DIV2KRK (DIV2K random kernel).\n\n7\n\n\fFigure 6: Depth is of the essence for the linear network: The deep linear generator outperforms the single\nlayer version by 3.8dB and 1.6dB for \u00d72 and \u00d74 over DIV2KRK. Kernel examples emphasize superiority of the\ndeep linear network.\n\nUsing the validation set (100 images) from the widely used DIV2K [1] dataset, we blurred and sub-\nsampled each image with a different, randomly generated kernel. Kernels were 11x11 anisotropic\ngaussians with random lengths \u03bb1, \u03bb2\u223cU(0.6, 5) independently distributed for each axis, rotated\nby a random angle \u03b8\u223cU[\u2212\u03c0, \u03c0]. To deviate from a regular gaussian, we further apply uniform\nmultiplicative noise (up to 25% of each pixel value of the kernel) and normalize it to sum to one. See\nFigs. 5 and 6 for a few such examples. Data and reproduction code are available on project website.\n5.3 Results\nOur method together with ZSSR [30] outperforms SotA SR results visually and numerically\nby a large margin of 1dB and 0.47dB for scales \u00d72 and \u00d74 respectively. When the kernel\ndeviates from bicubic, as in DIV2KRK and real images, Type 1, SotA SR, tends to produce blurry\nresults (often comparable to naive bicubic interpolation), and highlights undesired artifacts (e.g. JPEG\ncompression), such examples are shown in Fig. 1a. Type 2, i.e. SotA Blind-SR methods, produce\nsigni\ufb01cantly better quantitative results, than Type 1. Although visually (as shown in Fig. 1b), tend to\nover-smooth patterns and over-sharpen edges in an arti\ufb01cial and graphical manner.\nOur SR-kernel estimation, plugged into ZSSR [30], shows superior results over both types by a large\nmargin of 1.1dB, 0.47dB for scales \u00d72, \u00d74 respectively, and visually recovering small details from\nthe LR image as shown in Fig 1. Although, our kernel together with [37] did not perform as well for\nscale \u00d74, ranking lower than [34] in PSNR (but higher in SSIM).\nRegarding SR-kernel estimation, we analyze Type 3. When \ufb01xing a SR algorithm, our kernel\nestimation outperforms [24] visually, as seen in Fig. 1b, and quantitatively by 1dB, 0.73dB when\nplugged in [30] and 4dB, 2.3dB when plugged in [37] for scales \u00d72, \u00d74 respectively.\nThe superiority of Type 3 and Type 4 (as shown in Table 1) shows empirically the importance of\nperforming SR with regard to the image speci\ufb01c SR-kernel. Moreover, using a SR algorithm with\ndifferent SR-kernels leads to signi\ufb01cantly different results, demonstrating the importance of the\nSR-kernel accuracy. This is also shown in Fig. 5 with a large difference in PSNR while small (but\nvisible) difference in the estimated kernel. For more visual results+comparisons see project website.\nRun-time: Network training is done during test time. There is no actual inference step since the\ntrained G implicitly contains the resulting SR-kernel. Runtime is 61 or 102 seconds per image on a\nsingle Tesla V-100 or Tesla K-80 GPU, respectively. Runtime is independent both of image size and\nscale factor, since the GAN analyzes patches rather than the whole image. Since the same kernel\napplies to the entire image, analyzing a \ufb01xed number of crops suf\ufb01ces to estimate the kernel. Fixed\nsized image crops (64\u00d764 to G, and accordingly 32\u00d732 to D) are randomly selected with probability\nproportional to their gradient content. KernelGAN \ufb01rst estimates the \u00d72 SR-kernel, and then derives\nanalytically kernels for other scales (e.g., \u00d74). For comparison, [24] runs 45 minutes per image on\nour DIV2KRK and time consumption grows with image size.\n6 Conclusion\nWe present signi\ufb01cant progress towards real-world SR (i.e., Blind-SR), by estimating an image-\nspeci\ufb01c SR-kernel based on the LR image alone. This is done via an image-speci\ufb01c internal-GAN,\nwhich trains solely on the LR input image, and learns its internal distribution of patches without\nrequiring any prior examples. This gives rise to true SR \"in the wild\". We show both visually\nand quantitatively that when our kernel estimation is provided to existing off-the-shelf non-blind SR\nalgorithms, it leads to SotA SR results, by a large margin.\n\n8\n\n\fReferences\n[1] Eirikur Agustsson and Radu Timofte. Ntire 2017 challenge on single image super-resolution: Dataset and\nstudy. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR) Workshops, July\n2017.\n\n[2] Sanjeev Arora, Nadav Cohen, and Elad Hazan. On the optimization of deep networks: Implicit acceleration\nby overparameterization. In Jennifer G. Dy and Andreas Krause, editors, ICML, volume 80 of Proceedings\nof Machine Learning Research, pages 244\u2013253. PMLR, 2018.\n\n[3] Isabelle Begin and Frank P. Ferrie. Blind super-resolution using a learning-based approach. In International\n\nConference on Pattern Recognition, ICPR \u201904, Washington, DC, USA, 2004. IEEE Computer Society.\n\n[4] Jianrui Cai, Hui Zeng, Hongwei Yong, Zisheng Cao, and Lei Zhang. Toward real-world single image\n\nsuper-resolution: A new benchmark and a new model, 2019.\n\n[5] Kaiming He Xiaoou Tang Chao Dong, Chen Change Loy. Learning a deep convolutional network for\n\nimage super-resolution. In European Conference on Computer Vision (ECCV), 2014.\n\n[6] Anna Choromanska, MIkael Henaff, Michael Mathieu, Gerard Ben Arous, and Yann LeCun. The Loss\nSurfaces of Multilayer Networks. In Guy Lebanon and S. V. N. Vishwanathan, editors, Proceedings of the\nEighteenth International Conference on Arti\ufb01cial Intelligence and Statistics, volume 38 of Proceedings of\nMachine Learning Research, pages 192\u2013204, San Diego, California, USA, 09\u201312 May 2015. PMLR.\n\n[7] Gilad Freedman and Raanan Fattal. Image and video upscaling from local self-examples. ACM Trans.\n\nGraph., April 2011.\n\n[8] Daniel Glasner, Shai Bagon, and Michal Irani. Super-resolution from a single image. In ICCV, 2009.\n\n[9] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\nCourville, and Yoshua Bengio. Generative adversarial nets. In Z. Ghahramani, M. Welling, C. Cortes,\nN. D. Lawrence, and K. Q. Weinberger, editors, Advances in Neural Information Processing Systems 27,\npages 2672\u20132680. Curran Associates, Inc., 2014.\n\n[10] Jinjin Gu, Hannan Lu, Wangmeng Zuo, and Chao Dong. Blind super-resolution with iterative kernel\n\ncorrection. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), June 2019.\n\n[11] Moritz Hardt and Tengyu Ma. Identity matters in deep learning. CoRR, abs/1611.04231, 2016.\n\n[12] Muhammad Haris, Greg Shakhnarovich, and Norimichi Ukita. Deep back-projection networks for super-\n\nresolution. In IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[13] He He and Wan-Chi Siu. Single image super-resolution using gaussian process regression. In Proceedings\nof the 2011 IEEE Conference on Computer Vision and Pattern Recognition, CVPR \u201911, pages 449\u2013456,\nWashington, DC, USA, 2011. IEEE Computer Society.\n\n[14] Yu He, Kim-Hui Yap, Li Chen, and Lap-Pui Chau. A soft map framework for blind super-resolution image\n\nreconstruction. Image Vision Comput., 27(4):364\u2013373, March 2009.\n\n[15] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\ninternal covariate shift. In Proceedings of the 32Nd International Conference on International Conference\non Machine Learning - Volume 37, ICML\u201915, pages 448\u2013456. JMLR.org, 2015.\n\n[16] Phillip Isola, Jun-Yan Zhu, Tinghui Zhou, and Alexei A Efros. Image-to-image translation with conditional\nadversarial networks. The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2017.\n\n[17] Neel Joshi, Richard Szeliski, and David J. Kriegman. Psf estimation using sharp edge prediction. 2008\n\nIEEE Conference on Computer Vision and Pattern Recognition, pages 1\u20138, 2008.\n\n[18] Kenji Kawaguchi. Deep learning without poor local minima. In D. D. Lee, M. Sugiyama, U. V. Luxburg,\nI. Guyon, and R. Garnett, editors, Advances in Neural Information Processing Systems 29, pages 586\u2013594.\nCurran Associates, Inc., 2016.\n\n[19] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Accurate image super-resolution using very deep\nconvolutional networks. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR)\nWorkshops, pages 1646\u20131654, 06 2016.\n\n[20] Jiwon Kim, Jung Kwon Lee, and Kyoung Mu Lee. Deeply-recursive convolutional network for image\nsuper-resolution. In The IEEE Conference on Computer Vision and Pattern Recognition (CVPR Oral),\nJune 2016.\n\n9\n\n\f[21] Bee Lim, Sanghyun Son, Heewon Kim, Seungjun Nah, and Kyoung Mu Lee. Enhanced deep residual\nnetworks for single image super-resolution. In The IEEE Conference on Computer Vision and Pattern\nRecognition (CVPR) Workshops, July 2017.\n\n[22] Wenjie Luo, Yujia Li, Raquel Urtasun, and Richard Zemel. Understanding the effective receptive \ufb01eld in\ndeep convolutional neural networks. In D. D. Lee, M. Sugiyama, U. V. Luxburg, I. Guyon, and R. Garnett,\neditors, Advances in Neural Information Processing Systems 29, pages 4898\u20134906. Curran Associates,\nInc., 2016.\n\n[23] Xudong Mao, Qing Li, Haoran Xie, Raymond Y. K. Lau, and Zhen Wang. Least squares generative\n\nadversarial networks. In Computer Vision (ICCV), IEEE International Conference on, 2017.\n\n[24] T. Michaeli and M. Irani. Nonparametric blind super-resolution. In International Conference on Computer\n\nVision (ICCV), 2013.\n\n[25] Takeru Miyato, Toshiki Kataoka, Masanori Koyama, and Yuichi Yoshida. Spectral normalization for\n\ngenerative adversarial networks. In International Conference on Learning Representations, 2018.\n\n[26] Alexander Apartsin Boaz Nadler Anat levin Netalee Efrat, Daniel Glasner. Accurate blur models vs. image\n\npriors in single image super-resolution. In ICCV, 2013.\n\n[27] G. Riegler, S. Schulter, M. R\u00fcther, and H. Bischof. Conditioned regression models for non-blind single\nIn 2015 IEEE International Conference on Computer Vision (ICCV), pages\n\nimage super-resolution.\n522\u2013530, Dec 2015.\n\n[28] Andrew M. Saxe, James L. McClelland, and Surya Ganguli. Exact solutions to the nonlinear dynamics of\n\nlearning in deep linear neural networks, 2013.\n\n[29] Assaf Shocher, Shai Bagon, Phillip Isola, and Michal Irani. Ingan: Capturing and remapping the \u201cdna\u201d of\n\na natural image. In arXiv, 2019.\n\n[30] Assaf Shocher, Nadav Cohen, and Michal Irani. Zero-shot super-resolution using deep internal learning.\n\nIn The IEEE Conference on Computer Vision and Pattern Recognition (CVPR), 2018.\n\n[31] Radu Timofte, Shuhang Gu, Jiqing Wu, Luc Van Gool, Lei Zhang, Ming-Hsuan Yang, Muhammad Haris,\net al. Ntire 2018 challenge on single image super-resolution: Methods and results. In The IEEE Conference\non Computer Vision and Pattern Recognition (CVPR) Workshops, June 2018.\n\n[32] Radu Timofte, Vincent De Smet, and Luc Van Gool. A+: Adjusted anchored neighborhood regression for\n\nfast super-resolution. In ACCV, 2014.\n\n[33] Qiang Wang, Xiaoou Tang, and Harry Shum. Patch based blind image super resolution. In Proceedings of\nthe Tenth IEEE International Conference on Computer Vision (ICCV\u201905) Volume 1 - Volume 01, ICCV \u201905,\npages 709\u2013716, Washington, DC, USA, 2005. IEEE Computer Society.\n\n[34] Wang Xintao, Yu Ke, Hui Tak-Wai, Dong Chao, Lin Liang, and Change Loy Chen. Deep poly-dense\n\nnetwork for image superresolution. NTIRE challenge, 2018.\n\n[35] W. Yifan, F. Perazzi, B. McWilliams, A. Sorkine-Hornung, O Sorkine-Hornung, and C. Schroers. A fully\n\nprogressive approach to single-image super-resolution. In CVPR Workshops, June 2018.\n\n[36] Jiahui Yu, Yuchen Fan, Jianchao Yang, Ning Xu, Zhaowen Wang, Xinchao Wang, and Thomas S. Huang.\n\nWide activation for ef\ufb01cient and accurate image super-resolution. CoRR, abs/1808.08718, 2018.\n\n[37] Kai Zhang, Wangmeng Zuo, and Lei Zhang. Learning a single convolutional super-resolution network for\nmultiple degradations. In IEEE Conference on Computer Vision and Pattern Recognition, pages 3262\u20133271,\n2018.\n\n[38] Yulun Zhang, Kunpeng Li, Kai Li, Lichen Wang, Bineng Zhong, and Yun Fu. Image super-resolution\n\nusing very deep residual channel attention networks. In ECCV, 2018.\n\n[39] Maria Zontak and Michal Irani. Internal statistics of a single natural image. In The IEEE Conference on\n\nComputer Vision and Pattern Recognition (CVPR), 2011.\n\n10\n\n\f", "award": [], "sourceid": 114, "authors": [{"given_name": "Sefi", "family_name": "Bell-Kligler", "institution": "Weizmann Istitute of Science"}, {"given_name": "Assaf", "family_name": "Shocher", "institution": "Weizmann Institute of Science"}, {"given_name": "Michal", "family_name": "Irani", "institution": "Weizmann Institute of Science"}]}