{"title": "On The Classification-Distortion-Perception Tradeoff", "book": "Advances in Neural Information Processing Systems", "page_first": 1206, "page_last": 1215, "abstract": "Signal degradation is ubiquitous, and computational restoration of degraded signal has been investigated for many years. Recently, it is reported that the capability of signal restoration is fundamentally limited by the so-called perception-distortion tradeoff, i.e. the distortion and the perceptual difference between the restored signal and the ideal \"original\" signal cannot be made both minimal simultaneously. Distortion corresponds to signal fidelity and perceptual difference corresponds to perceptual naturalness, both of which are important metrics in practice. Besides, there is another dimension worthy of consideration--the semantic quality of the restored signal, i.e. the utility of the signal for recognition purpose. In this paper, we extend the previous perception-distortion tradeoff to the case of classification-distortion-perception (CDP) tradeoff, where we introduced the classification error rate of the restored signal in addition to distortion and perceptual difference. In particular, we consider the classification error rate achieved on the restored signal using a predefined classifier as a representative metric for semantic quality. We rigorously prove the existence of the CDP tradeoff, i.e. the distortion, perceptual difference, and classification error rate cannot be made all minimal simultaneously. We also provide both simulation and experimental results to showcase the CDP tradeoff. Our findings can be useful especially for computer vision research where some low-level vision tasks (signal restoration) serve for high-level vision tasks (visual understanding). Our code and models have been published.", "full_text": "On The Classi\ufb01cation-Distortion-Perception Tradeoff\n\nDong Liu, Haochen Zhang, Zhiwei Xiong\n\nUniversity of Science and Technology of China, Hefei 230027, China\n\ndongeliu@ustc.edu.cn\n\nAbstract\n\nSignal degradation is ubiquitous, and computational restoration of degraded signal\nhas been investigated for many years. Recently, it is reported that the capability of\nsignal restoration is fundamentally limited by the so-called perception-distortion\ntradeoff, i.e.\nthe distortion and the perceptual difference between the restored\nsignal and the ideal \u201coriginal\u201d signal cannot be made both minimal simultaneously.\nDistortion corresponds to signal \ufb01delity and perceptual difference corresponds to\nperceptual naturalness, both of which are important metrics in practice. Besides,\nthere is another dimension worthy of consideration\u2013the semantic quality of the\nrestored signal, i.e. the utility of the signal for recognition purpose. In this paper,\nwe extend the previous perception-distortion tradeoff to the case of classi\ufb01cation-\ndistortion-perception (CDP) tradeoff, where we introduced the classi\ufb01cation error\nrate of the restored signal in addition to distortion and perceptual difference. In\nparticular, we consider the classi\ufb01cation error rate achieved on the restored signal\nusing a prede\ufb01ned classi\ufb01er as a representative metric for semantic quality. We\nrigorously prove the existence of the CDP tradeoff, i.e. the distortion, perceptual\ndifference, and classi\ufb01cation error rate cannot be made all minimal simultaneously.\nWe also provide both simulation and experimental results to showcase the CDP\ntradeoff. Our \ufb01ndings can be useful especially for computer vision research where\nsome low-level vision tasks (signal restoration) serve for high-level vision tasks\n(visual understanding). Our code and models have been published.\n\n1\n\nIntroduction\n\nSignal degradation refers to the corruption of the signal due to many different reasons such as\ninterference and the blend of interested signal and uninterested signal or noise, which is observed\nubiquitously in practical information systems. The cause of signal degradation may be physical\nfactors, such as the imperfectness of data acquisition devices and the noise in data transmission\nmedium; or may be arti\ufb01cial factors, such as the lossy data compression and the transmission of\nmultiple sources over the same medium at the same time. In addition, in cases where we want to\nenhance signal, we may assume the signal to have been somehow \u201cdegraded.\u201d For example, if we\nwant to enhance the resolution of an image, we assume the image is a degraded version of an ideal\n\u201coriginal\u201d image that has high resolution [6].\nTo tackle signal degradation or to ful\ufb01ll signal enhancement, computational restoration of degraded\nsignal has been investigated for many years. There are various signal restoration tasks corresponding\nto different degradation reasons. Taken image as example, image denoising [23], image deblurring\n[17], single image super-resolution [6], image contrast enhancement [7], image compression artifact\nremoval [5], image inpainting [22], . . . , all belong to image restoration tasks.\nDifferent restoration tasks have various objectives. Some tasks may be keen to recover the \u201coriginal\u201d\nsignal as faithfully as possible, like image denoising is to recover the noise-free image, compression\nartifact removal is to recover the uncompressed image. Some other tasks may concern more about\nthe perceptual quality of the restored signal, like image super-resolution is to produce image details\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fto make the enhanced image look like having \u201chigh-resolution,\u201d image inpainting is to generate a\ncomplete image that looks \u201cnatural.\u201d Yet some other tasks may serve for recognition or understanding\npurpose: for one example, an image containing a car license plate may have blur, and image deblurring\ncan achieve a less blurred image so as to recognize the license plate [13]; for another example, an\nimage taken at night is dif\ufb01cult to identify, and image contrast enhancement can produce a more\nnaturally looking image that is better understood [10]. Recent years have witnessed more and more\nefforts about the last category [16, 19].\nGiven the different objectives, it is apparent that a signal restoration method designed for one speci\ufb01c\ntask shall be evaluated with the speci\ufb01c metric that corresponds to the task\u2019s objective. Indeed, the\naforementioned objectives correspond to three groups of evaluation metrics:\n\n1. Signal \ufb01delity metrics that evaluate how similar is the restored signal to the \u201coriginal\u201d signal.\nThese include all the full-reference quality metrics, such as the well-known mean-squared-\nerror (MSE) and its counterpart peak signal-to-noise ratio (PSNR), the structural similarity\n(SSIM) [21], and the difference in features extracted from original signal and restored signal\n[8], to name a few.\n\n2. Perceptual naturalness metrics that evaluate how \u201cnatural\u201d is the restored signal with\nrespect to human perception. These are usually known as no-reference quality metrics\n[14, 15]. Recently, the popularity of generative adversarial network (GAN) has motivated a\nmathematical formulation of perceptual naturalness [3].\n\n3. Semantic quality metrics that evaluate how \u201cuseful\u201d is the restored signal in the sense that it\nbetter serves for the following semantic-related analyses. For example, whether a restored\nsample can be correctly classi\ufb01ed is a measure of the semantic quality. There are only a few\nstudies about semantic quality assessment methods [12].\n\nIt is worth noting that signal \ufb01delity metrics have dominated in the research of signal restoration.\nHowever, is one method optimized for signal \ufb01delity also optimal for perceptual naturalness or\nsemantic quality? This question has been overlooked for a long while until recently. Blau and\nMichaeli considered signal \ufb01delity and perceptual naturalness, and concluded that optimizing for the\ntwo metrics can be contradictory [3]. Indeed, they provided a rigorous proof of the existence of the\nperception-distortion tradeoff: with distortion representing signal \ufb01delity and perceptual difference\nrepresenting perceptual naturalness, one signal restoration method cannot achieve both low distortion\nand low perceptual difference up to a bound. This conclusion reveals the fundamental limit of\nthe capability of signal restoration, and quickly inspires the investigation of perceptual naturalness\nmetrics in different tasks [2, 20].\nFollowing the work of the perception-distortion tradeoff, in this paper, we aim to consider the three\ngroups of metrics jointly, i.e. we want to study the relation between signal \ufb01delity, perceptual\nnaturalness, and semantic quality, in the context of signal restoration. In particular, we consider\nclassi\ufb01cation error rate as a representative of semantic quality metrics, because classi\ufb01cation is\nthe most fundamental semantic-related analysis. We adopt the classi\ufb01cation error rate achieved on\nthe restored signal using a prede\ufb01ned classi\ufb01er as the third dimension in addition to distortion and\nperceptual difference. We provide a rigorous proof of the existence of the classi\ufb01cation-distortion-\nperception (CDP) tradeoff, i.e. the distortion, perceptual difference, and classi\ufb01cation error rate\ncannot be made all minimal simultaneously. We also provide both simulation and experimental\nresults to showcase the CDP tradeoff. Our code and models are published at https://github.com/\nAlanZhang1995/CDP-Tradeoff.\nTo the best of our knowledge, this paper is the \ufb01rst to reveal the fundamental tradeoff between the\nthree kinds of quality metrics: signal \ufb01delity, perceptual naturalness, and semantic quality, in the\ncontext of signal restoration. Our results imply that, if a signal restoration method is meant to serve\nfor recognition or understanding purpose, then the method is better optimized for semantic quality\ninstead of signal \ufb01delity or perceptual naturalness. This is in contrast to most of the existing practices.\nIt then calls for more investigation of semantic quality metrics.\n\n2 Problem Formulation\nConsider the process: X \u2192 Y \u2192 \u02c6X, where X denotes the ideal \u201coriginal\u201d signal, Y denotes the\ndegraded signal, and \u02c6X denotes the restored signal. We consider X, Y , and \u02c6X each as a discrete\n\n2\n\n\frandom variable. The cases of continuous random variables can be deduced in a similar manner, and\nare omitted hereafter. The probability mass function of X is denoted by pX (x). The degradation\nmodel is denoted by PY |X, which is characterized by a conditional mass function p(y|x). The\nrestoration method is then denoted by P \u02c6X|Y and characterized by p(\u02c6x|y).\n\n2.1 Distortion, Perceptual Difference, and Classi\ufb01cation Error Rate\n\nThere are different categories of quality metrics to evaluate the signal restoration methods. For the\n\ufb01rst category, signal \ufb01delity, we usually adopt distortion that is de\ufb01ned precisely as the expectation\nof a given bivariate function, i.e.\n\nDistortion := E[\u2206(X, \u02c6X)]\n\n(1)\nwhere E is to take expectation over the joint distribution pX, \u02c6X, \u2206(\u00b7,\u00b7) : X \u00d7 \u02c6X \u2192 R+ is a given\nfunction to measure the difference between the original and the restored samples. This de\ufb01nition is\ncorresponding to the common practice of using various forms of full-reference loss functions, e.g.\nMSE, in the signal restoration tasks. The de\ufb01nition measures the dissimilarity, i.e. the lower the\nbetter, while some popular quality metrics such as PSNR and SSIM measure similarity.\nFor the second category, perceptual naturalness, it has been proved in [3] that the perceptual quality\nevaluated by human when performing a real-or-fake test is indeed equivalent to the total-variation\n(TV) distance between the distribution of the original signal and that of the restored signal. Following\n[3], we de\ufb01ne perceptual difference as\n\nPerceptual Difference := d(pX , p \u02c6X )\n\n(2)\nwhere d(\u00b7,\u00b7) is a function to measure the difference between two probability mass functions, such as\nthe TV distance and the Kullback-Leibler (KL) distance. Perception is also the lower the better.\nFor the third category, semantic quality, we will focus on the classi\ufb01cation error rate achieved on the\nrestored signal using a prede\ufb01ned classi\ufb01er in this paper. We will discuss the case of classifying the\nsignal into two categories, and note that extension to multiple categories is straightforward.\nWe assume each sample of the original signal belongs to one of two classes: \u03c91 or \u03c92. The a priori\nprobabilities and the conditional mass functions are assumed to be known as P1, P2 = 1 \u2212 P1 and\npX1(x), pX2(x), respectively. In other words, X follows a two-component mixture model: pX (x) =\nP1pX1(x) + P2pX2(x). Accordingly, Y follows the model: pY (y) = P1pY 1(y) + P2pY 2(y), and\n\u02c6X follows the model: p \u02c6X (\u02c6x) = P1p \u02c6X1(\u02c6x) + P2p \u02c6X2(\u02c6x), where\n\n(cid:88)\n(cid:88)\n\nx\u2208X\n\ny\u2208Y\n\npY i(y) =\n\np \u02c6Xi(\u02c6x) =\n\np(y|x)pXi(x), i = 1, 2\n\np(\u02c6x|y)pY i(y) =\n\np(\u02c6x|y)p(y|x)pXi(x), i = 1, 2\n\n(3)\n\n(4)\n\n(5)\n\n(cid:88)\n(cid:88)\n(cid:26)\u03c91,\n\nx\n\ny\n\nif t \u2208 R\n\u03c92, otherwise\n\n(cid:88)\n\n\u02c6x\u2208R\n\nA binary classi\ufb01er can be denoted by\n\nc(t) = c(t|R) =\n\nIf we apply this classi\ufb01er on the restored signal \u02c6X, we shall achieve an error rate\n\nClassi\ufb01cation Error Rate := \u03b5( \u02c6X|c) = \u03b5( \u02c6X|R) = P2\n\np \u02c6X2(\u02c6x) + P1\n\np \u02c6X1(\u02c6x)\n\n(6)\n\n(cid:88)\n\n\u02c6x /\u2208R\n\n2.2 The CDP Function\n\nWe are now ready to de\ufb01ne the CDP function, which is the focus of our study.\nDe\ufb01nition 1. The classi\ufb01cation-distortion-perception (CDP) function is\n\nC(D, P ) = min\nP \u02c6X|Y\n\n\u03b5( \u02c6X|c0), subject to E[\u2206(X, \u02c6X)] \u2264 D, d(pX , p \u02c6X ) \u2264 P\n\n(7)\n\nwhere c0 = c(\u00b7|R0) is a prede\ufb01ned binary classi\ufb01er.\n\n3\n\n\fFigure 1: A toy example to showcase the CDP function. See text for details.\n\nThe CDP function characterizes how well a signal restoration method (P \u02c6X|Y ) can perform, if it is\nconstrained to have a low distortion (less than D) and a low perceptual difference (less than P ). Note\nthat if D = \u221e, P = \u221e, then the restoration method is optimized purely for lowering error rate, and\nthus the CDP function reaches its minimum. By de\ufb01ning the CDP function, we are interested to know\nwhether a constrained optimization can perform as well as an unconstrained one. This is because\nthe optimization for distortion has been studied extensively, and if the optimization for distortion or\nperception also leads to the optimization for classi\ufb01cation, then we are done. However, this is not the\ncase, as we will prove.\nAnother issue is about the prede\ufb01ned classi\ufb01er in the de\ufb01nition of the CDP function. One may be\ncurious to know whether it is possible to adjust the classi\ufb01er itself to achieve a lower error rate: this is\nsurely possible. However, there are a practical dif\ufb01culty to train the optimal classi\ufb01er for the restored\nsignal, since the distribution of the restored signal is dependent on the restoration method that is to be\ndecided. Next, we may ask whether it is practical to adjust the restoration method and the classi\ufb01er\nsimultaneously. However, this is not necessary, because we can prove that the optimal classi\ufb01er for\n\u02c6X cannot outperform the optimal classi\ufb01er for Y (see the supplementary for the proof ). That says,\nwe do not need to perform signal restoration if we can train the optimal classi\ufb01er for the degraded\nsignal. But this is another practical dif\ufb01culty: the distribution of the signal to be restored is often\nunknown (called blind restoration), so it is not easy to train the optimal classi\ufb01er for it. In summary,\nif dealing with blind restoration, i.e. the distribution of the degraded signal is unknown, then it is\ndif\ufb01cult to achieve the optimal classi\ufb01er for either degraded or restored signal, so using a prede\ufb01ned\nclassi\ufb01er is a more practical choice. If dealing with non-blind restoration, i.e. the distribution of\nthe degraded signal is known, then we can achieve the optimal classi\ufb01er for the degraded signal,\nand it is not necessary to perform signal restoration prior to classi\ufb01cation as it will not improve the\nclassi\ufb01cation performance. In this paper, we consider the case of blind restoration, and we leave the\ncase of non-blind restoration as our future work.\n\n2.3 Toy Example\n\nTo showcase the characteristic of the CDP function, we conduct simulations with a toy example.\nAs shown in Figure 1, the original signal follows a two-component Gaussian mixture model: P1 =\n0.7, P2 = 0.3, pX1(x) = N (\u22121, 1), pX2(x) = N (1, 1). The signal is corrupted by additive white\nGaussian noise: Y = X + N, where N \u223c N (0, 1). The denoising method is linear: \u02c6X = aY where\na is an adjustable parameter. For example, the restored signal with a = 0.8 is depicted in Figure 1 (b).\nWe use the binary classi\ufb01er that is the optimal for the original signal to evaluate error rate. In addition,\nwe use MSE to evaluate distortion, and use the KL distance to evaluate perception. Under these\nsettings, we can derive closed-form functions of MSE and error rate with respect to the parameter a\n(see the supplementary for details). For the KL distance, we do not have closed-form expression so\nwe perform numerical calculation. We then use numerical methods to calculate the CDP function and\ndepict the function in Figure 2.\nFirst, the CDP function is monotonically non-increasing, i.e.\nthe minimal attainable error rate\ndecreases as the maximal allowable distortion and perception increase. It implies that if one wants to\nhave a restoration method for better classi\ufb01cation performance, it must come at the cost of higher\ndistortion, lower perceptual quality, or both. Second, the CDP function is convex, indicating that\n\n4\n\n(a) original signal \ud835\udc4b(b) restored signal \u0de1\ud835\udc4b(c) MSE/KL Distance/Error Rate function\fFigure 2: (a) The CDP function for the toy example where we can \ufb01nd the minimal attainable error\nrate (C) decreases as the maximal allowable MSE (D) and KL divergence (P) increase. (b) Pro\ufb01les\nof the CDP function at different P values and D values respectively, from which we can \ufb01nd the\nfunction is convex.\n\nif D (P ) is smaller, the error rate increases faster. Thus, minimizing D (P ) can be quite harmful\nfor the classi\ufb01cation performance. Third, from Figure 2 (b), we observe that when D is small, C is\ninvariant with P , and when D is large, C is invariant with D. In this example, the feasible domain of\nP \u02c6X|Y is fully determined by the feasible set of a, which is indeed the intersection of the feasible set\nof a de\ufb01ned by D and that de\ufb01ned by P . If D is small, the feasible set de\ufb01ned by D is also small\nand determines the intersection. If D is large, the feasible set is also large and has no effect on the\nintersection. Similarly, from Figure 2 (b), we observe that when P is small, C is invariant with D,\nand when P is large, C is invariant with P . It can be interpreted similarly. Last but not the least, note\nthat the areas where D and P are both small are not present in the CDP function, which results from\nthe perception-distortion tradeoff [3].\nIn more general situations, it is impossible to solve Eq. (7) analytically. But some properties of the\nCDP function are still valid, as shown in the following section.\n\n3 The CDP Tradeoff\nTheorem 1. Considering the CDP function (7), if d(\u00b7, q) is convex in q, then C(D, P ) is\n\n1. monotonically non-increasing,\n\n2. convex in D and P .\n\nProof. For the \ufb01rst point, simply note that when increasing D or P , the feasible domain of P \u02c6X|Y\nis enlarged; as C(D, P ) is the minimal value of \u03b5( \u02c6X|c0) over the feasible domain, and the feasible\ndomain is enlarged, the minimal value will not increase.\nFor the second point, it is equivalent to prove:\n\n\u03bbC(D1, P1) + (1 \u2212 \u03bb)C(D2, P2) \u2265 C(\u03bbD1 + (1 \u2212 \u03bb)D2, \u03bbP1 + (1 \u2212 \u03bb)P2)\n\n(8)\nfor any \u03bb \u2208 [0, 1]. First, let \u00b5(\u02c6x|y) (resp. \u03bd(\u02c6x|y)) denote the optimal restoration method under\nconstraint (D1, P1) (resp. (D2, P2)), and \u02c6X\u00b5 (resp. \u02c6X\u03bd) be the restored signal, i.e.\n\n\u03b5( \u02c6X\u00b5|c0) = min\n\nP \u02c6X|Y\n\n\u03b5( \u02c6X|c0), subject to E[\u2206(X, \u02c6X)] \u2264 D1, d(pX , p \u02c6X ) \u2264 P1\n\n(9)\n\n5\n\n(a)(b)\f\u03b5( \u02c6X\u03bd|c0) = min\n\nP \u02c6X|Y\n\n\u03b5( \u02c6X|c0), subject to E[\u2206(X, \u02c6X)] \u2264 D2, d(pX , p \u02c6X ) \u2264 P2\n\n(10)\n\nThen the left hand side of (8) becomes\n\n\u03bb\u03b5( \u02c6X\u00b5|c0) + (1 \u2212 \u03bb)\u03b5( \u02c6X\u03bd|c0) = \u03b5( \u02c6X\u03bb|c0)\n\n(11)\nwhere \u02c6X\u03bb denotes the restored signal corresponding to p\u03bb(\u02c6x|y) = \u03bb\u00b5(\u02c6x|y) + (1 \u2212 \u03bb)\u03bd(\u02c6x|y) (see the\nsupplementary for the proof of this equation). Let D\u03bb = E[\u2206(X, \u02c6X\u03bb)], P\u03bb = d(pX , p \u02c6X\u03bb\n), then by\nde\ufb01nition\n\n(cid:110)\n\u03b5( \u02c6X|c0) : E[\u2206(X, \u02c6X)] \u2264 D\u03bb, d(pX , p \u02c6X ) \u2264 P\u03bb\n\n\u03b5( \u02c6X\u03bb|c0) \u2265 min\n\nP \u02c6X|Y\n\n(cid:111)\n\n= C(D\u03bb, P\u03bb)\n\n(12)\n\nNext, as d(\u00b7,\u00b7) in (7) is convex in its second argument, we have\n+ (1 \u2212 \u03bb)p \u02c6X\u03bd\n)\n) + (1 \u2212 \u03bb)d(pX , p \u02c6X\u03bd\n\nP\u03bb = d(pX , \u03bbp \u02c6X\u00b5\n\u2264 \u03bbd(pX , p \u02c6X\u00b5\n\u2264 \u03bbP1 + (1 \u2212 \u03bb)P2\n\n)\n\nthe last inequality is due to (9) and (10). Similarly, we have\n\nD\u03bb = E[\u2206(X, \u02c6X\u03bb)] = EY E[\u2206(X, \u02c6X\u03bb)|Y ]\n\n= EY [\u03bbE[\u2206(X, \u02c6X\u00b5)|Y ] + (1 \u2212 \u03bb)E[\u2206(X, \u02c6X\u03bd)|Y ]]\n= \u03bbE[\u2206(X, \u02c6X\u00b5)] + (1 \u2212 \u03bb)E[\u2206(X, \u02c6X\u03bd)]\n\u2264 \u03bbD1 + (1 \u2212 \u03bb)D2\n\n(13)\n\n(14)\n\nthe last inequality is again due to (9) and (10). Finally, note that C(D, P ) is non-increasing with\nrespect to D and P ,\n\nC(D\u03bb, P\u03bb) \u2265 C(\u03bbD1 + (1 \u2212 \u03bb)D2, \u03bbP1 + (1 \u2212 \u03bb)P2)\n\n(15)\n\nCombining (11), (12), and (15), we have (8).\n\nDiscussion. Note that the property of the CDP function is quite similar to that of the perception-\ndistortion function [3], and the proof is similar, too. The theorem has assumed the convexity of the\nfunction d(\u00b7,\u00b7), which is satis\ufb01ed by a large number of commonly used functions, including any\nf-divergence (e.g. KL, TV, Hellinger) and the R\u00e9nyi divergence [4, 18]. The theorem does not require\nany assumption on the function \u2206(\u00b7,\u00b7), implying that the CDP tradeoff exists for any distortion metric,\nincluding MSE/PSNR, SSIM, and the so-called feature losses which are calculated between deep\nfeatures [8], and so on. The convexity of C(D, P ) implies the tradeoff is stronger at the low distortion\nor low perception regimes. In these regimes, any small improvement in distortion/perception achieved\nby a restoration algorithm, must be accompanied by a large loss of classi\ufb01cation accuracy. Similarly,\nany small improvement in classi\ufb01cation accuracy achieved by an algorithm whose error rate is already\nsmall, must be accompanied by a large increase of distortion and/or perceptual difference.\n\n4 Experiments\n\nIn this section, we want to demonstrate the CDP\ntradeoff by real-world datasets and realistic set-\ntings. We use the MNIST handwritten digit\nrecognition dataset [11] and the CIFAR-10 im-\nage recognition dataset [9]. The restoration\ntasks we considered are denoising and super-\nresolution (SR), and we use trained networks to\nperform the tasks. Since our intention is not to\nstudy the restoration method itself, we design\nsimple denoising and SR networks inspired by\nthe successful DnCNN [23] and SRCNN [6],\n\nTable 1: Experimental con\ufb01gurations. CNN-2 and\nCNN-2\u2019 have the same network structure but differ\nin input image size (28\u00d728 and 32\u00d732).\n\nTask\n\nClassi\ufb01er\nDenoising Logistic\nDenoising CNN-1\nDenoising CNN-2\nCNN-1\nCNN-2\u2019\n\nSR\nSR\n\nDataset\nExp-1 MNIST\nExp-2 MNIST\nExp-3 MNIST\nExp-4 MNIST\nExp-5 CIFAR-10\n\n6\n\n\fFigure 3: Pro\ufb01les of the CDP functions. From top to bottom: Exp-1, Exp-2, and Exp-4. Better\nclassi\ufb01cation performance always comes at the cost of higher distortion and worse perceptual quality.\n\nrespectively. Experimental con\ufb01gurations are summarized in Table 1. More details can be found in\nthe supplementary.\nIn order to showcase the CDP tradeoff, we train a restoration (denoising or SR) network with\na combination of three loss functions that correspond to distortion, perceptual difference, and\nclassi\ufb01cation error rate. In short, the entire loss function is\n\n(cid:96)restoration = \u03b1(cid:96)M SE + \u03b2(cid:96)adv + \u03b3(cid:96)CE\n\n(16)\nwhere \u03b1, \u03b2, \u03b3 are weights. The \ufb01rst term is MSE loss to represent distortion, which is widely used in\nimage restoration research. The second term is an adversarial loss, minimizing which is to ensure\nperceptual quality as suggested in [3]. Here we adopt the Wasserstein GAN [1] and the adversarial\nloss (cid:96)adv is proportional to the Wasserstein distance dW (pX , p \u02c6X ). Note that in the Wasserstein\nGAN, the discriminator loss is indeed an estimate of the Wasserstein distance, which can be used to\nassess the perceptual quality of the restored images quantitatively. The third term is cross entropy,\ncorresponding to classi\ufb01cation error rate. To demonstrate that the CDP tradeoff is generic, we use\nmultiple classi\ufb01ers in experiments: the \ufb01rst is a simple logistic regression, and the others are CNN-\nbased classi\ufb01ers. For each classi\ufb01er, we pretrain it on the clean (i.e. noise-free and original-resolution)\ntraining data, and use it to evaluate cross entropy when training the denoising or SR network.\nFor Exp-1, Exp-2, and Exp-3, noisy images are generated by adding Gaussian noise N (0, 1) onto\nthe MNIST images. Then, the noisy training data as well as their clean version are used to train the\ndenoiser, with different combinations of (\u03b1, \u03b2, \u03b3). After training we use the denoiser to process the\nnoisy MNIST test data, and calculate D (MSE), P (Wasserstein distance using the discriminator), and\nC (using the pretrained classi\ufb01er). For Exp-4, MNIST images are down-sampled by a factor of 6 and\nthen interpolated to original resolution. Interpolated images and their clean version are used to train\nthe SR network. For Exp-5, CIFAR-10 images are down-sampled by a factor of 3.\n\n7\n\n(a) (b)(c)\fFigure 4: Visual results of Exp-2 with different combinations of loss weights. As \u03b3 increases, the\nperceptual quality becomes worse but the restored images are easier to recognize, see for example the\nnumbers \u20185\u2019 and \u20182\u2019 highlighted by red boxes.\n\nFig. 3 presents the results of Exp-1/2/4, where we plot each pair of (C, D, P) separately. For example\nin (a), we plot the relation of P and D and using color to denote C. We also draw curves to connect\nthe points with approximately the same C. As can be observed, when C is suf\ufb01ciently large, there is a\ntradeoff between P and D, which has been characterized in [3]. Once C is smaller, the P-D curve\nelevates, indicating that better classi\ufb01cation performance comes at the cost of higher distortion and/or\nworse perceptual quality. Similarly in (b) and (c), we can observe the relations of C-P and C-D and\nall of them are convex as the theorem forecasts. Moreover, comparing Exp-1 and Exp-2 that use\ndifferent classi\ufb01ers, although the error rates differ much in number, the trends of the CDP tradeoff\nare similar. Please check the supplementary for more results.\nFig. 4 presents some results of Exp-2 for visual inspection. As observed, the visual quality of\ndenoised images in general increases along with the weight \u03b2. Given the same \u03b2, when increasing\n\u03b3, the visual quality decreases, showing a tradeoff. As expected, increasing \u03b3 will enhance the\nsemantic quality of the denoised images, which is actually evaluated by the pretrained classi\ufb01er.\nPlease note the numbers \u20185\u2019 and \u20182\u2019 highlighted by red boxes, these numbers may be dif\ufb01cult to\nrecognize if \u03b3 is small, but seem recognizable when \u03b3 is large. There seems a positive correlation\nbetween classi\ufb01cation error rate (which is evaluated by the classi\ufb01er) and human recognition (which\nis evaluated by ourselves). Note that the human recognition is different from the visual quality:\nhuman recognition means whether the class can be correctly recognized by human, visual quality\n(perceptual naturalness as de\ufb01ned in this paper) means whether the image looks like a natural image.\nMore visual examples are provided in the supplementary.\n\n5 Conclusion\n\nWe have addressed the classi\ufb01cation-distortion-perception tradeoff by both proving a theorem about\nthe characteristic of the CDP function and showcasing the CDP functions under simulation and\nexperimental settings. Regardless of the restoration algorithm, the classi\ufb01cation error rate on the\nrestored signal evaluated by a prede\ufb01ned classi\ufb01er cannot be made minimal along with the distortion\nand perceptual difference. The CDP function is convex, indicating that when the error rate is already\nlow, any improvement of classi\ufb01cation performance comes at the cost of higher distortion and worse\nperceptual quality.\nOur \ufb01ndings can be useful especially for computer vision research where some low-level vision tasks\n(signal restoration) serve for high-level vision tasks (visual understanding). It is worth noting that we\nhave used a prede\ufb01ned classi\ufb01er to evaluate the classi\ufb01cation error rate, but in practice, we may have\na different metric that directly measures the semantic quality of restored signal. More studies are\nexpected at this aspect in the future.\n\nAcknowledgments\n\nThis work was supported by the Natural Science Foundation of China under Grant 61772483.\n\n8\n\n\ud835\udefd=0.02\ud835\udefd=0.1\ud835\udefd=0\ud835\udefe=0\ud835\udefe=0.005\ud835\udefe=0.01Original ImageNoisy Image\fReferences\n[1] Martin Arjovsky, Soumith Chintala, and L\u00e9on Bottou. Wasserstein GAN. arXiv preprint\n\narXiv:1701.07875, 2017.\n\n[2] Yochai Blau, Roey Mechrez, Radu Timofte, Tomer Michaeli, and Lihi Zelnik-Manor. The 2018\n\nPIRM challenge on perceptual image super-resolution. In ECCV, pages 1\u201322, 2018.\n\n[3] Yochai Blau and Tomer Michaeli. The perception-distortion tradeoff. In CVPR, pages 6228\u2013\n\n6237, 2018.\n\n[4] Imre Csisz\u00e1r and Paul C. Shields. Information theory and statistics: A tutorial. Foundations\n\nand Trends in Communications and Information Theory, 1(4):417\u2013528, 2004.\n\n[5] Chao Dong, Yubin Deng, Chen Change Loy, and Xiaoou Tang. Compression artifacts reduction\n\nby a deep convolutional network. In ICCV, pages 576\u2013584, 2015.\n\n[6] Chao Dong, Chen Change Loy, Kaiming He, and Xiaoou Tang. Learning a deep convolutional\n\nnetwork for image super-resolution. In ECCV, pages 184\u2013199, 2014.\n\n[7] Micha\u00ebl Gharbi, Jiawen Chen, Jonathan T. Barron, Samuel W. Hasinoff, and Fr\u00e9do Durand.\nDeep bilateral learning for real-time image enhancement. ACM Transactions on Graphics,\n36(4):118, 2017.\n\n[8] Justin Johnson, Alexandre Alahi, and Li Fei-Fei. Perceptual losses for real-time style transfer\n\nand super-resolution. In ECCV, pages 694\u2013711, 2016.\n\n[9] Alex Krizhevsky. Learning multiple layers of features from tiny images. Technical report,\n\nUniversity of Toronto, 2009.\n\n[10] Hulin Kuang, Xianshi Zhang, Yong-Jie Li, Leanne Lai Hang Chan, and Hong Yan. Nighttime\nvehicle detection based on bio-inspired image enhancement and weighted score-level feature\nfusion. IEEE Transactions on Intelligent Transportation Systems, 18(4):927\u2013936, 2017.\n\n[11] Yann LeCun, L\u00e9on Bottou, Yoshua Bengio, and Patrick Haffner. Gradient-based learning\n\napplied to document recognition. Proceedings of the IEEE, 86(11):2278\u20132324, 1998.\n\n[12] Dong Liu, Dandan Wang, and Houqiang Li. Recognizable or not: Towards image semantic\n\nquality assessment for compression. Sensing and Imaging, 18(1):1\u201320, 2017.\n\n[13] Qingbo Lu, Wengang Zhou, Lu Fang, and Houqiang Li. Robust blur kernel estimation for\nlicense plate images from fast moving vehicles. IEEE Transactions on Image Processing,\n25(5):2311\u20132323, 2016.\n\n[14] Anish Mittal, Anush Krishna Moorthy, and Alan Conrad Bovik. No-reference image quality\nassessment in the spatial domain. IEEE Transactions on Image Processing, 21(12):4695\u20134708,\n2012.\n\n[15] Michele A. Saad, Alan C. Bovik, and Christophe Charrier. Blind image quality assessment: A\nnatural scene statistics approach in the DCT domain. IEEE Transactions on Image Processing,\n21(8):3339\u20133352, 2012.\n\n[16] Jacob Shermeyer and Adam Van Etten. The effects of super-resolution on object detection\n\nperformance in satellite imagery. In CVPR Workshops, pages 1\u201310, 2019.\n\n[17] Shuochen Su, Mauricio Delbracio, Jue Wang, Guillermo Sapiro, Wolfgang Heidrich, and Oliver\n\nWang. Deep video deblurring for hand-held cameras. In CVPR, pages 1279\u20131288, 2017.\n\n[18] Tim Van Erven and Peter Harremos. R\u00e9nyi divergence and Kullback-Leibler divergence. IEEE\n\nTransactions on Information Theory, 60(7):3797\u20133820, 2014.\n\n[19] Rosaura G. VidalMata, Sreya Banerjee, Brandon RichardWebster, Michael Albright, Pedro\nDavalos, Scott McCloskey, Ben Miller, Asong Tambo, Sushobhan Ghosh, and Sudarshan\nNagesh. Bridging the gap between computational photography and visual recognition. arXiv\npreprint arXiv:1901.09482, 2019.\n\n9\n\n\f[20] Thang Vu, Cao Van Nguyen, Trung X. Pham, Tung M. Luu, and Chang D. Yoo. Fast and\nef\ufb01cient image quality enhancement via desubpixel convolutional neural networks. In ECCV,\npages 1\u201317, 2018.\n\n[21] Zhou Wang, Alan C. Bovik, Hamid R. Sheikh, and Eero P. Simoncelli. Image quality assessment:\nFrom error visibility to structural similarity. IEEE Transactions on Image Processing, 13(4):600\u2013\n612, 2004.\n\n[22] Jiahui Yu, Zhe Lin, Jimei Yang, Xiaohui Shen, Xin Lu, and Thomas S. Huang. Generative\n\nimage inpainting with contextual attention. In CVPR, pages 5505\u20135514, 2018.\n\n[23] Kai Zhang, Wangmeng Zuo, Yunjin Chen, Deyu Meng, and Lei Zhang. Beyond a Gaussian\ndenoiser: Residual learning of deep CNN for image denoising. IEEE Transactions on Image\nProcessing, 26(7):3142\u20133155, 2017.\n\n10\n\n\f", "award": [], "sourceid": 733, "authors": [{"given_name": "Dong", "family_name": "Liu", "institution": "University of Science and Technology of China"}, {"given_name": "Haochen", "family_name": "Zhang", "institution": "University of Science and Technology of China"}, {"given_name": "Zhiwei", "family_name": "Xiong", "institution": "University of Science and Technology of China"}]}