{"title": "Neural Universal Discrete Denoiser", "book": "Advances in Neural Information Processing Systems", "page_first": 4772, "page_last": 4780, "abstract": "We present a new framework of applying deep neural networks (DNN) to devise a universal discrete denoiser. Unlike other approaches that utilize supervised learning for denoising, we do not require any additional training data. In such setting, while the ground-truth label, i.e., the clean data, is not available, we devise ``pseudo-labels'' and a novel objective function such that DNN can be trained in a same way as supervised learning to become a discrete denoiser. We experimentally show that our resulting algorithm, dubbed as Neural DUDE, significantly outperforms the previous state-of-the-art in several applications with a systematic rule of choosing the hyperparameter, which is an attractive feature in practice.", "full_text": "Neural Universal Discrete Denoiser\n\nTaesup Moon\n\nDGIST\n\nDaegu, Korea 42988\ntsmoon@dgist.ac.kr\n\nSeonwoo Min, Byunghan Lee, Sungroh Yoon\n\nSeoul National University\n\nSeoul, Korea 08826\n\n{mswzeus, styxkr, sryoon}@snu.ac.kr\n\nAbstract\n\nWe present a new framework of applying deep neural networks (DNN) to devise a\nuniversal discrete denoiser. Unlike other approaches that utilize supervised learning\nfor denoising, we do not require any additional training data. In such setting, while\nthe ground-truth label, i.e., the clean data, is not available, we devise \u201cpseudo-\nlabels\u201d and a novel objective function such that DNN can be trained in a same way\nas supervised learning to become a discrete denoiser. We experimentally show that\nour resulting algorithm, dubbed as Neural DUDE, signi\ufb01cantly outperforms the\nprevious state-of-the-art in several applications with a systematic rule of choosing\nthe hyperparameter, which is an attractive feature in practice.\n\n1\n\nIntroduction\n\nCleaning noise-corrupted data, i.e., denoising, is a ubiquotous problem in signal processing and\nmachine learning. Discrete denoising, in particular, focuses on the cases in which both the underlying\nclean and noisy data take their values in some \ufb01nite set. Such setting covers several applications\nin different domains, such as image denoising [1, 2], DNA sequence denoising [3], and channel\ndecoding [4].\nA conventional approach for addressing the denoising problem is the Bayesian approach, which can\noften yield a computationally ef\ufb01cient algorithm with reasonable performance. However, limitations\ncan arise when the assumed stochastic models do not accurately re\ufb02ect the real data distribution.\nParticularly, while the models for the noise can often be obtained relatively reliably, obtaining the\naccurate model for the original clean data is more tricky; the model for the clean data may be wrong,\nchanging, or may not exist at all.\nIn order to alleviate the above mentioned limitations, [5] proposed a universal approach for discrete\ndenoising. Namely, they \ufb01rst considered a general setting that the clean \ufb01nite-valued source symbols\nare corrupted by a discrete memoryless channel (DMC), a noise mechanism that corrupts each source\nsymbol independently and statistically identically. Then, they devised an algorithm called DUDE\n(Discrete Universal DEnoiser) and showed rigorous performance guarantees for the semi-stochastic\nsetting; namely, that where no stochastic modeling assumptions are made on the underlying source\ndata, while the corruption mechanism is assumed to be governed by a known DMC. DUDE is shown\nto universally attain the optimum denoising performance for any source data as the data size grows.\nIn addition to the strong theoretical performance guarantee, DUDE can be implemented as a compu-\ntationally ef\ufb01cient sliding window denoiser; hence, it has been successfully applied and extended\nto some practical applications, e.g., [1, 3, 4, 2]. However, it also had limitations; namely, the per-\nformance is sensitive on the choice of sliding window size k, which has to be hand-tuned without\nany systematic rule. Moreover, when k becomes large and the alphabet size of the signal increases,\nDUDE suffers from the data sparsity problem, which signi\ufb01cantly deteriorates the performance.\nIn this paper, we present a novel framework of addressing above limitations of DUDE by adopting\nthe machineries of deep neural networks (DNN) [6], which recently have seen great empirical success\n\n30th Conference on Neural Information Processing Systems (NIPS 2016), Barcelona, Spain.\n\n\fin many practical applications. While there have been some previous attempts of applying neural\nnetworks to grayscale image denoising [7, 8], they all remained in supervised learning setting, i.e.,\nlarge-scale training data that consists of clean and noisy image pairs was necessary. Such approach\nrequires signi\ufb01cant computation resources and training time and is not always transferable to other\ndenoising applications, in which collecting massive training data is often expensive, e.g., DNA\nsequence denoising [9].\nHenceforth, we stick to the setting of DUDE, which requires no additional data other than the given\nnoisy data. In this case, however, it is not straightforward to adopt DNN since there is no ground-truth\nlabel for supervised training of the networks. Namely, the target label that a denoising algorithm\nis trying to estimate from the observation is the underlying clean signal, hence, it can never be\nobserved to the algorithm. Therefore, we carefully exploit the known DMC assumption and the\n\ufb01niteness of the data values, and devise \u201cpseudo-labels\u201d for training DNN. They are based on the\nunbiased estimate of the true loss a denoising algorithm is incurring, and we show that it is possible\nto train a DNN as a universal discrete denoiser using the devised pseudo-labels and generalized\ncross-entropy objective function. As a by-product, we also obtain an accurate estimator of the true\ndenoising performance, with which we can systematically choose the appropriate window size k. In\nresults, we experimentally verify that our DNN based denoiser, dubbed as Neural DUDE, can achieve\nsigni\ufb01cantly better performance than DUDE maintaining robustness with respect to k. Furthermore,\nwe note that although the work in this paper is focused on discrete denoising, we believe the proposed\nframework can be extended to the denoising of continuous-valued signal as well, and we defer it to\nthe future work.\n\n2 Notations and related work\n\n2.1 Problem setting of discrete denoising\n\nThroughout this paper, we will generally denote a sequence (n-tuple) as, e.g., an = (a1, . . . , an),\nand aj\ni refers to the subsequence (ai, . . . , aj). In discrete denoising problem, we denote the clean,\nunderlying source data as xn and assume each component xi takes a value in some \ufb01nite set X . The\nsource sequence is corrupted by a DMC and results in a noisy version of the source zn, of which\neach component zi takes a value in , again, some \ufb01nite set Z. The DMC is completely characterized\nby the channel transition matrix \u03a0 \u2208 R|X|\u00d7|Z|, of which the (x, z)-th element, \u03a0(x, z), stands for\nPr(Zi = z|Xi = x), i.e., the conditional probability of the noisy symbol taking value z given the\noriginal source symbol was x. An essential but natural assumption we make is that \u03a0 is of the full\nrow rank.\nUpon observing the entire noisy data zn, a discrete denoiser reconstructs the original data with\n\u02c6X n = ( \u02c6X1(zn), . . . , \u02c6Xn(zn)), where each reconstructed symbol \u02c6Xi(zn) also takes its value in a\n\ufb01nite set \u02c6X . The goodness of the reconstruction by a discrete denoiser \u02c6X n is measured by the average\ni=1 \u039b(xi, \u02c6Xi(zn)), where \u039b(xi, \u02c6xi) is a single-letter loss function that\nloss, L \u02c6X n (X n, Z n) = 1\nmeasures the loss incurred by estimating xi with \u02c6xi at location i. The loss function can be also\nrepresented with a loss matrix \u039b \u2208 R|X|\u00d7| \u02c6X|. Throughout the paper, for simplicity, we will assume\nX = Z = \u02c6X , thus, assume that \u03a0 is invertible.\n\n(cid:80)n\n\nn\n\n2.2 Discrete Universal DEnoiser (DUDE)\n\nDUDE in [5] is a two-pass algorithm that has a linear complexity in the data size n. During the \ufb01rst\npass, the algorithm with the window size k collects the statistics vector\n\nm[zn, lk, rk](a) =(cid:12)(cid:12){i : k + 1 \u2264 i \u2264 n \u2212 k, zi+k\n\n(1)\nfor all a \u2208 Z, which is the count of the occurrence of the symbol a \u2208 Z along the noisy sequence zn\nthat has the double-sided context (lk, rk) \u2208 Z 2k. Once the m vector is collected, for the second pass,\nDUDE applies the rule\n\ni\u2212k = lkark}(cid:12)(cid:12),\n\n\u02c6Xi,DUDE(zn) = arg min\n\n\u02c6x\u2208X m[zn, ci]\n\n(cid:62)\u03a0\u22121[\u03bb\u02c6x (cid:12) \u03c0zi] for each k + 1 \u2264 i \u2264 n \u2212 k,\n\n(2)\n\nwhere ci (cid:44) (zi\u22121\ni+1 ) is the context of zi, \u03c0zi is the zi-th column of the channel matrix \u03a0, \u03bb\u02c6x is\nthe \u02c6x-th column of the loss matrix \u039b, and (cid:12) stands for the element-wise product. The form of (2)\n\ni\u2212k, zi+k\n\n2\n\n\fshows that DUDE is a sliding window denoiser with window size 2k + 1; namely, DUDE returns the\nsame denoised symbol at all locations i\u2019s with the same value of zi+k\ni\u2212k. We will call such denoisers as\nthe k-th order sliding window denoiser from now on.\nDUDE is shown to be universal, i.e., for any underlying clean sequence xn, it can always attain the\nperformance of the best k-th order sliding window denoiser as long as k|Z|2k = o(n/ log n) holds\n[5, Theorem 2]. For more rigorous analyses, we refer to the original paper [5].\n\n2.3 Deep neural networks (DNN) and related work\n\nDeep neural networks (DNN), often dubbed as deep learning algorithms, have recently made signi\ufb01-\ncant impacts in several practical applications, such as speech recognition, image recognition, and\nmachine translation, etc. For a thorough review on recent progresses of DNN, we refer the readers to\n[6] and refereces therein.\nRegarding denoising, [7, 8, 10] have successfully applied the DNN to grayscale image denoising by\nutilizing supervised learning at the small image patch level. Namely, they generated clean and noisy\nimage patches and trained neural networks to learn a mapping from noisy to clean patches. While\nsuch approach attained the state-of-the-art performance, as mentioned in Introduction, it has several\nlimitations. That is, it typically requires massive amount of training data, and multiple copies of the\ndata need to be generated for different noise types and levels to achieve robust performance. Such\nrequirement of large training data cannot be always met in other applications, e.g., in DNA sequence\ndenoising, collecting large scale clean DNA sequences is much more expensive than obtaining\ntraining images on the web. Moreover, for image denoising, working in the small patch level makes\nsense since the image patches may share some textual regularities, but in other applications, the\ncharacterstics of the given data for denoising could differ from those in the pre-collected training set.\nFor instance, the characteristics of substrings of DNA sequences vary much across different species\nand genes, hence, the universal setting makes more sense in DNA sequence denoising.\n\n3 An alternative interpretation of DUDE\n\n3.1 Unbiased estimated loss\n\nIn order to make an alternative interpretation of DUDE, which can be also found in [11], we need the\ntool developed in [12]. To be self-contained, we recap the idea here. Consider a single letter case,\nnamely, a clean symbol x is corrupted by \u03a0 and resulted in the noisy observation1 Z. Then, suppose\na single-symbol denoiser s : Z \u2192 \u02c6X is applied and obtained the denoised symbol \u02c6X = s(Z). In this\ncase, the true loss incurred by s for the clean symbol x and the noisy observation Z is \u039b(x, s(Z)). It\nis clear that s cannot evaluate its loss since it does not know what x is, but the following shows an\nunbiased estimate of the expected true loss, which is only based on Z and s, can be derived.\nFirst, denote S as the set of all possible single-symbol denoisers. Note |S| = | \u02c6X||Z|. Then, we de\ufb01ne\na matrix \u03c1 \u2208 R|X|\u00d7|S| with\n\n\u03c1(x, s) =\n\n\u03a0(x, z)\u039b(x, s(z)) = Ex\u039b(x, s(Z)), x \u2208 X , s \u2208 S.\n\n(3)\n\n(cid:88)\n\nz\u2208Z\n\nThen, we can de\ufb01ne an estimated loss matrix2 L (cid:44) \u03a0\u22121\u03c1 \u2208 R|Z|\u00d7|S|. With this de\ufb01nition, we can\nshow that L(Z, s) is an unbiased estimate of Ex\u039b(x, s(Z)) as follows (as shown in [12]):\nExL(Z, s) =\n\n, s) = \u03c1(x, s) = Ex\u039b(x, s(Z)).\n\n\u03a0\u22121(z, x\n(cid:48)\n\n(cid:48)\n\n(cid:48)\n, s) = \u03b4(x, x\n\n(cid:88)\n\n(cid:48)\n\n)\u03c1(x\n\n)\u03c1(x\n\n(cid:88)\n\nx(cid:48)\n\n\u03a0(x, z)\n\nz\n\n3.2 DUDE: Minimizing the sum of estimated losses\n\nAs mentioned in Section 2.2, DUDE with context size k is the k-th order sliding window denoiser.\nGenerally, we can denote such k-th order sliding window denoiser as sk : Z 2k+1 \u2192 \u02c6X , which\n\n1We use uppercase letter Z to stress it is a random variable\n2For general case in which \u03a0 is not a square matrix, \u03a0\u22121 can be replaced with the right inverse of \u03a0.\n\n3\n\n\fobtains the reconstruction at the i-th location as\n\n\u02c6Xi(zn) = sk(zi+k\n\n(4)\nTo recall, ci = (zi\u22121\ni+1 ). Now, from the formulation (4), we can interpret that sk de\ufb01nes a\nsingle-symbol denoiser at location i, i.e., sk(ci,\u00b7), depending on ci. With this view on sk, as derived\nin [11], we can show that the DUDE de\ufb01ned in (2) is equivalent to \ufb01nding a single-symbol denoiser\n(5)\n\ni\u2212k) = sk(ci, zi).\n\n(cid:88)\n\ni\u2212k, zi+k\n\nL(zi, s),\n\nsk,DUDE(c,\u00b7) = arg min\ns\u2208S\n\n{i:ci=c}\n\n(cid:80)\ni\u2208{i:ci=c} L(Zi, s) will concentrate on(cid:80)\n\nfor each context c \u2208 Ck (cid:44) {(lk, rk) : (lk, rk) \u2208 Z 2k} and obtaining the reconstruction at location i\nas \u02c6Xi,DUDE(zn) = sk,DUDE(ci, zi). The interpretation (5) gives some intuition on why DUDE enjoys\nstrong theoretical guarantees in [5]; since L(Zi, s) is an unbiased estimate of Exi \u039b(xi, s(Zi)),\ni\u2208{i:ci=c} \u039b(xi, s(Zi)) as long as |{i : ci = c}| is\nsuf\ufb01ciently large. Hence, the single symbol denoiser that minimizes the sum of the estimated losses\nfor each c (i.e., (5)) will also make the sum of the true losses small, which is the goal of a denoiser.\nWe can also express (5) using vector notations, which will become useful for deriving the Neural\nDUDE in the next section. That is, we let \u2206|S| be a probability simplex in R|S|. (Suppose we have\nuniquely assigned each coordinate of R|S| to each single-symbol denoiser in S from now on.) Then,\nwe can de\ufb01ne a probability vector for each c,\n\u02c6p(c) (cid:44) arg min\np\u2208\u2206|S|\n\n(cid:16) (cid:88)\n\n1(cid:62)\nzi\n\n(cid:17)\n\n(6)\n\np,\n\nL\n\nwhich will be on the vertex of \u2206|S| that corresponds to sk,DUDE(c,\u00b7) in (5). The reason is because\nthe objective function in (6) is a linear function in p. Hence, we can simply obtain sk,DUDE(c,\u00b7) =\narg maxs \u02c6p(c)s, where \u02c6p(c)s stands for the s-th coordinate of \u02c6p(c).\n\n{i:ci=c}\n\n4 Neural DUDE: A DNN based discrete denoiser\n\nAs seen in the previous section, DUDE utilizes the estimated loss matrix L, which does not depend\non the clean sequence xn. However, the main drawback of DUDE is that, as can be seen in (5), it\ntreats each context c independently from others. Namely, when the context size k grows, then the\nnumber of different contexts |Ck| = |Z|2k will grow exponentially with k, hence, the sample size\nfor each context |{i : ci = c}| will decrease exponentially for a given sequence length n. Such\ni\u2208{i:ci=c} L(Zi, s) mentioned in the previous section,\n\nphenomenon will hinder the concentration of(cid:80)\n\nwhich causes the performance of DUDE deteriorate when k grows too large.\nIn order to resolve above problem, we develop Neural DUDE, which adopts a single neural network\nsuch that the information from similar contexts can be shared via network parameters. We note that\nour usage of DNN resembles that of the neural language model (NLM) [13], which improved upon\nthe conventional N-gram models. The difference is that NLM is essentially a prediction problem,\nhence the ground truth label for supervised training is easily availble, but in denoising, this is not the\ncase. Before describing the algorithm more in detail, we need one following lemma.\n\n+ and any p \u2208 \u2206|S|, de\ufb01ne a cost function C(g, p) (cid:44) \u2212(cid:80)|S|\n\n4.1 A lemma\nLet R|S|\n+ be the space of all |S|-dimensional vectors of which elements are nonnegative. Then, for any\ng \u2208 R|S|\ni=1 gi log pi, i.e., a generalized\ncross-entropy function with the \ufb01rst argument not normalized to a probability vector. Note C(g, p) is\nlinear in g and convex in p. Now, following lemma shows another way of obtaining DUDE.\nLemma 1 De\ufb01ne Lnew (cid:44) \u2212L + Lmax11(cid:62) in which Lmax (cid:44) maxz,s L(z, s), the maximum element\nof L. Using the cost function C(\u00b7,\u00b7) de\ufb01ned above, for each c \u2208 Ck, let us de\ufb01ne\n\n(cid:88)\n\nC(cid:0)L(cid:62)\nnew1zi, p(cid:1).\n\np\u2217\n\n(c) (cid:44) arg min\np\u2208\u2206|S|\nThen, we have sk,DUDE(c,\u00b7) = arg maxs p\u2217(c)s.\nProof: The proof of lemma is given in the Supplementary Material.\n\n{i:ci=c}\n\n4\n\n\f4.2 Neural DUDE\n\nThe main idea for Neural DUDE is to use a single neural network to learn the k-th order slinding\nwindow denoising rule for all c\u2019s. Namely, we de\ufb01ne p(w,\u00b7) : Z 2k \u2192 \u2206|S| as a feed-forward\nneural network that takes the context vector c \u2208 Ck as input and outputs a probability vector on\n\u2206|S|. We let w stand for all the parameters in the network. The network architecture of p(w,\u00b7) has\nthe softmax output layer, and it is analogous to that used for the multi-class classi\ufb01cation. Thus,\nwhen the parameters are properly learned, we expect that p(w, ci) will give predictions on which\nsingle-symbol denoiser to apply at location i with the context ci.\n\n4.2.1 Learning\n\nWhen not resorting to the supervised learning framework, learning the network parameters w is not\nstraightforward as mentioned in the Introduction. However, inspired by Lemma 1, we de\ufb01ne the\nobjective function to minimize for learning w as\n\nL(w, zn) (cid:44) 1\nn\n\nL(cid:62)\nnew1zi, p(w, ci)\n\n,\n\n(7)\n\nC(cid:16)\n\nn(cid:88)\n\ni=1\n\n(cid:17)\n\nnew1zi)}n\n\nwhich resembles the widely used cross-entropy objective function in supervised multi-class classi\ufb01-\ncation. Namely, in (7), {(ci, L(cid:62)\ni=1, which solely depends on the noisy sequence zn, can be\nanalogously thought of as the input-label pairs in supervised learning. (Note for i \u2264 k and i \u2265 n \u2212 k,\ndummy variables are padded for obtaining ci.) But, unlike classi\ufb01cation, in which the ground-truth\nlabel is given as a one-hot vector, we treat L(cid:62)\nOnce the objective function is set as in (7), we can then use the widely used optimization techniques,\nnamely, the back-propagation and Stochastic Gradient Descent (SGD)-based methods, for learning\nthe parameters w. In fact, most of the well-known improvements to the SGD method, such as the\nmomentum [14], mini-batch SGD, and several others [15, 16], can be all used for learning w. Note\nthat there is no notion of generalization in our setting, since the goal of denoising is to simply achieve\nas small average loss as possible for the given noisy sequence zn, rather than performing well on the\nseparate unseen test data. Hence, we do not use any regularization techniques such as dropout in our\nlearning, but simply try to minimize the objective function.\n\n+ as a target \u201cpseudo-label\u201d on S.\n\nnew1zi \u2208 R|S|\n\n4.2.2 Denoising\n\nAfter suf\ufb01cient iterations of weight updates, the objective function (7) will converge, and we will\ndenote the converged parameters as w\u2217. The Neural DUDE algorithm then applies the resulting\nnetwork p(w\u2217,\u00b7) to the exact same noisy sequence zn used for learning to denoise. Namely, for each\nc \u2208 Ck, we obtain a single-symbol denoiser\n\nsk,Neural DUDE(c,\u00b7) = arg max\n\np(w\u2217\n\ns\n\n, c)s\n\n(8)\n\nand the reconstruction at location i by \u02c6Xi,DUDE(zn) = sk,Neural DUDE(ci, zi).\nFrom the objective function (7) and the de\ufb01nition (8), it is apparent that Neural DUDE does share\ninformation across different contexts since w\u2217 is learnt from all data and shared across all contexts.\nSuch property enables Neural DUDE to robustly run with much larger k\u2019s than DUDE without\nrunning into the data sparsity problem. As shown in the experimental section, Neural DUDE with\nlarge k can signi\ufb01cantly improve the denoising performance compared to DUDE. Furthermore, in the\nexperimental section, we show that the concentration\n\nL(Zi, sk,Neural DUDE(ci,\u00b7)) \u2248 1\nn\n\n\u039b(xi, sk,Neural DUDE(ci, Zi))\n\n(9)\n\nholds with high probability even for very large k\u2019s, whereas such concentration quickly breaks for\n(cid:80)n\nDUDE as k grows. While deferring the analyses on why such concentration always holds to the future\nwork, we can use the property to provide a systematic mechanism for choosing the best context size\ni=1 L(Zi, sk,Neural DUDE(ci,\u00b7)). As shown\nk for Neural DUDE - simply choose k\u2217 = arg mink\nin the experiments, such choice of k for Neural DUDE gives an excellent denoising performace.\nAlgorithm 1 summarizes the Neural DUDE algorithm.\n\n1\nn\n\n5\n\nn(cid:88)\n\ni=1\n\n1\nn\n\nn(cid:88)\n\ni=1\n\n\fAlgorithm 1 Neural DUDE algorithm\nInput: Noisy sequence zn, \u03a0, \u039b, Maximum context size kmax\nOutput: Denoised sequence \u02c6X n\n\nNeural DUDE = { \u02c6Xi,Neural DUDE(zn)}n\nCompute L = \u03a0\u22121\u03c1 as in Section 3.1 and Lnew as in Lemma 1\nfor k = 1, . . . , kmax do\n\ni=1\n\nn\n\nk\n\nInitialize p(w,\u00b7) with input dimension 2k|Z| (using one-hot encoding of each noisy symbol)\nObtain w\u2217\nObtain sk,Neural DUDE(c,\u00b7) for all c \u2208 Ck as in (8) using w\u2217\nCompute Lk (cid:44) 1\n\nk minimizing L(w, zn) in (7) using SGD-like optimization method\n\n(cid:80)n\ni=1 L(zi, sk,Neural DUDE(ci,\u00b7))\n\nend for\nGet k\u2217 = arg mink Lk and obtain \u02c6Xi,Neural DUDE(zn) = sk\u2217,Neural DUDE(ci, zi) for i = 1, . . . , n\n\n(cid:80)n\nRemark: We note that using the cost function in (7) is important. That is, if we use a simpler\ni=1(L(cid:62)1zi)(cid:62)p(w, ci), it becomes highly non-convex in w, and the solution\nobjective like (5), 1\nw\u2217 becomes very unstable. Moreover, using Lnew instead of L in the cost function is important as\nn\nwell, since it guarantees to have the cost function C(\u00b7,\u00b7) always convex in the second argument.\n\n5 Experimental results\n\nIn this section, we show the denoising results of Neural DUDE for the synthetic binary data, real\nbinary images, and real Oxford Nanopore MinION DNA sequence data. All of our experiments were\ndone with Python 2.7 and Keras package (http://keras.io) with Theano [17] backend.\n\n5.1 Synthetic binary data\n\nWe \ufb01rst experimented with a simple synthetic binary data to highlight the core strength of Neural\nDUDE. That is, we assume X = Z = \u02c6X = {0, 1} and \u03a0 is a binary symmetric channel (BSC)\nwith crossover probability \u03b4 = 0.1. We set \u039b as the Hamming loss. We generated the clean binary\n\n(a) BER/\u03b4 vs. Window size k\n\n(b) DUDE\n\n(c) Neural DUDE (4L)\n\nFigure 1: Denoising results of DUDE and Neural DUDE for the synthetic binary data with n = 106.\n\n(cid:80)n\n\nsequence xn of length n = 106 from a binary symmentric Markov chain (BSMC) with transition\nprobability \u03b1 = 0.1. The noise-corrupted sequence zn is generated by passing xn through \u03a0. Since\nwe use the Hamming loss, the average loss of a denoiser \u02c6X n, 1\ni=1 \u039b(xi, \u02c6Xi(zn)), is equal to the\nn\nbit error rate (BER). Note that in this setting, the noisy sequence zn is a hidden Markov process.\nTherefore, when the stochastic model of the clean sequence is exactly known to the denoiser, the\nViterbi-like Forward-Backward (FB) recursion algorithm can attain the optimum BER.\nFigure 1 shows the denoising results of DUDE and Neural DUDE, which do not know anything\nabout the characteristics of the clean sequence xn. For DUDE, the window size k is the single\nhyperparameter to choose. For Neural DUDE, we used the feed-forward fully connected neural\nnetworks for p(w,\u00b7) and varied the depth of the network between 1 \u223c 4 while also varying k. Neural\nDUDE(1L) corresponds to the simple linear softmax regression model. For deeper models, we used\n40 hidden nodes in each layer with Recti\ufb01ed Linear Unit (ReLU) activations. We used Adam [16]\nwith default setting in Keras as an optimizer to minimize (7). We used the mini-batch size of 100 and\nran 10 epochs for learning. The performance of Neural DUDE was robust to the initializtion of the\nparameters w.\n\n6\n\n02468101214Windowsizek0.500.550.600.650.700.750.800.850.90(BitErrorRate)/\u03b40.563\u03b40.558\u03b4DUDENeuralDUDE(1L)NeuralDUDE(2L)NeuralDUDE(3L)NeuralDUDE(4L)FBRecursion02468101214Windowsizek0.00.10.20.30.40.50.60.70.80.9(BitErrorRate)/\u03b4BEREstimatedBERFBRecursion02468101214Windowsizek0.520.540.560.580.600.620.64(BitErrorRate)/\u03b4BEREstimatedBERFBRecursion\fFigure 1(a) shows the BERs of DUDE and Neural DUDE with respect to varying k. Firstly, we see\nthat minimum BERs of both DUDE and Neural DUDE(4L), i.e., 0.563\u03b4 with k = 5, get very close\nto the optimum BER (0.558\u03b4) obtained by the Forward-Backward (FB) recursion. Secondly, we\nobserve that Neural DUDE quickly approaches the optimum BER as we increase the depth of the\nnetwork. This shows that as the descriminative power of the model increases with the depth of the\nnetwork, p(w,\u00b7) can successfully learn the denoising rule for each context c with a shared parameter\nw. Thirdly, we clearly see that in contrast to the performance of DUDE being sensitive to k, that\n(cid:80)n\nof Neural DUDE(4L) is robust to k by sharing information across contexts. Such robustness with\nrespect to k is obviously a very desirable property in practice.\ni=1 L(Zi, sk(ci,\u00b7)), against the\nFigure 1(b) and Figure 1(c) plot the average estimated BER, 1\nn\ntrue BER for DUDE and Neural DUDE (4L), respectively, to show the concentration phenomenon\ndescribed in (9). From the \ufb01gures, we can see that while the estimated BER drastically diverges from\ntrue BER for DUDE as k increases, it strongly concentrates on true BER for Neural DUDE (4L) for\nall k. This result suggests the concrete rule for selecting the best k described in Algorithm 1. Such\nrule is used for the experiments using real data in the following subsections.\n\n5.2 Real binary image denoising\n\nIn this section, we experiment with real, binary image data. The settings of \u03a0 and \u039b are identical\nto Section 5.1, while the clean sequence was generated by converting image to a 1-D sequence via\nraster scanning. We tested with 5 representative binary images with various textual characteristics:\nEinstein, Lena, Barbara, Cameraman, and scanned Shannon paper. Einstein and Shannon images had\nthe resolution of 256 \u00d7 256 and the rest had 512 \u00d7 512. For Neural DUDE, we tested with 4 layer\nmodel with 40 hidden nodes with ReLU activations in each layer.\n\n(a) Clean image\n\n(b) BER results\n\nFigure 2: Einstein image(256 \u00d7 256) denoising results with \u03b4 = 0.1.\n\nFigure 2(b) shows the result of denoising Einstein image in Figure 2(a) for \u03b4 = 0.1. We see that\nthe BER of Neural DUDE(4L) continues to drop as we increase k, whereas DUDE quickly fails\nto denoise for larger k\u2019s. Furthermore, we observe that the estimated BER of Neural DUDE(4L)\nagain strongly correlates with the true BER. Note that when k = 36, we have 272 possible different\ncontexts, which are much more than the number of pixels, 216(256 \u00d7 256). However, we see that\nNeural DUDE can still learn a good denoising rule from such many different contexts by aggregating\ninformation from similar contexts.\n\n\u03b4\n\n0.15\n\n0.1\n\nSchemes\nDUDE\n\nNeural DUDE\nImprovement\n\nDUDE\n\nNeural DUDE\nImprovement\n\nEinstein\n0.578 (5)\n0.384 (38)\n\n33.6%\n0.563 (5)\n0.404 (36)\n\n28.2%\n\nLena\n\n0.494 (6)\n0.405 (38)\n\n18.0%\n0.495 (6)\n0.403 (38)\n\n18.6%\n\nBarbara\n0.492 (5)\n0.448 (33)\n\n9.0%\n\n0.506 (6)\n0.457 (27)\n\n9.7%\n\nCameraman\n0.298 (6)\n0.264 (39)\n\n11.5%\n0.310 (5)\n0.268 (35)\n\n13.6%\n\nShannon\n0.498 (5)\n0.410 (38)\n\n17.7%\n0.475 (5)\n0.402 (35)\n\n15.4%\n\nTable 1: BER results for binary images. Each number represents the relative BER compared to \u03b4 and\nthe \u201cImprovement\u201d stands for the relative BER improvement of Neural DUDE(4L) over DUDE. The\nnumbers inside parentheses are the k values achieving the result.\nTable 1 summarizes the denoising results on six binary images for \u03b4 = 0.1, 0.15. We see that Neural\nDUDE always signi\ufb01cantly outperforms DUDE using much larger context size k. We believe this is a\n\n7\n\n0510152025303540Windowsizek0.30.40.50.60.70.80.9(BitErrorRate)/\u03b40.404\u03b40.563\u03b4DUDEBERNeuralDUDE(4L)BERNeuralDUDE(4L)Est.BER\fsigni\ufb01cant result since DUDE is shown to outperform many state-of-the-art sliding window denoisers\nin practice such as median \ufb01lters [5, 1]. Furthermore, following DUDE\u2019s extension to grayscale\nimage denoising [2], the result gives strong motivation for extending Neural DUDE to grayscale\nimage denoising.\n\n5.3 Nanopore DNA sequence denoising\n\nWe now go beyond binary data and apply Neural DUDE to DNA sequence denoising. As surveyed\nin [9], denoising DNA sequences is becoming increasingly important as the sequencing devices are\ngetting cheaper, but injecting more noise than before. For our experiment, we used simulated MinION\nNanopore reads, which were generated as follows; we obtained 16S rDNA reference sequences for\n20 species [18] and randomly generated noiseless template reads from them. The number of reads\nand read length for each species were set as identical to those of real MinION Nanopore reads [18].\nThen, based on \u03a0 of MinION Nanopore sequencer (Figure 3(a)) obtained in [19] (with 20.375%\naverage error rate), we induced substitution errors to the reads and obtained the corresponding noisy\nreads. Note that we are only considering substitution errors, while there also exist insertion/deletion\nerrors in real Nanopore sequenced data. The reason is that substitution errors can be directly handled\nby DUDE and Neural DUDE, so we focus on quantitatively evaluating the performance on those\nerrors. We sequentially merged 2,372 reads from 20 species and formed 1-D sequence of 2,469,111\nbase pairs long. We used two Neural DUDE (4L) models with 40 and 80 hidden nodes in each layer,\nand denoted as (40-40-40) and (80-80-80), respectively.\n\n(a) \u03a0 for nanopore sequencer\n\n(b) BER results\n\nFigure 3: Nanopore DNA sequence denoising results.\n\nFigure 3(b) shows the denoising results. We observe that Neural DUDE with large k\u2019s (around\nk = 100) can achieve less than half of the error rate of DUDE. Furthermore, as the complexity\nof model increases, the performance of Neural DUDE gets signi\ufb01cantly better. We could not \ufb01nd\na comparable baseline scheme, since most of nanopore error correction tool, e.g., Nanocorr [20],\ndid not produce read-by-read correction sequence, but returns downstream analyses results after\ndenoising. Coral [21], which gives read-by-read denoising result for Illumina data, completely failed\nfor the nanopore data. Given that DUDE ourperforms state-of-the-art schemes, including Coral, for\nIllumina sequenced data as shown in [3], we expect the improvement of Neural DUDE over DUDE\ncould translate into fruitful downstream analyses gain for nanopore data.\n\n6 Concluding remark and future work\n\nWe showed Neural DUDE signi\ufb01cantly improves upon DUDE and has a systematic mechanism for\nchoosing the best k. There are several future research directions. First, we plan to do thorough\nexperiments on DNA sequence denoising and quantify the impact of Neural DUDE in the downstream\nanalysis. Second, we plan to give theoretical analyses on the concentration (9) and justify the derived\nk selection rule. Third, extending the framework to deal with continuous-valued signal and \ufb01nding\nconnection with SURE principle [22] would be fruitful. Finally, applying recurrent neural networks\n(RNN) in place of DNNs could be another promising direction.\n\nAcknowledgments\n\nT. Moon was supported by DGIST Faculty Start-up Fund (2016010060) and Basic Science Research\nProgram through the National Research Foundation of Korea (2016R1C1B2012170), both funded by\nMinistry of Science, ICT and Future Planning. S. Min, B. Lee, and S. Yoon were supported in part by\nBrain Korea 21 Plus Project (SNU ECE) in 2016.\n\n8\n\n020406080100Windowsizek0.30.40.50.60.70.80.91.01.1(ErrorRate)/\u03b40.909\u03b40.544\u03b40.427\u03b4DUDENeuralDUDE(40-40-40)NeuralDUDE(80-80-80)\fReferences\n[1] E. Ordentlich, G. Seroussi, S. Verd\u00fa, M.J. Weinberger, and T. Weissman. A universal discrete\n\nimage denoiser and its application to binary images. In IEEE ICIP, 2003.\n\n[2] Giovanni Motta, Erik Ordentlich, Ignacio Ramirez, Gadiel Seroussi, and Marcelo J. Weinberger.\nThe iDUDE framework for grayscale image denoising. IEEE Trans. Image Processing, 20:1\u201321,\n2011.\n\n[3] B. Lee, T. Moon, S. Yoon, and T. Weissman. DUDE-Seq: Fast, \ufb02exible, and robust denoising\n\nof nucleotide sequences. arXiv:1511.04836, 2016.\n\n[4] E. Ordentlich, G. Seroussi, S. Verd\u00fa, and K. Viswanathan. Universal algorithms for channel\n\ndecoding of uncompressed sources. IEEE Trans. Inform. Theory, 54(5):2243\u20132262, 2008.\n\n[5] T. Weissman, E. Ordentlich, G. Seroussi, S. Verd\u00fa, and M. J. Weinberger. Universal discrete\n\ndenoising: Known channel. IEEE Trans. on Inform. Theory, 51(1):5\u201328, 2005.\n\n[6] G. Hinton, Y. LeCun, and Y. Bengio. Deep learning. Nature, 521:436\u2013444, 2015.\n[7] H. Burger, C. Schuler, and S. Harmeling. Image denoising: Can plain neural networks compete\n\nwith BM3D? In CVPR, 2012.\n\n[8] J. Xie, L. Xu, and E. Chen. Image denoising and inpainting with deep neural networks. In NIPS,\n\n2012.\n\n[9] D. Laehnemann, A. Borkhardt, and A.C. McHardy. Denoising DNA deep sequencing data\u2013\nhigh-throughput sequencing errors and their corrections. Brief Bioinform, 17(1):154\u2013179, 2016.\n[10] V. Jain and H.S. Seung. Natural image denoising with convolutional networks. In NIPS, 2008.\n[11] T. Moon and T. Weissman. Discrete denoising with shifts. IEEE Trans. on Inform. Theory,\n\n2009.\n\n[12] T. Weissman, E. Ordentlich, M. Weinberger, A. Somekh-Baruch, and N. Merhav. Universal\n\n\ufb01ltering via prediction. IEEE Trans. Inform. Theory, 53(4):1253\u20131264, 2007.\n\n[13] Y. Bengio, R. Ducharme, P. Vincent, and C. Jauvin. A neural probabilistic language model.\n\nJMLR, 3:1137\u20131155, 2003.\n\n[14] Y. Nesterov. A method of solving a convex programming problem with convergence rate\n\no(1/sqr(k)). Soviet Mathematics Doklady, 27:372\u2013376, 1983.\n\n[15] Tieleman and G. Hinton. RMSProp: Divide the gradient by a running average of its recent\n\nmagnitude. In Lecture Note 6-5, University of Toronto, 2012.\n\n[16] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In ICLR, 2015.\n[17] Fr\u00e9d\u00e9ric Bastien, Pascal Lamblin, Razvan Pascanu, James Bergstra, Ian Goodfellow, Arnaud\nBergeron, Nicolas Bouchard, David Warde-Farley, and Yoshua Bengio. Theano: new features\nand speed improvements. In NIPS Workshop on Deep Learning and Unsupervised Feature\nLearning, 2012.\n\n[18] A. Benitez-Paez, K. Portune, and Y. Sanz. Species level resolution of 16S rRNA gene amplicons\n\nsequenced through MinION portable nanopore sequencer. bioRxiv:021758, 2015.\n\n[19] M. Jain, I. Fiddes, K. Miga, H. Olsen, B. Paten, and M. Akeson. Improved data analysis for the\n\nMinION nanopore sequencer. Nature Methods, 12:351\u2013356, 2015.\n\n[20] S. Goodwin, J. Gurtowski, S Ethe-Sayers, P. Deshpande, M. Schatz, and W.R. McCombie.\nOxford Nanopore sequencing, hybrid error correction, and de novo assembly of a eukaryotic\ngenome. Genome Res., 2015.\n\n[21] L. Salmela and J. Schroder. Correcting errors in short reads by multiple alignments. BioInfor-\n\nmatics, 27(11):1455\u20131461, 2011.\n\n[22] C. Stein. Estimation of the mean of a multivariate normal distribution. The Annals of Statistics,\n\n9(6):1135\u20131151, 1981.\n\n9\n\n\f", "award": [], "sourceid": 2425, "authors": [{"given_name": "Taesup", "family_name": "Moon", "institution": "DGIST"}, {"given_name": "Seonwoo", "family_name": "Min", "institution": "Seoul National University"}, {"given_name": "Byunghan", "family_name": "Lee", "institution": "Seoul National University"}, {"given_name": "Sungroh", "family_name": "Yoon", "institution": "Seoul National University"}]}