{"title": "Non-Parametric Bayesian Dictionary Learning for Sparse Image Representations", "book": "Advances in Neural Information Processing Systems", "page_first": 2295, "page_last": 2303, "abstract": "Non-parametric Bayesian techniques are considered for learning dictionaries for sparse image representations, with applications in denoising, inpainting and compressive sensing (CS). The beta process is employed as a prior for learning the dictionary, and this non-parametric method naturally infers an appropriate dictionary size. The Dirichlet process and a probit stick-breaking process are also considered to exploit structure within an image. The proposed method can learn a sparse dictionary in situ; training images may be exploited if available, but they are not required. Further, the noise variance need not be known, and can be non-stationary. Another virtue of the proposed method is that sequential inference can be readily employed, thereby allowing scaling to large images. Several example results are presented, using both Gibbs and variational Bayesian inference, with comparisons to other state-of-the-art approaches.", "full_text": "Non-Parametric Bayesian Dictionary Learning for\n\nSparse Image Representations\n\nMingyuan Zhou, Haojun Chen, John Paisley, Lu Ren, 1Guillermo Sapiro and Lawrence Carin\n\nDepartment of Electrical and Computer Engineering\nDuke University, Durham, NC 27708-0291, USA\n\n1Department of Electrical and Computer Engineering\nUniversity of Minnesota, Minneapolis, MN 55455, USA\n\n{mz1,hc44,jwp4,lr,lcarin}@ee.duke.edu, {guille}@umn.edu\n\nAbstract\n\nNon-parametric Bayesian techniques are considered for learning dictionaries for\nsparse image representations, with applications in denoising, inpainting and com-\npressive sensing (CS). The beta process is employed as a prior for learning the\ndictionary, and this non-parametric method naturally infers an appropriate dic-\ntionary size. The Dirichlet process and a probit stick-breaking process are also\nconsidered to exploit structure within an image. The proposed method can learn\na sparse dictionary in situ; training images may be exploited if available, but they\nare not required. Further, the noise variance need not be known, and can be non-\nstationary. Another virtue of the proposed method is that sequential inference can\nbe readily employed, thereby allowing scaling to large images. Several example\nresults are presented, using both Gibbs and variational Bayesian inference, with\ncomparisons to other state-of-the-art approaches.\n\n1 Introduction\nThere has been signi\ufb01cant recent interest in sparse signal expansions in several settings. For ex-\nample, such algorithms as the support vector machine (SVM) [1], the relevance vector machine\n(RVM) [2], Lasso [3] and many others have been developed for sparse regression (and classi\ufb01ca-\ntion). A sparse representation has several advantages, including the fact that it encourages a simple\nmodel, and therefore over-training is often avoided. The inferred sparse coef\ufb01cients also often have\nbiological/physical meaning, of interest for model interpretation [4].\nOf relevance for the current paper, there has recently been signi\ufb01cant interest in sparse representa-\ntions in the context of denoising, inpainting [5\u201310], compressive sensing (CS) [11, 12], and classi\ufb01-\ncation [13]. All of these applications exploit the fact that most images may be sparsely represented\nin an appropriate dictionary. Most of the CS literature assumes \u201coff-the-shelf\u201d wavelet and DCT\nbases/dictionaries [14], but recent denoising and inpainting research has demonstrated the signif-\nicant advantages of learning an often over-complete dictionary matched to the signals of interest\n(e.g., images) [5\u201310, 12, 15]. The purpose of this paper is to perform dictionary learning using\nnew non-parametric Bayesian technology [16,17], that offers several advantages not found in earlier\napproaches, which have generally sought point estimates.\nThis paper makes four main contributions:\n\u2022 The dictionary is learned using a beta process construction [16, 17], and therefore the number of\ndictionary elements and their relative importance may be inferred non-parametrically.\n\u2022 For the denoising and inpainting applications, we do not have to assume a priori knowledge of the\nnoise variance (it is inferred within the inversion). The noise variance can also be non-stationary.\n\u2022 The spatial inter-relationships between different components in images are exploited by use of the\nDirichlet process [18] and a probit stick-breaking process [19].\n\n1\n\n\f\u2022 Using learned dictionaries, inferred off-line or in situ, the proposed approach yields CS perfor-\nmance that is markedly better than existing standard CS methods as applied to imagery.\n\n2 Dictionary Learning with a Beta Process\nIn traditional sparse coding tasks, one considers a signal x \u2208 0 and b > 0, and base measure H0, is represented as BP(a, b, H0), and a draw\nH \u223c BP(a, b, H0) may be represented as\n\nKX\n\nH(\u03c8) =\n\n\u03c0k\u03b4\u03c8k(\u03c8)\n\n\u03c0k \u223c Beta(a/K, b(K \u2212 1)/K)\n\n\u03c8k \u223c H0\n\n(1)\n\nk=1\n\nwith this a valid measure as K \u2192 \u221e. The expression \u03b4\u03c8k(\u03c8) equals one if \u03c8 = \u03c8k and is zero\notherwise. Therefore, H(\u03c8) represents a vector of K probabilities, with each associated with a\nrespective atom \u03c8k. In the limit K \u2192 \u221e, H(\u03c8) corresponds to an in\ufb01nite-dimensional vector of\nprobabilities, and each probability has an associated atom \u03c8k drawn i.i.d. from H0.\nUsing H(\u03c8), we may now draw N binary vectors, the ith of which is denoted zi \u2208 {0, 1}K,\nand the kth component of zi is drawn zik \u223c Bernoulli(\u03c0k). These N binary column vectors are\nused to constitute a matrix Z \u2208 {0, 1}K\u00d7N , with ith column corresponding to zi; the kth row of\nZ is associated with atom \u03c8k, drawn as discussed above. For our problem the atoms \u03c8k \u2208 2.\nFollowing [9], we may de\ufb01ne a linear or bilinear classi\ufb01er based on the sparse weights \u03b1 and the\nassociated data x (in the bilinear case), with this here implemented in the form of a probit classi\ufb01er.\nWe focus on the linear model, as it is simpler (has fewer parameters), and the results in [9] demon-\nstrated that it was often as good or better than the bilinear classi\ufb01er. To account for classi\ufb01cation,\nthe model in (2) remains unchanged, and the following may be added to the top of the hierarchy:\nyi = 1 if \u03b8T \u02c6\u03b1 + \u03bd > 0, yi = 2 if \u03b8T \u02c6\u03b1 + \u03bd < 0, \u03b8 \u223c N (0, \u03b3\u22121\n0 ), where\n\u02c6\u03b1 \u2208