{"title": "A Bridging Framework for Model Optimization and Deep Propagation", "book": "Advances in Neural Information Processing Systems", "page_first": 4318, "page_last": 4327, "abstract": "Optimizing task-related mathematical model is one of the most fundamental methodologies in statistic and learning areas. However, generally designed schematic iterations may hard to investigate complex data distributions in real-world applications. Recently, training deep propagations (i.e., networks) has gained promising performance in some particular tasks. Unfortunately, existing networks are often built in heuristic manners, thus lack of principled interpretations and solid theoretical supports. In this work, we provide a new paradigm, named Propagation and Optimization based Deep Model (PODM), to bridge the gaps between these different mechanisms (i.e., model optimization and deep propagation). On the one hand, we utilize PODM as a deeply trained solver for model optimization. Different from these existing network based iterations, which often lack theoretical investigations, we provide strict convergence analysis for PODM in the challenging nonconvex and nonsmooth scenarios. On the other hand, by relaxing the model constraints and performing end-to-end training, we also develop a PODM based strategy to integrate domain knowledge (formulated as models) and real data distributions (learned by networks), resulting in a generic ensemble framework for challenging real-world applications. Extensive experiments verify our theoretical results and demonstrate the superiority of PODM against these state-of-the-art approaches.", "full_text": "A Bridging Framework for\n\nModel Optimization and Deep Propagation\n\nRisheng Liu1,2\u2217, Shichao Cheng3, Xiaokun Liu1, Long Ma1, Xin Fan1,2, Zhongxuan Luo2,3\n1International School of Information Science & Engineering, Dalian University of Technology\n\n2Key Laboratory for Ubiquitous Network and Service Software of Liaoning Province\n\n3School of Mathematical Science, Dalian University of Technology\n\nAbstract\n\nOptimizing task-related mathematical model is one of the most fundamental\nmethodologies in statistic and learning areas. However, generally designed schemat-\nic iterations may hard to investigate complex data distributions in real-world appli-\ncations. Recently, training deep propagations (i.e., networks) has gained promising\nperformance in some particular tasks. Unfortunately, existing networks are often\nbuilt in heuristic manners, thus lack of principled interpretations and solid theoreti-\ncal supports. In this work, we provide a new paradigm, named Propagation and\nOptimization based Deep Model (PODM), to bridge the gaps between these differ-\nent mechanisms (i.e., model optimization and deep propagation). On the one hand,\nwe utilize PODM as a deeply trained solver for model optimization. Different from\nthese existing network based iterations, which often lack theoretical investigations,\nwe provide strict convergence analysis for PODM in the challenging nonconvex\nand nonsmooth scenarios. On the other hand, by relaxing the model constraints\nand performing end-to-end training, we also develop a PODM based strategy to\nintegrate domain knowledge (formulated as models) and real data distributions\n(learned by networks), resulting in a generic ensemble framework for challenging\nreal-world applications. Extensive experiments verify our theoretical results and\ndemonstrate the superiority of PODM against these state-of-the-art approaches.\n\n1\n\nIntroduction\n\nIn the last several decades, many machine learning and computer vision tasks have been formulated\nas the problems of solving mathematically designed optimization models. Indeed, these models are\nthe workhorse of learning, vision and power in most practical algorithms. However, it is actually\nhard to obtain a theoretically ef\ufb01cient formulation to handle these complex data distributions in\ndifferent practical problems. Moreover, generally designed optimization models [3, 5] may be lack\nof \ufb02exibility and robustness leading to severe corruptions and errors, which are commonly existed in\nreal-world scenarios.\nIn recent years, a variety of deep neural networks (DNNs) have been established and trained in end-to-\nend manner for different learning and vision problems. For example, AlexNet [13] \ufb01rst demonstrated\nthe advantages of DNNs in the challenge of ImageNet large scale visual recognition. With a careful\ndesign, [29] proposed GoogleNet, which increased the depth and width of the network while keeping\nthe computational budget constant. However, some researchers also found that although increasing\nthe layers of the networks may improve the performance, it is more dif\ufb01cult to train a deeper network.\nBy introducing shortcut blocks, [10] proposed the well-known residual network. It has been veri\ufb01ed\nthat the residual structure can successfully avoid gradient vanishing problems and thus signi\ufb01cantly\n\n\u2217Corresponding Author. Correspondence to <rsliu@dlut.edu.cn>.\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fimprove the practical training performance for deeper network. Besides the great success of DNNs in\nsupervised learning, some efforts have also been made on unsupervised learning tasks. [8] proposed\nthe generative adversarial network, which utilizes a pair of generator network and discriminator\nnetwork contesting with each other in a zero-sum game framework to generate the realistic samples.\nThough with relatively good performance on speci\ufb01c applications, the interpretability issue is still a\nbig problem for existing DNNs. That is, it is challenging to reason about what a DNN model actually\ndoes due to its opaque or black-box nature.\nEmbedding DNNs into the optimization process is recently popular and some preliminary works have\nbeen developed from different perspectives. For example, [9] trained a feed-forward architecture\nto speed up sparse coding problems. [1] introduced deep transformations to address correlation\nanalysis on multiple view data. Very recently, to better address the true image degradation, [7, 33, 30]\nincorporated convolutional DNN as image priors into the maximum a posterior inference process\nfor image restoration. Another group of recent works also tried to utilize recurrent neural network\n(RNN) structures [2] and/or reinforcement strategies [17] to directly learn descent iterations for\ndifferent learning tasks. It should be pointed out that the convergence issue should be the core for\noptimization algorithm design. Unfortunately, even with relatively good practical performance on\nsome applications, till now it is still challenging to provide strict convergence analysis on these deeply\ntrained iterations.\n\n1.1 Our Contributions\n\nAs discussed above, the interpretability and guarantees are the most important missing footstones\nfor the previous experience based networks. Some preliminary investigations have been proposed\nto combine numerical iterations and learnable architectures for deep propagations design. However,\ndue to these naive combination strategies (e.g., directly replace iterations by architectures), it is still\nchallenging to provide strict convergence analysis on their resulted deep models. To partially break\nthrough these limitations, this paper proposes a theoretically guaranteed paradigm, named Propagation\nand Optimization based Deep Model (PODM), to incorporate knowledge-driven schematic iterations\nand data-dependent network architectures to address both model optimization and learning tasks.\nOn the one hand, PODM actually provides a learnable (i.e., data-dependent) numerical solver (See\nFig. 1). Compared with these naive unrolling based methods (e.g., [7, 33, 30, 17, 2, 20]), the main\nadvantage of PODM is that we can generate iterations, which strictly converge to the critical point of\nthe given optimization model, even in the complex nonconvex and nonsmooth scenarios. On the other\nhand, by slightly relaxing the exact optimality constraints during propagations, we can also obtain\nan interpretable framework to integrate mathematical principles (i.e., formulated by model based\nbuilding-block) and experience of the tasks (i.e., network structures designed in heuristic manners)\nfor collaborative end-to-end learning.\nIn summary, the contributions of this paper mainly include:\n\n\u2022 We provided a model-inspired paradigm to establish building-block modules for deep model\ndesign. Different from existing trainable iteration methods, in which the architectures are\nbuilt either from speci\ufb01c prior formulations (e.g., Markov random \ufb01elds [24]) or completely\nin heuristic manners (e.g., replace original priors by experience based networks [7, 33]),\nwe develop a \ufb02exible framework to integrate both data (investigated from training set) and\nknowledge (incorporated into principled priors) for deep propagations construction.\n\n\u2022 By introducing an optimality error checking condition together with a proximal feedback\nmechanism, we prove in theory that the propagation generated by PODM is globally2\nconvergent to the critical point of the given optimization model. Such strict convergent\nguarantee is just the main advantage against these existing deep iterations designed in\nheuristic manner (e.g., [7, 33, 30, 17, 2])\n\n\u2022 As a nontrivial byproduct, the relaxed PODM actually provides a plug-and-play, collabora-\ntive, interpretable, and end-to-end deep learning framework for real-world complex tasks.\nExtensive experimental results on real-world image restoration applications demonstrate the\neffectiveness of our PODM and its relaxed extension.\n\n2Here \u201cglobally\u201d indicates that we generate a Cauchy sequence, thus the whole sequence is convergent.\n\n2\n\n\fFigure 1: Illustrating the mechanism of PODM for nonconvex model optimization.\n\n2 Existing Trainable Iterations: Lack Generalizations and Guarantees\n\nWe review existing training based iterative methods for model optimization. Speci\ufb01cally, most\nlearning and vision tasks can be formulated as the following regularized optimization model:\n\nmin\n\nx\n\n\u03a6(x) := f (x) + g(x),\n\n(1)\n\nwhere f denotes the loss term and g is related to the regularization term. Different from classical\nnumerical solvers, which design their iterations purely based on mathematical derivations. Recent\nstudies try to establish their optimization process based on training iterative architectures on collected\ntraining data. These existing works can be roughly divided into two categories: trainable priors and\nnetwork based iterations.\nThe \ufb01rst category of methods aim to introduce hyper parameters for speci\ufb01c prior formulations\n(e.g., (cid:96)1-norm and RTF) and then unroll the resulted updating schemes to obtain trainable iterations\nfor Eq (1). For example, the works in [9, 27] parameterize the (cid:96)1 regularizer and adopt classical\n\ufb01rst-order methods to derivate their \ufb01nal iterations. The main limitation of these approaches is that\ntheir schemes are established based on speci\ufb01c forms of priors, thus cannot be applied for general\nlearning/vision problems. Even worse, the hyper parameters in these approaches (e.g., trade-off or\ncombination weights) are too simple to extract complex data distributions.\nOn the other hand, the works in [7, 33] try to directly replace prior-related numerical computations\nat each iteration by experientially designed network architectures. In this way, these approaches\nactually completely discard the explicit regularization g in their updating schemes. Very recently, the\nrecurrent [2], unrolling [19] and reinforcement [17] learning strategies have also been introduced to\ntrain network based iterations for model optimization. Since these approaches completely discard\nthe original regularizations (i.e., g), no prior knowledges can be enforced in their iterations. More\nimportantly, we must emphasize that due to these embedded inexact computations, it is challenging\nto provide strict convergence analysis on most of above mentioned trainable iterations.\n\n3 Our Model Inspired Building-Blocks\n\nIn this section, we establish two fundamental iterative modules as our trainable architectures for both\nmodel optimization and deep propagation. Speci\ufb01cally, the deep propagation module is designed\nas our generic architecture to incorporate domain knowledge into trainable propagations. While\nthe optimization module actually enforces feedback control to guide the iterations to satisfy our\noptimality errors. The mechanism of PODM is brie\ufb02y illustrated in Fig. 1.\nSuppose A is the given network architecture (may built in heuristic manner) and denote its output as\nxA = A(x; \u03b8A). We would like to design our propagation module based on both A and \u03a6 de\ufb01ned in\nEq. (1). Speci\ufb01cally, rather than parameterizing g or completely replace it by networks in existing\nworks, we integrate these two parts by the following quadratic penalized energy:\n\n(cid:123)\n\n(cid:122)\n\n(cid:124)\n\nData\n\n(cid:125)(cid:124)\n(cid:125)\n\n(cid:123)\n(cid:124)(cid:123)(cid:122)(cid:125)\n\n\u2212(cid:104)x, \u0001(cid:105)\nError\n\n(cid:123)(cid:122)\n\n+\n\nd(x, xA)\n\n.\n\n(2)\n\nDesigned prior\n\nLearned prior\n\n\u03a6(x) + d(x, xA) \u2212 (cid:104)x, \u0001(cid:105) = min\n\nx\n\nmin\n\nx\n\nKnowledge\n\n(cid:125)(cid:124)\n\n(cid:122)\n(cid:124)(cid:123)(cid:122)(cid:125)\n\nf (x)\n\nFidelity\n\n+\n\n(cid:124)(cid:123)(cid:122)(cid:125)\n\ng(x)\n\n3\n\nConv+ELUConv+ELUConv+ELUConv+ELUConv+ELUConv+LinearOptimalityErrorYesNo. . .. . ....\fHere, d(x, xA) is the distance function which intents to introduce the output of network into the\npropagation module. It can be de\ufb01ned as d(x, xA) = h(x) \u2212 h(xA) \u2212 (cid:104)\u2207h(xA), x \u2212 xA(cid:105), where\nh(x) = (cid:107)\u00b7(cid:107)2\nH and H denotes a symmetric matrix3. \u0001 denotes the error corresponding to Eq. (1) since\nintroducing the network. Both the designed prior g(x) and learned prior d(x, xA) are merged to\ncompose our hybrid priors. Please notice that Eq. (2) can also be understood as a hybrid prior based\ninexact approximation of Eq. (1), in which we establish an ensemble of both domain knowledge (i.e.,\ng) and training data (i.e., A). Indeed, we can control the inexact solution by calculating the speci\ufb01c\nfunction about \u0001.\nPropagation Module: We \ufb01rst investigate the following sub-model of Eq. (2) (i.e., only with \ufb01delity\nand learned priors)\n\nxF = F(xA; \u03b8H) := arg min\n\n{f (x) + d(x, xA)} ,\n\nx\n\n(3)\nwhere \u03b8H denotes the parameter in distance d(x, xA). Eq. (3) actually integrates the principled\nmodel \ufb01delity (i.e., f) and network based priors (A). Following this formulation, we can de\ufb01ne our\ndata-dependent propagation module (P) as the cascade of A and F in the l-th stage, i.e.,\n\n\u02dcxl = P(xl\u22121; \u03d1l) := F(cid:0)A(cid:0)xl\u22121; \u03b8lA(cid:1) ; \u03b8l\n\nH} is the set of trainable parameters.\n\nwhere \u03d1l = {\u03b8lA, \u03b8l\nOptimality Module: Due to the inexactness of these learning based architectures, the propagation\nmodule de\ufb01nitely brings errors when optimizing Eq. (1). To provide effective control for these\niterative errors and generate strictly convergent propagations, we recall the designed prior g and\nassume xlG is one solution of Eq. (2) in the l-th stage, i.e.,\n\n(cid:1) ,\n\nxlG \u2208 G(\u02dcxl) := arg min\n\nf (x) + g(x) + d(x, xlA) \u2212 (cid:104)x, \u0001l(cid:105).\n\n(4)\nThe error \u0001l has a speci\ufb01c form as \u0001l = \u2207f (xlG) + \u2207d(xlG, xlA) + uxlG by considering the \ufb01rst-order\noptimality condition of Eq. (4). Here uxlG \u2208 \u2202g(xlG) is a limiting Ferchet subdifferential of g. Intu-\nitively, it is necessary to introduce some criteria about \u0001l to illustrate the current propagation whether\nsatis\ufb01ed the desired convergence behavior. Fortunately, we can demonstrate that the convergence of\nour deep propagations can be successfully guaranteed by the following optimality error:\n\nH\n\nx\n\n(cid:107)\u03c8(\u0001l)(cid:107) \u2264 cl(cid:107)xlG \u2212 xl\u22121(cid:107).\n\n(5)\nHere \u03c8(\u0001l) = \u0001l + \u00b5l(xlG \u2212 xl\u22121)/2\u2212 H(xl\u22121 + xlG \u2212 2xlA) is the error function and cl is a positive\nconstant to reveal our tolerance of the inexactness at the l-th stage.\nTherefore, as stated in the following Eq. (6), we adopt xlG as the output of our optimality module in\nthe l-th stage if the criterion in Eq. (5) is satis\ufb01ed. Otherwise, we return to the previous stage and\nadopt a standard proximal gradient updating (i.e., feedback) to correct the propagation.\nif Eq. (5) is satis\ufb01ed,\notherwise.\n\nprox\u03b3lg(xl\u22121 \u2212 \u03b3l(\u2207f (xl\u22121)))\n\n(cid:26) G(\u02dcxl)\n\nO(\u02dcxl, xl\u22121; \u03b3l) :=\n\n(6)\n\nIn this way, our optimality module actually provides a mechanism with proximal operator to guide\nthe propagations toward convergence.\nNotice that both xlG and \u0001l are abstracted in above optimality module. Actually, temporarily ignoring\nthe learned prior and error in Eq (2), we can provide a practical calculative form of xlG by calculating\nthe traditionally designed prior appeared in Eq. (2) (i.e., Eq. (1), only with \ufb01delity and designed\npriors) with a momentum proximal mechanism as follows,\n\n(cid:0)\u02dcxl \u2212 \u03b3l(cid:0)\u2207f (\u02dcxl) + \u00b5l(\u02dcxl \u2212 xl\u22121)(cid:1)(cid:1) ,\n\n(7)\nwhere \u00b5l is the trade-off parameter, and \u03b3l denotes the step size. On the other hand, the updating of\nxlG can also be reformulated4 as xlG \u2208 prox\u03b3lg\nCombining it with Eq. (7), we can obtain a practical computable formulation of error function \u03c8(\u0001l)\nappeared in optimality error as\n\n(cid:0)xlG \u2212 \u03b3l(cid:0)\u2207f (xlG) + \u00b5l(xlG \u2212 xlA)(cid:1) + \u03b3l\u0001l(cid:1) . Thus,\n\nxlG \u2208 prox\u03b3lg\n\n\u03c8(\u0001l) =\n\n1\n\n\u03b3l (\u02dcxl \u2212 xlG) \u2212 \u00b5l\n\n2\n\n(2\u02dcxl \u2212 xlG \u2212 xl\u22121) + H(xlG \u2212 xl\u22121) + \u2207f (xlG) \u2212 \u2207f (\u02dcxl).\n\n3d(x, xA) actually is a special Bregman distance. H can be an assigned or learnable matrix. The distance\n\nd(x, xA) = \u00b5(cid:107)x \u2212 xA(cid:107)2 if H = \u00b5I. We will detailed illustrate it on speci\ufb01c applications in experiments.\n4The detained deductions on this equality reformulation can be found in the supplementary materials.\n\n4\n\n\f4 Propagation and Optimization based Deep Model\n\nBased on the above building-block modules, it is ready to introduce Propagation and Optimization\nbased Deep Model (PODM). We \ufb01rst show how to apply PODM to perform fast and accurate model\noptimization and analyze its convergence behaviors in nonconvex and nonsmooth scenarios. Then we\ndiscuss how to establish end-to-end type PODM with relaxed optimality error to perform practical\nensemble learning for challenging real-world applications.\n\n4.1 PODM: A Deeply Trained Nonconvex Solver with Strict Convergence Guarantee\n\nWe demonstrate how to apply PODM for fast and accurate nonconvex optimization. It should be\nemphasized that different from most existing trainable iteration methods, which either incorporate\nnetworks into the iterations in heuristic manner (e.g., [33, 7]) or directly estimate data-dependent\ndescent directions using networks (e.g., [17, 2]), PODM provides a nice mechanism with optimality\nerror to control the training based propagations. It will be stated in the following that the main\nadvantage of our PODM is that the convergence of our iterations can be strictly guaranteed, while no\ntheoretical guarantees are provided for the above mentioned experientially trained iterations.\nPODM for Nonconvex and Nonsmooth Optimization: We \ufb01rst illustrate the mechanism of PODM\nin Fig. 1. It can be seen that PODM consists of two fundamental modules, i.e., experientially designed\n(trainable) propagation module P and theoretically designed optimality module O. It should be\npointed out that to guarantee the theoretical convergence, only the parameters \u03b8lA are learned when\nconsidering PODM as an accurate numerical solver5.\nConvergence Behaviors Analysis: Before providing our main theory, we give some statements\nabout functions appeared in our optimization model. Speci\ufb01cally, we assume that f is Lipschitz\nsmooth, g is proper and lower semi-continuous, d is differential, and \u03a6 is coercive6. All of these\nassumptions are fairly loose in most model optimization tasks.\nProposition 1. Suppose that the optimality error in Eq.(5) (i.e., (cid:107)\u03c8(\u0001l)(cid:107) \u2264 cl(cid:107)xlG\u2212xl\u22121(cid:107)) is satis\ufb01ed,\nEq. (5) is not satis\ufb01ed and thus the variable is updated by xl = prox\u03b3lg(xl\u22121 \u2212 \u03b3l(\u2207f (xl\u22121))).\nmodulus of \u2207f (x).\n\nthen we have \u03a6(xlG) \u2264 \u03a6(xl\u22121) \u2212(cid:0)\u00b5l/4 \u2212 cl2/\u00b5l(cid:1)(cid:107)xlG \u2212 xl\u22121(cid:107)2. In contrast, if the inequality in\nThen we have \u03a6(xl) \u2264 \u03a6(xl\u22121) \u2212(cid:0)1/(2\u03b3l) \u2212 Lf /2(cid:1)(cid:107)xl \u2212 xl\u22121(cid:107)2, where Lf is the Lipschitz\n\nActually, Proposition 1 provides us a nice descent property for PODM on the variational energy \u03a6(x)\nduring iterations. That is, it is easy to obtain a non-increase sequence {\u03a6(xl)}l\u2208N, which results\nin a limited value \u03a6\u2217 so that liml\u2192\u221e \u03a6(xl) = \u03a6\u2217 < \u221e. Moreover, if {xl}l\u2208N is bounded, there\nexists a convergence subsequence such that limj\u2192\u221e xlj = x\u2217, where {lj} \u2282 {l} \u2282 N. Then from\nthe conclusions in Proposition 1, we also have the sum of (cid:107)xl \u2212 xl\u22121(cid:107)2 from l = 1 to l \u2192 \u221e is\nbounded. Thus we can further prove the following proposition.\nProposition 2. Suppose x\u2217 is any accumulation point of sequence {xl}l\u2208N generalized by PODM,\nthen there exists a subsequence {xlj}j\u2208N such that lim\n\nj\u2192\u221e xlj = x\u2217, and lim\n\nj\u2192\u221e \u03a6(xlj ) = \u03a6(x\u2217).\n\nBased on the above results, it is ready to establish the convergence results for our PODM when\nconsidering it as a numerical solver for nonconvex model optimization.\nTheorem 1. (Converge to the Critical Point of Eq. (1)) Suppose f is proper and Lipschitz smooth, g\nis proper and lower semi-continuous, and \u03a6 is coercive. Then the output of PODM (i.e., {xl}l\u2208N)\nsatis\ufb01es: 1. The limit points of {xl}l\u2208N (denoted as \u2126) is a compact set; 2. All elements of \u2126 are the\ncritical points of \u03a6; 3. If \u03a6 is a semi-algebraic function, {xl}l\u2208N converges to a critical point of \u03a6.\nIn summary, we actually prove that PODM provides a novel strategy to iteratively guide the propaga-\ntions of deep networks toward the critical point of the given nonconvex optimization model, leading\nto a fast and accurate numerical solver.\n\n5Notice that both \u03b8lA and \u03b8l\n\nH are learnable in the Relaxed PODM, which will be introduced in Section 4.2.\n\nThe algorithms of PODM and Relaxed PODM are presented in the supplementary materials.\n\n6Due to the space limit, we omit the details of these de\ufb01nitions, all proofs of the following propositions and\n\ntheorem. The detailed version is presented in the supplementary materials.\n\n5\n\n\f4.2 Relaxed PODM: An End-to-end Collaborative Learning Framework\n\nIt is shown in the above subsection that by enforcing a carefully designed optimality error and greedily\ntrain the networks, we can obtain a theoretically convergent solver for nonconvex optimization.\nHowever, it is indeed challenging to utilize strict mathematical models to exactly formulate the\ncomplex data distributions in real-world applications. Therefore, in this subsection, we would like\nto relax the theoretical constraint and develop a novel end-to-end learning framework to address\nreal-world tasks. In particular, rather than only training the parameters \u03b8lA in given network A, we\nH in F at each layer. Therefore, at the l-th\nalso introduce \ufb02exible networks to learn parameters \u03b8l\nlayer, we actually have two groups of learnable parameters, including \u03b8lA for A and \u03b8l\nH for F. The\nforward propagation of the so-called Relaxed PODM (RPODM) at each stage can be summarized as\n\n(cid:1)(cid:1) . We would like to argue that RPODM actually provides a way to\n\ntrain the network structure using both domain knowledges and training data, thus results in a nice\ncollaborative learning framework.\n\nxl = G(cid:0)F(cid:0)A(cid:0)xl\u22121; \u03b8lA\n\n(cid:1) ; \u03b8l\n\nH\n\n5 Experimental Results\n\nWe \ufb01rst analyze the convergence behaviors of PODM by applying it to solve the widely used\nnonconvex (cid:96)p-regularized sparse coding problem. Then, we evaluate the performance of our Relaxed\nPODM on the practical image restoration applications. All the experiments are conducted on a PC\nwith Intel Core i7 CPU @ 3.6 GHz, 32 GB RAM and an NVIDIA GeForce GTX 1060 GPU.\n\n5.1 PODM for (cid:96)p-regularized Sparse Coding\nNow we consider the nonconvex (cid:96)p-regularized (0 < p < 1) sparse coding model: min\u03b1 (cid:107)D\u03b1\u2212o(cid:107)2+\n\u03bb(cid:107)\u03b1(cid:107)p\np, which has been widely used for synthetic image modeling [18, 22], subspace clustering [21]\nand motion segmentation [31], etc. Here \u03bb is the regularization parameter, o, D and \u03b1 denote\nthe observed signal, a given dictionary, and the corresponding sparse codes, respectively. In our\nexperiments, we formulate D as the multiplication of the down-sampling operator and the inverse\nof a three-stage Haar wavelet transform [3], which results in a sparse coding based single image\nsuper-resolution formulation. We consider (cid:96)0.8-norm to enforce the sparsity constraint. As for PODM,\nwe de\ufb01ne H = \u00b5I/2 with \u00b5 = 1e \u2212 2 in the distance function d (i.e., d(x, xA) = \u00b5(cid:107)x \u2212 xA(cid:107)2/2)\nto establish the propagation and optimality modules. For fair comparison, we just adopt the network\narchitecture used in existing works (i.e., IRCNN [33]) as A for PODM.\nTo verify the convergence properties of our framework, we plotted the iteration behaviors of PODM\non example images from the commonly used \u201cSet5\u201d super-resolution benchmark [4] and compared it\nwith the most popular numerical solvers (e.g., FISTA [3]) and the recently proposed representative\nnetwork based iteration methods (e.g., IRCNN [33]). Fig. 2 showed the curves of relative error (i.e.,\nlog10((cid:107)xl+1 \u2212 xl(cid:107)/(cid:107)xl(cid:107))), reconstruction error (i.e., (cid:107)xl \u2212 xgt(cid:107)/(cid:107)xgt(cid:107)), structural similarity (SSIM)\nand our optimality error de\ufb01ned in Eq. (5). It can be observed that the curves of relative error (i.e.\nsub\ufb01gure (a)) for IRCNN is always oscillating and cannot converge even after 200 stages. This\nis mainly due to its naive network embedding strategy. Although with a little bit smooth relative\nerrors, FISTA is much slower than our PODM. Meanwhile, we observed that PODM also has better\nperformance than other two schemes in terms of the restoration error (in sub\ufb01gure (b), lower is better)\nand SSIM (in sub\ufb01gure (c), higher is better). Furthermore, we also explored the optimality error of\nour PODM. It can be seen from sub\ufb01gure (d) that the optimality error is always satis\ufb01ed. This means\nthat the learnable architectures can always be used to improve the iteration process of PODM.\nWe then reported the average quantitative scores (i.e., PSNR and SSIM) on two benchmark datasets [4,\n32]. As shown in Table 1, the quantitative performance of PODM are much better than all the\ncompared methods on all up-sampling scales (i.e., \u00d72,\u00d73,\u00d74).\n\n5.2 Relaxed PODM for Image Restoration\n\nImage restoration is one of the most challenging low-level vision problems, which aims to recov-\ner a latent clear image u from the blurred and noised observation o. To evaluate the Relaxed\nPODM (RPODM) paradigm, we would like to formulate the image restoration task as u by solv-\ning minu (cid:107)k \u2297 u \u2212 o(cid:107)2 + \u03c7\u2126(u), where k and n are respectively the blur kernel and the noises,\n\n6\n\n\f(a)\n\n(b)\n\n(c)\n\n(d)\n\nFigure 2: Convergence curves of FISTA, IRCNN, and our PODM. Sub\ufb01gures (a)-(c) are the relative error,\nreconstruction error and SSIM, respectively. Sub\ufb01gure (d) plots the \u201cOptimality Error\u201d appeared in PODM.\n\nTable 1: Averaged quantitative performance on super-resolution with different up-sampling scales.\n\nScale Metric\n\n\u00d72\n\n\u00d73\n\n\u00d74\n\nPSNR\nSSIM\nPSNR\nSSIM\nPSNR\nSSIM\n\nFISTA\n35.14\n0.94\n31.35\n0.88\n29.26\n0.83\n\n[4]\n\nIRCNN\n37.43\n0.96\n33.39\n0.92\n31.02\n0.88\n\nOurs\n37.46\n0.98\n33.44\n0.96\n31.05\n0.93\n\nFISTA\n31.41\n0.90\n28.39\n0.81\n26.93\n0.76\n\n[32]\nIRCNN\n32.88\n0.91\n29.61\n0.83\n27.72\n0.76\n\nOurs\n33.06\n0.95\n29.77\n0.90\n27.86\n0.85\n\nTable 2: Averaged quantitative performance on image restoration.\n\nTV\nMetric\nPSNR 29.38\nSSIM 0.88\nTIME\n1.22\nPSNR 30.67\nSSIM 0.85\nTIME\n6.38\n\nHL\n30.12\n0.90\n0.10\n31.03\n0.85\n0.49\n\nEPLL\n31.65\n0.93\n70.32\n32.44\n0.88\n721.98\n\nCSF\n32.74\n0.93\n0.12\n31.55\n0.87\n0.50\n\nRTF\n33.26\n0.94\n26.63\n32.45\n0.89\n240.98\n\nMLP\n31.32\n0.90\n0.49\n31.47\n0.86\n4.59\n\nIRCNN Ours\n34.06\n32.51\n0.97\n0.92\n1.46\n2.85\n32.62\n32.61\n0.89\n0.89\n16.67\n1.95\n\n[15]\n\n[28]\n\nn=1 f(cid:62)\n\nn Fn)\u22121(K(cid:62)o + \u00b5(cid:80)N\n\nn=1 F(cid:62)\n\nn=1 F(cid:62)\n\nin \ufb01ltered space, we de\ufb01ne H = \u00b5(cid:80)N\ni.e., xF = (K(cid:62)K + \u00b5(cid:80)N\n\nand \u03c7\u2126 is the indicator function of the set \u2126. Here we de\ufb01ne \u2126 = {u|(cid:107)u(cid:107)0 \u2264 s, a \u2264 [u]i \u2264\nb, with s > 0, a = mini{[o]i}, b = maxi{[o]i}, i = 1,\u00b7\u00b7\u00b7 , n.} to enforce our fundamen-\ntal constraints on u. It is easy to check that the proximal operator corresponding to \u03c7\u2126 can be\n(u) := HardThre(Proj[a,b](u)), where HardThre(\u00b7) and Proj[a,b](\u00b7) are the\nwritten as prox\u03c7\u2126\nhard thresholding and projection operators, respectively. Then we would like to introduce learn-\nable architectures to build RPODM. Speci\ufb01cally, by considering the distance measure d(x, xA)\nn fn/2, where {fn} denote the \ufb01ltering operations.\nIn this way, the propagation module can be directly obtained by solving Eq. (3) in closed form,\nn FnxA), where K and {Fn} are block-\ncirculant matrices corresponding to convolutions. Inspired by [14], here we just introduce a multilayer\nperceptron and the DCT basis to output the parameter \u00b5 and construct the \ufb01lters {fn}, respectively.\nThen we adopt a CNN architecture with 6 convolutional layers (the \ufb01rst 5 layers followed by ELU [6]\nactivations) as A for our deep propagation.\nComparisons with State-of-the-art Methods: We compared RPODM with several state-of-the-art\nimage restoration approaches, including traditional numerical methods (e.g. TV [16], HL [12]),\nlearning based methods (e.g. EPLL [34], MLP [26]), and deep unrolling methods (e.g. CSF [25],\nRTF [24], IRCNN [33]). We \ufb01rst conducted experiments on the most widely used Levin et al.\u2019\nbenchmark [15], with 32 blurry images of size 255 \u00d7 255. We also evaluated all these compared\nmethods on the more challenging Sun et al.\u2019 benchmark [28], which includes 640 blurry images with\n1% Gaussian noises, sizes range from 620\u00d71024 to 928\u00d71024. Table 2 reported the quantitative\nresults (i.e., PSNR, SSIM and TIME (in seconds)).\nIt can be seen that our proposed method\n\n7\n\n\fPSNR / SSIM 32.00 / 0.94\n\n34.65 / 0.95\n\n35.67 / 0.96\n\n35.47 / 0.95\n\n36.75 / 0.98\n\nPSNR / SSIM 31.28 / 0.86\nBlurred Image\n\nEPLL\n\n29.86 / 0.82\n\nCSF\n\n31.14 / 0.86\n\nRTF\n\n31.35 / 0.87\n\nIRCNN\n\n31.65 / 0.87\n\nOurs\n\nFigure 3: Image restoration results on two example images, where the inputs on the top and bottom rows are\nrespectively from Levin et al.\u2019 and Sun et al.\u2019 benchmarks. The PSNR / SSIM are reported below each image.\n\nBlurred Image\n\nEPLL\n\nCSF\n\nRTF\n\nIRCNN\n\nOurs\n\nFigure 4: Image restoration results on the real blurry image.\n\ncan consistently obtain higher quantitative scores than other approaches, especially on Levin et\nal.\u2019 dataset [15]. As for the running time, we observed that RPODM is much faster than most\nconventional optimization based approaches and recently proposed learning based iteration methods\n(i.e., TV, EPLL, RTF, MLP and IRCNN). While the speeds of HL and CSF are slightly faster than\nRPODM. Unfortunately, the performance of these two simple methods are much worse than our\napproach. The qualitative comparisons in Fig. 3 also veri\ufb01ed the effectiveness of RPODM.\nReal Blurry Images: Finally, we evaluated RPODM on the real-world blurry images [11] (i.e., with\nunknown blur kernel and 1% additional Gaussian noises). We adopted the method in [23] to estimate\na rough blur kernel. In Fig. 4, we compared the image restoration results of RPODM with other\ncompetitive methods (top 4 in Table 2, i.e., EPLL, CSF, RTF, and IRCNN) based on this estimated\nkernel. It can be seen that even with the roughly estimated kernel (maybe inexact), RPODM can still\nobtain clear image with richer details and more plentiful textures (see zoomed in regions).\n\n6 Conclusions\n\nThis paper proposed Propagation and Optimization based Deep Model (PODM), a new paradigm\nto integrate principled domain knowledge and trainable architectures to build deep propagations for\nmodel optimization and machine learning. As a learning based numerical solver, we proved in theory\nthat the sequences generated by PODM can successfully converge to the critical point of the given\nnonconvex and nonsmooth optimization model. Furthermore, by relaxing the optimality error, we\nactually also obtain a plug-and-play, collaborative, interpretable, and end-to-end deep model for\nreal-world complex tasks. Extensive experimental results veri\ufb01ed our theoretical investigations and\ndemonstrated the effectiveness of the proposed framework.\n\nAcknowledgments\n\nThis work is partially supported by the National Natural Science Foundation of China (Nos. 61672125,\n61733002, 61572096 and 61632019), and Fundamental Research Funds for the Central Universities.\n\n8\n\n\fReferences\n[1] Galen Andrew, Raman Arora, Jeff Bilmes, and Karen Livescu. Deep canonical correlation analysis. In\n\nICML, pages 1247\u20131255, 2013.\n\n[2] Marcin Andrychowicz, Misha Denil, Sergio Gomez, Matthew W Hoffman, David Pfau, Tom Schaul, and\nNando de Freitas. Learning to learn by gradient descent by gradient descent. In NIPS, pages 3981\u20133989,\n2016.\n\n[3] Amir Beck and Marc Teboulle. A fast iterative shrinkage-thresholding algorithm for linear inverse problems.\n\nSIAM Journal on Imaging Sciences, 2(1):183\u2013202, 2009.\n\n[4] Marco Bevilacqua, Aline Roumy, Christine Guillemot, and Marie Line Alberi-Morel. Low-complexity\nsingle-image super-resolution based on nonnegative neighbor embedding. In BMVC, pages 1\u201310, 2012.\n\n[5] Stephen Boyd, Neal Parikh, Eric Chu, Borja Peleato, and Jonathan Eckstein. Distributed optimization\nand statistical learning via the alternating direction method of multipliers. Foundations and Trends R(cid:13) in\nMachine Learning, 3(1):1\u2013122, 2011.\n\n[6] Djork-Arn\u00e9 Clevert, Thomas Unterthiner, and Sepp Hochreiter. Fast and accurate deep network learning\n\nby exponential linear units (elus). arXiv preprint arXiv:1511.07289, 2015.\n\n[7] Steven Diamond, Vincent Sitzmann, Felix Heide, and Gordon Wetzstein. Unrolled optimization with deep\n\npriors. arXiv preprint arXiv:1705.08041, 2017.\n\n[8] Ian Goodfellow, Jean Pouget-Abadie, Mehdi Mirza, Bing Xu, David Warde-Farley, Sherjil Ozair, Aaron\n\nCourville, and Yoshua Bengio. Generative adversarial nets. In NIPS, pages 2672\u20132680, 2014.\n\n[9] Karol Gregor and Yann LeCun. Learning fast approximations of sparse coding. In ICML, pages 399\u2013406,\n\n2010.\n\n[10] Kaiming He, Xiangyu Zhang, Shaoqing Ren, and Jian Sun. Deep residual learning for image recognition.\n\nIn CVPR, pages 770\u2013778, 2016.\n\n[11] Rolf K\u00f6hler, Michael Hirsch, Betty Mohler, Bernhard Sch\u00f6lkopf, and Stefan Harmeling. Recording and\nplayback of camera shake: Benchmarking blind deconvolution with a real-world database. In ECCV, pages\n27\u201340, 2012.\n\n[12] Dilip Krishnan and Rob Fergus. Fast image deconvolution using hyper-laplacian priors. In NIPS, pages\n\n1033\u20131041, 2009.\n\n[13] Alex Krizhevsky, Ilya Sutskever, and Geoffrey E Hinton. Imagenet classi\ufb01cation with deep convolutional\n\nneural networks. In NIPS, pages 1097\u20131105, 2012.\n\n[14] Jakob Kruse, Carsten Rother, and Uwe Schmidt. Learning to push the limits of ef\ufb01cient fft-based image\n\ndeconvolution. In ICCV, pages 4586\u20134594, 2017.\n\n[15] Anat Levin, Yair Weiss, Fredo Durand, and William T. Freeman. Understanding and evaluating blind\n\ndeconvolution algorithms. In CVPR, pages 1964\u20131971, 2009.\n\n[16] Chengbo Li, Wotao Yin, Hong Jiang, and Yin Zhang. An ef\ufb01cient augmented lagrangian method with\napplications to total variation minimization. Computational Optimization & Applications, 56(3):507\u2013530,\n2013.\n\n[17] Ke Li and Jitendra Malik. Learning to optimize. arXiv preprint arXiv:1606.01885, 2016.\n\n[18] Risheng Liu, Shichao Cheng, Yi He, Xin Fan, Zhouchen Lin, and Zhongxuan Luo. On the convergence of\nlearning-based iterative methods for nonconvex inverse problems. arXiv preprint arXiv:1808.05331, 2018.\n\n[19] Risheng Liu, Xin Fan, Shichao Cheng, Xiangyu Wang, and Zhongxuan Luo. Proximal alternating direction\n\nnetwork: A globally converged deep unrolling framework. In AAAI, 2018.\n\n[20] Risheng Liu, Xin Fan, Minjun Hou, Zhiying Jiang, Zhongxuan Luo, and Lei Zhang. Learning aggregated\n\ntransmission propagation networks for haze removal and beyond. IEEE TNNLS, (99):1\u201314, 2018.\n\n[21] Risheng Liu, Zhouchen Lin, and Zhixun Su. Learning markov random walks for robust subspace clustering\n\nand estimation. Neural Networks, 59:1\u201315, 2014.\n\n[22] Risheng Liu, long Ma, Yiyang Wang, and Lei Zhang. Learning converged propagations with deep prior\n\nensemble for image enhancement. arXiv preprint arXiv:1810.04012v1, 2018.\n\n9\n\n\f[23] Jinshan Pan, Zhouchen Lin, Zhixun Su, and Ming-Hsuan Yang. Robust kernel estimation with outliers\n\nhandling for image deblurring. In CVPR, pages 2800\u20132808, 2016.\n\n[24] Uwe Schmidt, Jeremy Jancsary, Sebastian Nowozin, Stefan Roth, and Carsten Rother. Cascades of\n\nregression tree \ufb01elds for image restoration. IEEE TPAMI, 38(4):677\u2013689, 2016.\n\n[25] Uwe Schmidt and Stefan Roth. Shrinkage \ufb01elds for effective image restoration. In CVPR, pages 2774\u20132781,\n\n2014.\n\n[26] Christian J Schuler, Harold Christopher Burger, Stefan Harmeling, and Bernhard Scholkopf. A machine\n\nlearning approach for non-blind image deconvolution. In CVPR, pages 1067\u20131074, 2013.\n\n[27] Pablo Sprechmann, Alexander M Bronstein, and Guillermo Sapiro. Learning ef\ufb01cient sparse and low rank\n\nmodels. IEEE TPAMI, 37(9):1821\u20131833, 2015.\n\n[28] Libin Sun, Sunghyun Cho, Jue Wang, and James Hays. Edge-based blur kernel estimation using patch\n\npriors. In ICCP, pages 1\u20138, 2013.\n\n[29] Christian Szegedy, Wei Liu, Yangqing Jia, Pierre Sermanet, Scott Reed, Dragomir Anguelov, Dumitru\nErhan, Vincent Vanhoucke, and Andrew Rabinovich. Going deeper with convolutions. In CVPR, pages\n1\u20139, 2015.\n\n[30] Dmitry Ulyanov, Andrea Vedaldi, and Victor Lempitsky. Deep image prior. arXiv preprint arX-\n\niv:1711.10925, 2017.\n\n[31] Jingyu Yan and Marc Pollefeys. A general framework for motion segmentation: Independent, articulated,\n\nrigid, non-rigid, degenerate and non-degenerate. In ECCV, pages 94\u2013106, 2006.\n\n[32] Roman Zeyde, Michael Elad, and Matan Protter. On single image scale-up using sparse-representations.\n\nIn International conference on curves and surfaces, pages 711\u2013730, 2010.\n\n[33] Kai Zhang, Wangmeng Zuo, Shuhang Gu, and Lei Zhang. Learning deep cnn denoiser prior for image\n\nrestoration. In CVPR, pages 2808\u20132817, 2017.\n\n[34] Daniel Zoran and Yair Weiss. From learning models of natural image patches to whole image restoration.\n\nIn ICCV, pages 479\u2013486, 2011.\n\n10\n\n\f", "award": [], "sourceid": 2111, "authors": [{"given_name": "Risheng", "family_name": "Liu", "institution": "Dalian University of Technology"}, {"given_name": "Shichao", "family_name": "Cheng", "institution": "Dalian University of Technology"}, {"given_name": "xiaokun", "family_name": "liu", "institution": "DUT"}, {"given_name": "Long", "family_name": "Ma", "institution": "School of Software Technology, Dalian University of Technology"}, {"given_name": "Xin", "family_name": "Fan", "institution": "Dalian University of Technology"}, {"given_name": "Zhongxuan", "family_name": "Luo", "institution": "DALIAN UNIVERSITY OF TECHNOLOGY"}]}