{"title": "Invert to Learn to Invert", "book": "Advances in Neural Information Processing Systems", "page_first": 446, "page_last": 456, "abstract": "Iterative learning to infer approaches have become popular solvers for inverse problems. However, their memory requirements during training grow linearly with model depth, limiting in practice model expressiveness. In this work, we propose an iterative inverse model with constant memory that relies on invertible networks to avoid storing intermediate activations. As a result, the proposed approach allows us to train models with 400 layers on 3D volumes in an MRI image reconstruction task. In experiments on a public data set, we demonstrate that these deeper, and thus more expressive, networks perform state-of-the-art image reconstruction.", "full_text": "Invert to Learn to Invert\n\nPatrick Putzky\n\nAmlab, University of Amsterdam (UvA), Amsterdam, The Netherlands\n\nMax-Planck-Institute for Intelligent Systems (MPI-IS), T\u00fcbingen, Germany\n\npatrick.putzky@googlemail.com\n\nMax Welling\n\nAmlab, University of Amsterdam (UvA), Amsterdam, The Netherlands\n\nCanadian Institute for Advanced Research (CIFAR), Canada\n\nwelling.max@googlemail.com\n\nAbstract\n\nIterative learning to infer approaches have become popular solvers for inverse\nproblems. However, their memory requirements during training grow linearly with\nmodel depth, limiting in practice model expressiveness. In this work, we propose\nan iterative inverse model with constant memory that relies on invertible networks\nto avoid storing intermediate activations. As a result, the proposed approach allows\nus to train models with 400 layers on 3D volumes in an MRI image reconstruction\ntask. In experiments on a public data set, we demonstrate that these deeper, and\nthus more expressive, networks perform state-of-the-art image reconstruction.\n\n1\n\nIntroduction\n\nWe consider the task of solving inverse problems. An inverse problem is described through a so called\nforward problem that models either a real world measurement process or an auxiliary prediction task.\nThe forward problem can be written as\n\nd = A(p, n)\n\n(1)\nwhere d is the measurable data, A is a (non-)linear forward operator that models the measurement\nprocess, p is an unobserved signal of interest, and n is observational noise. Solving the inverse\nproblem is then a matter of \ufb01nding an inverse model p = A\u22121(d). However, if the problem is\nill-posed or the forward problem is non-linear, \ufb01nding A\u22121 is a non-trivial task. Oftentimes, it is\nnecessary to impose assumptions about signal p and to solve the task in an iterative fashion [1].\n\n1.1 Learn to Invert\n\nMany recent approached to solving inverse problems focus on models that learn to invert the forward\nproblem by mimicking the behaviour of an iterative optimization algorithm. Here, we will refer to\nmodels of this type as \u201citerative inverse models\u201d. Most iterative inverse models can be described\nthrough a recursive update equation of the form\n\npt+1, st+1 = h\u03c6(A, d, pt, st)\n\n(2)\n\nwhere h\u03c6 is a parametric function, pt is an estimate of signal p and st is an auxiliary (memory)\nvariable at iteration t, respectively. Because h\u03c6 is an iterative model it is often interpreted as\na recurrent neural network (RNN). The functional form of h\u03c6 ultimately characterizes different\napproaches to iterative inverse models. Figure 1 (A) illustrates the iterative steps of such models.\nTraining of iterative inverse models is typically done via supervised learning. To generate training\ndata, measurements d are simulated from ground truth observations p through a known forward\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: Iterative Inverse Models: Unrolled model and individual steps of RIM and i-RIM.\n\nmodel A. The training loss then takes the form\n\nT(cid:88)\n\nL(p; d,A, \u03c6) =\n\n\u03c9tL (p, \u02c6pt(A, d, \u03c6))\n\n(3)\n\nt\n\nwhere p is the ground truth signal, \u02c6pt(A, d, \u03c6) is the model estimate at iteration t given data (A, d)\nand model parameters \u03c6, and \u03c9t \u2208 R+ is an importance weight of the t-th iteration.\nModels of this form have been successfully applied in several image restoration tasks on natural\nimages [2\u20137], on sparse coding [8], and more recently have found adoption in medical image\nreconstruction [9\u201312], a \ufb01eld where sparse coding and compressed sensing have been a dominant\nforce over the past years [1, 13].\n\n1.2\n\nInvert to Learn\n\nIn order to perform back-propagation ef\ufb01ciently, we are typically required to store intermediate\nforward activations in memory. This imposes a trade-off between model complexity and hardware\nmemory constraints which essentially limits network depth. Since iterative inverse models are trained\nwith back-propagation through time, they represent some of the deepest, most memory consuming\nmodels currently used. As a result, one often has to resort to very shallow models at each step of the\niterative process. Here, we overcome these model limitations by presenting a memory ef\ufb01cient way\nto train very deep iterative inverse models. To do that, our approach follows the training principles\npresented in Gomez et al. [14]. To save memory, the authors suggested to use reversible neural\nnetwork architectures which make it unnecessary to store intermediate activations as they can be\nrestored from post-activations. Memory complexity in this approach is O(1) and computational\ncomplexity is O(L), where L is the depth of the network. In the original approach, the authors\nutilized pooling layers which still required storing of intermediate activations at these layers. In\npractice, memory cost was hence not O(1). Brie\ufb02y after, Jacobsen et al. [15] demonstrated that a\nfully invertible network inspired by RevNets [14] can perform as well as a non-invertible model on\ndiscriminative tasks, although the model was not trained with memory savings in mind. Here we will\nadapt invertible neural networks to allow for memory savings in the same way as in Gomez et al. [14].\nWe refer to this approach as \u201cinvertible learning\u201d.\n\n1.3\n\nInvertible Neural Networks\n\nInvertible Neural Networks have become popular predominantly in likelihood-based generative\nmodels that make use of the change of variable formula:\n\n(cid:12)(cid:12)(cid:12)(cid:12)det\n\n(cid:18) \u2202f (x)\n\n\u2202x(cid:62)\n\n(cid:19)(cid:12)(cid:12)(cid:12)(cid:12)\n\npx(x) = py (f (x))\n\n2\n\n(4)\n\n\fwhich holds for invertible f (x) [16]. Under a \ufb01xed distribution py (\u00b7), an invertible model f (x)\ncan be optimized to maximize the likelihood of data x. Dinh et al. [16] suggested an architecture\nfor invertible networks in which inputs and outputs of a layer are split into two parts such that\nx = (x1, x2) and y = (y1, y2). Their suggested invertible layer has the form\n\n(5)\n\ny1 = x1\ny2 = x2 + G(y1)\n\nx2 = y2 \u2212 G(y1)\nx1 = y1\n\nwhere the left-hand equations re\ufb02ect the forward computation and the right-hand equation are the\nbackward computations. This layer is similar to the layer used later in Gomez et al. [14]. Later\nwork on likelihood-based generative models built on the idea of invertible networks by focusing on\nnon-volume preserving layers [17, 18]. A recent work avoids the splitting from Dinh et al. [16] and\nfocuses instead on f (x) that can be numerically inverted [19].\nHere, we will assume that any invertible neural network can be used as a basis for invertible learning.\nApart from the fact that our approach will allow us to have signi\ufb01cant memory savings during training,\nit can further allow us to leverage Eq. (4) for unsupervised training in the future.\n\n1.4 Recurrent Inference Machines\n\nWe will use Recurrent Inference Machines (RIM) [2] as a basis for our iterative inverse model. The\nupdate equations of the RIM take the form\n\nst+1 = f (\u2207D (d,A (\u03a8 (\u03b7t))) , \u03b7t, st)\n\u03b7t+1 = \u03b7t + g (\u2207D (d,A (\u03a8 (\u03b7t))) , \u03b7t, st+1) ,\n\n(6)\n(7)\nwith pt = \u03a8(\u03b7t), where \u03a8 is a link function, and D (d,A (\u03a8 (\u03b7t))) is a data consistency term which\nensures that estimates of pt stay true to the measured data d under the forward model A. Below, we\nwill call (\u03b7t, st) the machine state at time t. The bene\ufb01t of the RIM is that it simultaneously learns\niterative inference and it implicitly learns a prior over p. The update equations are general enough\nthat they allow us to use any network for h\u03c6. Unfortunately, even if h\u03c6 was invertible, the RIM in it\u2019s\ncurrent form is not. We will show later how a simple modi\ufb01cation of these equations can make the\nwhole iterative process invertible. An illustration of the update block for the machine state can be\nfound in \ufb01gure 1 (B).\n\n1.5 Contribution\n\nIn this work, we marry iterative inverse models with the concept of invertible learning. This leads to\nthe following contributions:\n\n1. The \ufb01rst iterative inverse model that is fully invertible. This allows us to overcome memory\nconstraints during training with invertible learning [14]. It will further allow for semi- and un-\nsupervised training in the future [16].\n2. Stable invertible learning of very deep models. In practice, invertible learning can be unstable\ndue to numerical errors that accumulate in very deep networks [14]. In our experience, common\ninvertible layers [16\u201318] suffered from this problem. We give intuitions why these layers might\nintroduce training instabilities, and present a new layer that addresses these issues and enables stable\ninvertible learning of very deep networks (400 layers).\n3. Scale to large observations. We demonstrate in experiments that our model can be trained on\nlarge volumes in MRI (3d). Previous iterative inverse models were only able to perform reconstruction\non 2d slices. For data that has been acquired with 3d sequences, however, these approaches are not\nfeasible anymore. Our approach overcomes this issue. This result has implications in other domains\nwith large observations such as synthesis imaging in radio astronomy.\n\n2 Method\n\nOur approach consists of two components which we will describe in the following section. (1) We\npresent a simple way to modify Recurrent Inference Machines (RIM) [2] such that they become fully\ninvertible. We call this model \u201cinvertible Recurrent Inference Machines\u201d (i-RIM). (2) We present a\nnew invertible layer with modi\ufb01ed ideas from Kingma and Dhariwal [18] and Dinh et al. [16] that\nallows stable invertible learning of very deep i-RIMs. We brie\ufb02y discuss why invertible learning with\n\n3\n\n\fconventional layers [16\u201318] can be unstable. An implementation of our approach can be found at\nhttps://github.com/pputzky/invertible_rim.\n\nInvertible Recurrent Inference Machines\n\n2.1\nIn the following, we will assume that we can construct an invertible neural network h(\u00b7) with memory\ncomplexity O(1) during training using the approach in Gomez et al. [14]. That is, if the network can\nbe inverted layer-wise such that\n\nh = hL \u25e6 hL\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 h1\nh\u22121 = (h1)\u22121 \u25e6 (h2)\u22121 \u25e6 \u00b7\u00b7\u00b7 \u25e6 (hL)\u22121\n\n(8)\n(9)\nwhere hl is the l-th layer, and (hl)\u22121 is it\u2019s inverse, we can use invertible learning do back-propagation\nwithout storing activations in a computationally ef\ufb01cient way.\nEven though we might have an invertible h(\u00b7), the RIM in the formulation given in Eq. (7) and\nEq. (6) cannot trivially be made invertible. One issue is that if naively implemented, h(\u00b7) takes three\ninputs but has only two outputs as \u2207D(d,A\u03a8\u03b7t) is not part of the machine state (\u03b7t, st). With this,\nh(\u00b7) would have to increase in size with the number of iterations of the model in order to stay fully\ninvertible. The next issue is that the incremental update of \u03b7t in Eq. (7) is conditioned on \u03b7t itself,\nwhich prevents us to use the trick from Eq. (5). One solution would be to save all intermediate \u03b7t but\nthen we would still have memory complexity O(T ) < O(T \u2217 L) which is an improvement but still\ntoo restrictive for large scale data (compare 3.4, i-RIM 3D). Alternatively, Behrmann et al. [19] show\nthat Eq. (7) could be numerically inverted if Lip(h) < 1 in \u03b7. In order to satisfy this condition, this\nwould involve not only restricting h(\u00b7) but also on A, \u03a8, and D(\u00b7,\u00b7). Since we want our method to be\nusable with any type of inverse problem described in Eq. (1) such an approach becomes infeasible.\nFurther, for very deep h(\u00b7) numerical inversion would put a high computational burden on the training\nprocedure.\nWe are therefore looking for a way to make the update equations of the RIM trivially invertible. To do\nthat we use the same trick as in Eq. (5). The update equations of our invertible Recurrent Inference\nMachines (i-RIM) take the form (left - forward step; right - reverse step):\n\n\u03b7(cid:48)\nt = \u03b7t\nt = st + gt(\u2207D (d,A (\u03a8 (\u03b7(cid:48)\ns(cid:48)\n\u03b7t+1, st+1 = ht(\u03b7(cid:48)\nt, s(cid:48)\n\nt))))\n\nt, s(cid:48)\nt)\nwhere we do not require weight sharing over iterations t in gt or ht, respectively. As can be seen, the\nt) is reminiscent of Eq. (5), we assume that h(\u00b7) is an invertible function.\nupdate from (\u03b7t, st) to (\u03b7(cid:48)\nGiven h(\u00b7) can be inverted layer-wise as above, we can train an i-RIM model using invertible learning\nwith memory complexity O(1). We have visualised the forward and reverse updates of the machine\nstate in an i-RIM in \ufb01gure 1(C) & (D).\n\nt\n\nt (\u03b7t+1, st+1)\n\nt = h\u22121\nt \u2212 gt(\u2207D (d,A (\u03a8 (\u03b7(cid:48)\n\nt))))\n\n(10)\n\n\u03b7(cid:48)\nt, s(cid:48)\nst = s(cid:48)\n\u03b7t = \u03b7(cid:48)\n\n2.2 An Invertible Layer with Orthogonal 1x1 convolutions\nHere, we introduce a new invertible layer that we will use to form h(\u00b7). Much of the recent work on\ninvertible neural networks has focused on modifying the invertible layer of Dinh et al. [16] in order\nto improve generative modeling. Dinh et al. [17] proposed a non-volume preserving layer that can\nstill be easily inverted:\n\ny1 = x1\ny2 = x2 (cid:12) exp(F(y1)) + G(y1)\n\nx2 = (y2 \u2212 G(y1)) (cid:12) exp(\u2212F(y1))\nx1 = y1\n\n(11)\n\nWhile improving over the volume preserving layer in Dinh et al. [16], the method still required\nmanual splitting of layer activations using a hand-chosen mask. Kingma and Dhariwal [18] addressed\nthis issue by introducing invertible 1\u00d7 1 convolutions to embed the activations before using the af\ufb01ne\nlayer from Dinh et al. [17]. The idea behind this approach is that this convolution can learn to permute\nthe signal across the channel dimension to make the splitting a parametric approach. Following this,\nHoogeboom et al. [20] introduced a more general form of invertible convolutions which operate not\nonly on the channel dimension but also on spatial dimensions.\nWe have tried to use the invertible layer of Kingma and Dhariwal [18] in our i-RIM but without\nsuccess. We suspect three possible causes of this issue which we will use as motivation for our\nproposed invertible layer:\n\n4\n\n\fFigure 2: Illustration of the layer used in this work. (A): Invertible layer with orthogonal 1 \u00d7 1\nconvolutional embedding. (B): Function G used in each invertible layer.\n\n1. Channel permutations Channel ordering has a semantic meaning in the i-RIM (see Eq. (??))\nwhich it does not have a priori in the above described likelihood-based generative models. Further, a\npermutation in layer l will affect all channel orderings in down-stream layers k > l. Both factors\nmay harm the error landscape.\n2. Exponential gate The rescaling term exp(F(x)) in Eq. (11) can make inversions numerically\nunstable if not properly taken care of.\n3. Invertible 1x1 convolutions Without any restrictions on the eigenvalues of the convolution it is\npossible that it will cause vanishing or exploding gradients during training with back-propagation.\n\n1, x(cid:48)\n\n1, y(cid:48)\n\nFor the of the i-RIM, we have found an invertible layer that addresses all the potential issues mentioned\nabove, is simple to implement, and leads to good performance as can be seen later. Our invertible\nlayer has the following computational steps:\n\nx(cid:48) = Ux\ny(cid:48)\n1 = x(cid:48)\ny(cid:48)\n2 = x(cid:48)\ny = U(cid:62)y(cid:48)\n\n1\n\n2 + G(x(cid:48)\n1)\n\n2) and y(cid:48) = (y(cid:48)\n\n(12)\n(13)\n(14)\n(15)\n2), and U is an orthogonal 1\u00d7 1 convolution which is key to our\nwhere x(cid:48) = (x(cid:48)\ninvertible layer. Recall, the motivation for an invertible 1 \u00d7 1 convolution in Kingma and Dhariwal\n[18] was to learn a parametric permutation over channels in order to have a suitable embedding\nfor the following af\ufb01ne transformation from Dinh et al. [16, 17]. An orthogonal 1 \u00d7 1 convolution\nis suf\ufb01cient to implement this kind of permutation but will not cause any vanishing or exploding\ngradients during back-propagation since all eigenvalues of the matrix are 1. Further, U can be\ntrivially inverted with U\u22121 = U(cid:62) which will reduce computational cost in our training procedure\nas we require layer inversions at every optimization step. Below, we will show how to construct an\northogonal 1 \u00d7 1 convolution. Another feature of our invertible layer is Eq. (15) in which we project\nthe outputs of the af\ufb01ne layer back to the original basis of x using U(cid:62). This means that our 1 \u00d7 1\nconvolution will act only locally on it\u2019s layer while undoing the permutation for downstream layers.\nA schematic of our layer and it\u2019s inverse can be found in \ufb01gure 2.Here, we will omit the inversion of\nour layer for brevity and refer the reader to the supplement.\n\n2.2.1 Orthogonal 1x1 convolution\nA 1 \u00d7 1 convolution can be implemented through using a k \u00d7 k matrix that is reshaped into a\nconvolutional \ufb01lter and then used in a convolution operation [18]. In order to guarantee that this\nmatrix is orthogonal we use the method utilized in Tomczak and Welling [21] and Hoogeboom et al.\n[20]. Any orthogonal k \u00d7 k matrix U can be constructed from a series of Householder re\ufb02ections\nsuch that\n\nU = HKHK\u22121 . . . H1\nwhere Hk is a Householder re\ufb02ection with Hk = I \u2212 2 vkv(cid:62)\n(cid:107)vk(cid:107)2 , and vk is a vector which is orthogonal\nto the re\ufb02ection hyperplane. In our approach, the vectors vk represent the parameters to be optimized.\n\nk\n\n(16)\n\n5\n\n\fBecause the Householder matrix is itself orthogonal, a matrix UD that is constructed from a subset\nD < K of Householder re\ufb02ections such that UD = HDHD\u22121 . . . H1 is still orthogonal. We will\nuse this fact to reduce the amount of parameters necessary to train on the one hand and to reduce the\ncost of constructing UD on the other. For our experiments we chose D = 3.\n\n2.2.2 Residual Block with Spatial Downsampling\n\nIn this work we utilize a residual block that is inspired by the multiscale approach used in Jacobsen\net al. [15]. However, instead of shuf\ufb02ing pixels in the machine state, we perform this action in the\nresidual block. Each block consists of three convolutional layers. The \ufb01rst layer performs a spatial\ndownsampling operation of factor d, i.e. the convolutional \ufb01lter will have size d in every dimension,\nand we perform a strided convolution with stride d. This is equivalent to shuffeling pixels in a\nd \u00d7 d patch into different channels of the same pixel followed by a 1 \u00d7 1 convolution. The second\nconvolutional layer is a simple 3 \u00d7 3 convolution with stride 1. And the last convolutional layer\nreverses the spatial downsampling operation with a transpose convolution. At the output we have\nfound that a Gated Linear Unit [22] guarantees stable training without the need for any special weight\ninitializations. We use weight normalisation [23] for all convolutional weights in the block and we\ndisable the bias term for the last convolution. Our residual block has two parameters: d for the spatial\ndownsampling factor, and k for the number of channels in the hidden layers. An illustration of our\nresidual block can be found in \ufb01gure 2.\n\n2.3 Related Work\n\nRecently, Ardizzone et al. [24] proposed another approach of modeling inverse problems with\ninvertible networks. In their work, however, the authors assume that the forward problem is unknown,\nand possibly as dif\ufb01cult as the inverse problem. In our case the forward problem is typically much\neasier to solve than the inverse problem. The authors suggest to train the network bi-directionally\nwhich could potentially help our approach as well. The RIM has found successful applications to\nseveral imaging problems in MRI [25, 9] and radio astronomy [26, 27]. We expect that our presented\nresults can be translated to other applications of the RIM as well.\n\n3 Experiments\n\nWe evaluate our approach on a public data set for accelerated MRI reconstruction that is part of the so\ncalled fastMRI challenge [28]. Comparisons are made between the U-Net baseline from Zbontar et al.\n[28], an RIM [2, 9], and an i-RIM, all operating on single 2D slices. To explore future directions and\npush the memory bene\ufb01ts of our approach to the limit we also trained a 3D i-RIM. An improvement\non the results presented below can be found in Putzky et al. [29].\n\n3.1 Accelerated MRI\n\nThe problem in accelerated Magnetic Resonance Imaging (MRI) can be described as a linear mea-\nsurement problem of the form\n\n(17)\nwhere P \u2208 Rm\u00d7n is a sub-sampling matrix, F is a Fourier transform, d \u2208 Cm is the sub-sampled\nK-space data, and \u03b7 \u2208 Cn is an image. Here, we assume that the data has been measured on a single\ncoil, hence the measurement equation leaves out coil sensitivity models. This corresponds to the\n\u2019Single-coil task\u2019 in the fastMRI challenge [28]. Further, we set \u03a8 to be the identity function.\n\nd = PF \u03b7 + n\n\n3.2 Data\n\nAll of our experiments were run on the single-coil data from Zbontar et al. [28]. The data set consists\nof 973 volumes or 34, 742 slices in the training set, 199 volumes or 7, 135 slices in the validation\nset, and 108 volumes or 3, 903 slices in the test set. While training and validation sets are both fully\nsampled, the test set is only sub-sampled and performance has to be evaluated through the fastMRI\nwebsite 1. All volumes in the data set have vastly different sizes. For mini-batch training we therefore\nreduced the size of image slices to 480 \u00d7 320 (2D models) and volume size to 32 \u00d7 480 \u00d7 320\n\n1http://fastmri.org\n\n6\n\n\fTable 1: Comparison of memory consumption during training and testing.\n\nSize Machine State (\u03b7, s) (CDHW)\nMemory Machine State (\u03b7, s) (in GB)\nNumber of steps T\nNetwork Depth (#layers)\nMemory during Testing (in GB)\nMemory during Training (in GB)\n\nRIM\n\n130 \u00d7 1 \u00d7 480 \u00d7 320\n\ni-RIM 2D\n\n64 \u00d7 1 \u00d7 480 \u00d7 320\n\ni-RIM 3D\n\n64 \u00d7 32 \u00d7 480 \u00d7 320\n\n0.079\n1/4/8\n5/20/40\n\n0.60 / 0.65 / 0.65\n2.65 / 6.01 / 10.49\n\n0.039\n1/4/8\n\n50/200/400\n\n0.20 / 0.24 / 0.31\n2.47 / 2.49 / 2.51\n\n1.258\n1/4/8\n\n50/200/400\n\n5.87/6.03 / 6.25\n\n11.51 / 11.76 / 11.89\n\nTable 2: Reconstruction performance on validation and test data from the fastMRI challenge [28]\nunder different metrics. NMSE - normalized mean-squared-error (lower is better); PSNR - peak\nsignal-to-noise ratio (higher is better); SSIM - structural similarity index [30] (higher is better).\n\n4x Acceleration\n\n8x Acceleration\n\nValidation NMSE\nU-Net [28]\n0.0342\nRIM\n0.0332\ni-RIM 2D\n0.0316\ni-RIM 3D\n0.0322\nTest\nNMSE\nU-Net [28]\n0.0320\nRIM\n0.0270\ni-RIM 2D\n0.0255\ni-RIM 3D\n0.0261\n\nPSNR\n31.91\n32.24\n32.55\n32.39\nPSNR\n32.22\n33.39\n33.72\n33.54\n\nSSIM NMSE\n0.0482\n0.722\n0.0484\n0.725\n0.734\n0.0429\n0.731\n0.0435\nSSIM NMSE\n0.0480\n0.754\n0.0458\n0.759\n0.767\n0.0408\n0.0413\n0.764\n\nPSNR\n29.98\n30.03\n30.76\n30.66\nPSNR\n29.45\n29.71\n30.41\n30.34\n\nSSIM\n0.656\n0.656\n0.669\n0.667\nSSIM\n0.651\n0.650\n0.664\n0.662\n\n(3d model). Not all training volumes had 32 slices and hence were excluded for training the 3D\nmodel. During validation and test, we did not reduce the size of slices and volumes for reconstruction.\nDuring training, we simulated sub-sampled K-space data using Eq. (17). As sub-sampling masks we\nused the masks from the test set (108 distinct masks) in order to simulate the corruption process in\nthe test set. For validation, we generated sub-sampling masks in the way described in Zbontar et al.\n[28]. Performance was evaluated on the central 320 \u00d7 320 portion of each image slice on magnitude\nimages as in Zbontar et al. [28]. For evaluation, all slices were treated independently.\n\n3.3 Training\n\nFor the Unet, we followed the training protocol from Zbontar et al. [28]. All other models were\ntrained to reconstruct a complex valued signal. Real and imaginary parts were treated as separate\nchannels as done in L\u00f8nning et al. [25]. The machine state (\u03b70, s0) was initialized with zeros for\nall RIM and i-RIM models. As training loss, we chose the normalized mean squared error (NMSE)\nwhich showed overall best performance. To regularize training we masked model estimate \u02c6x and\ntarget x with the same random sub-sampling mask m such that\n\nL\u2217(\u02c6x) = N M SE(m (cid:12) \u02c6x, m (cid:12) x)\n\n(18)\n\nThis is a simple way to add gradient noise in cases where the number of pixels is very large. We\nchose a sub-sampling factor of 0.01, i.e. on average 1% of pixels (voxels) from each sample were\nused during a back-propagation step. All iterative models were trained on 8 inference steps.\nFor the RIM we used a similar model as in L\u00f8nning et al. [25]. The model consists of three\nconvolutional layers and two gated recurrent units (GRU) [31] with 64 hidden channels each. During\ntraining the loss was averaged across all times steps. For the i-RIM, we chose similar architectures\nfor the 2D and 3D models, respectively. The models consisted of 10 invertible layers with a fanned\ndownsampling structure at each time step, no parameter sharing was applied across time steps, the\nloss was only evaluated on the last time step. The only difference between 2D and 3D model was\nthat the former used 2D convolutions and the latter used 3d convolutions. More details on the model\narchitectures can be found in Appendix B. As a model for gt we chose for simplicity\n\ngt(\u2207D (d,A (\u03a8 (\u03b7(cid:48)\n\nt)))) =\n\n7\n\n(cid:18)\u2207D (d,A (\u03a8 (\u03b7(cid:48)\n\nt)))\n\n(cid:19)\n\n0D\u22122\n\n(19)\n\n\f(a) Target Image\n\n(b) U-Net 4x\n\n(c) RIM 4x\n\n(d) i-RIM 4x\n\n(e) i-RIM 3D 4x\n\n(f) Target Image\n\n(g) U-Net 8x\n\n(h) RIM 8x\n\n(i) i-RIM 8x\n\n(j) i-RIM 3D 8x\n\nFigure 3: Reconstructions of a central slice in volume \u2019\ufb01le1001458.h5\u2019 from the validation set. Top:\n4x acceleration. Bottom: 8x acceleration. Zoom in for better viewing experience.\nwhere 0D\u22122 is a zero-\ufb01lled tensor with D \u2212 2 number of channels with D the number of channels in\nst, which is the simplest model for g we could use. The gradient information will be mixed in the\ndownstream processing of h. We chose the number of channels in the machines state (\u03b7, s) to be 64.\n\n3.4 Results\n\nThe motivation of our work was to reduce memory consumption of iterative inverse models. In order\nto emphasize the memory savings achieved with our approach we compare memory consumption\nof each model in table 1. Shown are data for the machine state, and memory consumption during\ntraining and testing on a single GPU. As can be seen, for the 3D model we have a machine state\nthat occupies more that 1.25GB of memory (7.5% of available memory in a 16GB GPU). It would\nhave been impossible to train an RIM with this machine state given current hardware. Also note\nthat memory consumption is mostly independent of network depth for both i-RIM models. A small\nincrease in memory consumption is due to the increase of the number of parameters in a deeper\nmodel. We have thus introduced a model for which network depth becomes mostly a computational\nconsideration.\nWe compared all models on the validation and test set using the metrics suggested in Zbontar et al.\n[28]. A summary of this comparison can be found in table 2. Both i-RIM models consistently\noutperform the baselines. At the time of this writing, all three models sit on top of the challenge\u2019s\nSingle-Coil leaderboard.2 The i-RIM 3D shows almost as good performance as it\u2019s 2D counterpart\nand we believe that with more engineering and longer training it has the potential to outperform the\n2D model. A qualitative assessment of reconstructions of a single slice can be found in \ufb01gure 3. We\nchose this slice because it contains a lot of details which emphasize the differences across models.\n\n4 Discussion\n\nWe proposed a new approach to address the memory issues of training iterative inverse models using\ninvertible neural networks. This enabled us to train very deep models on a large scale imaging task\nwhich would have been impossible to do with earlier approaches. The resulting models learn to do\nstate-of-the-art image reconstruction. We further introduced a new invertible layer that allows us to\ntrain our deep models in a stable way. Due to it\u2019s structure our proposed layer lends itself to structured\nprediction tasks, and we expect it to be useful in other such tasks as well. Because invertible neural\nnetworks have been predominantly used for unsupervised training, our approach naturally allows us\nto exploit these directions as well. In the future, we aim to train our models in an unsupervised or\nsemi-supervised fashion.\n\n2http://fastmri.org/leaderboards, Team Name: NeurIPS_Anon; Model aliases: RIM - model_a,\n\ni-RIM 2D - model_b, i-RIM 3D - model_c. See Supplement for screenshot.\n\n8\n\n\fAcknowledgements\n\nThe authors are grateful for helpful comments from Matthan Caan, Clemens Kornd\u00f6rfer, J\u00f6rn\nJacobsen, Mijung Park, Nicola Pezzotti, and Isabel Valera.\nPatrick Putzky is supported by the Netherlands Organisation for Scienti\ufb01c Research (NWO) and the\nNetherlands Institute for Radio Astronomy (ASTRON) through the big bang, big data grant.\n\nReferences\n\n[1] Emmanuel J. Candes and Terence Tao. Near-Optimal Signal Recovery From Random Pro-\njections: Universal Encoding Strategies? IEEE Transactions on Information Theory, 52(12):\n5406\u20135425, dec 2006.\n\n[2] Patrick Putzky and Max Welling. Recurrent inference machines for solving inverse problems.\n\narXiv preprint arXiv:1706.04008, 2017.\n\n[3] Yunjin Chen, Wei Yu, and Thomas Pock. On learning optimized reaction diffusion processes\nfor effective image restoration. In Proceedings of the IEEE Conference on Computer Vision\nand Pattern Recognition, pages 5261\u20135269, 2015.\n\n[4] Shenlong Wang, Sanja Fidler, and Raquel Urtasun. Proximal deep structured models.\n\nAdvances in Neural Information Processing Systems, pages 865\u2013873, 2016.\n\nIn\n\n[5] Shuai Zheng, Sadeep Jayasumana, Bernardino Romera-Paredes, Vibhav Vineet, Zhizhong Su,\nDalong Du, Chang Huang, and Philip H. S. Torr. Conditional Random Fields as Recurrent\nNeural Networks. In International Conference on Computer Vision (ICCV), page 16, feb 2015.\n\n[6] Uwe Schmidt and Stefan Roth. Shrinkage Fields for Effective Image Restoration. In 2014 IEEE\nConference on Computer Vision and Pattern Recognition, pages 2774\u20132781. IEEE, jun 2014.\n\n[7] Uwe Schmidt, Jeremy Jancsary, Sebastian Nowozin, Stefan Roth, and Carsten Rother. Cascades\nof regression tree \ufb01elds for image restoration. IEEE Transactions on Pattern Analysis and\nMachine Intelligence, 38(4):677\u2013689, 2016.\n\n[8] Karol Gregor and Yann LeCun. Learning Fast Approximations of Sparse Coding. In Proceedings\nof the 27th International Conference on Machine Learning (ICML-10), pages 399\u2013406, 2010.\n\n[9] Kai L\u00f8nning, Patrick Putzky, Jan-Jakob Sonke, Liesbeth Reneman, Matthan WA Caan, and Max\nWelling. Recurrent inference machines for reconstructing heterogeneous mri data. Medical\nimage analysis, 2019.\n\n[10] Jonas Adler and Ozan \u00c3ktem. Solving ill-posed inverse problems using iterative deep neural\nnetworks. Inverse Problems, 33(12):124007, nov 2017. doi: 10.1088/1361-6420/aa9581. URL\nhttps://doi.org/10.1088%2F1361-6420%2Faa9581.\n\n[11] Kerstin Hammernik, Teresa Klatzer, Erich Kobler, Michael P. Recht, Daniel K. Sodickson,\nThomas Pock, and Florian Knoll. Learning a variational network for reconstruction of accel-\nerated mri data. Magnetic Resonance in Medicine, 79(6):3055\u20133071, 2018. doi: 10.1002/\nmrm.26977. URL https://onlinelibrary.wiley.com/doi/abs/10.1002/mrm.26977.\n\n[12] Jo Schlemper, Jose Caballero, Joseph V. Hajnal, Anthony Price, and Daniel Rueckert. A deep\ncascade of convolutional neural networks for mr image reconstruction. In Marc Niethammer,\nMartin Styner, Stephen Aylward, Hongtu Zhu, Ipek Oguz, Pew-Thian Yap, and Dinggang Shen,\neditors, Information Processing in Medical Imaging, pages 647\u2013658, Cham, 2017. Springer\nInternational Publishing.\n\n[13] Emmanuel J. Cand\u00e8s, Justin K. Romberg, and Terence Tao. Stable signal recovery from\nincomplete and inaccurate measurements. Communications on Pure and Applied Mathematics,\n59(8):1207\u20131223, aug 2006.\n\n9\n\n\f[14] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible\nresidual network: Backpropagation without storing activations.\nIn I. Guyon, U. V.\nLuxburg, S. Bengio, H. Wallach, R. Fergus, S. Vishwanathan, and R. Garnett, editors, Ad-\nvances in Neural Information Processing Systems 30, pages 2214\u20132224. Curran Associates,\nInc., 2017. URL http://papers.nips.cc/paper/6816-the-reversible-residual-\nnetwork-backpropagation-without-storing-activations.pdf.\n\n[15] J\u00f6rn-Henrik Jacobsen, Arnold W.M. Smeulders, and Edouard Oyallon. i-revnet: Deep invertible\nIn International Conference on Learning Representations, 2018. URL https:\n\nnetworks.\n//openreview.net/forum?id=HJsjkMb0Z.\n\n[16] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\n[17] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\n2017. URL https://openreview.net/forum?id=HkpbnH9lx.\n\n[18] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions.\nIn S. Bengio, H. Wallach, H. Larochelle, K. Grauman, N. Cesa-Bianchi, and R. Garnett, editors,\nAdvances in Neural Information Processing Systems 31, pages 10215\u201310224. Curran Associates,\nInc., 2018. URL http://papers.nips.cc/paper/8224-glow-generative-flow-with-\ninvertible-1x1-convolutions.pdf.\n\n[19] Jens Behrmann, Will Grathwohl, Ricky T. Q. Chen, David Duvenaud, and J\u00f6rn-Henrik Jacobsen.\n\nInvertible residual networks, 2018.\n\n[20] Emiel Hoogeboom, Rianne van den Berg, and Max Welling. Emerging convolutions for\n\ngenerative normalizing \ufb02ows, 2019.\n\n[21] Jakub M. Tomczak and Max Welling. Improving variational auto-encoders using householder\n\n\ufb02ow, 2016.\n\n[22] Yann N. Dauphin, Angela Fan, Michael Auli, and David Grangier. Language modeling\nwith gated convolutional networks. In Proceedings of the 34th International Conference on\nMachine Learning - Volume 70, ICML\u201917, pages 933\u2013941. JMLR.org, 2017. URL http:\n//dl.acm.org/citation.cfm?id=3305381.3305478.\n\n[23] Tim Salimans and Durk P Kingma.\n\nWeight normalization: A simple repa-\nIn D. D. Lee,\nrameterization to accelerate training of deep neural networks.\nI. Guyon, and R. Garnett, editors, Advances in\nM. Sugiyama, U. V. Luxburg,\nNeural Information Processing Systems 29, pages 901\u2013909. Curran Associates,\nInc.,\n2016. URL http://papers.nips.cc/paper/6114-weight-normalization-a-simple-\nreparameterization-to-accelerate-training-of-deep-neural-networks.pdf.\n\n[24] Lynton Ardizzone, Jakob Kruse, Carsten Rother, and Ullrich K\u00f6the. Analyzing inverse problems\nwith invertible neural networks. In International Conference on Learning Representations,\n2019. URL https://openreview.net/forum?id=rJed6j0cKX.\n\n[25] Kai L\u00f8nning, Patrick Putzky, Matthan WA Caan, and Max Welling. Recurrent inference\nmachines for accelerated mri reconstruction. In (1st Medical Imaging with Deep Learning\n(MIDL) conference.\n\n[26] Warren R Morningstar, Yashar D Hezaveh, Laurence Perreault Levasseur, Roger D Blandford,\nPhilip J Marshall, Patrick Putzky, and Risa H Wechsler. Analyzing interferometric observations\nof strong gravitational lenses with recurrent and convolutional neural networks. arXiv preprint\narXiv:1808.00011, 2018.\n\n[27] Warren R Morningstar, Laurence Perreault Levasseur, Yashar D Hezaveh, Roger Blandford,\nPhil Marshall, Patrick Putzky, Thomas D Rueter, Risa Wechsler, and Max Welling. Data-driven\nreconstruction of gravitationally lensed galaxies using recurrent inference machines. arXiv\npreprint arXiv:1901.01359, 2019.\n\n10\n\n\f[28] Jure Zbontar, Florian Knoll, Anuroop Sriram, Matthew J Muckley, Mary Bruno, Aaron Defazio,\nMarc Parente, Krzysztof J Geras, Joe Katsnelson, Hersh Chandarana, et al. fastmri: An open\ndataset and benchmarks for accelerated mri. arXiv preprint arXiv:1811.08839, 2018.\n\n[29] Patrick Putzky, Dimitrios Karkalousos, Jonas Teuwen, Nikita Miriakov, Bart Bakker, Matthan\n\nCaan, and Max Welling. i-rim applied to the fastmri challenge, 2019.\n\n[30] Image quality assessment: form error visibility to structural similarity. IEEE Transactions on\n\nImage Processing, 13(4):600\u2013612, 2004.\n\n[31] Junyoung Chung, Caglar Gulcehre, KyungHyun Cho, and Yoshua Bengio. Empirical Evaluation\n\nof Gated Recurrent Neural Networks on Sequence Modeling. dec 2014.\n\n11\n\n\f", "award": [], "sourceid": 238, "authors": [{"given_name": "Patrick", "family_name": "Putzky", "institution": "University of Amsterdam"}, {"given_name": "Max", "family_name": "Welling", "institution": "University of Amsterdam / Qualcomm AI Research"}]}