{"title": "Input-Output Equivalence of Unitary and Contractive RNNs", "book": "Advances in Neural Information Processing Systems", "page_first": 15368, "page_last": 15378, "abstract": "Unitary recurrent neural networks (URNNs) have been proposed as a method to overcome the vanishing and exploding gradient problem in modeling data with long-term dependencies. A basic question is how restrictive is the unitary constraint on the possible input-output mappings of such a network? This works shows that for any contractive RNN with ReLU activations, there is a URNN with at most twice the number of hidden states and the identical input-output mapping. Hence, with ReLU activations, URNNs are as expressive as general RNNs. In contrast, for certain smooth activations, it is shown that the input-output mapping of an RNN cannot be matched with a URNN, even with an arbitrary number of states. The theoretical results are supported by experiments on modeling of slowly-varying dynamical systems.", "full_text": "Input-Output Equivalence of Unitary and\n\nContractive RNNs\n\nMelikasadat Emami\n\nMojtaba Sahraee-Ardakan\n\nSundeep Rangan\n\nDept. ECE\n\nUCLA\n\nDept. ECE\n\nNYU\n\nDept. ECE\n\nUCLA\n\nemami@ucla.edu\n\nmsahraee@ucla.edu\n\nsrangan@nyu.edu\n\nAlyson K. Fletcher\n\nDept. Statistics\n\nUCLA\n\nakfletcher@ucla.edu\n\nAbstract\n\nUnitary recurrent neural networks (URNNs) have been proposed as a method to\novercome the vanishing and exploding gradient problem in modeling data with\nlong-term dependencies. A basic question is how restrictive is the unitary constraint\non the possible input-output mappings of such a network? This work shows that\nfor any contractive RNN with ReLU activations, there is a URNN with at most\ntwice the number of hidden states and the identical input-output mapping. Hence,\nwith ReLU activations, URNNs are as expressive as general RNNs. In contrast, for\ncertain smooth activations, it is shown that the input-output mapping of an RNN\ncannot be matched with a URNN, even with an arbitrary number of states. The\ntheoretical results are supported by experiments on modeling of slowly-varying\ndynamical systems.\n\n1\n\nIntroduction\n\nRecurrent neural networks (RNNs) \u2013 originally proposed in the late 1980s [20, 6] \u2013 refer to a widely-\nused and powerful class of models for time series and sequential data. In recent years, RNNs have\nbecome particularly important in speech recognition [9, 10] and natural language processing [5, 2, 24]\ntasks.\nA well-known challenge in training recurrent neural networks is the vanishing and exploding gradient\nproblem [3, 18]. RNNs have a transition matrix that maps the hidden state at one time to the next time.\nWhen the transition matrix has an induced norm greater than one, the RNN may become unstable.\nIn this case, small perturbations of the input at some time can result in a change in the output that\ngrows exponentially over the subsequent time. This instability leads to a so-called exploding gradient.\nConversely, when the norm is less than one, perturbations can decay exponentially so inputs at one\ntime have negligible effect in the distant future. As a result, the loss surface associated with RNNs can\nhave steep walls that may be dif\ufb01cult to minimize. Such problems are particularly acute in systems\nwith long-term dependencies, where the output sequence can depend strongly on the input sequence\nmany time steps in the past.\nUnitary RNNs (URNNs) [1] is a simple and commonly-used approach to mitigate the vanishing\nand exploding gradient problem. The basic idea is to restrict the transition matrix to be unitary (an\northogonal matrix for the real-valued case). The unitary transitional matrix is then combined with\na non-expansive activation such as a ReLU or sigmoid. As a result, the overall transition mapping\ncannot amplify the hidden states, thereby eliminating the exploding gradient problem. In addition,\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fsince all the singular values of a unitary matrix equal 1, the transition matrix does not attenuate the\nhidden state, potentially mitigating the vanishing gradient problem as well. (Due to activation, the\nhidden state may still be attenuated). Some early work in URNNs suggested that they could be more\neffective than other methods, such as long short-term memory (LSTM) architectures and standard\nRNNs, for certain learning tasks involving long-term dependencies [13, 1] \u2013 see a short summary\nbelow.\nAlthough URNNs may improve the stability of the network for the purpose of optimization, a basic\nissue with URNNs is that the unitary contraint may potentially reduce the set of input-output mappings\nthat the network can model. This paper seeks to rigorously characterize how restrictive the unitary\nconstraint is on an RNN. We evaluate this restriction by comparing the set of input-output mappings\nachievable with URNNs with the set of mappings from all RNNs. As described below, we restrict our\nattention to RNNs that are contractive in order to avoid unstable systems.\nWe show three key results:\n\n1. Given any contractive RNN with n hidden states and ReLU activations, there exists a URNN\n\nwith at most 2n hidden states and the identical input-ouput mapping.\n\n2. This result is tight in the sense that, given any n > 0, there exists at least one contractive\nRNN such that any URNN with the same input-output mapping must have at least 2n states.\n3. The equivalence of URNNs and RNNs depends on the activation. For example, we show\nthat there exists a contractive RNN with sigmoid activations such that there is no URNN\nwith any \ufb01nite number of states that exactly matches the input-output mapping.\n\nThe implication of this result is that, for RNNs with ReLU activations, there is no loss in the\nexpressiveness of model when imposing the unitary constraint. As we discuss below, the penalty is a\ntwo-fold increase in the number of parameters.\nOf course, the expressiveness of a class of models is only one factor in their real performance. Based\non these results alone, one cannot determine if URNNs will outperform RNNs in any particular task.\nEarlier works have found examples where URNNs offer some bene\ufb01ts over LSTMs and RNNs [1, 28].\nBut in the simulations below concerning modeling slowly-varying nonlinear dynamical systems, we\nsee that URNNs with 2n states perform approximately equally to RNNs with n states.\nTheoretical results on generalization error are an active subject area in deep neural networks. Some\nmeasures of model complexity such as [17] are related to the spectral norm of the transition matrices.\nFor RNNs with non-contractive matrices, these complexity bounds will grow exponentially with the\nnumber of time steps. In contrast, since unitary matrices can bound the generalization error, this work\ncan also relate to generalizability.\n\nPrior work\n\nThe vanishing and exploding gradient problem in RNNs has been known almost as early as RNNs\nthemselves [3, 18]. It is part of a larger problem of training models that can capture long-term\ndependencies, and several proposed methods address this issue. Most approaches use some form\nof gate vectors to control the information \ufb02ow inside the hidden states, the most widely-used being\nLSTM networks [11]. Other gated models include Highway networks [21] and gated recurrent units\n(GRUs) [4]. L1/L2 penalization on gradient norms and gradient clipping were proposed to solve the\nexploding gradient problem in [18]. With L1/L2 penalization, capturing long-term dependencies\nis still challenging since the regularization term quickly kills the information in the model. A\nmore recent work [19] has successfully trained very deep networks by carefully adjusting the initial\nconditions to impose an approximate unitary structure of many layers.\nUnitary evolution RNNs (URNNs) are a more recent approach \ufb01rst proposed in [1]. Orthogonal\nconstraints were also considered in the context of associative memories [27]. One of the technical\ndif\ufb01culties is to ef\ufb01ciently parametrize the set of unitary matrices. The numerical simulations in this\nwork focus on relatively small networks, where the parameterization is not a signi\ufb01cant computational\nissue. Nevertheless, for larger numbers of hidden states, several approaches have been proposed.\nThe model in [1] parametrizes the transition matrix as a product of re\ufb02ection, diagonal, permutation,\nand Fourier transform matrices. This model spans a subspace of the whole unitary space, thereby\nlimiting the expressive power of RNNs. The work [28] overcomes this issue by optimizing over\n\n2\n\n\fy\n\nhk = \u03c6(Wh(k\u22121) + Fxk + b)\nyk = Chk\n\n:\n\nW, \u03c6, b\n\nx\n\ny\n\nC\n\nh\n\nF\n\nx\n\nyk\u22121\n\nC\n\nyk\n\nC\n\nyk+1\n\nC\n\nUnfold\n\nW\n\nh(k\u22121)\n\nW\n\nhk\n\nW\n\nW\n\nhk+1\n\nF\n\nxk\u22121\n\nF\n\nxk\n\nF\n\nxk+1\n\nFigure 1: Recurrent Neural Network (RNN) model.\n\nfull-capacity unitary matrices. A key limitation in this work, however, is that the projection of weights\non to the unitary space is not computationally ef\ufb01cient. A tunable, ef\ufb01cient parametrization of\nunitary matrices is proposed in [13]. This model provides the computational complexity of O(1) per\nparameter. The unitary matrix is represented as a product of rotation matrices and a diagonal matrix.\nBy grouping speci\ufb01c rotation matrices, the model provides tunability of the span of the unitary space\nand enables using different capacities for different tasks. Combining the parametrization in [13]\nfor unitary matrices and the \u201cforget\u201d ability of the GRU structure, [4, 12] presented an architecture\nthat outperforms conventional models in several long-term dependency tasks. Other methods such\nas orthogonal RNNs proposed by [16] showed that the unitary constraint is a special case of the\northogonal constraint. By representing an orthogonal matrix as a product of Householder re\ufb02ectors,\nwe are able span the entire space of orthogonal matrices. Imposing hard orthogonality constraints on\nthe transition matrix limits the expressiveness of the model and speed of convergence and performance\nmay degrade [26].\n\n2 RNNs and Input-Output Equivalence\n\nRNNs. We consider recurrent neural networks (RNNs) representing sequence-to-sequence map-\npings of the form\n\nh(k) = \u03c6(Wh(k\u22121) + Fx(k) + b), h(\u22121) = h\u22121,\ny(k) = Ch(k),\n\n(1a)\n(1b)\n\nparameterized by \u0398 = (W, F, b, C, h\u22121). The system is shown in Fig. 1. The system maps\na sequence of inputs x(k) \u2208 Rm, k = 0, 1, . . . , T \u2212 1 to a sequence of outputs y(k) \u2208 Rp. In\nequation (1), \u03c6 is the activation function (e.g. sigmoid or ReLU); h(k) \u2208 Rn is an internal or hidden\nstate; W \u2208 Rn\u00d7n, F \u2208 Rn\u00d7m, and C \u2208 Rp\u00d7n are the hidden-to-hidden, input-to-hidden, and\nhidden-to-output weight matrices respectively; and b is the bias vector. We have considered the\ninitial condition, h\u22121, as part of the parameters, although we will often take h\u22121 = 0. Given a set of\nparameters \u0398, we will let\n\ny = G(x, \u0398)\n\n(2)\ndenote the resulting sequence-to-sequence mapping. Note that the number of time samples, T , is\n\ufb01xed throughout our discussion.\nRecall [23] that a matrix W is unitary if WHW = WWH = I. When a unitary matrix is real-\nvalued, it is also called orthogonal. In this work, we will restrict our attention to real-valued matrices,\nbut still use the term unitary for consistency with the URNN literature. A Unitary RNN or URNN\nis simply an RNN (1) with a unitary state-to-state transition matrix W. A key property of unitary\nmatrices is that they are norm-preserving, meaning that (cid:107)Wh(k)(cid:107)2 = (cid:107)h(k)(cid:107)2. In the context of (1a),\nthe unitary constraint implies that the transition matrix does not amplify the state.\n\nEquivalence of RNNs. Our goal is to understand the extent to which the unitary constraint in\na URNN restricts the set of input-output mappings. To this end, we say that the RNNs for two\nparameters \u03981 and \u03982 are input-output equivalent if the sequence-to-sequence mappings are identical,\n\nG(x, \u03981) = G(x, \u03982) for all x = (x(0), . . . , x(T\u22121)).\n\n(3)\n\n3\n\n\fThat is, for all input sequences x, the two systems have the same output sequence. Note that the\nhidden internal states h(k) in the two systems may be different. We will also say that two RNNs are\nequivalent on a set of X of inputs if (3) holds for all x \u2208 X .\nIt is important to recognize that input-output equivalence does not imply that the parameters \u03981 and\n\u03982 are identical. For example, consider the case of linear RNNs where the activation in (1) is the\nidentity, \u03c6(z) = z. Then, for any invertible T, the transformation\n\nW \u2192 TWT\u22121, C \u2192 CT\u22121, F \u2192 TF, h\u22121 \u2192 Th\u22121,\n\n(4)\n\nresults in the same input-output mapping. However, the internal states h(k) will be mapped to Th(k).\nThe fact that many parameters can lead to identical input-output mappings will be key to \ufb01nding\nequivalent RNNs and URNNs.\n\n(cid:107)Wh(cid:107)2\n(cid:107)h(cid:107)2\n\nContractive RNNs. The spectral norm [23] of a matrix W is the maximum gain of the matrix\n(cid:107)W(cid:107) := maxh(cid:54)=0\n. In an RNN (1), the spectral norm (cid:107)W(cid:107) measures how much the transition\nmatrix can amplify the hidden state. For URNNs, (cid:107)W(cid:107) = 1. We will say an RNN is contractive if\n(cid:107)W(cid:107) < 1, expansive if (cid:107)W(cid:107) > 1, and non-expansive if (cid:107)W(cid:107) \u2264 1. In the sequel, we will restrict\nour attention to contractive and non-expansive RNNs. In general, given an expansive RNN, we\ncannot expect to \ufb01nd an equivalent URNN. For example, suppose h(k) = h(k) is scalar. Then, the\ntransition matrix W is also scalar W = w and w is expansive if and only if |w| > 1. Now suppose\nthe activation is a ReLU \u03c6(h) = max{0, h}. Then, it is possible that a constant input x(k) = x0 can\nresult in an output that grows exponentially with time: y(k) = const \u00d7 wk. Such an exponential\nincrease is not possible with a URNN. We consider only non-expansive RNNs in the remainder of\nthe paper. Some of our results will also need the assumption that the activation function \u03c6(\u00b7) in (1) is\nnon-expansive:\n\n(cid:107)\u03c6(x) \u2212 \u03c6(y)(cid:107)2 \u2264 (cid:107)x \u2212 y(cid:107)2,\n\nfor all x and y.\n\nThis property is satis\ufb01ed by the two most common activations, sigmoids and ReLUs.\n\nEquivalence of Linear RNNs. To get an intuition of equivalence, it is useful to brie\ufb02y review the\nconcept in the case of linear systems [14]. Linear systems are RNNs (1) in the special case where the\nactivation function is identity, \u03c6(z) = z; the initial condition is zero, h\u22121 = 0; and the bias is zero,\nb = 0. In this case, it is well-known that two systems are input-output equivalent if and only if they\nhave the same transfer function,\n\nH(s) := C(sI \u2212 W)\u22121F.\n\n(5)\nIn the case of scalar inputs and outputs, H(s) is a rational function of the complex variable s with\nnumerator and denominator degree of at most n, the dimension of the hidden state h(k). Any state-\nspace system (1) that achieves a particular transfer function is called a realization of the transfer\nfunction. Hence two linear systems are equivalent if and only if they are the realizations of the same\ntransfer function.\nA realization is called minimal if it is not equivalent some linear system with fewer hidden states.\nA basic property of realizations of linear systems is that they are minimal if and only if they are\ncontrollable and observable. The formal de\ufb01nition is in any linear systems text, e.g. [14]. Loosely,\ncontrollable implies that all internal states can be reached with an appropriate input and observable\nimplies that all hidden states can be observed from the ouptut. In absence of controllability and\nobservability, some hidden states can be removed while maintaining input-output equivalence.\n\n3 Equivalence Results for RNNs with ReLU Activations\n\nOur \ufb01rst results consider contractive RNNs with ReLU activations. For the remainder of the section,\nwe will restrict our attention to the case of zero initial conditions, h(\u22121) = 0 in (1).\nTheorem 3.1 Let y = G(x, \u0398c) be a contractive RNN with ReLU activation and states of dimension\nn. Fix M > 0 and let X be the set of all sequences such that (cid:107)x(k)(cid:107)2 \u2264 M < \u221e for all k. Then\nthere exists a URNN with state dimension 2n and parameters \u0398u = (Wu, Fu, bu, Cu) such that for\nall x \u2208 X , G(x, \u0398c) = G(x, \u0398u). Hence the input-output mapping is matched for bounded inputs.\n\n4\n\n\fProof See Appendix A.\n\nTheorem 3.1 shows that for any contractive RNN with ReLU activations, there exists a URNN with at\nmost twice the number of hidden states and the identical input-output mapping. Thus, there is no loss\nin the set of input-output mappings with URNNs relative to general contractive RNNs on bounded\ninputs.\nThe penalty for using RNNs is the two-fold increase in state dimension, which in turn increases\nthe number of parameters to be learned. We can estimate this increase in parameters as follows:\nThe raw number of parameters for an RNN (1) with n hidden states, p outputs and m inputs is\nn2 +(p+m+1)n. However, for ReLU activations, the RNNs are equivalent under the transformations\n(4) using diagonal positive T. Hence, the number of degrees of freedom of a general RNN is at most\ndrnn = n2 + (p + m)n. We can compare this value to a URNN with 2n hidden states. The set of\n2n \u00d7 2n unitary W has 2n(2n \u2212 1)/2 degrees of freedom [22]. Hence, the total degrees of freedom\nin a URNN with 2n states is at most durnn = n(2n \u2212 1) + 2n(p + m). We conclude that a URNN\nwith 2n hidden states has slightly fewer than twice the number of parameters as an RNN with n\nhidden states.\nWe note that there are cases that the contractivity assumption is limiting, however, the limitations may\nnot always be prohibitive. We will see in our experiments that imposing the contractivity constraint\ncan improve learning for RNNs when models have suf\ufb01ciently large numbers of time steps. Some\nrelated results where bounding the singular values help with the performance can be found in [26].\nWe next show a converse result.\n\nTheorem 3.2 For every positive n, there exists a contractive RNN with ReLU nonlinearity and state\ndimension n such that every equivalent URNN has at least 2n states.\n\nProof See Appendix B.1 in the Supplementary Material.\n\nThe result shows that the 2n achievability bound in Theorem 3.1 is tight, at least in the worst case. In\naddition, the RNN constructed in the proof of Theorem 3.2 is not particularly pathological. We will\nshow in our simulations in Section 5 that URNNs typically need twice the number of hidden states to\nachieve comparable modeling error as an RNN.\n\n4 Equivalence Results for RNNs with Sigmoid Activations\n\nEquivalence between RNNs and URNNs depends on the particular activation. Our next result shows\nthat with sigmoid activations, URNNs are, in general, never exactly equivalent to RNNs, even with\nan arbitrary number of states.\nWe need the following technical de\ufb01nition: Consider an RNN (1) with a standard sigmoid activation\n\u03c6(z) = 1/(1 + e\u2212z). If W is non-expansive, then a simple application of the contraction mapping\nprinciple shows that for any constant input x(k) = x\u2217, there is a \ufb01xed point in the hidden state\nh\u2217 = \u03c6(Wh\u2217 + Fx\u2217 + b). We will say that the RNN is controllable and observable at x\u2217 if the\nlinearization of the RNN around (x\u2217, h\u2217) is controllable and observable.\nTheorem 4.1 There exists a contractive RNN with sigmoid activation function \u03c6 with the following\nproperty: If a URNN is controllable and observable at any point x\u2217, then the URNN cannot be\nequivalent to the RNN for inputs x in the neighborhood of x\u2217.\nProof See Appendix B.2 in the Supplementary Material.\n\nThe result provides a converse on equivalence: Contractive RNNs with sigmoid activations are not in\ngeneral equivalent to URNNs, even if we allow the URNN to have an arbitrary number of hidden\nstates. Of course, the approximation error between the URNN and RNN may go to zero as the URNN\nhidden dimension goes to in\ufb01nity (e.g., similar to the approximation results in [8]). However, exact\nequivalence is not possible with sigmoid activations, unlike with ReLU activations. Thus, there is\nfundamental difference in equivalence for smooth and non-smooth activations.\nWe note that the fundamental distinction between Theorem 3.1 and the opposite result in Theorem 4.1\nis that the activation is smooth with a positive slope. With such activations, you can linearize the\n\n5\n\n\fsystem, and the eigenvalues of the transition matrix become visible in the input-output mapping. In\ncontrast, ReLUs can zero out states and suppress these eigenvalues. This is a key insight of the paper\nand a further contribution in understanding nonlinear systems.\n\n5 Numerical Simulations\n\nIn this section, we numerically compare the modeling ability of RNNs and URNNs where the true\nsystem is a contractive RNN with long-term dependencies. Speci\ufb01cally, we generate data from\nmultiple instances of a synthetic RNN where the parameters in (1) are randomly generated. For the\ntrue system, we use m = 2 input units, p = 2 output units, and n = 4 hidden units at each time step.\nThe matrices F, C and b are generated as i.i.d. Gaussians. We use a random transition matrix,\n\nW = I \u2212 \u0001ATA/(cid:107)A(cid:107)2,\n\n(6)\n\nwhere A is Gaussian i.i.d. matrix and \u0001 is a small value, taken here to be \u0001 = 0.01. The matrix (6)\nwill be contractive with singular values in (1 \u2212 \u0001, 1). By making \u0001 small, the states of the system\nwill vary slowly, hence creating long-term dependencies. In analogy with linear systems, the time\nconstant will be approximately 1/\u0001 = 100 time steps. We use ReLU activations. To avoid degenerate\ncases where the outputs are always zero, the biases b are adjusted to ensure that the each hidden state\nis on some target 60% of the time using a similar procedure as in [7].\nThe trials have T = 1000 time steps, which corresponds to 10 times the time constant 1/\u0001 = 100 of\nthe system. We added noise to the output of this system such that the signal-to-noise ratio (SNR) is\n15 dB or 20 dB. In each trial, we generate 700 training samples and 300 test sequences from this\nsystem.\nGiven the input and the output data of this contractive RNN, we attempt to learn the system with: (i)\nstandard RNNs, (ii) URNNs, and (iii) LSTMs. The hidden states in the model are varied in the range\nn = [2, 4, 6, 8, 10, 12, 14], which include values both above and below the true number of hidden\nstates ntrue = 4. We used mean-squared error as the loss function. Optimization is performed using\nAdam [15] optimization with a batch size = 10 and learning rate = 0.01. All models are implemented\nin the Keras package in Tensor\ufb02ow. The experiments are done over 30 realizations of the original\ncontractive system.\nFor the URNN learning, of all the proposed algorithms for enforcing the unitary constraints on\ntransition matrices during training [13, 28, 1, 16], we chose to project the transition matrix on the full\nspace of unitary matrices after each iteration using singular value decomposition (SVD). Although\nSVD requires O(n3) computation for each projection, for our choices of hidden states it performed\nfaster than the aforementioned methods.\nSince we have training noise and since optimization algorithms can get stuck in local minima, we\ncannot expect \u201cexact\" equivalence between the learned model and true system as in the theorems. So,\ninstead, we look at the test error as a measure of the closeness of the learned model to the true system.\nFigure 2 on the left shows the test R2 for a Gaussian i.i.d. input and output with SNR = 20 dB for\nRNNs, URNNs, and LSTMs. The red dashed line corresponds to the optimal R2 achievable at the\ngiven noise level.\nNote that even though the true RNN has ntrue = 4 hidden states, the RNN model does not obtain the\noptimal test R2 at n = 4. This is not due to training noise, since the RNN is able to capture the full\ndynamics when we over-parametrize the system to n \u2248 8 hidden states. The test error in the RNN at\nlower numbers of hidden states is likely due to the optimization being caught in a local minima.\nWhat is important for this work though is to compare the URNN test error with that of the RNN. We\nobserve that URNN requires approximately twice the number of hidden states to obtain the same test\nerror as achieved by an RNN. To make this clear, the right plot shows the same performance data\nwith number of states adjusted for URNN. Since our theory indicates that a URNN with 2n hidden\nstates is as powerful as an RNN with n hidden states, we compare a URNN with 2n hidden units\ndirectly with an RNN with n hidden units. We call this the adjusted hidden units. We see that the\nURNN and RNN have similar test error when we appropriately scale the number of hidden units as\npredicted by the theory.\n\n6\n\n\fFigure 2: Test R2 on synthetic data for a Gaussian i.i.d. input and output SNR=20 dB.\n\nFor completeness, the left plot in Figure 2 also shows the test error with an LSTM. It is important to\nnote that the URNN has almost the same performance as an LSTM with considerably smaller number\nof parameters.\nFigure 3 shows similar results for the same task with SNR = 15 dB. For this task, the input is sparse\nGaussian i.i.d., i.e. Gaussian with some probability p = 0.02 and 0 with probability 1 \u2212 p. The left\nplot shows the R2 vs. the number of hidden units for RNNs and URNNs and the right plot shows the\nsame results once the number of hidden units for URNN is adjusted.\nWe also compared the modeling ability of URNNs and RNNs using the Pixel-Permuted MNIST task.\nEach MNIST image is a 28 \u00d7 28 grayscale image with a label between 0 and 9. A \ufb01xed random\npermutation is applied to the pixels and each pixel is fed to the network in each time step as the input\nand the output is the predicted label for each image [1, 13, 26].\nWe evaluated various models on the Pixel-Permuted MNIST task using validation based early stopping.\nWithout imposing a contractivity constraint during learning, the RNN is either unstable or requires a\nslow learning rate. Imposing a contractivity constraint improves the performance. Incidentally, using\na URNN improves the performance further. Thus, contractivity can improve learning for RNNs when\nmodels have suf\ufb01ciently large numbers of time steps.\n\n6 Conclusion\n\nSeveral works empirically show that using unitary recurrent neural networks improves the stability\nand performance of the RNNs. In this work, we study how restrictive it is to use URNNs instead of\nRNNs. We show that URNNs are at least as powerful as contractive RNNs in modeling input-output\nmappings if enough hidden units are used. More speci\ufb01cally, for any contractive RNN we explicitly\nconstruct a URNN with twice the number of states of the RNN and identical input-output mapping.\nWe also provide converse results for the number of state and the activation function needed for exact\nmatching. We emphasize that although it has been shown that URNNs outperform standard RNNs\nand LSTM in many tasks that involve long-term dependencies, our main goal in this paper is to show\nthat from an approximation viewpoint, URNNs are as expressive as general contractive RNNs. By\na two-fold increase in the number of parameters, we can use the stability bene\ufb01ts they bring for\noptimization of neural networks.\n\nAcknowledgements\n\nThe work of M. Emami, M. Sahraee-Ardakan, A. K. Fletcher was supported in part by the National\nScience Foundation under Grants 1254204 and 1738286, and the Of\ufb01ce of Naval Research under\n\n7\n\n\fFigure 3: Test R2 on synthetic data for a Gaussian i.i.d. input and output SNR=15 dB.\n\nFigure 4: Accuracy on Permuted MNIST task for various models trained with RMSProp, validation-\nbased early termination, and initial learning rate lr. (1) URNN model: RNN model with unitary\nconstraint; (2) ContRNN: RNN with a contractivity constraint; (3 & 4) RNN model with no con-\ntractivity or unitary constraint (two learning rates). We see contractivity improves performance, and\nunitary constraints improve performance further.\n\nGrant N00014-15-1-2677. S. Rangan was supported in part by the National Science Foundation\nunder Grants 1116589, 1302336, and 1547332, NIST, the industrial af\ufb01liates of NYU WIRELESS,\nand the SRC.\n\nA Proof of Theorem 3.1\n\nThe basic idea is to construct a URNN with 2n states such that \ufb01rst n states match the states of RNN\nand the last n states are always zero. To this end, consider any contractive RNN,\n\nc = \u03c6(Wch(k\u22121)\nh(k)\n\nc\n\n+ Fcx(k) + bc), y(k) = Cch(k)\n\nc\n\n,\n\nwhere h(k) \u2208 Rn. Since W is contractive, we have (cid:107)W(cid:107) \u2264 \u03c1 for some \u03c1 < 1. Also, for a ReLU\nactivation, (cid:107)\u03c6(z)(cid:107) \u2264 (cid:107)z(cid:107) for all pre-activation inputs z. Hence,\n\n(cid:107)h(k)\n\nc (cid:107)2 = (cid:107)\u03c6(Wch(k\u22121)\n\nc\n\n+ Fcx(k) + bc)(cid:107)2 \u2264 (cid:107)Wch(k\u22121)\n\n+ Fcx(k) + bc(cid:107)2\n\nc\n\n\u2264 \u03c1(cid:107)h(k\u22121)\n\n(cid:107)2 + (cid:107)Fc(cid:107)(cid:107)x(k)(cid:107)2 + (cid:107)bc(cid:107)2.\n\nc\n\n8\n\n\fTherefore, with bounded inputs, (cid:107)x(k)(cid:107) \u2264 M, we have the state is bounded,\n\n(cid:107)h(k)(cid:107)2 \u2264 1\n1 \u2212 \u03c1\n\n[(cid:107)Fc(cid:107)M + (cid:107)bc(cid:107)2] =: Mh.\n\n(7)\n\n+ Fux(k) + bu), y(k) = Cuh(k)\nu\n\nWe construct a URNN as,\nu = \u03c6(Wuh(k\u22121)\nh(k)\n(cid:21)\n\nu\nwhere the parameters are of the form,\n\n(cid:21)\n(8)\nc Wc (cid:23) 0. Therefore, there exists W3 such that\nc Wc. With this choice of W3, the \ufb01rst n columns of Wu are orthonormal. Let\n\nW3, W4\nLet W1 = Wc. Since (cid:107)Wc(cid:107) < 1, we have I\u2212WT\nWT\n\n(cid:20)W1, W2\n\n\u2208 R2n, Wu =\n\n(cid:20)Fc\n\n(cid:20)h1\n\n(cid:20)bc\n\n, Fu =\n\n, bu =\n\nhu =\n\n(cid:21)\n\n(cid:21)\n\nh2\n\nb2\n\n0\n\n.\n\nextend these to an orthonormal basis for R2n. Then, the matrix Wu will be orthonormal.\n\nW4\nNext, let b2 = \u2212Mh1n\u00d71, where Mh is de\ufb01ned in (7). We show by induction that for all k,\n\n(cid:21)\n3W3 = I\u2212WT\n\n(cid:20)W2\n\n(9)\nIf both systems are initialized at zero, (9) is satis\ufb01ed at k = \u22121. Now, suppose this holds up to time\nk \u2212 1. Then,\n\n2 = 0.\n\nh(k)\n1 = h(k)\n\n, h(k)\n\nc\n\n1 = \u03c6(W1h(k\u22121)\nh(k)\n= \u03c6(W1h(k\u22121)\n\n1\n\n1\n\n+ W2h(k\u22121)\n+ Fcx(k) + bc) = h(k)\n\n2\n\nc\n\n,\n\n+ Fcx(k) + bc)\n\nwhere we have used the induction hypothesis that h(k\u22121)\n\n2\n\n= 0. For h(k)\n\n(cid:107)W3h(k\u22121)\n\n1\n\n(cid:107)\u221e \u2264 (cid:107)W3h(k\u22121)\n\n1\n\n(cid:107)2 \u2264 (cid:107)h(k\u22121)\n\n1\n\nwhere the last step follows from (7). Therefore,\n\n2 , note that\n(cid:107) \u2264 Mh,\n\n(10)\n\nW3h(k\u22121)\n\n\u2212 M 1n\u00d71 \u2264 0.\nHence with ReLU activation h(k)\n+ b2) = 0. By induction, (9) holds\nfor all k. Then, if we de\ufb01ne Cu = [Cc0], we have the output of the URNN and RNN systems are\nidentical\n\n+ W4h(k\u22121)\n2 = \u03c6(W3h(k\u22121)\n\n+ b2 = W3h(k\u22121)\n+ W4h(k\u22121)\n\n(11)\n\n1\n\n1\n\n2\n\n2\n\n1\n\ny(k)\nu = Cuh(k)\n\nu = Cch(k)\n\n1 = y(k)\n\nc\n\n.\n\nThis shows that the systems are equivalent.\n\nReferences\n[1] Martin Arjovsky, Amar Shah, and Yoshua Bengio. Unitary evolution recurrent neural networks.\n\nIn International Conference on Machine Learning, pages 1120\u20131128, 2016.\n\n[2] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. Neural machine translation by jointly\n\nlearning to align and translate. arXiv, pages arXiv\u20131409, 2014.\n\n[3] Yoshua Bengio, Paolo Frasconi, and Patrice Simard. The problem of learning long-term\ndependencies in recurrent networks. In IEEE International Conference on Neural Networks,\npages 1183\u20131188. IEEE, 1993.\n\n[4] Kyunghyun Cho, Bart van Merrienboer, Dzmitry Bahdanau, and Yoshua Bengio. On the\nproperties of neural machine translation: Encoder\u2013decoder approaches. Proceedings of SSST-8,\nEighth Workshop on Syntax, Semantics and Structure in Statistical Translation, 2014.\n\n[5] Ronan Collobert, Jason Weston, L\u00e9on Bottou, Michael Karlen, Koray Kavukcuoglu, and Pavel\nKuksa. Natural language processing (almost) from scratch. Journal of Machine Learning\nResearch, 12(Aug):2493\u20132537, 2011.\n\n[6] Jeffrey L Elman. Finding structure in time. Cognitive Science, 14(2):179\u2013211, 1990.\n\n9\n\n\f[7] Alyson K Fletcher, Sundeep Rangan, and Philip Schniter. Inference in deep networks in high\ndimensions. In Proc. IEEE International Symposium on Information Theory, pages 1884\u20131888.\nIEEE, 2018.\n\n[8] Ken-ichi Funahashi and Yuichi Nakamura. Approximation of dynamical systems by continuous\n\ntime recurrent neural networks. Neural Networks, 6(6):801\u2013806, 1993.\n\n[9] Alex Graves, Abdel-rahman Mohamed, and Geoffrey Hinton. Speech recognition with deep\nrecurrent neural networks. In IEEE International Conference on Acoustics, Speech and Signal\nProcessing, pages 6645\u20136649. IEEE, 2013.\n\n[10] Geoffrey Hinton, Li Deng, Dong Yu, George Dahl, Abdel-rahman Mohamed, Navdeep Jaitly,\nAndrew Senior, Vincent Vanhoucke, Patrick Nguyen, Tara Sainath, and Brian Kingsbury. Deep\nneural networks for acoustic modeling in speech recognition. IEEE Signal Processing Magazine,\n29, 2012.\n\n[11] Sepp Hochreiter and J\u00fcrgen Schmidhuber. Long short-term memory. Neural Computation,\n\n9(8):1735\u20131780, 1997.\n\n[12] Li Jing, Caglar Gulcehre, John Peurifoy, Yichen Shen, Max Tegmark, Marin Soljacic, and\nYoshua Bengio. Gated orthogonal recurrent units: On learning to forget. Neural Computation,\n31(4):765\u2013783, 2019.\n\n[13] Li Jing, Yichen Shen, Tena Dubcek, John Peurifoy, Scott Skirlo, Yann LeCun, Max Tegmark,\nand Marin Solja\u02c7ci\u00b4c. Tunable ef\ufb01cient unitary neural networks (eunn) and their application to\nrnns. In Proceedings of the 34th International Conference on Machine Learning-Volume 70,\npages 1733\u20131741. JMLR. org, 2017.\n\n[14] Thomas Kailath. Linear systems, volume 156. Prentice-Hall Englewood Cliffs, NJ, 1980.\n\n[15] Diederik P Kingma and Jimmy Ba. Adam: A method for stochastic optimization. arXiv preprint\n\narXiv:1412.6980, 2014.\n\n[16] Zakaria Mhammedi, Andrew Hellicar, Ashfaqur Rahman, and James Bailey. Ef\ufb01cient orthogonal\nparametrisation of recurrent neural networks using householder re\ufb02ections. In Proceedings of\nthe 34th International Conference on Machine Learning-Volume 70, pages 2401\u20132409. JMLR.\norg, 2017.\n\n[17] Behnam Neyshabur, Srinadh Bhojanapalli, David McAllester, and Nati Srebro. Exploring\ngeneralization in deep learning. In Advances in Neural Information Processing Systems, pages\n5947\u20135956, 2017.\n\n[18] Razvan Pascanu, Tomas Mikolov, and Yoshua Bengio. On the dif\ufb01culty of training recurrent\nneural networks. In International Conference on Machine Learning, pages 1310\u20131318, 2013.\n\n[19] Jeffrey Pennington, Samuel S Schoenholz, and Surya Ganguli. The emergence of spectral\n\nuniversality in deep networks. arXiv preprint arXiv:1802.09979, 2018.\n\n[20] David E Rumelhart, Geoffrey E Hinton, and Ronald J Williams. Learning representations by\n\nback-propagating errors. Cognitive Modeling, 5(3):1, 1988.\n\n[21] Rupesh Kumar Srivastava, Klaus Greff, and J\u00fcrgen Schmidhuber. Highway networks. arXiv\n\npreprint arXiv:1505.00387, 2015.\n\n[22] Gilbert W Stewart. The ef\ufb01cient generation of random orthogonal matrices with an application\n\nto condition estimators. SIAM Journal on Numerical Analysis, 17(3):403\u2013409, 1980.\n\n[23] Gilbert Strang. Introduction to linear algebra, volume 3. Wellesley-Cambridge Press Wellesley,\n\nMA, 1993.\n\n[24] Ilya Sutskever, Oriol Vinyals, and Quoc V Le. Sequence to sequence learning with neural\n\nnetworks. In Advances in Neural Information Processing Systems, pages 3104\u20133112, 2014.\n\n[25] Mathukumalli Vidyasagar. Nonlinear systems analysis, volume 42. Siam, 2002.\n\n10\n\n\f[26] Eugene Vorontsov, Chiheb Trabelsi, Samuel Kadoury, and Chris Pal. On orthogonality and learn-\ning recurrent networks with long term dependencies. In Proceedings of the 34th International\nConference on Machine Learning-Volume 70, pages 3570\u20133578. JMLR. org, 2017.\n\n[27] Olivia L White, Daniel D Lee, and Haim Sompolinsky. Short-term memory in orthogonal\n\nneural networks. Physical review letters, 92(14):148102, 2004.\n\n[28] Scott Wisdom, Thomas Powers, John Hershey, Jonathan Le Roux, and Les Atlas. Full-capacity\nunitary recurrent neural networks. In Advances in Neural Information Processing Systems,\npages 4880\u20134888, 2016.\n\n11\n\n\f", "award": [], "sourceid": 8844, "authors": [{"given_name": "Melikasadat", "family_name": "Emami", "institution": "UCLA"}, {"given_name": "Mojtaba", "family_name": "Sahraee Ardakan", "institution": "UCLA"}, {"given_name": "Sundeep", "family_name": "Rangan", "institution": "NYU"}, {"given_name": "Alyson", "family_name": "Fletcher", "institution": "UCLA"}]}