{"title": "Learning nonlinear level sets for dimensionality reduction in function approximation", "book": "Advances in Neural Information Processing Systems", "page_first": 13220, "page_last": 13229, "abstract": "We developed a Nonlinear Level-set Learning (NLL) method for dimensionality reduction in high-dimensional function approximation with small data. This work is motivated by a variety of design tasks in real-world engineering applications, where practitioners would replace their computationally intensive physical models (e.g., high-resolution fluid simulators) with fast-to-evaluate predictive machine learning models, so as to accelerate the engineering design processes. There are two major challenges in constructing such predictive models: (a) high-dimensional inputs (e.g., many independent design parameters) and (b) small training data, generated by running extremely time-consuming simulations. Thus, reducing the input dimension is critical to alleviate the over-fitting issue caused by data insufficiency. Existing methods, including sliced inverse regression and active subspace approaches, reduce the input dimension by learning a linear coordinate transformation; our main contribution is to extend the transformation approach to a nonlinear regime. Specifically, we exploit reversible networks (RevNets) to learn nonlinear level sets of a high-dimensional function and parameterize its level sets in low-dimensional spaces. A new loss function was designed to utilize samples of the target functions' gradient to encourage the transformed function to be sensitive to only a few transformed coordinates. The NLL approach is demonstrated by applying it to three 2D functions and two 20D functions for showing the improved approximation accuracy with the use of nonlinear transformation, as well as to an 8D composite material design problem for optimizing the buckling-resistance performance of composite shells of rocket inter-stages.", "full_text": "Learning nonlinear level sets for dimensionality\n\nreduction in function approximation\n\nGuannan Zhang\n\nJiaxin Zhang\n\nComputer Science and Mathematics Division\n\nNational Center for Computational Sciences\n\nOak Ridge National Laboratory\n\nzhangg@ornl.gov\n\nOak Ridge National Laboratory\n\nzhangj@ornl.gov\n\nJacob Hinkle\n\nComputational Science and Engineering Division\n\nOak Ridge National Laboratory\n\nhinklejd@ornl.gov\n\nAbstract\n\nWe developed a Nonlinear Level-set Learning (NLL) method for dimensionality\nreduction in high-dimensional function approximation with small data. This work\nis motivated by a variety of design tasks in real-world engineering applications,\nwhere practitioners would replace their computationally intensive physical models\n(e.g., high-resolution \ufb02uid simulators) with fast-to-evaluate predictive machine\nlearning models, so as to accelerate the engineering design processes. There are\ntwo major challenges in constructing such predictive models: (a) high-dimensional\ninputs (e.g., many independent design parameters) and (b) small training data,\ngenerated by running extremely time-consuming simulations. Thus, reducing\nthe input dimension is critical to alleviate the over-\ufb01tting issue caused by data\ninsuf\ufb01ciency. Existing methods, including sliced inverse regression and active\nsubspace approaches, reduce the input dimension by learning a linear coordinate\ntransformation; our main contribution is to extend the transformation approach to a\nnonlinear regime. Speci\ufb01cally, we exploit reversible networks (RevNets) to learn\nnonlinear level sets of a high-dimensional function and parameterize its level sets\nin low-dimensional spaces. A new loss function was designed to utilize samples of\nthe target functions\u2019 gradient to encourage the transformed function to be sensitive\nto only a few transformed coordinates. The NLL approach is demonstrated by\napplying it to three 2D functions and two 20D functions for showing the improved\napproximation accuracy with the use of nonlinear transformation, as well as to\nan 8D composite material design problem for optimizing the buckling-resistance\nperformance of composite shells of rocket inter-stages.\n\n1\n\nIntroduction\n\nHigh-dimensional function approximation arises in a variety of engineering applications where\nscientists or engineers rely on accurate and fast-to-evaluate approximators to replace complex and\ntime-consuming physical models (e.g. multiscale \ufb02uid models), so as to accelerate scienti\ufb01c discovery\nor engineering design/manufacture. In most of those applications, training and validation data need\nto be generated by running expensive simulations, that the amount of training data is often limited\ndue to high cost of data generation (see \u00a74.3 for an example). Thus, this effort is motivated by the\nchallenge imposed by high dimensionality and small data in the context of function approximation.\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fOne way to overcome the challenge is to develop dimensionality reduction methods that can build a\ntransformation of the input space to increase the anisotropy of the input-output map. In this work, we\nassume that the function has a scalar output and the input consists of high-dimensional independent\nvariables, such that there is no intrinsically low-dimensional structure of the input manifold. In this\ncase, instead of analyzing the input or output manifold separately, we will learn low-dimensional\nstructures of the target function\u2019s level sets to reduce the input dimension. Several methods have\nbeen developed for this purpose, including sliced inverse regression and active subspace methods. A\nliterature review of those methods is given in \u00a72.1. Despite many successful applications of those\nmethods, their main drawback is that they use linear transformations to capture low-dimensional\nstructures of level sets. For example, the existing methods for functions with linear level sets, e.g.,\nf (x) = sin(x1 + x2) (the optimal linear transformation is a 45 degree rotation). When the level sets\nare nonlinear, e.g., f (x) = sin((cid:107)x(cid:107)2) (the spherical transformation is optimal), the number of active\ninput dimensions cannot be reduced by linear transformations.\nIn this effort, we exploited reversible residual neural networks (RevNets) [6, 18] to learn the target\nfunctions\u2019 level sets and build nonlinear coordinate transformations to reduce the number of active\ninput dimensions of the function. Reversible architectures have been developed in the literature\n[17, 19, 12] with a purpose of reducing memory usage in backward propagation, while we intend to\nexploit the reversibility to build bijective nonlinear transformations. Since the RevNet is used for a\ndifferent purpose, we designed a new loss function for training the RevNets, such that a well-trained\nRevNet can capture the nonlinear geometry of the level sets. The key idea is to utilize samples of the\nfunction\u2019s gradient to promote the objective that most of the transformed coordinates move on the\ntangent planes of the target function, i.e., the transformed function is invariant with respect to those\ncoordinates. In addition, we constrain the determinant of the Jacobian matrix of the transformation in\norder to enforce invertibility. The main contributions of this effort can be summarized as follows:\n\u2022 Development of RevNet-based coordinate transformation model for capturing the geometry of\n\nlevel sets, which extends function dimensionality reduction to the nonlinear regime.\n\n\u2022 Design of a new loss function that exploits gradient of the target function to successfully train\n\nthe proposed RevNet-based nonlinear transformation.\n\n\u2022 Demonstration of the performance of the proposed NLL method on a high-dimensional real-world\n\ncomposite material design problem for rocket inter-stage manufacture.\n\n2 Problem formulation\nWe are interested in approximating a d-dimensional multivariate function of the form\n\n(1)\nwhere \u2126 is a bounded domain in Rd, the input x := (x1, x2, . . . , xd)(cid:62) is a d-dimensional vector, and\nthe output y is a scalar value. \u2126 is equipped with a probability density function \u03c1 : Rd (cid:55)\u2192 R+, i.e.,\n\ny = f (x), x \u2208 \u2126 \u2282 Rd,\n\n0 < \u03c1(x) < \u221e, x \u2208 \u2126 and \u03c1(x) = 0, x (cid:54)\u2208 \u2126,\n\nsuch that the manifold {x|x \u223c \u03c1(x)} does not have any intrinsically low-dimensional structure (e.g.,\n\u03c1 is a uniform distribution in a d-dimensional hypercube). The target function f is assumed to be\n\ufb01rst-order continuously differentiable, i.e., f \u2208 C 1(\u2126), and square-integrable with respect to the\n\nprobability measure \u03c1, i.e.,(cid:82)\n\n\u2126 f 2(x)\u03c1(x)dx < \u221e.\n\nIn many engineering applications, e.g., the composite shell design problem in \u00a74.3, f usually repre-\nsents the input-output relationship of computationally expensive simulators. In order to accelerate\na discovery/design process, practitioners seek to build an approximation of f, denoted by \u02dcf, such\nthat the error f \u2212 \u02dcf is smaller than a prescribed threshold \u03b5 > 0, i.e., (cid:107)f (x) \u2212 \u02dcf (x)(cid:107)L2\n\u03c1(\u2126) < \u03b5,\nwhere (cid:107) \u00b7 (cid:107)L2\nis the L2 norm under the probability measure \u03c1. As discussed in \u00a71, the main challenge\nresults from the concurrence of having high-dimensional input (i.e., large d) and small training data,\nwhich means the amount of training data is insuf\ufb01cient to overcome the curse of dimensionality. In\nthis scenario, naive applications of existing approximation methods, e.g., sparse polynomials, kernel\nmethods, neural networks (NN), etc., may lead to severe over-\ufb01tting. Therefore, our goal is to reduce\nthe input dimension d by transforming the original input vector x to a lower-dimensional vector z,\nsuch that the transformed function can be accurately approximated with small data.\n\n\u03c1\n\n2\n\n\f2.1 Related work\n\nManifold learning for dimensionality reduction. Manifold learning, including linear and nonlinear\napproaches [28, 27, 2, 14, 29, 26], focuses on reducing data dimension via learning intrinsically\nlow-dimensional structures in the data. Nevertheless, since we assume the input vector x in Eq. (1)\nconsists of independent components and the output f is a scalar, no low-dimensional structure can be\nidenti\ufb01ed by separately analyzing the input and the output data. Thus, the standard manifold learning\napproaches are not applicable to the function dimensionality reduction problem under consideration.\nSliced inverse regression (SIR). SIR is a statistical dimensionality reduction approach for the\nproblem under consideration. In SIR, the input dimension is reduced by constructing/learning a linear\ncoordinate transformation z = Ax, with the expectation that the output of the transformed function\ny = h(z) = h(Ax) is only sensitive to a very small number of the new coordinates of z. The\noriginal version of SIR was developed in [23] and then improved extensively by [10, 24, 11, 9, 25].\nTo relax the elliptic assumption (e.g., Gaussian) of the data, kernel dimension reduction (KDR) was\nintroduced in [15, 16]. Several recent work, including manifold learning with KDR [31] and localized\nSIR [30], were developed for classi\ufb01cation problem. In \u00a74, the SIR will be used to produce baseline\nresults to compare with the performance of our nonlinear method.\nActive subspace (AS). The AS method [8, 7] shares the same motivation as SIR, i.e., reducing\nthe input dimension of f (x) by de\ufb01ning a linear transformation z = Ax. The main difference\nbetween AS and SIR is the way to construct the matrix A. The AS method does not need the elliptic\nassumption needed for SIR but requires (approximate) gradient samples of f (x) to build A. For both\nSIR and AS, when the level sets of f are nonlinear, e.g., f (x) = sin((cid:107)x(cid:107)2\n2), the dimension cannot be\neffectively reduced using any linear transformation. An initial attempt of nonlinear AS method was\nconducted in [4] by analyzing local structures of isosurfaces, where its main drawback is the high\nonline cost. The AS method will be used as another baseline to compare with our method in \u00a74.\nReversible neural networks. We exploited the RevNets proposed in [6, 18] to de\ufb01ne our nonlinear\ntransformation for dimensionality reduction. Those RevNets describe bijective continuous dynamics\nwhile regular residual networks result in crossing and collapsing paths which correspond to non-\nbijective continuous dynamics [1, 6]. Recently, RevNets have been shown to produce competitive\nperformance on discriminative tasks [17, 20] and generative tasks [12, 13, 21]. In particular, the non-\nlinear independent component estimation (NICE) [12, 13] used RevNets to build nonlinear coordinate\ntransformations to factorize high-dimensional density functions into products of independent 1D\ndistributions. The main difference between NICE and our approach is that NICE seeks convergence\nin distribution (weak convergence) with the purpose of building an easy-to-sample distribution, and\nour approach seeks strong convergence as indicated by the norm (cid:107) \u00b7 (cid:107)L2\nwith the purpose of building\nan accurate pointwise approximation to the target function in a lower-dimensional input space.\n\n\u03c1\n\n3 Proposed method: Nonlinear Level sets Learning (NLL)\n\nThe goal of dimensionality reduction is to construct a bijective nonlinear transformation, denoted by\n\nz = g(x) \u2208 Rd\n\nand x = g\u22121(z),\n\n(2)\nwhere z = (z1, . . . , zd)(cid:62), such that the composite function y = f \u25e6 g\u22121(z) has a very small number\nof active input components. In other words, even though z \u2208 Rd is still de\ufb01ned in Rd, the components\nof z can be split into two groups, i.e., z = (zact, zinact) with dim(zact) much smaller than d, such\nthat f \u25e6 g\u22121 is only sensitive to the perturbation of zact. To this end, our method was inspired by the\nfollowing observation:\nObservation: For a \ufb01xed pair (x, z) satisfying z = g(x), if x = g\u22121(z), as a particle in \u2126, moves\nalong a tangent direction, i.e., any direction perpendicular to \u2207f (x), of the level set passing f (x)\nunder a perturbation of zi (the i-th component of z), then the output of f \u25e6 g\u22121(z) does NOT change\nwith zi in the neighbourhood of z.\nBased on such observation, we intend to build and train a nonlinear transformation g with the\nobjective that having a prescribed number of inactive components of z satisfy the above statement,\nand those inactive components will form zinact.\n\n3\n\n\fTraining data: We need two types of data for training g, i.e., samples of the function values and its\ngradients, denoted by\n\n(cid:17)\n\n(cid:111)\n\n\u039etrain :=\n\nx(s), f (x(s)),\u2207f (x(s))\n\n: s = 1, . . . , S\n\n,\n\n(cid:110)(cid:16)\n\nwhere {x(s) : s = 1, . . . , S} are drawn from \u03c1(x), and \u2207f (x(s)) denotes the gradient of f at x(s).\nThe gradient samples describe the tangent direction of the target function\u2019s level sets, i.e., the gradient\ndirection is in perpendicular to all tangent directions. The requirement of gradient samples may limit\nthe applicability of our approach to real-world applications in which gradient is not available. A\ndetailed discussion on how to mitigate such disadvantage is given in \u00a75.\n\n3.1 The level sets learning model: RevNets\n\n(cid:40) un+1 = un + h K(cid:62)\n\nThe \ufb01rst step is to de\ufb01ne a model for the nonlinear transformation g in Eq. (2). In this effort, we\nutilize the nonlinear RevNet model proposed in [6, 18], de\ufb01ned by\n\nn,1 \u03c3(Kn,1vn + bn,1),\nn,2 \u03c3(Kn,2un+1 + bn,2),\n\nvn+1 = vn \u2212 h K(cid:62)\n\n(3)\nfor n = 0, 1, . . . , N \u2212 1, where un and vn are partitions of the states, h is the \u201ctime step\u201d scalar,\nKn,1, Kn,2 are weight matrices, bn,1, bn,2 are biases, and \u03c3 is the activation function. Since un,\nvn can be explicitly calculated given un+1, vn+1, the RevNet in Eq. (3) is reversible by de\ufb01nition.\nEven though our approach can incorporate any reversible architecture, we chose the model in Eq. (3)\nbecause it has been shown in [6] that this architecture has better nonlinear representability than other\ntypes of RevNets.\nTo de\ufb01ne g : x (cid:55)\u2192 z, we split the components of x evenly into u0 and v0, and split the components\nof z accordingly into uN and vN , i.e.\n\nx :=\n\nwhere u0 := (x1, . . . , x(cid:100)d/2(cid:101))(cid:62)\n\n, v0 := (x(cid:100)d/2(cid:101)+1, . . . , xd)(cid:62)\n\n,\n\n(4)\n\nz :=\n\nwhere uN := (z1, . . . , z(cid:100)d/2(cid:101))(cid:62)\n\n(5)\nsuch that the nonlinear transformation g is de\ufb01ned by the map (u0, v0) (cid:55)\u2192 (uN , vN ) from the input\nstates of the N-layer RevNets in Eq. (3) to its output states, i.e.,\n\n, vN := (z(cid:100)d/2(cid:101)+1, . . . , zd)(cid:62)\n\nvN\n\n,\n\n(cid:20)u0\n(cid:21)\n(cid:20)uN\n(cid:21)\n\nv0\n\n(cid:21)\n\n(cid:20)u0\n\nv0\n\nx =\n\n(cid:21)\n\n(cid:20)uN\n\nvN\n\ng\u2212\u2212\u2212\u2212\u2212\u2212\u2212\u2192\n\u2190\u2212\u2212\u2212\u2212\u2212\u2212\n\ng\u22121\n\n= z.\n\n(6)\n\nIt was shown in [18] that the RevNet in Eq. (3) is guaranteed to be stable, so that we can use deep\narchitectures to build a highly nonlinear transformation to capture the geometry of the level sets of f.\n\n3.2 The loss function\n\nThe main novelty of this work is the design of the loss function for training the RevNet in Eq. (3). The\nnew loss function includes two components. The \ufb01rst component is to inspired by our observation\ngiven at the beginning of \u00a73. Speci\ufb01cally, guided by such observation, we write out the Jacobian\nmatrix of the inverse transformation g\u22121 : z (cid:55)\u2192 x as\n\nJg\u22121 (z) = [J1(z), J2(z), . . . , Jd(z)] with Ji(z) :=\n\n(z), . . . ,\n\n\u2202xd\n\u2202zi\n\n(z)\n\n(7)\n\nwhere the i-th column Ji describes the direction in which the particle x moves when perturbing zi.\nAs such, we can use Ji to mathematically rewrite our observation as: the output of f (x) does not\nchange with a perturbation of zi in the neighborhood of z, if\n(cid:104)Ji(z),\u2207f (x)(cid:105) = 0,\n\n(8)\n\n4\n\n(cid:18) \u2202x1\n\n\u2202zi\n\n(cid:19)(cid:62)\n\n\fwhere (cid:104)\u00b7,\u00b7(cid:105) denotes the inner product of two vectors. The relation in Eq. (8) is illustrated in Figure 1.\nTherefore, the \ufb01rst component of the loss function, denoted by L1, is de\ufb01ned by\n\nL1 :=\n\n\u03c9i\n\n,\u2207f (x(s))\n\n,\n\n(9)\n\nS(cid:88)\n\nd(cid:88)\n\n(cid:20)\n\ns=1\n\ni=1\n\n(cid:28) Ji(z(s))\n\n(cid:107)Ji(z(s))(cid:107)2\n\n(cid:29)(cid:21)2\n\nwhere \u03c91, \u03c92, . . . , \u03c9d are user-de\ufb01ned anisotropy weights determining how strict the condition in\nEq. (8) is enforced on each dimension. A extreme case could be \u03c9 := (0, 1, 1, . . . , 1), which means\nthe objective is to train the transformation g such that the intrinsic dimension of f \u25e6 g\u22121(z) is one\nwhen L1 = 0. Another extreme case is \u03c9 = (0, . . . , 0), which leads to L1 = 0 and no dimensionality\nreduction will be performed. In practice, such weights \u03c9 give us the \ufb02exibility to balance between\ntraining cost and reduction effect. It should be noted that we only normalize Ji in Eq (9), but not \u2207f,\nsuch that L1 will not penalize too much in the regions where \u2207f is very small. In particular, L1 = 0\nif f is a constant function.\nThe second component of the loss function is designed to guarantee that the nonlinear transformation\ng is non-singular. It is observed in Eq. (9) that L1 only affects the\nJacobian columns Ji with \u03c9i (cid:54)= 0, but has no control of the columns\nJi with \u03c9i = 0. To avoid the situation that the transformation g\nbecomes singular during training, we de\ufb01ne the second loss com-\nponent L2 as a quadratic penalty on the Jacobian determinant, i.e. ,\n\nL2 := (det(Jg\u22121) \u2212 1)2,\n\n(10)\nwhich will push the transformation to be non-singular and volume\npreserving. Note that L2 can be viewed as a regularization term. In\nsummary, the \ufb01nal loss function is de\ufb01ned by\n\nwhere \u03bb is a user-speci\ufb01ed constant to balance the two terms.\n\nL := L1 + \u03bbL2,\n\n(11)\n\nFigure 1. Illustration of the ob-\nservation for de\ufb01ning the loss L1\nin Eq. (9), i.e., f (x) is insensitive\nto perturbation of zi in the neigh-\nborhood of z if Ji(z)\u22a5\u2207f (x),\nwhere Ji is de\ufb01ned in Eq. (7).\n\nImplementation\n\n3.3\nThe RevNet in Eq. (3) with the new loss function in Eq. (11) was\nimplemented in PyTorch 1.1 and tested on a 2014 iMac Desktop\nwith a 4 GHz Intel Core i7 CPU and 32 GB DDR3 memory. To make use of the automatic\ndifferentiation in PyTorch, we implemented a customized loss function in Pytorch, where the entries of\nthe Jacobian matrix Jg\u22121 were computed using \ufb01nite difference schemes, and the Jacobian det(Jg\u22121)\nwas approximately calculated using the PyTorch version of singular value decomposition. Since this\neffort focuses on proof of concept of the proposed methodology, the current implementation is not\noptimized in terms of computational ef\ufb01ciency.\n\n4 Numerical experiments\nWe evaluated our method using three 2D functions in \u00a74.1 for visualizing the nonlinear capability,\ntwo 20D functions in \u00a74.2 for comparing our method with brute-force neural networks, SIR and AS\nmethods, as well as a composite material design problem in \u00a74.3 for demonstrating the potential\nimpact of our method on real-world engineering problems. To generate baseline results, we used\nexisting SIR and AS codes available at https://github.com/paulcon/active_subspaces and\nhttps://github.com/joshloyal/sliced, respectively. Source code for the proposed NLL\nmethod is available in the supplemental material.\n\n4.1 Tests on two-dimensional functions\nHere we applied our method to the following three 2-dimensional functions:\nfor x \u2208 \u2126 = [0, 1] \u00d7 [0, 1]\nfor x \u2208 \u2126 = [0, 1] \u00d7 [0, 1]\nfor x \u2208 \u2126 = [\u22121, 1] \u00d7 [\u22121, 1].\n\nsin(2\u03c0(x1 + x2)) + 1\nf1(x) =\nf2(x) = exp(\u2212(x1 \u2212 0.5)2 \u2212 x2\n2)\nf3(x) = x3\n2 + 0.2x1 + 0.6x2\n\n1 + x3\n\n1\n2\n\n(12)\n\n(13)\n(14)\n\n5\n\n\fWe used the same RevNet architecture for the three functions. Speci\ufb01cally, u and v in Eq. (3) were\n1D variables (as the total dimension is 2); the number of layers was N = 10, i.e., 10 blocks of the\nform in Eq. (3) were connected; Kn,1, Kn,2 were 2 \u00d7 1 matrices; bn,1, bn,2 are 2D vectors; the\nactivation function was tanh(); the time step h was set to 0.25; stochastic gradient descent method\nwas used to train the RevNet with the learning rate being 0.01; no regularization was applied to\nthe network parameters; the weights in Eq. (9) was set to \u03c9 = (0, 1); \u03bb = 1 in the loss function in\nEq. (11); the training set included 121 uniformly distributed samples in \u2126, and the validation set\nincluded 2000 uniformly distributed samples in \u2126. We compared our method with either SIR or AS\nfor each of the three functions.\nThe results for f1, f2, f3 are shown in Figure 2. For f1, it is known that the optimal transformation\nis a 45 degree rotation of the original coordinate system. The \ufb01rst row in Figure 2 shows that the\ntrained RevNet can approximately recover the 45 degree rotation, which demonstrates that the NLL\nmethod can also recover linear transformation. The level sets of f2 and f3 are nonlinear, and the NLL\nmethod successfully captured such nonlinearity. In comparison, the performance of AS and SIR is\nworse than the NLL method because they can only perform linear transformation.\n\nf4(x) = sin(cid:0)x2\n\n(cid:1)\n\n20(cid:89)\n\n(cid:0)1.2\u22122 + x2\n\ni\n\n(cid:1)\u22121\n\nFigure 2. Comparison between NLL and AS/SIR for f1(x), f2(x), f3(x) in Eqs. (12)-(14) (rows 1-3 respec-\ntively). The \ufb01rst and fourth columns show the relationship between the function output and z1, where the\nperformance is better if the curve is thinner (i.e., the thickness of the cures shows the variation of f \u25e6 g\u22121 w.r.t.\nz2). The second and \ufb01fth columns show the gradient \ufb01eld (gray arrows) and the vector \ufb01eld of second Jabobian\ncolumn J2, where the performance is better if the gray and black arrows are perpendicular to each other. The\nthird and sixth columns show the transformation of a Cartesian mesh to the z space. Note that the AS method is\nshown for f1, f3 while the SIR method is shown for f2; both methods were applied to all functions and showed\nvery similar results. Since a linear transformation (45 degree rotation) is optimal in the case of f1, both NLL and\nAS can learn such a transformation, but in the other cases the NLL method outperforms the linear methods.\n4.2 Tests on 20-dimensional functions\nHere we applied the new method to the following two 20-dimensional functions:\n\n1 + x2\n\n2 + \u00b7\u00b7\u00b7 + x2\n\n20\n\nand f5(x) =\n\n(15)\n\ni=1\n\nfor x \u2208 \u2126 = [0, 1]20.We used one RevNet architecture for the two functions. Speci\ufb01cally, u and v\nin Eq. (3) were 10D variables, respectively; the number of layers is N = 30, i.e., 30 blocks of the\nform in Eq. (3) were connected; Kn,1, Kn,2 were 20 \u00d7 10 matrices; bn,1, bn,2 were 10-dimensional\nvectors; the activation function was tanh(); the time step h was set to 0.25; stochastic gradient\ndescent method was used to train the RevNet with the learning rate being 0.05; \u03bb = 1 for the loss\nfunction in Eq. (11); the training set includes 500 uniformly distributed samples in \u2126.\nThe effectiveness of the NLL method is shown as relative sensitivity indicators in Figure 3(a) for f4\nand Figure 3(b) for f5. The sensitivity of each transformed variable zi is described by the normalized\nsample mean of the absolute values of the corresponding partial derivative. The de\ufb01nition of w in\nEq. (9) provides the target anisotropy of the transformed function. For f4, we set \u03c91 = 0, \u03c9i = 1\n\n6\n\nNLL methodAS/SIR methodf1f2f3\fFigure 3. Comparison of relative sensitivities of the transformed function (a) f4 \u25e6 g\u22121(z) and (b) f5 \u25e6 g\u22121(z)\nwith the original function and the transformed functions using AS and SIR methods.\n\nfor i = 2, . . . , 20; for f5 we set \u03c91 = \u03c92 = 0, \u03c9i = 1 for i = 3, . . . , 20. As expected, the NLL\nmethod successfully reduced the sensitivities of the inactive dimensions to two orders of magnitude\nsmaller than the active dimensions. In comparison, the SIR and AS methods can only reduce their\nsensitivities by one order of magnitude using optimal linear transformations.\nNext, we show how the NLL method improves the accuracy of the approximation of the transformed\nfunction f \u25e6 g\u22121. We used two fully connected NNs to approximate the transformed functions, i.e.,\none has 2 hidden layers with 20+20 neurons, and the other has a single hidden layer with 10 neurons.\nThe implementation of both networks was based on the neural network toolbox in Matlab 2017a.\nWe used various sizes of training data: 100, 500, 10,000, and we used another 10,000 samples as\nvalidation data. All the samples are drawn uniformly in \u2126. The approximation error was computed as\nthe relative root mean square error (RMSE) using the validation data. For comparison, we used the\nsame data to run brute-force neural networks without any transformation, AS and SIR methods.\nThe results for f4 and f5 are shown in Table 1 and 2, respectively. For the 20+20 network, when the\ntraining data is too small (e.g., 100 samples), all the methods have the over-\ufb01tting issue; when the\ntraining data is very large (e.g., 10,000 samples), all the methods can achieve good accuracy 1. Our\nmethod shows signi\ufb01cant advantages over AS and SIR methods, when having relatively small training\ndata, e.g., 500 training data, which is a common scenario in scienti\ufb01c and engineering applications.\nFor the single hidden layer network with 10 neurons, we can see that the brute-force NN, AS and SIR\ncannot achieve good accuracy with 10,000 training data (no over-\ufb01tting), which means the network\ndoes not have suf\ufb01cient expressive power to approximate the original function and the transformed\nfunctions using AS or SIR. In comparison, the NLL method still performs well as shown in Table\n1(Right) and 2(Right). This means the dimensionality reduction has signi\ufb01cantly simpli\ufb01ed the target\nfunctions\u2019 structure, such that the transformed functions can be accurately approximated with smaller\narchitectures to reduce the possibility of over-\ufb01tting.\n\nTable 1: Relative RMSE for approximating f4 in Eq. (15). (Left) 2 hidden layers fully-connected NN with\n20+20 neurons; (Right) 1 hidden layer fully-connected NN with 10 neurons.\n\n100 data\n\n500 data\n\n10,000 data\n\n100 data\n\n500 data\n\n10,000 data\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nNN\nNLL\nAS\nSIR\n\n96.74% 0.01% 61.22% 1.01% 9.17% 7.72%\n98.23% 0.02% 13.41% 2.33% 1.84% 1.37%\n95.42% 0.03% 65.98% 1.09% 2.36% 1.81%\n97.87% 0.01% 56.97% 2.91% 2.61% 1.99%\n\nNN\nNLL\nAS\nSIR\n\n61.93% 0.01% 49.67% 16.93% 30.36% 28.62%\n28.61% 0.01% 8.54% 2.11% 3.11% 2.83%\n81.64% 0.001% 47.52% 15.73% 29.59% 28.42%\n76.53% 0.002% 49.34% 15.11% 29.67% 28.11%\n\nTable 2: Relative RMSE for approximating f5 in Eq. (15). (Left) 2 hidden-layer fully-connected NN with\n20+20 neurons; (Right) 1 hidden layer fully-connected NN with 10 neurons.\n\n100 data\n\n500 data\n\n10,000 data\n\n100 data\n\n500 data\n\n10,000 data\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nValid\n\nTrain\n\nNN\nNLL\nAS\nSIR\n\n40.95% 0.005% 33.92% 11.10% 3.56% 4.14%\n77.79% 0.001% 13.36% 4.32% 3.04% 3.12%\n66.64% 0.002% 39.73% 3.38% 6.21% 3.32%\n80.91% 0.112% 28.17% 9.85% 2.91% 4.19%\n\nNN\nNLL\nAS\nSIR\n\n30.35% 0.001% 25.69% 6.37% 16.32% 14.22%\n26.93% 0.001% 10.63% 1.43% 6.74% 4.76%\n60.47% 0.002% 24.54% 4.02% 18.65% 13.94%\n72.45% 0.002% 35.23% 4.66% 19.08% 12.84%\n\n4.3 Design of composite shell for rocket inter-stages\nFinally, we demonstrate the NLL method on a real-world composite material design problem. With\nhigh speci\ufb01c stiffness and strength, composite materials are increasingly being used for launch-vehicle\nstructures. A series of large-scale composite tests for shell buckling knockdown factor conducted\nby NASA (see Figure 4(a)) aimed to develop and validate new analysis-based design guidelines for\n\n1 10% or smaller RMSE is considered as satisfactory accuracy in many engineering applications\n\n7\n\n\fFigure 4. Illustration of the composite shell design problem for\nrocket inter-stages.\n\nFigure 5. Loss function decay\n\nsafer and lighter space structure. Since the experimental cost is extremely high, numerical simulation,\ne.g., \ufb01nite element method (FEM), is often employed to predict the shell buckling knockdown factor\ngiven a multi-layer ply stack design [5], as illustrated in Figure 4(c). The goal of this work is to\nimplement an accurate approximation of this high-dimensional regression problem where the inputs\nare ply angles for 8 layers and the output is the knockdown factor which needs high precision for\nspace structure design. However, the high \ufb01delity FEM simulation is so time consuming that one\nanalysis takes 10 hours and consequently it is impractical to collect a large data set for approximating\nthe knockdown factor.\nTo demonstrate the applicability of our method to this problem, we used a simpli\ufb01ed FEM model\nthat runs relatively faster but preserves all the physical properties as the high-\ufb01delity FEM model.\nAs shown in Figure 4(b), a ply angle ranging from 0\u25e6 to 22.5\u25e6 that is assigned for each of 8 layers\nare considered in this example, i.e., the input domain is \u2126 = [0\u25e6, 22.5\u25e6]8. The RevNet has N = 10\nlayers; Kn,1, Kn,2 were 8\u00d74 matrices; bn,1, bn,2 were 4-dimensional vectors; the activation function\nwas tanh; the time step h was set to 0.1; stochastic gradient descent method was used with the\nlearning rate being 0.05; \u03bb = 1 for the loss function in Eq. (11).\n\nTable 3: Relative sensitivities of the transformed functions for the composite material design model\n\nMethod\nOriginal\nNLL\nAS\nSIR\n\nDim 1 Dim 2 Dim 3 Dim 4 Dim 5 Dim 6 Dim 7 Dim 8\n0.36\n0.018\n0.15\n0.16\n\n0.45\n0.036\n0.17\n0.14\n\n0.61\n0.011\n0.20\n0.16\n\n0.39\n0.024\n0.15\n0.12\n\n0.51\n0.075\n0.20\n0.13\n\n0.72\n0.12\n0.22\n0.18\n\n1.0\n0.68\n1.0\n1.0\n\n0.85\n1.0\n0.41\n0.21\n\nLike previous examples, we show the comparison of relative sensitivities in Table 3, where we\nallowed 3 active dimensions in the loss L1, i.e. \u03c91 = \u03c92 = \u03c93 = 0, and \u03c9i = 1 for i = 4, . . . , 8.\nAs expected, the NLL method successfully reduced\nTable 4: Relative RMSE for approximating the\nthe input dimension by reducing the sensitivities of\ncomposite material design model.\nDim 4-8 to two orders of magnitude smaller than the\nmost active dimension, which outperforms the AS\nand SIR method. In Table 4, we show the RMSE\nof approximating the transformed function using\na neural network with a single hidden layer hav-\ning 20 neurons. The other settings are the same as\nthe examples in \u00a74.2. As expected, the NLL ap-\nproach outperforms the AS and SIR in the small\ndata regime, i.e., 500 training data. In Figure 5, we show the decay of the loss function for different\nchoices of the anisotropy weights \u03c9 in L1, we can see that the more inactive/insensitive dimensions\n(more non-zero \u03c9i), the slower the loss function decay.\n\n65.74% 0.01% 67.57% 24.77% 3.74% 3.52%\n63.18% 0.02% 11.96% 5.13% 2.51% 2.17%\n58.89% 0.13% 47.27% 19.11% 3.05% 2.91%\n65.34% 0.21% 54.99% 22.52% 3.32% 3.21%\n\nNN\nNLL\nAS\nSIR\n\n10,000 data\n\n500 data\n\n100 data\n\nTrain\n\nTrain\n\nValid\n\nValid\n\nValid\n\nTrain\n\n5 Concluding remarks\nWe developed RevNet-based level sets learning method for dimensionality reduction in high-\ndimensional function approximation. With a custom-designed loss function, the RevNet-based\nnonlinear transformation can effectively learn the nonlinearity of the target function\u2019s level sets, so\nthat the input dimension can be signi\ufb01cantly reduced.\n\n8\n\n(c) Micro-structure of ply composite(b) Simulation of the buckling factor(a) Rocket interstage shellPly stack design\fLimitations. Despite the successful applications of the NLL method shown in \u00a74, we realize that\nthere are several limitations with the current NLL algorithm, including (a) The need for gradient\nsamples. Many engineering models do not provide gradient as an output. To use the current algorithm,\nwe need to compute the gradients by \ufb01nite difference or other perturbation methods, which will\nincrease the computational cost. (b) Non-uniqueness. Unlike the AS and SIR method, the nonlinear\ntransformation produced by the NLL method is not unique, which poses a challenge in the design of\nthe RevNet architectures. (c) High cost and low accuracy of computing Jacobians. The main cost in\nthe backward propagation lies in the computation of the Jacobian matrices and its determinant, which\ndeteriorates the training ef\ufb01ciency and/or accuracy as we increase the depth of the RevNet.\nFuture work. There are several research directions we will pursue in the future. The \ufb01rst is to\ndevelop a gradient estimation approach that can approximately compute gradients needed by our\napproach. Speci\ufb01cally, we will exploit the contour regression method [22] and the manifold tangent\nlearning approach [3], both of which have the potential to estimate gradients by using function\nsamples. The second is to improve the computational ef\ufb01ciency of the training algorithm. Since our\nloss function is more complicated than standard loss functions, it will require extra effort to improve\nthe ef\ufb01ciency of backward propagation.\n\nAcknowledgements\nThis material was based upon work supported by the U.S. Department of Energy, Of\ufb01ce of Science,\nOf\ufb01ce of Advanced Scienti\ufb01c Computing Research, Applied Mathematics program under contract\nERKJ352; and by the Arti\ufb01cial Intelligence Initiative at the Oak Ridge National Laboratory (ORNL).\nORNL is operated by UT-Battelle, LLC., for the U.S. Department of Energy under Contract DE-\nAC05-00OR22725.\n\nReferences\n[1] Jens Behrmann, David Duvenaud, and J\u00f6rn-Henrik Jacobsen. Invertible residual networks.\n\narXiv preprint arXiv:1811.00995, 2018.\n\n[2] Mikhail Belkin, Partha Niyogi, and Vikas Sindhwani. Manifold regularization: A geometric\nframework for learning from labeled and unlabeled examples. Journal of machine learning\nresearch, 7(Nov):2399\u20132434, 2006.\n\n[3] Yoshua Bengio and Martin Monperrus. Non-local manifold tangent learning. In L. K. Saul,\nY. Weiss, and L. Bottou, editors, Advances in Neural Information Processing Systems 17, pages\n129\u2013136. MIT Press, 2005.\n\n[4] Robert A. Bridges, Anthony D. Gruber, Christopher Felder, Miki E. Verma, and Chelsey Hoff.\nIn Proceedings of the 36th\nActive manifolds: A non-linear analogue to active subspaces.\nInternational Conference on Machine Learning, ICML 2019, 9-15 June 2019, Long Beach,\nCalifornia, USA, pages 764\u2013772, 2019.\n\n[5] Saullo GP Castro, Rolf Zimmermann, Mariano A Arbelo, Regina Khakimova, Mark W\nHilburger, and Richard Degenhardt. Geometric imperfections and lower-bound methods used to\ncalculate knock-down factors for axially compressed composite cylindrical shells. Thin-Walled\nStructures, 74:118\u2013132, 2014.\n\n[6] Bo Chang, Lili Meng, Eldad Haber, Lars Ruthotto, David Begert, and Elliot Holtham. Reversible\narchitectures for arbitrarily deep residual neural networks. In AAAI Conference on Arti\ufb01cial\nIntelligence, 2018.\n\n[7] Paul G Constantine. Active subspaces: Emerging ideas for dimension reduction in parameter\n\nstudies, volume 2. SIAM, 2015.\n\n[8] Paul G Constantine, Eric Dow, and Qiqi Wang. Active subspace methods in theory and practice:\napplications to kriging surfaces. SIAM Journal on Scienti\ufb01c Computing, 36(4):A1500\u2013A1524,\n2014.\n\n[9] R Dennis Cook and Liqiang Ni. Suf\ufb01cient dimension reduction via inverse regression: A\nminimum discrepancy approach. Journal of the American Statistical Association, 100(470):410\u2013\n428, 2005.\n\n[10] R Dennis Cook and Sanford Weisberg. Sliced inverse regression for dimension reduction:\n\nComment. Journal of the American Statistical Association, 86(414):328\u2013332, 1991.\n\n9\n\n\f[11] R Dennis Cook and Xiangrong Yin. Theory & methods: special invited paper: dimension\nreduction and visualization in discriminant analysis (with discussion). Australian & New\nZealand Journal of Statistics, 43(2):147\u2013199, 2001.\n\n[12] Laurent Dinh, David Krueger, and Yoshua Bengio. Nice: Non-linear independent components\n\nestimation. arXiv preprint arXiv:1410.8516, 2014.\n\n[13] Laurent Dinh, Jascha Sohl-Dickstein, and Samy Bengio. Density estimation using real nvp.\n\narXiv preprint arXiv:1605.08803, 2016.\n\n[14] David L Donoho and Carrie Grimes. Hessian eigenmaps: Locally linear embedding techniques\nfor high-dimensional data. Proceedings of the National Academy of Sciences, 100(10):5591\u2013\n5596, 2003.\n\n[15] Kenji Fukumizu, Francis R Bach, and Michael I Jordan. Dimensionality reduction for supervised\nlearning with reproducing kernel hilbert spaces. Journal of Machine Learning Research,\n5(Jan):73\u201399, 2004.\n\n[16] Kenji Fukumizu, Francis R Bach, Michael I Jordan, et al. Kernel dimension reduction in\n\nregression. The Annals of Statistics, 37(4):1871\u20131905, 2009.\n\n[17] Aidan N Gomez, Mengye Ren, Raquel Urtasun, and Roger B Grosse. The reversible residual\nnetwork: Backpropagation without storing activations. In Advances in neural information\nprocessing systems, pages 2214\u20132224, 2017.\n\n[18] Eldad Haber and Lars Ruthotto. Stable architectures for deep neural networks. Inverse Problems,\n\n34:014004, Jan 2018.\n\n[19] Michael Hauser and Asok Ray. Principles of Riemannian Geometry in Neural Networks. NIPS,\n\n2017.\n\n[20] J\u00f6rn-Henrik Jacobsen, Arnold Smeulders, and Edouard Oyallon.\n\nnetworks. arXiv preprint arXiv:1802.07088, 2018.\n\ni-revnet: Deep invertible\n\n[21] Durk P Kingma and Prafulla Dhariwal. Glow: Generative \ufb02ow with invertible 1x1 convolutions.\n\nIn Advances in Neural Information Processing Systems, pages 10215\u201310224, 2018.\n\n[22] Bing Li, Hongyuan Zha, and Francesca Chiaromonte. Contour regression: A general approach\n\nto dimension reduction. Ann. Statist., 33(4):1580\u20131616, 08 2005.\n\n[23] Ker-Chau Li. Sliced inverse regression for dimension reduction. Journal of the American\n\nStatistical Association, 86(414):316\u2013327, 1991.\n\n[24] Ker-Chau Li. On principal hessian directions for data visualization and dimension reduc-\ntion: Another application of stein\u2019s lemma. Journal of the American Statistical Association,\n87(420):1025\u20131039, 1992.\n\n[25] Lexin Li. Sparse suf\ufb01cient dimension reduction. Biometrika, 94(3):603\u2013613, 2007.\n[26] Mijung Park, Wittawat Jitkrittum, Ahmad Qamar, Zolt\u00e1n Szab\u00f3, Lars Buesing, and Maneesh\nIn\n\nSahani. Bayesian manifold learning: the locally linear latent variable model (ll-lvm).\nAdvances in neural information processing systems, pages 154\u2013162, 2015.\n\n[27] Sam T Roweis and Lawrence K Saul. Nonlinear dimensionality reduction by locally linear\n\nembedding. science, 290(5500):2323\u20132326, 2000.\n\n[28] Joshua B Tenenbaum, Vin De Silva, and John C Langford. A global geometric framework for\n\nnonlinear dimensionality reduction. science, 290(5500):2319\u20132323, 2000.\n\n[29] Jing Wang, Zhenyue Zhang, and Hongyuan Zha. Adaptive manifold learning. In Advances in\n\nneural information processing systems, pages 1473\u20131480, 2005.\n\n[30] Qiang Wu, Sayan Mukherjee, and Feng Liang. Localized sliced inverse regression. In Advances\n\nin neural information processing systems, pages 1785\u20131792, 2009.\n\n[31] Yi-Ren Yeh, Su-Yun Huang, and Yuh-Jye Lee. Nonlinear dimension reduction with kernel sliced\ninverse regression. IEEE transactions on Knowledge and Data Engineering, 21(11):1590\u20131603,\n2008.\n\n10\n\n\f", "award": [], "sourceid": 7244, "authors": [{"given_name": "Guannan", "family_name": "Zhang", "institution": "Oak Ridge National Laboratory"}, {"given_name": "Jiaxin", "family_name": "Zhang", "institution": "Oak Ridge National Laboratory"}, {"given_name": "Jacob", "family_name": "Hinkle", "institution": "Oak Ridge National Lab"}]}