{"title": "Provable Non-linear Inductive Matrix Completion", "book": "Advances in Neural Information Processing Systems", "page_first": 11439, "page_last": 11449, "abstract": "Consider a standard recommendation/retrieval problem where given a query, the goal is to retrieve the most relevant items. Inductive matrix completion (IMC) method is a standard approach for this problem where the given query as well as the items are embedded in a common low-dimensional space. The inner product between a query embedding and an item embedding reflects relevance of the (query, item) pair. Non-linear IMC (NIMC) uses non-linear networks to embed the query as well as items, and is known to be highly effective for a variety of tasks, such as video recommendations for users, semantic web search, etc. Despite its wide usage, existing literature lacks rigorous understanding of NIMC models. A key challenge in analyzing such models is to deal with the non-convexity arising out of non-linear embeddings in addition to the non-convexity arising out of the low-dimensional restriction of the embedding space, which is akin to the low-rank restriction in the standard matrix completion problem. In this paper, we provide the first theoretical analysis for a simple NIMC model in the realizable setting, where the relevance score of a (query, item) pair is formulated as the inner product between their single-layer neural representations. Our results show that under mild assumptions we can recover the ground truth parameters of the NIMC model using standard (stochastic) gradient descent methods if the methods are initialized within a small distance to the optimal parameters. We show that a standard tensor method can be used to initialize the solution within the required distance to the optimal parameters. Furthermore, we show that the number of query-item relevance observations required, a key parameter in learning such models, scales nearly linearly with the input dimensionality thus matching existing results for the standard linear inductive matrix completion.", "full_text": "Provable Non-linear Inductive Matrix Completion\n\nKai Zhong\nAmazon\n\nkaizhong@amazon.com\n\nPrateek Jain\n\nMicrosoft\n\nprajain@microsoft.com\n\nZhao Song\n\nUniversity of Washington\n\nmagic.linuxkde@gmail.com\n\nInderjit S. Dhillon\n\nAmazon & University of Texas at Austin\n\nisd@amazon.com\n\nAbstract\n\nConsider a standard recommendation/retrieval problem where given a query, the\ngoal is to retrieve the most relevant items. Inductive matrix completion (IMC)\nmethod is a standard approach for this problem where the given query as well as\nthe items are embedded in a common low-dimensional space. The inner prod-\nuct between a query embedding and an item embedding re\ufb02ects relevance of the\n(query, item) pair. Non-linear IMC (NIMC) uses non-linear networks to embed the\nquery as well as items, and is known to be highly effective for a variety of tasks,\nsuch as video recommendations for users, semantic web search, etc. Despite its\nwide usage, existing literature lacks rigorous understanding of NIMC models. A\nkey challenge in analyzing such models is to deal with the non-convexity arising\nout of non-linear embeddings in addition to the non-convexity arising out of the\nlow-dimensional restriction of the embedding space, which is akin to the low-rank\nrestriction in the standard matrix completion problem. In this paper, we provide\nthe \ufb01rst theoretical analysis for a simple NIMC model in the realizable setting,\nwhere the relevance score of a (query, item) pair is formulated as the inner prod-\nuct between their single-layer neural representations. Our results show that under\nmild assumptions we can recover the ground truth parameters of the NIMC model\nusing standard (stochastic) gradient descent methods if the methods are initial-\nized within a small distance to the optimal parameters. We show that a standard\ntensor method can be used to initialize the solution within the required distance\nto the optimal parameters. Furthermore, we show that the number of query-item\nrelevance observations required, a key parameter in learning such models, scales\nnearly linearly with the input dimensionality thus matching existing results for the\nstandard linear inductive matrix completion.\n\n1\n\nIntroduction\n\nReal-world recommendation systems and information retrieval systems aim to obtain the relevance\nbetween \u201cqueries\u201d and \u201citems\u201d, such as user-item ratings, query-web relevance, query-product rele-\nvance, etc. A classic technique to model a recommendation system is matrix completion or collab-\norative \ufb01ltering [CR09, GUH16] where the model is learned from a few observed user-item ratings\nwithout the need for the entire user/item information. Modern recommendation systems also have\naccess to a large amount of side information about the users and the items. In the meantime, the de-\nvelopment of deep learning models has facilitated the extraction of effective neural representations\nfor the users/items. Therefore modern recommendation systems such as Youtube video recom-\nmendation [CAS16], image recommendation [LLL+16], music recommendation [WW14], etc, are\nadopting deep learning representations for users/items. On the other hand, modern information re-\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fDataset\nml-100k\nml-1m\n\n#movies\n\n1682\n3883\n\n#users\n943\n6040\n\n# ratings\n100,000\n1,000,000\n\n# movie feat.\n\n39\n38\n\n# user feat. NIMC\n1.034\n1.021\n\n29\n29\n\nIMC\n1.321\n1.320\n\nTable 1: Test RMSE for recommending new users with existing movies on Movielens dataset. We use users\u2019\ndemographic information and movies\u2019 genre information as features x and y respectively. We randomly split\nthe users into existing users (training data) and new users (testing data) with ratio 4:1. Hence, we are predicting\nratings for completely new users, for which only users\u2019 demographic features are available. The user features\ninclude 21 types of occupations, 7 different age ranges and one gender information; the movie features include\n18-19 (18 for ml-1m and 19 for ml-100k) genre features and 20 features from the top 20 right singular values\nof the training rating matrix (which has size #training users -by- #movies). k is set to 50, and is the ReLU\nactivation.\n\ntrieval systems [HHG+13, NSM+19] are also evolving from lexical relevance between queries and\ndocuments to deep semantic relevance by leveraging deep learning techniques.\nA common practice for modeling the relevance for query-item pair is to \ufb01rst extract representations\nfrom the original features of the query/item. For instance, convolutional neural networks are used\nto represent videos/images and recurrent neural networks/attention-based models are used to extract\nthe embeddings for text features. Then the representations of the pair are used to calculate the\nrelevance by some similarity function, such as cosine similarity or inner product.\nIn particular, if we use inner product, the query(x)-item(y) relevance can be modeled as follows.\n\nA(x, y) = hU(x),V(y)i,\n\n(1)\nwhere x 2 Rd1, y 2 Rd2 are the feature vectors, A(x, y) is their relevance and U : Rd1 ! Rk,V :\nRd2 ! Rk are non-linear mappings from the raw feature space to the latent space. We call this\nmodel non-linear inductive matrix completion (NIMC) model named after the linear inductive ma-\ntrix completion model [JD13].\nDespite the practical success of deep learning models in real-world recommendation and information\nretrieval systems, a rigorous understanding for why they work is still lacking. Theoretical analyses\nfor the linear case of NIMC [ABEV06], where U,V are linear mappings, have been provided in\n[JD13, XJZ13, CHD15], where the challenges mainly come from the non-convexity arising from\nthe restriction to the low-dimensional embedding space. For non-linear cases, an additional major\nchallenge is the non-convexity from non-linear feature extraction mappings.\nIn this paper, we provide analyses for a simple one-layer neural network style NIMC model. That\nis, we set U(x) = (U\u21e4>x) and V(x) = (V \u21e4>x) where is a nonlinear activation function and\nU 2 Rd1\u21e5k, V 2 Rd2\u21e5k (k \uf8ff d1, k \uf8ff d2). Despite the seemingly simple non-linearity of one-layer\nneural network, we can still see there is signi\ufb01cant improvement over linear IMC, for example on\nthe Movielens [Gro97] dataset as shown in Table 1.\nNote that if is ReLU then the latent space is guaranteed to be in non-negative orthant which in\nitself can be a desirable property for certain recommendation problems.\nIn particular, we formulate a squared-loss based optimization problem for estimating parameters\nU\u21e4 and V \u21e4. We show that under a realizable model and Gaussian input assumption, the objective\nfunction is locally strongly convex within a \"reasonably large\" neighborhood of the ground truth.\nMoreover, we show that the above strong convexity claim holds even if the number of observed\nrelevance scores is nearly-linear in dimension and polynomial in the conditioning of the weight ma-\ntrices. In particular, for well-conditioned matrices, we can recover the underlying parameters using\nonly poly log(d1 + d2) query-item relevance scores, which is critical for practical recommendation\nsystems as they tend to have very few relevance scores available per query. Our analysis covers\npopular activation functions, e.g., sigmoid and ReLU, and unearths various subtleties that arise due\nto the activation function. Finally we discuss how we can leverage standard tensor decomposition\ntechniques to initialize our parameters well. We would like to stress that practitioners typically use\nrandom initialization itself, and hence results studying random initialization for NIMC model would\nbe of signi\ufb01cant interest.\nAs mentioned above, due to non-linearity of activation function along with non-convexity of the\nparameter space, the existing proof techniques do not apply directly to the problem. Moreover, we\n\n2\n\n\fhave to carefully argue about both the optimization landscape as well as the sample complexity\nof the algorithm which is not carefully studied for neural networks. Our proof establishes some\nnew techniques that might be of independent interest, e.g., how to handle the redundancy in the\nparameters for ReLU activation. To the best of our knowledge, this is one of the \ufb01rst theoretically\nrigorous study of neural-network based recommendation/retrieval systems and will hopefully be a\nstepping stone for similar analysis for \"deeper\" neural networks based recommendation systems. We\nwould also like to highlight that our model can be viewed as a strict generalization of a one-hidden\nlayer neural network, hence our result represents one of the few rigorous guarantees for models that\nare more powerful than one-hidden layer neural networks [LY17, BGMSS18, ZSJ+17]. Finally, we\napply our model on synthetic datasets and verify our theoretical analysis.\nIn summary, the main contribution of this paper is to provide, as far as we know, the \ufb01rst theoretical\nrecovery guarantees for learning a neural network based inductive matrix completion model using\ngradient descent when the parameters are initialized by tensor methods.\n\n1.1 Related work\nCollaborative \ufb01ltering: Our model is a non-linear version of the standard inductive matrix comple-\ntion model [JD13]. Practically, IMC has been applied to gene-disease prediction [ND14], matrix\nsensing [ZJD15], multi-label classi\ufb01cation[YJKD14], blog recommender system [SCLD15], link\nprediction [CHD15] and semi-supervised clustering [CHD15, SCH+16]. However, IMC restricts the\nlatent space of users/items to be a linear transformation of the user/item\u2019s feature space. [SCH+16]\nextended the model to a three-layer neural network and showed signi\ufb01cantly better empirical per-\nformance for multi-label/multi-class classi\ufb01cation problem and semi-supervised problems.\nAlthough standard IMC has linear mappings, it is still a non-convex problem due to the bilinearity\nU V >. To deal with this non-convex problem, [JD13, Har14] provided recovery guarantees using\nalternating minimization with sample complexity linear in dimension. [XJZ13] relaxed this problem\nto a nuclear-norm problem and also provided recovery guarantees. More general norms have been\nstudied [RSW16, SWZ17a, SWZ17b, SWZ18], e.g. weighted Frobenius norm, entry-wise `1 norm.\nMore recently, [ZDG18] uses gradient-based non-convex optimization and proves a better sample\ncomplexity. [CHD15] studied dirtyIMC models and showed that the sample complexity can be im-\nproved if the features are informative when compared to matrix completion. Several low-rank matrix\nsensing problems [ZJD15, GJZ17] are also closely related to IMC models where the observations\nare sampled only from the diagonal elements of the relevance matrix. [Ren10, LY16] introduced\nand studied an alternate framework for relevance prediction with side-information but the prediction\nfunction is linear in their case as well.\nNeural networks: Nonlinear activation functions play an important role in neural networks. Re-\ncently, several powerful results have been discovered for learning one-hidden-layer feedforward\nneural networks [Tia17, ZSJ+17, JSA15, LY17, BGMSS18, GKKT17, VW18, ZYWG18], convo-\nlutional neural networks [BG17, ZSD17, DLT18a, DLT+18b, GKM18, DWZ+18]. However, our\nproblem is a strict generalization of the one-hidden layer neural network and is not covered by the\nabove mentioned results.\n\nNotation. For any function f, we de\ufb01ne eO(f ) to be f \u00b7 logO(1)(f ). For two functions f, g, we use\nthe shorthand f . g (resp. &) to indicate that f \uf8ff Cg (resp. ) for an absolute constant C. We use\nf h g to mean cf \uf8ff g \uf8ff Cf for constants c, C. We use poly(f ) to denote f O(1).\nRoadmap. We \ufb01rst present the formal model and the corresponding optimization problem in Sec-\ntion 2. We then present the local strong convexity and local linear convergence results in Section 3.\nFinally, we show simulation results to verify our theory (Section 4).\n\n2 Problem Formulation\n\nConsider a query-item recommender/retrieval system, where we have n1 queries with feature vectors\nX := {xi}i2[n1] \u2713 Rd1, n2 items with feature vectors Y := {yj}j2[n2] \u2713 Rd2 and a collection of\npartially-observed query-item relevance scores, Aobs = {A(x, y)|(x, y) 2 \u2326 \u2713 X \u21e5 Y }. That is\nA(xi, yj) is the relevance score that query xi gave for item yj. For simplicity, we assume xi\u2019s and\nyj\u2019s are sampled i.i.d. from distribution X and Y, respectively. Each element of the index set \u2326 is\nalso sampled independently and uniformly with replacement from S := X \u21e5 Y .\n\n3\n\n\fIn this paper, our goal is to predict the relevance score for any query-item pair with feature vectors\nx and y, respectively. We model the query-item relevance scores as:\n\nA(x, y) = (U\u21e4>x)>(V \u21e4>y),\n\n(2)\nwhere U\u21e4 2 Rd1\u21e5k, V \u21e4 2 Rd2\u21e5k and is a non-linear activation function. Under this realizable\nmodel, our goal is to recover U\u21e4, V \u21e4 from a collection of observed entries, {A(x, y)|(x, y) 2 \u2326}.\nWithout loss of generality, we set d1 = d2. Also we treat k as a constant throughout the paper. Our\nanalysis requires U\u21e4, V \u21e4 to be full column rank, so we require k \uf8ff d. And w.l.o.g., we assume\nk(U\u21e4) = k(V \u21e4) = 1, i.e., the smallest singular value of both U\u21e4 and V \u21e4 is 1.\nNote that this model is similar to one-hidden layer feed-forward network popular in standard classi-\n\ufb01cation/regression tasks. However, as there is an inner product between the output of two non-linear\nlayers, (U\u21e4x) and (V \u21e4y), it cannot be modeled by a single hidden layer neural network (with\nsame number of nodes). Also, for linear activation function, the problem reduces to inductive ma-\ntrix completion [ABEV06, JD13].\nNow, to solve for U\u21e4, V \u21e4, we optimize a simple squared-loss based optimization problem, i.e.,\n\nwhere\n\nmin\n\nU2Rd1\u21e5k,V 2Rd2\u21e5k\n\nf\u2326(U, V ),\n\nf\u2326(U, V ) = X(x,y)2\u2326\n\n((U>x)>(V >y) A(x, y))2.\n\n(3)\n\nNaturally, the above problem is a challenging non-convex optimization problem that is strictly harder\nthan two non-convex optimization problems which are challenging in their own right: a) the linear\ninductive matrix completion where non-convexity arises due to bilinearity of U>V , and b) the stan-\ndard one-hidden layer neural network (NN). In fact, recently a lot of research has focused on under-\nstanding various properties of both the linear inductive matrix completion problem [GJZ17, JD13]\nas well as one-hidden layer NN [GLM18, ZSJ+17].\nIn this paper, we show that despite the non-convexity of Problem (3), it behaves as a convex\noptimization problem close to the optima if the data is sampled stochastically from a Gaus-\nsian distribution. This result combined with standard tensor decomposition based initialization\n[ZSJ+17, KCL15, JSA15] leads to a polynomial time algorithm for solving (3) optimally if the\ndata satis\ufb01es certain sampling assumptions in Theorem 2.1. Moreover, we also discuss the effect\nof various activation functions, especially the difference between a sigmoid activation function vs\nRELU activation (see Theorem 3.2 and Theorem 3.4).\nInformally, our recovery guarantee can be stated as follows,\nTheorem 2.1 (Informal Recovery Guarantee). Consider a recommender system with a realizable\nmodel Eq. (2) with sigmoid activation, Assume the features {xi}i2[n1] and {yj}j2[n2] are sampled\ni.i.d. from the normal distribution and the observed pairs \u2326 are i.i.d. sampled from {xi}i2[n1] \u21e5\n{yj}j2[n2] uniformly at random. Then there exists an algorithm such that U\u21e4, V \u21e4 can be recovered\nto any precision \u270f with time complexity and sample complexity (refers to n1, n2,|\u2326|) polynomial in\nthe dimension and the condition number of U\u21e4, V \u21e4, and logarithmic in 1/\u270f.\n\n3 Main Results\n\nOur main result shows that when initialized properly, gradient-based algorithms will be guaranteed\nto converge to the ground truth. We \ufb01rst study the Hessian of empirical risk for different activation\nfunctions, then based on the positive-de\ufb01niteness of the Hessian for smooth activations, we show\nlocal linear convergence of gradient descent. The proof sketch is provided in Appendix C.\nThe positive de\ufb01niteness of the Hessian does not hold for several activation functions. Here we\nprovide some examples.\nCounter Example 1) The Hessian at the ground truth for linear activation is not positive de\ufb01nite\nbecause for any full-rank matrix R 2 Rk\u21e5k, (U\u21e4R, V \u21e4R1) is also a global optimal.\nCounter Example 2) The Hessian at the ground truth for ReLU activation is not positive de\ufb01nite\nbecause for any diagonal matrix D 2 Rk\u21e5k with positive diagonal elements, U\u21e4D, V \u21e4D1 is also\n\n4\n\n\fa global optimal. These counter examples have a common property: there is redundancy in the\nparameters. Surprisingly, for sigmoid and tanh, the Hessian around the ground truth is positive\nde\ufb01nite. More surprisingly, we will later show that for ReLU, if the parameter space is constrained\nproperly, its Hessian at a given point near the ground truth can also be proved to be positive de\ufb01nite\nwith high probability.\n\n3.1 Local Geometry and Local Linear Convergence for Sigmoid and Tanh\nWe de\ufb01ne two natural condition numbers for the problem that captures the \"hardness\" of the prob-\nlem:\nDe\ufb01nition 3.1. De\ufb01ne := max{(U\u21e4), (V \u21e4)} and \uf8ff := max{\uf8ff(U\u21e4),\uf8ff (V \u21e4)}, where (U ) =\ni=1i(U )), \uf8ff(U ) = 1(U )/k(U ), and i(U ) denotes the i-th singular value of U with\nk\n1 (U )/(\u21e7k\nthe ordering i i+1.\nFirst we show the result for sigmoid and tanh activations.\nTheorem 3.2 (Positive De\ufb01niteness of Hessian for Sigmoid and Tanh). Let the activation function\n in the NIMC model (2) be sigmoid or tanh and let \uf8ff, be as de\ufb01ned in De\ufb01nition 3.1. Then for\nany t > 1 and any given U, V , if\n\nn1 & t4\uf8ff2d log2 d, n2 & t4\uf8ff2d log2 d,\n|\u2326| & t4\uf8ff2d log2 d,\nkU U\u21e4k + kV V \u21e4k . 1/(2\uf8ff),\n\nthen with probability at least 1 dt, the smallest eigenvalue of the Hessian of Eq. (3) is lower\nbounded by:\n\nmin(r2f\u2326(U, V )) & 1/(2\uf8ff).\n\nRemark. Theorem 3.2 shows that, given suf\ufb01ciently large number of query-item relevance scores\nand a suf\ufb01ciently large number of queries/items themselves, the Hessian at a point close enough\nto the true parameters U\u21e4, V \u21e4, is positive de\ufb01nite with high probability. The sample complexity,\nincluding n1, n2 and |\u2326|, have a near-linear dependency on the dimension, which matches the linear\nIMC analysis [JD13]. Strong convexity parameter as well as the sample complexity depend on the\ncondition number of U\u21e4, V \u21e4 as de\ufb01ned in De\ufb01nition 3.1. Although we don\u2019t explicitly show the\ndependence on k, both sample complexity and the minimal eigenvalue scale as a polynomial of k.\nThe proofs can be found in Appendix C.\nAs the above theorem shows the Hessian is positive de\ufb01nite w.h.p. for a given U, V that is close to\nthe optima. This result along with smoothness of the activation function implies linear convergence\nof gradient descent that samples a fresh batch of samples in each iteration as shown in the following,\nwhose proof is postponed to Appendix E.1.\nTheorem 3.3. Let [U c, V c] be the parameters in the c-th iteration. Assuming kU c U\u21e4k + kV c \nV \u21e4k . 1/(2\uf8ff), then given a fresh sample set, \u2326, that is independent of [U c, V c] and satis\ufb01es the\nconditions in Theorem 3.2, the next iterate using one step of gradient descent, i.e., [U c+1, V c+1] =\n[U c, V c] \u2318rf\u2326(U c, V c), satis\ufb01es\n\nkU c+1 U\u21e4k2\n\uf8ff(1 Ml/Mu)(kU c U\u21e4k2\n\nF + kV c+1 V \u21e4k2\n\nF\n\nF + kV c V \u21e4k2\nF )\n\nwith probability 1 dt, where \u2318 =\u21e5(1 /Mu) is the step size and Ml & 1/(2\uf8ff) is the lower\nbound on the eigenvalues of the Hessian and Mu . 1 is the upper bound on the eigenvalues of the\nHessian.\n\nRemark. The linear convergence requires each iteration has a set of fresh samples. However,\nsince it converges linearly to the ground-truth, we only need log(1/\u270f) iterations, therefore the sam-\nple complexity is only logarithmic in 1/\u270f. This dependency is better than directly using Tensor\ndecomposition method [JSA15], which requires O(1/\u270f2) samples. Note that we only use Tensor\ndecomposition to initialize the parameters. Therefore the sample complexity required in our tensor\ninitialization does not depend on \u270f.\n\n5\n\n\f3.2 Empirical Hessian around the Ground Truth for ReLU\nWe now present our result for ReLU activation. As we see in Counter Example 2, without any\nfurther modi\ufb01cation, the Hessian for ReLU is not locally strongly convex due to the redundancy in\nparameters. Therefore, we reduce the parameter space by \ufb01xing one parameter for each (ui, vi) pair,\ni 2 [k]. In particular, we \ufb01x u1,i = u\u21e41,i,8i 2 [k] when minimizing the objective function, Eq. (3),\nwhere u1,i is i-th element in the \ufb01rst row of U. Note that as long as u\u21e41,i 6= 0, u1,i can be \ufb01xed to\nany other non-zero values. We set u1,i = u\u21e41,i just for simplicity of the proof. The new objective\nfunction can be represented as\n1\n\n((W >x2:d + x1(u\u21e4(1))>)>(V >y) A(x, y))2.\n\n(4)\n\nf ReLU\n\u2326\n\n(W, V ) =\n\n2|\u2326| \u21e5 X(x,y)2\u2326\n\nwhere u\u21e4(1) is the \ufb01rst row of U\u21e4 and W 2 R(d1)\u21e5k.\nSurprisingly, after \ufb01xing one parameter for each (ui, vi) pair, the Hessian using ReLU is also positive\nde\ufb01nite w.h.p. for a given (U, V ) around the ground truth.\nTheorem 3.4 (Positive De\ufb01niteness of Hessian for ReLU). De\ufb01ne u0 := mini2[k]{|u\u21e41,i|}. For any\nt > 1 and any given U, V , if\n\n0 t4\uf8ff12d log2 d, n2 & u4\n0 t4\uf8ff12d log2 d,\n\nn1 & u4\n|\u2326| & u4\nkW W \u21e4k + kV V \u21e4k . u4\n\n0/4\uf8ff12,\n\n0 t4\uf8ff12d log2 d,\n\nthen with probability 1 dt, the minimal eigenvalue of the objective for ReLU activation function,\nEq. (4), is lower bounded,\n\nmin(r2f ReLU\n\n\u2326\n\n(W, V )) & u2\n\n0/2\uf8ff4.\n\nRemark. Similar to the sigmoid/tanh case, the sample complexity for ReLU case also has a linear\ndependency on the dimension. However, here we have a worse dependency on the condition number\nof the weight matrices. The scale of u0 can also be important and in practice one needs to set it\ncarefully. Note that although the activation function is not smooth, the Hessian at a given point can\nstill exist with probability 1, since ReLU is smooth almost everywhere and there are only a \ufb01nite\nnumber of samples. However, owing to the non-smoothness, a proof of convergence of gradient\ndescent method for ReLU is still an open problem.\n\n3.3 Technical Analysis Challenges\n\nAt high level the proofs for Theorem 3.2 and Theorem 3.4 include the following steps. 1) Show that\nthe population Hessian at the ground truth is positive de\ufb01nite. 2) Show that population Hessians\nnear the ground truth are also positive de\ufb01nite. 3) Employ matrix Bernstein inequality to bound the\npopulation Hessian and the empirical Hessian.\nHere we show the challenges to prove the positive de\ufb01niteness of the Hessian for the population risk\nat the ground truth. The population risk for Eq. (3) is given by:\n\nfD(U, V ) =\n\n1\n2\n\nE\n\n(x,y)\u21e0D\n\n[((U>x)>(V >y) A(x, y))2],\n\n(5)\n\nwhere D := X\u21e5Y .\nLet the Hessian of fD(U, V ) at the ground-truth (U, V ) = (U\u21e4, V \u21e4) be H\u21e4 2 R(2dk)\u21e5(2dk), which\ncan be decomposed into the following two types of blocks (i 2 [k], j 2 [k]),\n\n@2fD(U\u21e4, V \u21e4)\n\n@ui@uj\n\n@2fD(U\u21e4, V \u21e4)\n\n@ui@vj\n\n= Ex,yh0(u\u21e4>i x)0(u\u21e4>j x)xx>(v\u21e4>i y)(v\u21e4>j y)i ,\n= Ex,yh0(u\u21e4>i x)0(v\u21e4>j y)xy>(v\u21e4>i y)(u\u21e4>j x)i .\n\n6\n\n\fTo study the positive de\ufb01niteness of H\u21e4, we characterize the minimal eigenvalue of H\u21e4 by a con-\nstrained optimization problem,\n\nmin(H\u21e4) = min\n(a,b)2B\n\nEx,y24 kXi=1\nwhere (a, b) 2 B denotes thatPk\n\n0(u\u21e4>i x)(v\u21e4>i y)x>ai + 0(v\u21e4>i y)(u\u21e4>i x)y>bi!235 ,\n\n(6)\n\ni=1 kaik2 + kbik2 = 1. Obviously, min(H\u21e4) 0 due to the\nsquared loss and the realizable assumption. However, this is not suf\ufb01cient for the local convexity\naround the ground truth, which requires the positive (semi-)de\ufb01niteness for the neighborhood around\nthe ground truth. In other words, we need to show that min(H\u21e4) is strictly greater than 0, so that\nwe can characterize an area in which the Hessian still preserves positive de\ufb01niteness (PD) despite\nthe deviation from the ground truth.\nAs we mentioned previously there are activation functions that lead to redundancy in parameters.\nHence one challenge is to distill properties of the activation functions that preserve the PD. Another\nchallenge is the correlation introduced by U\u21e4 when it is non-orthogonal. So we \ufb01rst study the\nminimal eigenvalue for orthogonal U\u21e4 and orthogonal V \u21e4 and then link the non-orthogonal case to\nthe orthogonal case.\n\nInitialization\n\n3.4\nTo achieve the ground truth, our algorithm needs a good initialization method that can initialize the\nparameters to fall into the neighborhood of the ground truth. Here we show that this is possible by\nusing tensor method under the Gaussian assumption. In the following, we consider estimating U\u21e4.\nEstimating V \u21e4 is similar.\nDe\ufb01ne a third-order moment of the input,\n\nM3 := E[A(x, y) \u00b7 (x\u23263 xe\u2326I)],\n\nwhere xe\u2326I := Pd\nj=1[x \u2326 ej \u2326 ej + ej \u2326 x \u2326 ej + ej \u2326 ej \u2326 x]. De\ufb01ne j() := E[( \u00b7\nz)zj], 8j = 0, 1, 2, 3. Then under the Gaussian input assumption, M3 = Pk\n, where\nu\u21e4i = u\u21e4i /ku\u21e4ik,\u21b5 i = 0(kv\u21e4i k) (3(ku\u21e4ik) 31(ku\u21e4ik)) . When \u21b5i 6= 0, we can approximately\nrecover \u21b5i and u\u21e4i from the empirical version of M3 using non-orthogonal tensor decomposition\n[KCL15].\nAlthough tensor initialization has nice theoretical guarantees and sample complexity, it heavily de-\npends on Gaussian assumption and realizable model assumption. In contrast, practitioners typically\nuse random initialization.\n\ni=1 \u21b5iu\u21e4\u23263\n\ni\n\n4 Simulation\n\nIn this section, we generate some synthetic datasets to verify the sample complexity and the con-\nvergence of gradient descent. For simplicity, we apply gradient descent with initialization as\nW (0) = (1 \u21b5)W \u21e4 + \u21b5W (r), where W \u21e4 is the ground truth, W (r) is a Gaussian random ma-\ntrix, and \u21b5 2 [0, 1]. In Fig. 1 (a)(b), \u21b5 = 0.1 and in Fig. 1 (c)(d) \u21b5 = 1. The sampling rule for the\nobservations follows our previous assumptions. For each n, m pair, we make 5 trials and take the\naverage of the successful recovery times. We say a solution (U, V ) successfully recovers the ground\ntruth parameters when the solution achieves 0.001 relative testing error, i.e.,\n\nk(XtU )(XtU )> (XtU\u21e4)(XtU\u21e4)>kF \uf8ff 0.001 \u00b7k (XtU\u21e4)(XtU\u21e4)>kF ,\n\nwhere Xt 2 Rn\u21e5d is a newly sampled testing dataset. For both ReLU and sigmoid, we minimize\nthe original objective function (3).\nIn (a), k = 5, d = 10, n = 1000, and m = 10000. As we can see, (a) shows how the objective value\nconverges, which is almost linear. In (b), we show how initialization affects the recovery rate. When\nk, d are large (k = 10, d = 100, n = 500) and the initialization is purely random (\u21b5 = 1), gradient\ndescent doesn\u2019t converge to the ground truth. As shown in Fig. 1 (c)(d), when k = 5, d = 10, pure\nrandom initialization can converge to the ground truth. We believe that it is because when k, d are\n\n7\n\n\f6\n\n4\n\n2\n\n0\n\n-2\n\n-4\n\n-6\n\n)\nj\nb\no\n(\ng\no\n\nl\n\n0\n\n1000\n\n2000\niteration\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n)\nd\nk\n(\n/\n\nm\n\n3000\n\n4000\n\n0\n\n0.2\n\n0.4\n\nalpha\n\n0.6\n\n0.8\n\n1\n\n(a) log(obj) v.s. iteration.\n\n(b) Recovery rate v.s. different initializations.\n\n)\nd\nk\n2\n(\n/\n\nm\n\n10\n\n8\n\n6\n\n4\n\n2\n\n0\n\n0\n\n20\n\n15\n\n10\n\n)\nd\nk\n2\n(\n/\n\nm\n\n5\n\n0\n\n0\n\n100\n\n50\n\n50\nn\n\n100\nn\n\n150\n\n200\n\n(c) Recovery Rate v.s. (m, n) for Sigmoid\n\n(d) Recovery Rate v.s. (m, n) for ReLU\n\nFigure 1: (a) shows how the gradient descent converges, which is almost linear. (b) shows that when k, d are\nlarge (k = 10, d = 100) and the initialization is purely random (\u21b5 = 1), gradient descent doesn\u2019t converge\nto the ground truth. (c) and (d) show the recovery rate for sigmoid and ReLU activation functions respectively.\nWhite blocks denote 100% recovery rate over 5 trials, while black means 100% failure.\n\nlarger, random initialization can be further away from the ground truth. Hence, gradient descent can\nget stuck in local optima more easily.\nWe illustrate the recovery rate and sample complexity for sigmoid and ReLU in Figure 1 (c)(d).\nFor sigmoid, set the number of samples n1 = n2 = n = {10 \u00b7 i}i=1,2\u00b7\u00b7\u00b7 ,10 and the number of\nobservations |\u2326| = m = {2kd \u00b7 i}i=1,2,\u00b7\u00b7\u00b7 ,10. For ReLU, set n = {20 \u00b7 i}i=1,2\u00b7\u00b7\u00b7 ,10 and m =\n{4kd \u00b7 i}i=1,2,\u00b7\u00b7\u00b7 ,10.\nAs we can see, ReLU requires more samples/observations than that for sigmoid for exact recovery\n(note the scales of n and m/2kd are different in the two \ufb01gures). This is consistent with our theoret-\nical results. Comparing Theorem 3.2 and Theorem 3.4, we can see the sample complexity for ReLU\nhas a worse dependency on the conditioning of U\u21e4, V \u21e4 than sigmoid. We can also see that when\nn is suf\ufb01ciently large, the number of observed ratings required remains the same for both methods.\nThis is also consistent with the theorems, where |\u2326| is near-linear in d and is independent of n.\n5 Conclusions\n\nIn this paper, we propose a nonlinear IMC model that represents one of the simplest inductive\nmodels for neural-network-based recommendation/retrieval systems. We study local geometry of\nthe empirical risk function and show that, close to the optima, the function is strongly convex for\nboth ReLU and sigmoid activations. Therefore, using a smooth activation function like sigmoid\nactivation, gradient descent recovers the underlying model with polynomial sample complexity and\ntime complexity if the parameters are initialized by standard tensor methods. Thus we provide the\n\ufb01rst theoretically rigorous result for the non-linear recommendation/retrieval system problem, which\nwe hope will spur further progress on the theory of deep-learning based recommendation/retrieval\nsystems.\n\n8\n\n\fReferences\n[ABEV06] Jacob Abernethy, Francis Bach, Theodoros Evgeniou, and Jean-Philippe Vert. Low-\n\nrank matrix factorization with attributes. arXiv preprint cs/0611124, 2006.\n\n[BG17] Alon Brutzkus and Amir Globerson. Globally optimal gradient descent for a convnet\n\nwith Gaussian inputs. In ICML. https://arxiv.org.pdf/1702.07966, 2017.\n\n[BGMSS18] Alon Brutzkus, Amir Globerson, Eran Malach, and Shai Shalev-Shwartz. SGD learns\nover-parameterized networks that provably generalize on linearly separable data. In\nICLR. https://arxiv.org/pdf/1710.10174, 2018.\n\n[CAS16] Paul Covington, Jay Adams, and Emre Sargin. Deep neural networks for YouTube\nIn Proceedings of the 10th ACM Conference on Recommender\n\nrecommendations.\nSystems, pages 191\u2013198. ACM, 2016.\n\n[CHD15] Kai-Yang Chiang, Cho-Jui Hsieh, and Inderjit S Dhillon. Matrix completion with\nnoisy side information. In Advances in Neural Information Processing Systems, pages\n3447\u20133455, 2015.\n\n[CR09] Emmanuel J. Cand\u00e8s and Benjamin Recht. Exact matrix completion via convex op-\ntimization. Foundations of Computational Mathematics, 9(6):717\u2013772, December\n2009.\n\n[DLT18a] Simon S Du, Jason D Lee, and Yuandong Tian. When is a convolutional \ufb01lter easy to\n\nlearn? In ICLR. https://arxiv.org/pdf/1709.06129, 2018.\n\n[DLT+18b] Simon S Du, Jason D Lee, Yuandong Tian, Barnabas Poczos, and Aarti Singh. Gradi-\nent descent learns one-hidden-layer CNN: Don\u2019t be afraid of spurious local minima.\nIn ICML. https://arxiv.org/pdf/1712.00779, 2018.\n\n[DWZ+18] Simon S Du, Yining Wang, Xiyu Zhai, Sivaraman Balakrishnan, Ruslan R Salakhut-\ndinov, and Aarti Singh. How many samples are needed to estimate a convolutional\nneural network? In Advances in Neural Information Processing Systems, pages 371\u2013\n381, 2018.\n\n[GJZ17] Rong Ge, Chi Jin, and Yi Zheng. No spurious local minima in nonconvex low rank\nIn International Conference on Machine\n\nproblems: A uni\ufb01ed geometric analysis.\nLearning, pages 1233\u20131242. https://arxiv.org/pdf/1704.00708, 2017.\n\n[GKKT17] Surbhi Goel, Varun Kanade, Adam Klivans, and Justin Thaler. Reliably learning the\nReLU in polynomial time. In 30th Annual Conference on Learning Theory (COLT).\nhttps://arxiv.org/pdf/1611.10258, 2017.\n\n[GKM18] Surbhi Goel, Adam Klivans, and Reghu Meka. Learning one convolutional layer with\n\noverlapping patches. In ICML. https://arxiv.org/pdf/1802.02547, 2018.\n\n[GLM18] Rong Ge, Jason D Lee, and Tengyu Ma. Learning one-hidden-layer neural networks\n\nwith landscape design. In ICLR. https://arxiv.org/pdf/1711.00501, 2018.\n\n[Gro97] GroupLens. Movie lens dataset.\n\nIn University of Minnesota. http://www.\n\ngrouplens.org/taxonomy/term/14, 1997.\n\n[GUH16] Carlos A Gomez-Uribe and Neil Hunt. The Net\ufb02ix recommender system: Algorithms,\nbusiness value, and innovation. ACM Transactions on Management Information Sys-\ntems (TMIS), 6(4):13, 2016.\n\n[Har14] Moritz Hardt. Understanding alternating minimization for matrix completion.\n\nIn\nFoundations of Computer Science (FOCS), 2014 IEEE 55th Annual Symposium on,\npages 651\u2013660. IEEE, 2014.\n\n[HHG+13] Po-Sen Huang, Xiaodong He, Jianfeng Gao, Li Deng, Alex Acero, and Larry Heck.\nLearning deep structured semantic models for web search using clickthrough data. In\nProceedings of the 22nd ACM international conference on Conference on information\n& knowledge management, pages 2333\u20132338. ACM, 2013.\n\n9\n\n\f[HKZ12] Daniel Hsu, Sham M Kakade, and Tong Zhang. A tail inequality for quadratic forms of\nsubGaussian random vectors. Electronic Communications in Probability, 17(52):1\u20136,\n2012.\n\n[JD13] Prateek Jain and Inderjit S Dhillon. Provable inductive matrix completion. arXiv\n\npreprint arXiv:1306.0626, 2013.\n\n[JSA15] Majid Janzamin, Hanie Sedghi, and Anima Anandkumar. Beating the perils of non-\narXiv\n\nconvexity: Guaranteed training of neural networks using tensor methods.\npreprint 1506.08473, 2015.\n\n[KCL15] Volodymyr Kuleshov, Arun Chaganty, and Percy Liang. Tensor factorization via ma-\n\ntrix factorization. In AISTATS, pages 507\u2013516, 2015.\n\n[LLL+16] Chenyi Lei, Dong Liu, Weiping Li, Zheng-Jun Zha, and Houqiang Li. Comparative\ndeep learning of hybrid representations for image recommendations. In Proceedings of\nthe IEEE Conference on Computer Vision and Pattern Recognition, pages 2545\u20132553,\n2016.\n\n[LY16] Ming Lin and Jieping Ye. A non-convex one-pass framework for generalized fac-\ntorization machine and rank-one matrix sensing. In Advances in Neural Information\nProcessing Systems, pages 1633\u20131641, 2016.\n\n[LY17] Yuanzhi Li and Yang Yuan. Convergence analysis of two-layer neural networks with\nReLU activation. In Advances in Neural Information Processing Systems, pages 597\u2013\n607. https://arxiv.org/pdf/1705.09886, 2017.\n\n[ND14] Nagarajan Natarajan and Inderjit S Dhillon. Inductive matrix completion for predict-\n\ning gene\u2013disease associations. Bioinformatics, 30(12):i60\u2013i68, 2014.\n\n[NSM+19] Priyanka Nigam, Yiwei Song, Vijai Mohan, Vihan Lakshman, Weitian Ding, Ankit\nShingavi, Choon Hui Teo, Hao Gu, and Bing Yin. Deep semantic product search.\nIn Proceedings of the 25nd ACM SIGKDD International Conference on Knowledge\nDiscovery and Data Mining. ACM, 2019.\n\n[Ren10] Steffen Rendle. Factorization machines. In Data Mining (ICDM), 2010 IEEE 10th\n\nInternational Conference on, pages 995\u20131000. IEEE, 2010.\n\n[RSW16] Ilya Razenshteyn, Zhao Song, and David P Woodruff. Weighted low rank approx-\nimations with provable guarantees. In Proceedings of the forty-eighth annual ACM\nsymposium on Theory of Computing (STOC), pages 250\u2013263. ACM, 2016.\n\n[SCH+16] Si Si, Kai-Yang Chiang, Cho-Jui Hsieh, Nikhil Rao, and Inderjit S Dhillon. Goal-\ndirected inductive matrix completion.\nIn Proceedings of the 22nd ACM SIGKDD\nInternational Conference on Knowledge Discovery and Data Mining, pages 1165\u2013\n1174. ACM, 2016.\n\n[SCLD15] Donghyuk Shin, Suleyman Cetintas, Kuang-Chih Lee, and Inderjit S Dhillon. Tumblr\nblog recommendation with boosted inductive matrix completion. In Proceedings of the\n24th ACM International on Conference on Information and Knowledge Management,\npages 203\u2013212. ACM, 2015.\n\n[SWZ17a] Zhao Song, David P Woodruff, and Peilin Zhong. Low rank approximation with en-\ntrywise `1-norm error. In Proceedings of the 49th Annual Symposium on the Theory\nof Computing (STOC). ACM, https://arxiv.org/pdf/1611.00898, 2017.\n\n[SWZ17b] Zhao Song, David P Woodruff, and Peilin Zhong. Relative error tensor low rank\n\napproximation. arXiv preprint arXiv:1704.08246, 2017.\n\n[SWZ18] Zhao Song, David P Woodruff, and Peilin Zhong. Towards a zero-one law for entry-\n\nwise low rank approximation. 2018.\n\n[Tia17] Yuandong Tian. An analytical formula of population gradient for two-layered ReLU\nIn ICML.\n\nnetwork and its applications in convergence and critical point analysis.\nhttps://arxiv.org/pdf/1703.00560, 2017.\n\n10\n\n\f[Tro12] Joel A. Tropp. User-friendly tail bounds for sums of random matrices. Foundations of\n\nComputational Mathematics, 12(4):389\u2013434, 2012.\n\n[VW18] Santosh Vempala and John Wilmes. Polynomial convergence of gradient descent for\n\ntraining one-hidden-layer neural networks. arXiv preprint arXiv:1805.02677, 2018.\n\n[WW14] Xinxi Wang and Ye Wang. Improving content-based and hybrid music recommenda-\ntion using deep learning. In Proceedings of the 22nd ACM international conference\non Multimedia, pages 627\u2013636. ACM, 2014.\n\n[XJZ13] Miao Xu, Rong Jin, and Zhi-Hua Zhou. Speedup matrix completion with side infor-\n\nmation: Application to multi-label learning. In NIPS, pages 2301\u20132309, 2013.\n\n[YJKD14] Hsiang-Fu Yu, Prateek Jain, Purushottam Kar, and Inderjit Dhillon. Large-scale multi-\n\nlabel learning with missing labels. In ICML, pages 593\u2013601, 2014.\n\n[ZDG18] Xiao Zhang, Simon S Du, and Quanquan Gu. Fast and sample ef\ufb01cient inductive\nmatrix completion via multi-phase procrustes \ufb02ow. In ICML. https://arxiv.org/\npdf/1803.01233, 2018.\n\n[ZJD15] Kai Zhong, Prateek Jain, and Inderjit S. Dhillon. Ef\ufb01cient matrix sensing using rank-1\nGaussian measurements. In International Conference on Algorithmic Learning The-\nory, pages 3\u201318. Springer, 2015.\n\n[ZSD17] Kai Zhong, Zhao Song, and Inderjit S Dhillon. Learning non-overlapping convolu-\ntional neural networks with multiple kernels. arXiv preprint arXiv:1711.03440, 2017.\n[ZSJ+17] Kai Zhong, Zhao Song, Prateek Jain, Peter L Bartlett, and Inderjit S Dhillon. Recovery\nguarantees for one-hidden-layer neural networks. In ICML. https://arxiv.org/\npdf/1706.03175, 2017.\n\n[ZYWG18] Xiao Zhang, Yaodong Yu, Lingxiao Wang, and Quanquan Gu. Learning one-hidden-\n\nlayer relu networks via gradient descent. arXiv preprint arXiv:1806.07808, 2018.\n\n11\n\n\f", "award": [], "sourceid": 6094, "authors": [{"given_name": "Kai", "family_name": "Zhong", "institution": "Amazon"}, {"given_name": "Zhao", "family_name": "Song", "institution": "UT-Austin"}, {"given_name": "Prateek", "family_name": "Jain", "institution": "Microsoft Research"}, {"given_name": "Inderjit", "family_name": "Dhillon", "institution": "UT Austin & Amazon"}]}