{"title": "The limits of squared Euclidean distance regularization", "book": "Advances in Neural Information Processing Systems", "page_first": 2807, "page_last": 2815, "abstract": "Some of the simplest loss functions considered in Machine Learning are the square loss, the logistic loss and the hinge loss. The most common family of algorithms, including Gradient Descent (GD) with and without Weight Decay, always predict with a linear combination of the past instances. We give a random construction for sets of examples where the target linear weight vector is trivial to learn but any algorithm from the above family is drastically sub-optimal. Our lower bound on the latter algorithms holds even if the algorithms are enhanced with an arbitrary kernel function. This type of result was known for the square loss. However, we develop new techniques that let us prove such hardness results for any loss function satisfying some minimal requirements on the loss function (including the three listed above). We also show that algorithms that regularize with the squared Euclidean distance are easily confused by random features. Finally, we conclude by discussing related open problems regarding feed forward neural networks. We conjecture that our hardness results hold for any training algorithm that is based on the squared Euclidean distance regularization (i.e. Back-propagation with the Weight Decay heuristic).", "full_text": "The limits of squared\n\nEuclidean distance regularization\u2217\n\nMicha\u0142 Derezi\u00b4nski\n\nComputer Science Department\n\nUniversity of California, Santa Cruz\n\nCA 95064, U.S.A.\n\nmderezin@soe.ucsc.edu\n\nManfred K. Warmuth\n\nComputer Science Department\n\nUniversity of California, Santa Cruz\n\nCA 95064, U.S.A.\n\nmanfred@cse.ucsc.edu\n\nAbstract\n\nSome of the simplest loss functions considered in Machine Learning are the square\nloss, the logistic loss and the hinge loss. The most common family of algorithms,\nincluding Gradient Descent (GD) with and without Weight Decay, always predict\nwith a linear combination of the past instances. We give a random construction\nfor sets of examples where the target linear weight vector is trivial to learn but any\nalgorithm from the above family is drastically sub-optimal. Our lower bound on\nthe latter algorithms holds even if the algorithms are enhanced with an arbitrary\nkernel function.\nThis type of result was known for the square loss. However, we develop new\ntechniques that let us prove such hardness results for any loss function satisfying\nsome minimal requirements on the loss function (including the three listed above).\nWe also show that algorithms that regularize with the squared Euclidean distance\nare easily confused by random features. Finally, we conclude by discussing re-\nlated open problems regarding feed forward neural networks. We conjecture that\nour hardness results hold for any training algorithm that is based on the squared\nEuclidean distance regularization (i.e. Back-propagation with the Weight Decay\nheuristic).\n\n1\n\nIntroduction\n\nWe de\ufb01ne a set of simple linear learning problems described by an n dimensional square matrix\nM with \u00b11 entries. The rows xi of M are n instances, the columns correspond to the n possible\ntargets, and Mij is the label given by target j to the\ninstance xi (See Figure 1). Note, that Mij = xi \u00b7 ej,\nwhere ej is the j-th unit vector. That is, the j-th target\nis a linear function that picks the j-th column out of\nM. It is important to understand that the matrix M,\nwhich we call the problem matrix, speci\ufb01es n learn-\ning problems: In the jth problem each of the n in-\nstances (rows) are labeled by the jth target (column).\nThe rationale for de\ufb01ning a set of problems instead of\na single problem follows from the fact that learning a\nsingle problem is easy and we need to average the pre-\ndiction loss over the n problems to obtain a hardness\nresult.\n\nFigure 1: A random \u00b11 matrix M: the instances\nare the rows and the targets the columns of the\nmatrix. When the j-th column is the target, then\nwe have a linear learning problem where the j-th\nunit vector is the target weight vector.\n\n\u2191\n\u2191\ntargets\n\n\u2192 \u22121 +1 \u22121 +1\ninstances \u2192 \u22121 +1 +1 \u22121\n\u2192 +1 \u22121 \u22121 +1\n\u2192 +1 +1 \u22121 +1\n\u2191\n\n\u2191\n\n\u2217This research was supported by the NSF grant IIS-1118028.\n\n1\n\n\fThe protocol of learning is simple: The algorithm is given k training instances labeled by one of\nthe targets. It then produces a linear weight vector w that aims to incur small average loss on all n\ninstances labeled by the same target.1 Any loss function satisfying some minimal assumptions can\nbe used, including the square, the logistic and the hinge loss. We will show that when M is random,\nthen this type of problems are hard to learn by any algorithm from a certain class of algorithms.2\nBy hard to learn we mean that the loss is high when we average over instances and targets. The class\nof algorithms for which we prove our hardness results is any algorithm whose prediction on a new\ninstance vector x is a function of w \u00b7 x where the weight vector w is a linear combination of train-\ning examples. This includes any algorithm motivated by regularizing with || w ||2\n2 (i.e. algorithms\nmotivated by the Representer Theorem [KW71, SHS01]) or alternatively any algorithm that exhibits\ncertain rotation invariance properties [WV05, Ng04, WKZ14]. Note that any version of Gradient\nDescent or Weight Decay on the three loss functions listed above belongs to this class of algorithms,\ni.e. it predicts with a linear combination of the instances seen so far.\n\nThis class of simple algorithms has many advantages (such as the fact that it can be kernelized).\nHowever, we show that this class is very slow at learning the simple learning problems described\nabove. More precisely, our lower bounds for a randomly chosen M have the following form: For\nsome constants A \u2208 (0, 1] and B \u2265 1 that depend on the loss function, any algorithm that predicts\nwith linear combinations of k instances has average\nloss at least A \u2212 B k\nn with high probability, where the\naverage is over instances and targets. This means that\n2B of all n instances, the\nafter seeing a fraction of A\naverage loss is still at least the constant A\n2 (see the red\nsolid curve in Figure 2 for a typical plot of the average\nloss of GD).\nNote, that there are trivial algorithms that learn our\nlearning problem much faster.\nThese algorithms\nclearly do not predict with a linear combination of the\ngiven instances. For example, one simple algorithm\nkeeps track of the set of targets that are consistent\nwith the k examples seen so far (the version space)\nand chooses one target in the version space at ran-\ndom. This algorithm has the following properties: Af-\nter seeing k instances, the expected size of the version\nspace is min(n/2k, 1), so after O(log2 n) examples,\nwith high probability there is only one unit vector ej\nleft in the version space that labels all the examples\ncorrectly.\nOne way to closely approximate the above version space algorithm is to run the Exponentiated Gra-\ndient (EG) algorithm [KW97b] with a large learning rate. The EG algorithm maintains a weight\nvector which is a probability vector. It updates the weights by multiplying them by non-negative\nfactors and then re-normalizes them to a probability vector. The factors are the exponentiated neg-\native scaled derivatives of the loss. See dot-dashed green curve of Figure 2 for a typical plot of the\naverage loss of EG. It converges \u201dexponentially faster\u201d than GD for the problem given in Figure\n1. General regret bounds for the EG algorithm are known (see e.g. [KW97b, HKW99]) that grow\nlogarithmically with the dimension n of the problem. Curiously enough, for the EG family of algo-\nrithms, the componentwise logarithm of the weight vector is a linear combination of the instances.3\nIf we add a 1-norm regularization to the loss, then GD behaves more like the EG algorithm (see\ndashed blue curve of Figure 2). In Figure 3 we plot the weights of the EG and GD algorithms (with\noptimized learning rates) when the target is the \ufb01rst column of a 100 dimensional random matrix.\n\nFigure 2: The average logistic loss of the Gradi-\nent Descent (with and without 1-norm regulariza-\ntion) and the Exponentiated Gradient algorithms\nfor the problem of learning the \ufb01rst column of a\n100 dimensional square \u00b11 matrix. The x-axis is\nthe number of examples k in the training set. Note\nthat the average logistic loss for Gradient Descent\ndecreases roughly linearly.\n\n1Since the sample space is so small it is cleaner to require small average loss on all n instances than just the\n\nn \u2212 k test instances. See [WV05] for a discussion.\n\n2Our setup is the same as the one used in [WV05], where such hardness results were proved for the square\n\nloss only. The generalization to the more general losses is non-trivial.\n\n3This is a simpli\ufb01cation because it ignores the normalization.\n\n2\n\n\fFigure 3: In the learning problem the rows of a 100-dimensional random \u00b11 matrix are labeled\nby the \ufb01rst column. The x-axis is the number of instances k \u2208 1..100 seen by the algorithm. We\nplot all 100 weights of the GD algorithm (left), GD with 1-norm regularization (center) and the EG\nalgorithm (right) as a function of k. The GD algorithms keeps lots of small weights around and the\n\ufb01rst weight grows only linearly. The EG algorithm wipes out the irrelevant weights much faster and\nbrings up the good weight exponentially fast. GD with 1-norm regularization behaves like GD for\nsmall k and like EG for large k.\n\nThe GD algorithm keeps all the small weight around and the weight of the \ufb01rst component only\ngrows linearly. In contrast, the EG algorithm grows the target weight much faster. This is because\nin a GD algorithm the squared 2-norm regularization does not punish small weight enough (because\ni \u2248 0 when wi is small). If we add a 1-norm regularization to the loss then the irrelevant weights\nw2\nof GD disappear more quickly and the algorithm behaves more like EG.\n\nKernelization\n\nWe clearly have a simple linear learning problem in Figure 1. So, can we help the class of algorithms\nthat predicts with linear combinations of the instances by \u201cexpanding\u201d the instances with a feature\nmap? In other words, we could replace the instance x by \u03c6(x), where \u03c6 is any mapping from Rn to\nRm, and m might be much larger than n (and can even be in\ufb01nite dimensional). The weight vector\nis now a linear combination of the expanded instances and computing the dot product of this weight\nvector with a new expanded instance requires the computation of dot products between expanded\ninstances.4\nEven though the class of algorithms that predicts with a linear combination of instances is good at\nincorporating such an expansion (also referred to as an embedding into a feature space), we can\nshow that our hardness results still hold even if any such expansion is used. In other words it does\nnot help if the instances (rows) are represented by any other set of vectors in Rm. Note that the\nlearner knows that it will receive examples from one of the n problems speci\ufb01ed by the problem\nmatrix M. The expansion is allowed to depend on M, but it has to be chosen before any examples\nare seen by the learner.\n\nRelated work\n\nThere is a long history for proving hardness results for the class of algorithms that predict with\nIn particular, in [WV05] it was shown for\nlinear combinations of instances [KW97a, KWA97].\nthe Hadamard matrix and the square loss, that the average loss is at least 1 \u2212 k\nn even if an arbitrary\nexpansion is used. This means, that if the algorithm is given half of all n instances, its average square\nloss is still half. The underlying model is a simple linear neuron. It was left as an open problem\nwhat happens for example for a sigmoided linear neuron and the logistic loss. Can the hardness\nresult be circumvented by choosing different neuron and loss function? In this paper, we are able to\nshow that this type of hardness results for algorithms that predict with a linear combination of the\ninstances are robust to learning with a rather general class of linear neurons and more general loss\nfunctions. The hardness result of [WV05] for the square loss followed from a basic property of the\nSingular Value Decomposition. However, our hardness results require more complicated counting\n\n4This can often be done ef\ufb01ciently via a kernel function. Our result only requires that the dot products\n\nbetween the expanded instances are \ufb01nite and the \u03c6 map can be de\ufb01ned implicitly via a kernel function.\n\n3\n\n\ftechniques. For the more general class of loss functions we consider, the Hadamard matrix actually\nleads to a weaker bound and we had to use random matrices instead.\nMoreover, it was shown experimentally in [WV05] (and to some extent theoretically in [Ng04]) that\nthe generalization bounds of 1-norm regularized linear regression grows logarithmically with the\ndimension n of the problem. Also, a linear lower bound for any algorithm that predicts with linear\ncombinations of instances was given in Theorem 4.3 of [Ng04]. However, the given lower bound\nis based on the fact that the Vapnik Chervonienkis (VC) dimension of n-dimensional halfspaces is\nn + 1 and the resulting linear lower bound holds for any algorithm. No particular problem is given\nthat is easy to learn by say multiplicative updates and hard to learn by GD. In contrast, we give\na random problem in Figure 1 that is trivial to learn by some algorithms, but hard to learn by the\nnatural and most commonly used class of algorithms which predicts with linear combinations of\ninstances. Note, that the number of target concepts we are trying to learn is n, and therefore the VC\ndimension of our problem is at most log2 n.\nThere is also a large body of work that shows that certain problems cannot be embedded with a large\n2-norm margin (see [FS02, BDES02] and the more recent work on similarity functions [BBS08]).\nAn embedding with large margins allows for good generalization bounds. This means that if a\nproblem cannot be embedded with a large margin, then the generalization bounds based on the\nmargin argument are weak. However we don\u2019t know of any hardness results for the family of\nalgorithms that predict with linear combinations in terms of a margin argument, i.e. lower bounds\nof generalization for this class of algorithms that is based on non-embeddability with large 2-norm\nmargins.\n\nRandom features\n\nThe purpose of this type of research is to delineate which types of problems can or cannot be ef\ufb01-\nciently learned by certain classes of algorithms. We give a problem for which the sample complexity\nof the trivial algorithm is logarithmic in n, whereas it is linear in n for the natural class of algorithms\nthat predicts with the linear combination of instances. However, why should we consider learning\nproblems that pick columns out of a random matrix? Natural data is never random. However, the\nproblem with this class of algorithms is much more fundamental. We will argue in Section 4 that\nthose algorithms get confused by random irrelevant features. This is a problem if datasets are based\non some physical phenomena and that contain at least some random or noisy features. It seems that\ni \u2248 0 when wi is small), the algorithms\nbecause of the weak regularization of small weights (i.e. w2\nare given the freedom to \ufb01t noisy features.\n\nOutline\n\nAfter giving some notation in the next section and de\ufb01ning the class of loss functions we consider,\nwe prove our main hardness result in Section 3. We then argue that the family of algorithms that\npredicts with linear combination of instances gets confused by random features (Section 4). Finally,\nwe conclude by discussing related open problems regarding feed forward neural nets in Section 5:\nWe conjecture that going from single neurons to neural nets does not help as long as the training\nalgorithm is Gradient Descent with a squared Euclidean distance regularization.\n\n2 Notations\n\nWe will now describe our learning problem and some notations for representing algorithms that\npredict with a linear combination of instances. Let M be a \u00b11 valued problem matrix. For the sake\nof simplicity we assume M is square (n\u00d7 n). The i-th row of M (denoted as xi) is the i-th instance\nvector, while the j-th column of M is the labeling of the instances by the j-th target. We allow\nthe learner to map the instances to an m-dimensional feature space, that is, xi is replaced by \u03c6(xi),\nwhere \u03c6 : Rn \u2192 Rm is an arbitrary mapping. We let Z \u2208 Rn\u00d7m denote the new instance matrix\nwith its i-th row being \u03c6(xi).5\n\n5The number of features m can even be in\ufb01nite as long as the n2 dot products Z Z(cid:62) between the expanded\n\ninstances are all \ufb01nite. On the other hand, m can also be less than n.\n\n4\n\n\fThe algorithm is given the \ufb01rst k rows of Z labeled by one of the n targets. We use (cid:98)Z to denote\nthe \ufb01rst k rows of Z. After seeing the rows of(cid:98)Z labeled by target i, the algorithm produces a linear\ncombination wi of the k rows. Thus the weight vector wi takes the form wi = (cid:98)Z\nW = (cid:98)Z\nprediction matrix of the algorithm: P = Z W = Z(cid:98)Z\n\nai, where ai\nis the vector of the k linear coef\ufb01cients. We aggregate the n weight vectors and coef\ufb01cients into\nthe m \u00d7 n and k \u00d7 n matrices, respectively: W := [w1, . . . , wn] and A = [a1, . . . , an]. Clearly,\nA. By applying the weight matrix to the instance matrix Z we can obtain the n \u00d7 n\nA. Note that Pij = \u03c6(xi) \u00b7 wj is the linear\nactivation of the algorithm produced for the i-th instance after receiving the \ufb01rst k rows of Z labeled\nwith the j-th target.\nWe are now interested to compare the prediction matrix with the problem matrix using a non-\nnegative loss function L : R \u00d7 {\u22121, 1} \u2192 R\u22650. We de\ufb01ne the average loss of the algorithm\nas\n\n(cid:62)\n\n(cid:62)\n\n(cid:62)\n\n(cid:88)\n\ni,j\n\n1\nn2\n\nL(Pi,j, Mi,j).\n\nNote that the loss is between linear activations and binary labels and we average it over instances\nand targets.\nDe\ufb01nition 1 We will call a loss function L : R \u00d7 {\u22121, 1} \u2192 R\u22650 to be C-regular where C > 0, if\nL(a, y) \u2265 C whenever a \u00b7 y \u2264 0, i.e. a and y have different signs.\n\nThe loss function guarantees that if the algorithm produces a linear activation of a different sign,\nthen a loss of at least C is incurred. Three commonly used 1-regular losses are the:\n\n\u2022 Square Loss, L(a, y) = (a \u2212 y)2, used in Linear Regression.\n\u2022 Logistic Loss, L(a, y) = \u2212 y+1\n\nlog2(\u03c3(a))\u2212 y\u22121\n\n2\n\nsion. Here \u03c3(a) denotes the sigmoid function\n\nlog2(1\u2212 \u03c3(a)), used in Logistic Regres-\n\n2\n\n1\n\n1+exp(\u2212a).\n\n\u2022 Hinge Loss, L(a, y) = max(0, 1 \u2212 ay), used in Support Vector Machines.\n\n[WV05] obtained a linear lower bound for the square:\n\n(cid:62)\n\nA\n\nTheorem 2 If the problem matrix M is the n dimensional Hadamard matrix, then for any algorithm\nthat predicts with linear combinations of expanded training instances, the average square loss after\nobserving k instances is at least 1 \u2212 k\nn .\n\nThe key observation used in the proof of this theorem is that the prediction matrix P = Z(cid:98)Z\nhas rank at most k, because(cid:98)Z has only k rows. Using an elementary property of the singular value\n\n\u03c3(Z W) = \u03c3(Z(cid:98)Z\nwe keep the prediction matrix P as the n2 linear activations Z(cid:98)Z\n\ndecomposition, the total squared loss (cid:107) P\u2212 M(cid:107)2\n2 can be bounded by the sum of the squares of the\nlast n \u2212 k singular values of the problem matrix M. The bound now follows from the fact that\nHadamard matrices have a \ufb02at spectrum. Random matrices have a \u201c\ufb02at enough\u201d spectrum and the\nsame technique gives an expected linear lower bound for random problem matrices. Unfortunately\nthe singular value argument only applies to the square loss. For example, for the logistic loss the\nproblem is much different. In that case it would be natural to de\ufb01ne the n \u00d7 n prediction matrix as\nA). However the rank of \u03c3(Z W) jumps to n even for small values of k. Instead\nA produced by the algorithm, and\nde\ufb01ne the loss between linear activations and labels. This matrix still has rank at most k. In the next\nsection, we will use this fact in a counting argument involving the possible sign patterns produced\nby low rank matrices.\nIf the algorithms are allowed to start with a non-zero initial weight vector, then the hardness results\nessentially hold for the class of algorithms that predict with linear combinations of this weight vector\nand the k expanded training instances. The only difference is that the rank of the prediction matrix is\nnow at most k + 1 instead of k and therefore the lower bound of the above theorem becomes 1\u2212 k+1\ninstead of 1 \u2212 k\nn. Our main result also relies on the rank of the prediction matrix and therefore it\nallows for a similar adjustment of the bound when an initial weight vector is used.\n\n(cid:62)\n\n(cid:62)\n\nn\n\n5\n\n\f3 Main Result\n\nIn this section we present a new technique for proving lower bounds on the average loss for the\nsparse learning problem discussed in this paper. The lower bound applies to any regular loss and is\nbased on counting the number of sign-patterns that can be generated by a low-rank matrix. Bounds\non the number of such sign patterns were \ufb01rst introduced in [AFR85]. As a corollary of our method,\nwe also obtain a lower bound for the \u201crigidity\u201d of random matrices.\nTheorem 3 Let L be a C-regular loss function. A random n\u00d7n problem matrix M almost certainly\nhas the property that for any algorithm that predicts with linear combinations of expanded training\ninstances, the average loss L after observing k instances is at least 4C ( 1\n\n20 \u2212 k\nn ).\n\nProof C-regular losses are at least C if the sign of the linear activation for an example does not match\nthe label. So, we can focus on counting the number of linear activations that have wrong signs. Let\nP be the n\u00d7n prediction matrix after receiving k instances. Furthermore let sign(P) \u2208 {\u22121, 1}n\u00d7n\ndenote the sign-pattern of P. For the sake of simplicity, we de\ufb01ne sign(0) as 1. This simpli\ufb01cation\nunderestimates the number of disagreements. However we still have the property that for any C-\nregular loss: L(a, y) \u2265 C| sign(a) \u2212 y|/2.\nWe now count the number of entries on which sign(P) disagrees with M. We use the fact that P\nhas rank at most k. The number of sign patterns of n \u00d7 m rank \u2264 k matrices is bounded as follows\n(This was essentially shown6 in [AFR85], the exact bound we use below is a re\ufb01nement given in\n[Sre04]):\n\n(cid:18) 8e \u00b7 2 \u00b7 nm\n\n(cid:19)k(n+m)\n\n.\n\nk(n + m)\n\nf (n, m, k) \u2264\n\nSetting n = m = a \u00b7 k, we get\n\nNow, suppose that we allow additional up to r = \u03b1n2 signs of sign(P) to be \ufb02ipped. In other words,\nn(r) of sign-patterns having Hamming distance at most r from any sign-pattern\nwe consider the set Sk\nproduced from a matrix of rank at most k. For a \ufb01xed sign-pattern, the number g(n, \u03b1) of matrices\nobtained by \ufb02ipping at most r entries is the number of subsets of size r or less that can be \ufb02ipped:\n\nf (n, n, n/a) \u2264 2(6+2 log2(e\u00b7a))\u00b7n2/a.\n\n(cid:18)n2\n\n(cid:19)\n\ni\n\n\u03b1n2(cid:88)\n\ni=0\n\ng(n, \u03b1) =\n\n\u2264 2H(\u03b1)n2\n\n.\n\nHere, H denotes the binary entropy. The above bound holds for any \u03b1 \u2264 1\nbounds described above, we can \ufb01nally estimate the size of Sk\n\nn(r):\n|Sk\nn(r)| \u2264 f (n, n, n/a) \u00b7 g(n, \u03b1) \u2264 2(6+2 log2(e\u00b7a))\u00b7n2/a \u00b7 2H(\u03b1)n2\n\n= 2\n\n2. Combining the two\n(cid:16) 6+2 log2(e\u00b7a)\n\n(cid:17)\n\n+H(\u03b1)\n\nn2\n\na\n\n.\n\nNotice, that if the problem matrix M does not belong to Sk\nn(r), then our prediction matrix P will\nmake more than r sign errors. We assumed that M is selected randomly from the set {\u22121, 1}n\u00d7n\nwhich contains 2n2 elements. From simple asymptotic analysis, we can conclude that for large\nn(r) will be much smaller than {\u22121, 1}n\u00d7n, if the following condition holds:\nenough n, the set Sk\n\n6 + 2 log2(e \u00b7 a)\n\na\n\n+ H(\u03b1) \u2264 1 \u2212 \u03b4 < 1.\n\n(1)\n\nIn that case, the probability of a random problem matrix belonging to Sk\n\nn(r) is at most\n\n2(1\u2212\u03b4)n2\n2n2 = 2\u2212\u03b4n2 \u2212\u2192 0.\n\nWe can numerically solve Inequality (1) for \u03b1 by comparing the left-hand side expression to 1.\nn = a\u22121. From this, we can obtain the simple\nFigure 4 shows the plot of \u03b1 against the value of k\n6Note that they count {\u22121, 0, 1} sign patterns. However by mapping 0\u2019s to 1\u2019s we do not increase the\n\nnumber of sign patterns.\n\n6\n\n\fFigure 4: Lower bound for average error. The solid line\nis obtained by solving inequality (1). The dashed line\nis a simple linear bound.\n\nFigure 5: We plot the distance of the\nunit vector to a subspace formed by k\nrandomly chosen instances.\n\n20 \u2212 k\n\n5 \u2212 4 k\n\nn ) = 1\n\nlinear bound of 4( 1\nn, because it satis\ufb01es the strict inequality for \u03b4 = 0.005. It is\neasy to estimate, that this bound will hold for n = 40 with probability approximately 0.996, and\nfor larger n that probability converges to 1 even faster than exponentially. It remains to observe that\neach sign error incurs at least loss C, which gives us the desired bound for the average loss of the\nalgorithm.\n2\n\nThe technique used in our proof also gives an interesting insight into the rigidity of random matrices.\nTypically, the rigidity RM(r) of a matrix M is de\ufb01ned as the minimum number of entries that need\n\nto be changed to reduce the rank of M to r. In [FS06], a different rigidity measure, (cid:101)RM(r), is\n\nconsidered, which only counts the sign-non-preserving changes. The bounds shown there depend\non the SVD spectrum of a matrix. However, if we consider a random matrix, then a much stronger\nlower bound can be obtained with high probability:\nCorollary 4 For a random matrix M \u2208 {\u22121, 1}n\u00d7n and 0 < r < n, almost certainly the minimum\nnumber of sign-non-preserving changes to a matrix in Rn\u00d7n that is needed to reduce the rank of the\nmatrix to r is at least\n\n(cid:101)RM(r) \u2265 n2\n\n5\n\n\u2212 4rn.\n\nNote that the rigidity bound given in [FS06] also applies to our problem, if we use the Hadamard\nmatrix as the problem matrix. In this case, the lower bound is much weaker and no longer linear.\nNotably, it implies that at least\nn instances are needed to get the average loss down to zero (and this\nis conjectured to be tight for Hadamard matrices). In contrast our lower bound for random matrices\nassures that \u2126(n) instances are required to get the average loss down to zero.\n\n\u221a\n\n4 Random features\n\nIn this section, we argue that the family of algorithms whose weight vector is a linear combination\nof the instances gets confused by random features. Assume we have n instances that are labeled by\na single \u00b11 feature. We represent this feature as a single column. Now, we add random additional\nfeatures. For the sake of concreteness, we add n \u2212 1 of them. So our learning problem is again\ndescribed by an n dimensional square matrix: The n rows are the instances and the target is the unit\nvector e1. In Figure 5, we plot the average distance of the vector e1 to the subspace formed by a\nsubset of k instances. This is the closest a linear combination of the k instances can get to the target.\nn on average. This means, that the target e1\nWe show experimentally, that this distance is\ncannot be expressed by linear combinations of instances until essentially all instances are seen (i.e.\nk is close to n).\n\n1 \u2212 k\n\n(cid:113)\n\n7\n\n\fIt is also very important to understand that expanding the instances using a feature map can be costly\nbecause a few random features may be expanded into many \u201cweakly random\u201d features that are still\nrandom enough to confuse the family of algorithms that predict with linear combination of instances.\nFor example, using a polynomial kernel, n random features may be expanded to nd features and now\nthe sample complexity grows with nd instead of n.\n\n5 Open problems regarding neural networks\n\nWe believe that our hardness results for picking single features out of random vectors carry over\nto feed forward neural nets provided that they are trained with Gradient Descent (Backpropatation)\nregularized with the squared Euclidean distance (Weight Decay). More precisely, we conjecture\nthat if we restrict ourself to Gradient Descent with squared Euclidean distance regularization, then\nadditional layers cannot improve the average loss on the problem described in Figure 1 and the\nbounds from Theorem 3 still hold.\nOn the other hand if 1-norm regularization is used, then Gradient Descent behaves more like the\nExponentiated Gradient algorithm and the hardness result can be avoided.\nOne can view the feature vectors arriving at the output node as an expansion of the input instances.\nOur lower bounds already hold for \ufb01xed expansions (i.e.\nthe same expansion must be used for\nall targets). In the neural net setting the expansion arriving at the output node is adjusted during\ntraining and our techniques for proving hardness results fail in this case. However, we conjecture that\nthe features learned from the k training examples cannot help to improve its average performance,\nprovided its training algorithm is based on the Gradient Descent or Weight Decay heuristic.\nNote that our conjecture is not fully speci\ufb01ed: what initialization is used, which transfer functions,\nare there bias terms, etc. We believe that the conjecture is robust to many of those details. We have\ntested our conjecture on neural nets with various numbers of layers and standard transfer functions\n(including the recti\ufb01er function). Also in our experiments, the dropout heuristic [HSK+12] did not\nimprove the average loss. However at this point we have only experimental evidence which will\nalways be insuf\ufb01cient to prove such a conjecture.\nIt is also an interesting question to study whether random features can confuse a feed forward neural\nnet that is trained with Gradient Descent. Additional layers may hurt such training algorithms when\nsome random features are in the input. We conjecture that any such algorithm requires at least O(1)\nadditional examples per random redundant feature to achieve the same average accuracy.\n\nReferences\n[AFR85] N. Alon, P. Frankl, and V. R\u00a8odel. Geometrical realization of set systems and probabilis-\ntic commnunication complexity. In Proceedings of the 26th Annual Symposium on the\nFoundations of Computer Science (FOCS), pages 277\u2013280, Portland, OR, USA, 1985.\nIEEE Computer Society.\n\n[BBS08] Maria-Florina Balcan, Avrim Blum, and Nathan Srebro.\n\nImproved Guarantees for\nIn Rocco A. Servedio and Tong Zhang, editors,\n\nLearning via Similarity Functions.\nCOLT, pages 287\u2013298. Omnipress, 2008.\n\n[BDES02] S. Ben-David, N. Eiron, and H. U. Simon. Limitations of learning via embeddings in\nEuclidean half-spaces. Journal of Machine Learning Research, 3:441\u2013461, November\n2002.\n\n[FS02] J. Forster and H. U. Simon. On the smallest possible dimension and the largest possible\nmargin of linear arrangements representing given concept classes. In Proceedings of the\n13th International Conference on Algorithmic Learning Theory, number 2533 in Lec-\nture Notes in Computer Science, pages 128\u2013138, London, UK, 2002. Springer-Verlag.\n[FS06] J. Forster and H. U. Simon. On the smallest possible dimension and the largest possible\nmargin of linear arrangements representing given concept classes. Theor. Comput. Sci.,\npages 40\u201348, 2006.\n\n[HKW99] D. P. Helmbold, J. Kivinen, and M. K. Warmuth. Relative loss bounds for single neu-\n\nrons. IEEE Transactions on Neural Networks, 10(6):1291\u20131304, November 1999.\n\n8\n\n\f[HSK+12] Geoffrey E. Hinton, Nitish Srivastava, Alex Krizhevsky, Ilya Sutskever, and Ruslan R.\nSalakhutdinov. Improving neural networks by preventing co-adaptation of feature de-\ntectors. CoRR, abs/1207.0580, 2012.\n\n[KW71] G. S. Kimeldorf and G. Wahba. Some results on Tchebychef\ufb01an Spline Functions.\n\nJ. Math. Anal. Applic., 33:82\u201395, 1971.\n\n[KW97a] J. Kivinen and M. K. Warmuth. Additive versus Exponentiated Gradient updates for\n\nlinear prediction. Information and Computation, 132(1):1\u201364, January 1997.\n\n[KW97b] J. Kivinen and M. K. Warmuth. Exponentiated gradient versus gradient descent for\n\nlinear predictors. Information and Computation, 132(1):1\u201364, January 1997.\n\n[KWA97] J. Kivinen, M. K. Warmuth, and P. Auer. The perceptron algorithm vs. winnow: lin-\near vs. logarithmic mistake bounds when few input variables are relevant. Arti\ufb01cial\nIntelligence, 97:325\u2013343, December 1997.\n\n[Ng04] A. Y. Ng. Feature selection, L1 vs. L2 regularization, and rotational invariance.\n\nIn\nProceedings of Twenty\ufb01rst International Conference in Machine Learning, pages 615\u2013\n622, Banff, Alberta, Canada, 2004. ACM Press.\n\n[SHS01] B. Sch\u00a8olkopf, R. Herbrich, and A. J. Smola. A generalized Representer Theorem. In\nD. P. Helmbold and B. Williamson, editors, Proceedings of the 14th Annual Confer-\nence on Computational Learning Theory, number 2111 in Lecture Notes in Computer\nScience, pages 416\u2013426, London, UK, 2001. Springer-Verlag.\n\n[Sre04] N. Srebro. Learning with Matrix Factorizations. PhD thesis, Massachusetts Institute of\n\nTechnology, 2004.\n\n[WKZ14] M. K. Warmuth, W. Kot\u0142owski, and S. Zhou. Kernelization of matrix updates. Jour-\nnal of Theoretical Computer Science, 2014. Special issue for the 23nd International\nConference on Algorithmic Learning Theory (ALT 12), to appear.\n\n[WV05] M. K. Warmuth and S.V.N. Vishwanathan. Leaving the span.\n\nIn Proceedings of the\n18th Annual Conference on Learning Theory (COLT \u201905), Bertinoro, Italy, June 2005.\nSpringer-Verlag.\n\n9\n\n\f", "award": [], "sourceid": 1461, "authors": [{"given_name": "Michal", "family_name": "Derezinski", "institution": "University of California, Santa Cruz"}, {"given_name": "Manfred K.", "family_name": "Warmuth", "institution": "Univ. of Calif. at Santa Cruz"}]}