Saharon Rosset

#### Abstract

Regularization plays a central role in the analysis of modern data, where non-regularized fitting is likely to lead to over-fitted models, useless for both prediction and interpretation. We consider the design of incremen- tal algorithms which follow paths of regularized solutions, as the regu- larization varies. These approaches often result in methods which are both efficient and highly flexible. We suggest a general path-following algorithm based on second-order approximations, prove that under mild conditions it remains "very close" to the path of optimal solutions and illustrate it with examples.

1 Introduction

Given a data sample (xi, yi)n i=1 (with xi Rp and yi R for regression, yi {1} for classification), the generic regularized optimization problem calls for fitting models to the data while controlling complexity by solving a penalized fitting problem:

                        ^ (1)                         () = arg min               C(yi,  xi) + J()                                                    i


where C is a convex loss function and J is a convex model complexity penalty (typically taken to be the lq norm of , with q 1).1

Many commonly used supervised learning methods can be cast in this form, including regularized 1-norm and 2-norm support vector machines [13, 4], regularized linear and logistic regression (i.e. Ridge regression, lasso and their logistic equivalents) and more. In [8] we show that boosting can also be described as approximate regularized optimization, with an l1-norm penalty.

Detailed discussion of the considerations in selecting penalty and loss functions for regu- larized fitting is outside the scope of this paper. In general, there are two main areas we need to consider in this selection:

1. Statistical considerations: robustness (which affects selection of loss), sparsity (l1-norm penalty encourages sparse solutions) and identifiability are among the questions we should

1We assume a linear model in (1), but this is much less limiting than it seems, as the model can be linear in basis expansions of the original predictors, and so our approach covers Kernel methods, wavelets, boosting and more

keep in mind when selecting our formulation. 2. Computational considerations: we should be able to solve the problems we pose with the computational resources at our disposal. Kernel methods and boosting are examples of computational tricks that allow us to solve very high dimensional problems exactly or approximately with a relatively small cost. In this paper we suggest a new computational approach.

Once we have settled on a loss and penalty, we are still faced with the problem of select- ing a "good" regularization parameter , in terms of prediction performance. A common approach is to solve (1) for several values of , then use holdout data (or theoretical ap- proaches, like AIC or SRM) to select a good value. However, if we view the regularized optimization problem as a family of problems, parameterized by the regularization parame- ter , it allows us to define the "path" of optimal solutions { ^ () : 0 }, which is a 1-dimensional curve through Rp. Path following methods attempt to utilize the mathemat- ical properties of this curve to devise efficient procedures for "following" it and generating the full set of regularized solutions with a (relatively) small computational cost.

As it turns out, there is a family of well known and interesting regularized problems for which efficient exact path following algorithms can be devised. These include the lasso [3], 1- and 2-norm support vector machines [13, 4] and many others [9]. The main property of these problems which makes them amenable to such methods is the piecewise linearity of the regularized solution path in Rp. See [9] for detailed exposition of these properties and the resulting algorithms.

However, the path following idea can stretch beyond these exact piecewise linear algo- rithms. The "first order" approach is to use gradient-based approaches. In [8] we have described boosting as an approximate gradient-based algorithm for following l1-norm reg- ularized solution paths. [6] suggest a gradient descent algorithm for finding an optimal so- lution for a fixed value of and are seemingly unaware that the path they are going through is of independent interest as it consists of approximate (alas very approximate) solutions to l1-regularized problems. Gradient-based methods, however, can only follow regularized paths under strict and non-testable conditions, and theoretical "closeness" results to the optimal path are extremely difficult to prove for them (see [8] for details).

In this paper, we suggest a general second-order algorithm for following "curved" regu- larized solution paths (i.e. ones that cannot be followed exactly by piecewise-linear al- gorithms). It consists of iteratively changing the regularization parameter, while making a single Newton step at every iteration towards the optimal penalized solution, for the current value of . We prove that if both the loss and penalty are "nice" (in terms of bounds on their relevant derivatives in the relevant region), then the algorithm is guaranteed to stay "very close" to the true optimal path, where "very close" is defined as:

     If the change in the regularization parameter at every iteration is , then          the solution path we generate is guaranteed to be within O( 2) from the          true path of penalized optimal solutions


In section 2 we present the algorithm, and we then illustrate it on l1- and l2-regularized logistic regression in section 3. Section 4 is devoted to a formal statement and proof outline of our main result. We discuss possible extensions and future work in section 5.

2 Path following algorithm

We assume throughout that the loss function C is twice differentiable. Assume for now also that the penalty J is twice differentiable (this assumption does not apply to the l1- norm penalty which is of great interest and we address this point later). The key to our

method are the normal equations for (1):

(2) C( ^ ()) + J( ^ ()) = 0

Our algorithm iteratively constructs an approximate solution ( ) t by taking "small" Newton-Raphson steps trying to maintain (2) as the regularization changes. Our main result in this paper is to show, both empirically and theoretically, that for small , the dif-

ference ( ) t - ^ (0 + t) is small, and thus that our method successfully tracks the path of optimal solutions to (1).

Algorithm 1 gives a formal description of our quadratic tracking method. We start from a solution to (1) for some fixed 0 (e.g. ^ (0), the non-regularized solution). At each iteration we increase by and take a single Newton-Raphson step towards the solution to (2) with the new value in step 2(b).

Algorithm 1 Approximate incremental quadratic algorithm for regularized optimization

   1. Set ( )                  0      = ^                               (0), set t = 0.

2. While (t < max)

(a) t+1 = t +

(b) ( )                       t+1 =                                                                               -1                       ( )                                2                        t      -    2C(( )                                          t    ) + t+1         J(( )                                                                   t      )          C(( )                                                                                            t         ) + t+1 J(( )                                                                                                                  t      )

(c) t = t + 1


2.1 The l1-norm penalty

The l1-norm penalty, J() = 1, is of special interest because of its favorable statistical properties (e.g. [2]) and its widespread use in popular methods, such as the lasso [10] and 1-norm SVM [13]. However it is not differentiable and so our algorithm does not apply to l1-penalized problems directly.

To understand how we can generalize Algorithm 1 to this situation, we need to consider the Karush-Kuhn-Tucker (KKT) conditions for optimality of the optimization problem implied by (1). It is easy to verify that the normal equations (2) can be replaced by the following KKT-based condition for l1-norm penalty:

(3) | C( ^ ())j| < ^ ()j = 0 ^ (4) ()j = 0 | C( ^ ())j| =

these conditions hold for any differentiable loss and tell us that at each point on the path we have a set A of non-0 coefficients which corresponds to the variables whose current "gen- eralized correlation" | C( ^ ())j| is maximal and equal to . All variables with smaller generalized correlation have 0 coefficient at the optimal penalized solution for this . Note that the l1-norm penalty is twice differentiable everywhere except at 0. So if we carefully manage the set of non-0 coefficients according to these KKT conditions, we can still apply our algorithm in the lower-dimensional subspace spanned by non-0 coefficients only.

Thus we get Algorithm 2, which employs the Newton approach of Algorithm 1 for twice differentiable penalty, limited to the sub-space of "active" coefficients denoted by A. It adds to Algorithm 1 updates for the "add variable to active set" and "remove variable from

active set" events, when a variable becomes "highly correlated" as defined in (4) and when a coefficient hits 0 , respectively. 2

Algorithm 2 Approximate incremental quadratic algorithm for regularized optimization with lasso penalty

     1. Set ( )                      0    = ^                             (0), set t = 0, set A = {j : ^                                                                      (0)j = 0}.

2. While (t < max)

(a) t+1 = t +               (b)

-1                           ( )                            t+1 = ( )                                    t     -      2C(( )                                                          t    )A              C(( )                                                                                   t      )A + t+1sgn(( )                                                                                                        t      )A

(c) A = A  {j /                                     A :        C(( )                                                     t+1)j > t+1}

(d) A = A - {j  A : |( )                 | < }                                                 t+1,j               (e) t = t + 1


2.2 Computational considerations

For a fixed 0 and max, Algorithms 1 and 2 take O(1/ ) steps. At each iteration they need to calculate the Hessians of both the loss and the penalty at a typical computational cost of O(n p2); invert the resulting p p matrix at a cost of O(p3); and perform the gradient calculation and multiplication, which are o(n p2) and so do not affect the complexity calculation. Since we implicitly assume throughout that n p, we get overall complexity of O(n p2/ ). The choice of represents a tradeoff between computational complexity and accuracy (in section 4 we present theoretical results on the relationship between and the accuracy of the path approximation we get). In practice, our algorithm is practical for problems with up to several hundred predictors and several thousand observations. See the example in section 3.

It is interesting to compare this calculation to the obvious alternative, which is to solve O(1/ ) regularized problems (1) separately, using a Newton-Raphson approach, resulting in the same complexity (assuming the number of Newton-Raphson iterations for finding each solution is bounded). There are several reasons why our approach is preferable:

       The number of iterations until convergence of Newton-Raphson may be large even              if it does converge. Our algorithm guarantees we stay very close to the optimal              solution path with a single Newton step at each new value of .

Empirically we observe that in some cases our algorithm is able to follow the path              while direct solution for some values of  fails to converge. We assume this is              related to various numeric properties of the specific problems being solved.

For the interesting case of l1-norm penalty and a "curved" loss function (like logis-              tic log-likelihood), there is no direct Newton-Raphson algorithm. Re-formulating              the problem into differentiable form requires doubling the dimensionality. Using              our Algorithm 2, we can still utilize the same Newton method, with significant              computational savings when many coefficients are 0 and we work in a lower-              dimensional subspace.

2When a coefficient hits 0 it not only hits a non-differentiability point in the penalty, it also ceases to be maximally correlated as defined in (4). A detailed proof of this fact and the rest of the "accounting" approach can be found in [9]


On the flip side, our results in section 4 below indicate that to guarantee successful tracking we require to be small, meaning the number of steps we do in the algorithm may be significantly larger than the number of distinct problems we would typically solve to select using a non-path approach.

2.3 Connection to path following methods from numerical analysis

There is extensive literature on path-following methods for solution paths of general para- metric problems. A good survey is given in [1]. In this context, our method can be described as a "predictor-corrector" method with a redundant first order predictor step. That is, the corrector step starts from the previous approximate solution. These methods are recognized as attractive options when the functions defining the path (in our case, the combination of loss and penalty) are "smooth" and "far from linear". These conditions for efficacy of our approach are reflected in the regularity conditions for the closeness result in Section 4.

3 Example: l2- and l1-penalized logistic regression

Regularized logistic regression has been successfully used as a classification and proba- bility estimation approach [11, 12]. We first illustrate applying our quadratic method to this regularized problem using a small subset of the "spam" data-set, available from the UCI repository (http://www.ics.uci.edu/~mlearn/MLRepository.html) which allows us to present some detailed diagnostics. Next, we apply it to the full "spam" data-set, to demonstrate its time complexity on bigger problems.

We first choose five variables and 300 observations and track the solution paths to two regularized logistic regression problems with the l2-norm and the l1-norm penalties:

                     ^ (5)                      () = arg min log(1 + exp{-yi xi}) +   22                                                                    ^ (6)                      () = arg min log(1 + exp{-yi xi}) +   1


Figure 1 shows the solution paths ( )(t) generated by running Algorithms 1 and 2 on this data using = 0.02 and starting at = 0, i.e. from the non-regularized logistic regression solution. The interesting graphs for our purpose are the ones on the right. They represent the "optimality gap":

                                          C(( )                                   e               t     )                                        t =                   +  t                                               J(( )                                                   t     )


where the division is done componentwise (and so the five curves in each plot correspond to the five variables we are using). Note that the optimal solution ^ (t ) is uniquely defined by the fact that (2) holds and therefore the "optimality gap" is equal to zero componentwise at ^ (t ). By convexity and regularity of the loss and the penalty, there is a correspondence between small values of e and small distance ( )(t) - ^ (t ) . In our example we observe that the components of e seem to be bounded in a small region around 0 for both paths (note the small scale of the y axis in both plots -- the maximal error is under 10-3). We conclude that on this simple example our method tracks the optimal solution paths well, both for the l1- and l2-regularized problems. The plots on the left show the actual coefficient paths -- the curve in R5 is shown as five coefficient traces in R, each corresponding to one variable, with the non-regularized solution (identical for both problems) on the extreme left.

Next, we run our algorithm on the full "spam" data-set, containing p = 57 predic- tors and n = 4601 observations. For both the l1- and l2-penalized paths we used

                                                                                                                                            -4                                                                                                                                            x 10                                      2.5

2                                                                                                                                       4                                      1.5

1                                                  J +                  (/)                                                                                                                2                                                                 0.5                                                           C /                                                                                                                                        0                                       0

-0.5                                                                                             -2                                             0    10    20      30           40                                                             0          10    20    30    40

-4                                                                                                                                            x 10                                      2.5

2                                                                                                                                       4                                      1.5

1                                                                           J +                  (/)                                                                                                                2                                                                      0.5                                                                                    C /                                                                                                                                        0                                       0

-0.5                                                                                             -2                                             0    10    20      30           40                                                             0          10    20    30    40


Figure 1: Solution paths (left) and optimality criterion (right) for l1 penalized logistic re- gression (top) and l2 penalized logistic regression (bottom). These result from running Algorithms 2 and 1, respectively, using = 0.02 and starting from the non-regularized logistic regression solution (i.e. = 0)

0 = 0, max = 50, = 0.02, and the whole path was generated in under 5 minutes using a Matlab implementation on an IBM T-30 Laptop. Like in the small scale example, the "optimality criterion" was uniformly small throughout the two paths, with none of its 57 components exceeding 10-3 at any point.

4 Theoretical closeness result

In this section we prove that our algorithm can track the path of true solutions to (1). We show that under regularity conditions on the loss and penalty (which hold for all the candidates we have examined), if we run Algorithm 1 with a specific step size , then we remain within O( 2) of the true path of optimal regularized solutions.

Theorem 1 Assume 0 > 0, then for small enough and under regularity conditions on the derivatives of C and J ,

               0 < c < max - 0 ,                                     ( )(c/ ) - ^                                                                                                                                                 (0 + c) = O( 2)


So there is a uniform bound O( 2) on the error which does not depend on c.

Proof We give the details of the proof in Appendix A of [7]. Here we give a brief review of the main steps.

Similar to section 3 we define the "optimality gap":

                                                         C(( ) (7)                                                    (        t      ) )j + t = etj                                                              J(( )                                                                 t      ) Also define a "regularity constant" M , which depends on 0 and the first, second and third derivatives of the loss and penalty.


The proof is presented as a succession of lemmas:

                                                        Lemma 2 Let u1 = M  p  2, ut = M (ut-1 +                      p  )2, then: et 2  ut


This lemma gives a recursive expression bounding the error in the optimality gap (7) as the algorithm proceeds. The proof is based on separate Taylor expansions of the numerator and denominator of the ratio C J in the optimality gap and some tedious algebra. 1-4 p M Lemma 3 If p M 1/4 then u 1 t - p - = O( 2) , t 2M 2M

This lemma shows that the recursive bound translates to a uniform O( 2) bound, if is small enough. The proof consists of analytically finding the fixed point of the increasing series ut.

Lemma 4 Under regularity conditions on the penalty and loss functions in the neighbor- hood of the solutions to (1), the O( 2) uniform bound of lemma 3 translates to an O( 2) uniform bound on ( )(c/ ) - ^ (0 + c)

Finally, this lemma translates the optimality gap bound to an actual closeness result. This is proven via a Lipschitz argument.

4.1 Required regularity conditions

Regularity in the loss and the penalty is required in the definition of the regularity constant M and in the translation of the O( 2) bound on the "optimality gap" into one on the distance from the path in lemma 4. The exact derivation of the regularity conditions is highly tech- nical and lengthy. They require us to bound the norm of third derivative "hyper-matrices" for the loss and the penalty as well as the norms of various functions of the gradients and Hessians of both (the boundedness is required only in the neighborhood of the optimal path where our approximate path can venture, obviously). We also need to have 0 > 0 and max < . Refer to Appendix A of [7] for details. Assuming that 0 > 0 and max < these conditions hold for every interesting example we have encountered, including:

     Ridge regression and the lasso (that is, l2- and l1- regularized squared error loss).          l1- and l2-penalized logistic regression. Also Poisson regression and other expo-           nential family models.          l1- and l2-penalized exponential loss.


Note that in our practical examples above we have started from 0 = 0 and our method still worked well. We observe in figure 1 that the tracking algorithm indeed suffers the biggest inaccuracy for the small values of , but manages to "self correct" as increases.