Part of Advances in Neural Information Processing Systems 13 (NIPS 2000)

*Igor Cadez, Padhraic Smyth*

We investigate a general characteristic of the trade-off in learning problems between goodness-of-fit and model complexity. Specifi(cid:173) cally we characterize a general class of learning problems where the goodness-of-fit function can be shown to be convex within first(cid:173) order as a function of model complexity. This general property of "diminishing returns" is illustrated on a number of real data sets and learning problems, including finite mixture modeling and multivariate linear regression.

Introduction, Motivation, and Related Work

1 Assume we have a data set D = {Xl, X2, ... , x n }, where the X i could be vectors, sequences, etc. We consider modeling the data set D using models indexed by a complexity index k, 1 :::; k :::; kmax • For example, the models could be finite mixture probability density functions (PDFs) for vector Xi'S where model complexity is indexed by the number of components k in the mixture. Alternatively, the modeling task could be to fit a conditional regression model y = g(Zk) + e, where now y is one of the variables in the vector X and Z is some subset of size k of the remaining components in the X vector.

Such learning tasks can typically be characterized by the existence of a model and a loss function. A fitted model of complexity k is a function of the data points D and depends on a specific set of fitted parameters B. The loss function (goodness(cid:173) of-fit) is a functional of the model and maps each specific model to a scalar used to evaluate the model, e.g., likelihood for density estimation or sum-of-squares for regression.

Figure 1 illustrates a typical empirical curve for loss function versus complexity, for mixtures of Markov models fitted to a large data set of 900,000 sequences. The complexity k is the number of Markov models being used in the mixture (see Cadez et al. (2000) for further details on the model and the data set). The empirical curve has a distinctly concave appearance, with large relative gains in fit for low complexity models and much more modest relative gains for high complexity models. A natural question is whether this concavity characteristic can be viewed as a general phenomenon in learning and under what assumptions on model classes and

Nwnber of M Ixture Cmnponen1S 11]

Figure 1: Log-likelihood scores for a Markov mixtures data set.

loss functions the concavity can be shown to hold. The goal of this paper is to illustrate that in fact it is a natural characteristic for a broad range of problems in mixture modeling and linear regression.

We note of course that for generalization that using goodness-of-fit alone will lead to the selection of the most complex model under consideration and will not in general select the model which generalizes best to new data. Nonetheless our pri(cid:173) mary focus of interest in this paper is how goodness-of-fit loss functions (such as likelihood and squared error, defined on the training data D) behave in general as a function of model complexity k. Our concavity results have a number of interesting implications. For example, for model selection methods which add a penalty term to the goodness-of-fit (e.g., BIC), the resulting score function as a function of model complexity will be unimodal as a function of complexity k within first order.

Li and Barron (1999) have shown that for finite mixture models the expected value of the log-likelihood for any k is bounded below by a function of the form -C /k where C is a constant which is independent of k. The results presented here are complementary in the sense that we show that the actual maximizing log-likelihood itself is concave to first-order as a function of k. Furthermore, we obtain a more general principle of "diminishing returns," including both finite mixtures and subset selection in regression.

Do not remove: This comment is monitored to verify that the site is working properly