{"title": "Model Selection in Clustering by Uniform Convergence Bounds", "book": "Advances in Neural Information Processing Systems", "page_first": 216, "page_last": 222, "abstract": null, "full_text": "Model selection in clustering by uniform \n\nconvergence bounds* \n\nJoachim M. Buhmann and Marcus Held \n\n{jb,held}@cs.uni-bonn.de \n\nInstitut flir Informatik III, \n\nRomerstraBe 164, D-53117 Bonn, Germany \n\nAbstract \n\nUnsupervised learning algorithms are designed to extract struc(cid:173)\nture from data samples. Reliable and robust inference requires a \nguarantee that extracted structures are typical for the data source, \nLe., similar structures have to be inferred from a second sample \nset of the same data source. The overfitting phenomenon in max(cid:173)\nimum entropy based annealing algorithms is exemplarily studied \nfor a class of histogram clustering models. Bernstein's inequality \nfor large deviations is used to determine the maximally achievable \napproximation quality parameterized by a minimal temperature. \nMonte Carlo simulations support the proposed model selection cri(cid:173)\nterion by finite temperature annealing. \n\n1 \n\nIntroduction \n\nLearning algorithms are designed to extract structure from data. Two classes of \nalgorithms have been widely discussed in the literature - supervised and unsuper(cid:173)\nvised learning. The distinction between the two classes depends on supervision or \nteacher information which is either available to the learning algorithm or missing. \nThis paper applies statistical learning theory to the problem of unsupervised learn(cid:173)\ning. In particular, error bounds as a protection against overfitting are derived for \nthe recently developed Asymmetric Clustering Model (ACM) for co-occurrence \ndata [6]. These theoretical results show that the continuation method \"determin(cid:173)\nistic annealing\" yields robustness of the learning results in the sense of statistical \nlearning theory. The computational temperature of annealing algorithms plays the \nrole of a control parameter which regulates the complexity of the learning machine. \nLet us assume that a hypothesis class 1\u00a3 of loss functions h(x; a) is given. These \nloss functions measure the quality of structures in data. The complexity of 1\u00a3 is \ncontrolled by coarsening, i.e., we define a 'Y-cover of 1\u00a3. Informally, the inference \nprinciple advocated by us performs learning by two inference steps: (i) determine \nthe optimal approximation level l' for consistent learning (in terms of large risk devi(cid:173)\nations); (ii) given the optimal approximation level 1', average over all hypotheses in \nan appropriate neighborhood of the empirical minimizer. The result of the inference \n\n*This work has been supported by the German Israel Foundation for Science and Re(cid:173)\n\nsearch Development (GIF) under grant #1-0403-001.06/95. \n\n\fModel Selection in Clustering by Uniform Convergence Bounds \n\n217 \n\nprocedure is not a single hypothesis but a set of hypotheses. This set is represented \neither by an average of loss functions or, alternatively, by a typical member of this \nset. This induction approach is named Empirical Risk Approximation (ERA) [2]. \nThe reader should note that the learning algorithm has to return an average struc(cid:173)\nture which is typical in a 'Y-cover sense but it is not supposed to return the hypothesis \nwith minimal empirical risk as in Vapnik's \"Empirical Risk Minimization\" (ERM) \ninduction principle for classification and regression [9]. The loss function with mini(cid:173)\nmal empirical risk is usually a structure with maximal complexity, e.g., in clustering \nthe ERM principle will necessarily yield a solution with the maximal number of clus(cid:173)\nters. The ERM principle, therefore, is not suitable as a model selection principle to \ndetermine the number of clusters which are stable under sample fluctuations. The \nERA principle with its approximation accuracy 'Y solves this problem by controlling \nthe effective complexity of the hypothesis class. \nIn spirit, this approach is similar to the Gibbs-algorithm presented for example in \n[3]. The Gibbs-algorithm samples a random hypothesis from the version space to \npredict the label of the 1 + lth data point Xl+!o The version space is defined as \nthe set of hypotheses which are consistent with the first 1 given data points. In our \napproach we use an alternative definition of consistency, where all hypothesis in an \nappropriate neighborhood of the empirical minimizer define the version space (see \nalso [4]). Averaging over this neighborhood yields a structure with risk equivalent to \nthe expected risk obtained by random sampling from this set of hypotheses. There \nexists also a tight methodological relationship to [7] and [4] where learning curves \nfor the learning of two class classifiers are derived using techniques from statistical \nmechanics. \n\n2 The Empirical Risk Approximation Principle \nThe data samples Z = {zr E 0, 1 ~ r ~ l} which have to be analyzed by the unsu(cid:173)\npervised learning algorithm are elements of a suitable object (resp. feature) space \nO. The samples are distributed according to a measure J.L which is not assumed to \nbe known for the analysis.l \nA mathematically precise statement of the ERA principle requires several defini(cid:173)\ntions which formalize the notion of searching for structure in the data. The qual(cid:173)\nity of structures extracted from the data set Z is evaluated by the empirical risk \nR(a; Z) := t 2:~=1 h(zr; a) of a structure a given the training set Z. The function \nh(z; a) is known as loss function in statistics. It measures the costs for processing a \ngeneric datum z with model a. Each value a E A parameterizes an individual loss \nfunction with A denoting the set of possible parameters. The loss function which \nminimizes the empirical risk is denoted by &1. := arg minaEA R( a; Z). \n.(cid:173)\nThe relevant quality measure for \nIn h(z; a) dJ.L(z). The optimal structure to be inferred from the data is a1. .(cid:173)\nargminaEA R(a). The distribution J.L is assumed to decay sufficiently fast with \nbounded rth moments Ell {Ih(z; a) - R(a)IT} ~ rh\u00b7r - 2V II {h(z; an, 'Va E A \n(r > 2). Ell {.} and VII {.} denote expectation and variance of a random vari(cid:173)\nable, respectively. T is a distribution dependent constant. \nERA requires the learning algorithm to determine a set hypotheses on the basis \nof the finest consistently learnable cover of the hypothesis class. Given a learning \naccuracy 'Y a subset of parameters A-y = {al,'\" ,aIA-yI-l} U {&1.} can be defined \nsuch that the hypothesis class 1i is covered by the function balls with index sets \nB-y(a) := {a' : In Ih(z; a') - h(z; a)1 dJ.L(z) ~ 'Y}, i. e. A C UaEA-y B-y(a). The em-\n1 Knowledge of covering numbers is required in the following analysis which is a weaker \ntype of information than complete knowledge of the probability measure IL (see also [5]). \n\nthe expected risk R(a) \n\nlearning is \n\n\f218 \n\nJ. M Buhmann and M Held \n\npirical minimizer &1. has been added to the cover to simplify bounding arguments. \nLarge deviation theory is used to determine the approximation accuracy '1 for learn(cid:173)\ning a hypothesis from the hypothesis class 11.. The expected risk of the empirical \nminimizer exceeds the global minimum of the expected risk R(01.) by faT with a \nprobability bounded by Bernstein's inequality [8] \n\n< P { sup IR(o) - R(o)1 ~ -21 (faT - 'Y)} \n\naEA-y \n\n< 21A')'1 exp \n\nl(f-'Y/aT )2) _ \n- 8 + 4r (f _ 'Y/a T) = o. \n\n( \n\n(1) \n\nThe complexity I A')' I of the coarsened hypothesis class has to be small enough to \nguarantee with high confidence small f-deviations. 2 This large deviation inequality \nweighs two competing effects in the learning problem, i. e. the probability of a \nlarge deviation exponentially decreases with growing sample size I, whereas a large \ndeviation becomes increasingly likely with growing cardinality of the 'Y-cover of the \nhypothesis class. According to (1) the sample complexity Io (-y, f, 0) is defined by \n\nto (f - '1/ aT) 2 \n\nlog IA')'I - 8 + 4r (f _ 'Y/aT) + log \"8 = o. \n\n2 \n\n(2) \n\nWith probability 1 - 0 the deviation of the empirical risk from the expected risk is \nbounded by ~ (foPta T - '1) =: 'Yapp \u2022 Averaging over a set of functions which exceed \nthe empirical minimizer by no more than 2'Yapp in empirical risk yields an average \nhypothesis corresponding to the statistically significant structure in the data, i.e., \nR( 01.) - R( &1.) ~ R( 01. ) + 'Yapp -\n(R( &1.) - 'Yapp ) ~ 2'Yapp since R( 01.) ~ R( &1.) by \ndefinition. The key task in the following remains to calculate the minimal precision \nf( '1) as a function of the approximation '1 and to bound from above the cardinality \nI A')' I of the 'Y-cover for specific learning problems. \n\n3 Asymmetric clustering model \n\nThe asymmetric clustering model was developed for the analysis resp. grouping of \nobjects characterized by co-occurrence of objects and certain feature values [6]. \nApplication domains for this explorative data analysis approach are for example \ntexture segmentation, statistical language modeling or document retrieval. \nDenote by n = X x y the product space of objects Xi EX, 1 ~ i ~ nand \nfeatures Y j E y, 1 ~ j ~ j. The Xi E X are characterized by observations \nZ = {zr} = {(Xi(r),Yj(r)) ,T = 1, ... ,l}. The sufficient statistics of how often \nthe object-feature pair (Xi, Y j) occurs in the data set Z is measured by the set \nof frequencies {'f]ij : number of observations (Xi, Yj) /total number of observations}. \nDerived measurements are the frequency of observi~g object Xi, i. e. 'f]i = 2:;=1 'f]ij \nand the frequency of observing feature Yj given object Xi, i. e. 'f]jli = 'f]ij/'f]i. The \nasymmetric clustering model defines a generative model of a finite mixture of com(cid:173)\nponent probability distributions in feature space with cluster-conditional distri(cid:173)\nbutions q = (qjlv) ' 1 ~ j ~ j, 1 ~ v ~ k (see [6]). We introduce indicator \nvariables M iv E {O, 1} for the membership of object Xi in cluster v E {I, ... ,k}. \n2::=1 M iv = 1 Vi : 1 ~ i ~ n enforces the uniqueness constraint for assignments. \n2The maximal standard deviation (1 T := sUPaEA-y y'V {h(z; a)} defines the scale to \n\nmeasure deviations of the empirical risk from the expected risk (see [2]). \n\n\fModel Selection in Clustering by Uniform Convergence Bounds \n\n219 \n\nUsing these variables the observed data Z are distributed according to the genera(cid:173)\ntive model over X x y: \n\nP {xi,YjIM,q} = - ~ Mivqjlv' \n\nk \n\n1 \nn L--v=1 \n\n(3) \n\ncharacterized (at least approxima(cid:173)\n\nFor the analysis of the unknown data source -\na structure 0: = (M, q) with M E {O, I} n x k has \ntively) by the empirical data Z -\nto be inferred. The aim of an ACM analysis is to group the objects Xi as coded by \nthe unknown indicator variables M iv and to estimate for each cluster v a prototyp(cid:173)\nical feature distribution qjlv' \nUsing the loss function h(Xi' Yj; 0:) = logn - 2:~=1 M iv logqjlv the maximiza(cid:173)\ntion of the likelihood can be formulated as minimization of the empirical risk: \nR(o:; Z) = 2:~=1 2:;=11}ij h(xi, Yj; 0:), where the essential quantity to be minimized \nis the expected risk: R(o:) = 2:~=1 2:;=1 ptrue {Xi, Yj} h(Xi' Yj; 0:). Using the max(cid:173)\nimum entropy principle the following annealing equations are derived [6]: \n\nA \n\nqjlv \n\n2:~1 (Miv )1}ij _ ~n (Miv )1}i \n\"n (M ) \n- L--. \"n (M \nwi=1 \nexp [.8 2:;=1 1}jli log Q]lv ] \n\nt=1 wh=1 \n\niv \n\n)1}j1i, \n\nhv \n\n(4) \n\nThe critical temperature: Due to the limited precision of the observed data it \nis natural to study histogram clustering as a learning problem with the hypothesis \nclass 1\u00a3 = {-2:vMivlogqjlv :Miv E {0,1} /\\ 2:vMiv = 1/\\ Qjlv E H,t, .. \u00b7 ,1}/\\ \n2:j qjlv = I}. The limited number of observations results in a limited precision of \nthe frequencies 1}jli' The value Q;lv = 0 has been excluded since it causes infinite \nexpected risk for ptrue {Yj IXi} > O. The size of the regularized hypothesis class A-y \ncan be upper bounded by the cardinality of the complete hypothesis class divided \nby the minimal cardinality of a 'Y-function ball centered at a function of the 'Y-cover \nA-y, i. e. \nThe cardinality of a function ball with radius 'Y can be approximated by adopting \ntechniques from asymptotic analysis [1] (8 (x) = g for x ~ 0): \n8 ('Y - L ~ptrue {Yj IXi} IIOg ~~I~(i) I) \n\nIA-yl ~ 11\u00a31/!llin IB-y(&)I. \n\nIB-y(5)1 = L \n\noEA'T \n\n(6) \n\n%Im(t) \n\nL \nq,lo \n\nM { . } \n\ni J' \n\u2022 \n\nand the entropy S is given by \n\nS(q,Q,x) = \n\n'Yx - Lv Qv (L j qjlv -1) + \n.!. ~ ,log ~ exp (-x ~ , ptrue {Yj IXi} IIOg _ Qjlp I). (7) \n\n%Im(i) \n\nn L--, \n\nL--p \n\nL--J \n\nThe auxiliary variables Q = {Q v } ~=1 are Lagrange parameters to enforce the nor(cid:173)\nmalizations 2:j qjlv = 1. Choosing %10 = qjlm(i) Vm(i) = 0:, we obtain an approxi(cid:173)\nmation of the integral. The reader should note that a saddlepoint approximation in \n\n\f220 \n\nJ. M Buhmann and M Held \n\nthe usual sense is only applicable for the parameter x but will fail for the q, Q param(cid:173)\neters since the integrand is maximal at the non-differentiability point of the absolute \nvalue function. We, therefore, expand S (q, Q,x) up to linear terms 0 (q - q) and \nintegrate piece-wise. \nUsing the abbreviation Kill := Lj ptrue {Yj Ixd IIog qj~:~i) I the following saddle \npoint approximation for the integral over x is obtained: \n\n, = - . \n\n1 I:n I:k \nn \n\nt=1 \n\nj.\u00a3=1 \n\nPij.\u00a3Kjlj.\u00a3 wIth Pia = L \n\n\u2022 \n\nexp ( -XKia) \nj.\u00a3 exp -XKij.\u00a3 \n\n(~)\" \n\n(8) \n\nThe entropy S evaluated at q = q yields in combination with the Laplace approxi(cid:173)\nmation [1] an estimate for the cardinality of the ,-cover \n\nlog I A')' I = n (log k - S) + -21 I:. KipP ip (I: P illKill - KiP) x2 \n\nt,p \n\nII \n\n(9) \n\nwhere the second term results from the second order term of the Taylor expansion \naround the saddle point. Inserting this complexity in equation (2) yields an equation \nwhich determines the required number of samples 10 for a fixed precision f and \nconfidence o. This equation defines a functional relationship between the precision \nf and the approximation quality, for fixed sample size 10 and confidence o. Under \nthis assumption the precision f depends on , in a non-monotone fashion, i. e. \n\n(10) \n\nusing the abbreviation C = log I A')' I + log~. The minimum of the function \u20ac(,) \ndefines a compromise between uncertainty originating from empirical fluctuations \nand the loss of precision due to the approximation by a ,-cover. Differentiating \nwith respect to , and setting the result to zero (df(T)/d, = 0) yields as upper \nbound for the inverse temperature: \n\n~ \nx < -\n\n-\n\n1 10 ( \n(1T 2n \n\n-\n\n10+C7\"2 \n\n7\" + -;:;~:;;;==~iiT \nV210C + 7\"2C2 \n\n)-1 \n\n(11) \n\nAnalogous to estimates of k-means, phase-transitions occur in ACM while lowering \nthe temperature. The mixture model for the data at hand can be partitioned \ninto more and more components, revealing finer and finer details of the generation \nprocess. The critical xopt defines the resolution limit below which details can not \nbe resolved in a reliable fashion on the basis of the sample size 10 . \nGiven the inverse temperature x the effective cardinality of the hypothesis class \ncan be upper bounded via the solution of the fix point equation (8). On the other \nhand this cardinality defines with (11) and the sample size lo an upper bound on x. \nIterating these two steps we finally obtain an upper bound for the critical inverse \ntemperature given a sample size 10. \n\nEmpirical Results: \nFor the evaluation of the derived theoretical result a series of Monte-Carlo exper(cid:173)\niments on artificial data has been performed for the asymmetric clustering model. \nGiven the number of objects n = 30, the number of groups k = 5 and the size of the \nhistograms f = 15 the generative model for this experiments was created randomly \nand is summarized in fig. 1. From this generative model sample sets of arbitrary \nsize can be generated and the true distributions ptrue {Yj IXi} can be calculated. \nIn figure 2a,b the predicted temperatures are compared to the empirically observed \ncritical temperatures, which have been estimated on the basis of 2000 different sam(cid:173)\nples of randomly generated co-occurrence data for each 10. The expected risk (solid) \n\n\fModel Selection in Clustering by Uniform Convergence Bounds \n\n221 \n\nv \n1 \n2 \n3 \n4 \n5 \n\nqjlv \n\n0.11,0.01,0.11,0.07,0.08,0.04,0.06,0,0.13,0.07, 0.08, 0.1, 0, 0.11,0.031 \n0.18,0.1,0.09,0.02,0.05,0.09,0.08,0.03,0.06, 0.07, 0.03, 0.02, 0.07, 0.06, 0.05} \n0.17,0.05,0.05,0.06,0.06,0.05,0.03,0.11,0.09,0, 0.02,0.1,0.03,0.07, 0.11} \n0.15,0.07,0.1,0.03,0.09,0.03,0.04,0.05,0.06, 0.05,0.08,0.04,0.08,0.09, 0.04} \n0.09,0.09,0.07,0.1,0.07,0.06,0.06,0.11,0.07,0.07, 0.1, 0.02,0.07,0.02, O} \n\nm(i} = (5,3,2,5,2,2,5,4,2,2,2,4,1,5,3,5,3,4,1 , 2,2,3,1,1,2, 5, 5, 2, 2, 1) \n\nFigure 1: Generative ACM model for the Monte-Carlo experiments. \n\nand empirical risk (dashed) of these 2000 inferred models are averaged. Overfitting \nsets in when the expected risk rises as a function of the inverse temperature x. \nFigure 2c indicates that on average the minimal expected risk is assumed when the \neffective number is smaller than or equal 5, i. e. the number of clusters of the true \ngenerative model. Predicting the right computational temperature, therefore, also \nenables the data analyst to solve the cluster validation problem for the asymmetric \nclustering model. Especially for 10 = 800 the sample fluctuations do not permit the \nestimate of five clusters and the minimal computational temperature prevents such \nan inference result. On the other hand for lo = 1600 and 10 = 2000 the minimal \ntemperature prevents the algorithm to infer too many clusters, which would be an \ninstance of overfitting. \nAs an interesting point one should note that for an infinite number of observations \nthe critical inverse temperature reaches a finite positive value and not more than \nthe five effective clusters are extracted. At this point we conclude, that for the case \nof histogram clustering the Empirical Risk Approximation solves for realizable rules \nthe problem of model validation, i. e. choosing the right number of clusters. \nFigure 2d summarizes predictions of the critical temperature on the basis of the \nempirical distribution 1]ij rather than the true distribution ptrue {Xi, Yj}. The em(cid:173)\npirical distribution has been generated by a training sample set with x of eq. (11) \nbeing used as a plug-in estimator. The histogram depicts the predicted inverse \ntemperature for 10 = 1200. The average of these plug-in estimators is equal to \nthe predicted temperature for the true distribution. The estimates of x are biased \ntowards too small inverse temperatures due to correlations between the parameter \nestimates and the stopping criterion. It is still an open question and focus of ongo(cid:173)\ning work to rigorously bound the variance of this plug- in estimator. \nEmpirically we observe a reduction of the variance of the expected risk occurring \nat the predicted temperature for higher sample sizes lo . \n\n4 Conclusions \n\nThe two conditions that the empirical risk has to uniformly converge towards the \nexpected risk and that all loss functions within an 2,&PP -range of the global empirical \nrisk minimum have to be considered in the inference process limits the complexity \nof the underlying hypothesis class for a given number of samples. The maximum \nentropy method which has been widely employed in deterministic annealing proce(cid:173)\ndures for optimization problems is substantiated by our analysis. Solutions with \ntoo many clusters clearly overfit the data and do not generalize. The condition that \nthe hypothesis class should only be divided in function balls of size , forces us to \nstop the stochastic search at the lower bound of the computational temperature. \nAnother important result of this investigation is the fact that choosing the right \nstopping temperature for the annealing process not only avoids overfitting but also \nsolves the cluster validation problem in the realizable case of ACM. A possible \ninference of too many clusters using the empirical risk functional is suppressed. \n\n\f\\ \n\\ \n\\ \n\\ \n\\ \n\n'0 \n\n2 \n\n0 \n\n680 \n\n11 \n\n10 \n\nc) \n\n9 \nill 8 \n~ \n[ 7 \n'0 6 \n~ 5 \n\n~ \n