Part of Advances in Neural Information Processing Systems 17 (NIPS 2004)
Joachim Giesen, Simon Spalinger, Bernhard Schölkopf
We describe methods for computing an implicit model of a hypersurface that is given only by a finite sampling. The methods work by mapping the sample points into a reproducing kernel Hilbert space and then deter- mining regions in terms of hyperplanes.
1 Introduction
Suppose we are given a finite sampling (in machine learning terms, training data) x1, . . . , xm X , where the domain X is some hypersurface in Euclidean space Rd. The case d = 3 is especially interesting since these days there are many devices, e.g., laser range scanners, that allow the acquisition of point data from the boundary surfaces of solids. For further processing it is often necessary to transform this data into a continu- ous model. Today the most popular approach is to add connectivity information to the data by transforming them into a triangle mesh (see [4] for an example of such a transformation algorithm). But recently also implicit models, where the surface is modeled as the zero set of some sufficiently smooth function, gained some popularity [1]. They bear resemblance to level set methods used in computer vision [6]. One advantage of implicit models is that they easily allow the derivation of higher order differential quantities such as curvatures. Another advantage is that an inside-outside test, i.e., testing whether a query point lies on the bounded or unbounded side of the surface, boils down to determining the sign of a function-evaluation at the query point. Inside-outside tests are important when one wants to intersect two solids.
The goal of this paper is, loosely speaking, to find a function which takes the value zero on a surface which
(1) contains the training data and
(2) is a "reasonable" implicit model of X .
To capture properties of its shape even in the above general case, we need to exploit some structure on X . In line with a sizeable amount of recent work on kernel methods [11], we assume that this structure is given by a (positive definite) kernel, i.e., a real valued function
Partially supported by the Swiss National Science Foundation under the project "Non-linear manifold learning".
Figure 1: In the 2-D toy example depicted, o o the hyperplane w, (x) = separates all o o but one of the points from the origin. The out- . o o lier (x) is associated with a slack variable , o which is penalized in the objective function /||w|| (4). The distance from the outlier to the hy- o w o /||w|| perplane is / w ; the distance between hy- o x ( ) perplane and origin is / w . The latter im- plies that a small w corresponds to a large margin of separation from the origin.
k on X X which can be expressed as
k(x, x ) = (x), (x ) (1)
for some map into a Hilbert space H. The space H is the reproducing kernel Hilbert space (RKHS) associated with k, and is called its feature map. A popular example, in the case where X is a normed space, is the Gaussian (where > 0)
x - x 2 k(x, x ) = exp - . (2) 2 2
The advantage of using a positive definite kernel as a similarity measure is that it allows us to construct geometric algorithms in Hilbert spaces.
2 Single-Class SVMs
Single-class SVMs were introduced [8, 10] to estimate quantiles C {x X |f (x) [, [} of an unknown distribution P on X using kernel expansions. Here,
f (x) = ik(xi, x) - , (3) i
where x1, . . . , xm X are unlabeled data generated i.i.d. according to P . The single-class SVM approximately computes the smallest set C C containing a specified fraction of all training examples, where smallness is measured in terms of the norm in the RKHS H associated with k, and C is the family of sets corresponding to half-spaces in H. Depending on the kernel, this notion of smallness will coincide with the intuitive idea that the quantile estimate should not only contain a specified fraction of the training points, but it should also be sufficiently smooth so that the same is approximately true for previously unseen points sampled from P .
Let us briefly describe the main ideas of the approach. The training points are mapped into H using the feature map associated with k, and then it is attempted to separate them from the origin with a large margin by solving the following quadratic program: for (0, 1],1
1 1 minimize w 2 + i - (4) wH,R 2 m m ,R i
subject to w, (xi) - i, i 0. (5)
Since non-zero slack variables i are penalized in the objective function, we can expect that if w and solve this problem, then the decision function, f (x) = sgn ( w, (x) - ) will
1Here and below, bold face greek character denote vectors, e.g., = (1, . . . , m) , and indices i, j by default run over 1, . . . , m.
Figure 2: Models computed with a single class SVM using a Gaussian kernel (2). The three examples differ in the value chosen for in the kernel - a large value (0.224 times the diameter of the hemisphere) in the left figure and a small value (0.062 times the diameter of the hemisphere) in the middle and right figure. In the right figure also non-zero slack variables (outliers) were allowed. Note that that the outliers in the right figure correspond to a sharp feature (non-smoothness) in the original surface.
equal 1 for most examples xi contained in the training set,2 while the regularization term w will still be small. For an illustration, see Figure 1. The trade-off between these two goals is controlled by a parameter .
One can show that the solution takes the form
f (x) = sgn ik(xi, x) - , (6) i
where the i are computed by solving the dual problem,
1 minimize ij k(xi, xj ) (7) Rm 2 ij 1 subject to 0 i and m i = 1. (8) i
Note that according to (8), the training examples contribute with nonnegative weights i 0 to the solution (6). One can show that asymptotically, a fraction of all training examples will have strictly positive weights, and the rest will be zero (the "-property").
In our application we are not primarily interested in a decision function itself but in the boundaries of the regions in input space defined by the decision function. That is, we are interested in f -1(0), where f is the kernel expansion (3) and the points x1, . . . , xm X are sampled from some unknown hypersurface X Rd. We want to consider f -1(0) as a model for X . In the following we focus on the case d = 3. If we assume that the xi are sampled without noise from X which for example is a reasonable assumption for data obtained with a state of the art 3d laser scanning device we should set the slack variables in (4) and (5) to zero. In the dual problem this results in removing the upper constraints on the i in (8). Note that sample points with non-zero slack variable cannot be contained in f -1(0). But also sample points whose image in feature space lies above the optimal hyperplane are not contained in f -1(0) (see Figure 1) -- we will address this in the next section. It turns out that it is useful in practice to allow non-zero slack variables, because they prevent f -1(0) from decomposing into many connected components (see Figure 2 for an illustration).
In our experience, one can ensure that the images of all sample points in feature space lie close to (or on) the optimal hyperplane can be achieved by choosing in the Gaussian
2We use the convention that sgn (z) equals 1 for z 0 and -1 otherwise.
Figure 3: Two parallel hy- x ( * )o perplanes w, (x) = + o o /* () enclosing all but two . ||w|| o of the points. The outlier o (x()) is associated with (+ * ||w|| )/ (+)/||w|| o a slack variable (), which w o /||w|| o x ( ) o is penalized in the objective function (9).
kernel (2) such that the Gaussians in the kernel expansion (3) are highly localized. How- ever, highly localized Gaussians are not well suited for interpolation -- the implicit surface decomposes into several components. Allowing outliers mitigates the situation to a certain extent. Another way to deal with the problem is to further restrict the optimal region in feature space. In the following we will pursue the latter approach.