{"title": "Knowledge-Based Support Vector Machine Classifiers", "book": "Advances in Neural Information Processing Systems", "page_first": 537, "page_last": 544, "abstract": "", "full_text": "Knowledge-Based Support Vector \n\nMachine Classifiers \n\nGlenn M. Fung, Olvi L. Mangasarian and Jude W. Shavlik \n\nComputer Sciences Department, University of Wisconsin \n\nMadison, WI 53706 \n\ngfung, olvi, shavlik@cs.wisc.edu \n\nAbstract \n\nPrior knowledge in the form of multiple polyhedral sets, each be(cid:173)\nlonging to one of two categories, is introduced into a reformulation \nof a linear support vector machine classifier. The resulting formu(cid:173)\nlation leads to a linear program that can be solved efficiently. Real \nworld examples, from DNA sequencing and breast cancer prognosis, \ndemonstrate the effectiveness of the proposed method. Numerical \nresults show improvement in test set accuracy after the incorpo(cid:173)\nration of prior knowledge into ordinary, data-based linear support \nvector machine classifiers. One experiment also shows that a lin(cid:173)\near classifier, based solely on prior knowledge, far outperforms the \ndirect application of prior knowledge rules to classify data. \nKeywords: use and refinement of prior knowledge, sup(cid:173)\nport vector machines, linear programming \n\nIntroduction \n\n1 \nSupport vector machines (SVMs) have played a major role in classification problems \n[18,3, 11]. However unlike other classification tools such as knowledge-based neural \nnetworks [16, 17, 7], little work [15] has gone into incorporating prior knowledge into \nsupport vector machines. In this work we present a novel approach to incorporating \nprior knowledge in the form of polyhedral knowledge sets in the input space of the \ngiven data. These knowledge sets, which can be as simple as cubes, are supposed \nto belong to one of two categories into which all the data is divided. Thus, a \nsingle knowledge set can be interpreted as a generalization of a training example, \nwhich typically consists of a single point in input space. In contrast, each of our \nknowledge sets consists of a region in the same space. By using a powerful tool from \nmathematical programming, theorems of the alternative [9, Chapter 2], we are able \nto embed such prior data into a linear program that can be efficiently solved by any \nof the publicly available solvers. \n\nWe briefly summarize the contents of the paper now. In Section 2 we describe the \nlinear support vector machine classifier and give a linear program for it. We then \ndescribe how prior knowledge, in the form of polyhedral knowledge sets belonging to \none of two classes can be characterized. In Section 3 we incorporate these polyhedral \nsets into our linear programming formulation which results in our knowledge-based \nsupport vector machine (KSVM) formulation (19). This formulation is capable of \ngenerating a linear classifier based on real data and/or prior knowledge. Section \n4 gives a brief summary of numerical results that compare various linear and non(cid:173)\nlinear classifiers with and without the incorporation of prior knowledge. Section 5 \nconcludes the paper. \n\n\fWe now describe our notation. All vectors will be column vectors unless transposed \nto a row vector by a prime I. The scalar (inner) product of two vectors x and y \nin the n-dimensional real space Rn will be denoted by x' y. For a vector x in Rn, \nthe sign function sign(x) is defined as sign(x)i = 1 if Xi > a else sign(x)i = -1 if \nXi::; 0, for i = 1, ... ,no For x ERn, Ilxll p denotes the p-norm, p = 1,2,00. The \nnotation A E Rmxn will signify a real m x n matrix. For such a matrix, A' will \ndenote the transpose of A and Ai will denote the i-th row of A. A vector of ones \nin a real space of arbitrary dimension will be denoted bye. Thus for e E Rm and \ny E R m the notation e'y will denote the sum of the components of y. A vector \nof zeros in a real space of arbitrary dimension will be denoted by O. The identity \nmatrix of arbitrary dimension will be denoted by I. A separating plane, with respect \nto two given point sets A and B in R n , is a plane that attempts to separate R n \ninto two halfspaces such that each open halfspace contains points mostly of A or \nB. A bounding plane to the set A is a plane that places A in one of the two closed \nhalfspaces that the plane generates. The symbol 1\\ will denote the logical \"and\". \nThe abbreviation \"s.t.\" stands for \"such that\" . \n2 Linear Support Vector Machines and Prior Knowledge \nWe consider the problem, depicted in Figure l(a), of classifying m points in the \nn-dimensional input space Rn , represented by the m x n matrix A, according to \nmembership of each point Ai in the class A + or A-as specified by a given m x m \ndiagonal matrix D with plus ones or minus ones along its diagonal. For this problem, \nthe linear programming support vector machine [11, 2] with a linear kernel, which \nis a variant of the standard support vector machine [18, 3], is given by the following \nlinear program with parameter v > 0: \n\nmin \n\n(W ,\"Y,y)ERn +l += \n\n{ve'y + Ilwlll I D(Aw - WI') + y ~ e, y ~ a}, \n\n(1) \n\nwhere II . III denotes the I-norm as defined in the Introduction, y is a vector of \nslack variables measuring empirical error and (w, 'Y) characterize a separating plane \ndepicted in Figure 1. That this problem is indeed a linear program, can be easily \nseen from the equivalent formulation: \n\nmin \n\n{ve'y+e't I D(Aw - q) +y ~ e,t ~ w ~ -t,y ~ a}, \n\n(2) \n\n(W ,\"Y ,y ,t)ERn +l +=+n \n\nwhere e is a vector of ones of appropriate dimension. For economy of notation \nwe shall use the first formulation (1) with the understanding that computational \nimplementation is via (2). As depicted in Figure l(a), w is the normal to the \nbounding planes: \n\nx'w = 'Y + 1, x'w = 'Y - 1, \n\n(3) \n\nthat bound the points belonging to the sets A + and A-respectively. The constant \n'Y determines their location relative to the origin. When the two classes are strictly \nlinearly separable, that is when the error variable y = a in (1) (which is the case \n'Y + 1 bounds all of the class A + points, \nshown in Figure 1 (a)), the plane x' w = \nwhile the plane x' w = 'Y - 1 bounds all of the class A-points as follows: \n\nAiW ~ 'Y + 1, for Dii = 1, AiW ::; \n\n'Y - 1, for Dii = -1. \n\nConsequently, the plane: \n\nx'w = 'Y, \n\n(4) \n\n(5) \n\nmidway between the bounding planes (3), is a separating plane that separates points \nbelonging to A + from those belonging to A-completely if y = 0, else only approx(cid:173)\nimately. The I-norm term Ilwlll in (1), which is half the reciprocal of the distance \n11,,7111 measured using the oo-norm distance [10] between the two bounding planes of \n\n\f(3) (see Figure l(a)), maximizes this distance, often called the \"margin\". Maximiz(cid:173)\ning the margin enhances the generalization capability of a support vector machine \n[18, 3]. If the classes are linearly inseparable, then the two planes bound the two \nclasses with a \"soft margin\" (i.e. bound approximately with some error) determined \nby the nonnegative error variable y, that is: \n\nAiW + Yi 2: ry + 1, for Dii = 1, AiW - Yi ::; ry - 1, for Dii = -1. \n\n(6) \nThe I-norm of the error variable Y is minimized parametrically with weight /J in \n(1), resulting in an approximate separating plane (5) which classifies as follows: \n\nx E A+ if sign(x'w - ry) = 1, x E A- if sign(x'w - ry) = -1. \n\n(7) \nSuppose now that we have prior information of the following type. All points x \nlying in the polyhedral set determined by the linear inequalities: \n\n(8) \nbelong to class A +. Such inequalities generalize simple box constraints such as \na ::; x ::; d. Looking at Figure 1 (a) or at the inequalities (4) we conclude that the \nfollowing implication must hold: \n\nBx ::; b, \n\nBx::; b ===? x'w 2: ry+ 1. \n\n(9) \nThat is, the knowledge set {x I Bx ::; b} lies on the A + side of the bounding plane \nx'w = ry+ 1. Later, in (19), we will accommodate the case when the implication (9) \ncannot be satisfied exactly by the introduction of slack error variables. For now, \nassuming that the implication (9) holds for a given (w, ry), it follows that (9) is \nequivalent to: \n\nBx ::; b, x'w < ry + 1, has no solution x. \nThis statement in turn is implied by the following statement: \n\n(10) \n\nB'u+w = 0, b'u+ry+ 1::; 0, u 2: 0, has a solution (u,w). \n\n(11) \nTo see this simple backward implication: (10)\u00a2=(11), we suppose the contrary that \nthere exists an x satisfying (10) and obtain the contradiction b'u > b'u as follows: \n(12) \nwhere the first inequality follows by premultiplying Bx ::; b by u 2: O. In fact, under \nthe natural assumption that the prior knowledge set {x I Bx ::; b} is nonempty, \nthe forward implication: (10)===?(11) is also true, as a direct consequence of the \nnonhomogeneous Farkas theorem of the alternative [9, Theorem 2.4.8]. We state \nthis equivalence as the following key proposition to our knowledge-based approach. \n\nb'u 2: u'Bx = -w'x > -ry-l2: b'u, \n\nProposition 2.1 Knowledge Set Classification. Let the set {x I Bx ::; b} be \nnonempty. Then for a given (w, ry), the implication (9) is equivalent to the statement \n(11). In other words, the set {x I Bx ::; b} lies in the halfspace {x I w' x 2: ry + I} if \nand only if there exists u such that B'u + w = 0, b'u + ry + 1 ::; 0 and u 2: O. \nProof We establish the equivalence of (9) and (11) by showing the equivalence (10) \nand (11). By the nonhomogeneous Farkas theorem [9, Theorem 2.4.8] we have that \n(10) is equivalent to either: \n\nB'u + w = 0, b'u + ry + 1::; 0, u 2: 0, having solution (u, w), \n\n(13) \nor \n(14) \nHowever, the second alternative (14) contradicts the nonemptiness ofthe knowledge(cid:173)\nset {x I Bx::; b}, because for x in this set and u solving (14) gives the contradiction: \n\nB'u = 0, b'u < 0, u 2: 0, having solution u. \n\n(15) \nHence (14) is ruled out and we have that (10) is equivalent to (13) which is (11). D \n\n02: u'(Bx - b) = x' B'u - b'u = -b'u > O. \n\nThis proposition will play a key role in incorporating knowledge sets, such as \n{x I Bx ::; b}, into one of two categories in a support vector classifier formula(cid:173)\ntion as demonstrated in the next section. \n\n\f- 15 \n\n-15 \n\n-20 \n\nX'W= Y +1 \n\n-30 ~---:---j \n\nx'w= y \n\n-40 \n\n-~\u00b7~0------~'5~-----~'0~-----~5 ------~----~ -45 '--------~----~------~----~----~ \n\n(a) \n\n-20 \n\n- 15 \n\n- 10 \n\n-5 \n\n(b) \n\nFigure 1: (a): A linear SVM separation for 200 points in R2 using the linear programming \nformulation (1). (b): A linear SVM separation for the salTIe 200 points in R2 as those in \nFigure l(a) but using the linear programming forlTIulation (19) which incorporates three \nknowledge sets: { x I B ' x :'0 b'} into the halfspace of A + , and { x I C'x :'0 c'}, { x I C 2 x :'0 c 2 } \ninto the halfspace of A - , as depicted above. Note the substantial difference between the \nlinear classifiers x' w = , of both figures. \n3 Knowledge-Based SVM Classification \nWe describe now how to incorporate prior knowledge in the form of polyhedral sets \ninto our linear programming SVM classifier formulation (1). \n\nWe assume that we are given the following knowledge sets: \n\nk sets belonging to A+ : {x I B ix ::; bi } , i = 1, ... ,k \nIZ sets belonging to A- : {x I eix::; ci }, i = 1, ... ,IZ \n\nIt follows by Proposition 2.1 that, relative to the bounding planes (3): \n\nThere exist ui , i = 1, ... ,k, vj , j = 1, ... ,IZ, such that: \nB i ' \nej ' \n\nu+w= , u+ 1'+ _, u _, Z= , ... , \nV \n\ni > O\u00b7 \nj > 0 \n,J -\n\nl' + _ ,v _ \n\n1 < 0 \n1 < 0 \n\nk \n, ... ,1:-\nfi \n\n0 bi ' \n\ni \n,c v \n\n1 \n. - 1 \n\nj' j -\n\n- 0 \n\nW -\n\ni \n\nj -\n\n(16) \n\n(17) \n\nWe now incorporate the knowledge sets (16) into the SVM linear programming for(cid:173)\nmulation (1) classifier, by adding the conditions (17) as constraints to it as follows: \n\nmin \n\nw\"\n\n,(y ,u i ,vj )2':O \n\nve'y + Ilwlll \n\n. \n\n., \nB \" u\" + w \n. \n\ns.t. D(Aw - q ) +y > e \n0 \n., \nb\" u\" + l' + 1 < 0, \nej'vj - w \n0 \n., \ncJ vJ -1'+1 < 0, j = 1, ... ,IZ \n\ni = 1, ... , k \n\n. \n\n(18) \n\nThis linear programming formulation will ensure that each of the knowledge sets \n{ x I BiX::; bi } , i = 1, ... , k and { x I eix::; ci } , i = 1, ... ,IZ lie on the ap(cid:173)\npropriate side of the bounding planes (3). However, there is no guarantee that \nsuch bounding planes exist that will precisely separate these two classes of knowl(cid:173)\nedge sets, just as there is no a priori guarantee that the original points belonging \nto the sets A + and A-are linearly separable. We therefore add error variables \n\n\fri, pi, i = 1, ... ,k, sj, (J\"j, j = 1, ... ,\u00a3, just like the slack error variable y of the \nSVM formulation (1), and attempt to drive these error variables to zero by modi(cid:173)\nfying our last formulation above as follows: \n\n. min . . ve'y + j.L(l,)ri + /) \n\nW'f , (y , u~,r~,pt , vJ ,sJ ,aJ)~O \n\nk \n\ni=l \n\ns.t. D(Aw - wy) + y \n_ri ::; Bil ui + W \nb\"u\"+I'+1 \n-sj ::; ejl vj - w \ncJ vJ -I'+1 \n\n\u00b71 \n\n. \n\n\u00b71 \n\n. \n\ne \n\n+ l,) sj + (J\"j)) + Ilwlll \n\nj=l \n\n> e \n< ri \n< pi,i=I, ... ,k \n< sj \n< \n\n(J\"j, j = 1, . .. ,\u00a3 \n\n(19) \n\nThis is our final knowledge-based linear programming formulation which incorpo(cid:173)\nrates the knowledge sets (16) into the linear classifier with weight j.L, while the \n(empirical) error term e'y is given weight v. As usual, the value of these two pa(cid:173)\nrameters, v, j.L, are chosen by means of a tuning set extracted from the training \nset. If we set j.L = a then the linear program (19) degenerates to (1), the linear \nprogram associated with an ordinary linear SVM. However, if set v = 0, then the \nlinear program (19) generates a linear SVM that is strictly based on knowledge \nsets, but not on any specific training data. This might be a useful paradigm for \nsituations where training datasets are not easily available, but expert knowledge, \nsuch as doctors' experience in diagnosing certain diseases, is readily available. This \nwill be demonstrated in the breast cancer dataset of Section 4. \nNote that the I-norm term Ilwlll can be replaced by one half the 2-norm squared, \n~llwll~, which is the usual margin maximization term for ordinary support vector \nmachine classifiers [18, 3]. However, this changes the linear program (19) to a \nquadratic program which typically takes longer time to solve. \n\nFor standard SVMs, support vectors consist of all data points which are the com(cid:173)\nplement of the data points that can be dropped from the problem without changing \nthe separating plane (5) [18, 11]. Thus for our knowledge-based linear programming \nformulation (19), support vectors correspond to data points (rows of the matrix A) \nfor which the Lagrange multipliers are nonzero, because solving (19) with these data \npoints only will give the same answer as solving (19) with the entire matrix A. \n\nThe concept of support vectors has to be modified as follows for our knowledge \nsets. Since each knowledge set in (16) is represented by a matrix Bi or ej\n, each \nrow of these matrices can be thought of as characterizing a boundary plane of \nthe knowledge set. In our formulation (19) above, such rows are wiped out if the \ncorresponding components of the variables u i or vj are zero at an optimal solution. \nWe call the complement of these components of the the knowledge sets (16), support \nconstraints. Deleting constraints (rows of Bi or ej ), for which the corresponding \ncomponents of u i or v j are zero, will not alter the solution of the knowledge-based \nlinear program (19). This in fact is corroborated by numerical tests that were \ncarried out. Deletion of non-support constraints can be considered a refinement of \nprior knowledge [17]. Another type of of refinement of prior knowledge may occur \nwhen the separating plane x' w = I' intersects one of the knowledge sets. In such \na case the plane x'w = I' can be added as an inequality to the knowledge set it \nintersects. This is illustrated in the following example. \n\nWe demonstrate the geometry of incorporating knowledge sets by considering a \nsynthetic example in R2 with m = 200 points, 100 of which are in A + and the other \n100 in A -. Figure 1 (a) depicts ordinary linear separation using the linear SVM \nformulation (1). We now incorporate three knowledge sets into the the problem: \n\n\f{x I Blx ::; bl } belonging to A+ and {x I Clx ::; cl } and {x I C 2 x ::; c2 } belonging \nto A -, and solve our linear program (19) with f-l = 100 and v = 1. We depict the \nnew linear separation in Figure 1 (b) and note the substantial change generated in \nthe linear separation by the incorporation of these three knowledge sets. Also note \nthat since the plane x'w = \"( intersects the knowledge set {x I BlX ::; bl }, this \nknowledge set can be refined to the following {x I B 1 X ::; bl, w' x 2: \"(}. \n4 Numerical Testing \nNumerical tests, which are described in detail in [6], were carried out on the \nDNA promoter recognition dataset [17] and the Wisconsin prognostic breast \ncancer dataset WPBC \nlearn/cancer/WPBC/). We briefly summarize these results here. \n\n(ftp:j /ftp.cs.wisc.edu/math-prog/cpo-dataset/machine(cid:173)\n\nOur first dataset, the promoter recognition dataset, is from the the domain of DNA \nsequence analysis. A promoter, which is a short DNA sequence that precedes a \ngene sequence, is to be distinguished from a nonpromoter. Promoters are impor(cid:173)\ntant in identifying starting locations of genes in long uncharacterized sequences of \nDNA. The prior knowledge for this dataset, which consists of a set of 14 prior rules, \nmatches none of the examples of the training set. Hence these rules by themselves \ncannot serve as a classifier. However, they do capture significant information about \npromoters and it is known that incorporating them into a classifier results in a \nmore accurate classifier [17]. These 14 prior rules were converted in a straightfor(cid:173)\nward manner [6] into 64 knowledge sets. Following the methodology used in prior \nwork [17], we tested our algorithm on this dataset together with the knowledge sets, \nusing a \"leave-one-out\" cross validation methodology in which the entire training \nset of 106 elements is repeatedly divided into a training set of size 105 and a test \nset of size 1. The values of v and f-l associated with both KSVM and SVM l \n[2] \nwhere obtained by a tuning procedure which consisted of varying them on a square \ngrid: {2-6, 2- 5 , ... ,26} X {2-6, 2- 5 , ... ,26}. After expressing the prior knowledge \nin the form of polyhedral sets and applying KSVM, we obtained 5 errors out of 106 \n(5/106). KSVM gave a much better performance than five other different meth(cid:173)\nods that do not use prior knowledge: Standard I-norm support vector machine [2] \n(9/106), Quinlan's decision tree builder [13] (19/106), PEBLS Nearest algorithm \n[4] with k = 3 (13/106), an empirical method suggested by a biologist based on \na collection of \"filters\" to be used for promoter recognition known as O'Neill's \nMethod [12] (12/106), neural networks with a simple connected layer of hidden \nunits trained using back-propagation [14] (8/106). Except for KSVM and SVM l , \nall of these results are taken from an earlier report [17]. KSVM was also compared \nwith [16] where a hybrid learning system maps problem specific prior knowledge, \nrepresented in propositional logic into neural networks and then, refines this refor(cid:173)\nmulated knowledge using back propagation. This method is known as Knowledge \nBased Artificial Neural Networks (KBANN). KBANN was the only approach that \nperformed slightly better than our algorithm and obtained 4 misclassifications com(cid:173)\npared to our 5. However, it is important to note that our classifier is a much simpler \nlinear classifier, sign(x'w - \"(), while the neural network classifier of KBANN is a \nconsiderably more complex nonlinear classifier. Furthermore, we note that KSVM \nis simpler to implement than KBANN and requires merely a commonly available \nlinear programming solver. In addition, KSVM which is a linear support vector \nmachine classifier, improves by 44.4% the error of an ordinary linear I-norm SVM \nclassifier that does not utilize prior knowledge sets. \n\nThe second dataset used in our numerical tests was the Wisconsin breast cancer \nprognosis dataset WPBC using a 60-month cutoff for predicting recurrence or nonre(cid:173)\ncurrence of the disease [2]. The prior knowledge utilized in this experiment consisted \nof the prognosis rules used by doctors [8] which depended on two features from the \ndataset: tumor size (T)(feature 31), that is the diameter of the excised tumor in \n\n\fcentimeters and lymph node status (L) which refers to the number of metastasized \naxillary lymph nodes (feature 32). The rules are: \n\n(L:2: 5) 1\\ (T:2: 4) ===} RECUR \n\nand \n\n(L = 0) 1\\ (T S 1.9) ===} NON RECUR \n\nIt is important to note that the rules described above can be applied directly to \nclassify only 32 of the given 110 given points of the training dataset and correctly \nclassify 22 of these 32 points. The remaining 78 points are not classifiable by \nthe above rules. Hence, if the rules are applied as a classifier by themselves the \nclassification accuracy would be 20%. As such, these rules are not very useful by \nthemselves and doctors use them in conjunction with other rules [8]. However, \nusing our approach the rules were converted to linear inequalities and used in our \nKSVM algorithm without any use of the data, i.e. l/ = 0 in the linear program (19). \nThe resulting linear classifier in the 2-dimensional space of L(ymph) and T(umor) \nachieved 66.4% accuracy. The ten-fold, cross-validated test set correctness achieved \nby standard SVM using all the data is 66.2% [2]. This result is remarkable because \nour knowledge-based formulation can be applied to problems where training data \nmay not be available whereas expert knowledge may be readily available in the form \nof knowledge sets. This fact makes this method considerably different from previous \nhybrid methods like KBANN where training examples are needed in order to refine \nprior knowledge. If training data are added to this knowledge-based formulation, \nno noticeable improvement is obtained. \n5 Conclusion & Future Directions \nWe have proposed an efficient procedure for incorporating prior knowledge in the \nform of knowledge sets into a linear support vector machine classifier either in \ncombination with a given dataset or based solely on the knowledge sets. This novel \nand promising approach of handling prior knowledge is worthy of further study, \nespecially ways to handle and simplify the combinatorial nature of incorporating \nprior knowledge into linear inequalities. A class of possible future applications \nmight be to problems where training data may not be easily available whereas \nexpert knowledge may be readily available in the form of knowledge sets. This \nwould correspond to solving our knowledge based linear program (19) with l/ = O. \nA typical example of this type was breast cancer prognosis [8] where knowledge sets \nby themselves generated a linear classifier as good as any classifier based on data \npoints. This is a new way of incorporating prior knowledge into powerful support \nvector machine classifiers. Also, the concept of support constraints as discussed \nat the end of Section 3, warrants further study that may lead to a systematic \nsimplification of prior knowledge sets. Other avenues of research include, knowledge \nsets characterized by nonpolyhedral convex sets as well as nonlinear kernels [18, ll] \nwhich are capable of handling more complex classification problems, as well as the \nincorporation of prior knowledge into multiple instance learning [1, 5] which might \nlead to improved classifiers in that field. \nAcknowledgments \nResearch in this UW Data Mining Institute Report 01-09, November 2001, was sup(cid:173)\nported by NSF Grants CCR-9729842, IRI-9502990 and CDA-9623632, by AFOSR \nGrant F49620-00-1-0085, by NLM Grant 1 ROI LM07050-01, and by Microsoft. \nReferences \n\n[1] P. Auer. On learning from multi-instance examples: Empirical evaluation of a \n\ntheoretical approach. pages 21- 29, 1987. \n\n[2] P. S. Bradley and O. L. Mangasarian. Feature selection via concave minimiza(cid:173)\ntion and support vector machines. In J. Shavlik, editor, Machine Learning Pro(cid:173)\nceedings of the Fifteenth International Conference{ICML '98), pages 82-90, San \n\n\fFrancisco, California, 1998. Morgan Kaufmann. ftp:/ /ftp.cs.wisc.edu/math(cid:173)\nprog/ tech-reports / 98-03. ps. \n\n[3] V. Cherkassky and F. Mulier. Learning from Data - Concepts, Theory and \n\nMethods. John Wiley & Sons, New York, 1998. \n\n[4] S. Cost and S. Salzberg. A weighted nearest neighbor algorithm for learning \n\nwith symbolic features. Machine Learning, 10:57-58, 1993. \n\n[5] T. G. Dietterich, R. H. Lathrop, and T. Lozano-Perez. Solving the multiple(cid:173)\n\ninstance problem with axis-parallel rectangles. Artificial Intelligence, 89:31-71, \n1998. \n\n[6] G. Fung, O. L. Mangasarian, and J. Shavlik. Knowledge-based support vector \nmachine classifiers. Technical Report 01-09, Data Mining Institute, Computer \nSciences Department, University of Wisconsin, Madison, Wisconsin, November \n2001. ftp:/ /ftp.cs.wisc.edu/pub/dmi/tech-reports/01-09.ps. \n\n[7] F. Girosi and N. Chan. Prior knowledge and the creation of \"virtual\" examples \nfor RBF networks. In Neural networks for signal processing, Proceedings of \nthe 1995 IEEE-SP Workshop, pages 201-210, New York, 1995. IEEE Signal \nProcessing Society. \n\n[8] Y.-J. Lee, O. L. Mangasarian, and W. H. Wolberg. Survival-time classifica(cid:173)\n\ntion of breast cancer patients. Technical Report 01-03, Data Mining Institute, \nComputer Sciences Department, University of Wisconsin, Madison, Wiscon(cid:173)\nsin, March 2001. Computational Optimization and Applications, to appear. \nftp:/ /ftp.cs.wisc.edu/pub/dmi/tech-reports/Ol-03.ps. \n\n[9] O. L. Mangasarian. Nonlinear Programming. SIAM, Philadelphia, PA, 1994. \n[10] O. L. Mangasarian. Arbitrary-norm separating plane. Operations Research Let(cid:173)\nters, 24:15- 23, 1999. ftp:/ /ftp.cs.wisc.edu/math-prog/tech-reports/97-07r.ps. \nIn A. Smola, \nP. Bartlett, B. Scholkopf, and D. Schuurmans, editors, Advances in Large \nMargin Classifiers, pages 135-146, Cambridge, MA, 2000. MIT Press. \nftp:/ /ftp.cs.wisc.edu/math-prog/tech-reports/98-14.ps. \n\n[11] O. L. Mangasarian. Generalized support vector machines. \n\n[12] M. C. O 'Neill. Escherchia coli promoters: I. concensus as it relates to spac(cid:173)\n\ning class, specificity, repeat substructure, and three dimensional organization. \nJournal of Biological Chemistry, 264:5522- 5530, 1989. \n\n[13] J. R. Quinlan. Induction of Decision Trees, volume 1. 1986. \n[14] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal rep(cid:173)\nresentations by error propagation. \nIn D. E. Rumelhart and J. L. McClel(cid:173)\nland, editors, Parallel Distributed Processing, pages 318- 362, Cambridge, Mas(cid:173)\nsachusetts, 1986. MIT Press. \n\n[15] B. Scholkopf, P. Simard, A. Smola, and V. Vapnik. Prior knowledge in support \nvector kernels. In M. Jordan, M. Kearns, and S. Solla, editors, Advances in \nNeural Information Processing Systems 10, pages 640 - 646, Cambridge, MA, \n1998. MIT Press. \n\n[16] G. G. Towell and J. W. Shavlik. Knowledge-based artificial neural networks. \n\nArtificial Intelligence, 70:119-165, 1994. \n\n[17] G. G. Towell, J. W. Shavlik, and M. N oordewier. Refinement of approximate \ndomain theories by knowledge-based artificial neural networks. In Proceedings \nof the Eighth National Conference on Artificial Intelligence (AAAI-90) , pages \n861-866, 1990. \n\n[18] V. N. Vapnik. The Nature of Statistical Learning Theory. Springer, New York, \n\nsecond edition, 2000. \n\n\f", "award": [], "sourceid": 2222, "authors": [{"given_name": "Glenn", "family_name": "Fung", "institution": null}, {"given_name": "Olvi", "family_name": "Mangasarian", "institution": null}, {"given_name": "Jude", "family_name": "Shavlik", "institution": null}]}