{"title": "Network Structuring and Training Using Rule-based Knowledge", "book": "Advances in Neural Information Processing Systems", "page_first": 871, "page_last": 878, "abstract": null, "full_text": "Network Structuring And Training Using \n\nRule-based Knowledge \n\nVolker Tresp \nSiemens AG \n\nCentral Research \nOtto-Hahn-Ring 6 \n\n8000 Munchen 83, Germany \n\nJiirgen Hollatz* \n\nInstitut fur Informatik \n\nTV Munchen \nArcisstraBe 21 \n\n8000 Munchen 2, Germany \n\nSubutai Ahmad \n\nSiemens AG \n\nCentral Research \nOtto-Hahn-Ring 6 \n\n8000 Munchen 83, Germany \n\nAbstract \n\nWe demonstrate in this paper how certain forms of rule-based \nknowledge can be used to prestructure a neural network of nor(cid:173)\nmalized basis functions and give a probabilistic interpretation of \nthe network architecture. We describe several ways to assure that \nrule-based knowledge is preserved during training and present a \nmethod for complexity reduction that tries to minimize the num(cid:173)\nber of rules and the number of conjuncts. After training the refined \nrules are extracted and analyzed. \n\n1 \n\nINTRODUCTION \n\nTraining a network to model a high dimensional input/output mapping with only a \nsmall amount of training data is only possible if the underlying map is of low com(cid:173)\nplexity and the network, therefore, can be oflow complexity as well. With increasing \n\n*Mail address: Siemens AG, Central Research, Otto-Hahn-Ring 6, 8000 Miinchen 83. \n\n871 \n\n\f872 \n\nTresp, Hollatz, and Ahmad \n\nnetwork complexity, parameter variance increases and. the network prediction be(cid:173)\ncomes less reliable. This predicament can be solved if we manage to incorporate \nprior knowledge to bias the network as it was done by Roscheisen, Hofmann and \nTresp (1992). There, prior knowledge was available . in the form of an algorithm \nwhich summarized the engineering knowledge accumulated over many years. Here, \nwe consider the case that prior knowledge is available in the form of a set of rules \nwhich specify knowledge about the input/output mapping that the network has to \nlearn. This is a very common occurrence in industrial and medical applications \nwhere rules can be either given by experts or where rules can be extracted from the \nexisting solution to the problem. \n\nThe inclusion of prior knowledge has the additional advantage that if the network \nis required to extrapolate into regions of the input space where it has not seen any \ntraining data, it can rely on this prior knowledge. Furthermore, in many on-line \ncontrol applications, the network is required to make reasonable predictions right \nfrom the beginning. Before it has seen sufficient training data it has to rely primarily \non prior knowledge. \nThis situation is also typical for human learning. If we learn a new skill such \nas driving a car or riding a bicycle, it would be disastrous to start without prior \nknowledge about the problem. Typically, we are told some basic rules, which we \ntry to follow in the beginning, but which are then refined and altered through \nexperience. The better our initial knowledge about a problem, the faster we can \nachieve good performance and the less training is required (Towel, Shavlik and \nNoordewier, 1990). \n\n2 FROM KNOWLEDGE TO NETWORKS \n\nWe consider a neural network y = N N(x) which makes a prediction about the state \nof y E ~ given the state of its input x E ~n. We assume that an expert provides \ninformation about the same mapping in terms of a set of rules. The premise of a \nrule specifies the conditions on x under which the conclusion can be applied. This \nregion of the input space is formally described by a basis function bi(x). Instead \nof allowing only binary values for a basis function (1: premise is valid, 0: premise \nis not valid), we permit continuous positive values which represent the certainty or \nweight of a rule given the input. \n\nWe assume that the conclusion of the rule can be described in form of a mathemat(cid:173)\nical expression, such as conc/usioni: the output is equal to Wi(X) where Wi(X) is a \nfunction of the input (or a subset of the input) and can be a constant, a polynomial \nor even another neural network. \n\nSince several rules can be active for a given state of the input, we define the output \nof the network to be a weighted average of the conclusions of the active rules where \nthe weighting factor is proportional to the activity of the basis function given the \ninput \n\n( ) = NN( ) = Li Wj(x) bj(x) \n. \ny x \n\nLj bj(x) \n\nx \n\n(1) \n\nThis is a very general concept since we still have complete freedom to specify the \nform of the basis function bi(x) and the conclusion Wj(x). If bi(x) and Wi(X) are \n\n\fNetwork Structuring And Training Using Rule-based Knowledge \n\n873 \n\ndescribed by neural networks themselves, there is a close relationship with the \nadaptive mixtures of local experts (Jacobs, Jordan, Nowlan and Hinton, 1991). On \nthe other hand, if we assume that the basis function can be approximated by a \nmultivariate Gaussian \n\nbi(x) = Ki exp[-2 L-\n\n1 ~ (x\u00b7 - W\u00b7)2 \n\nJ u? 'J \n\n], \n\n(2) \n\nj \n\n'J \n\nand if the Wi are constants, we obtain the network of normalized basis functions \nwhich were previously described by Moody and Darken (1989) and Specht (1990). \n\nIn some cases the expert might want to formulate the premise as simple logical \nexpressions. As an example, the rule \n\nIF [((Xl:::::: a) AND (X4:::::: b)] OR (X2:::::: c) THEN y = d X x2 \n\nis encoded as \n\npremzsei : \n\nb( ) \ni x = exp -2 \nconclusioni : Wi(X) = d X x 2 . \n\n[1(xl-a)2+(x4- b)2] \n\nu2 \n\n[1(x2-c)2] \n\n+ exp -2 \n\nu 2 \n\nThis formulation is related to the fuzzy logic approach of Tagaki and Sugeno (1992). \n\n3 PRESERVING THE RULE-BASED KNOWLEDGE \n\nEquation 1 can be implemented as a network of normalized basis functions N Ninit \nwhich describes the rule-based knowledge and which can be used for prediction. \nActual training data can be used to improve network performance. We consider \nfour different ways to ensure that the expert knowledge is preserved during training. \n\nForget. We use the data to adapt N Ninit with gradient descent (we typically adapt \nall parameters in the network). The sooner we stop training, the more of the initial \nexpert knowledge is preserved. \n\nFreeze. We freeze the parameters in the initial network and introduce a new basis \nfunction whenever prediction and data show a large deviation. In this way the \nnetwork learns an additive correction to the initial network. \n\nCorrect. Whereas normal weight decay penalizes the deviation of a parameter from \nzero, we penalize a parameter if it deviates from its initial value q}nit \n\nE \n\n1 ~( \n\np = 20:\"j L- qj - qj \n\ninit)2 \n\n(3) \n\nw here the qj is a generic network parameter. \n\nj \n\nInternal teacher. We formulate a penalty in terms of the mapping rather than in \nterms of the parameters \n\nEp = ~O:\" j(NNinit(x) - NN(x\u00bb2dx. \n\nThis has the advantage that we do not have to specify priors on relatively unintu(cid:173)\nitive network parameters. Instead, the prior directly reflects the certainty that we \n\n\f874 \n\nTresp, Hollatz, and Ahmad \n\nassociate with the mapping of the initialized network which can often be estimated. \nRoscheisen, Hofmann and Tresp (1992) estimated this certainty from problem spe(cid:173)\ncific knowledge. We can approximate the integral in Equation 3 numerically by \nMonte-Carlo integration which leads to a training procedure where we adapt the \nnetwork with a .~ixture of measured training data and training data artificially gen(cid:173)\nerated by JV JV'\"&t(x) at randomly chosen inputs. The mixing proportion directly \nrelates to the weight of the penalty, a (Roscheisen, Hofmann and Tresp, 1992). \n\n4 COMPLEXITY REDUCTION \n\nAfter training the rules can be extracted again from the network but we have to \nensure that the set of rules is as concise as possible, otherwise the value of the \nextracted rules is limited. We would like to find the smallest number of rules that \ncan still describe the knowledge sufficiently. Also, the network should be encouraged \nto find rules with the smallest number of conjuncts, which in this case means that \na basis function is only dependent on a small number of input dimensions. \n\nWe suggest the following pruning strategy for Gaussian basis functions. \n\n1. Prune basis functions. Evaluate the relative weight of each basis function at its \ncenter Wi = bi(J-Ldl 2:j bj (J-Li) which is a measure of its importance in the network. \nRemove the unit with the smallest Wi. Figure 1 illustrates the pruning of basis \nfunctions. \n\n2. Prune conjuncts. Successively, set the largest (J'ij equal to infinity, effectively \nremoving input j from basis function i. \n\nSequentially remove basis functions and conjuncts until the error increases above a \nthreshold. Retrain after a unit or a conjunct is removed. \n\n5 A PROBABILISTIC INTERPRETATION \n\nOne of the advantages of our approach is that there is a probabilistic interpretation \nof the system. In addition, if the expert formulates his or her knowledge in terms \nof probability distributions then a number of useful properties can be derived (it is \nnatural here to interpret probability as a subjective degree of belief in an event.). \nWe assume that the system can be in a number of states Si which are unobservable. \nFormally, each of those hidden states corresponds to a rule. The prior probability \nthat the system is in state Sj is equal to P(Si). Assuming that the system is in state \nSi there is a probability distribution P(x, ylSi) that we measure an input vector x \nand an output y and \n\nFor every rule the expert specifies the probability distributions in the last sum. Let's \nconsider the case that P(x, ylsd = P(XISi) P(ylsd and that P(XISi) and P(yISi) \ncan be approximated by Gaussians. In this case Equation 4 describes a Gaussian \nmixture model. For every rule, the expert has to specify \n\n\fNetwork Structuring And Training Using Rule-based Knowledge \n\n875 \n\n20 units \n\n5 units \n\nFigure 1: 80 values of a noisy sinusoid (A) are presented as training data to a \nnetwork of20 (Cauchy) basis functions, (bi(x) = Ki [1+ Lj (Xj - P.ij )2/O'fj]-2). (B) \nshows how this network also tries to approximate the noise in the data. (D) shows \nthe basis functions bi(x) and (F) the normalized basis functions bi(x)/ Lj bj(x). \nPruning reduces the network architecture to 5 units placed at the extrema of the \nsinusoid (basis functions: E, normalized basis functions: G). The network output \nis shown in (C) . \n\ny \n\n\u2022 \n\n\u2022 \n\n,. \n.. \"' \n\n. \n.... \n.. : .......... -: .. \n\u2022 \n_ \n\u2022 ~ I,\" \n. .,.--. \n\u2022 ,.:,.. \u2022 ~ \" '1. \n. .... \n. \n. \n. -....-.---...~- .. \n~.j :.\" .... . \n... . . . . \n\u2022 -; . . . . . f# . . . ~. \n~! ':!-.~.!.-.-'. \"' \n\n\u2022 . --.c '41' \n'.' \n\n. \n\n. \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\" \n\ny \n\n\u2022 \n\ni \ni \n, -\n\u2022 ! \ni \n.... \u00b7 i \n\n.. . .. \n. ~. ~. ~.:: ..... \"\"'. \n. . ,: \n\n. \n,. \n\u2022\u2022 \" --c : : # \n. ' ......... . \n.t \n\n\u2022 \n' -..... t... \n\n\u2022 \n\n- \u2022 \n\nE(xty) \n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \n\n\u2022 \u2022 \n\n\u00b7\u00b7-;'-'\u00b7f \n\n. ..... . \n'.. . ~.. .. .. . \n\\. z:t .~.. \u2022 \n\u2022 ~ \n\u2022 \u2022 \n\n'.a \u2022 \u2022 : \n\n. t ' : , .# \n\n. . - - -\n\ni ' \n\nx \nFigure 2: Left: The two rectangles indicate centers and standard deviations of two \nGaussians that approximate a density. Right: the figure shows the expected values \nf(Ylx) (continuous line) and f(xly) (dotted line). \n\nx \n\n\f876 \n\nTresp, Hollatz, and Ahmad \n\nthe rule), \n\n\u2022 P (Si), the probability of the occurrence of state Si (the overall weight of \n\u2022 P(xlsd = Nj(x; J-Lj, Ed, the probability that an input vector x occurs, given \nthat the system is in state Sj, and \n\u2022 P(ylsd = Nl (y; Wi, un, the probability of output y given state Si. \n\nThe evidence for a state given an input x becomes \n\nP(sdx) = P(XISi)P(Si) \n\nLj P(xlsj )P(Sj) \n\nand the expected value of the output \n\nE(ylx) = Li J y P(ylx, Si) dy P(xlsdP(sj), \n\nLj P(xlsj)P(sj) \n\n(5) \n\nwhere, P(xlsj) = J P(x, ylSi) dy. \nIf we substitute bi(x) = P(XISi)P(sd and \nWi(X) = J y P(ylx, bi) dy we can calculate the expected value of y using the same \narchitecture as described in Equation 1. \n\nSubsequent training data can be employed to improve the model. The likelihood of \nthe data {xk, y\"} becomes \n\nL = IIEp(xk,ykISd P(sd \n\nk \n\ni \n\nwhich can be maximized using gradient descent or EM. These adaptation rules are \nmore complicated than supervised learning since according to our model the data \ngenerating process also makes assumptions about the distributions of the data in \nthe input space. \n\nEquation 4 gives an approximation of the joint probability density of input and \noutput. Input and output are formally equivalent (Figure 2) and, in the case of \nGaussian mixtures, we can easily calculate the optimal output given just a subset \nof inputs (Ahmad and Tresp, 1993). \n\nA number of authors used clustering and Gaussian approximation on the input \nspace alone and resources were distributed according to the complexity of the input \nspace. In this method, resources are distributed according to the complexity of both \ninput and output space. 1 \n\n6 CLASSIFICATION \nA conclusion now specifies the correct class. Let {bikli = l. .. Nd denote the set of \nbasis functions whose conclusion specifies c1assk. We set wt = Dkj, where wt is the \nweight from basis function bij to the kth output and Dkj is the Kronecker symbol. \nThe kth output of the network \n\nYk X \n\n-\n\n-\n\n( ) _ NN ( ) - Lij wtbij(X) _ Li bik(X) \nLim blm(x) \n\nLim b,m(X) \n\nk X \n\n-\n\n. \n\n(6) \n\n1 Note, that a probabilistic interpretation is only possible if the integral over a basis \n\nfunction is finite, i.e. all variances are finite. \n\n\fNetwork Structuring And Training Using Rule-based Knowledge \n\n877 \n\nspecifies the certainty of classk, given the input. During training, we do not adapt \n\nthe output weights wt. Therefore, the outputs of the network are always positive \n\nand sum to one. \n\nA probabilistic interpretation can be found if we assume that P(xlclassk)P(classk) \n~ L:i bik(X). We obtain, \nP( I \n\nI) \n\nP(xlclassk)P(classk) \nc assk x = L:, P(xlclass,)P(class,) \n\nand recover Equation 6. If the basis functions are Gausssians, again we obtain a \nGaussian mixture learning problem and, as a special case (one unit per class), a \nGaussian classifier. \n\n7 APPLICATIONS \n\nWe have validated our approach on a number of applications including a network \nthat learned how to control a bicycle and an application in the legal sciences (Hollatz \nand Tresp, 1992). Here we present results for a well known data set, the Boston \nhousing data (Breiman et al., 1981), and demonstrate pruning and rule extraction. \nThe task is to predict the housing price in a Boston neighborhood as a function of \n13 potentially relevant input features. We started with 20 Gaussian basis functions \nwhich were adapted using gradient descent. We achieved a generalization error of \n0.074. We then pruned units and conjuncts according to the procedure described \nin Section 4. We achieved the best generalization error (0.058) using 4 units (this \nis approximately 10% better than the result reported for CART in Breiman et al., \n1981). With only two basis functions and 3 conjuncts, we still achieved reasonable \nprediction accuracy (generalization error of 0.12; simply predicting the mean results \nin a generalization error of 0.28). Table 1 describes the final network. Interestingly, \nour network was left with the input features which CART also considered the most \nrelevant. \nThe network was trained with normalized inputs. If we translate them back into \nreal world values, we obtain the rules: \n\nRule14: IF the number of rooms (RM) is approximately 5.4 (0.62 corresponds to \n5.4 rooms which is smaller than the average of 6.3) AND the pupil/teacher value is \napproximately 20.2 (0.85 corresponds to 20.2 pupils/teacher which is higher than \nthe average of 18.4) THEN the value of the home is approximately $14000 (0.528 \ncorresponds to $14000 which is lower than the average of $22500). \n\nTable 1: Network structure after pruning. \n\nconclusion \n\nfeature j CART rating \n\nUnit#: i = 14 Wi = 0.528 RM \nKi = 0.17 \nPIT \nUnit#: i = 20 Wi = 1.6 \nLSTAT \nKi = 0.83 \n\ncenter: J-I.ij width: l1'ij \nsecond \n0.62 \nthird \n0.85 \nmost important 0.06 \n\n0.21 \n0.35 \n0.24 \n\n\f878 \n\nTresp, Hollatz, and Ahmad \n\nRule2o: IF the percentage of lower-status population (LSTAT) is approximately \n2.5% (0.06 corresponds to 2.5% which is lower than the average of 12.65%), THEN \nthe value of the home is approximately $34000 (1.6 corresponds to $34000 which is \nhigher than the average of $22500). \n\n8 CONCLUSION \n\nWe demonstrated how rule-based knowledge can be incorporated into the structur(cid:173)\ning and training of a neural network. Training with experimental data allows for \nrule refinement. Rule extraction provides a quantitative interpretation of what is \n\"going on\" in the network, although, in general, it is difficult to define the domain \nwhere a given rule \"dominates\" the network response and along which boundaries \nthe rules partition the input space. \n\nAcknowledgements \n\nWe acknowledge valuable discussions with Ralph N euneier and his support in the \nBoston housing data application. V.T. was supported in part by a grant from \nthe Bundesminister fiir Forschung und Technologie and J. H. by a fellowship from \nSiemens AG. \n\nReferences \n\nS. Ahmad and V. Tresp. Some solutions to the missing feature problem in vision. \nThis volume, 1993. \n\nL. Breiman et al.. Classification and regression trees. Wadsworth and Brooks, \n1981. \nR. A. Jacobs, M. I. Jordan, S. J. Nowlan and G. E. Hinton. Adaptive mixtures of \nlocal experts. Neural Computation, Vol. 3, pp. 79-87, 1991. \n\nJ. Hollatz and V. Tresp. A Rule-based network architecture. Artificial Neural \nNetworks II, I. Aleksander, J. Taylor, eds., Elsevier, Amsterdam, 1992. \nJ. Moody and C. Darken. Fast learning in networks of locally-tuned processing \nunits. Neural Computation, Vol. 1, pp. 281-294, 1989. \n\nM. Roscheisen, R. Hofmann and V. Tresp. Neural control for rolling mills: in(cid:173)\ncorporating domain theories to overcome data deficiency. In: Advances in Neural \nInformation Processing Systems 4, 1992. \nD. F. Specht. Probabilistic neural networks. Neural Networks, Vol. 3, pp. 109-117, \n1990. \n\nT. Takagi and M. Sugeno. Fuzzy identification of systems and its applications to \nmodeling and control. IEEE Transactions on Systems, Man and Cybernetics, Vol. \n15, No.1, pp. 116-132, 1985. \n\nG. G. Towell, J. W. Shavlik and M. O. Noordewier. Refinement of approximately \ncorrect domain theories by knowledge-based neural networks. In Proceedings of the \nEights National Conference on Artificial Intelligence, pp. 861-866, MA, 1990. \n\n\f", "award": [], "sourceid": 638, "authors": [{"given_name": "Volker", "family_name": "Tresp", "institution": null}, {"given_name": "J\u00fcrgen", "family_name": "Hollatz", "institution": null}, {"given_name": "Subutai", "family_name": "Ahmad", "institution": null}]}