{"title": "An Information Theoretic Approach to Rule-Based Connectionist Expert Systems", "book": "Advances in Neural Information Processing Systems", "page_first": 256, "page_last": 263, "abstract": null, "full_text": "256 \n\nAN INFORMATION THEORETIC APPROACH TO \n\nRULE-BASED CONNECTIONIST EXPERT SYSTEMS \n\nRodney M. Goodman, John W. Miller \nDepartment of Electrical Engineering \nC altech 116-81 \nPasadena, CA 91125 \n\nPadhraic Smyth \nCommunication Systems Research \nJet Propulsion Laboratories 238-420 \n4800 Oak Grove Drive \nPasadena, CA 91109 \n\nAbstract \n\nWe discuss in this paper architectures for executing probabilistic rule-bases in a par(cid:173)\nallel manner, using as a theoretical basis recently introduced information-theoretic \nmodels. We will begin by describing our (non-neural) learning algorithm and theory \nof quantitative rule modelling, followed by a discussion on the exact nature of two \nparticular models. Finally we work through an example of our approach, going from \ndatabase to rules to inference network, and compare the network's performance with \nthe theoretical limits for specific problems. \n\nIntroduction \n\nWith the advent of relatively cheap mass storage devices it is common in many \ndomains to maintain large databases or logs of data, e.g., in telecommunications, \nmedicine, finance, etc. The question naturally arises as to whether we can extract \nmodels from the data in an automated manner and use these models as the basis \nfor an autonomous rational agent in the given domain, i.e., automatically generate \n\"expert systems\" from data. There are really two aspects to this problem: firstly \nlearning a model and, secondly, performing inference using this model. What we \npropose in this paper is a rather novel and hybrid approach to learning and in(cid:173)\nference. Essentially we combine the qu'alitative knowledge representation ideas of \nAI with the distributeq, computational advantages of connectionist models, using \nan underlying theoretical basis tied to information theory. The knowledge repre(cid:173)\nsentation formalism we adopt is the rule-based representation, a scheme which is \nwell supported by cognitive scientists and AI researchers for modeling higher level \nsymbolic reasoning tasks. We have recently developed an information-theoretic al(cid:173)\ngorithm called ITRULE which extracts an optimal set of probabilistic rules from a \ngiven data set [1, 2, 3]. It must be emphasised that we do not use any form of neural \nlearning such as backpropagation in our approach. To put it simply, the ITRULE \nlearning algorithm is far more computationally direct and better understood than \n(say) backpropagation for this particular learning task of finding the most infor(cid:173)\nmative individual rules without reference to their collective properties. Performing \nuseful inference with this model or set of rules, is quite a difficult problem. Exact \ntheoretical schemes such as maximum entropy (ME) are intractable for real-time \napplications. \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n257 \n\nWe have been investigating schemes where the rules represent links on a directed \ngraph and the nodes correspond to propositions, i.e., variable-value pairs. Our \napproach is characterised by loosely connected, multiple path (arbitrary topology) \ngraph structures, with nodes performing local non-linear decisions as to their true \nstate based on both supporting evidence and their a priori bias. What we have in \nfact is a recurrent neural network. What is different about this approach compared \nto a standard connectionist model as learned by a weight-adaptation algorithm such \nas BP? The difference lies in the semantics of the representation [4]. Weights such \nas log-odds ratios based on log transformations of probabilities possess a clear mean(cid:173)\ning to the user, as indeed do the nodes themselves. This explicit representation of \nknowledge is a key requirement for any system which purports to perform reasoning, \nprobabilistic or otherwise. Conversely, the lack of explicit knowledge representation \nin most current connectionist approaches, i.e., the \"black box\" syndrome, is a ma(cid:173)\njor limitation to their application in critical domains where user-confidence and \nexplanation facilities are key criteria for deployment in the field. \n\nLearning the model \n\nConsider that we have M observations or samples available, e.g., the number of \nitems in a database. Each sample datum is described in terms of N attributes \nor features, which can assume values in a corresponding set of N discrete alpha(cid:173)\nbets. For example our data might be described in the form of lO-component binary \nvectors. The requirement for discrete rather than continuous-valued attributes is \ndictated by the very nature of the rule-based representation. In addition it is impor(cid:173)\ntant to note that we do not assume that the sample data is somehow exhaustive and \n\"correct.\" There is a tendency in both the neural network and AI learning literature \nto analyse learning in terms of learning a Boolean function from a truth table. The \nimplicit assumption is often made that given enough samples, and a good enough \nlearning algorithm we can always learn the function exactly. This is a fallacy, since \nit depends on the feature representation. For any problem of interest there are \nalways hidden causes with a consequent non-zero Bayes misclassification risk, i.e., \nthe function is dependent on non-observable features (unseen columns of the truth \ntable). Only in artificial problems such as game playing is \"perfect\" classification \nin practical problems nature hides the real features. This phenomenon \npossible -\nis well known in the statistical pattern recognition literature and renders invalid \nthose schemes which simply try to perfectly classify or memorise the training data. \n\nWe use the following simple model of a rule, i.e., \n\nIT Y = y then X = x with probability p \n\nwhere X and Yare two attributes (random variables) with \"x\" and \"y\" being values \nin their respective discrete alphabets. Given sample data as described earlier we \npose the problem as follows: can we find the \"best\" rules from a given data set, \nsay the K best rules? We will refer to this problem as that of generalised rule \ninduction, in order to distinguish it from the special case of deriving classification \n\n\f258 \n\nGoodman, Miller and Smyth \n\nrules. Clearly we require both a preference measure to rank the rules and a learning \nalgorithm which uses the preference measure to find the K best rules. \n\nLet us define the information which the event y yields about the variable X, say \n!(Xj y). Based on the requirements that !(Xj y) is both non-negative and that \nits expectation with respect to Y equals the average mutual information J(Xj Y), \nBlachman [5] showed that the only such function is the j-measure, which is defined \nas \n\ni(Xj y) = p(x\\y) log (p(x\\y)) + p(x\\y) log (p(x)~y)) \n\np(x) \n\np(x) \n\nMore recently we have shown that i(Xj y) possesses unique properties as a rule \ninformation measure [6]. \nIn general the j-measure is the average change in bits \nrequired to specify X between the a priori distribution (p(X)) and the a posteriori \ndistribution (p(X\\y)). It can also be interpreted as a special case of the cross-entropy \nor binary discrimination (Kullback [7]) between these two distributions. We further \ndefine J(Xj y) as the average information content where J(X; y) = p(Y)-i(Xj y). \nJ(Xj y) simply weights the instantaneous rule information i(X; y) by the probability \nthat the left-hand side will occur, i.e., that the rule will be fired. This definition \nis motivated by considerations of learning useful rules in a resource-constrained \nenvironment. A rule with high information content must be both a good predictor \nand have a reasonable probability of being fired, i.e., p(y) can not be too small. \nInterestingly enough our definition of J(Xj y) possesses a well-defined interpretation \nin terms of classical induction theory, trading off hypothesis simplicity with the \ngoodness-of-fit of the hypothesis to the data [8]. \n\nThe ITRULE algorithm [1, 2, 3] uses the J-measure to derive the most informative \nset of rules from an input data set. The algorithm produces a set of K probabilistic \nrules, ranked in order of decreasing information content. The parameter K may be \nuser-defined or determined via some statistical significance test based on the size of \nthe sample data set available. The algorithm searches the space of possible rules, \ntrading off generality of the rules with their predictiveness, and using information(cid:173)\ntheoretic bounds to constrain the search space. \n\nUsing the Model to Perform Inference \n\nHaving learned the model we now have at our disposal a set of lower order con(cid:173)\nstraints on the N-th order joint distribution in the form of probabilistic rules. This \nis our a priori model. In a typical inference situation we are given some initial \nconditions (i.e., some nodes are clamped), we are allowed to measure the state of \nsome other nodes (possibly at a cost), and we wish to infer the state or probability \nof one more goal propositions or nodes from the available evidence. It is important \nto note that this is a much more difficult and general problem than classification of \na single, fixed, goal variable, since both the initial conditions and goal propositions \nmay vary considerably from one problem instance to the next. This is the infer(cid:173)\nence problem, determining an a posteriori distribution in the face of incomplete and \nuncertain information. The exact maximum entropy solution to this problem is in-\n\n\fAn Information Theoretic Approach to Expert Systems \n\n259 \n\ntractable and, despite the elegance of the problem formulation, stochastic relaxation \ntechniques (Geman [9]) are at present impractical for real-time robust applications. \nOur motivation then is to perform an approximation to exact Bayesian inference \nin a robust manner. With this in mind we have developed two particular models \nwhich we describe as the hypothesis testing network and the uncertainty network. \n\nPrinciples of the Hypothesis Testing Network \n\nIn the first model under consideration each directed link from Y to x is assigned a \nweight corresponding to the weight of evidence of yon x. This idea is not necessarily \nnew, although our interpretation and approach is different to previous work [10, 4]. \nHence we have \n\nW \n\n:r.y -\n\n-1 p{xIY) -1 p(:xIY) \nog p(x) \n\nog p(x) \n\nand R = -log p(x) \np(x) \n\n:r. \n\nand the node x is assigned a threshold term corresponding to a priori bias. We use \na sigmoidal activation function, i.e., \n\n1 \n\na ( x) = --~7'\"\"E=-t----;;R'--, \n\nl+e \n\nT \n\nwhere \n\nn \n\nl:J.E:r. = I: W:r.y; . q(y,) - R:r. \n\n,=1 \n\nbased on multiple binary inputs Y1 ... Yn to x. Let 8 be the set of all Yi which are \nhypothesised true (Le., a{yd = 1), so that \n\nAE = I p(x) + '\" (1 p(xlYd _ 1 p(xIY,)) \n\nog p(x) \n\nog p(x) \n\nL.l:r. \n\nog p(x) L-\ny;ES \n\nIf each y, is conditionally independent given x then we can write \n\np(xIS) = p(x) II p(xIY,) \np(x) y;ES p(xlYd \np(xIS) \n\nTherefore the updating rule for conditionally independent y, is: \n\nT . log a(x) \n\n1 - a(x) \n\n= log p(xI8) \n\n1 - p(x/S) \n\nHence a(x) > ~ iff p{xI8) > ~ and if T == 1, a(x) is exactly p(xIS). In terms of a \nhypothesis test, a(x) is chosen true iff: \n\n' \" I p(XIYi) > \nL- og \np(XIYi) -\n\nI p{x) \n- og--\np(x) \n\nSince this describes the Neyman-Pearson decision region for independent measure(cid:173)\nments (evidence or yd with R:r. = -log :~~~ [11], this model can be interpreted as \na distributed form of hypothesis testing. \n\n\f260 \n\nGoodman, Miller and Smyth \n\nPrinciples of the Uncertainty Network \nFor this model we defined the weight on a directed link from Yi to x as \n\nW XYi = si.1(XjYi) = Si\u00b7 p(XIYi}log( p(x) ) + p(xly,)log( p(x) \n\np(XIYi) \n\n_ \n\np(xIYi))) \n\n. \n\n( \n\nwhere Si = \u00b11 and the threshold is the same as the hypothesis model. We can \ninterpret W:Zlli as the change in bits to specify the a posteriori distribution of x. H \nP(XIYi) > p{x), w:ZYi has positive support for x, i.e., Si = +1. H P{XIYi) < p(x), W:Zlli \nhas negative support for x, Le., Si = -1. IT we interpret the activation a(Yi) as an \nestimator (p(y)) for p(Yi), then for multiple inputs, \n\nP(XIYi) ) \n- ~ p(Yi).Si. p(XIYi log( p{x) ) + P xly,) log( p(x) ) \n\nP(XIYi) \n\n(_ \n\n( \n\n) \n\ni \n~ .. \n\u2022 \n\nThis sum over input links weighted by activation functions can be interpreted as \nthe total directional change in bits required to specify x, as calculated locally by the \nnode x. One can normalise !:1Ex to obtain an average change in bits by dividing by \na suitable temperature T. The node x can make a local decision by recovering p(x) \nfrom an inverse J-measure transformation of !:1E (the sigmoid is an approximation \nto this inverse function). \n\nExperimental Results and Conclusions \n\nIn this section we show how rules can be generated from example data and auto(cid:173)\nmatically incorporated into a parallel inference network that takes the form of a \nmulti-layer neural network. The network can then be \"run\" to perform parallel \ninference. The domain we consider is that of a financial database of mutual funds, \nusing published statistical data [12]. The approach is, however, typical of many \ndifferent real world domains. \n\nFigure 1 shows a portion of a set of typical raw data on no-load mutual funds. \nEach line is an instance of a fund (with name omitted), and each column represents \nan attribute (or feature) of the fund. Attributes can be numerical or categorical. \nTypical categorical attributes are the fund type which reflect the investment objec(cid:173)\ntives of the fund (growth, growth and income, balanced, and agressive growth) and \na typical numerical attribute is the five year return on investment expressed as a \npercentage. There are a total of 88 fund examples in this data set. From this raw \ndata a second quantized set of the 88 examples is produced to serve as the input to \nITRULE (Figure 2). In this example the attributes have been categorised to binary \nvalues so that they can be directly implemented as binary neurons. The ITRULE \nsoftware then processes this table to produce a set of rules. The rules are ranked \nin order of decreasing information according to the J-measure. Figure 3 shows a \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n261 \n\nportion (the top ten rules) of the ITRULE output for the mutual fund data set. The \nhypothesis test log-likelihood metric h(Xj y), the instantaneous j-measure j(Xj y), \nand the average J-measure J(Xj y), are all shown, together with the rule transition \nprobability p{x/y). \nIn order to perform inference with the ITRULE rules we need to map the rules \ninto a neural inference net. This is automatically done by ITRULE which gener(cid:173)\nates a network file that can be loaded into a neural network simulator. Thus rule \ninformation metrics become connection weights. Figure 4 shows a typical network \nderived from the ITRULE rule output for the mutual funds data. For clarity not \nall the connections are shown. The architecture consists of two layers of neurons \n(or \"units\"): an input layer and an output layer, both of which have an activation \nwithin the range {O,l}. There is one unit in the input layer (and a corresponding \nunit in the output layer) for each attribute in the mutual funds data. The output \nfeeds back to the input layer, and each layer is synchronously updated. The output \nunits can be considered to be the right hand sides of the rules and thus receive \ninputs from many rules, where the strength of the connection is the rule's metric. \nThe output units implement a sigmoid activation function on the sum of the in(cid:173)\nputs, and thus compute an activation which is an estimator of the right hand side \nposteriori attribute value. The input units simply pass this value on to the output \nlayer and thus have a linear activation. \n\nTo perform inference on the network, a probe vector of attribute values is loaded \ninto the input and output layers. Known values are clamped and cannot change \nwhile unknown or desired attribute values are free to change. The network then \nrelaxes and after several feedback cycles converges to a solution which can be read \noff the input or output units. To evaluate the models we setup fo~r standard clas(cid:173)\nsification tests with varying number of nodes clamped as inPlits. Undamped nodes \nwere set to their a priori probability. After relaxing the network, the activation of \nthe \"target\" node was compared with the true attribute values for that sample in \norder to determine classification performance. The two models were each trained \non 10 randomly selected sets of 44 samples. The performance results given in Table \n1 are the average classification rate of the models on the other 44 unseen samples. \nThe Bayes risk (for a uniform loss matrix) of each classification test was calculated \nfrom the 88 samples. The actual performance of the networks occasionally exceeded \nthis value due to small sample variations on the 44/44 cross validations. \n\nTable 1 \n\nUnits Cramped Uncertainty Test HYPOthesis Test \n\n1 - Bayes' Risk \n\n9 \n5 \n2 \n1 \n\n66.8% \n70.1% \n48.2% \n51.4% \n\n70.4% \n70.1% \n63.0% \n65.7% \n\n88.6% \n80.6% \n63.6% \n64.8% \n\n\f262 \n\nGoodman, Miller and Smyth \n\nWe conclude from the performance of the networks as classifiers that they have \nindeed learned a model of the data using a rule-based representation. The hypoth(cid:173)\nesis network performs slightly better than the uncertainty model, with both being \nquite close to the estimated optimal rate (the Bayes' risk). Given that we know \nthat the independence assumptions in both models do not hold exactly, we coin the \nterm robust inference to describe this kind of accurate behaviour in the presence of \nincomplete and uncertain information. Based on these encouraging initial results, \nour current research is focusing on higher-order rule networks and extending our \ntheoretical understanding of models of this nature. \n\nAcknowledgments \n\nThis work is supported in part by a grant from Pacific Bell, and by Caltech's \nprogram in Advanced Technologies sponsored by Aerojet General, General Motors \nand TRW. Part of the research described in this paper was carried out by the Jet \nPropulsion Laboratory, California Institute of Technology, under a contract with \nthe National Aeronautics and Space Administration. John Miller is supported by \nNSF grant no. ENG-8711673. \n\nReferences \n\n1. R. M. Goodman and P. Smyth, 'An information theoretic model for rule-based \nexpert systems,' presented at the 1988 International Symposium on Information \nTheory, Kobe, Japan. \n\n2. R. M. Goodman and P. Smyth, 'Information theoretic rule induction,' Proceed(cid:173)\n\nings of the 1988 European Conference on AI, Pitman Publishing: London. \n\n3. R. M. Goodman and P. Smyth, 'Deriving rules from databases: the ITRULE \n\nalgorithm,' submitted for publication. \n\n4. H. Geffner and J. Pearl, 'On the probabilistic semantics of connectionist net(cid:173)\n\nworks,' Proceedings of the 1987 IEEE ICNN, vol. II, pp. 187-195. \n\n5. N. M. Blachman, 'The amount of information that y gives about X,' IEEE \n\nTransactions on Information Theory, vol. IT-14 (1), 27-31, 1968. \n\n6. P. Smyth and R. M. Goodman, 'The information content of a probabilistic \n\nrule,' submitted for publication. \n\n7. S. Kullback, Information Theory and Statistics, New York: Wiley, 1959. \n8. D. Angluin and C. Smith, 'Inductive inference: theory and methods,' ACM \n\nComputing Surveys, 15(9), pp. 237-270, 1984. \n\n9. S. Geman, 'Stochastic relaxation methods for image restoration and expert sys(cid:173)\n\ntems,' in Maximum Entropy and Bayesian Methods in Science and Engineering \n(Vol. 2), 265-311, Kluwer Academic Publishers, 1988. \n\n10. G. Hinton and T. Sejnowski, 'Optimal perceptual inference,' Proceedings of the \n\nIEEE CVPR 1989. \n\n11. R. E. Blahut, Principles and Practice of Information Theory, Addison-Wesley: \n\nReading, MA, 1987. \n\n12. American Association of Investors, The individual investor's guide to no-load \n\nmutual funds, International Publishing Corporation: Chicago, 1987. \n\n\fAn Infonnation Theoretic Approach to Expert Systems \n\n263 \n\nFund Type \n\n5 Year Diver- Beta Bull Bear Stocks Invest- Net \nReturn sity \n0/0 \n\n(Risk) Perf. Perf. \n\nDistri- Expense Turn- Total \n0/0 ment Asset butions Ratio % over Assets \n\nBalanced \nGrowth \nGrowth& Income 88.3 \nAgressive \nGrowth&lncome \nBalanced \n\n136 C \n32 .5 C \nA \n-24 A \n172 E \n144 C \n\n0.8 B \n1.05 E \n0.96 C \n1.23 E \n0.59 A \n0.71 B \n\nD \nB \nD \nE \nB \nB \n\n87 \n81 \n82 \n95 \n73 \n51 \n\nIncm. $ Value $ (%NAV\\ \n0.67 37 .3 17 .63 \n0.88 \n4 .78 \n9 .30 \n9.97 \n13 10.44 \n\n- 0.02 12.5 \n0.14 11.9 \n0.02 6.45 \n0.53 13.6 \n0.72 \n\nRate %$M \n\n0 .79 \n\n1.4 200 \n1.34 127 \n1 61 \n\n34 415 \n16 \n27 \n64 \n1.4 \n1.09 \n31 113 \n0.98 239 190 \n\nFlgure1. \n\nRaw Mutual Funds Data \n\nType Type Type Type 5 Year \n\nA \n\nB \n\nG \n\n(?J Return 0/0 \n\nBeta Stocks Turn-\n>90% over \n\nAssets Distri- Diver- Bull Bear \nsity Perf. Perf. \n\nbutions \n\nS&P=1380/0 \nabove S&P \nbelow S&P \n\nno \nno \nno \nno \nno \nno \n\nno \nno \nno \nno \nno \nno \n\nyes no below \nyes no below \nno \nyes below \nno \nyes above \nno \nyes below \nyes no above \n\nunder1 \nover1 \nunder1 \nunder1 \nunder1 \nunder1 \n\nno \nno \nno \nno \nyes \nno \n\n<100% <$100M <150/0NAV C.D.E C.D.E C.D.E \n>100% >$100M >150/0NAV A.B AB A,B \nlow \nhigh \nlow high \nlow \nlow \nhigh high \nlow high \nlow \nhigh \n\nlarge \nsmall \nsmall \nlarge \nsmall \nlarge \n\nhigh \nlow \nlow \nlow \nhigh \nhigh \n\nlow \nlow \nhigh \nlow \nhigh \nhigh \n\nlow \nhigh \nhigh \nlow \nlow \nlow \n\nFigure 2. \n\nQuantized Mutual Funds Data \n\nITRULE rule output: Mutual Funds \n\np(x/y) \n\nj(X;y) J(X;y) h(X;y) \n\n1 IF \n2 IF \n3 IF \n4 IF \n5 IF \n6 IF \n7 IF \n8 IF \n9 IF \n10 IF \n\n5yrRebS&P \nBullJ)erf \nAssets \nBullJ)erf \ntypeA \nBullJ)erf \ntypeGl \nBullJ)erf \ntypeG \nAssets \n\nabove \nlow \nlarge \nhigh \nyes \nlow \nyes \nhigh \nyes \nsmall \n\nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \nlHEN \n\nBullJ)erf high \n\n5yrRet>S&P below \n\nBullJ)erf high \n\n5yrRet>s&P above \n\ntypeG no \nAssets small \ntypeG no \nAssets \ntypeA no \nlow \n\nlarge \n\nBull perf \n\n0.97 \n0.98 \n0.81 \n0.40 \n0 .04 \n0.18 \n0.05 \n0.72 \n0.97 \n0 .26 \n\n0.75 0.235 \n0.41 \n0 .201 \n0.28 0.127 \n0.25 0.127 \n0 .50 \n0 .123 \n0.25 \n0.121 \n0 .49 0.109 \n0.21 \n0.109 \n0.27 0.108 \n0.19 0.103 \n\n4.74 \n4.31 \n2.02 \n-1.71 \n-3 .87 \n-1 .95 \n-3.74 \n1.64 \n3 .54 \n-1.57 \n\nFigure 3. Top Ten Mutual Funds Rules \n\nnfo2atl~ 0 0 ~ ~ ~ Input layer - linear units \n\nD D \n\nmetric connection \nweights \n\nI \nI \none unit per attribute I \n\no DOD 0 D D 0 I \n\nI \n\nFigure 4. Rule Network \n\nFeedback connections \nweight = 1 \n\no output layer - sigmoid units \n\n\f", "award": [], "sourceid": 150, "authors": [{"given_name": "Rodney", "family_name": "Goodman", "institution": null}, {"given_name": "John", "family_name": "Miller", "institution": null}, {"given_name": "Padhraic", "family_name": "Smyth", "institution": null}]}