{"title": "A Framework for the Cooperation of Learning Algorithms", "book": "Advances in Neural Information Processing Systems", "page_first": 781, "page_last": 788, "abstract": null, "full_text": "A Framework for the Cooperation \n\nof Learning Algorithms \n\nLeon Bottou \n\nPatrick Gallinari \n\nLaboratoire de Recherche en Informatique \n\nUniversite de Paris XI \n91405 Orsay Cedex \n\nFrance \n\nAbstract \n\nWe introduce a framework for training architectures composed of several \nmodules. This framework, which uses a statistical formulation of learning \nsystems, provides a unique formalism for describing many classical \nconnectionist algorithms as well as complex systems where several \nalgorithms interact. It allows to design hybrid systems which combine the \nadvantages of connectionist algorithms as well as other learning algorithms. \n\n1 INTRODUCTION \n\nMany recent achievements in the connectionist area have been carried out by designing \nsystems where different algorithms interact. For example (Bourlard & Morgan, 1991) \nhave mixed a Multi-Layer Perceptron (MLP) with a Dynamic Programming algorithm. \nAnother impressive application (Le Cun, Boser & al., 1990) uses a very complex multi(cid:173)\nlayer architecture, followed by some statistical decision process. Also, in speech or image \nrecognition systems, input signals are sequentially processed through different modules. \nModular systems are the most promising way to achieve such complex tasks. They can \nbe built using simple components and therefore can be easily modified or extended, also \nthey allow to incorporate into their architecture some structural a priori knowledge about \nthe task decomposition. Of course, this is also true for connectionism, and important \n\n781 \n\n\f782 \n\nBottou and Gallinari \n\nprogresses in this field could be achieved if we were able to train multi-modules \narchitectures. \n\nIn this paper, we introduce a formal framework for designing and training such \ncooperative systems. It provides a unique formalism for describing both the different \nmodules and the global system. We show that it is suitable for many connectionist \nalgorithms, which allows to make them cooperate in an optimal way according to the \ngoal of learning. It also allows to train hybrid systems where connectionist and classical \nalgorithms interact. Our formulation is based on a probabilistic approach to the problem \nof learning which is described in section 2. One of the advantages of this approach is to \nprovide a formal definition of the goal of learning. In section 3, we introduce modular \narchitectures where each module can be described using this framework, and we derive \nexplicit formulas for training the global system through a stochastic gradient descent \nalgorithm. Section 4 is devoted to examples, including the case of hybrid algorithms \ncombining MLP and Learning Vector Quantization (Bollivier, Gallinari & Thiria, 1990). \n\n2 LEARNING SYSTEMS \n\nThe probabilistic formulation of the problem of learning has been extensively studied for \nthree decades (Tsypkin 1971), and applied to control, pattern recognition and adaptive \nsignal processing. We recall here the main ideas and refer to (Tsypkin 1971) for a detailed \npresentation. \n\n2.1 EXPECTED COST \n\nLet x be an instance of the concept to learn. In the case of a pattern recognition problem \nfor example, x would be a pair (pattern, class). The concept is mathematically defined by \nan unknown probability density function p(x) which measures the likelihood of instance \nx. \n\nWe shall use a system parameterized by w to perform some task that depends on p(x). \nGiven an example x, we can define a local cost, J(x,w), that measures how well our \nsystem behaves on that example. For instance, for classification J would be zero if the \nsystem puts a pattern in the correct class, or positive in case of misclassification. \n\nLearning consists in finding a parameter w\u00b7 that optimises some functional of the model \nparameters. For instance, one would like to minimize the expected cost (1). \n\nC(w) = f J(x,w) p(x)dx \n\n(1) \n\nThe expected cost cannot be explicitely computed, because the density p(x) is unknown. \nOur only knowledge of the process comes from a series of observations {X1 ... xn} drawn \nfrom the unknown density p(x). Therefore, the quality of our system can only be \nmeasured through the realisations J(x,w) of the local cost function for the different \nobservations. \n\n\fA Framework for the Cooperation of Learning Algorithms \n\n783 \n\n2.2 STOCHASTIC GRADIENT DESCENT \n\nGradient descent algorithms are the simplest minimization algorithms. We cannot, \nhowever, compute the gradient of the expected cost (1), because p(x) is unknown. \nEstimating these derivatives on a training set {X1 ... xn}, gives the gradient algorithm (2), \nwhere VJ denotes the gradient of J(x,w) with respect to w, and Et, a small positive \nconstant. the \"learning rate\". \n\nWt+ 1 = Wt - Et - 2. V J(Xj,wt} \n\n1 n \nn . 1 \n\n1-\n\n(2) \n\nThe stochastic gradient descent algorithm (3) is an alternative to algorithm (2). At each \niteration, an example Xt is drawn at random, and a new value of w is computed. \n\nAlgorithm (3) is faster and more reliable than (2), it is the only solution for training \nadaptive systems like Neural networks (NN). Such stochastic approximations have been \nextensively studied in adpative signal processing (Benveniste. Metiver & Priouret, 1987). \n(Ljung & Soderstrom, 1983). Under certain conditions, algorithm (3) converges almost \nsurely (Bottou, 1991). (White, 1991) and allows to reach an optimal state of the system. \n\n(3) \n\n3 MODULAR LEARNING SYSTEMS \n\nMost often, when the goal of learning is complex, it can be achieved more easily by \nusing a decomposition of the global task into several simpler subtasks which for instance \nreflect some a priori knowledge about the structure of the problem. One can use this \ndecomposition to build modular architectures where each module will correspond to one of \nthe subtasks. \n\nWithin this framework, we will use the expected risk (1) as the goal of learning. The \nproblem now is to change the analytical formulation of the functional (1) so as to \nintroduce the modular decomposition of the global task. In (1), the analytic expression of \nthe local cost J(x,w) has two meanings: it describes a parametric relationship between the \ninputs and the outputs of the system, and measures the quality of the system. To \nintroduce the decomposition, one may write this local cost J(x,w) as the composition of \nseveral functions. One of them will take into account the local error and therefore measure \nthe quality of the system; the others will correspond to the decomposition of the \nparametric relationship between the inputs and the outputs of the system (Figure 1). Each \nof the modules will therefore receive some inputs from other modules or the external \nworld and produce some outputs which will be sent to other modules. \n\n\f784 \n\nB ottou and Gallinari \n\na I-y-p \n\nFigure 1: A modular system \n\nIn classical systems. these modules correspond to well defmed processing stages like e.g. \nsignal processing. filtering. feature extraction. classification. They are trained sequentially \nand then linked together to build a complete processing system which takes some inputs \n(e.g. raw signals) and produces some outputs (e.g. classes). Neither the assumed \ndecomposition. nor the behavior of the different modules is guaranteed to optimally \ncontribute to the global goal of learning. We will show in the following that it is \npossible to optimally train such systems. \n\n3.1 TRAINING MODULAR SYSTEMS \n\nEach function in the above composition defmes a local processing stage or module whose \noutputs are defined by a parametric function of its inputs (4). \n\nV' je y-1 (n), Yj = fj( (Xk) ke X-1 (n) , (Wi) ie W-1 (n) ) \n\n(4) \n\ny-1 (n) ( resp. X-1 (n). and W-1 (n) ) denotes the set of subscripts associated to the outputs \nY ( resp. inputs x and parameters W ) of module n. Conversely. output Yj ( resp. input xk \nand parameter Wi ) belongs to module Y(j) ( resp. X(k) and W(i) ). \n\nModules are linked so as to build a feed-forward topology which is expressed by a \nfunction cj). \n\nV'k, xk = Y~(k) \n\n(5) \n\nWe shall consider that the first module only feeds the system with examples and that the \nlast module only computes Ylast = J(x,w). \n\nFollowing (Le Cun. 1988). we can compute the derivatives of J with a Lagrangian \nmethod. Let a and ~ be the Lagrange coefficients for constraints (4) and (5). \n\nL = J -L ~k(Xk-Y~(k)) - L aj (Yr!j( (Xk) ke X-1Y(j), (Wi) ie W-1Y(j) )) (6) \nBy equating the derivatives of L with respect to x and Y to zero. we get recursive formulas \nfor computing a and ~ in a single backward pass along the acyclic graph cj). \n\nk \n\nj \n\n\fA Framework for the Cooperation of Learning Algorithms \n\n785 \n\nalast = 1, \n\nThen, the derivatives of J with respect to the weights are: \n\ndJ \n-(w) = -(aRw) = \ndwi \n\ndL \ndwi \n\n,..\" \n\nd I: \n~ a' :LJ.. \nJ :l. ... \n\u00a3.J \nje y-1W{i) UYVI \n\n(7) \n\n(8) \n\nOnce we have computed the derivatives of the local cost J(x,w), we can apply the \nstochastic gradient descent algorithm (3) for minimizing of the expected cost C(w). \n\nWe shall say that each module is defined by the equations in (7) and (8) that characterize \nits behavior. These equations are: \n\n\u2022 a forward equation (F) \n\n\u2022 a backward equation (B) \n\n\u2022 a gradient equation (G) \n\nYj = fj( (xl<) keX-1(n) ,(Wi) ieW1(n) ) \n. Ell \n~= L a J dXk \n\njeY-'X(k) \ndJ \ndwl \n\na/: \n~i=-.= Laj ~ \nje Y-'W(i) awl \n\nThe remaining equations do not depend on the nature of the modules. They describe how \nmodules interact during training. Like back-propagation, they address the credit \nassignment problem between modules by globally minimizing a single cost function. \nTraining such a complex system actually consists in cooperatively training its \ncomponents. \n\n4 EXAMPLES \n\nMost learning algorithms, as well as new algorithms may be expressed as modular \nlearning systems. Here are some simple examples of modules and systems. \n\n4.1 \n\nLINEAR AND QUASI-LINEAR SYSTEMS \n\nMODULE \n\nSYMBOL \n\nMatrix product \n\nWx \n\nMean square error \n\nMSE \n\nFORWARD \nYi-tWikXk \n\nJ. t{dk'Xk)2 \n\nBACKWARD \n\nGRADIENT \n\n~k'\"\"L