{"title": "Combined Neural Networks for Time Series Analysis", "book": "Advances in Neural Information Processing Systems", "page_first": 224, "page_last": 231, "abstract": null, "full_text": "COMBINED NEURAL NETWORKS \n\nFOR TIME SERIES ANALYSIS \n\nIris Ginzburg and David Horn \n\nSchool of Physics and Astronomy \n\nRaymond and Beverly Sackler Faculty of Exact Science \n\nTel-Aviv University \nTel-A viv 96678, Israel \n\nAbstract \n\nWe propose a method for improving the performance of any net(cid:173)\nwork designed to predict the next value of a time series. Vve advo(cid:173)\ncate analyzing the deviations of the network's predictions from the \ndata in the training set. This can be carried out by a secondary net(cid:173)\nwork trained on the time series of these residuals. The combined \nsystem of the two networks is viewed as the new predictor. We \ndemonstrate the simplicity and success of this method, by apply(cid:173)\ning it to the sunspots data. The small corrections of the secondary \nnetwork can be regarded as resulting from a Taylor expansion of \na complex network which includes the combined system. \\\\Te find \nthat the complex network is more difficult to train and performs \nworse than the two-step procedure of the combined system. \n\n1 \n\nINTRODUCTION \n\nThe use of neural networks for computational tasks is based on the idea that the \nefficient way in which the nervous system handles memory and cognition is worth \nimmitating. Artificial implementations are often based on a single network of math(cid:173)\nematical neurons. We note, however, that in biological systems one can find collec(cid:173)\ntions of consecutive networks, performing a complicated task in several stages, with \nlater stages refining the performance of earlier ones. Here we propose to follow this \nstrategy in artificial applications. \n\n224 \n\n\fCombined Neural Networks for Time Series Analysis \n\n225 \n\nWe study the analysis of time series, where the problem is to predict the next ele(cid:173)\nment on the basis of previous elements of the series. One looks then for a functional \nrelation \n\nYn = f (Yn -1 , Yn - 2, ... , Yn - m) . \n\n(1 ) \n\nThis type of representation is particularly useful for the study of dynamical sys(cid:173)\ntems. These are characterized by a common continuous variable, time, and many \ncorrelated degrees of freedom which combine into a set of differential equations. \nNonetheless, each variable can in principle be described by a lag-space representa(cid:173)\ntion of the type 1 . This is valid even if the Y = y(t) solution is unpredictable as in \nchaotic phenomena. \n\nWeigend Huberman and Rumelhart (1990) have studied the experimental series \nof yearly averages of sunspots activity using this approach. They have realized \nthe lag-space representation on an (m, d, 1) network, where the notation implies a \nhidden layer of d sigmoidal neurons and one linear output. Using m = 12 and a \nweight-elimination method which led to d = 3, they obtained results which compare \nfavorably with the leading statistical model (Tong and Lim, 1980). Both models do \nwell in predicting the next element of the sunspots series. Recently, Nowlan and \nHinton (1992) have shown that a significantly better network can be obtained if the \ntraining procedure includes a complexity penalty term in which the distribution of \nweights is modelled as a mixture of multiple gaussians whose parameters vary in an \nadaptive manner as the system is being trained. \n\nWe propose an alternative method which is capable of improving the performance \nof neural networks: train another network to predict the errors of the first one, to \nuncover and remove systematic correlations that may be found in the solution given \nby the trained network, thus correcting the original predictions. This is in agreement \nwith the general philosophy mentioned at the beginning, where we take from Nature \nthe idea that the task does not have to be performed by one complicated network; it \nis advantageous to break it into stages of consecutive analysis steps. Starting with \na network which is trained on the sunspots data with back-propagation, we show \nthat the processed results improve considerably and we find solutions which match \nthe performance of Weigend et. al. \n\n2 CONSTRUCTION OF THE PRIMARY NETWORK \n\nLet us start with a simple application of back-propagation to the construction of \na neural network describing the sunspots data which are normalized to lie between \no and 1. The network is assumed to have one hidden layer of sigmoidal neurons, \nhi i = 1\" \n\n. \" d, which receives the input of the nth vector: \n\nm \n\nhi = 0'(2: WijYn-j - Oi) \n\nj=l \n\nThe output of the network, Pn, is constructed linearly, \n\nd \n\nPn = 2: Wi hi - O. \n\ni=l \n\n(2) \n\n(3) \n\n\f226 \n\nGinzburg and Hom \n\nThe error-function which we minimize is defined by \nE = 2 L (Pn - Yn)2 \n\n1 N \n\nn=m+l \n\n(4) \n\nwhere we try to equate Pn, the prediction or output of the network, with Yn, the \nnth value of the series. This is the appropriate formulation for a training set of N \ndata points which are viewed as N - m strings of length m used to predict the point \nfollowing each string. \nWe will work with two sets of data points. One will be labelled T and be used for \ntraining the network, and the other P will be used for testing its predictive power. \nLet us define the average error by \n\n{s = \n\n1 \njjSfj 2:(Pn - Yn)2 \n\nnES \n\n(5) \n\nwhere the set S is either Tor P. An alternative parameter was used by Weigend et. \nal. \n,in which the error is normalized by the standard deviation of the data. This \nleads to an average relative variance (arv) which is related to the average error \nthrough \n\n(6) \n\nFollowing Weigend et. al. we choose m = 12 neurons in the first layer and \nIITII = 220 data points for the training set. The following IIPII = 35 years are \nused for testing the predictions of our network. We use three sigmoidal units in the \nhidden layer and run with a slow convergence rate for 7000 periods. This is roughly \nwhere cross-validation would indicate that a minimum is reached. The starting \nparameters of our networks are chosen randomly. Five examples of such networks \nare presented in Table 1. \n\n3 THE SECONDARY NETWORK \n\nGiven the networks constructed above, we investigate their deviations from the \ndesired values \n\nqn = Yn - Pn\u00b7 \n\n(7) \nA standard statistical test for the quality of any predictor is the analysis of the \ncorrelations between consecutive errors. If such correlations are found, the predictor \nmust be improved. The correlations reflect a systematic deviation of the primary \nnetwork from the true solution. We propose not to improve the primary network \nby modifying its architecture but to add to it a secondary network which uses the \nresiduals qn as its new data. The latter is being trained only after the training \nsession of the primary network has been completed. \n\nClearly one may expect some general relation of the type \n\n(8) \nto exist. Looking for a structure of this kind enlarges considerably the original \nspace in which we searched for a solution to 1 . We wish the secondary network \n\n\fCombined Neural Networks for Time Series Analysis \n\n227 \n\nto do a modest task, therefore we assume that much can be gained by looking at \nthe interdependence of the residuals qn on themselves. This reduces the problem to \nfinding the best values of \n\nwhich would minimize the new error function \n\nTn = b(qn-l, qn-2,\"', qn-I) \n\n1 N \n\nE2='2 L (Tn-qn)2. \n\nn=I+1 \n\n(9) \n\n(10) \n\nAlternatively, one may try to express the residual in terms of the functional values \n(11) \n\nTn = !2(Yn-1, Yn-2,\"', Yn-I) \n\nminimizing again the expression 10 . \nWhen the secondary network completes its training, we propose to view \n\n(12) \nas the new prediction of the combined system. We will demonstrate that a major \nimprovement can be obtained already with a linear perceptron. This means that \nthe linear regression \n\ntn = Pn + Tn \n\n1 \n\nTn = L aIqn-i + /3 1 \n\nor \n\ni=l \n\n1 \n\nTn = L a;Yn-i + /32 \n\ni=l \n\n(13) \n\n(14) \n\nis sufficient to account for a large fraction of the systematic deviations of the primary \nnetworks from the true function that they were trained to represent. \n\n4 NUMERICAL RESULTS \nWe present in Table 1 five examples of results of (12,5,1) networks, i.e. m = 12 \ninputs, a hidden layer of three sigmoidal neurons and a linear output neuron. These \nfive examples were chosen from 100 runs of simple back-propagation networks with \nrandom initial conditions by selecting the networks with the smallest R values \n(Ginzburg and Horn, 1992). This is a weak constraint which is based on letting \nthe network generate a large sequence of data by iterating its own predictions, and \nselecting the networks whose distribution of function values is the closest to the \ncorresponding distribution of the training set. \nThe errors of the primary networks, in particular those of the prediction set \u20acp, are \nquite higher than those quoted by Weigend et. al. who started out from a (12,8,1) \nnetwork and brought it down through a weight elimination technique to a (12,5,1) \nstructure. They have obtained the values \u20acT = 0.059 \u20acp = 0.06. We can reduce our \nerrors and reach the same range by activating a secondary network with I = 11 to \nperform the linear regression (3.6) on the residuals of the predictions of the primary \nnetwork. The results are the primed errors quoted in the table. Characteristically \nwe observe a reduction of \u20acT by 3 - 4% and a reduction of \u20acp by more than 10%. \n\n\f228 \n\nGinzburg and Hom \n\n# \n\n1 \n2 \n3 \n4 \n5 \n\nfT \n\n0.0614 \n0.0600 \n0.0611 \n0.0621 \n0.0616 \n\nf' T \n\n0.0587 \n0.0585 \n0.0580 \n0.0594 \n0.0589 \n\n{p \n\n0.0716 \n0.0721 \n0.0715 \n0.0698 \n0.0681 \n\n{' P \n\n0.0620 \n0.0663 \n0.0621 \n0.0614 \n0.0604 \n\nTable 1 \n\nError parameters of five networks. The unprimed errors are those of the primary \nnetworks. The primed errors correspond to the combined system which includes \ncorrection of the residuals by a linear perceptron with I = 11 , which is an autore(cid:173)\ngressions of the residuals. Slightly better results for the short term predictions are \nachieved by corrections based on regression of the residuals on the original input \nvectors, when the regression length is 13 (Table 2). \n\n# \n\n1 \n2 \n3 \n4 \n5 \n\n{T \n\n0.061 \n0.060 \n0.061 \n0.062 \n0.062 \n\nfT \n\n0.059 \n0.059 \n0.058 \n0.060 \n0.059 \n\nfp \n\nf' p \n\n0.072 \n0.072 \n0.072 \n0.070 \n0.068 \n\n0.062 \n0.065 \n0.062 \n0.061 \n0.059 \n\nError parameters for the same five networks. The primed errors correspond to the \ncombined system which includes correction of the residuals by a linear perceptron \nbased on original input vectors with I = 13. \n\nTable 2 \n\n5 LONG TERM PREDICTIONS \n\nWhen short term prediction is performed, the output of the original network is \ncorrected by the error predicted by the secondary network. This can be easily gen(cid:173)\neralized to perform long term predictions by feeding the corrected output produced \nby the combined system of both networks back as input to the primary network. The \ncorrected residuals predicted by the secondary network are viewed as the residuals \nneeded as further inputs if the secondary network is the one performing autore(cid:173)\ngression of residuals. We run both systems based on regression on residuals and \nregression on functional values to produce long term predictions. \n\nIn table 3 we present the results of this procedure for the case of a secondary \nnetwork performing regression on residuals. The errors of the long term predictions \nare averaged over the test set P of the next 35 years. We see that the errors of \nthe primary networks are reduced by about 20%. The quality of these long term \npredictions is within the range of results presented by Weigend et. al. Using the \nregression on (predicted) functional values, as in Eq. 14 , the results are improved \nby up to 15% as shown in Table 4. \n\n\fCombined Neural Networks for Time Series Analysis \n\n229 \n\n# \n\n1 \n2 \n3 \n4 \n5 \n\nf2 \n\nfj \n\nf5 \n\n0.118 \n0.118 \n0.117 \n0.116 \n0.113 \n\n0.098 \n0.106 \n0.099 \n0.099 \n0.097 \n\n0.162 \n0.164 \n0.164 \n0.152 \n0.159 \n\nf~ \n\n0.109 \n0.125 \n0.112 \n0.107 \n0.112 \n\nfll \n\n0.150 \n0.131 \n0.136 \n0.146 \n0.147 \n\n, \n\nf11 \n\n0.116 \n0.101 \n0.099 \n0.120 \n0.123 \n\nTable 3 \n\nLong term predictions into the future. fn denotes the average error of n time steps \npredictions over the P set. The unprimed errors are those of the primary networks. \nThe primed errors correspond to the combined system which includes correction of \nthe residuals by a linear perceptron. \n\n# \n\n1 \n2 \n3 \n4 \n5 \n\nf2 \n\n0.118 \n0.118 \n0.117 \n0.117 \n0.113 \n\nf' \n2 \n\n0.098 \n0.104 \n0.098 \n0.098 \n0.096 \n\nf5 \n\n0.162 \n0.164 \n0.164 \n0.152 \n0.159 \n\nf' \n5 \n\n0.107 \n0.117 \n0.108 \n0.105 \n0.110 \n\nf11 \n\n0.150 \n0.131 \n0.136 \n0.146 \n0.147 \n\n, \n\nf11 \n\n0.101 \n0.089 \n0.086 \n0.105 \n0.109 \n\nLong term predictions into the future. The primed errors correspond to the com(cid:173)\nbined system which includes correction of the residuals by a linear perceptron based \non the original inputs. \n\nTable 4 \n\n6 THE COMPLEX NETWORK \n\nSince the corrections of the secondary network are much smaller than the charac(cid:173)\nteristic weights of the primary network, the corrections can be regarded as resulting \nfrom a Taylor expansion of a complex network which include's the combined system. \nThis can be simply implemented in the case of Eq. 14 which can be incorporated in \nthe complex network as direct linear connections from the input layer to the output \nneuron, in addition to the non-linear hidden layer, i.e., \n\nd \n\ntn = L:: Wihi + L viYn-i -\n\nm \n\n() . \n\n(15) \n\ni=l \n\ni=l \n\nWe train such a complex network on the same problem to see how it compares with \nthe two-step approach of the combined networks described in the previous chapters. \n\nThe results depend strongly on the training rates of the direct connections, as \ncompared with the training rates of the primary connections (i.e. \nthose of the \nprimary network). When the direct connections are trained faster than the primary \nones, the result is a network that resembles a linear perceptron, with non-linear \n\n\f230 \n\nGinzburg and Hom \n\ncorrections. \nIn this case, the assumption of the direct connections being small \ncorrections to the primary ones no longer holds. The training error and prediction \ncapability of such a network are worse than those of the primary network. On the \nother hand, when the primary connections are trained using a faster training rate, \nwe expect the final network to be similar in nature to the combined system. Still, \nthe quality of training and prediction of these solutions is not as good as the quality \nof the combined system, unless a big effort is made to find the correct rates. Typical \nresults of the various systems are presented in Table 5. \n\ntype of network \n\nprimary network \n\nlearning rate of linear weights = 0.1 \nlearning rate of linear weights = 0.02 \n\ncombined system \n\n0.061 \n0.062 \n0.061 \n0.058 \n\n0.072 \n0.095 \n0.068 \n0.062 \n\nShort term predictions of various networks. The learning rate of primary weights \nis 0.04. \n\nTable 5 \n\nThe performance of the complex network can be better than that of the primary \nnetwork by itself, but it is surpassed by the achievements of the combined system. \n\n7 DISCUSSION \n\nIt is well known that increasing the complexity of a network is not the guaranteed \nsolution to better performance (Geman et. al. 1992). In this paper we propose an \nalternative which increases very little the number of free parameters, and focuses on \nthe residual errors one wants to eliminate. Still one may raise the question whether \nthis cannot be achieved in one complex network. It can, provided we are allowed to \nuse different updating rates for different connections. In the extreme limit in which \none rate supersedes by far the other one, this is equivalent to a disjoint architecture \nof a combined two-step system. This emphasizes the point that a solution of a \nfeedforward network to any given task depends on the architecture of the network \nas well as on its training procedure. \n\nThe secondary network which we have used was linear, hence it defined a simple \nregression of the residual on a series of residuals or a series of function values. In \nboth cases the minimum which the network looks for is unique. In the case in \nwhich the residual is expressed as a regression on function values, the problem can \nbe recast in a complex architecture. However, the combined procedure guarantees \nthat the linear weights will be small, i.e. we look for a small linear correction to the \nprediction of the primary network. If one trains all weights of the complex network \nat the same rate this condition is not met, hence the worse results. \n\nWe advocate therefore the use of the two-step procedure of the combined set of \nnetworks. We note that combined set of networks. We note that the secondary \nnetworks perform well on all possible tests: they reduce the training errors, they \n\n\fCombined Neural Networks for Time Series Analysis \n\n231 \n\nimprove short term predictions and they do better on long term predictions as well. \nSince this approach is quite general and can be applied to any time-series forecasting \nproblem, we believe it should be always tried as a correction procedure. \n\nREFERENCES \n\nGeman, S., Bienenstock, E., & Doursat, R., 1992. Neural networks and the \nbias/variance dilemma. Neural Compo 4, 1-58. \nGinzburg, I. & Horn, D. 1992. Learning the rule of a time series. Int. Journal of \nNeural Systems 3, 167-177. \nNowlan, S. J. & Hinton, G. E. 1992. Simplifying neural networks by soft weight(cid:173)\nsharing. Neural Compo 4, 473-493. \nTong, H., & Lim, K. S., 1980. Threshold autoregression, limit cycles and cyclical \ndata. J. R. Stat. Soc. B 42, 245. \nWeigend, A. S., Huberman, B. A. & Rumelhart, D. E., 1990. Predicting the Future: \nA Connectionist Approach, Int. Journal of Neural Systems 1, 193-209. \n\n\f", "award": [], "sourceid": 824, "authors": [{"given_name": "Iris", "family_name": "Ginzburg", "institution": null}, {"given_name": "David", "family_name": "Horn", "institution": null}]}