{"title": "High Performance Neural Net Simulation on a Multiprocessor System with \"Intelligent\" Communication", "book": "Advances in Neural Information Processing Systems", "page_first": 888, "page_last": 895, "abstract": "", "full_text": "High Performance Neural Net Simulation \n\non a Multiprocessor System with \n\n\"Intelligent\" Communication \n\nUrs A. Miiller, Michael Kocheisen, and Anton Gunzinger \n\nElectronics Laboratory, Swiss Federal Institute of Technology \n\nCH-B092 Zurich, Switzerland \n\nAbstract \n\nThe performance requirements in experimental research on arti(cid:173)\nficial neural nets often exceed the capability of workstations and \nPCs by a great amount. But speed is not the only requirement. \nFlexibility and implementation time for new algorithms are usually \nof equal importance. This paper describes the simulation of neural \nnets on the MUSIC parallel supercomputer, a system that shows a \ngood balance between the three issues and therefore made many \nresearch projects possible that were unthinkable before. (MUSIC \nstands for Multiprocessor System with Intelligent Communication) \n\n1 Overview of the MUSIC System \n\nThe goal of the MUSIC project was to build a fast parallel system and to use it in \nreal-world applications like neural net simulations, image processing or simulations \nin chemistry and physics [1, 2]. The system should be flexible, simple to program \nand the realization time should be short enough to not have an obsolete system by \nthe time it is finished. Therefore, the fastest available standard components were \nused. The key idea of the architecture is to support the collection and redistribution \nof complete data blocks by a simple, efficient and autonomously working commu(cid:173)\nnication network realized in hardware. Instead of considering where to send data \nand where from to receive data, each processing element determines which part of \na (virtual) data block it has produced and which other part of the same data block \nit wants to receive for the continuation of the algorithm. \n\n888 \n\n\fParallel Neural Net Simulation \n\n889 \n\nHost computer \n(Sun, PC, Macintosh) \n- user terminal \n- mass storage \n\nSCSI \n\n\u2022\u2022 _ - ._ \u2022\u2022\u2022\u2022\u2022 _ \n\n\u2022\u2022\u2022 _ \u2022\u2022 _ \n\n\u2022\u2022 \nMUSIC board \n\n\u2022\u2022\u2022 _ \n\n110 board \n\nr\u00b7\u00b7-.. \u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7 .... _\u00b7\u00b7\u00b7 __ \u00b7\u00b7 __ \u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7\u00b7\u00b7_\u00b7\u00b7_-\u00b7\u00b7\u00b7_\u00b71 \n! \nMUSIC board I \nI \n! \ni \nI \n.. I . \n- .... - - - - - - .1 \n\nBo~ \nmanager \n\nmanager \n\n\u2022\u2022\u2022 __ \u2022\u2022 _ ... _ \u2022\u2022 _ \n\nBoard \n\n: \n: \n\n_ \n\n.:.! \n\nII' \nI \n\nI \nI \n\nTransputer links \n\nPE \n\nPE \n\nvo \n\n32+8 bit, 5 MHz \n\nOutside world \n\nFigure 1: Overview of the MUSIC hardware \n\nFigure 1 shows an overview of the MUSIC architecture. For the realization of the \ncommunication paradigm a ring architecture has been chosen. Each processing \nelement has a communication interface realized with a XILINX 3090 programmable \ngate array. During communication the data is shifted through a 40-bit wide bus (32 \nbit data and 8 bit token) operated at a 5-MHz clock rate. On each clock cycle, the \nprocessing elements shift a data value to their right neighbors and receive a new \nvalue from their left neighbors. By counting the clock cycles each communication \ninterface knows when to copy data from the stream passing by into the local memory \nof its processing element and, likewise, when to insert data from the local memory \ninto the ring. The tokens are used to label invalid data and to determine when a \ndata value has circulated through the complete ring. \n\nThree processing elements are placed on a 9 x 8.5-inch board, each of them consist(cid:173)\ning of a Motorola 96002 floating-point processor, 2 Mbyte video (dynamic) RAM, \n1 Mbyte static RAM and the above mentioned communication controller. The \nvideo RAM has a parallel port which is connected to the processor and a serial port \nwhich is connected to the communication interface. Therefore, data processing is \nalmost not affected by the communication network's activity and communication \nand processing can overlap in time. This allows to use the available communication \nbandwidth more efficiently. The processors run at 40 MHz with a peak performance \nof 60 MFlops. Each board further contains an Inmos T425 transputer as a board \n\n\f890 \n\nMilller, Kocheisen, and Gunzinger \n\nN umber of processing elments: \nPeak performance: \nFloating-point format: \nMemory: \nProgramming language: \nCabinet: \nCooling: \nTotal power consumption: \nHost computer: \n\n60 \n3.6 Gflops \n44 bit IEEE single extended precision \n180 Mbyte \nC, Assembler \n19-inch rack \nforced air cooling \nless than 800 Watt \nSun workstation, PC or Macintosh \n\nTable 1: MUSIC system technical data \n\nmanager, responsible for performance measurements and data communication with \nthe host (a Sun workstation, PC or Macintosh). \n\nIn order to provide the fast data throughput required by many applications, special \nI/O modules (for instance for real-time video processing applications) can be added \nwhich have direct access to the fast ring bus. An SCSI interface module for four \nparallel SCSI-2 disks, which is currently being developed, will allow the storage \nof huge amount of training data for neural nets. Up to 20 boards (60 processing \nelements) fit into a standard 19-inch rack resulting in a 3.6-Gflops system. MUSIC's \ntechnical data is summarized in Table 1. \n\nFor programming the communication network just three library functions are nec(cid:173)\nessary: Init_commO to specify the data block dimensions and data partitioning, \nData.IeadyO to label a certain amount of data as ready for communication and \nWait...ciataO to wait for the arrival of the expected data (synchronization). Other \nfunctions allow the exchange and automatic distribution of data blocks between the \nhost computer and MUSIC and the calling of individual user functions. The activity \nof the transputers is embedded in these functions and remains invisible for the user. \n\nEach processing element has its own local program memory which makes MUSIC \na MIMD machine (multiple instructions multiple data). However, there is usually \nonly one program running on all processing elements (SPMD = single program mul(cid:173)\ntiple data) which makes programming as simple or even simpler as programming a \nSIMD computer (single instruction multiple data). The difference to SIMD machines \nis that each processor can take different program pathes on conditional branches \nwithout the performance degradation that occurs on SIMD computers in such a \ncase. This is especially important for the simulation of neural nets with nonregular \nlocal structures. \n\n2 Parallelization of Neural Net Algorithms \n\nThe first implemented learning algorithm on MUSIC was the well-known back(cid:173)\npropagation applied to fully connected multilayer perceptrons [3]. The motivation \nwas to gain experience in programming the system and to demonstrate its perfor(cid:173)\nmance on a real-world application. All processing elements work on the same layer \na time, each of them producing an individual part of the output vector (or error \nvector in the backward path) [1]. The weights are distributed to the processing \nelements accordingly. Since a processing element needs different weight subsets in \n\n\fParallel Neural Net Simulation \n\n891 \n\n200.-----.-----~----._----._----~----_n \n\n900-600-30 \n\n----:(cid:173)\n\n,.../-:--.. v' 300-200-10 \n\n~~ .......... .;..-.. ~.-.~ .... : .... :;.-.; ... : .... + .... ~ .... ~ .... ~ . \n\n50 \n\n................ + + \n\n203-80-26 \n\n.....\u2022 \n\n...... \n\u2022\u2022\u2022\u2022 ~ II \n\nI!JI!JI!IDI!JI!JIII!JIiIIiII!JIDIiI \n\nO~ ____ L -____ L -____ ~ ____ ~ ____ ~ ____ -U \n60 \n\n20 \n\n30 \n\n10 \n\n40 \n\n50 \n\no \n\nNumber of processing elements \n\nFigure 2: Estimated (lines) and measured (points) back-propagation performance \nfor different neural net sizes. \n\nthe forward and in the backward path, two subsets are stored and updated on each \nprocessing element. Each weight is therefore stored and updated twice on different \nlocations on the MUSIC system [1]. This is done to avoid the communication of \nthe weights during learning what would cause a saturation of the communication \nnetwork. The estimated and experimentally measured speedup for different sizes of \nneural nets is illustrated in Figure 2. \n\nAnother frequently reported parallelization scheme is to replicate the complete net(cid:173)\nwork on all processing elments and to let each of them work on an individual subset \nof the training patterns [4, 5, 6]. The implementation is simpler and the commu(cid:173)\nnication is reduced. However, it does not allow continuous weight update, which is \nknown to converge significantly faster than batch learning in many cases. A com(cid:173)\nparison of MUSIC with other back-propagation implementations reported in the \nliterature is shown in Table 2. \n\nAnother category of neural nets that have been implemented on MUSIC are cellular \nneural nets (CNNs) [10]. A CNN is a two-dimensional array of nonlinear dynamic \ncells, where each cell is only connected to a local neighborhood [11, 12]. In the \nMUSIC implementation every processing elment computes a different part of the \narray. Between iteration steps only the overlapping parts of the neighborhoods \nneed to be communicated. Thus, the computation to communication ratio is very \nhigh resulting in an almost linear speedup up to the maximum system size. CNNs \nare used in image processing and for the modeling of biological structures. \n\n3 A Neural Net Simulation Environment \n\nAfter programming all necessary functions for a certain algorithm (e.g. forward \npropagate, backward propagate, weight update, etc.) they need to be combined \n\n\f892 \n\nMuller, Kocheisen, and Gunzinger \n\nSystem \n\nPC (80486, 50 MHz)_* \nSun (Sparcstation 10)* \nAlpha Station (150 MHz)* \nHypercluster [7] \nWarp [4] \nCM-2** [6] \nCray Y-MP C90*** \nRAP [8] \nNEC SX-3*** \nMUSIC* \nSandy /8** [9] \nGFll [5] \n\nNo. of \nPEs \n\n1 \n1 \n1 \n64 \n10 \n64K \n1 \n40 \n1 \n60 \n256 \n356 \n\n*Own measurements \n**Estimated numbers \n***No published reference available. \n\nPerformance \nforward Learmng \n[MCPS] \n(McuPS] \n0.47 \n1.1 \n1.1 \n3.0 \n8.3 \n3.2 \n27.0 \n9.9 \n17.0 \n-\n40.0 \n65.6 \n106.0 \n130.0 \n247.0 \n583.0 \n901.0 \n\n180.0 \n220.3 \n574.0 \n\n504.0 \n-\n-\n\n-\n\nCont. \nPeak weight \n(%) \nupdate \n38.0 \n43_0 \n8.6 \n-\n-\n-\n-\n50.0 \n9.6 \n28.0 \n31.0 \n54.0 \n\nYes \nYes \nYes \n-\nNo \nNo \nYes \nYes \nYes \nYes \nYes \nNo \n\nTable 2: Comparison of floating-point back-propagation implementations. \"PEs\" \nmeans processing elements, \"MCPS\" stands for millions of connections per second \nin the forward path and \"MCUPS\" is the number of connection updates per second \nin the learning mode, including both forward and backward path. Note that not all \nimplementations allow continuous weight update. \n\nin order to construct and train a specific neural net or to carry out a series of \nexperiments. This can be done using the same programming language that was \nused to program the neural functions (in case of MUSIC this would be C). In this \ncase the programmer has maximum flexibility but he also needs a good knowledge \nof the system and programming language and after each change in the experimental \nsetup a recompilation of the program is necessary. \nBecause a set of neural functions is usually used by many different researchers who, \nin many cases, don't want to be involved in a low-level (parallel) programming of \nthe system, it is desirable to have a simpler front-end for the simulator. Such a \nfront-end can be a shell program which allows to specify various parameters of the \nalgorithm (e.g. number of layers, number of neurons per layer, etc.). The usage of \nsuch a shell can be very easy and changes in the experimental setup don't require \nrecompilation of the code. However, the flexibility for experimental research is \nusually too much limited with a simple shell program. We have chosen a way in \nbetween: a command language to combine the neural functions which is interactive \nand much simpler to learn and to use than an ordinary programming language like \nC or Fortran. The command language should have the following properties: \n\ninteractive \n\n-\n- easy to learn and to use \n-\n-\n- variables \n-\n\nflexible \nloops and conditional branches \n\ntransparent interface to neural functions. \n\n\fParallel Neural Net Simulation \n\n893 \n\nInstead of defining a new special purpose command language we decided to consider \nan existing one. The choice was Basic which seems to meet the above requirements \nbest. It is easy to learn and to use, it is widely spread, flexible and interactive. For \nthis purpose a Basic interpreter, named Neuro-Basic, was written that allows the \ncalling of neural (or other) functions running parallel on MUSIC. From the Basic \nlevel itself the parallelism is completely invisible. To allocate a new layer with 300 \nneurons, for instance, one can type \n\na = new_layer(300) \n\nThe variable a afterwards holds a pointer to the created layer which later can be \nused in other functions to reference that layer. The following command propagates \nlayer a to layer b using the weight set w \n\npropagate (a, b, w) \n\nOther functions allow the randomization of weights, the loading of patterns and \nweight sets, the computation of mean squared errors and so on. Each instruction \ncan be assigned to a program line and can then be run as a program. The sequence \n\n10 a = new_layer(300) \n20 b = new_layer(10) \n30 w = new_weights(a, b) \n\nfor instance defines a two-layer perceptron with 300 input and 10 output neurons be(cid:173)\ning connected with the weights w. Larger programs, loops and conditional branches \ncan be used to construct and train complete neural nets or to automatically run \ncomplete series of experiments where experimental setups depend on the result of \nprevious experiments. The Basic environment thus allows all kinds of gradations in \nexperimental research, from the interactive programming of small experiments till \nlarge off-line learning jobs. Extending the simulator with new learning algorithms \nmeans that the programmer just has to write the parallel code of the actual algo(cid:173)\nrithm. It can then be controlled by a Basic program and it can be combined with \nalready existing algorithms. \nThe Basic interpreter runs on the host computer allowing easy access to the in(cid:173)\nput/output devices of the host. However, the time needed for interpreting the \ncommands on the host can easily be in the same order of magnitude as the runtime \nof the actual functions on the attached parallel processor array. The interpretation \nof a Basic program furthermore is a sequential part of the system (it doesn't run \nfaster if the system size is increased) which is known to be a fundamental limit in \nspeedup (Amdahls law [13]). Therefore the Basic code is not directly interpreted on \nthe host but first is compiled to a simpler stack oriented meta-code, named b-code, \nwhich is afterwards copied and run on all processing elements at optimum speed. \nThe compilation phase is not really noticeable to the user since compiling 1000 \nsource lines takes less than a second on a workstation. \n\nNote that Basic is not the programming language for the MUSIC system, it is a \nhigh level command language for the easy control of parallel algorithms. The actual \nprogramming language for MUSIC is C or Assembler. \n\n\f894 \n\nMuller, Kocheisen, and Gunzinger \n\nOf course, Neuro-Basic is not restricted to the MUSIC system. The same principle \ncan be used for neural net simulation on conventional workstations, vector comput(cid:173)\ners or other parallel systems. Furthermore, the parallel algorithms of MUSIC also \nrun on sequential computers. Simulations in Neuro-Basic can therefore be executed \nlocally on a workstation or PC as well. \n\n4 Conclusions \n\nNeuro-Basic running on MUSIC proved to be an important tool to support exper(cid:173)\nimental research on neural nets. It made possible to run many experiments which \ncould not have been carried out otherwise. An important question, however, is, \nhow much more programming effort is needed to implement a new algorithm in \nthe Neuro-Basic environment compared to an implementation on a conventional \nworkstation and how much faster does it run. \n\nAlgorithm \n\nadditional \n\nprogramming \n\nBack-propagation ~ C) \nBack-propagation (Assembler) \nCellular neural nets (CNN) \n\nx 2 \nx 8 \nx 3 \n\nspeedup \n\n60 \n240 \n60 \n\nTable 3: Implementation time and performance ratio of a 60-processor MUSIC \nsystem compared to a Sun Sparcstation-10 \n\nTable 3 contains these numbers for back-propagation and cellular neural nets. It \nshows that if an additional programming effort of a factor two to three is invested \nto program the MUSIC system in C, the return of investment is a speedup of ap(cid:173)\nproximately 60 compared to a Sun Sparcstation-10. This means one year of CPU \ntime on a workstation corresponds to less than a week on the MUSIC system. \n\nAcknowledgements \n\nWe would like to express our gratitude to the many persons who made valuable \ncontributions to the project, especially to Peter Kohler and Bernhard Baumle for \ntheir support of the MUSIC system, Jose Osuna for the CNN implementation and \nthe students Ivo Hasler, Bjorn Tiemann, Rene Hauck, Rolf Krahenbiihl who worked \nfor the project during their graduate work. \nThis work was funded by the Swiss Federal Institute of Technology, the Swiss N a(cid:173)\ntional Science Foundation and the Swiss Commission for Support of Scientific Re(cid:173)\nsearch (KWF). \n\nReferences \n\n[1] Urs A. Miiller, Bernhard Baumle, Peter Kohler, Anton Gunzinger, and Walter \nGuggenbiihl. Achieving supercomputer performance for neural net simulation \nwith an array of digital signal processors. IEEE Micro Magazine, 12(5):55-65, \nOctober 1992. \n\n\fParallel Neural Net Simulation \n\n895 \n\n[2] Anton Gunzinger, Urs A. Miiller, Walter Scott, Bernhard Bliumle, Peter \n\nKohler, Hansruedi Vonder Miihll, Florian Miiller-Plathe, Wilfried F. van Gun(cid:173)\nsteren, and Walter Guggenbiihl. Achieving super computer performance with \na DSP array processor. In Robert Werner, editor, Supercomputing '92, pages \n543-550. IEEEj ACM, IEEE Computer Society Press, November 16-20, 1992, \nMinneapolis, Minnesota 1992. \n\n[3] D. E. Rumelhart, G. E. Hinton, and R. J. Williams. Learning internal represen(cid:173)\n\ntation by error propagation. In David E. Rumelhart and James L. McClelland, \neditors, Parallel Distributet Processing: Explorations in the Microstructure of \nCognition, volume 1, pages 318-362. Bradford Books, Cambridge MA, 1986. \n\n[4] Dean A. Pomerleau, George L. Gusclora, David S. Touretzky, and H. T. Kung. \nNeural network simulation at Warp speed: How we got 17 million connections \nper second. \nIn IEEE International Conference on Neural Networks, pages \n11.143-150, July 24-27, San Diego, California 1988. \n\n[5] Michael Witbrock and Marco Zagha. An implementation of backpropaga(cid:173)\ntion learning on GF11, a large SIMD parallel computer. Parallel Computing, \n14(3):329-346, 1990. \n\n[6] Xiru Zhang, Michael Mckenna, Jill P. Mesirov, and David L. Waltz. An ef(cid:173)\n\nficient implementation of the back-propagation algorithm on the Connection \nMachine CM-2. In David S. Touretzky, editor, Advances in Neural Information \nProcessing Systems (NIPS-89), pages 801-809,2929 Campus Drive, Suite 260, \nSan Mateo, CA 94403, 1990. Morgan Kaufmann Publishers. \n\n[7] Heinz Miihlbein and Klaus Wolf. Neural network simulation on parallel com(cid:173)\nputers. In David J. Evans, Gerhard R. Joubert, and Frans J. Peters, editors, \nParallel Computing-89, pages 365-374, Amsterdam, 1990. North Holland. \n\n[8] Phil Kohn, Jeff Bilmes, Nelson Morgan, and James Beck. Software for ANN \ntraining on a Ring Array Processor. In John E. Moody, Steven J. Hanson, \nand Richard P. Lippmann, editors, Advances in Neural Information Processing \nSystems 4 (NIPS-91), 2929 Campus Drive, Suite 260, San Mateo, California \n94403, 1992. Morgan kaufmann. \n\n[9] Hideki Yoshizawa, Hideki Kato Hiroki Ichiki, and Kazuo Asakawa. A \nhighly parallel architecture for back-propagation using a ring-register data \npath. In 2nd International Conference on Microe/ectrnics for Neural Networks \n(ICMNN-91), pages 325-332, October 16-18, Munich 1991. \n\n[10] J. A. Osuna, G. S. Moschytz, and T. Roska. A framework for the classifica(cid:173)\n\ntion of auditory signals with cellular neural networks. In H. Dedieux, editor, \nProcedings of 11. European Conference on Circuit Theory and Design, pages \n51-56 (part 1). Elsevier, August 20 - Sept. 3 Davos 1993. \n\n[11] Leon O. Chua and Lin Yang. Cellular neural networks: Theory. IEEE Trans(cid:173)\n\nactions on Circuits and Systems, 35(10):1257-1272, October 1988. \n\n[12] Leon O. Chua and Lin Yang. Cellular neural networks: Applications. IEEE \n\nTransactions on Circuits and Systems, 35(10):1273-1290, October 1988. \n\n[13] Gene M. Amdahl. Validity of the single processor approach to achieving large \nscale computing capabilities. In AFIPS Spring Computer Conference Atlantic \nCity, NJ, pages 483-485, April 1967. \n\n\f", "award": [], "sourceid": 731, "authors": [{"given_name": "Urs", "family_name": "M\u00fcller", "institution": null}, {"given_name": "Michael", "family_name": "Kocheisen", "institution": null}, {"given_name": "Anton", "family_name": "Gunzinger", "institution": null}]}