{"title": "The Ni1000: High Speed Parallel VLSI for Implementing Multilayer Perceptrons", "book": "Advances in Neural Information Processing Systems", "page_first": 747, "page_last": 754, "abstract": null, "full_text": "The NilOOO: High Speed Parallel VLSI \nfor Implementing Multilayer Perceptrons \n\nMichael P. Perrone \n\nLeon N Cooper \n\nThomas J. Watson Research Center \n\nInstitute for Brain and Neural Systems \n\nP.O. Box 704 \n\nYorktown Heights, NY 10598 \n\nmppGwatson.ibm.com \n\nBrown University \n\nProvidence, Ri 02912 \nIncGcns.brown.edu \n\nAbstract \n\nIn this paper we present a new version of the standard multilayer \nperceptron (MLP) algorithm for the state-of-the-art in neural net(cid:173)\nwork VLSI implementations: the Intel Ni1000. This new version of \nthe MLP uses a fundamental property of high dimensional spaces \nwhich allows the 12-norm to be accurately approximated by the \nIt -norm. This approach enables the standard MLP to utilize the \nparallel architecture of the Ni1000 to achieve on the order of 40000, \n256-dimensional classifications per second. \n\n1 The Intel NilOOO VLSI Chip \n\nThe Nestor/Intel radial basis function neural chip (Ni1000) contains the equivalent \nof 1024 256-dimensional artificial digital neurons and can perform at least 40000 \nclassifications per second [Sullivan, 1993]. To attain this great speed, the Ni1000 \nwas designed to calculate \"city block\" distances (Le. \nthe II-norm) and thus to \navoid the large number of multiplication units that would be required to calculate \nEuclidean dot products in parallel. Each neuron calculates the city block distance \nbetween its stored weights and the current input: \n\nneuron activity = L IWi - :eil \n\n(1) \nwhere w, is the neuron's stored weight for the ith input and :ei is the ith input. \nThus the Nil000 is ideally suited to perform both the RCE [Reillyet al., 1982] and \n\n\f748 \n\nMichael P. Perrone. Leon N. Cooper \n\nPRCE [Scofield et al., 1987] algorithms or any of the other commonly used radial \nbasis function (RBF) algorithms. However, dot products are central in the calcula(cid:173)\ntions performed by most neural network algorithms (e.g. MLP, Cascade Correlation, \netc.). Furthermore, for high dimensional data, the dot product becomes the compu(cid:173)\ntation bottleneck (i.e. most ofthe network's time is spent calculating dot products). \nIf the dot product can not be performed in parallel there will be little advantage \nusing the NilOOO for such algorithms. In this paper, we address this problem by \nshowing that we can extend the NilOOO to many of the standard neural network \nalgorithms by representing the Euclidean dot product as a function of Euclidean \nnorms and by then using a city block norm approximation to the Euclidean norm. \nSection 2, introduces the approximate dot productj Section 3 describes the City \nBlock MLP which uses the approximate dot productj and Section 4 presents ex(cid:173)\nperiments which demonstrate that the City Block MLP performs well on the NIST \nOCR data and on human face recognition data. \n\n2 Approximate Dot Product \n\nConsider the following approximation [Perrone, 1993]: \n\n1 \n\n11Z11 ~ y'n1Z1 \n\n(2) \nwhere z is some n-dimensional vector, II\u00b7 II is the Euclidean length (i.e. the 12 -\nnorm) and I\u00b7 I is the City Block length (i.e. the 11-norm). This approximation is \nmotivated by the fact that in high dimensional spaces it is accurate for a majority \nof the points in the space. In Figure 1, we suggest an intuitive interpretation of why \nthis approximation is reasonable. It is clear from Figure 1 that the approximation \nis reasonable for about 20% of the points on the arc in 2 dimensions. 1 As the \ndimensionality of the data space increases, the tangent region in Figure 1 expands \nasymptotically to fill the entire space and thus the approximation becomes more \nvalid. Below we examine how accurate this approximation is and how we can use \nit with the NilOOO, particularly in the MLP context. Given a set of vectors, V, all \nwith equal city block length, we measure the accuracy of the approximation by the \nratio of the variance of the Euclidean lengths in V to the squared mean Euclidean \nlengths in V. If the ratio is low, then the approximation is good and all we must \ndo is scale the city block length to the mean Euclidean length to get a good fit. 2 In \nparticular, it can be shown that assuming all the vectors of the space are equally \nlikely, the following equation holds [Perrone, 1993]: \n\nO'~ < (a~(!n+ 1) -1)ILfower, \n\n(3) \n\nwhere n is the dimension of the space; ILn is the average Euclidean length of the \nset of vectors with fixed city block length Sj O'~ is the variance about the average \nEuclidean length; ILlower is the lower bound for ILn and is given by ILlower == an S / Vnj \n\n1 In fact, approximately 20% of the points are within 1% of each other and 40% of the \n\npoints are within 5% of each other. \n\n2 Note that in Equation 2 we scale by 1/ fo. For high dimensional spaces this is a good \n\napproximation to the ratio of the mean Euclidean length to the City Block length. \n\n\fVLSI for Implementing Multilayer Perceptrons \n\n749 \n\nFigure 1: Two dimensional interpretation of the city block approximation. The \ncircle corresponds to all of the vectors with the same Euclidean length. The inner \nsquare corresponds to all of the vectors with city block length equal the Euclidean \nlength of the vectors in the circle. The outer square (tangent to the circle) corre(cid:173)\nsponds to the set of vectors over which we will be making our approximation. In \norder to scale the outer square to the inner square, we multiple by 11 Vn where \nn is the dimensionality of the space. The outer square approximates the circle in \nthe regions near the tangent points. In high dimensional spaces, these tangent re(cid:173)\ngions approximate a large portion of the total hypersphere and thus the city block \ndistance is a good approximation along most of the hypersphere. \n\nand an is defined by \n\nn -1 J \n- n + 1 \n\na n = - - 1+ \n\nen \n\n(n) ~ 2 \n2?r( n - 1) 2 \nn + 1 . \n-\n+--\n\n(4) \n\nFrom this equation we see that the ratio of O\"~ to /-L~ower in the large n limit is \nbounded above by 0.4. This bound is not very tight due to the complexity of the \ncalculations required; however Figure 3 suggests that a much tighter bound must \nexist. A better bound exists if we are willing to add a minor constraint to our \nhigh dimensional space [Perrone, 1993]. In the case in which each dimension of the \nvector is constrained such that the entire vector cannot lie along a single axis, 3 we \ncan show that \n\n0\"2 :::::: 2(n -1) ( ~ _ 1)2 /-L~ower \na~ , \nn \n\n(n + 1) 2 V S \n\n(5) \n\nwhere S is the city block length of the vector in question. Thus in this case, the ratio \nof O\"~ to /-L~ower decreases at least as fast as lin since nl S will be some fixed constant \nindependent of n. 4 This dependency on nand S is shown in Figure 2. This result \nsuggests that the approximation will be very accurate for many real-world pattern \n\n3For example, when the axes are constrained to be in the range [D, 1] and the city block \nlength of the vector is greater than 1. Note that this is true for the majority of the points \nin a n dimensional unit hypercube. \n\n~Thus the accuracy improves as S increases towards its maximum value. \n\n\f750 \n\nMichael P. Perrone, Leon N. Cooper \n\nrecognition tasks such as speech and high resolution image recognition which can \ntypically have thousand or even tens of thousands of dimensions. \n\n1 \n\n0.8 \n\n0.6 \n\n0.4 \n\n0 .2 \n\nSIn - 0.025 \nSIn - 0.05 \nSIn - 0.1 \n0 .2 \nSIn -\nSIn - 0 .3 \n\n\\.-,----.------------\n\no L -__ ~~:~\u00b7~~~::~~:\u00b7 \u00b7~~:2~:~\u00b7\u00b7--u-- ~--\u00b7_-\u00b7_\u00b7\u00b7_-\u00b7_\u00b7 -~--_\u00b7-_ .. _-._-._ ... ~.-~. -~-._ .. _ .. ~ .. -_ .. _ .. _._--_ .. ~.-._.-_: :_~~:-~.:_~_~-_:;~~_: :~~:~~~-:~~:~--~7 \n500 \n\n-----------------\n300 \n\n400 \n\no \n\n100 \n\n200 \n\nFigure 2: Plot of unj I-'lower vs. n for constrained vectors with varying values of Sin. \nAs S grows the ratio shrinks and consequently, accuracy improves. If we assume \nthat all of the vectors are uniformly distributed in an n-dimensional unit hypercube, \nit is easy to show that the average city block length is nj2 and the variance of the \ncity block length is n/12. Since Sjn will generally be within one standard deviation \nofthe mean, we find that typically 0.2 < Sjn < 0.8. We can use the same analysis \non binary valued vectors to derive similar results. \n\nWe explore this phenomenon further by considering the following Monte Carlo simu(cid:173)\nlation. We sampled 200000 points from a uniform distribution over an n-dimensional \ncube. The Euclidean distance of each of these points to a fixed corner of the cube \nwas calculated and all the lengths were normalized by the largest possible length, \n~. Histograms of the resulting lengths are shown in Figure 3 for four different val(cid:173)\nues of n. Note that as the dimension increases the variance about the mean drops. \nFrom Figure 3 we see that for as few as 100 dimensions, the standard deviation is \napproximately 5% of the mean length. \n\n3 The City Block MLP \n\nIn this section we describe how the approximation explained in Section 2 can be \nused by the NilOOO to implement MLPs in parallel. Consider the following formula \nfor the dot product \n\n(6) \n\n\fVLSI for Implementing Multilayer Perceptrons \n\n751 \n\n0.45 \n\n0 . 4 \n\n0.35 \n\n0.3 \n\n0.25 \n\n0.2 \n\n0 . 15 \n\n0.1 \n\n0.05 \n\ng \n~ {} e \n\nt>. \n\n0.45 \n\n0.5 \n\n~ooo D~mQg~ons ~ \n\n100 D.1.mes.1.ons - ..... --\n10 D;l.mas1.ons -c:.--(cid:173)\n:2 D.1.mas1.ons ...... - .-\n\n0.7 \n\n0.75 \n\nO.B \n\n0.55 \n\n0.6 \n\n0.65 \n\nNorma1izad LenQth \n\nFigure 3: Probability distributions for randomly draw lengths. Note that as the \ndimension increases the variance about the mean length drops. \n\nwhere II\u00b7 II is the Euclidean length (i.e. 12-norm).5 Using Equation 2, we can \napproximation Equation 6 by \n\nI'\" \n:t! - YI \n\n...... \n:t! \u2022 Y ~ -\n\n1(1\"';;'12 \n:t! + YI \n-\n4n \n\n(7) \nwhere n is the dimension of the vectors and I . I is the city block length. The \nadvantage to the approximation in Equation 7 is that it can be implemented in \nparallel on the Ni1000 while still behaving like a dot product. Thus we can use this \napproximation to implement MLP's on an Ni1000. The standard functional form \nfor MLP's is given by [Rumelhart et al., 1986] \n\n;;'12) \n\nh(:t!;a,f3) = u(aok + Lajku (!30j + Lf3ij:t!i\u00bb) \n\n(8) \n\nN \n\nj=1 \n\nd \n\ni=1 \n\nwere u is a fixed ridge function chosen to be u(:t!) = (1 + e -:t) -\n\\ N is the number \nof hidden units; k is the class label; d is the dimensionality of the data space; and a \nand f3 are adjustable parameters. The alternative which we propose, the City Block \nMLP, is given by [Perrone, 1993] \ngk(:t!; a, f3) = u(aok + L ajku (f30j + 4(L lf3ij + :t!i1)2 - 4(L lf3ij -:t!i 1)2\u00bb) (9) \n\n1 \n\n1 \n\nN \n\nd \n\nd \n\nj=1 \n\nn i=1 \n\nn i=1 \n\nriNote also that depending on the information available to us, we could use either \n\nor \n\ni\u00b7 y = }(IIi + yW -lIiW -11?7W) \n\ni\u00b7 y= }