{"title": "Bat-G net: Bat-inspired High-Resolution 3D Image Reconstruction using Ultrasonic Echoes", "book": "Advances in Neural Information Processing Systems", "page_first": 3720, "page_last": 3731, "abstract": "In this paper, a bat-inspired high-resolution ultrasound 3D imaging system is presented. Live bats demonstrate that the properly used ultrasound can be used to perceive 3D space. With this in mind, a neural network referred to as a Bat-G network is implemented to reconstruct the 3D representation of target objects from the hyperbolic FM (HFM) chirped ultrasonic echoes. The Bat-G network consists of an encoder emulating a bat's central auditory pathway, and a 3D graphical visualization decoder. For the acquisition of the ultrasound data, a custom-made Bat-I sensor module is used. The Bat-G network shows the uniform 3D reconstruction results and achieves precision, recall, and F1-score of 0.896, 0.899 and 0.895, respectively. The experimental results demonstrate the implementation feasibility of a high-resolution non-optical sound-based imaging system being used by live bats. The project web page (https://sites.google.com/view/batgnet) contains additional content summarizing our research.", "full_text": "Bat-G net: Bat-inspired High-Resolution 3D Image\n\nReconstruction using Ultrasonic Echoes\n\nGunpil Hwang\u2217, Seohyeon Kim\u2217, and Hyeon-Min Bae\n\nSchool of Electrical Engineering\n\nKorea Advanced Institute of Science and Technology\n\nDaejeon, South Korea\n\n{gphwang, dddokman, hmbae}@kaist.ac.kr\n\nAbstract\n\nIn this paper, a bat-inspired high-resolution ultrasound 3D imaging system is pre-\nsented. Live bats demonstrate that the properly used ultrasound can be used to\nperceive 3D space. With this in mind, a neural network referred to as a Bat-G\nnetwork is implemented to reconstruct the 3D representation of target objects from\nthe hyperbolic FM (HFM) chirped ultrasonic echoes. The Bat-G network consists\nof an encoder emulating a bat's central auditory pathway, and a 3D graphical visu-\nalization decoder. For the acquisition of the ultrasound data, a custom-made Bat-I\nsensor module is used. The Bat-G network shows the uniform 3D reconstruction\nresults and achieves precision, recall, and F1-score of 0.896, 0.899, and 0.895, re-\nspectively. The experimental results demonstrate the implementation feasibility of\na high-resolution non-optical sound-based imaging system being used by live bats.\nThe project web page (https://sites.google.com/view/batgnet) contains\nadditional content summarizing our research.\n\n1\n\nIntroduction\n\nRecent improvements in sensor and information processing technologies have made signi\ufb01cant\ncontributions to the progress of numerous unmanned systems (UMS) such as a drone, an autonomous\nvehicle, and a robot. In order for UMS to reach full autonomous level that does not require any\nhuman intervention, the collected data from sensors in UMS should suf\ufb01ce to manage the entire\nenvironmental scenarios. Therefore, UMS commonly employs a combination of sensors including\nRGB-D cameras, RADARs, LIDARs, and ultrasonic sensors that are complementary to each other.\nBoth RGB-D camera and LIDAR provide abundant high-resolution visual information, however, the\nvisibility and accuracy can be severely compromised depending on environmental/weather conditions\nas shown in Fig. 1(a). On the contrary, RADAR and conventional ultrasonic sensors, measuring\nthe time-of-\ufb02ight of the re\ufb02ected signal, are relatively less sensitive to operating circumstances but\nmerely provide low-resolution ranging information [1, 2]. Consequently, a clear need exists for an\nimaging sensor that can precisely visualize 3D space irrespective of environmental conditions.\nIn this paper, a high-resolution ultrasound 3D imaging system emulating the echolocation mechanism\nof a live bat is presented. Among many outstanding features of a bat's sensory system that enables\naccurate 3D perception, three following key points are essentials: (1) Bats localize obstacles and\ndiscriminate prey by analyzing the echoes of emitted ultrasound pulses, which is called echolocation\n[3\u20135]. The emitted ultrasound signal from a bat is frequency-chirped over a wide frequency range,\nwhich plays a critical role in recognizing the shape of an object from the echo spectra [6]. Hence, the\nproposed system employs a frequency-chirped broadcasting ultrasound signal in the range of 20-120\nkHz as shown in Fig. 1(b). (2) Echolocation is an inverse problem where the spatial information\n\n\u2217Equal contribution\n\n33rd Conference on Neural Information Processing Systems (NeurIPS 2019), Vancouver, Canada.\n\n\fFigure 1: (a) Measurement results of a camera, an RGB-D camera, and an ultrasound (US) sensor\nin the different brightness/fog level. (b) Overview of a bat-inspired high-resolution ultrasound 3D\nimaging system.\nof a target is extracted from re\ufb02ected/scattered echoes. In general, solving an inverse problem is\nan extremely laborious and time-consuming task since the problem is often ill-posed and requires\nseveral iterations [7]. However, live bats with a real neural network recognize the surroundings in\nreal time. From these, we can infer the fact that the echolocation problem can be solved ef\ufb01ciently\nwith the help of the arti\ufb01cial neural network. Therefore, we designed a feed-forward neural network\nto inversely reconstruct a 3D image from the collected ultrasound data, referred to as a bat-inspired\ngraphical visualization (Bat-G) network. (3) The sensory-to-image conversion of bats involves the\nneural interactions between the nuclei on the central auditory pathway (through the brainstem and\nthe midbrain) and the auditory cortex (AC). From the sensory input, it is believed that the auditory\nnuclei extract temporal and spectral features needed for the echolocation and then pass them to AC\nthrough monaural, binaural, ipsilateral, and/or contralateral connections. The architecture of Bat-G\nnet is heavily inspired by the neuroanatomical auditory pathway of bats.\n\n2 Related Work\n\nIn the past decades, airborne ultrasonic sensors have been widely used for range detection. These\nsensors emit a single frequency ultrasonic signal and calculate the distance to the object in a 2D\nhorizontal plane by measuring the time-of-\ufb02ight (TOF) of echoes re\ufb02ected from an object. Recently,\nthere have been attempts to localize/classify a target object and/or reconstruct the shape of an object\nas shown in table 1. A series of 3D localization strategies have been explored, which includes the\ncalculation of TOF difference between two pairs of microphones [8] and the reception of signals from\na designated direction in 3D space using a beamforming (BF) technique [9]. In [10], a biomimetic\nsonar system performing spectrum-based 3D localization is proposed. Another line of research\nis the classi\ufb01cation of target objects by combining different classi\ufb01cation parameters such as the\n\n[8]\n3D localization (cid:88)\n\u00d7\nClassi\ufb01cation\nReconstruction \u00d7\n1/4\n(cid:88)\nT\n\nMeasurement\n\nTX / RX\n\nMethod1\n\nTable 1: Summary of Related Works\n\n[9]\n(cid:88)\n\u00d7\n\u00d7\n1/32\n(cid:88)\nBF\n\n[10]\n(cid:88)\n\u00d7\n\u00d7\n1/2\n(cid:88)\nSC,\nBM\n\n[11]\n(cid:88)\n(cid:88)\n\u00d7\n3/3\n(cid:88)\nT,\nAC\n\n[12]\n(cid:88)\n(cid:88)\n\u00d7\n4/4\n(cid:88)\nT,\nPCA\n\n[13]\n\u00d7\n(cid:88)\n\u00d7\n1/1\n(cid:88)\nNN,\nBM\n\n[14]\n(cid:88)\n(cid:88)\n(cid:88)\n1/400\n(cid:88)\nSA,\nBF\n\n[15]\n(cid:88)\n(cid:88)\n(cid:88)\n1/64\n(cid:88)\nNN,\nHG\n\n[16] This Work\n(cid:88)\n(cid:88)\n(cid:88)\n5/3\n\u00d7\nT,\nCS\n\n(cid:88)\n(cid:88)\n(cid:88)\n1/4\n(cid:88)\n\nT, NN,\nBM\n\nNOTE: 1 T: Time difference of arrival, BF: Beamforming, SC: Spectral Cues, BM: Biomimetics, AC:\nAngle Change, PCA: Principal-Component-Analysis, NN: Neural Network, SA: Synthetic Aperture, HG:\nHolography, CS: Compressive Sensing\n\n2\n\n\fFigure 2: (a) Operational block diagram of a bat's cochlear block. (b) Illustration of the operation of a\nbat's temporal cue analysis (TCA) block. (c) Fine delay determination mechanism of a bat's spectral\ncue analysis (SCA) block.\nangles/distances between the 3D sensor array and an object [11] or by utilizing 16 TOF vectors (4\nTXs and 4 RXs) processed by means of the principal component analysis (PCA) [12]. However, such\ntechniques relying on a lookup table for the classi\ufb01cation, distinguish only simple objects such as\nplane, corner, and edge. On the other hand, [13] has attempted to categorize cube and tetrahedron by\nanalyzing the spectrum of echoes with the help of a neural network (NN). However, the approach has\nyet to demonstrate the real potential of NN methods due to limited datasets alongside the rudimentary\nNN structure. Besides 3D localization and classi\ufb01cation of target objects, many efforts have been\nmade to solve the ill-posed inverse problem for the reconstruction of the 3D shape of an object\nfrom the received echoes. Such attempts adopted either the BF [14] or holography [15] techniques\nwith a large number of TRX array. Compressive sensing (CS) technique, a subset of the inverse\nproblem approach, has also been tried in a simulation domain with a few cuboids considering the\nsparse property of the scenes [16]. However, such inverse problem approaches require tremendous\ncomputation power and time to process the incoming data from such large array. In this paper,\na feed-forward Bat-G network is proposed to solve the ill-posed 3D ultrasonic inverse problem.\nThe proposed network reconstructs 3D representation of diverse objects from measured 4-channel\nultrasonic signals.\n\n3 Preliminaries\n\nIn order to understand the 3D spatial perception mechanism of a bat-inspired imaging (Bat-I) sensor,\nit is essential to understand the structure of a bat's auditory system that comprising three principal\ncomponents including the cochlear and the temporal/spectral cue analysis block.\n\n3.1 Cochlear Block\n\nThe position-dependent frequency selectivity of the basilar membrane in the bat's cochlea can be\nmodeled by sharply tuned band-pass \ufb01lters (BPFs) as described in Fig. 2(a). These \ufb01lters are\ntypically modeled by 81 parallel constant-bandwidth, 10th-order Butterworth IIR \ufb01lters whose center\nfrequencies (fc) are hyperbolic in the range of 20-100 kHz. The transmission process linking the\nexcitation of hair cells to the primary auditory neurons through synapse is modeled by half-wave\nrecti\ufb01cation followed by low-pass \ufb01ltering (LPF) at the output of each 81 BPF [17, 18]. As a\nresult, the emitted/received sound signal is decomposed into 81 band-pass \ufb01ltered signals and then\nsubsequent recti\ufb01er and LPF extract the amplitude (or power) of the signals. Consequently, this\nprocess produces the time-frequency representation of the acoustic time-domain signal, which is\nanalogous to the spectrogram.\n\n3.2 Temporal/Spectral Cue Analysis (TCA/SCA)\n\nThe TCA block measures the elapsed time between the emitted sound signal and its echoes over each\nrepetitive emission time. Delay-tuned neurons operate as tapped delay lines in each frequency channel\nas described in Fig. 2(b). The emitted and echo signal travel along these delay lines sequentially.\nCoincidence detection neurons in multiple channels detect the coexistence of the emitted signal and\n\n3\n\n\fthe echo signal in each tapped delay line. The activated position of the tapped delay lines determines\nthe delay of the echo. When the number of activated channels exceeds the threshold, the location of\nthe target is declared [17].\nFine delay, caused by overlapping echoes re\ufb02ected from two nearby glints, is unresolvable through\ndirect delay measurements by the TCA block [18]. These \ufb01ne delays are resolved by the SCA block\nanalyzing the spectral cues such as notch and null. Assuming only two glints exist for simplicity\nand the echoes r(t) from two glints are re\ufb02ected back with the same magnitude A but with different\ndelays \u03c41 and \u03c42, then the received signal s(t) is given by\n\ns(t) = Ar(t \u2212 \u03c41) + Ar(t \u2212 \u03c42),\n\nwhere t denotes time. The frequency spectrum S(f ) of s(t) can be written by\n\nS(f ) = A \u00b7 R(f )e\u2212j2\u03c0f \u03c41 [1 + e\u2212j2\u03c0f (\u03c42\u2212\u03c41)],\n\n(1)\n\n(2)\n\nwhere f and R(f ) are the frequency and the frequency spectrum of the individual echo r(t), respec-\ntively. Bats transform the spectral information into a time-delay domain, as shown in Fig. 2(c), by\nsumming up the |S(fk)|-weighted basis only when the magnitude of the frequency spectrum |S(fk)|\nof k-th frequency channel exceeds the threshold Sthr [19, 17, 20\u201322], namely\n\nN(cid:88)\n\nxbat(\u03c4 ) =\n\n|S(fk)| \u00b7 cos(2\u03c0fk\u03c4 )\n\nif\n\n|S(fk)| > Sthr,\n\n(3)\n\nk=1\n\nwhere xbat(\u03c4 ) and fk denote the time-delay representation of the bats and the center frequency of k-th\nchannel, respectively, as shown in Fig. 2(c) [17]. The \ufb01ne delay is eventually determined by \ufb01nding\nthe location of the peaks in the time-delay representation. Furthermore, a target can be considered\nas an object containing several glints and re\ufb02ecting surfaces [23\u201326]. Echoes re\ufb02ected from these\nglints contribute to the spectral cues of an echo [27]. That is, the shape of a target is expressed with a\nunique spectral \ufb01ngerprint. Bats are known to use these spectral signatures to recognize the shape of\na target [18, 28, 29]. Consequently, the sophisticated pattern recognition of the spectral cue is central\nto the spatial perception mechanism.\n\n4 Data Acquisition\n\nBat-inspired imaging (Bat-I) sensor (see Fig. 1(b)) emits broadband FM signals and records echoes\nre\ufb02ected from the target object. The recorded data transformed into spectrogram are fed into the\nBat-G network for training and the network eventually infers the object\u2019s 3D representation. In\norder to train the network, we have adopted a supervised learning algorithm and created 4-channel\nultrasound echo dataset, ECHO-4CH (49 k data for training and 2.6 k data for evaluation). Each\necho data consists of eight spectrograms (2562 grayscale image) and one 3D ground-truth label (643\nvoxels).\n\n4.1 Data (4-channel Ultrasound Echo)\n\nSystem Setup The ultrasonic electrostatic speaker (UES) (see Fig. 3(a)), placed at the center of\na sensing module, broadcasts the ultrasonic chirp in the frequency range of 20-120 kHz with the\nmaximum power of 78 dB SPL at 1 m. The UES is driven by a class AB speaker driver with a\nmaximum power of 10 W. Four ultrasound condenser microphones (UCMs) are placed right, left,\nup, and down of the UES with the separation of 6 cm. The UCMs have a broad and \ufb02at frequency\nresponse in 20-150 kHz with the attenuation less than -6 dB. The recorder ampli\ufb01es the received\nsignals from the UCMs with the maximum gain of 40 dB and digitizes at a sampling rate of 750\nkSample/s.\n\nBroadcasting Signal Bats use a hyperbolic frequency-modulated (HFM) chirp containing multi\nharmonics, which has the effect of pulse compression increasing the spatial resolution as well as the\nreceiver sensitivity assuring robust performance in environments with heavy reverberation [30\u201335].\nCompared to the linear FM chirp, the HFM chirp is less sensitive to the frequency shifts caused by\nthe movement of subjects because of its Doppler tolerance [36]. The waveform of the HFM chirp\n\n4\n\n\fFigure 3: (a) Custom-made bat-inspired imaging (Bat-I) sensor for ultrasound data acquisition. (b)\nPolar plot of signal-to-noise ratio (SNR) of a sphere and a triangular pyramid. (c) The region-of-\ninterest (ROI) pooling of the raw echo data. (d) Conversion of the echo signal into two spectrograms.\nxHF M with pulse duration THF M , chosen as the broadcasting signal format in our Bat-I sensor, can\nbe expressed as\n\nxHF M (t) = A(t) \u00b7 sin[\n\n(4)\nwhere \u03be = (f1 \u2212 fN )/f1fN THF M and f1 and fN are the \ufb01rst and the last carrier frequency, respec-\ntively, and A(t) denotes a rectangular function given by A(t) = a\u00b7rect[(t\u2212THF M /2)/(THF M )][37].\nThe selected HFM parameters are a = 0.3, THF M = 6 ms, f1 = 120 kHz, and fN = 20 kHz.\n\nln(1 + \u03bef1t)],\n\n0 \u2264 t \u2264 THF M\n\n2\u03c0\n\u03be\n\nObjects and Data Acquisition We have chosen 16.2 k geometric object con\ufb01guration (such as\ncube, cone, sphere, and so on) as shown in Fig. 1(b), and created the objects using the building blocks\nand a 3D printer. The geometric objects are randomly placed in a 643 cm3 space with a distance of\n1.48 m from the Bat-I sensor. Each target object is measured \ufb01ve times to desensitize the network to\nthe ambient noise (e.g. noise from electronic equipment, footsteps, voice and so forth). We eventually\nacquired 81 k measured echo data.\n\n(1) Thresholding - An object re\ufb02ects a limited portion of ultrasound energy back\n|xr,i(t)|2dt with scan duration Ts received by the i-th\n\nData Processing\n\nto the UCM. The backscattered power(cid:82) Ts\n(cid:82) THF M\n\nUCM of a RADAR/SONAR system is\n\n0\n\n|xr,i(t)|2dt =\n\n0\n\n(cid:90) Ts\nwhere(cid:82) THF M\n\n0\n\n|xHF M (t)|2dt \u00b7 GtAr\u03c3e\u22122\u03b1Ri\n\n(4\u03c0)2R4\ni\n\n,\n\ni = 1, 2, ..., 4\n\n(5)\n\n0\n\n|xHF M (t)|2dt and Gt are the power of the transmitted HFM chirp and the gain of the\nUES , respectively [38, 39]. Ar is the effective area of the UCM, \u03c3 is the sonar cross section (SCS), \u03b1\nis the atmospheric attenuation constant, and Ri is the distance from the UES/i-th UCM to the object.\nThe SCS depends on the object\u2019s geometric shape, and orientation of the ultrasound source. In case an\nobject has a small SCS (e.g. the most of the re\ufb02ective surfaces of an object cause specular re\ufb02ection),\nthe SNR of the received signal drops below the minimum detectable SNR threshold (see Fig. 3(b)).\nAs such, we have constructed training datasets with only reliable data that meet the threshold criteria\nsuch as\n\nXthr = {xk\n\nr,i(t) | \u2206dB[xk\n\nr,\u2200(t)] > \u22126 dB},\n\n0 \u2264 t \u2264 Ts\n\ni = 1, 2, ..., 4\n\n(6)\n\n10 log[(cid:82) Ts\n\n0\n\nr,i(t)|2dt/(cid:82) Ts\n\n0\n\n|xk\n\nwhere object-to-sphere power ratio (OSPR) of the k-th measured data of the i-th UCM \u2206dB[xk\n\n|xsph(t)|2dt].\n\nr,i(t)] =\n|xsph(t)|2dt is the backscattered power of an\n\nisotropic sphere (radius = 9 cm).\n(2) Pooling - 70.5 k-Sample data recorded by the Bat-I sensor covers a scan depth of 17.17 m\n(the speed of sound c = 343.42 m/s) as described in Fig. 3(c). Processing raw data requires large\ncomputational resources. In order to reduce the input dimension, preliminary information re\ufb02ecting\n\n(cid:82) Ts\n\n0\n\n5\n\n\fFigure 4: Simpli\ufb01ed diagram of the anatomical connections of a bat's auditory system and architecture\nof the proposed Bat-G network (BN: batch normalization [40], ReLU: recti\ufb01ed linear unit [41]).\nthe fact that an object is placed at a distance of 1.48 m \u00b1 32 cm is considered and only 2.8 k-Sample\ndata covering the region-of-interest (ROI) are used. This process reduces the input dimension by 98\n%. The ROI pooling can be expressed as\n\nXroi = {\u00b4xk\n\nr,i(t) = xk\n\nr,i(t + T1) | xk\n\nr,i \u2208 Xthr},\n0 \u2264 t \u2264 T2 \u2212 T1 + \u03c4ir\n\ni = 1, 2, ..., 4\n\n(7)\n\nwhere T1 and T2 are the start and the end time of the data covering ROI. \u03c4ir is the length of the\nintrinsic transient response of the UES/UCM.\n(3) Spectrogram \u2013 Bat-G net includes two pathways primarily processing temporal or spectral cues\nsimilar to a bat's central auditory pathway (see section 5). In order to feed the appropriate signal\nto each path, the recorded signal is converted into two high-resolution spectrograms (see Fig. 3(d))\nproduced by the short-time Fourier transform (STFT) with a short/long hamming window \u03c9s/\u03c9l\n(33-\u00b5s/133-\u00b5s window size with 22-\u00b5s/90-us overlap), namely\nr,i(t)](\u03c9l)|2 | \u00b4xk\n\nr,i(t)](\u03c9s)|2,|ST F T [\u00b4xk\n\nXsp = {|ST F T [\u00b4xk\n\nr,i(t) \u2208 Xroi}.\n0 \u2264 t \u2264 T2 \u2212 T1 + \u03c4ir\n\ni = 1, 2, ..., 4\n\n(8)\n\nAs the size of generated two spectrograms is different, they are resized to 2562. As a result, we have\ngathered 51.6 k data and each data is composed of eight spectrograms.\n\n4.2 Labels (3D Ground-truth Model)\n\nEach target object of the gathered data is modeled in 3D CAD and voxelized with the dimensions of\n643 (voxel size of 13 cm3). As the acoustic re\ufb02ection coef\ufb01cient at the interface between the air and\nthe solid object material is close to one, the \ufb01eld of view (FoV) of the UCM is limited to the front\nview of the target objects. Therefore, shaded regions, from the back of the object to the end of the\nROI, are padded by one.\n\n5 Architecture of Proposed Bat-G Network\n\nIn this section, the architecture of the proposed Bat-G network that analyzes the 4-channel ultrasonic\nechoes and inversely reconstructs the 3D representation of the target objects is presented. The network\nconsists of two components: (1) a neural encoder that emulates a bat's central auditory pathway\n\n6\n\n\fand (2) a 3D rendering decoder that is inspired by the expansive path of the U-net [42] without any\nconcatenation from the contracting path.\n\n5.1 Encoder\n\nThe 3D perception mechanism of the FM bats described in section 3 involves the neural interactions\nbetween the auditory pathway (through the brainstem and the midbrain) and auditory cortex (AC).\nFig. 4 depicts the simpli\ufb01ed anatomical connections of a bat's auditory system (reconstructed from\n[43, 44]). The system consists of four main blocks2: (a) VCN where the cells (e.g. the bushy and the\noctopus cells) play an important role in extracting the timing information from the auditory nerve;\nand DCN where the principal neurons, including the fusiform cells, perform non-linear spectral\nanalysis considering the location of the head and ears [46], (b) SOC (MSO and LSO) that calculates\nthe interaural differences in time and intensity, contributing to the sound source localization, (c) NLL\nand IC where organized auditory information and the auditory nerve from peripheral brainstem nuclei\nconverge, and (d) AC and PFC converting the integrated auditory features to a uni\ufb01ed image.The\narchitecture of the proposed Bat-G network emulates two features of a bat's auditory system.\n(1) Spectral/Temporal-Cue Dominant Path - Some neurons are sensitive to the temporal- (time) or\nspectral- (frequency) domain information. These neurons form a nucleus, a cluster of neurons. Each\nnucleus intensively extracts domain-speci\ufb01c features depending on the nature of the neurons that\nmake up the cluster. We constructed the front cluster of layers employing deformable convolution\nlayer [47] which adjusts the receptive \ufb01eld according to the pattern of the temporal/spectral cues. In\naddition, the network pathway is divided into the two paths of dominantly processing either temporal\nor spectral cues of the input spectrogram.\n(2) Biomimetic Connections - The nuclei directly or indirectly receive the monaural, binaural, ipsilat-\neral, or contralateral signal from the lower auditory nuclei. In the aspect of network implementation,\neach ultrasonic echo spectrogram with a short/long window from four recording channels (right, left,\nup, and down) is monaurally processed at the corresponding L1-CN inspired by CN as shown in\nFig. 4. The output feature-maps of L1-CN, L2-SOC1, or L3-SOC2 are binaurally concatenated and\nthen fed into one step deeper layers (highlighted in blue arrow). The feature maps are simultaneously\ntransmitted to deeper layers (>2) via direct connection with successive stride-4 4 x 4 max pooling\n(highlighted in red arrow). L4-NLL/IC integrates entire products of each layer and then forwards\nthe results (4 x 4 dimension vector with 4096 feature maps) to a 3D visualization decoder. The\ndetailed structure (see Fig. 4) of each layer is as follows. The \ufb01rst three layers (L1-CN, L2-SOC1, and\nL3-SOC2) are implemented with two 3 x 3 deformable convolutions (dilation factor of 1, 3 x 3 offset,\nand \u201csame\u201d padding) followed by batch normalization (BN) [40] and recti\ufb01ed linear unit (ReLU)\nactivation [41]. L4-NLL/IC consists of successive two conventional convolutions (3 x 3 kernels, and\n\u201csame\u201d padding with BN and ReLU).\n\n5.2 Decoder\n\nA 3D inverse rendering decoder projects the output data of the encoder in low dimensional manifold\ninto the volumetric 3D image in R64\u00d764\u00d764 vector space (see Fig. 4). A fully-connected (FC) layer,\napplied to 4 x 4 pixel inputs encoded with 4096 feature maps, has 4096 hidden units. The output\nof the FC layer is reshaped into a 3D vector domain of R4\u00d74\u00d74 with 1024 feature maps. Then, the\n3D vector passes through three 3D convolution transpose layers which are composed of one 3D\nconvolution transpose (or deconvolution) layer (stride-2 2 x 2 x 2 or stride-4 4 x 4 x 4 kernels, and\n\u201csame\u201d padding with ReLU) and two 3D convolution layers (3 x 3 x 3 kernels, and \u201csame\u201d padding\nwith BN and ReLU) . In order to convert a 16-feature vector into the desired representation, a 1 x 1 x\n1 convolution layer is added to the \ufb01nal layer. The detailed structure of each layer is described in\nFig. 4.\n\n2Note that acronyms of bat's nerve nuclei are listed in this footnote for the readability. CN: the cochlear\nnucleus (VCN: the ventral CN and DCN: the dorsal CN), SOC: the superior olivary nuclei, MSO: medial\nsuperior olive, LSO: lateral superior olive, NLL: the nucleus of the lateral lemniscus, IC: the inferior colliculus,\nPFC: the prefrontal cortex. [45]\n\n7\n\n\fFigure 5: 3D reconstruction results of target objects when (a) the objects are composed of convex\nsurfaces, (b) the objects have vertices, (c) the objects have an signi\ufb01cantly small re\ufb02ective area, and\n(d) the echo suffers the multiple diffusion re\ufb02ections.\n6 Training\n\nThe Bat-G network is trained employing a supervised learning algorithm. The network is repeatably\nfed with 49 k training data randomly selected from ECHO-4CH dataset (51.6 k data). The learning\nobjective is minimizing the 3D reconstruction loss L between the 3D network f output \u02c6y = f (\u00b4xr) \u2208\nR64\u00d764\u00d764, where \u00b4xr \u2208 Xsp \u223c R8\u00d7256\u00d7256 is the input spectrogram, and the corresponding ground\ntruth label y \u2208 R64\u00d764\u00d764. The loss function is implemented by employing L2-regularization loss\n(regularization strength \u03bb=10\u22126), and cross-entropy loss with softmax activation S, which can be\nexpressed as\n\nL(\u02c6y, y) = y log[S(\u02c6y)] + \u03bb(cid:80)m\n\ni=1 \u03c92\ni .\n\n(9)\n\nWe adopted the Adam optimization algorithm [48] (\u03b21, \u03b22, and \u03b5 are 0.9, 0.999, and 10\u22128, respec-\ntively) with an exponential decay (learning rate, decay rate, and decay steps are 10\u22124, 0.9, and 5 k,\nrespectively) for better convergence. To reduce over\ufb01tting, dropout with the probability of retention\nof 0.5 [49] is applied to the network during training. The network is iteratively trained with 500 k\nsteps on a GTX 1080 Ti GPU and a Threadripper 1900X CPU.\n\n7 Experimental Results\n\nWe \ufb01rst present the qualitative assessment of the 3D rendering results of the Bat-G network. We\nthen quantitatively evaluated 3D reconstruction performance based on precision, recall, and F1-score\nmetrics. The Bat-G network is evaluated with 2.6 k test data of the ECHO-4CH dataset.\n\n7.1 Qualitative Assessment\n\nFig. 5 shows the measured objects, the ground-truth labels, and the 3D reconstruction results of the\nBat-G network presented in a 3D view and third angle projection. When a radiated ultrasonic chirp\nis re\ufb02ected from convex surfaces of target objects, the 3D representation of the measured objects is\nuniformly reconstructed as shown in Fig. 5(a). It can be observed that the Bat-G network localizes\nthe measured objects in a 3D-space and reconstructs the shapes of the objects by inferring based on\ntest data. It is worth noting that the Bat-G network can reconstruct 3D shapes of the objects having\nvertices (Fig. 5(b)). Results presented in Fig. 5 clearly shows that the Bat-G net is sensitive to both\nazimuth and elevation cues. Examples yielding slightly unsatis\ufb01ed outputs compared to those shown\nin Fig. 5(a)-(b) are presented in Fig. 5(c)-(d). From the reconstruction result in Fig. 5(c), it can be\nseen that the edge information of the measured objects are not fully-retrieved since the re\ufb02ective area\nseen by the Bat-G sensor is signi\ufb01cantly small. Fig. 5(d) shows that the ultrasound echo re\ufb02ected\nfrom area A is received through multiple diffusion re\ufb02ection paths, while the re\ufb02ected echo from area\nB is measured primarily through the direct path. As a result, the Bat-G net erroneously represented\nthe shape of A because of the multiple diffusion re\ufb02ections.\n\n8\n\n\fFigure 6: (a) Precision, recall and F1-score of the proposed Bat-G net and the stacked auto-encoder\n(SAE) employing the 4-, 2-, or 1-channel UCM input data. (b) Performance of Bat-G network\nwith/without spectral/temporal-cue dominant path and/or the biomimetic connections.\n7.2 Quantitative Assessment\nAs the volumetric 3D ground truth data is unbalanced (90 % of labels is label 0), the accuracy is always\nestimated to be higher than 90 % even though the network infers all outputs as label 0. Therefore,\nwe quantitatively assessed the performance based on the precision, recall, and F1-score metric. The\ncurrent state-of-the-art image reconstruction method using a neural network [50] demonstrates that\nthe architecture composed of a conventional stacked auto-encoder (SAE) and FC layers can effectively\nlearn forward reconstruction method composed of two manifold transformations: (a) diffeomorphism\nbetween sensory input and latent low dimensional space and (b) manifold mapping from latent space\nto the output image. Therefore, such SAE (with an FC layer) structure is employed as the baseline,\nwhile maintaining the number of parameters and layers equal to that of the Bat-G network for a fair\ncomparison. The Bat-G net (4-channel UCM) achieves (see Fig. 6(a)) 0.896 in precision, 0.899 in\nrecall, and 0.895 in F1-score which are 3.0 %, 7.1 %, and 5.4 % increase against the SAE (4-channel\nUCM), respectively. Besides, the contribution of the number of UCMs is assessed. As the number\nof UCMs decreases, the performance of the Bat-G net deteriorated (10.9 % and 27.8 % drops in\nF1-score when the number of UCMs reduces into two and one, respectively). This suggests that\nemploying the 4-channel UCM data as the input is essential for the Bat-G net to reconstruct a 3D\nimage that is sensitive to both azimuth and elevation cues. We also presented the ablation studies to\nvalidate ef\ufb01cacy of the spectral/temporal-cue dominant path and the biomimetic connection emulating\na bat's auditory pathways. Employing both spectral and temporal pathways demonstrated the best\nperformance, which means that the two pathways are complementary to each other (3.8 % or 14.6 %\nincreases in F1-score against using only the spectral or the temporal pathway, respectively). When\nthe biomimetic connection was removed, the performance degradation of 5.1 % was observed. The\nresult shows that the nested biomimetic connection in the Bat-G net contributes signi\ufb01cantly to\nextracting essential features required for 3D image reconstruction from ultrasonic echoes. More\ninformation on the network structures used for the comparison and the ablation studies can be found\nin the supplementary material.\n\n8 Conclusion\nIn this study, a bat-inspired high-resolution 3D imaging system that can reconstruct the shape of\ntarget objects in 3D space using HFM ultrasonic echoes is presented. The proposed imaging system\nis composed of a Bat-G network and a Bat-I sensor that are equivalent to the central-auditory-\npathway/auditory-cortex and the nose/ear of the bat, respectively. The Bat-G net was implemented\nusing an encoder extracting temporal/spectral features from the hyperbolic chirped ultrasonic echoes,\nand a decoder reconstructing the 3D representation of a target object from the extracted features.\nThe network is trained using a supervised learning algorithm with custom-made datasets (ECHO-\n4CH). Through a range of experiments, we have shown that the proposed network can effectively\nreconstruct the shapes of 3D objects. This work clearly demonstrates the implementation feasibility\nof a high-resolution ultrasound 3D imaging system used by live bats. It also marks a crucial step\ntoward realizing an imaging sensor that can graphically visualize objects and their surroundings\nirrespective of environmental conditions, unlike conventional electromagnetic wave-based imaging\nsystems.\n\n9\n\n\fAcknowledgments\n\nThe authors would like to thank Gain Kim, Soon-Won Kwon, and Sejun Jeon for their thoughtful\ncomments on the manuscript. We thank all anonymous reviewers for their constructive feedback.\n\nReferences\n[1] Matti Kutila, Pasi Pyyk\u00f6nen, Werner Ritter, Oliver Sawade, and Bernd Sch\u00e4ufele. Automotive lidar sensor\ndevelopment scenarios for harsh weather conditions. In Intelligent Transportation Systems (ITSC), 2016\nIEEE 19th International Conference on, pages 265\u2013270. IEEE, 2016.\n\n[2] Muqaddas Bin Tahir and Musarat Abdullah. Distance measuring (hurdle detection system) for safe\nenvironment in vehicles through ultrasonic rays. Global Journal of Research In Engineering, 12(1-B),\n2012.\n\n[3] James A Simmons. Bats use a neuronally implemented computational acoustic model to form sonar images.\n\nCurrent opinion in neurobiology, 22(2):311\u2013319, 2012.\n\n[4] Donald R Grif\ufb01n. Listening in the dark: the acoustic orientation of bats and men. 1958.\n\n[5] Gerhard Neuweiler. The biology of bats. Oxford University Press on Demand, 2000.\n\n[6] Mark Denny. Blip, Ping, and Buzz: Making Sense of Radar and Sonar. JHU Press, 2007.\n\n[7] Michael T McCann, Kyong Hwan Jin, and Michael Unser. A review of convolutional neural networks for\n\ninverse problems in imaging. arXiv preprint arXiv:1710.04011, 2017.\n\n[8] G Kaniak and H Schweinzer. A 3d airborne ultrasound sensor for high-precision location data estimation\nand conjunction. In Instrumentation and Measurement Technology Conference Proceedings, 2008. IMTC\n2008. IEEE, pages 842\u2013847. IEEE, 2008.\n\n[9] Jan Steckel, Andre Boen, and Herbert Peremans. Broadband 3-d sonar system using a sparse array for\n\nindoor navigation. IEEE Transactions on Robotics, 29(1):161\u2013171, 2013.\n\n[10] Jonas Reijniers and Herbert Peremans. Biomimetic sonar system performing spectrum-based localization.\n\nIEEE Transactions on Robotics, 23(6):1151\u20131159, 2007.\n\n[11] Huzefa Akbarally and Lindsay Kleeman. A sonar sensor for accurate 3d target localisation and classi\ufb01cation.\nIn Robotics and Automation, 1995. Proceedings., 1995 IEEE International Conference on, volume 3, pages\n3003\u20133008. IEEE, 1995.\n\n[12] Alberto Ochoa, Jesus Urena, Alvaro Hernandez, Manuel Mazo, Jos\u00e9 Antonio Jim\u00e9nez, and Ma Carmen\nPerez. Ultrasonic multitransducer system for classi\ufb01cation and 3-d location of re\ufb02ectors based on pca.\nIEEE Transactions on Instrumentation and Measurement, 58(9):3031\u20133041, 2009.\n\n[13] Itiel E Dror, Mark Zagaeski, and Cynthia F Moss. Three-dimensional target recognition via sonar: a neural\n\nnetwork model. Neural Networks, 8(1):149\u2013160, 1995.\n\n[14] Marco Moebus and Abdelhak Zoubir. Three-dimensional ultrasound imaging in air for parking and\npedestrian protection. In In-Vehicle Corpus and Signal Processing for Driver Behavior, pages 137\u2013147.\nSpringer, 2009.\n\n[15] Sumio Watanabe and Masahide Yoneyama. An ultrasonic visual sensor for three-dimensional object\nrecognition using neural networks. IEEE transactions on Robotics and Automation, 8(2):240\u2013249, 1992.\n\n[16] Petros Boufounos. Compressive sensing for over-the-air ultrasound. In 2011 IEEE International Conference\n\non Acoustics, Speech and Signal Processing (ICASSP), pages 5972\u20135975. IEEE, 2011.\n\n[17] Prestor A Saillant, James A Simmons, Steven P Dear, and Teresa A McMullen. A computational model\nof echo processing and acoustic imaging in frequency-modulated echolocating bats: The spectrogram\ncorrelation and transformation receiver. The Journal of the Acoustical Society of America, 94(5):2691\u20132712,\n1993.\n\n[18] James A Simmons. A view of the world through the bat\u2019s ear: the formation of acoustic images in\n\necholocation. Cognition, 33(1-2):155\u2013199, 1989.\n\n[19] James A Simmons, Prestor A Saillant, Janine M Wotton, Tim Haresign, Michael J Ferragamo, and\nCynthia F Moss. Composition of biosonar images for target recognition by echolocating bats. Neural\nNetworks, 8(7-8):1239\u20131261, 1995.\n\n10\n\n\f[20] Herbert Peremans and John Hallam. The spectrogram correlation and transformation receiver, revisited.\n\nThe Journal of the Acoustical Society of America, 104(2):1101\u20131110, 1998.\n\n[21] F Devaud, G Haward, and JJ Soraghan. The use of chirp overlapping properties for improved target\nresolution in an ultrasonic ranging system. In Ultrasonics Symposium, 2004 IEEE, volume 3, pages\n2041\u20132044. IEEE, 2004.\n\n[22] G Hayward, F Devaud, and JJ Soraghan. P1g-3 evaluation of a bio-inspired range \ufb01nding algorithm (bira).\n\nIn Ultrasonics Symposium, 2006. IEEE, pages 1381\u20131384. IEEE, 2006.\n\n[23] OW HESON Jr. Biosonar imaging of insects by pteronotus p. parnelli, the mustached bat. Natl. Geogr.\n\nRes, 3:82\u2013101, 1987.\n\n[24] Rudolf Kober and Hans-Ulrich Schnitzler. Information in sonar echoes of \ufb02uttering insects available for\n\necholocating bats. The Journal of the Acoustical Society of America, 87(2):882\u2013896, 1990.\n\n[25] Lee A Miller and Simon Boel Pedersen. Echoes from insects processed using time delayed spectrometry\n\n(tds). In Animal Sonar, pages 803\u2013807. Springer, 1988.\n\n[26] Hans-Ulrich Schnitzler, Dieter Menne, Rudi Kober, and Klaus Heblich. The acoustical image of \ufb02uttering\ninsects in echolocating bats. In Neuroethology and behavioral physiology, pages 235\u2013250. Springer, 1983.\n\n[27] James A Simmons and Lynda Chen. The acoustic basis for target discrimination by fm echolocating bats.\n\nThe Journal of the Acoustical Society of America, 86(4):1333\u20131350, 1989.\n\n[28] James A Simmons. The processing of sonar echoes by bats. In Animal Sonar Systems, pages 695\u2013714.\n\nSpringer, 1980.\n\n[29] James A Simmons and Roger A Stein. Acoustic imaging in bat sonar: echolocation signals and the\n\nevolution of echolocation. Journal of Comparative Physiology, 135(1):61\u201384, 1980.\n\n[30] GK Strother. Note on the possible use of ultrasonic pulse compression by bats. The Journal of the\n\nAcoustical Society of America, 33(5):696\u2013697, 1961.\n\n[31] James A Simmons. The resolution of target range by echolocating bats. The Journal of the Acoustical\n\nSociety of America, 54(1):157\u2013173, 1973.\n\n[32] David Allen Cahlander. Echolocation with wide-band waveforms: Bat sonar signals. Technical report,\n\nMASSACHUSETTS INST OF TECH LEXINGTON LINCOLN LAB, 1964.\n\n[33] Richard A Altes. Methods of wideband signal design for radar and sonar systems. Technical report,\n\nROCHESTER UNIV NY DEPT OF ELECTRICAL ENGINEERING, 1970.\n\n[34] Changsheng Yang, Bingxu Ren, Shuang Wu, and Ran Fang. Time-scale analysis on harmonic signal and\nbionic anti-reverberation waveform design. In Image and Signal Processing (CISP), 2014 7th International\nCongress on, pages 943\u2013947. IEEE, 2014.\n\n[35] Mary E Bates, James A Simmons, and Tengiz V Zorikov. Bats use echo harmonic structure to distinguish\n\ntheir targets from background clutter. Science, 333(6042):627\u2013630, 2011.\n\n[36] Chang-sheng Yang, Hong Liang, and SG Yang. Analysis of self-cwt ridge of wideband hfm signal. J.\n\nSystem Simulation, 20:5324\u20135327, 2008.\n\n[37] Lan Zhang, Xiaomei Xu, Wei Feng, and Yougan Chen. Hfm spread spectrum modulation scheme in\n\nshallow water acoustic channels. In 2012 Oceans, pages 1\u20136. IEEE, 2012.\n\n[38] MI Skolnik. Radar handbook, 1536 pp. New York: McGraw-Hill. Google Scholar, 1970.\n\n[39] Arjan Boonman, Yinon Bar-On, and Yossi Yovel. It\u2019s not black or white\u2014on the range of vision and\n\necholocation in echolocating bats. Frontiers in physiology, 4:248, 2013.\n\n[40] Sergey Ioffe and Christian Szegedy. Batch normalization: Accelerating deep network training by reducing\n\ninternal covariate shift. arXiv preprint arXiv:1502.03167, 2015.\n\n[41] Vinod Nair and Geoffrey E Hinton. Recti\ufb01ed linear units improve restricted boltzmann machines. In\nProceedings of the 27th international conference on machine learning (ICML-10), pages 807\u2013814, 2010.\n\n[42] Olaf Ronneberger, Philipp Fischer, and Thomas Brox. U-net: Convolutional networks for biomedical\nimage segmentation. In International Conference on Medical image computing and computer-assisted\nintervention, pages 234\u2013241. Springer, 2015.\n\n11\n\n\f[43] Ellen Covey and John H Casseday. The lower brainstem auditory pathways. In Hearing by bats, pages\n\n235\u2013295. Springer, 1995.\n\n[44] Melville J Wohlgemuth, Jinhong Luo, and Cynthia F Moss. Three-dimensional auditory localization in the\n\necholocating bat. Current opinion in neurobiology, 41:78\u201386, 2016.\n\n[45] Eric D Young, George A Spirou, John J Rice, and Herbert F Voigt. Neural organization and responses to\ncomplex stimuli in the dorsal cochlear nucleus. Phil. Trans. R. Soc. Lond. B, 336(1278):407\u2013413, 1992.\n\n[46] Jan Schnupp, Israel Nelken, and Andrew King. Auditory neuroscience: Making sense of sound. MIT press,\n\n2011.\n\n[47] Jifeng Dai, Haozhi Qi, Yuwen Xiong, Yi Li, Guodong Zhang, Han Hu, and Yichen Wei. Deformable\n\nconvolutional networks. CoRR, abs/1703.06211, 1(2):3, 2017.\n\n[48] D Kinga and J Ba Adam. A method for stochastic optimization. In International Conference on Learning\n\nRepresentations (ICLR), volume 5, 2015.\n\n[49] Nitish Srivastava, Geoffrey Hinton, Alex Krizhevsky, Ilya Sutskever, and Ruslan Salakhutdinov. Dropout:\na simple way to prevent neural networks from over\ufb01tting. The Journal of Machine Learning Research,\n15(1):1929\u20131958, 2014.\n\n[50] Bo Zhu, Jeremiah Z Liu, Stephen F Cauley, Bruce R Rosen, and Matthew S Rosen. Image reconstruction\n\nby domain-transform manifold learning. Nature, 555(7697):487, 2018.\n\n12\n\n\f", "award": [], "sourceid": 2032, "authors": [{"given_name": "Gunpil", "family_name": "Hwang", "institution": "KAIST"}, {"given_name": "Seohyeon", "family_name": "Kim", "institution": "KAIST"}, {"given_name": "Hyeon-Min", "family_name": "Bae", "institution": "KAIST"}]}