{"title": "Self-Supervised Generation of Spatial Audio for 360\u00b0 Video", "book": "Advances in Neural Information Processing Systems", "page_first": 362, "page_last": 372, "abstract": "We introduce an approach to convert mono audio recorded by a 360\u00b0 video camera into spatial audio, a representation of the distribution of sound over the full viewing sphere. Spatial audio is an important component of immersive 360\u00b0 video viewing, but spatial audio microphones are still rare in current 360\u00b0 video production. Our system consists of end-to-end trainable neural networks that separate individual sound sources and localize them on the viewing sphere, conditioned on multi-modal analysis from the audio and 360\u00b0 video frames. We introduce several datasets, including one filmed ourselves, and one collected in-the-wild from YouTube, consisting of 360\u00b0 videos uploaded with spatial audio. During training, ground truth spatial audio serves as self-supervision and a mixed down mono track forms the input to our network. Using our approach we show that it is possible to infer the spatial localization of sounds based only on a synchronized 360\u00b0 video and the mono audio track.", "full_text": "Self-Supervised Generation of Spatial Audio\n\nfor 360\u25e6 Video\n\nPedro Morgado\n\nUniversity of California, San Diego\u2217\n\nNuno Vasconcelos\n\nUniversity of California, San Diego\n\nTimothy Langlois\n\nAdobe Research, Seattle\n\nOliver Wang\n\nAdobe Research, Seattle\n\nAbstract\n\nWe introduce an approach to convert mono audio recorded by a 360\u25e6 video camera\ninto spatial audio, a representation of the distribution of sound over the full viewing\nsphere. Spatial audio is an important component of immersive 360\u25e6 video viewing,\nbut spatial audio microphones are still rare in current 360\u25e6 video production. Our\nsystem consists of end-to-end trainable neural networks that separate individual\nsound sources and localize them on the viewing sphere, conditioned on multi-modal\nanalysis of audio and 360\u25e6 video frames. We introduce several datasets, including\none \ufb01lmed ourselves, and one collected in-the-wild from YouTube, consisting of\n360\u25e6 videos uploaded with spatial audio. During training, ground-truth spatial\naudio serves as self-supervision and a mixed down mono track forms the input to\nour network. Using our approach, we show that it is possible to infer the spatial\nlocation of sound sources based only on 360\u25e6 video and a mono audio track.\n\nIntroduction\n\n1\n360\u25e6 video provides viewers an immersive viewing experience where they are free to look in any\ndirection, either by turning their heads with a Head-Mounted Display (HMD), or by mouse-control\nwhile watching the video in a browser (e.g., YouTube). Capturing 360\u25e6 video involves \ufb01lming the\nscene with multiple cameras and stitching the result together. While early systems relied on expensive\nrigs with carefully mounted cameras, recent consumer-level devices combine multiple lenses in a\nsmall \ufb01xed-body frame that enables automatic stitching, allowing 360\u25e6 video to be recorded with a\nsingle push of a button.\nAs humans rely on audio localization cues for full scene awareness, spatial audio is a crucial\ncomponent of 360\u25e6 video. Spatial audio enables viewers to experience sound in all directions, while\nadjusting the audio in real time to match the viewing position. This gives users a more immersive\nexperience, as well as providing cues about which part of the scene might have interesting content\nto look at. However, unlike 360\u25e6 video, producing spatial audio content still requires a moderate\ndegree of expertise. Most consumer-level 360\u25e6 cameras only record mono audio, and syncing an\nexternal spatial audio microphone can be expensive and technically challenging. As a consequence,\nwhile most video platforms (e.g., YouTube and Facebook) support spatial audio, it is often ignored\nby content creators, and at the time of submission, a random polling of 1000 YouTube 360\u25e6 videos\nyielded less than 5% with spatial audio.\nIn order to close this gap between the audio and visual experiences, we introduce three main\ncontributions: (1) we formalize the 360\u25e6 spatialization problem; (2) design the \ufb01rst 360\u25e6 spatialization\nprocedure; and (3) collect two datasets and propose an evaluation protocol to benchmark ours and\n\n\u2217Contact author: pmaravil@eng.ucsd.edu\n\n32nd Conference on Neural Information Processing Systems (NeurIPS 2018), Montr\u00e9al, Canada.\n\n\fFigure 1: Architecture overview. Our approach is composed of four main blocks. The input video and audio\nsignals are fed into the analysis block (a), which extracts high-level features. The separation block (b) then\nlearns k time-frequency attenuation maps ai(t, w) to modulate the input STFT and produce modi\ufb01ed waveforms\nf i(t). The localization block (c) computes a set of linear transform weights wi(t) that localize each source. In\nthe ambisonics generation step (d), localization weights are then combined with the separated sound sources to\nproduce the \ufb01nal spatial audio output.\n\nfuture algorithms. 360\u25e6 spatialization aims to upconvert a single mono recording into spatial\naudio guided by full 360 view video. More speci\ufb01cally, we seek to generate spatial audio in the\nform of a popular encoding format called \ufb01rst-order ambisonics (FOA), given the mono audio and\ncorresponding 360\u25e6 video as inputs. In addition to formulating the 360\u25e6 spatialization task, we design\nthe \ufb01rst data-driven system to upgrade mono audio using self-supervision from 360\u25e6 videos recorded\nwith spatial audio. The proposed procedure is based on a novel neural network architecture that\ndisentangles two fundamental challenges in audio spatialization: the separation of sound sources\nfrom a mixed audio input and respective localization of these sources. In order to train and validate\nour approach, we introduce two 360\u25e6 video datasets with spatial audio, one recorded by ourselves\nin a constrained domain, and a large-scale dataset collected in-the-wild from YouTube. During\ntraining, the captured spatial audio serves as ground truth, with a mixed down mono version provided\nas input to our system. Experiments conducted in both datasets show that the proposed neural\nnetwork can generate plausible spatial audio for 360\u25e6 video. We further validate each component of\nthe proposed architecture and show its superiority over a state-of-the-art, but domain-independent\nbaseline architecture.\nIn the interest of reproducibility, code, data and trained models will be made available to the\ncommunity at https://pedro-morgado.github.io/spatialaudiogen.\n\n2 Related Work\n\nTo the best of our knowledge, we propose the \ufb01rst system for audio spatialization. In addition to\nspatial audio, the \ufb01elds most related to our work are self-supervised learning, audio generation, source\nseparation and audio-visual cross-modal learning, which we now brie\ufb02y describe.\n\nSpatial audio Arti\ufb01cial environments, such as those rendered by game engines, can play sounds\nfrom any location in the video. This capability requires recording sound sources separately and\nmixing them according to the desired scene con\ufb01guration (i.e., the positions of each source relative to\nthe user). In a real world recording, however, sound sources cannot be recorded separately. In this case\nwhere sound sources are naturally mixed, spatial audio is often encoded using Ambisonics [13, 9, 30].\nAmbisonics aim to approximate the sound pressure \ufb01eld at a single point in space using a spherical\nharmonic decomposition. More speci\ufb01cally, an audio signal f (\u03b8\u03b8\u03b8, t) arriving from direction \u03b8\u03b8\u03b8 = (\u03d5, \u03d1)\n(where \u03d5 is the zenith angle and \u03d1 the azimuth angle) at time t is represented by a truncated spherical\nharmonic expansion of order N\n\nf (\u03b8\u03b8\u03b8, t) =(cid:80)N\n\n(cid:80)n\n\nn (\u03d5, \u03d1) is the real spherical harmonic of order n and degree m, and \u03c6m\n\nwhere Y m\nof the expansion. For ease of notation, Y m\n(Eq. 1) written as f (\u03b8\u03b8\u03b8, t) = yyyT\n\nN (\u03b8\u03b8\u03b8) \u03c6\u03c6\u03c6N (t).\n\nn and \u03c6m\n\nn=0\n\nm=\u2212n Y m\n\nn (t)\n\nn (\u03d5, \u03d1)\u03c6m\n\n(1)\nn (t) are the coef\ufb01cients\nn can be stacked into vectors yyyN and \u03c6\u03c6\u03c6N , and\n\n2\n\nCopyCopyCopyXCNNCNNRGBFLOWSTFT\ud835\udc64\ud835\udc67\ud835\udc56\ud835\udc61\ud835\udc64\ud835\udc66\ud835\udc56\ud835\udc61\ud835\udc64\ud835\udc65\ud835\udc56\ud835\udc61\ud835\udf19\ud835\udc65\ud835\udc61\ud835\udf19\ud835\udc66\ud835\udc61\ud835\udf19\ud835\udc67\ud835\udc61\ud835\udc64\ud835\udc65\ud835\udc47\ud835\udc53\ud835\udc64\ud835\udc66\ud835\udc47\ud835\udc53\ud835\udc64\ud835\udc67\ud835\udc47\ud835\udc53iSTFT\ud835\udc4e\ud835\udc56\ud835\udc61,\ud835\udf14\ud835\udebd\ud835\udc56\ud835\udc61,\ud835\udf14Tile & ConcatAudio\ud835\udc56\ud835\udc61Video\ud835\udc63\ud835\udc61\ud835\udebd\ud835\udc61,\ud835\udf14(a)(b)(c)(d)\fIn a controlled environment, sound sources with known locations can be synthetically encoded into\nambisonics using their spherical harmonic projection. More speci\ufb01cally, given a set of k audio signals\ns1(t), . . . , sk(t) originating from directions \u03b8\u03b8\u03b81, . . . , \u03b8\u03b8\u03b8k,\n\n\u03c6\u03c6\u03c6N (t) =(cid:80)k\n\ni=1 yyyN (\u03b8\u03b8\u03b8i)si(t).\n\n(2)\n\nFor ambisonics playback, \u03c6\u03c6\u03c6N is then decoded into a set of speakers or headphone signals in order\nto provide a plane-wave reconstruction of the sound \ufb01eld. In sum, the coef\ufb01cients \u03c6\u03c6\u03c6N , also known\nas ambisonic channels, are suf\ufb01cient to encode and reproduce spatial audio. Hence, our goal is to\ngenerate \u03c6\u03c6\u03c6N from non-spatial audio and the corresponding video.\n\nSelf-supervised learning Neural networks have been successfully trained through self-supervision\nfor tasks such as image super-resolution [10, 27] and image colorization [20, 46]. In the audio\ndomain, self-supervision has also enabled the detection of sound-video misalignment [37] and audio\nsuper-resolution [31]. Inspired by these approaches, we propose a self-supervised technique for audio\nspatialization. We show that the generation of ambisonic audio can be learned using a dataset of 360\u25e6\nvideo with spatial audio collected in-the-wild without additional human intervention.\n\nGenerative models Recent advances in generative models such as Generative Adversarial Networks\n(GANs) [14] or Variational Auto-Encoders (VAE) [29] have enabled the generation of complex\npatterns, such as images [14] or text [23]. In the audio domain, Wavenet [36] has demonstrated the\nability to produce high \ufb01delity audio samples of both speech and music, by generating a waveform\nfrom scratch on a sample-by-sample basis. Furthermore, neural networks have also outperformed\nprior solutions to audio super-resolution [31] (e.g. converting from 4kHz to 16kHz audio) using a\nU-Net encoder-decoder architecture, and have enabled \u201cautomatic-Foley\u201d type applications [41, 38],\ni.e. generating sounds that correspond to image features, and vice-versa. In this work, instead of\ngenerating audio from scratch, our goal is to augment the input audio channels so as to introduce\nspatial information. Thus, unlike Wavenet, ef\ufb01cient audio generation can be achieved without\nsacri\ufb01cing audio \ufb01delity, by transforming the input audio. We also demonstrate the advantages of our\napproach, inspired by the ambisonics encoding process in controlled environments, over a generic\nU-Net architecture for spatial audio generation.\n\nSource separation Source separation is a classic problem with an extensive literature. While early\nmethods present the problem as independent component analysis, and focused on maximizing the\nstatistical independence of the extracted signals [24, 7, 6, 2], recent approaches focus on data-driven\nsolutions. For example, [19] proposes a recurrent neural-network for monaural separation of two\nspeakers, [1, 12, 11] seek to isolate sound sources by leveraging synchronized visual information in\naddition to the audio input, and [44] studies a wide range of frequency-based separation methods.\nSimilarly to recent trends, we rely on neural networks guided by cross-modal video analysis. However,\ninstead of only separating human speakers [44] or musical instruments [47], we aim to separate\nmultiple unidenti\ufb01ed types of sound sources. Also, unlike previous algorithms, no explicit supervision\nis available to learn the separation block.\n\nSource localization Sound source localization is a mature area of signal processing and robotics\nresearch [3, 35, 34, 42]. However, unlike the proposed 360\u25e6 spatialization problem, these works rely\non microphone arrays using beamforming techniques [43] or binaural audio and HRTF cues similar\nto those used by humans [18]. Furthermore, the need for carefully calibrated microphones limits the\napplicability of these techniques to videos collected in-the-wild.\n\nCross visual-audio analysis Cross-modal analysis has been extensively studied in the vision and\ngraphics community, due to the inherently paired nature of video and audio. For example, [4] learns\naudio feature representations in an unsupervised setting by leveraging synchronized video. [22]\nsegments and localizes dominant sound sources using clustering of video and sound features. Other\nmethods correlate repeated motions with sounds to identify sound sources such as the strumming of a\nguitar using for example canonical correlation analysis [25, 26], joint embedding spaces [41, 38] or\nother temporal features [5].\n\n3\n\n\f3 Method\nIn this section, we de\ufb01ne the 360\u25e6 spatialization task to upconvert common audio recordings to\nsupport spatial audio playback. We then introduce a deep learning architecture to address this task,\nand two datasets to train the proposed architecture.\n\n3.1 Audio spatialization\nThe goal of 360\u25e6 spatialization is to generate ambisonic channels \u03c6\u03c6\u03c6N (t) from non-spatial audio i(t)\nand corresponding video v(t). To handle the most common audio formats supported by commercial\n360\u25e6 cameras and video viewing platforms (e.g., YouTube and Facebook), we upgrade monaural\nrecordings (mono) into \ufb01rst-order ambisonics (FOA). FOA consists of four channels that store the\n\ufb01rst-order coef\ufb01cients, \u03c60\n1, of the spherical harmonic expansion in (Eq. 1). For ease\nof notation, we refer to these tracks as \u03c6w, \u03c6y, \u03c6z and \u03c6x, respectively.\n\n0, \u03c6\u22121\n\n1 , \u03c60\n\n1 and \u03c61\n\nSelf-supervised audio spatialization Converting mono to FOA ideally requires learning from\nvideos with paired mono and ambisonics recordings, which are dif\ufb01cult to collect in-the-wild. In\norder to learn from self-supervision, we assume that monaural audio is recorded with an omni-\ndirectional microphone. Under this assumption, mono is equivalent to zeroth-order ambisonics (up\nto an amplitude scale) and, as a consequence, the upconversion only requires the synthesis of the\nmissing higher-order channels. More speci\ufb01cally, we learn to predict the \ufb01rst-order components\n\u03c6x(t), \u03c6y(t), \u03c6z(t) from the (surrogate) mono audio i(t) = \u03c6w(t) and video input v(t). Note that\nthe proposed framework is also applicable to other conversion scenarios, e.g. FOA to second-order\nambisonics (SOA), simply by changing the number of input and output audio tracks (see Sec 5).\n\n3.2 Architecture\n\nAudio spatialization requires solving two fundamental problems: source separation and localization.\nIn controlled environments, where the separated sound sources si(t) and respective localization \u03b8\u03b8\u03b8i\nare known in advance, ambisonics can be generated using (Eq. 2). However, since si(t) and \u03b8\u03b8\u03b8i are\nnot known in practice, we design dedicated modules to isolate sources from the mixed audio input\nand localize them in the video. Also, because audio and video are complementary for identifying\neach source, both separation and localization modules are guided by a multi-modal audio-visual\nanalysis module. A schematic description of our architecture is shown in Fig. 1. We now describe\neach component. Details of network architectures are provided in Appendix A.\n\nAudio and visual analysis Audio features are extracted in the time-frequency domain, which has\nproduced successful audio representations for tasks such as audio classi\ufb01cation [17] and speaker\nidenti\ufb01cation [33]. More speci\ufb01cally, we extract a sequence of short-term Fourier transforms (STFT)\ncomputed on 25ms segments of the input audio with 25% hop size and multiplied by Hann window\nfunctions. Then, we apply a (two-dimensional) CNN encoder to the audio spectrogram, which\nprogressively reduces the spectrogram dimensionality and extracts high-level features.\nVideo features are extracted using a two-stream network, based on Resnet-18 [16], to encode both\nappearance (RGB frames) and motion (optical \ufb02ow predicted by FlowNet2 [21]). Both streams are\ninitialized with weights pre-trained on ImageNet [8] for classi\ufb01cation, and \ufb01ne-tuned on our task.\nA joint audio-visual representation is then obtained by merging the three feature maps (audio, RGB\nand \ufb02ow) produced at each time t. Since audio features are extracted at a higher frame rate than video\nfeatures, we \ufb01rst synchronize the audio and video feature maps by nearest neighbor up-sampling of\nvideo features. Each feature map is then projected into a feature vector (1024 for audio and 512 for\nRGB and \ufb02ow), and the outputs concatenated and fed to the separation and localization modules.\n\nAudio separation\nAlthough the number of sources may vary, this is often small in practice.\nFurthermore, psycoaccoustic studies have shown that humans can only distinguish a small number of\nsimultaneous sources (three according to [39]). We thus assume an upper-bound of k simultaneous\nsources, and implement a separation network that extracts k audio tracks f i(t) from the input audio\ni(t). The separation module takes the form of a U-Net decoder that progressively restores the STFT\ndimensionality through a series of transposed convolutions and skip connections from the audio\n\n4\n\n\fanalysis stage of equivalent resolution. Furthermore, to visually guide the separation module, we\nconcatenate the multi-modal features to the lowest resolution layer of the audio encoder. In the last\nup-sampling layer, we produce k sigmoid activated maps ai(t, \u03c9), which are used to modulate the\nSTFT of the mono input \u03a6\u03a6\u03a6(t; \u03c9). The STFT of the ith source \u03a6\u03a6\u03a6i(t; \u03c9) is thus obtained through\nthe soft-attention mechanism \u03a6\u03a6\u03a6i(t; \u03c9) = ai(t, \u03c9) \u00b7 \u03a6\u03a6\u03a6(t; \u03c9), and the separated audio track f i(t)\nreconstructed as the inverse STFT of \u03a6\u03a6\u03a6i(t; \u03c9) using an overlap-add method.\n\nLocalization\nTo localize the sounds f i(t) extracted by the separation network, we implement\na module that generates, at each time t, the localization weights wi(t) = (wi\nz(t))\nassociated with each of the k sources, through a series of fully-connected layers applied to the\nmulti-modal feature vectors of the analysis stage. In a parallel to the encoding mechanism of (Eq. 2)\nused in controlled environments, wi(t) can be interpreted as the spherical harmonics yyyN (\u03b8\u03b8\u03b8i(t))\nevaluated at the predicted position of the ith source \u03b8\u03b8\u03b8i(t).\n\nx(t), wi\n\ny(t), wi\n\n(cid:80)k\n\nAmbisonic generation\nGiven the localization weights wi(t) and separated wave-forms\nf i(t), the \ufb01rst-order ambisonic channels \u03c6\u03c6\u03c6(t) = (\u03c6x(t), \u03c6y(t), \u03c6z(t)) are generated by \u03c6\u03c6\u03c6(t) =\ni=1 wi(t)f i(t). In summary, we split the generation task into two components: generating the\nattenuation maps ai(t, \u03c9) for source separation, and the localization weights wi(t). As audio is not\ngenerated from scratch, but through a transformation of the original input inspired by the encoding\nframework of (Eq. 2), we are able to achieve fast deployment speeds with high quality results.\n\n3.3 Evaluation metrics\n\nLet \u03c6\u03c6\u03c6(t) and \u02c6\u03c6\u03c6\u03c6(t) be the ground-truth and predicted ambisonics, and \u03a6\u03a6\u03a6(t; \u03c9) and \u02c6\u03a6\u03a6\u03a6(t; \u03c9) their\nrespective STFTs. We now discuss several metrics used for evaluating the generated signals \u02c6\u03c6\u03c6\u03c6(t).\n\nSTFT distance Our network is trained end-to-end to minimize errors between STFTs, i.e.,\n\nM SEstft =(cid:80)\n\np\u2208{x,y,z}(cid:80)\n\n(cid:80)\n\u03c9 (cid:107)\u03a6p(t, \u03c9) \u2212 \u02c6\u03a6p(t, \u03c9)(cid:107)2,\n\n(3)\nwhere (cid:107) \u00b7 (cid:107) is the euclidean complex norm. M SEstft has well-de\ufb01ned and smooth partial derivatives\nand, thus, it is a suitable loss function. Furthermore, unlike the euclidean distance between raw\nwaveforms, the STFT loss is able to separate the signal into its frequency components, which enables\nthe network to learn the easier parts of the spectrum without distraction from other errors.\n\nt\n\nLog-spectral distance (LSD) Distances that only compare the smoothed spectral behavior of audio\nsignals are widely used throughout the audio literature. We use the log-spectral distance [15] between\n\u03a6\u03a6\u03a6(t; \u03c9) and \u02c6\u03a6\u03a6\u03a6(t; \u03c9), which measures the distance in dB between the two spectrograms using\n\nLSD =(cid:80)\n\np\u2208{x,y,z}(cid:80)\n\nt\n\n(cid:114)\n\n(cid:80)K\n\n(cid:16)\n\n(cid:12)(cid:12)(cid:12) \u03a6p(t,\u03c9)\n\n\u02c6\u03a6p(t,\u03c9)\n\n(cid:12)(cid:12)(cid:12)(cid:17)2\n\n1\nK\n\n\u03c9=1\n\n10 log10\n\n.\n\n(4)\n\nEnvelope distance (ENV) Due to the high-frequency nature of audio and the human insensitivity to\nphase differences, frame-by-frame comparison of raw waveforms do not capture perceptual similarity\nof two audio signals. Instead, we measure the euclidean distance between envelopes of \u03c6\u03c6\u03c6(t) and \u02c6\u03c6\u03c6\u03c6(t),\nwhere the envelope of an audio wave is computed using the Hilbert transform method [40].\n\n(cid:113) 1\n\nT\n\n(cid:80)\n\n(cid:113)\n\n(cid:80)\n\n(cid:0)yyyT\nN (\u03b8\u03b8\u03b8) \u03c6\u03c6\u03c6N (\u03c4 )(cid:1)2\n\nEarth Mover\u2019s Distance (EMD) Ambisonics model the sound \ufb01eld f (\u03b8\u03b8\u03b8, t) over all directions \u03b8\u03b8\u03b8.\nThe energy of the sound \ufb01eld measured over a small window wt around time t along direction \u03b8\u03b8\u03b8 is\n\nE(\u03b8\u03b8\u03b8, t) =\n\n\u03c4\u2208wt\n\nf (\u03b8\u03b8\u03b8, \u03c4 )2 =\n\n1\nT\n\n\u03c4\u2208wt\n\n.\n\n(5)\n\nThus, E(\u03b8\u03b8\u03b8, t) represents the directional energy map of \u03c6\u03c6\u03c6(t). In order to measure the localization\naccuracy of the generated spatial audio, we propose to compute the EMD [32] between the energy\nmaps E(\u03b8\u03b8\u03b8, t) associated with \u03c6\u03c6\u03c6(t) and \u02c6\u03c6\u03c6\u03c6(t). In practice, we uniformly sample the maps E(\u03b8\u03b8\u03b8, t) over\ni E(\u03b8\u03b8\u03b8i, t) = 1, and measure the distance between\n\nthe sphere, normalize the sampled map so that(cid:80)\n\nsamples over the sphere\u2019s surface using cosine (angular) distances for EMD calculation.\n\n5\n\n\fREC-STREET\n\nYT-ALL\n\nYT-MUSIC\n\nYT-CLEAN\n\nFigure 2: Representative images. Example video frames from each dataset.\n\n3.4 Datasets\nTo train our model, we collected two datasets of 360\u25e6 videos with FOA audio. The \ufb01rst dataset,\ndenoted REC-STREET, was recorded by us using a Theta V 360\u25e6 camera with an attached TA-1\nspatial audio microphone. REC-STREET consists of 43 videos of outdoor street scenes, totaling 3.5\nhours and 123k training samples (0.1s each). Due to the consistency of capture hardware and scene\ncontent, the audio of REC-STREET videos is relatively easier to spatialize.\nThe second dataset, denoted YT-ALL, was collected in-the-wild by scraping 360\u25e6 videos from\nYouTube using queries related to spatial audio, e.g., spatial audio, ambisonics, and ambix. To\nclean the search results, we automatically removed videos that did not contain valid ambisonics, as\ndescribed by YouTube\u2019s format, keeping only videos containing all 4 channels or with only the Z\nchannel missing (a common spatial audio capture scenario). Finally, we performed a manual curation\nto remove videos containing 1) still images, 2) computer generated content, or 3) post-processed\nand non-visually indicated sounds such as background music or voice-overs. During this pruning\nprocess, 799 videos were removed, resulting in 1146 valid videos totaling 113.1 hours of content\n(3976k training samples). YT-ALL was further separated into live musical performances, YT-MUSIC\n(397 videos), and videos with a small number of super-imposed sources which could be localized in\nthe image, YT-CLEAN (496 videos). Upgrading YT-MUSIC videos into spatial audio is especially\nchallenging due to the large number of mixed sources (voices and instruments). We also identi\ufb01ed\n489 videos that were recorded with a \u201chorizontal\u201d spatial audio microphone (i.e. only containing\n\u03c6w(t),\u03c6x(t) and \u03c6y(t) channels). In this case, we simply ignore the Z channel \u03c6z(t) when computing\neach metric including the STFT loss. Fig. 2 shows illustrative video frames and summarizes the most\ncommon categories for each dataset.\n\n4 Evaluation\n\nFor our experiments, we randomly sample three partitions, each containing 75% of all videos for\ntraining and 25% for testing. Networks are trained to generate audio at 48kHz from input mono audio\nprocessed at 48kHz and video at 10Hz. Each training sample consists of a chunk of 0.6s of mono\naudio and a single frame of RGB and \ufb02ow, which are used to predict 0.1s of spatial audio at the\ncenter of the 0.6s input window. To make the model more robust and remove any bias to content in\nthe center, we augment datasets during training by randomly rotating both video and spatial audio\naround the vertical (z) axis. Spatial audio can be rotated by multiplying the ambisonic channels with\nthe appropriate rotation matrix as described in [30], and video frames (in equirectangular format) can\nbe rotated using horizontal translations with wrapping. Networks are trained by back-propagation\nusing the Adam optimizer [28] for 150k iterations (roughly two days) with parameters \u03b21 = 0.9,\n\u03b22 = 0.999 and \u0001 = 1e \u2212 8, batch size of 32, learning rate of 1e \u2212 4 and weight decay of 0.0005.\nDuring evaluation, we predict a chunk of 0.1s for each second of the test video, and average the results\nacross all chunks. Also, to avoid bias towards longer videos, all evaluation metrics are computed for\neach video separately, and averaged across videos.\n\n6\n\n0100200300CountsAutos &VehiclesSportsFilm &AnimationScience &TechnologyEntertainmentTravel &EventsMusicPeople &BlogsYT-AllYT-MusicYT-Clean\fREC-STREET\n\nYT-CLEAN\n\nYT-MUSIC\n\nYT-ALL\n\nSTFT\n\n0.187\n0.180\n0.178\n0.158\n0.172\n0.152\n0.158\n\nENV\n\n0.958\n0.935\n0.973\n0.779\n0.784\n0.790\n0.767\n\nEMD\n\n0.492\n0.449\n0.450\n0.425\n0.440\n0.422\n0.419\n\nSTFT\n\n1.394\n1.361\n1.370\n1.339\n1.349\n1.381\n1.379\n\nENV\n\n2.063\n2.039\n2.081\n1.847\n1.778\n1.773\n1.776\n\nEMD\n\n1.478\n1.403\n1.428\n1.405\n1.402\n1.415\n1.417\n\nSTFT\n\n4.652\n4.338\n4.220\n3.664\n3.615\n3.627\n3.524\n\nENV\n\n4.355\n4.678\n4.591\n3.569\n3.467\n3.602\n3.366\n\nEMD\n\n3.479\n2.855\n2.654\n2.432\n2.403\n2.447\n2.350\n\nSTFT\n\n2.691\n2.658\n2.635\n2.546\n2.455\n2.435\n2.447\n\nENV\n\n3.394\n3.239\n3.200\n2.907\n2.665\n2.694\n2.649\n\nEMD\n\n2.246\n2.137\n2.117\n2.063\n2.023\n2.050\n2.019\n\nSPATIAL PRIOR\nU-NET BASELINE\nOURS-NOVIDEO\nOURS-NORGB\nOURS-NOFLOW\nOURS-NOSEP\nOURS-FULL\n\nTable 1: Quantitative comparisons. We report three quality metrics (Sec 3.3): Envelope distance (ENV),\nLog-spectral distance (LSD), and earth-mover\u2019s distance (EMD), on test videos from different datasets (Sec 3.4).\nLower is better. All results within 0.01 of the top performer are shown in bold.\n\nT\nG\n\ns\nr\nu\nO\n\nFigure 3: Qualitative Results. Comparison between predicted and recorded FOA. Spatial audio is visualized\nas a color overlay over the frame, with darker red indicating locations with higher audio energy.\n\nReal time performance\nsampling rate in 103ms, using a single 12GB Titan Xp GPU (3840 cores running at 1.6GHz).\n\nThe proposed procedure can generate 1s of spatial audio at 48000Hz\n\nBaselines Since spatial audio generation is a novel task, no established methods exist for comparison\npurposes. Instead, we ablate our architecture to determine the relevance of each component, and\ncompare it to the prior spatial distribution of audio content and a popular, domain-independent\nbaseline architecture. Quantitative results are shown in Table 1.\nTo determine the role of the visual input, we remove the RGB encoder (NORGB), the \ufb02ow encoder\n(NOFLOW), or both (NOVIDEO). We also remove the separation block entirely (NOSEP), and\nmultiply the localization weights with the input mono directly. The results indicate that the network\nis highly relying on visual features, with NOVIDEO being one of the worse performers overall.\nInterestingly, most methods performed well on REC-STREET and YT-CLEAN. However, the visual\nencoder and separation block are necessary for more complex videos as in YT-MUSIC and YT-ALL.\nSince the main sound sources in 360\u25e6 videos often appear in the center, we validate the need\nfor a complex model by directly using the prior distribution of audio content (SPATIAL-PRIOR).\nWe compute the spatial prior \u00afE(\u03b8) by averaging the energy maps E(\u03b8, t) of (Eq. 5) over all\nvideos in the training set. Then, to induce the same distribution on test videos, we decompose\n\u00afE(\u03b8) into its spherical harmonics coef\ufb01cients (cw, cx, cy, cz) and upconvert the input mono using\n(\u03c6w(t), \u03c6x(t), \u03c6y(t), \u03c6z(t)) = (1, cx/cw, cy/cw, cz/cw) i(t). As shown in Table 1, relying solely on\nthe prior distribution is not enough for accurate ambisonic conversion.\nWe \ufb01nally compare to a popular encoder-decoder U-NET architecture, which has been sucessfully\napplied to audio tasks such as audio super-resolution [31]. This network consists of a number of\nconvolutional downsampling layers that progressively reduce the dimension of the signal, distilling\nhigher level features, followed by a number of upsampling layers to restore the signal\u2019s resolution. In\neach upsampling layer, a skip connection is added from the encoding layer of equivalent resolution.\nTo generate spatial audio, we modify the U-NET architecture by setting the number of units in the\noutput layer to the number of ambisonic channels, and concatenate video features to the U-Net\nbottleneck (i.e., the lowest resolution layer). Our approach signi\ufb01cantly outperforms the U-NET\narchitecture, which demonstrates the importance of an architecture tailored to the task of spatial audio\ngeneration.\n\n7\n\n\fGround-truth\n\nU-NET\n\nNOAUDIO\n\nNOSEP\n\nOURS\n\nFigure 4: Comparisons. Predicted FOA produced by different procedures.\n\nFigure 5: Mono recordings. Predicted FOA on\nvideos recorded with a real mono microphone (un-\nknown FOA).\n\nFigure 6: User studies. Percentage of videos labeled as\n\"Real\" when viewed with audio generated by various meth-\nods (GT, OURS, U-NET and MONO) under two viewing\nexperiences (using a HMD device, and in-browser view-\ning). Error bars represent Wilson score intervals [45] for a\n95% con\ufb01dence level.\n\nQualitative results Designing robust metrics for comparing spatial audio is an open problem,\nand we found that only so much can be determined by these metrics alone. For example, fully \ufb02at\npredictions can have a similar EMD to a mis-placed prediction, but perceptually be much worse.\nTherefore, we also rely on qualitative evaluation and a user study. Fig. 3 shows illustrative examples\nof the spatial audio output of our network, and Fig. 4 shows a comparison with other baselines. To\ndepict spatial audio, we overlay the directional energy map E(\u03b8\u03b8\u03b8, t) of the predicted ambisonics (Eq. 5)\nover the video frame at time t. As can be seen in most of these examples, our network generates\nspatial audio that has a similar spatial distribution of energy as the ground truth. Furthermore, due to\nthe form of the audio generator, the sound \ufb01delity of the original mono input is carried over to the\nsynthesized audio. These and other examples, together with the predicted spatial audio, are provided\nin Supp. material.\nThe results shown in Table 1 and Fig. 3 use videos recorded with ambisonic microphones and\nconverted to mono audio. To validate whether our method extends to real mono microphones, we\nscraped additional videos from YouTube that were not recorded with ambisonics, and show that we\ncan still generate convincing spatial audio (see Fig. 5 and Supp. material).\n\nUser study The real criteria for success is whether viewers believe that the generated audio is\ncorrectly spatialized. To evaluate this, we conducted a \u201creal vs fake\u201d user study, where participants\nwere shown a 360\u25e6 video and asked to decide whether the perceived location of the audio matches\nthe location of its sources in the video (real) or not (fake). Two studies were conducted in different\nviewing environments: a popular in-browser 360\u25e6 video viewing platform (YouTube), and with a\nhead-mounted display (HMD) in a controlled environment. We recruited 32 participants from Amazon\nMechanical Turk for the in-browser study. For the HMD study, we recruited 9 participants (aged\nbetween 20 and 32, 1 female) through an engineering school email list of a large university. In both\ncases, participants were asked to have normal hearing, and to listen to the audio using headphones. In\nthe HMD study, participants were asked to wear a KAMLE VR Headset. To familiarize participants\nwith the spatial audio experience, each participant was \ufb01rst asked to watch two versions of a pre-\nselected video with and without correct spatial audio. After the practice round, participants watched\n20 randomly selected videos whose audio was generated by one of four methods: GT, the original\nground-truth recorded spatial audio; MONO, just the mono track (no spatialization); U-NET, the\nbaseline method; and OURS, the result of our full method. After each video, participants were asked\nto decide whether its audio was real or fake. In total, 280 clips per method were watched for the\nin-browser study, and 45 per method in the HMD study.\nThe results of both studies, shown in Fig 6, support several conclusions. First, our approach\noutperforms the U-NET baseline and MONO by statistically signi\ufb01cant margins in both studies.\n\n8\n\nHMDIn-Browser020406080100% Real84.472.562.255.031.140.422.235.884.4444444444444472.53521126760563GTOursU-NetMono\fFOA\n\nSOA\n\nMONO \u2192 FOA\n\nFOA \u2192 SOA\n\nENV\nLSD\nEMD\n\n1.870\n3.228\n1.400\n\n0.333\n0.513\n0.232\n\nGround truth\n\nOurs\n\nFigure 7: Limitations. Our algorithm predicts\nthe wrong people who are laughing in a room\nfull of people (top), and the wrong violin who is\ncurrently playing in the live performance (right).\n\nFigure 8: Higher order ambisonics.\n(Top) Examples\nfrom our synthetic FOA to SOA conversion experiment.\n(Bottom) Comparison between Mono to FOA and FOA to\nSOA conversion tasks.\n\nSecond, in comparison to in-browser video platforms, HMD devices offer a more realistic viewing\nexperience, which enables non-spatial audio to be identi\ufb01ed more easily. Thus, participants were\nconvinced by the ambisonics predicted by our approach at higher rates while wearing an HMD device\n(62% HMD vs. 55% in-browser). Finally, spatial audio may not always be experienced easily, e.g.,\nwhen the video does not contain clean sound sources. As a consequence, even videos with GT\nambisonics were misclassi\ufb01ed in both studies at a signi\ufb01cant rate.\n\n5 Discussion\n\nLimitations We observe several cases where sound sources are not correctly separated or localized.\nThis occurs with challenging examples such as those with many overlapping sources, reverberant\nenvironments which are hard to separate, or where there is an ambiguous mapping from visual\nappearance to sound source (such as multiple, similar looking cars). Fig. 7 shows a few examples.\nWhile general purpose spatial audio generation is still an open problem, we provide a \ufb01rst approach.\nWe hope that future advances in audio-visual analysis and audio generation will enable more robust\nsolutions. Also, while total amount of content (in hours) is on par with other video datasets, the\nnumber of videos is still low, due to the limited number of 360\u25e6 video with spatial audio available\nfrom online sources. As this number increases, our method should also improve signi\ufb01cantly.\nFuture work Although hardware trends change and we begin to see commercial cameras that\ninclude spatial audio microphone arrays capable of recording FOA, we believe that up-converting\nto spatial audio will remain relevant for a number of reasons. Besides the spatialization of legacy\nrecordings with only mono or stereo audio, our method can be used to further increase the ambisonics\nspatial resolution, for example by up-converting \ufb01rst into second-order ambisonics (SOA). Unfortu-\nnately, ground-truth SOA recordings are dif\ufb01cult to collect in-the-wild, since SOA microphones are\nrare and expensive. Instead, to demonstrate future potential, we applied our approach to the FOA to\nSOA conversion task, using a small synthetic dataset where pre-recorded sounds are placed at chosen\nlocations, which move over time in random trajectories. These are accompanied by an arti\ufb01cially\nconstructed video consisting of a random background image with identifying icons synchronized\nwith the sound location (see Fig. 8). The results shown in Fig. 8 indicate that converting FOA into\nSOA may be signi\ufb01cantly easier than ZOA to FOA. This is because FOA signals already contain\nsubstantial spatial information, and partially separated sounds. Given these \ufb01ndings, a promising area\nfor future work is to synthesize a realistic large scale SOA dataset for learning to convert FOA into\nhigh-order ambisonics in order to support more realistic viewing experience.\nConclusion We presented the \ufb01rst approach for up-converting conventional mono recordings into\nspatial audio given a 360\u25e6 video, and introduced an end-to-end trainable network tailored to this\ntask. We also demonstrate the bene\ufb01ts of each component of our network and show that the proposed\ngenerator performs substantially better than a domain independent baseline.\n\nAcknowledgments This work was partially funded by graduate fellowship SFRH/BD/109135/2015 from the\nPortuguese Ministry of Sciences and Education and NRI Grant IIS-1637941.\n\n9\n\n\fReferences\n[1] T. Afouras, J. S. Chung, and A. Zisserman. The conversation: Deep audio-visual speech enhancement. In\n\nInterspeech, 2018. 3\n\n[2] S. Amari, A. Cichocki, and H. H. Yang. A new learning algorithm for blind signal separation. In Advances\n\nin Neural Information and Processing Systems (NIPS), 1996. 3\n\n[3] S. Argentieri, P. Dan\u00e8s, and P. Sou\u00e8res. A survey on sound source localization in robotics: From binaural\n\nto array processing methods. Computer Speech & Language, 34(1):87\u2013112, 2015. 3\n\n[4] Y. Aytar, C. Vondrick, and A. Torralba. Soundnet: Learning sound representations from unlabeled video.\n\nIn Advances in Neural Information and Processing Systems (NIPS), 2016. 3\n\n[5] Z. Barzelay and Y. Y. Schechner. Harmony in motion. In IEEE Conf. on Computer Vision and Pattern\n\nRecognition (CVPR), 2007. 3\n\n[6] A. J. Bell and T. J. Sejnowski. An information-maximization approach to blind separation and blind\n\ndeconvolution. Neural computation, 7(6):1129\u20131159, 1995. 3\n\n[7] P. Comon. Independent component analysis, a new concept? Signal Processing, 36(3):287 \u2013 314, 1994. 3\n[8] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei. Imagenet: A large-scale hierarchical image\n\ndatabase. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2009. 4\n\n[9] G. Dickins and R. Kennedy. Towards optimal sound\ufb01eld representation. In Audio Engineering Society\n\nConvention, 1999. 2\n\n[10] C. Dong, C. C. Loy, K. He, and X. Tang. Learning a deep convolutional network for image super-resolution.\n\nIn European Conference on Computer Vision (ECCV), 2014. 3\n\n[11] A. Ephrat, I. Mosseri, O. Lang, T. Dekel, K. Wilson, A. Hassidim, W. Freeman, and M. Rubinstein.\nLooking to listen at the cocktail party: A speaker-independent audio-visual model for speech separation.\nIn ACM SIGGRAPH, 2018. 3\n\n[12] A. Gabbay, A. Shamir, and S. Peleg. Visual speech enhancement using noise-invariant training.\n\nInterspeech, 2018. 3\n\nIn\n\n[13] M. A. Gerzon. Periphony: With-height sound reproduction. J. Audio Eng. Soc, 21(1):2\u201310, 1973. 2\n[14] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio.\nGenerative adversarial nets. In Advances in Neural Information and Processing Systems (NIPS), 2014. 3\n[15] A. Gray and J. Markel. Distance measures for speech processing. IEEE Transactions on Acoustics, Speech,\n\nand Signal Processing, 24(5):380\u2013391, 1976. 5\n\n[16] K. He, X. Zhang, S. Ren, and J. Sun. Identity mappings in deep residual networks. In European Conference\n\non Computer Vision (ECCV), 2016. 4\n\n[17] S. Hershey, S. Chaudhuri, D. P. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A.\nSaurous, B. Seybold, et al. CNN architectures for large-scale audio classi\ufb01cation. In IEEE International\nConf. on Acoustics, Speech and Signal Processing (ICASSP), 2017. 4\n\n[18] J. Hornstein, M. Lopes, J. Santos-Victor, and F. Lacerda. Sound localization for humanoid robots-building\naudio-motor maps based on the HRTF. In IEEE/RSJ International Conf. on Intelligent Robots and Systems,\n2006. 3\n\n[19] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis. Deep learning for monaural speech\nseparation. In IEEE International Conf. on Acoustics, Speech and Signal Processing (ICASSP), 2014. 3\n[20] S. Iizuka, E. Simo-Serra, and H. Ishikawa. Let there be color!: joint end-to-end learning of global and\nlocal image priors for automatic image colorization with simultaneous classi\ufb01cation. ACM Transactions\non Graphics (TOG), 35(4):110, 2016. 3\n\n[21] E. Ilg, N. Mayer, T. Saikia, M. Keuper, A. Dosovitskiy, and T. Brox. Flownet 2.0: Evolution of optical\n\ufb02ow estimation with deep networks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR),\n2017. 4\n\n[22] H. Izadinia, I. Saleemi, and M. Shah. Multimodal analysis for identi\ufb01cation and segmentation of moving-\n\nsounding objects. IEEE Transactions on Multimedia, 15(2):378\u2013390, 2013. 3\n\n[23] R. Jozefowicz, O. Vinyals, M. Schuster, N. Shazeer, and Y. Wu. Exploring the limits of language modeling.\n\narXiv preprint arXiv:1602.02410, 2016. 3\n\n[24] C. Jutten and J. Herault. Blind separation of sources, part I: An adaptive algorithm based on neuromimetic\n\narchitecture. Signal Processing, 24(1), 1991. 3\n\n[25] E. Kidron, Y. Y. Schechner, and M. Elad. Pixels that sound. In IEEE Conf. on Computer Vision and Pattern\n\nRecognition (CVPR), 2005. 3\n\n[26] E. Kidron, Y. Y. Schechner, and M. Elad. Cross-modal localization via sparsity. IEEE Trans. on Signal\n\nProcessing (TIP), 55(4):1390\u20131404, 2007. 3\n\n[27] J. Kim, J. Kwon Lee, and K. Mu Lee. Accurate image super-resolution using very deep convolutional\n\nnetworks. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. 3\n\n[28] D. P. Kingma and J. Ba. Adam: A method for stochastic optimization. CoRR, abs/1412.6980, 2014. 6\n[29] D. P. Kingma and M. Welling. Auto-encoding variational Bayes. In International Conference on Learning\n\nRepresentations (ICLR), 2014. 3\n\n10\n\n\f[30] M. Kronlachner. Spatial transformations for the alteration of ambisonic recordings. Master\u2019s thesis, Graz\n\nUniversity of Technology, 2014. 2, 6\n\n[31] V. Kuleshov, S. Z. Enam, and S. Ermon. Audio super resolution using neural networks. In Workshops at\n\nInternational Conference on Learning Representations (ICLR), 2017. 3, 7\n\n[32] E. Levina and P. Bickel. The earth mover\u2019s distance is the mallows distance: Some insights from statistics.\n\nIn IEEE International Conference on Computer Vision (ICCV), 2001. 5\n\n[33] A. Nagrani, J. S. Chung, and A. Zisserman. Voxceleb: A large-scale speaker identi\ufb01cation dataset. In\n\nInterspeech, 2017. 4\n\n[34] K. Nakadai, H. G. Okuno, and H. Kitano. Real-time sound source localization and separation for robot\n\naudition. In International Conference on Spoken Language Processing, 2002. 3\n\n[35] K. Nakamura, K. Nakadai, F. Asano, and G. Ince. Intelligent sound source localization and its application\nto multimodal human tracking. In IEEE/RSJ International Conf. on Intelligent Robots and Systems (IROS),\n2011. 3\n\n[36] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, and\nK. Kavukcuoglu. Wavenet: A generative model for raw audio. In 9th ISCA Speech Synthesis Workshop,\n2016. 3\n\n[37] A. Owens and A. A. Efros. Audio-visual scene analysis with self-supervised multisensory features. In\n\nEuropean Conference on Computer Vision (ECCV), 2018. 3\n\n[38] A. Owens, P. Isola, J. McDermott, A. Torralba, E. H. Adelson, and W. T. Freeman. Visually indicated\n\nsounds. In IEEE Conf. on Computer Vision and Pattern Recognition (CVPR), 2016. 3\n\n[39] O. Santala and V. Pulkki. Directional perception of distributed sound sources. The Journal of the Acoustical\n\nSociety of America, 129(3):1522\u20131530, 2011. 4\n\n[40] J. O. Smith. Mathematics of the discrete Fourier transform (DFT): with audio applications, chapter The\n\nAnalytic Signal and Hilbert Transform Filters. Julius Smith, 2007. 5\n\n[41] M. Soler, J.-C. Bazin, O. Wang, A. Krause, and A. Sorkine-Hornung. Suggesting sounds for images from\n\nvideo collections. In European Conference on Computer Vision (ECCV), 2016. 3\n\n[42] N. Strobel, S. Spors, and R. Rabenstein. Joint audio-video object localization and tracking. IEEE Signal\n\nProcessing Magazine, 18(1):22\u201331, 2001. 3\n\n[43] J.-M. Valin, F. Michaud, and J. Rouat. Robust localization and tracking of simultaneous moving sound\nsources using beamforming and particle \ufb01ltering. Robotics and Autonomous Systems, 55(3):216\u2013228, 2007.\n3\n\n[44] D. Wang and J. Chen. Supervised speech separation based on deep learning: An overview. IEEE/ACM\n\nTransactions on Audio, Speech, and Language Processing, 2018. 3\n\n[45] E. B. Wilson. Probable inference, the law of succession, and statistical inference. Journal of the American\n\nStatistical Association, 22(158):209\u2013212, 1927. 8\n\n[46] R. Zhang, P. Isola, and A. A. Efros. Colorful image colorization. In European Conference on Computer\n\nVision (ECCV), 2016. 3\n\n[47] H. Zhao, C. Gan, A. Rouditchenko, C. Vondrick, J. McDermott, and A. Torralba. The sound of pixels. In\n\nEuropean Conference on Computer Vision (ECCV), 2018. 3\n\n11\n\n\f", "award": [], "sourceid": 236, "authors": [{"given_name": "Pedro", "family_name": "Morgado", "institution": "University of California, San Diego"}, {"given_name": "Nuno", "family_name": "Nvasconcelos", "institution": "UC San Diego"}, {"given_name": "Timothy", "family_name": "Langlois", "institution": "Adobe Systems Inc"}, {"given_name": "Oliver", "family_name": "Wang", "institution": "Adobe Research"}]}