Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks

Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks David Diaz-Guerra, Student Member, IEEE, Antonio Miguel and Jose R. Beltran Abstract—In this paper, we present a new single sound source More recently, some methods based on Deep Neural Net- DOA estimation and tracking system based on the well-known works (DNN) have been proposed. From the original Mul- SRP-PHAT algorithm and a three-dimensional Convolutional tilayer Perceptron (MLP) used in the first proposals [18], Neural Network. It uses SRP-PHAT power maps as input [19], their architectures have evolved into more sophisticated features of a fully convolutional causal architecture that uses 3D Convolutional Neural Networks [20]–[22] which can jointly convolutional layers to accurately perform the tracking of a sound source even in highly reverberant scenarios where most of the perform DOA estimation and tracking. However, despite state of the art techniques fail. Unlike previous methods, since we DNN-based techniques claim to be more robust than the do not use bidirectional recurrent layers and all our convolutional classical methods, their use of the CNNs might not be the layers are causal in the time dimension, our system is feasible most appropriate and many of them add non-causal recurrent for real-time applications and it provides a new DOA estimation layers that make them unfeasible for real-time applications. for each new SRP-PHAT map. To train the model, we introduce a new procedure to simulate random trajectories as they are In this paper, we propose the use of 3D CNNs over SRP- needed during the training, equivalent to an infinite-size dataset PHAT power maps to jointly perform the DOA estimation with high flexibility to modify its acoustical conditions such as and the tracking of a source in highly reverberant rooms. We the reverberation time. We use both acoustical simulations on present a completely causal technique that provides a new a large range of reverberation times and the actual recordings DOA estimation with each new power map and we show its of the LOCATA dataset to prove the robustness of our system and its good performance even using low-resolution SRP-PHAT robustness through several simulations in adverse conditions. maps. We analyze how the resolution of the SRP-PHAT power maps affects our technique and we prove that by using CNNs we Index Terms—microphone arrays, direction of arrival esti- mation, DOA, sound source tracking, SRP-PHAT, convolutional can obtain resolutions that surpass the search grid employed to neural networks, CNN. compute the maps. Finally, we apply our model to the acoustic source LOCalization And TrAcking (LOCATA) challenge [23] I. INTRODUCTION dataset in order to show how the models trained with simulated IRECTION Of Arrival (DOA) estimation and Sound signals are general enough to work with actual recordings Source Localization with microphone arrays has been in real conditions. Although we focus on compact arrays widely investigated and used in different applications, such as and evaluate the performance of our technique with an array robot audition [1], [2], acoustic characterization [3], speech with 12 microphones mounted over a NAO robot head, the recognition [4], [5] or teleconference systems [6]. Most of technique may be used with any array geometry. the techniques in the literature can be roughly classified into It is worth mentioning that we focus on single source i) Time Difference Of Arrival (TDOA) based techniques, scenarios that are supposed to always have an active source; which first use the Generalized Cross-Correlation (GCCs) therefore, our tracking does not need to deal with data associa- functions [7] to estimate the TDOA and then compute the tion and with the birth and the dead of the source. For a single most reliable DOA for them (it is worth saying that there source scenario, the birth and dead problem may be easily are also some alternatives to the GCCs such as the eigenvalue solved with a Voice Activity Detector (VAD) but extending our decomposition [8] or even deep-learning based techniques [9]), system to deal with multiple sources might be more difficult. ii) beamforming based techniques, such as SRP-PHAT [10], However, some ideas are proposed in section III-A. [11], which search the direction that maximizes the power of In order to encourage and facilitate the replicability of this the output of a beamformer, and iii) subspace techniques, such research, the source code of our model and the models used as Multiple Signal Classification (MUSIC) [12], [13], based as baselines, as well as everything needed to train and test on the eigenstructure of the narrowband cross-correlation them, can be found in our public repository ; we also share matrices. These techniques vary in terms of computational the trained models there. complexity and their robustness against adverse scenarios such The remainder of this paper is structured as follows. We first as noise and reverberation. When they have to deal with non- review the SRP-PHAT algorithm (section II) and the state of stationary signals, such as the speech, a tracking algorithm is the art of DNN-based DOA estimation techniques (section III). needed after them to exploit the temporal correlation between In section IV we present our proposed technique and in section source positions [14]–[17]. V we analyze its performance with both simulated rooms and actual recordings. Finally, section VI concludes the paper. D. Diaz-Guerra, A. Miguel and J.R. Beltran are with the Department of Electronic Engineering and Communications, University of Zaragoza, Zaragoza, Spain. https://github.com/DavidDiazGuerra/Cross3D Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 II. THE SRP-PHAT ALGORITHM th The signal received at the n sensor of a microphone array can be modeled as x (t) = a (t) h ( ; t) + v (t); (1) n s n s n where a (t) is the signal generated by the source,  is the s s source position, h ( ; t) is the impulse response from  to n s s (a) (b) th the n sensor, and v (t) is the noise of the sensor, which is typically supposed to be white, Gaussian, and uncorrelated with the source signal and with the noises of other sensors. It is worth mentioning that  is written in bold because it can represent an angle, two spherical coordinates, or even a point in 3D Cartesian coordinates depending on the geometry of the array. One of the most classic and popular approaches to DOA (c) (d) estimation is finding the direction that maximizes the Steered Fig. 1. Example of SRP-PHAT power maps with different resolutions in Response Power (SRP) that we would obtain using a filter- a favorable scenario: SNR=30 dB and T =0:3 s. The red dot indicates the actual DOA of the sound source and the black dot is at the maximum of the and-sum beamformer: map. = argmax P () (2) N1 j! () P () = G (!)X (!)e d!; (3) n n n=0 where N is the number of sensors of the array, X (!) is the Fourier Transform of x (t), G (!) is the frequency response n n of the filter for the channel n, and  () is the time delay th (a) (b) occurring from the position or direction  to the n sensor. Although directly implementing (3) would be computation- ally expensive, it can be computed in terms of the Generalized Cross-Correlation functions as N1 N1 X X P () = 2 R ( ()); (4) nm nm n=0 m=0 where  () =  ()  () and R is the GCC nm n m nm (c) (d) th th between the signals of the n and the m sensor: Fig. 2. Example of SRP-PHAT power maps with different resolutions in an adverse scenario: SNR=5 dB and T =0:9 s. The red dot indicates the actual j!t R ( ) = (!)X (!)X (!)e d!; (5) DOA of the sound source and the black dot is at the maximum of the map. nm nm n where denotes the complex conjugate and (!) = nm G (!)G (!) is a weighting function. the two angles of the spherical coordinates, or even three- dimensional, e.g. XYZ coordinates. Some search strategies Equation (4), combined with the use of the PHAse Trans- have been proposed to reduce the number of evaluations of form (PHAT) G (!) = 1=jX (!)j, is commonly known as n n (4) that need to be computed to accurately find the maximum the SRP-PHAT algorithm [10], [11], and allows us to obtain of P () [24]–[26] but, due to the non-convexity of the SRP- an acoustic power map of the environment whose maximum should correspond with the source position. PHAT power maps, the number of SRP-PHAT evaluations needed might still be an issue in some scenarios. In [26], [27], Although the SRP-PHAT algorithm is a good trade-off be- it is proposed to modify (4) to compute the power received tween robustness and computational efficiency, obtaining more from a space region instead of from a point, so they can use accurate results than two-step TDOA based techniques with a hierarchical search strategies over maps with lower resolution. lower computational cost than most of the broadband subspace techniques, it still presents several issues. The main advantage As we can see in Fig. 1, in favorable scenarios with high of (4) is that most of its computational cost is in computing the SNR and low reverberation, the SRP-PHAT power maps have GCCs and does not increase with the search space. However, a clear maximum in the DOA of the sound source that can the computation of its sums for each direction, especially if be used to obtain a good estimation even with low-resolution it is needed to interpolate R ( ()) from its adjacent maps but, when SNR decreases and the reverberation in- nm nm samples, may not be negligible; this problem becomes more creases, such as in the scenario of Fig. 2, the maps present challenging when the search space is two-dimensional, e.g. several local maxima that may be incorrectly interpreted as Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 the DOA of the sound, especially when using low-resolution most popular [19], [30]–[34]. However, since [22] proposed maps. However, in those maps, in addition to the maxima, we using 3 outputs to estimate a unitary vector pointing to the can also observe several patterns that are also related to the direction of the source, several models have followed this DOA of the sound and the geometry of the array and that may approach [35]–[37]; further studies about the advantages of be exploited to obtain a more accurate DOA estimation. using 3 regression outputs to infer the Cartesian coordinates Due to the non-stationary nature of most of the signals of instead of 2 to directly obtain the azimuth and the elevation interest, such as the speech or the music, a tracking stage can be found in [38], [39]. Motivated by the good results of is needed after the DOA estimation to exploit the temporal these recent works, we opted to follow a regression approach. correlation between the source positions and to avoid inaccu- One of the main drawbacks of solving the DOA estimation rate estimations in frames where the power of the signal is as a regression problem is that it makes it harder to estimate low or its autocorrelation makes the maximum of the power the DOA of multiple sources. Since they also classify the map become too wide. The algorithms for one source tracking sources into several classes, Sound Event Localization and are typically based on the Kalman filter [14], [15] although Detection (SELD) models usually have a regression output more advanced techniques have been proposed to deal with for each source class [22], [35]–[37] supposing that only one multiple sources, such as those based on particle filtering source of each class can be active at the same time. Another [16], [17]. However, in these approaches, they use two-step possible approach might be the use of a single-source DOA strategies which make them sensitive to potential information estimator (as the one proposed here) combined with a source loss when only the DOA estimations are used for the tracking; cancellation technique [40], [41] to iteratively find multiple e.g. the absolute maximum of the SRP-PHAT maps is always sources. selected even if another local-maximum was much closer to the previous estimations and we assign the same likelihood to B. Input features and network architecture all the DOA estimations while some of them correspond to wider maximums from frames where the source was weaker. Initially, the most common input features were the GCCs Including some of this information in the tracking algorithms between the signals of each sensor [18], [33], [34], but we can may be possible, but it would increase both its complexity and also find other approaches such as using the eigenvectors of the spatial covariance matrix [19]. In [42], we proposed using the number of parameters that would need to be fine-tuned. low-resolution SRP-PHAT maps, in that case combined with In [28], a technique to share information between an iterative fully connected perceptrons. More recently, some techniques DOA estimator based on Expectation-Maximization [29] and have been proposed using 2D convolutional networks over the a tracking system based on particle filtering is proposed. In spectrogram of the microphone signals, using only the phase this paper, we use Neural Networks to jointly perform DOA information [30], the magnitude information [32] or both of estimation and tracking, since they have been proved to have them [22], [31], [35]; some transformations, such as using an excellent performance in several end-to-end problems in the cepstrogram [21] or the Mel spectrogram [37], have also other fields, such as computer vision or speech recognition or synthesis. been proposed. Other features proposed as inputs of CNNs are Ambisonics intensity vectors [43], raw audio samples [20] and combinations of several of the already mentioned features III. DOA ESTIM ATION WITH DEEP NEURAL NETWORKS [21], [36], [37]. One of the first proposals of using neural networks for DOA One of the most important properties of CNNs is that they estimation was [18], which used a fully connected perceptron are equivariant to translations, which, in plain text, means that with a hidden layer to obtain the DOA from the GCCs as a if we apply a translation to the input features we get the classification problem. Since that, several techniques have been same output with its equivalent translation. This property is proposed, differing in the output format, the features that they very useful in many computer vision applications, where the use as input, and the network architecture. same patterns have the same meaning in all the positions of the image. When used for DOA estimation and tracking, 2D A. Output format CNNs are typically used over spectrograms, so convolution To obtain the DOA estimation as a classification problem, is performed over the time and the frequency axes and each we first need to define a grid of directions where the source can microphone spectrogram is treated as a different channel. Be- be found (similar to the resolution of the SRP-PHAT maps) ing equivariant to time translations seems to be an advantage so the network has an output per grid point. The network of since we would expect similar patterns for any source in a [18] had 359 outputs, so they had a maximum resolution of 1 position no matter the time instant when it was there. However, degree for azimuth estimation. However, if we want to estimate since the phase differences for the same source position vary both azimuth and elevation (or even XYZ coordinates), the with the frequency, equivariance to frequency shifts may not number of outputs would dramatically increase, and therefore be an interesting property. Another approach to the use of 2D its computational complexity and the size of the dataset needed CNNs is proposed in [30], where the convolution is performed to train the network. over the frequency and the microphone dimensions and the In a footnote of [18], it was claimed that they had obtained time evolution is not taken into account by the network, i.e. worse results when they tried to estimate the DOA as a re- they do not perform any tracking. As they work with an gression problem and classification approaches seem to be the Uniform Linear Array (ULA), the phase differences expected Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 VAD VAD Utterance index LibriSpeech Silence RIR filtering Windowing SRP-PHAT Noise addition Dataset cleaning starting point SNR T60 Room size RIR simulation Array position Trajectory Coordinates to generation DOA Fig. 3. Dataset generation process. Italic letters represent variables and regular letters represent processes. Right-angled boxes represent deterministic processes and round boxes represent stochastic variables or processes. for a source position are the same for each pair of adjacent with speech signals, but this would have reduced the amount of microphones, so equivariance is desired, but this would not be different acoustic conditions seen by the model during training, the case for other array geometries. increasing the possibility of overfitting to those conditions and In this paper, we propose the use of CNNs over SRP- not generalizing. Instead of generating a dataset and using it PHAT power maps, performing the convolution over the to train the network, we simulate the inputs of the networks dimensions of the maps and the temporal dimension. Any as they are needed during training. This makes the training kind of SRP-PHAT power maps could be employed with this slower, but has two important advantages: 1) we have an approach depending on the geometry of the array, but as we infinite-size dataset, since all the random parameters of the focus on compact arrays, we use 2D spherical power maps simulation are modified for each trajectory simulated during and therefore, since we include the temporal dimension, 3D the training, which reduces the risk of overfitting, and 2) we CNNs. Actually, working over spherical maps, equivariance to have higher flexibility to modify the probability distribution spherical translations (i.e. rotations) would be preferred over of the parameters of the simulation, such as the signal to equivariance to euclidean translations, but this would lead us noise ratio or the reverberation time, during training so we to the use Spherical CNNs [44], which are still less efficient can perform curriculum learning strategies [45]. As shown in Fig. 3, we use LibriSpeech utterances as sound from a computational point of view than classical CNNs. The sources. The LibriSpeech corpus [46] contains 960 hours of extension to 4D CNNs over 3D SRP-PHAT maps to perform speech sampled at f = 16 kHz extracted from audiobooks. 3D Sound Source Localization (SSL) with distributed arrays Although audiobooks could be expected to contain quite clean would be straightforward. speech signals, we found that some of them have a strong Many of the state of the art CNN architectures include background noise that, after filtered by the RIRs, would be bidirectional recurrent units at the last layers of the model. located in the same position as the source and would facilitate Recurrent Neural Networks (RNNs), as recurrent linear filters, its localization and tracking in silent segments. To avoid our make the output at any time instant dependent on the values network to learn to exploit this fact, which will not be present of the input at every previous time instant and, therefore, in actual recordings, we use the WebRTC Voice Activity applying them in the backward direction is extremely non- Detector (VAD) [47] to detect silent segments and clean them causal. Obviously, any tracking system can benefit greatly by completely removing the signal in those frames. from the information of the future positions of the source but, in order to make our system feasible for real-time applications, The size of the rooms are randomly selected from the we opt for using only causal convolutional layers. range 3 m 3 m 2:5 m to 10 m 8 m 6 m and the array is randomly placed inside the room, being restricted to have a separation from the walls of a 10% of the room size in each IV. PROPOSED TECHNIQUE dimension and be in the lower half of the room for the vertical A. Training dataset axis. The Signal to Noise Ratio (SNR) and reverberation time Due to the difficulty of obtaining an accurately hand-labeled (T ) are also randomly selected from the ranges 5 dB to 30 dB dataset of moving sources recorded with microphone arrays, and 0:2 s to 1:3 s respectively. Uniform distributions over the we opted to train our model with simulated signals; another specified ranges are used for all the random parameters of the approach might have been using measured RIRs convolved dataset. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 results obtained with the LOCATA dataset, but would have also increased the complexity of the simulations and would made the training slower. Since different array geometries would lead to different patterns, the model should be re-trained for any new microphone array. We could take advantage of the similarity of the power maps of most compact arrays to apply transfer learning strategies when training models for new arrays. However, this is not a big issue since we do not need to record a dataset with the new array but just simulate it. Having simulated the propagation of the sound from the moving source to each microphone of the array using the overlap-add method, we add an omnidirectional Gaussian noise to obtain the desired SNR, window the signal using Hanning windows of length K = 4096 samples (i.e. 256 ms) with a hop size of 3K=4, and apply (4) to obtain the SRP- PHAT map of each window. In order to compute the noise power needed to obtain the desired SNR, we computed the Fig. 4. Examples of source trajectories used to train the model. The red dots signal power as the average power of all the non-silent frames are the trajectory points and the gray points represent the microphones. of the trajectory. Finally, we subtract its mean to each map and divide it between its maximum to fit it to the range [-1,1]. For the sake of computational efficiency, we do not perform We need to randomly generate continuous trajectory points, any kind of interpolation in the computation of (4) and just so it is possible to track them, but having enough diversity approximate the fractional delays to the nearest sample. to avoid the network to learn how they are generated and We found that, since this simulation process did not include overfit to them. In order to do so, we randomly select two any directional noise, the models trained with it were very points within the room boundaries to be the starting (p = T 0 0 0 0 T sensitive to directional noise sources. For example, in some of [p ; p ; p ] ) and ending (p = [p ; p ; p ] ) points x0 y0 z0 L xL yL zL the recordings of the LOCATA dataset, the noise of a fan is of the trajectory and add to the straight line that connect them present and, although its power is very low, the models tracked a sinusoidal function in each axis with random frequencies T T it when it was the only active sound source. In order to avoid (!!! = [! ; ! ; ! ] ) and amplitudes (A = [A ; A ; A ] ) x y z x y z this issue, we use the WebRTC VAD to determine in which ensuring that no more than 2 oscillations are performed during frames the speech source is active. We first tried to include the trajectory in each axis and that the amplitude is low enough the VAD information as an additional input channel to the to avoid the source to exit the room: network. However, as during the training the VAD sometimes p = p + (p p ) + A sin(!!!i); (6) failed and classified frames which speech information as silent, i 0 0 L 1 the network learned that even the frames classified as silent where L is the number of points of the trajectory,  stands could contain useful tracking information as long as they for the pointwise product and the sin function also operates contained a directional source and therefore ignores the VAD pointwise. Although the generation model is quite simple, it input. In order to avoid the network to track the directional generates quite diverse trajectories (some examples are shown noise sources of the LOCATA dataset, we finally opted to turn in Fig. 4) and, since the network only sees the azimuth and to zero the maps corresponding to frames classified as silent elevation coordinates and has a limited temporal perceptive by the VAD so no directional information was seen by the field, the model should not overfit to it. In order to confirm network when there was not any speech source active. this, we tested our model in a more realistic scenario with the recordings of the LOCATA dataset (see section V-B2). B. Model architecture To simulate the movement of the source, we use the GPU implementation of the Image Source Method [48] found in the Our model takes as input a 4-dimensional tensor (M) with python library gpuRIR [49]; the use of this library allows us size C  T  N  N , whose first channel M is built ' 1;t;i;j to reduce the simulation time in two orders of magnitude and by computing T temporally equispaced SRP-PHAT maps with makes possible to perform the simulations during the training N equispaced elevation angles in the range  2 [0; ] and N of the network. equispaced azimuth angles in ' 2 [; ); for planar arrays, the same model could be used sampling the elevation only For the results presented in this paper, we simulated a in  2 [0; ]. Using uniform spherical sampling instead of microphone array with 12 sensors designed to be mounted equispaced angles might have led to more precise SRP-PHAT over a NAO robot head; the minimum and maximum inter- maps, but would prevent us from using standard convolutional microphone distance of the array are 1:3 cm and 12:1 cm layers. and the actual position of each microphone can be found in the documentation of the LOCATA dataset [50]. We did Although the model must learn more complex patterns in not include the effect of the robot head in the simulation. order to exploit all the information available in the SRP- Including the scattering generated by it might improve the PHAT maps, it is obvious that one of the main sources of Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Reshape, concatenate and transpose 32x103x16x2 128 channels 5 length conv PReLU activation 32 channels 5x3x3 conv 3 channels 5 length conv dilation factor 2 PReLU activation tanh activation 1x1x2 max pooling x4 dilation factor 2 32x103x16x32 ... ... 1x103x16x32 ... 3x103 ... 32 channels 5x5x5 conv ... 128x103 PReLU activation ... 2048x103 32 channels 5x3x3 conv PReLU activation ... 1x2x1 max pooling x4 32x103x1x32 Fig. 5. Model architecture. The noted sizes correspond to a model for 16x32 maps and an input sequence of length 103. For the shake of simplicity, we represented it with only 1 input channel, although it actually have 3. information about the DOA of the source is the position of the information of the whole input, e.g. in image classification the maximum of each map; however, the argmax function is tasks, increasing the number of channels with convolutional highly non-linear (and non-differentiable) and it is not easy for layers and reducing their size with pooling layers progressively an artificial neural network to learn and fit it. Since it did not reduce the spatial information and gets higher-level represen- cause a significant increase of the computational complexity tations of the input. However, since our desired output is not of the algorithm, we decided to explicitly indicate to the only related to the presence of some patterns but especially to network the position of the maximum of each map. After their position, we must be careful when using them. trying to introduce this information in different layers, we In order to get the benefits of pooling layers but allowing found that the best results were obtained including it in the the spatial information to reach the last layers of the model, SRP input of the network, using C = 3 with M = 2;t;i;j we opted to, as shown in Fig. 5, split the model into two SRP and M = ' ^ for any t 2 f1; :::; Tg, i 2 f1; :::; N g, 3;t;i;j branches and apply max pooling in a different dimension in SRP SRP and j 2 f1; :::; N g, where  and ' ^ are the DOA t t each one. Working this way, the branch which pools the ' equivalent to the position of the maximum of the map t axis can retain positional information about the  coordinate normalized to be in the range [0,1]. This approach might seem of the maps and vice versa. Specifically, each branch has 4 quite redundant and inefficient, but it is a typical approach to layers with a convolution with 32 kernels of size 5  3  3, condition the output of a CNN to the value of a variable since PReLU activations, and a max-pooling with a kernel size of it is the simplest way to include that information in the first 1 1 2 and 1 2 1 respectively. If the input power maps layers of the network keeping its convolutional architecture, have less than 16 points in the  or the ' axes, it would not whose implementation is extremely optimized in the Deep be possible to perform so many pooling layers; in those cases, Learning software libraries. we reduce the 4 layers to the maximum number possible: The first layer of our model is a 3D convolutional layer with log (min (N ; N )). Due to the use of 3D convolutional 32 kernels of size 5 5 5 and PReLU activations [51]. It is layers and these perpendicular branches, we named our model worth mentioning that, for the temporal axis, we always use Cross3D. causal convolutions, so this model could be used in real time After the 3D convolutional layers, we concatenate the results applications generating a new DOA estimation for each new of each branch and reshape them so we have a temporal power map available and without introducing any delay. sequence of length T for each one of the elements of each Pooling layers are typically used in CNNs to progressively channel and spherical coordinates. Each one of these temporal reduce the size of the input and make the model generalize; sequences are used as the input channels of a 1D causal not using them, means that the fully connected layers used convolutional layer with 128 kernels of length 5 and PReLU at the end of most of the convolutional models would have activations. Finally, the resultant 128 time sequences are a huge number of trainable parameters which would surely passed through another 1D causal convolutional layer with overfit. When the desired output of the CNN is a summary of only 3 kernels of length 5 and tanh activations. These layers Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. { This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 TABLE I MODELS EMPLOYED FOR THE EVALUATION Trainable Temporal Window Model Input Causal parameters perceptive field length 4x8 526 372 5:63 s 8x16 946 340 6:40 s Cross3D Power maps 16x32 1 693 988 4096 Yes 32x64 5 626 148 7:17 s 64x128 21 354 788 GCCs 11 282 436 1D CNN 7:17 s 4096 Yes Maximums (64x128) 6 899 716 2D CNN 1 882 372 7:17 s 4096 Yes Spectrograms SELDnet [22] 104 643 1 512 No are similar to the fully connected layers that most of the CNN range of SNRs, increased the batch size to 10 trajectories, and architectures have, but we include a temporal convolution reduced the learning rate from 1e-4 to 1e-5. so they can still exploit the tracking information. We use a V. EVALUATION dilation factor of 2 in order to allow the tracking to take into A. Baseline methods account a longer context without increasing the complexity of the network. With all the temporal convolutions included in In order to analyze the convenience of using SRP-PHAT the model, each DOA estimation is computed from the last 37 maps as input features of CNNs for DOA estimation, we SRP-PHAT maps, i.e. the tracking memory is 7:17 s. developed some alternative CNNs to use them as baseline. The result of all this process is 3 time sequences of length T We designed them to be as similar as possible to our proposed whose elements are in the range (1; 1), which are considered model and to have the same temporal perceptive field so they to be the XYZ coordinates of a unitary vector pointing in the have the same tracking information. direction of the source in each time frame. Tables detailing the Since we are including the position of the maximum of network architecture of Cross3D for several SRP-PHAT map each map into the input of the network, we should verify resolutions can be found in the supplementary material of this if our model is actually exploiting the additional information paper. that is within the SRP-PHAT maps or if it is only using the position of its maximums. To do that, we designed a 1D CNN which takes as input 2 time sequences with the coordinates C. Training of the maximum of each map normalized to the range [0,1] We trained our model to minimize the Euclidean distance and applies to them 7 layers of 1D causal convolutions with between the output of the network and the 3 time sequences PReLU activations and without any pooling. All the layers obtained from the coordinates of the unitary vectors that point had a kernel size of 5 and the last two layers used a dilation to the direction where the sound source was simulated in each factor of 2, so its temporal perceptive field is 37 frames as time window. Similarly to the results reported in [22], [38], in Cross3D, and the number of channels of each layer was [39], we obtained better results using this approach than trying f1024; 512; 512; 512; 512; 128; 3g. The results shown in the to directly obtain the spherical coordinates from the network following sections were obtained training this network with even when using the great-circle distance between the output the same process described in section IV-C and using the and the ground-truth DOA angles as cost function. coordinates of maps with resolution 64x128. Although using an infinite-size dataset the term “epoch” One of the most common input features employed by the does not have the same meaning than in most of the machine first DOA estimation techniques based on neural networks learning systems, we define an epoch as 585 trajectories (the were the GCCs. They typically employed fully connected number of book chapters in the LibriSpeech train-clean-100 perceptrons with not too many hidden layers and, since they subset). We employed 80 epochs with trajectories of 20 s, only used the GCCs computed in a temporal window, did not i.e. 103 SRP-PHAT maps, to train the model with the Adam perform any kind of tracking. Following this idea, but with algorithm [52] using Pytorch [53]. the aim of including tracking information to the network, we As explained in section IV-A, we trained the networks with used the same 1D causal CNN than we used over the map reverberation times and SNRs uniformly distributed from 0:2 s maximums but using as input sequences the temporal evolution to 1:3 s and from 5 dB to 30 dB respectively; however, we of each element of the GCCs which represented an inter- found that the training converged faster with higher SNRs. microphone delay lower than the maximum inter-microphone Therefore, we followed a curriculum learning strategy [45] distance divided by the speed of sound. using batches of 5 trajectories with SNR=30 dB for the first Although, as explained in section III, the use of 2D CNNs 20 epochs and for the following epochs we employed the full over spectrograms may not be optimal, we also implemented a Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 (a) (b) (c) (d) (e) Fig. 6. Localization Root Mean Squared Angular Error for several power map resolutions, SNRs and reverberation times. The silent frames where not included in the computation of the RMSAE. model following this approach since it is quite popular in the B. Results literature. For a fair comparison, we used causal convolutions 1) Simulated dataset: We trained different models for sev- with a similar architecture to Cross3D: one convolution with eral power map resolutions with the whole range of reverber- 256 5x5 kernels, four convolutions with 256 5x5 kernels with ation times and SNRs, and then we tested their performance 1x4 pooling, a reshape to transform the remaining features for several specific values of T and SNR in order to analyze into temporal sequences and two 1D causal convolutions with the robustness of the proposed tracking system. kernel size 5 and dilation factor 2, the first one with 128 chan- Since we are using SRP-PHAT power maps as the input of nels and the last one with 3. For computing the spectrogram, our algorithm, we started our evaluation comparing our model we used the same windows than for computing the SRP-PHAT maps, extracted the magnitude and phase of each frequency of with the classic SRP-PHAT algorithm. SRP-PHAT does not the FFT, and finally normalized the magnitude of each window perform any kind of tracking, so, for a fairer comparison, we to its maximum and the phase to the range [-1,1]. did not take into account the silent frames when computing Finally, we also trained with our simulation procedure a the Root Mean Squared Angular Errors (RMSAE) showed in replica of SELDnet [22] but without including the Sound Event Fig. 6. As we can see in this figure, when working with high- Detection (SED) output and with only a DOA output since we resolution power maps in almost anechoic rooms with high were only interested in tracking one source. This model takes SNR, using our 3D CNN over the SRP-PHAT maps does as inputs the magnitude and phase of the spectrograms and has not improve the results compared to just taking the maximum three 2D convolutional layers followed by two bidirectional of each map; actually, our system seems to slightly degrade Gated Recurrent Units (GRU) [54] and two fully connected the DOA estimation, probably due to the effect applying layers. It is worth saying that this model, due to the bidirec- an unneeded tracking. However, when the room conditions tional GRUs, is non-causal and that it uses shorter analysis deteriorate, we can see how Cross3D is robust enough to get windows than the other analyzed methods. its performance degraded in only 5 when the T increases For the models that use spectrograms as input features, we to 1:5 s (which is higher than any reverberation seen during found that they did not train properly with the full range of the training) while the SRP-PHAT algorithm is just unable reverberations described in section IV-A, and we got the best to perform a proper estimation. They are also surprising the results training them with values of T60 randomly selected results obtained with maps of only 4x8 resolution, which from the range 0 s to 0:3 s. only perform a SRP-PHAT measurement each 45 in the All the models employed for the evaluation are summarized azimuth and 60 in the elevation and, since P (; ') = P (; 0) in Table I and tables detailing their architectures can be found 8' 2 [0; 2) if  = 0 or , only needs to perform 18 in the supplementary material of this paper. computations of (4). Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. RMSAE RMSAE RMSAE RMSAE RMSAE This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Tracking examp e: 32x64 maps, T60=0.9s and SNR=5dB E evation Azimuth −50 −100 −150 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 time [s] (a) (a) Tracking examp e: 4x8 maps, T60=0.3s and SNR=30dB −25 E evation Azimuth −50 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 time [s] (b) (b) Fig. 7. Examples of the DOA estimated in a scenario with T = 0:9 s Fig. 8. Tracking Root Mean Squared Angular Error of Cross3D with several and SNR=5 dB using maps with 32x64 resolution (a) and in a scenario with power map resolutions and the baseline methods for SNR=30 dB (a) and T = 0:3 s and SNR=30 dB using maps with 4x8 resolution (b). The solid SNR=5 dB (b) and several reverberation times. The silent frames where also line represents the actual DOA of the source, the dashed line the estimated included in the computation of the RMSAE. DOA and the crosses represent the maximum of each SRP-PHAT power map. Grey segments indicate silent frames. Fig. 7 shows a couple of examples of simulated trajectories reverberations and SNRs to compare the robustness of each and their estimated DOA. In Fig. 7a we can see how, for model. In this case, since all the methods include tracking scenarios with high reverberation and low SNRs, the maximum capabilities, we did not exclude the silent frames when we of the SRP-PHAT maps becomes really noisy but our proposed computed the RMSAEs shown in Fig. 8. We can see how system is able to maintain the estimated DOA quite close to the best results are obtained using our method with high the actual one. In linear systems, robust tracking with noisy resolution power maps, but that, even reducing the resolution, estimations usually comes with the cost of being slow to track its performance is still competitive. Using 1D CNN over the fast changes, at least with casual systems, but we can see how coordinates of the maximums of 64x128 SRP-PHAT maps our model was able to follow the sudden change in the azimuth performs worse than using our 3D CNN over 4x8 maps, so we of the source at the fifth second of the trajectory. In Fig. 7b can conclude that our model is exploiting the patterns present we can see how, when working with low-resolution power in the SRP-PHAT maps and not only using the information of maps, our system is able to predict the DOA with much higher the position of its maximums (this was also suggested by Fig. precision than the maximums of the maps. This could not be 7b). Using 1D CNNs over the GCCs —which is an approach, done with a two-step DOA estimation and tracking algorithm to the best of the authors’ knowledge, unpublished— have a that performed the tracking based only on the maximum of the performance between using 3D CNNs over 4x8 and 8x16 maps maps. Our system is able to analyze the whole maps and it was and may be an interesting approach when a lower computa- able to learn to exploit the patterns in the SRP-PHAT maps tional cost is needed. Finally, the models that use spectrograms to achieve higher resolution than the grid used to compute the as inputs perform well in favorable scenarios (SELDnet even maps. outperforms our proposal in low noise anechoic chambers) but Finally, we also tested the baseline methods under different they are not very robust against noise and reverberation. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. DoA [,] DoA [,] RMSAE RMSAE This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 TABLE II RMSAE [ ] OF THE DOA ESTIM ATED FOR THE LOCATA DATASET WITH C ROSS3D USING SEVERAL MAP RESOLUTIONS AND THE BASELINE TRACKING METHODS. THE SILENT FRAMES WERE INCLUDED IN THE COMPUTATION OF THE RMSAE. Model: Cross3D 1D CNN 2D CNN SELDnet SRP-PHAT maps Input: GCCs Maximums Spectrograms 4x8 8x16 16x32 32x64 64x128 Recording 1 17.93 11.92 8.30 4.62 5.16 16.18 7.54 93.76 29.70 Recording 2 18.90 7.68 6.68 4.90 3.91 12.60 5.19 64.18 38.44 Task 1 Recording 3 10.35 6.34 2.98 3.25 2.24 11.57 5.09 140.21 54.81 Average 15.72 8.65 5.99 4.26 3.77 13.45 5.94 99.38 40.98 Recording 1 23.06 18.11 13.79 12.43 9.92 13.59 14.04 70.86 50.57 Recording 2 20.97 13.71 10.01 8.36 9.22 14.17 12.02 83.42 48.71 Task 3 Recording 3 21.05 12.74 9.83 7.69 6.60 15.21 13.29 82.48 57.29 Average 21.69 14.85 11.21 9.49 8.58 14.32 13.12 78.92 52.86 Recording 1 11.93 10.83 7.25 5.74 5.49 10.93 10.53 58.33 37.24 Recording 2 20.92 16.16 16.08 12.18 13.59 17.33 17.42 41.98 73.17 Task 5 Recording 3 23.57 18.25 13.58 15.64 15.49 20.14 23.58 66.91 66.50 Average 18.81 15.08 12.31 11.19 11.52 16.13 17.18 55.74 58.97 Average 18.74 12.86 9.83 8.31 7.96 14.64 12.08 78.01 50.94 2) LOCATA dataset: In order to confirm that, although it was trained with a simulated dataset, our system is general enough to track sound sources recorded in real rooms, we tested it with the LOCATA challenge dataset [23], which contains several recordings with the same array that we had simulated to train the models. We used the development dataset and we focused in the tasks 1, 3, and 5 of the challenge: a static loudspeaker recorded with a static array, a moving talker recorded with a static array, and a moving talker recorded with a moving array; it is worth mentioning that the array was static in all the simulations employed to train the model. For the robot head microphone array that we simulated in the training dataset, the development dataset contains 3 recordings for each task and its ground-truth positions. It is worth saying that the only modification to the proposed technique that we made after seeing its performance with the LCOATA dataset was the use of a VAD. All the hyperparam- eters of the model and the acoustical properties of the training Fig. 9. DOA estimated for the second recording of the second task of the dataset were selected according only to the results obtained LOCATA challenge using maps with Cross3D over 32x64 maps and the with simulated datasets. In other words, we used simulated baseline methods. signals for training and validation and the LOCATA recordings only for testing. Table II shows the RMSAE of estimating the DOA of the with the results obtained with the simulated dataset (see Fig. source of each recording using our technique and with the 8), but it disappears when the resolution of the maps increases; baseline methods. Although it is difficult to draw conclusions actually we even reach lower errors in the LOCATA dataset from such a low number of recordings, we can see how the than with the simulated test dataset. Using a 1D CNN also proposed tracking system clearly outperforms the baseline suffers a similar degradation, but its most dramatic impact is methods that use spectrograms as inputs and that it also on the methods which use spectrograms as inputs. In contrast, outperforms the 1D CNN methods when we use maps with at the use of a 1D CNN over the coordinates of the maximums least 16x32 resolution. of high resolution SRP-PHAT maps does not suffer almost any degradation; but it may not be an interesting approach since, According to [23], the reverberation time of the room where having computed the 64x128 resolution maps, we can obtain the recordings were performed was T  0:5 s, so we can far better results using the whole maps as inputs of Cross3D. observe some degradation in the performance of Cross3D when it is used over low resolution power maps compared As an example, Fig. 9 shows the DOA estimation of the Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 second recording of the third task of the LOCATA dataset, This material is based upon work supported by Google where all the methods obtained a RMSAE quite close to their Cloud. average. We can see how Cross3D performs the best estimation of the analyzed methods both for the elevation and for the REFERENCES azimuth. We can also see that the LOCATA dataset has longer [1] C. Rascon and I. Meza, “Localization of sound sources in robotics: A silences than the ones present in the simulated dataset, which review,” Robotics and Autonomous Systems, vol. 96, pp. 184–210, 2017. [2] V. Tourbabin and B. Rafaely, “Theoretical Framework for the Optimiza- could also explain why some of the methods obtained lower tion of Microphone Array Configuration for Humanoid Robot Audition,” results with this dataset. In order to make the methods based on IEEE/ACM Transactions on Audio, Speech, and Language Processing, CNNs more robust against longer silences, we should include vol. 22, no. 12, pp. 1803–1814, Dec. 2014. [3] A. Farina and L. Tronchin, “3D Sound Characterisation in Theatres them in the simulation of the training dataset and, probably, Employing Microphone Arrays,” Acta Acustica united with Acustica, increase the temporal receptive field of the models, which vol. 99, no. 1, pp. 118–125, Jan. 2013. could be done increasing the number of layers, the temporal [4] K. Kumatani, J. McDonough, and B. Raj, “Microphone Array Processing size of its kernels, or including longer temporal dilations in for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors,” IEEE Signal Processing Magazine, vol. 29, no. 6, the convolutions. pp. 127–140, Nov. 2012. [5] C. Spille, B. Kollmeier, and B. T. Meyer, “Comparing human and automatic speech recognition in simple and complex acoustic scenes,” VI. CONCLUSIONS Computer Speech & Language, vol. 52, pp. 123–140, 2018. In this paper, we have presented a new sound source DOA [6] R. Ma, G. Liu, Q. Hao, and C. Wang, “Smart microphone array design for speech enhancement in financial VR and AR,” in 2017 IEEE estimation and tracking system based on the well known SRP- SENSORS, Oct. 2017, pp. 1–3. PHAT method and a three-dimensional Convolutional Neural [7] C. Knapp and G. Carter, “The generalized correlation method for Network. The use of a fully causal convolutional architecture, estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, Aug. 1976. without any bidirectional recurrent layer, makes our proposal [8] J. Benesty, “Adaptive eigenvalue decomposition algorithm for passive feasible for real-time applications, being able to provide a new acoustic source localization,” The Journal of the Acoustical Society of America, vol. 107, no. 1, pp. 384–391, Dec. 1999. DOA estimation each 192 ms. We used a 3D CNN over time [9] L. Comanducci, M. Cobos, F. Antonacci, and A. Sarti, “Time Differ- sequences of elevation and azimuth maps computed from the ence of Arrival Estimation from Frequency-Sliding Generalized Cross- signals captured by a compact array but, using distributed Correlations Using Convolutional Neural Networks,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal arrays to compute 3D maps, the extension of the technique Processing (ICASSP), May 2020, pp. 4945–4949. to use a 4D CNN would be straightforward. [10] J. H. DiBiase, “A high-accuracy, low-latency technique for talker lo- The experiments performed show that the SRP-PHAT maps calization in reverberant environments using microphone arrays,” Ph.D. dissertation, Brown University, 2000. are a good input feature to be used in tracking systems based [11] J. H. DiBiase, H. F. Silverman, and M. Brandstein, “Robust Localiza- on deep learning, being much more robust to reverberation and tion in Reverberant Rooms,” in Microphone Arrays: Signal Processing noise than the use of spectrograms as proposed in most of the Techniques and Applications. Berlin, Heidelberg: Springer Berlin recent literature. They also prove that it is possible to obtain Heidelberg, 2001. [12] R. Schmidt, “Multiple emitter location and signal parameter estimation,” a good tracking performance using only causal convolutional IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. layers and that non-causal recurrent layers are not needed. 276–280, Mar. 1986. Due to the difficulty of recording a hand-labeled dataset [13] J. P. Dmochowski, J. Benesty, and S. Affes, “Broadband Music: Oppor- tunities and Challenges for Multiple Source Localization,” in 2007 IEEE of moving sources large enough to train a neural network, Workshop on Applications of Signal Processing to Audio and Acoustics, we have introduced a new procedure for generating random Oct. 2007, pp. 18–21. trajectories and simulate them as they are needed for training. [14] J. Traa and P. Smaragdis, “A Wrapped Kalman Filter for Azimuthal Speaker Tracking,” IEEE Signal Processing Letters, vol. 20, no. 12, pp. With it, we have a infinite size dataset whose parameters can be 1257–1260, Dec. 2013. easily modified during training to accelerate the convergence [15] Y. Tian, Z. Chen, and F. Yin, “Distributed Kalman filter-based speaker or during test to analyze the performance of the model in tracking in microphone array networks,” Applied Acoustics, vol. 89, pp. 71–77, Mar. 2015. specific scenarios. To prove that the models trained with this [16] D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering procedure are general enough to deal with actual recordings, algorithms for tracking an acoustic source in a reverberant environment,” we have tested our model with the LOCATA dataset and IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 826–836, Nov. 2003. obtained satisfactory results. [17] W.-K. Ma, B.-N. Vo, S. S. Singh, and A. Baddeley, “Tracking an As a baseline method for our main proposal, we have also unknown time-varying number of speakers using TDOA measurements: introduced a new architecture, based on the use of a causal 1D A random finite set approach,” IEEE Transactions on Signal Processing, vol. 54, no. 9, pp. 3291–3304, Sep. 2006. CNN over the GCCs, that also presents a good performance [18] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, and robustness and that may be interesting for applications “A learning-based approach to direction of arrival estimation in noisy where the computation of the SRP-PHAT maps is not possible and reverberant environments,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. due to computational resource limitations. 2814–2818. [19] R. Takeda and K. Komatani, “Discriminative multiple sound source localization based on deep neural networks using independent location ACKNOWLEDGMENT model,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec. 2016, pp. 603–609. This work was supported in part by the Regional Govern- [20] J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa, “Towards End-to- ment of Aragon (Spain) with a grant for postgraduate research End Acoustic Localization Using Deep Learning: From Audio Signals contracts (2017-2021) co-founded by the Operative Program to Source Position Coordinates,” Sensors, vol. 18, no. 10, p. 3418, Oct. FSE Aragon 2014-2020. 2018. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 [21] E. L. Ferguson, S. B. Williams, and C. T. Jin, “Sound Source Localiza- Recurrent Neural Networks,” in Interspeech 2019. ISCA, Sep. 2019, tion in a Multipath Environment Using Convolutional Neural Networks,” pp. 654–658. in 2018 IEEE International Conference on Acoustics, Speech and Signal [40] D. Diaz-Guerra and J. R. Beltran, “Source cancellation in cross- Processing (ICASSP), Apr. 2018, pp. 2386–2390. correlation functions for broadband multisource DOA estimation,” Sig- [22] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound Event nal Processing, vol. 170, p. 107442, May 2020. Localization and Detection of Overlapping Sources Using Convolutional [41] A. Brutti, M. Omologo, and P. Svaizer, “Multiple Source Localization Recurrent Neural Networks,” IEEE Journal of Selected Topics in Signal Based on Acoustic Map De-Emphasis,” EURASIP Journal on Audio, Processing, vol. 13, no. 1, pp. 34–48, Mar. 2019. Speech, and Music Processing, vol. 2010, no. 1, p. 147495, Dec. 2010. [23] H. W. Lollmann, ¨ C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P. A. [42] D. Diaz-Guerra and J. R. Beltran, “Direction of Arrival Estimation with Naylor, and W. Kellermann, “The LOCATA Challenge Data Corpus for Microphone Arrays Using SRP-PHAT and Neural Networks,” in 2018 Acoustic Source Localization and Tracking,” in 2018 IEEE 10th Sensor IEEE 10th Sensor Array and Multichannel Signal Processing Workshop Array and Multichannel Signal Processing Workshop (SAM), Jul. 2018, (SAM), Jul. 2018, pp. 617–621. pp. 410–414. [43] L. Perotin, R. Serizel, E. Vincent, and A. Guerin, “CRNN-Based Multi- [24] H. Do and H. F. Silverman, “A Fast Microphone Array SRP-PHAT ple DoA Estimation Using Acoustic Intensity Features for Ambisonics Source Location Implementation using Coarse-To-Fine Region Con- Recordings,” IEEE Journal of Selected Topics in Signal Processing, traction(CFRC),” in 2007 IEEE Workshop on Applications of Signal vol. 13, no. 1, pp. 22–33, Mar. 2019. Processing to Audio and Acoustics, Oct. 2007, pp. 295–298. [44] T. S. Cohen, M. Geiger, J. Kohler ¨ , and M. Welling, “Spherical CNNs,” [25] ——, “Stochastic particle filtering: A fast SRP-PHAT single source in International Conference on Learning Representations, 2018. localization algorithm,” in 2009 IEEE Workshop on Applications of [45] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum Signal Processing to Audio and Acoustics, Oct. 2009, pp. 213–216. learning,” in Proceedings of the 26th Annual International Conference [26] L. O. Nunes, W. A. Martins, M. V. S. Lima, L. W. P. Biscainho, M. V. M. on Machine Learning, ser. ICML ’09. Montreal, Quebec, Canada: Costa, F. M. Gonc ¸alves, A. Said, and B. Lee, “A Steered-Response Association for Computing Machinery, Jun. 2009, pp. 41–48. Power Algorithm Employing Hierarchical Search for Acoustic Source [46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: Localization Using Microphone Arrays,” IEEE Transactions on Signal An ASR corpus based on public domain audio books,” in 2015 IEEE Processing, vol. 62, no. 19, pp. 5171–5183, Oct. 2014. International Conference on Acoustics, Speech and Signal Processing [27] M. Cobos, A. Marti, and J. J. Lopez, “A Modified SRP-PHAT Functional (ICASSP), Apr. 2015, pp. 5206–5210. for Robust Real-Time Sound Source Localization With Scalable Spatial [47] J. Wiseman, “Wiseman/py-webrtcvad,” Nov. 2019. Sampling,” IEEE Signal Processing Letters, vol. 18, no. 1, pp. 71–74, [48] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating Jan. 2011. small-room acoustics,” The Journal of the Acoustical Society of America, [28] C. Evers, Y. Dorfan, S. Gannot, and P. A. Naylor, “Source tracking vol. 65, no. 4, pp. 943–950, Apr. 1979. using moving microphone arrays for robot audition,” in 2017 IEEE [49] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python International Conference on Acoustics, Speech and Signal Processing library for room impulse response simulation with GPU acceleration,” (ICASSP), Mar. 2017, pp. 6145–6149. Multimedia Tools and Applications, Oct. 2020. [29] O. Schwartz, Y. Dorfan, E. A. P. Habets, and S. Gannot, “Multi- [50] H. W. Lollmann, ¨ C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, speaker DOA estimation in reverberation conditions using expectation- P. A. Naylor, and W. Kellermann, “IEEE-AASP Challenge on Acoustic maximization,” in 2016 IEEE International Workshop on Acoustic Signal Source Localization and Tracking: Documentation of Final Release,” Enhancement (IWAENC), Sep. 2016, pp. 1–5. https://locata.lms.tf.fau.de/datasets/, Jan. 2020. [30] S. Chakrabarty and E. A. P. Habets, “Multi-Speaker DOA Estimation [51] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Using Deep Convolutional Networks Trained With Noise Signals,” IEEE Surpassing Human-Level Performance on ImageNet Classification,” in Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 8–21, 2015 IEEE International Conference on Computer Vision (ICCV), Dec. Mar. 2019. 2015, pp. 1026–1034. [31] S. Adavanne, A. Politis, and T. Virtanen, “Direction of Arrival Es- [52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” timation for Multiple Sound Sources Using Convolutional Recurrent in ICLR 2015, San Diego. Neural Network,” in 2018 26th European Signal Processing Conference [53] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, (EUSIPCO), Sep. 2018, pp. 1462–1466. T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, [32] N. Yalta, K. Nakadai, and T. Ogata, “Sound Source Localization Using E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, Deep Learning Models,” Journal of Robotics and Mechatronics, vol. 29, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High- no. 1, pp. 37–48, 2017. Performance Deep Learning Library,” in Advances in Neural Information [33] Y. Sun, J. Chen, C. Yuen, and S. Rahardja, “Indoor Sound Source Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Localization With Probabilistic Neural Network,” IEEE Transactions on d’Alche-Buc, ´ E. Fox, and R. Garnett, Eds. Curran Associates, Inc., Industrial Electronics, vol. 65, no. 8, pp. 6403–6413, Aug. 2018. 2019, pp. 8026–8037. [34] W. He, P. Motlicek, and J.-M. Odobez, “Deep Neural Networks for Mul- [54] K. Cho, B. van Merrienboer, ¨ C. Gulcehre, D. Bahdanau, F. Bougares, tiple Speaker Detection and Localization,” in 2018 IEEE International H. Schwenk, and Y. Bengio, “Learning Phrase Representations using Conference on Robotics and Automation (ICRA), May 2018, pp. 74–79. RNN Encoder–Decoder for Statistical Machine Translation,” in Proceed- [35] S. Kapka and M. Lewandowski, “Sound Source Detection, Localization ings of the 2014 Conference on Empirical Methods in Natural Language and Classification using Consecutive Ensemble of CRNN Models,” in Processing (EMNLP). Doha, Qatar: Association for Computational Proceedings of the Detection and Classification of Acoustic Scenes and Linguistics, Oct. 2014, pp. 1724–1734. Events 2019 Workshop (DCASE2019). New York University, 2019, pp. 119–123. [36] H. Cordourier, P. Lopez Meyer, J. Huang, J. Del Hoyo Ontiveros, and H. Lu, “GCC-PHAT Cross-Correlation Audio Features for Simultaneous Sound Event Localization and Detection (SELD) on Multiple Rooms,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019). New York University, Oct. [37] Y. Cao, T. Iqbal, Q. Kong, M. B. Galindo, W. Wang, and M. D. Plumbley, “TWO-STAGE SOUND EVENT LOCALIZATION AND DETECTION USING INTENSITY VECTOR AND GENERALIZED CROSS-CORRELATION,” in Proceedings of the Detection and Classi- fication of Acoustic Scenes and Events 2019 Workshop (DCASE2019). New York University, 2019. [38] L. Perotin, A. Defossez, ´ E. Vincent, R. Serizel, and A. Guerin, ´ “Re- gression versus classification for neural network based audio source localization,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p. 6, 2019. [39] Z. Tang, J. D. Kanu, K. Hogan, and D. Manocha, “Regression and Classification for Direction-of-Arrival Estimation with Convolutional Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks

Loading next page...
 
/lp/arxiv-cornell-university/robust-sound-source-tracking-using-srp-phat-and-3d-convolutional-AauzP7do87

References (54)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2020.3040031
Publisher site
See Article on Publisher Site

Abstract

This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Robust Sound Source Tracking Using SRP-PHAT and 3D Convolutional Neural Networks David Diaz-Guerra, Student Member, IEEE, Antonio Miguel and Jose R. Beltran Abstract—In this paper, we present a new single sound source More recently, some methods based on Deep Neural Net- DOA estimation and tracking system based on the well-known works (DNN) have been proposed. From the original Mul- SRP-PHAT algorithm and a three-dimensional Convolutional tilayer Perceptron (MLP) used in the first proposals [18], Neural Network. It uses SRP-PHAT power maps as input [19], their architectures have evolved into more sophisticated features of a fully convolutional causal architecture that uses 3D Convolutional Neural Networks [20]–[22] which can jointly convolutional layers to accurately perform the tracking of a sound source even in highly reverberant scenarios where most of the perform DOA estimation and tracking. However, despite state of the art techniques fail. Unlike previous methods, since we DNN-based techniques claim to be more robust than the do not use bidirectional recurrent layers and all our convolutional classical methods, their use of the CNNs might not be the layers are causal in the time dimension, our system is feasible most appropriate and many of them add non-causal recurrent for real-time applications and it provides a new DOA estimation layers that make them unfeasible for real-time applications. for each new SRP-PHAT map. To train the model, we introduce a new procedure to simulate random trajectories as they are In this paper, we propose the use of 3D CNNs over SRP- needed during the training, equivalent to an infinite-size dataset PHAT power maps to jointly perform the DOA estimation with high flexibility to modify its acoustical conditions such as and the tracking of a source in highly reverberant rooms. We the reverberation time. We use both acoustical simulations on present a completely causal technique that provides a new a large range of reverberation times and the actual recordings DOA estimation with each new power map and we show its of the LOCATA dataset to prove the robustness of our system and its good performance even using low-resolution SRP-PHAT robustness through several simulations in adverse conditions. maps. We analyze how the resolution of the SRP-PHAT power maps affects our technique and we prove that by using CNNs we Index Terms—microphone arrays, direction of arrival esti- mation, DOA, sound source tracking, SRP-PHAT, convolutional can obtain resolutions that surpass the search grid employed to neural networks, CNN. compute the maps. Finally, we apply our model to the acoustic source LOCalization And TrAcking (LOCATA) challenge [23] I. INTRODUCTION dataset in order to show how the models trained with simulated IRECTION Of Arrival (DOA) estimation and Sound signals are general enough to work with actual recordings Source Localization with microphone arrays has been in real conditions. Although we focus on compact arrays widely investigated and used in different applications, such as and evaluate the performance of our technique with an array robot audition [1], [2], acoustic characterization [3], speech with 12 microphones mounted over a NAO robot head, the recognition [4], [5] or teleconference systems [6]. Most of technique may be used with any array geometry. the techniques in the literature can be roughly classified into It is worth mentioning that we focus on single source i) Time Difference Of Arrival (TDOA) based techniques, scenarios that are supposed to always have an active source; which first use the Generalized Cross-Correlation (GCCs) therefore, our tracking does not need to deal with data associa- functions [7] to estimate the TDOA and then compute the tion and with the birth and the dead of the source. For a single most reliable DOA for them (it is worth saying that there source scenario, the birth and dead problem may be easily are also some alternatives to the GCCs such as the eigenvalue solved with a Voice Activity Detector (VAD) but extending our decomposition [8] or even deep-learning based techniques [9]), system to deal with multiple sources might be more difficult. ii) beamforming based techniques, such as SRP-PHAT [10], However, some ideas are proposed in section III-A. [11], which search the direction that maximizes the power of In order to encourage and facilitate the replicability of this the output of a beamformer, and iii) subspace techniques, such research, the source code of our model and the models used as Multiple Signal Classification (MUSIC) [12], [13], based as baselines, as well as everything needed to train and test on the eigenstructure of the narrowband cross-correlation them, can be found in our public repository ; we also share matrices. These techniques vary in terms of computational the trained models there. complexity and their robustness against adverse scenarios such The remainder of this paper is structured as follows. We first as noise and reverberation. When they have to deal with non- review the SRP-PHAT algorithm (section II) and the state of stationary signals, such as the speech, a tracking algorithm is the art of DNN-based DOA estimation techniques (section III). needed after them to exploit the temporal correlation between In section IV we present our proposed technique and in section source positions [14]–[17]. V we analyze its performance with both simulated rooms and actual recordings. Finally, section VI concludes the paper. D. Diaz-Guerra, A. Miguel and J.R. Beltran are with the Department of Electronic Engineering and Communications, University of Zaragoza, Zaragoza, Spain. https://github.com/DavidDiazGuerra/Cross3D Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 II. THE SRP-PHAT ALGORITHM th The signal received at the n sensor of a microphone array can be modeled as x (t) = a (t) h ( ; t) + v (t); (1) n s n s n where a (t) is the signal generated by the source,  is the s s source position, h ( ; t) is the impulse response from  to n s s (a) (b) th the n sensor, and v (t) is the noise of the sensor, which is typically supposed to be white, Gaussian, and uncorrelated with the source signal and with the noises of other sensors. It is worth mentioning that  is written in bold because it can represent an angle, two spherical coordinates, or even a point in 3D Cartesian coordinates depending on the geometry of the array. One of the most classic and popular approaches to DOA (c) (d) estimation is finding the direction that maximizes the Steered Fig. 1. Example of SRP-PHAT power maps with different resolutions in Response Power (SRP) that we would obtain using a filter- a favorable scenario: SNR=30 dB and T =0:3 s. The red dot indicates the actual DOA of the sound source and the black dot is at the maximum of the and-sum beamformer: map. = argmax P () (2) N1 j! () P () = G (!)X (!)e d!; (3) n n n=0 where N is the number of sensors of the array, X (!) is the Fourier Transform of x (t), G (!) is the frequency response n n of the filter for the channel n, and  () is the time delay th (a) (b) occurring from the position or direction  to the n sensor. Although directly implementing (3) would be computation- ally expensive, it can be computed in terms of the Generalized Cross-Correlation functions as N1 N1 X X P () = 2 R ( ()); (4) nm nm n=0 m=0 where  () =  ()  () and R is the GCC nm n m nm (c) (d) th th between the signals of the n and the m sensor: Fig. 2. Example of SRP-PHAT power maps with different resolutions in an adverse scenario: SNR=5 dB and T =0:9 s. The red dot indicates the actual j!t R ( ) = (!)X (!)X (!)e d!; (5) DOA of the sound source and the black dot is at the maximum of the map. nm nm n where denotes the complex conjugate and (!) = nm G (!)G (!) is a weighting function. the two angles of the spherical coordinates, or even three- dimensional, e.g. XYZ coordinates. Some search strategies Equation (4), combined with the use of the PHAse Trans- have been proposed to reduce the number of evaluations of form (PHAT) G (!) = 1=jX (!)j, is commonly known as n n (4) that need to be computed to accurately find the maximum the SRP-PHAT algorithm [10], [11], and allows us to obtain of P () [24]–[26] but, due to the non-convexity of the SRP- an acoustic power map of the environment whose maximum should correspond with the source position. PHAT power maps, the number of SRP-PHAT evaluations needed might still be an issue in some scenarios. In [26], [27], Although the SRP-PHAT algorithm is a good trade-off be- it is proposed to modify (4) to compute the power received tween robustness and computational efficiency, obtaining more from a space region instead of from a point, so they can use accurate results than two-step TDOA based techniques with a hierarchical search strategies over maps with lower resolution. lower computational cost than most of the broadband subspace techniques, it still presents several issues. The main advantage As we can see in Fig. 1, in favorable scenarios with high of (4) is that most of its computational cost is in computing the SNR and low reverberation, the SRP-PHAT power maps have GCCs and does not increase with the search space. However, a clear maximum in the DOA of the sound source that can the computation of its sums for each direction, especially if be used to obtain a good estimation even with low-resolution it is needed to interpolate R ( ()) from its adjacent maps but, when SNR decreases and the reverberation in- nm nm samples, may not be negligible; this problem becomes more creases, such as in the scenario of Fig. 2, the maps present challenging when the search space is two-dimensional, e.g. several local maxima that may be incorrectly interpreted as Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 the DOA of the sound, especially when using low-resolution most popular [19], [30]–[34]. However, since [22] proposed maps. However, in those maps, in addition to the maxima, we using 3 outputs to estimate a unitary vector pointing to the can also observe several patterns that are also related to the direction of the source, several models have followed this DOA of the sound and the geometry of the array and that may approach [35]–[37]; further studies about the advantages of be exploited to obtain a more accurate DOA estimation. using 3 regression outputs to infer the Cartesian coordinates Due to the non-stationary nature of most of the signals of instead of 2 to directly obtain the azimuth and the elevation interest, such as the speech or the music, a tracking stage can be found in [38], [39]. Motivated by the good results of is needed after the DOA estimation to exploit the temporal these recent works, we opted to follow a regression approach. correlation between the source positions and to avoid inaccu- One of the main drawbacks of solving the DOA estimation rate estimations in frames where the power of the signal is as a regression problem is that it makes it harder to estimate low or its autocorrelation makes the maximum of the power the DOA of multiple sources. Since they also classify the map become too wide. The algorithms for one source tracking sources into several classes, Sound Event Localization and are typically based on the Kalman filter [14], [15] although Detection (SELD) models usually have a regression output more advanced techniques have been proposed to deal with for each source class [22], [35]–[37] supposing that only one multiple sources, such as those based on particle filtering source of each class can be active at the same time. Another [16], [17]. However, in these approaches, they use two-step possible approach might be the use of a single-source DOA strategies which make them sensitive to potential information estimator (as the one proposed here) combined with a source loss when only the DOA estimations are used for the tracking; cancellation technique [40], [41] to iteratively find multiple e.g. the absolute maximum of the SRP-PHAT maps is always sources. selected even if another local-maximum was much closer to the previous estimations and we assign the same likelihood to B. Input features and network architecture all the DOA estimations while some of them correspond to wider maximums from frames where the source was weaker. Initially, the most common input features were the GCCs Including some of this information in the tracking algorithms between the signals of each sensor [18], [33], [34], but we can may be possible, but it would increase both its complexity and also find other approaches such as using the eigenvectors of the spatial covariance matrix [19]. In [42], we proposed using the number of parameters that would need to be fine-tuned. low-resolution SRP-PHAT maps, in that case combined with In [28], a technique to share information between an iterative fully connected perceptrons. More recently, some techniques DOA estimator based on Expectation-Maximization [29] and have been proposed using 2D convolutional networks over the a tracking system based on particle filtering is proposed. In spectrogram of the microphone signals, using only the phase this paper, we use Neural Networks to jointly perform DOA information [30], the magnitude information [32] or both of estimation and tracking, since they have been proved to have them [22], [31], [35]; some transformations, such as using an excellent performance in several end-to-end problems in the cepstrogram [21] or the Mel spectrogram [37], have also other fields, such as computer vision or speech recognition or synthesis. been proposed. Other features proposed as inputs of CNNs are Ambisonics intensity vectors [43], raw audio samples [20] and combinations of several of the already mentioned features III. DOA ESTIM ATION WITH DEEP NEURAL NETWORKS [21], [36], [37]. One of the first proposals of using neural networks for DOA One of the most important properties of CNNs is that they estimation was [18], which used a fully connected perceptron are equivariant to translations, which, in plain text, means that with a hidden layer to obtain the DOA from the GCCs as a if we apply a translation to the input features we get the classification problem. Since that, several techniques have been same output with its equivalent translation. This property is proposed, differing in the output format, the features that they very useful in many computer vision applications, where the use as input, and the network architecture. same patterns have the same meaning in all the positions of the image. When used for DOA estimation and tracking, 2D A. Output format CNNs are typically used over spectrograms, so convolution To obtain the DOA estimation as a classification problem, is performed over the time and the frequency axes and each we first need to define a grid of directions where the source can microphone spectrogram is treated as a different channel. Be- be found (similar to the resolution of the SRP-PHAT maps) ing equivariant to time translations seems to be an advantage so the network has an output per grid point. The network of since we would expect similar patterns for any source in a [18] had 359 outputs, so they had a maximum resolution of 1 position no matter the time instant when it was there. However, degree for azimuth estimation. However, if we want to estimate since the phase differences for the same source position vary both azimuth and elevation (or even XYZ coordinates), the with the frequency, equivariance to frequency shifts may not number of outputs would dramatically increase, and therefore be an interesting property. Another approach to the use of 2D its computational complexity and the size of the dataset needed CNNs is proposed in [30], where the convolution is performed to train the network. over the frequency and the microphone dimensions and the In a footnote of [18], it was claimed that they had obtained time evolution is not taken into account by the network, i.e. worse results when they tried to estimate the DOA as a re- they do not perform any tracking. As they work with an gression problem and classification approaches seem to be the Uniform Linear Array (ULA), the phase differences expected Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 VAD VAD Utterance index LibriSpeech Silence RIR filtering Windowing SRP-PHAT Noise addition Dataset cleaning starting point SNR T60 Room size RIR simulation Array position Trajectory Coordinates to generation DOA Fig. 3. Dataset generation process. Italic letters represent variables and regular letters represent processes. Right-angled boxes represent deterministic processes and round boxes represent stochastic variables or processes. for a source position are the same for each pair of adjacent with speech signals, but this would have reduced the amount of microphones, so equivariance is desired, but this would not be different acoustic conditions seen by the model during training, the case for other array geometries. increasing the possibility of overfitting to those conditions and In this paper, we propose the use of CNNs over SRP- not generalizing. Instead of generating a dataset and using it PHAT power maps, performing the convolution over the to train the network, we simulate the inputs of the networks dimensions of the maps and the temporal dimension. Any as they are needed during training. This makes the training kind of SRP-PHAT power maps could be employed with this slower, but has two important advantages: 1) we have an approach depending on the geometry of the array, but as we infinite-size dataset, since all the random parameters of the focus on compact arrays, we use 2D spherical power maps simulation are modified for each trajectory simulated during and therefore, since we include the temporal dimension, 3D the training, which reduces the risk of overfitting, and 2) we CNNs. Actually, working over spherical maps, equivariance to have higher flexibility to modify the probability distribution spherical translations (i.e. rotations) would be preferred over of the parameters of the simulation, such as the signal to equivariance to euclidean translations, but this would lead us noise ratio or the reverberation time, during training so we to the use Spherical CNNs [44], which are still less efficient can perform curriculum learning strategies [45]. As shown in Fig. 3, we use LibriSpeech utterances as sound from a computational point of view than classical CNNs. The sources. The LibriSpeech corpus [46] contains 960 hours of extension to 4D CNNs over 3D SRP-PHAT maps to perform speech sampled at f = 16 kHz extracted from audiobooks. 3D Sound Source Localization (SSL) with distributed arrays Although audiobooks could be expected to contain quite clean would be straightforward. speech signals, we found that some of them have a strong Many of the state of the art CNN architectures include background noise that, after filtered by the RIRs, would be bidirectional recurrent units at the last layers of the model. located in the same position as the source and would facilitate Recurrent Neural Networks (RNNs), as recurrent linear filters, its localization and tracking in silent segments. To avoid our make the output at any time instant dependent on the values network to learn to exploit this fact, which will not be present of the input at every previous time instant and, therefore, in actual recordings, we use the WebRTC Voice Activity applying them in the backward direction is extremely non- Detector (VAD) [47] to detect silent segments and clean them causal. Obviously, any tracking system can benefit greatly by completely removing the signal in those frames. from the information of the future positions of the source but, in order to make our system feasible for real-time applications, The size of the rooms are randomly selected from the we opt for using only causal convolutional layers. range 3 m 3 m 2:5 m to 10 m 8 m 6 m and the array is randomly placed inside the room, being restricted to have a separation from the walls of a 10% of the room size in each IV. PROPOSED TECHNIQUE dimension and be in the lower half of the room for the vertical A. Training dataset axis. The Signal to Noise Ratio (SNR) and reverberation time Due to the difficulty of obtaining an accurately hand-labeled (T ) are also randomly selected from the ranges 5 dB to 30 dB dataset of moving sources recorded with microphone arrays, and 0:2 s to 1:3 s respectively. Uniform distributions over the we opted to train our model with simulated signals; another specified ranges are used for all the random parameters of the approach might have been using measured RIRs convolved dataset. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 results obtained with the LOCATA dataset, but would have also increased the complexity of the simulations and would made the training slower. Since different array geometries would lead to different patterns, the model should be re-trained for any new microphone array. We could take advantage of the similarity of the power maps of most compact arrays to apply transfer learning strategies when training models for new arrays. However, this is not a big issue since we do not need to record a dataset with the new array but just simulate it. Having simulated the propagation of the sound from the moving source to each microphone of the array using the overlap-add method, we add an omnidirectional Gaussian noise to obtain the desired SNR, window the signal using Hanning windows of length K = 4096 samples (i.e. 256 ms) with a hop size of 3K=4, and apply (4) to obtain the SRP- PHAT map of each window. In order to compute the noise power needed to obtain the desired SNR, we computed the Fig. 4. Examples of source trajectories used to train the model. The red dots signal power as the average power of all the non-silent frames are the trajectory points and the gray points represent the microphones. of the trajectory. Finally, we subtract its mean to each map and divide it between its maximum to fit it to the range [-1,1]. For the sake of computational efficiency, we do not perform We need to randomly generate continuous trajectory points, any kind of interpolation in the computation of (4) and just so it is possible to track them, but having enough diversity approximate the fractional delays to the nearest sample. to avoid the network to learn how they are generated and We found that, since this simulation process did not include overfit to them. In order to do so, we randomly select two any directional noise, the models trained with it were very points within the room boundaries to be the starting (p = T 0 0 0 0 T sensitive to directional noise sources. For example, in some of [p ; p ; p ] ) and ending (p = [p ; p ; p ] ) points x0 y0 z0 L xL yL zL the recordings of the LOCATA dataset, the noise of a fan is of the trajectory and add to the straight line that connect them present and, although its power is very low, the models tracked a sinusoidal function in each axis with random frequencies T T it when it was the only active sound source. In order to avoid (!!! = [! ; ! ; ! ] ) and amplitudes (A = [A ; A ; A ] ) x y z x y z this issue, we use the WebRTC VAD to determine in which ensuring that no more than 2 oscillations are performed during frames the speech source is active. We first tried to include the trajectory in each axis and that the amplitude is low enough the VAD information as an additional input channel to the to avoid the source to exit the room: network. However, as during the training the VAD sometimes p = p + (p p ) + A sin(!!!i); (6) failed and classified frames which speech information as silent, i 0 0 L 1 the network learned that even the frames classified as silent where L is the number of points of the trajectory,  stands could contain useful tracking information as long as they for the pointwise product and the sin function also operates contained a directional source and therefore ignores the VAD pointwise. Although the generation model is quite simple, it input. In order to avoid the network to track the directional generates quite diverse trajectories (some examples are shown noise sources of the LOCATA dataset, we finally opted to turn in Fig. 4) and, since the network only sees the azimuth and to zero the maps corresponding to frames classified as silent elevation coordinates and has a limited temporal perceptive by the VAD so no directional information was seen by the field, the model should not overfit to it. In order to confirm network when there was not any speech source active. this, we tested our model in a more realistic scenario with the recordings of the LOCATA dataset (see section V-B2). B. Model architecture To simulate the movement of the source, we use the GPU implementation of the Image Source Method [48] found in the Our model takes as input a 4-dimensional tensor (M) with python library gpuRIR [49]; the use of this library allows us size C  T  N  N , whose first channel M is built ' 1;t;i;j to reduce the simulation time in two orders of magnitude and by computing T temporally equispaced SRP-PHAT maps with makes possible to perform the simulations during the training N equispaced elevation angles in the range  2 [0; ] and N of the network. equispaced azimuth angles in ' 2 [; ); for planar arrays, the same model could be used sampling the elevation only For the results presented in this paper, we simulated a in  2 [0; ]. Using uniform spherical sampling instead of microphone array with 12 sensors designed to be mounted equispaced angles might have led to more precise SRP-PHAT over a NAO robot head; the minimum and maximum inter- maps, but would prevent us from using standard convolutional microphone distance of the array are 1:3 cm and 12:1 cm layers. and the actual position of each microphone can be found in the documentation of the LOCATA dataset [50]. We did Although the model must learn more complex patterns in not include the effect of the robot head in the simulation. order to exploit all the information available in the SRP- Including the scattering generated by it might improve the PHAT maps, it is obvious that one of the main sources of Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Reshape, concatenate and transpose 32x103x16x2 128 channels 5 length conv PReLU activation 32 channels 5x3x3 conv 3 channels 5 length conv dilation factor 2 PReLU activation tanh activation 1x1x2 max pooling x4 dilation factor 2 32x103x16x32 ... ... 1x103x16x32 ... 3x103 ... 32 channels 5x5x5 conv ... 128x103 PReLU activation ... 2048x103 32 channels 5x3x3 conv PReLU activation ... 1x2x1 max pooling x4 32x103x1x32 Fig. 5. Model architecture. The noted sizes correspond to a model for 16x32 maps and an input sequence of length 103. For the shake of simplicity, we represented it with only 1 input channel, although it actually have 3. information about the DOA of the source is the position of the information of the whole input, e.g. in image classification the maximum of each map; however, the argmax function is tasks, increasing the number of channels with convolutional highly non-linear (and non-differentiable) and it is not easy for layers and reducing their size with pooling layers progressively an artificial neural network to learn and fit it. Since it did not reduce the spatial information and gets higher-level represen- cause a significant increase of the computational complexity tations of the input. However, since our desired output is not of the algorithm, we decided to explicitly indicate to the only related to the presence of some patterns but especially to network the position of the maximum of each map. After their position, we must be careful when using them. trying to introduce this information in different layers, we In order to get the benefits of pooling layers but allowing found that the best results were obtained including it in the the spatial information to reach the last layers of the model, SRP input of the network, using C = 3 with M = 2;t;i;j we opted to, as shown in Fig. 5, split the model into two SRP and M = ' ^ for any t 2 f1; :::; Tg, i 2 f1; :::; N g, 3;t;i;j branches and apply max pooling in a different dimension in SRP SRP and j 2 f1; :::; N g, where  and ' ^ are the DOA t t each one. Working this way, the branch which pools the ' equivalent to the position of the maximum of the map t axis can retain positional information about the  coordinate normalized to be in the range [0,1]. This approach might seem of the maps and vice versa. Specifically, each branch has 4 quite redundant and inefficient, but it is a typical approach to layers with a convolution with 32 kernels of size 5  3  3, condition the output of a CNN to the value of a variable since PReLU activations, and a max-pooling with a kernel size of it is the simplest way to include that information in the first 1 1 2 and 1 2 1 respectively. If the input power maps layers of the network keeping its convolutional architecture, have less than 16 points in the  or the ' axes, it would not whose implementation is extremely optimized in the Deep be possible to perform so many pooling layers; in those cases, Learning software libraries. we reduce the 4 layers to the maximum number possible: The first layer of our model is a 3D convolutional layer with log (min (N ; N )). Due to the use of 3D convolutional 32 kernels of size 5 5 5 and PReLU activations [51]. It is layers and these perpendicular branches, we named our model worth mentioning that, for the temporal axis, we always use Cross3D. causal convolutions, so this model could be used in real time After the 3D convolutional layers, we concatenate the results applications generating a new DOA estimation for each new of each branch and reshape them so we have a temporal power map available and without introducing any delay. sequence of length T for each one of the elements of each Pooling layers are typically used in CNNs to progressively channel and spherical coordinates. Each one of these temporal reduce the size of the input and make the model generalize; sequences are used as the input channels of a 1D causal not using them, means that the fully connected layers used convolutional layer with 128 kernels of length 5 and PReLU at the end of most of the convolutional models would have activations. Finally, the resultant 128 time sequences are a huge number of trainable parameters which would surely passed through another 1D causal convolutional layer with overfit. When the desired output of the CNN is a summary of only 3 kernels of length 5 and tanh activations. These layers Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. { This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 TABLE I MODELS EMPLOYED FOR THE EVALUATION Trainable Temporal Window Model Input Causal parameters perceptive field length 4x8 526 372 5:63 s 8x16 946 340 6:40 s Cross3D Power maps 16x32 1 693 988 4096 Yes 32x64 5 626 148 7:17 s 64x128 21 354 788 GCCs 11 282 436 1D CNN 7:17 s 4096 Yes Maximums (64x128) 6 899 716 2D CNN 1 882 372 7:17 s 4096 Yes Spectrograms SELDnet [22] 104 643 1 512 No are similar to the fully connected layers that most of the CNN range of SNRs, increased the batch size to 10 trajectories, and architectures have, but we include a temporal convolution reduced the learning rate from 1e-4 to 1e-5. so they can still exploit the tracking information. We use a V. EVALUATION dilation factor of 2 in order to allow the tracking to take into A. Baseline methods account a longer context without increasing the complexity of the network. With all the temporal convolutions included in In order to analyze the convenience of using SRP-PHAT the model, each DOA estimation is computed from the last 37 maps as input features of CNNs for DOA estimation, we SRP-PHAT maps, i.e. the tracking memory is 7:17 s. developed some alternative CNNs to use them as baseline. The result of all this process is 3 time sequences of length T We designed them to be as similar as possible to our proposed whose elements are in the range (1; 1), which are considered model and to have the same temporal perceptive field so they to be the XYZ coordinates of a unitary vector pointing in the have the same tracking information. direction of the source in each time frame. Tables detailing the Since we are including the position of the maximum of network architecture of Cross3D for several SRP-PHAT map each map into the input of the network, we should verify resolutions can be found in the supplementary material of this if our model is actually exploiting the additional information paper. that is within the SRP-PHAT maps or if it is only using the position of its maximums. To do that, we designed a 1D CNN which takes as input 2 time sequences with the coordinates C. Training of the maximum of each map normalized to the range [0,1] We trained our model to minimize the Euclidean distance and applies to them 7 layers of 1D causal convolutions with between the output of the network and the 3 time sequences PReLU activations and without any pooling. All the layers obtained from the coordinates of the unitary vectors that point had a kernel size of 5 and the last two layers used a dilation to the direction where the sound source was simulated in each factor of 2, so its temporal perceptive field is 37 frames as time window. Similarly to the results reported in [22], [38], in Cross3D, and the number of channels of each layer was [39], we obtained better results using this approach than trying f1024; 512; 512; 512; 512; 128; 3g. The results shown in the to directly obtain the spherical coordinates from the network following sections were obtained training this network with even when using the great-circle distance between the output the same process described in section IV-C and using the and the ground-truth DOA angles as cost function. coordinates of maps with resolution 64x128. Although using an infinite-size dataset the term “epoch” One of the most common input features employed by the does not have the same meaning than in most of the machine first DOA estimation techniques based on neural networks learning systems, we define an epoch as 585 trajectories (the were the GCCs. They typically employed fully connected number of book chapters in the LibriSpeech train-clean-100 perceptrons with not too many hidden layers and, since they subset). We employed 80 epochs with trajectories of 20 s, only used the GCCs computed in a temporal window, did not i.e. 103 SRP-PHAT maps, to train the model with the Adam perform any kind of tracking. Following this idea, but with algorithm [52] using Pytorch [53]. the aim of including tracking information to the network, we As explained in section IV-A, we trained the networks with used the same 1D causal CNN than we used over the map reverberation times and SNRs uniformly distributed from 0:2 s maximums but using as input sequences the temporal evolution to 1:3 s and from 5 dB to 30 dB respectively; however, we of each element of the GCCs which represented an inter- found that the training converged faster with higher SNRs. microphone delay lower than the maximum inter-microphone Therefore, we followed a curriculum learning strategy [45] distance divided by the speed of sound. using batches of 5 trajectories with SNR=30 dB for the first Although, as explained in section III, the use of 2D CNNs 20 epochs and for the following epochs we employed the full over spectrograms may not be optimal, we also implemented a Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 (a) (b) (c) (d) (e) Fig. 6. Localization Root Mean Squared Angular Error for several power map resolutions, SNRs and reverberation times. The silent frames where not included in the computation of the RMSAE. model following this approach since it is quite popular in the B. Results literature. For a fair comparison, we used causal convolutions 1) Simulated dataset: We trained different models for sev- with a similar architecture to Cross3D: one convolution with eral power map resolutions with the whole range of reverber- 256 5x5 kernels, four convolutions with 256 5x5 kernels with ation times and SNRs, and then we tested their performance 1x4 pooling, a reshape to transform the remaining features for several specific values of T and SNR in order to analyze into temporal sequences and two 1D causal convolutions with the robustness of the proposed tracking system. kernel size 5 and dilation factor 2, the first one with 128 chan- Since we are using SRP-PHAT power maps as the input of nels and the last one with 3. For computing the spectrogram, our algorithm, we started our evaluation comparing our model we used the same windows than for computing the SRP-PHAT maps, extracted the magnitude and phase of each frequency of with the classic SRP-PHAT algorithm. SRP-PHAT does not the FFT, and finally normalized the magnitude of each window perform any kind of tracking, so, for a fairer comparison, we to its maximum and the phase to the range [-1,1]. did not take into account the silent frames when computing Finally, we also trained with our simulation procedure a the Root Mean Squared Angular Errors (RMSAE) showed in replica of SELDnet [22] but without including the Sound Event Fig. 6. As we can see in this figure, when working with high- Detection (SED) output and with only a DOA output since we resolution power maps in almost anechoic rooms with high were only interested in tracking one source. This model takes SNR, using our 3D CNN over the SRP-PHAT maps does as inputs the magnitude and phase of the spectrograms and has not improve the results compared to just taking the maximum three 2D convolutional layers followed by two bidirectional of each map; actually, our system seems to slightly degrade Gated Recurrent Units (GRU) [54] and two fully connected the DOA estimation, probably due to the effect applying layers. It is worth saying that this model, due to the bidirec- an unneeded tracking. However, when the room conditions tional GRUs, is non-causal and that it uses shorter analysis deteriorate, we can see how Cross3D is robust enough to get windows than the other analyzed methods. its performance degraded in only 5 when the T increases For the models that use spectrograms as input features, we to 1:5 s (which is higher than any reverberation seen during found that they did not train properly with the full range of the training) while the SRP-PHAT algorithm is just unable reverberations described in section IV-A, and we got the best to perform a proper estimation. They are also surprising the results training them with values of T60 randomly selected results obtained with maps of only 4x8 resolution, which from the range 0 s to 0:3 s. only perform a SRP-PHAT measurement each 45 in the All the models employed for the evaluation are summarized azimuth and 60 in the elevation and, since P (; ') = P (; 0) in Table I and tables detailing their architectures can be found 8' 2 [0; 2) if  = 0 or , only needs to perform 18 in the supplementary material of this paper. computations of (4). Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. RMSAE RMSAE RMSAE RMSAE RMSAE This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 Tracking examp e: 32x64 maps, T60=0.9s and SNR=5dB E evation Azimuth −50 −100 −150 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 time [s] (a) (a) Tracking examp e: 4x8 maps, T60=0.3s and SNR=30dB −25 E evation Azimuth −50 0.0 2.5 5.0 7.5 10.0 12.5 15.0 17.5 time [s] (b) (b) Fig. 7. Examples of the DOA estimated in a scenario with T = 0:9 s Fig. 8. Tracking Root Mean Squared Angular Error of Cross3D with several and SNR=5 dB using maps with 32x64 resolution (a) and in a scenario with power map resolutions and the baseline methods for SNR=30 dB (a) and T = 0:3 s and SNR=30 dB using maps with 4x8 resolution (b). The solid SNR=5 dB (b) and several reverberation times. The silent frames where also line represents the actual DOA of the source, the dashed line the estimated included in the computation of the RMSAE. DOA and the crosses represent the maximum of each SRP-PHAT power map. Grey segments indicate silent frames. Fig. 7 shows a couple of examples of simulated trajectories reverberations and SNRs to compare the robustness of each and their estimated DOA. In Fig. 7a we can see how, for model. In this case, since all the methods include tracking scenarios with high reverberation and low SNRs, the maximum capabilities, we did not exclude the silent frames when we of the SRP-PHAT maps becomes really noisy but our proposed computed the RMSAEs shown in Fig. 8. We can see how system is able to maintain the estimated DOA quite close to the best results are obtained using our method with high the actual one. In linear systems, robust tracking with noisy resolution power maps, but that, even reducing the resolution, estimations usually comes with the cost of being slow to track its performance is still competitive. Using 1D CNN over the fast changes, at least with casual systems, but we can see how coordinates of the maximums of 64x128 SRP-PHAT maps our model was able to follow the sudden change in the azimuth performs worse than using our 3D CNN over 4x8 maps, so we of the source at the fifth second of the trajectory. In Fig. 7b can conclude that our model is exploiting the patterns present we can see how, when working with low-resolution power in the SRP-PHAT maps and not only using the information of maps, our system is able to predict the DOA with much higher the position of its maximums (this was also suggested by Fig. precision than the maximums of the maps. This could not be 7b). Using 1D CNNs over the GCCs —which is an approach, done with a two-step DOA estimation and tracking algorithm to the best of the authors’ knowledge, unpublished— have a that performed the tracking based only on the maximum of the performance between using 3D CNNs over 4x8 and 8x16 maps maps. Our system is able to analyze the whole maps and it was and may be an interesting approach when a lower computa- able to learn to exploit the patterns in the SRP-PHAT maps tional cost is needed. Finally, the models that use spectrograms to achieve higher resolution than the grid used to compute the as inputs perform well in favorable scenarios (SELDnet even maps. outperforms our proposal in low noise anechoic chambers) but Finally, we also tested the baseline methods under different they are not very robust against noise and reverberation. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. DoA [,] DoA [,] RMSAE RMSAE This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 TABLE II RMSAE [ ] OF THE DOA ESTIM ATED FOR THE LOCATA DATASET WITH C ROSS3D USING SEVERAL MAP RESOLUTIONS AND THE BASELINE TRACKING METHODS. THE SILENT FRAMES WERE INCLUDED IN THE COMPUTATION OF THE RMSAE. Model: Cross3D 1D CNN 2D CNN SELDnet SRP-PHAT maps Input: GCCs Maximums Spectrograms 4x8 8x16 16x32 32x64 64x128 Recording 1 17.93 11.92 8.30 4.62 5.16 16.18 7.54 93.76 29.70 Recording 2 18.90 7.68 6.68 4.90 3.91 12.60 5.19 64.18 38.44 Task 1 Recording 3 10.35 6.34 2.98 3.25 2.24 11.57 5.09 140.21 54.81 Average 15.72 8.65 5.99 4.26 3.77 13.45 5.94 99.38 40.98 Recording 1 23.06 18.11 13.79 12.43 9.92 13.59 14.04 70.86 50.57 Recording 2 20.97 13.71 10.01 8.36 9.22 14.17 12.02 83.42 48.71 Task 3 Recording 3 21.05 12.74 9.83 7.69 6.60 15.21 13.29 82.48 57.29 Average 21.69 14.85 11.21 9.49 8.58 14.32 13.12 78.92 52.86 Recording 1 11.93 10.83 7.25 5.74 5.49 10.93 10.53 58.33 37.24 Recording 2 20.92 16.16 16.08 12.18 13.59 17.33 17.42 41.98 73.17 Task 5 Recording 3 23.57 18.25 13.58 15.64 15.49 20.14 23.58 66.91 66.50 Average 18.81 15.08 12.31 11.19 11.52 16.13 17.18 55.74 58.97 Average 18.74 12.86 9.83 8.31 7.96 14.64 12.08 78.01 50.94 2) LOCATA dataset: In order to confirm that, although it was trained with a simulated dataset, our system is general enough to track sound sources recorded in real rooms, we tested it with the LOCATA challenge dataset [23], which contains several recordings with the same array that we had simulated to train the models. We used the development dataset and we focused in the tasks 1, 3, and 5 of the challenge: a static loudspeaker recorded with a static array, a moving talker recorded with a static array, and a moving talker recorded with a moving array; it is worth mentioning that the array was static in all the simulations employed to train the model. For the robot head microphone array that we simulated in the training dataset, the development dataset contains 3 recordings for each task and its ground-truth positions. It is worth saying that the only modification to the proposed technique that we made after seeing its performance with the LCOATA dataset was the use of a VAD. All the hyperparam- eters of the model and the acoustical properties of the training Fig. 9. DOA estimated for the second recording of the second task of the dataset were selected according only to the results obtained LOCATA challenge using maps with Cross3D over 32x64 maps and the with simulated datasets. In other words, we used simulated baseline methods. signals for training and validation and the LOCATA recordings only for testing. Table II shows the RMSAE of estimating the DOA of the with the results obtained with the simulated dataset (see Fig. source of each recording using our technique and with the 8), but it disappears when the resolution of the maps increases; baseline methods. Although it is difficult to draw conclusions actually we even reach lower errors in the LOCATA dataset from such a low number of recordings, we can see how the than with the simulated test dataset. Using a 1D CNN also proposed tracking system clearly outperforms the baseline suffers a similar degradation, but its most dramatic impact is methods that use spectrograms as inputs and that it also on the methods which use spectrograms as inputs. In contrast, outperforms the 1D CNN methods when we use maps with at the use of a 1D CNN over the coordinates of the maximums least 16x32 resolution. of high resolution SRP-PHAT maps does not suffer almost any degradation; but it may not be an interesting approach since, According to [23], the reverberation time of the room where having computed the 64x128 resolution maps, we can obtain the recordings were performed was T  0:5 s, so we can far better results using the whole maps as inputs of Cross3D. observe some degradation in the performance of Cross3D when it is used over low resolution power maps compared As an example, Fig. 9 shows the DOA estimation of the Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 second recording of the third task of the LOCATA dataset, This material is based upon work supported by Google where all the methods obtained a RMSAE quite close to their Cloud. average. We can see how Cross3D performs the best estimation of the analyzed methods both for the elevation and for the REFERENCES azimuth. We can also see that the LOCATA dataset has longer [1] C. Rascon and I. Meza, “Localization of sound sources in robotics: A silences than the ones present in the simulated dataset, which review,” Robotics and Autonomous Systems, vol. 96, pp. 184–210, 2017. [2] V. Tourbabin and B. Rafaely, “Theoretical Framework for the Optimiza- could also explain why some of the methods obtained lower tion of Microphone Array Configuration for Humanoid Robot Audition,” results with this dataset. In order to make the methods based on IEEE/ACM Transactions on Audio, Speech, and Language Processing, CNNs more robust against longer silences, we should include vol. 22, no. 12, pp. 1803–1814, Dec. 2014. [3] A. Farina and L. Tronchin, “3D Sound Characterisation in Theatres them in the simulation of the training dataset and, probably, Employing Microphone Arrays,” Acta Acustica united with Acustica, increase the temporal receptive field of the models, which vol. 99, no. 1, pp. 118–125, Jan. 2013. could be done increasing the number of layers, the temporal [4] K. Kumatani, J. McDonough, and B. Raj, “Microphone Array Processing size of its kernels, or including longer temporal dilations in for Distant Speech Recognition: From Close-Talking Microphones to Far-Field Sensors,” IEEE Signal Processing Magazine, vol. 29, no. 6, the convolutions. pp. 127–140, Nov. 2012. [5] C. Spille, B. Kollmeier, and B. T. Meyer, “Comparing human and automatic speech recognition in simple and complex acoustic scenes,” VI. CONCLUSIONS Computer Speech & Language, vol. 52, pp. 123–140, 2018. In this paper, we have presented a new sound source DOA [6] R. Ma, G. Liu, Q. Hao, and C. Wang, “Smart microphone array design for speech enhancement in financial VR and AR,” in 2017 IEEE estimation and tracking system based on the well known SRP- SENSORS, Oct. 2017, pp. 1–3. PHAT method and a three-dimensional Convolutional Neural [7] C. Knapp and G. Carter, “The generalized correlation method for Network. The use of a fully causal convolutional architecture, estimation of time delay,” IEEE Transactions on Acoustics, Speech, and Signal Processing, vol. 24, no. 4, pp. 320–327, Aug. 1976. without any bidirectional recurrent layer, makes our proposal [8] J. Benesty, “Adaptive eigenvalue decomposition algorithm for passive feasible for real-time applications, being able to provide a new acoustic source localization,” The Journal of the Acoustical Society of America, vol. 107, no. 1, pp. 384–391, Dec. 1999. DOA estimation each 192 ms. We used a 3D CNN over time [9] L. Comanducci, M. Cobos, F. Antonacci, and A. Sarti, “Time Differ- sequences of elevation and azimuth maps computed from the ence of Arrival Estimation from Frequency-Sliding Generalized Cross- signals captured by a compact array but, using distributed Correlations Using Convolutional Neural Networks,” in ICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal arrays to compute 3D maps, the extension of the technique Processing (ICASSP), May 2020, pp. 4945–4949. to use a 4D CNN would be straightforward. [10] J. H. DiBiase, “A high-accuracy, low-latency technique for talker lo- The experiments performed show that the SRP-PHAT maps calization in reverberant environments using microphone arrays,” Ph.D. dissertation, Brown University, 2000. are a good input feature to be used in tracking systems based [11] J. H. DiBiase, H. F. Silverman, and M. Brandstein, “Robust Localiza- on deep learning, being much more robust to reverberation and tion in Reverberant Rooms,” in Microphone Arrays: Signal Processing noise than the use of spectrograms as proposed in most of the Techniques and Applications. Berlin, Heidelberg: Springer Berlin recent literature. They also prove that it is possible to obtain Heidelberg, 2001. [12] R. Schmidt, “Multiple emitter location and signal parameter estimation,” a good tracking performance using only causal convolutional IEEE Transactions on Antennas and Propagation, vol. 34, no. 3, pp. layers and that non-causal recurrent layers are not needed. 276–280, Mar. 1986. Due to the difficulty of recording a hand-labeled dataset [13] J. P. Dmochowski, J. Benesty, and S. Affes, “Broadband Music: Oppor- tunities and Challenges for Multiple Source Localization,” in 2007 IEEE of moving sources large enough to train a neural network, Workshop on Applications of Signal Processing to Audio and Acoustics, we have introduced a new procedure for generating random Oct. 2007, pp. 18–21. trajectories and simulate them as they are needed for training. [14] J. Traa and P. Smaragdis, “A Wrapped Kalman Filter for Azimuthal Speaker Tracking,” IEEE Signal Processing Letters, vol. 20, no. 12, pp. With it, we have a infinite size dataset whose parameters can be 1257–1260, Dec. 2013. easily modified during training to accelerate the convergence [15] Y. Tian, Z. Chen, and F. Yin, “Distributed Kalman filter-based speaker or during test to analyze the performance of the model in tracking in microphone array networks,” Applied Acoustics, vol. 89, pp. 71–77, Mar. 2015. specific scenarios. To prove that the models trained with this [16] D. B. Ward, E. A. Lehmann, and R. C. Williamson, “Particle filtering procedure are general enough to deal with actual recordings, algorithms for tracking an acoustic source in a reverberant environment,” we have tested our model with the LOCATA dataset and IEEE Transactions on Speech and Audio Processing, vol. 11, no. 6, pp. 826–836, Nov. 2003. obtained satisfactory results. [17] W.-K. Ma, B.-N. Vo, S. S. Singh, and A. Baddeley, “Tracking an As a baseline method for our main proposal, we have also unknown time-varying number of speakers using TDOA measurements: introduced a new architecture, based on the use of a causal 1D A random finite set approach,” IEEE Transactions on Signal Processing, vol. 54, no. 9, pp. 3291–3304, Sep. 2006. CNN over the GCCs, that also presents a good performance [18] X. Xiao, S. Zhao, X. Zhong, D. L. Jones, E. S. Chng, and H. Li, and robustness and that may be interesting for applications “A learning-based approach to direction of arrival estimation in noisy where the computation of the SRP-PHAT maps is not possible and reverberant environments,” in 2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), Apr. 2015, pp. due to computational resource limitations. 2814–2818. [19] R. Takeda and K. Komatani, “Discriminative multiple sound source localization based on deep neural networks using independent location ACKNOWLEDGMENT model,” in 2016 IEEE Spoken Language Technology Workshop (SLT), Dec. 2016, pp. 603–609. This work was supported in part by the Regional Govern- [20] J. M. Vera-Diaz, D. Pizarro, and J. Macias-Guarasa, “Towards End-to- ment of Aragon (Spain) with a grant for postgraduate research End Acoustic Localization Using Deep Learning: From Audio Signals contracts (2017-2021) co-founded by the Operative Program to Source Position Coordinates,” Sensors, vol. 18, no. 10, p. 3418, Oct. FSE Aragon 2014-2020. 2018. Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org. This is the author's version of an article that has been published in this journal. Changes were made to this version by the publisher prior to publication. The final version of record is available at http://dx.doi.org/10.1109/TASLP.2020.3040031 [21] E. L. Ferguson, S. B. Williams, and C. T. Jin, “Sound Source Localiza- Recurrent Neural Networks,” in Interspeech 2019. ISCA, Sep. 2019, tion in a Multipath Environment Using Convolutional Neural Networks,” pp. 654–658. in 2018 IEEE International Conference on Acoustics, Speech and Signal [40] D. Diaz-Guerra and J. R. Beltran, “Source cancellation in cross- Processing (ICASSP), Apr. 2018, pp. 2386–2390. correlation functions for broadband multisource DOA estimation,” Sig- [22] S. Adavanne, A. Politis, J. Nikunen, and T. Virtanen, “Sound Event nal Processing, vol. 170, p. 107442, May 2020. Localization and Detection of Overlapping Sources Using Convolutional [41] A. Brutti, M. Omologo, and P. Svaizer, “Multiple Source Localization Recurrent Neural Networks,” IEEE Journal of Selected Topics in Signal Based on Acoustic Map De-Emphasis,” EURASIP Journal on Audio, Processing, vol. 13, no. 1, pp. 34–48, Mar. 2019. Speech, and Music Processing, vol. 2010, no. 1, p. 147495, Dec. 2010. [23] H. W. Lollmann, ¨ C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, P. A. [42] D. Diaz-Guerra and J. R. Beltran, “Direction of Arrival Estimation with Naylor, and W. Kellermann, “The LOCATA Challenge Data Corpus for Microphone Arrays Using SRP-PHAT and Neural Networks,” in 2018 Acoustic Source Localization and Tracking,” in 2018 IEEE 10th Sensor IEEE 10th Sensor Array and Multichannel Signal Processing Workshop Array and Multichannel Signal Processing Workshop (SAM), Jul. 2018, (SAM), Jul. 2018, pp. 617–621. pp. 410–414. [43] L. Perotin, R. Serizel, E. Vincent, and A. Guerin, “CRNN-Based Multi- [24] H. Do and H. F. Silverman, “A Fast Microphone Array SRP-PHAT ple DoA Estimation Using Acoustic Intensity Features for Ambisonics Source Location Implementation using Coarse-To-Fine Region Con- Recordings,” IEEE Journal of Selected Topics in Signal Processing, traction(CFRC),” in 2007 IEEE Workshop on Applications of Signal vol. 13, no. 1, pp. 22–33, Mar. 2019. Processing to Audio and Acoustics, Oct. 2007, pp. 295–298. [44] T. S. Cohen, M. Geiger, J. Kohler ¨ , and M. Welling, “Spherical CNNs,” [25] ——, “Stochastic particle filtering: A fast SRP-PHAT single source in International Conference on Learning Representations, 2018. localization algorithm,” in 2009 IEEE Workshop on Applications of [45] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, “Curriculum Signal Processing to Audio and Acoustics, Oct. 2009, pp. 213–216. learning,” in Proceedings of the 26th Annual International Conference [26] L. O. Nunes, W. A. Martins, M. V. S. Lima, L. W. P. Biscainho, M. V. M. on Machine Learning, ser. ICML ’09. Montreal, Quebec, Canada: Costa, F. M. Gonc ¸alves, A. Said, and B. Lee, “A Steered-Response Association for Computing Machinery, Jun. 2009, pp. 41–48. Power Algorithm Employing Hierarchical Search for Acoustic Source [46] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Librispeech: Localization Using Microphone Arrays,” IEEE Transactions on Signal An ASR corpus based on public domain audio books,” in 2015 IEEE Processing, vol. 62, no. 19, pp. 5171–5183, Oct. 2014. International Conference on Acoustics, Speech and Signal Processing [27] M. Cobos, A. Marti, and J. J. Lopez, “A Modified SRP-PHAT Functional (ICASSP), Apr. 2015, pp. 5206–5210. for Robust Real-Time Sound Source Localization With Scalable Spatial [47] J. Wiseman, “Wiseman/py-webrtcvad,” Nov. 2019. Sampling,” IEEE Signal Processing Letters, vol. 18, no. 1, pp. 71–74, [48] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating Jan. 2011. small-room acoustics,” The Journal of the Acoustical Society of America, [28] C. Evers, Y. Dorfan, S. Gannot, and P. A. Naylor, “Source tracking vol. 65, no. 4, pp. 943–950, Apr. 1979. using moving microphone arrays for robot audition,” in 2017 IEEE [49] D. Diaz-Guerra, A. Miguel, and J. R. Beltran, “gpuRIR: A python International Conference on Acoustics, Speech and Signal Processing library for room impulse response simulation with GPU acceleration,” (ICASSP), Mar. 2017, pp. 6145–6149. Multimedia Tools and Applications, Oct. 2020. [29] O. Schwartz, Y. Dorfan, E. A. P. Habets, and S. Gannot, “Multi- [50] H. W. Lollmann, ¨ C. Evers, A. Schmidt, H. Mellmann, H. Barfuss, speaker DOA estimation in reverberation conditions using expectation- P. A. Naylor, and W. Kellermann, “IEEE-AASP Challenge on Acoustic maximization,” in 2016 IEEE International Workshop on Acoustic Signal Source Localization and Tracking: Documentation of Final Release,” Enhancement (IWAENC), Sep. 2016, pp. 1–5. https://locata.lms.tf.fau.de/datasets/, Jan. 2020. [30] S. Chakrabarty and E. A. P. Habets, “Multi-Speaker DOA Estimation [51] K. He, X. Zhang, S. Ren, and J. Sun, “Delving Deep into Rectifiers: Using Deep Convolutional Networks Trained With Noise Signals,” IEEE Surpassing Human-Level Performance on ImageNet Classification,” in Journal of Selected Topics in Signal Processing, vol. 13, no. 1, pp. 8–21, 2015 IEEE International Conference on Computer Vision (ICCV), Dec. Mar. 2019. 2015, pp. 1026–1034. [31] S. Adavanne, A. Politis, and T. Virtanen, “Direction of Arrival Es- [52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” timation for Multiple Sound Sources Using Convolutional Recurrent in ICLR 2015, San Diego. Neural Network,” in 2018 26th European Signal Processing Conference [53] A. Paszke, S. Gross, F. Massa, A. Lerer, J. Bradbury, G. Chanan, (EUSIPCO), Sep. 2018, pp. 1462–1466. T. Killeen, Z. Lin, N. Gimelshein, L. Antiga, A. Desmaison, A. Kopf, [32] N. Yalta, K. Nakadai, and T. Ogata, “Sound Source Localization Using E. Yang, Z. DeVito, M. Raison, A. Tejani, S. Chilamkurthy, B. Steiner, Deep Learning Models,” Journal of Robotics and Mechatronics, vol. 29, L. Fang, J. Bai, and S. Chintala, “PyTorch: An Imperative Style, High- no. 1, pp. 37–48, 2017. Performance Deep Learning Library,” in Advances in Neural Information [33] Y. Sun, J. Chen, C. Yuen, and S. Rahardja, “Indoor Sound Source Processing Systems 32, H. Wallach, H. Larochelle, A. Beygelzimer, F. Localization With Probabilistic Neural Network,” IEEE Transactions on d’Alche-Buc, ´ E. Fox, and R. Garnett, Eds. Curran Associates, Inc., Industrial Electronics, vol. 65, no. 8, pp. 6403–6413, Aug. 2018. 2019, pp. 8026–8037. [34] W. He, P. Motlicek, and J.-M. Odobez, “Deep Neural Networks for Mul- [54] K. Cho, B. van Merrienboer, ¨ C. Gulcehre, D. Bahdanau, F. Bougares, tiple Speaker Detection and Localization,” in 2018 IEEE International H. Schwenk, and Y. Bengio, “Learning Phrase Representations using Conference on Robotics and Automation (ICRA), May 2018, pp. 74–79. RNN Encoder–Decoder for Statistical Machine Translation,” in Proceed- [35] S. Kapka and M. Lewandowski, “Sound Source Detection, Localization ings of the 2014 Conference on Empirical Methods in Natural Language and Classification using Consecutive Ensemble of CRNN Models,” in Processing (EMNLP). Doha, Qatar: Association for Computational Proceedings of the Detection and Classification of Acoustic Scenes and Linguistics, Oct. 2014, pp. 1724–1734. Events 2019 Workshop (DCASE2019). New York University, 2019, pp. 119–123. [36] H. Cordourier, P. Lopez Meyer, J. Huang, J. Del Hoyo Ontiveros, and H. Lu, “GCC-PHAT Cross-Correlation Audio Features for Simultaneous Sound Event Localization and Detection (SELD) on Multiple Rooms,” in Proceedings of the Detection and Classification of Acoustic Scenes and Events 2019 Workshop (DCASE2019). New York University, Oct. [37] Y. Cao, T. Iqbal, Q. Kong, M. B. Galindo, W. Wang, and M. D. Plumbley, “TWO-STAGE SOUND EVENT LOCALIZATION AND DETECTION USING INTENSITY VECTOR AND GENERALIZED CROSS-CORRELATION,” in Proceedings of the Detection and Classi- fication of Acoustic Scenes and Events 2019 Workshop (DCASE2019). New York University, 2019. [38] L. Perotin, A. Defossez, ´ E. Vincent, R. Serizel, and A. Guerin, ´ “Re- gression versus classification for neural network based audio source localization,” IEEE Workshop on Applications of Signal Processing to Audio and Acoustics, p. 6, 2019. [39] Z. Tang, J. D. Kanu, K. Hogan, and D. Manocha, “Regression and Classification for Direction-of-Arrival Estimation with Convolutional Copyright (c) 2020 IEEE. Personal use is permitted. For any other purposes, permission must be obtained from the IEEE by emailing pubs-permissions@ieee.org.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jun 16, 2020

There are no references for this article.