Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation

Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 1 Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation Luca Remaggi, Philip J. B. Jackson, Wenwu Wang, Senior Member IEEE . Abstract—Typical methods for binaural source separation size of the environment, without directional information [12]. consider only the direct sound as the target signal in a mixture. Instead, early reflections affect the human sound perception, However, in most scenarios, this assumption limits the source by conveying a directional sense of the geometry of the separation performance. It is well known that the early reflections environment [13]. This generates auditory effects, for instance interact with the direct sound, producing acoustic effects at modifying the source width perception [14]. Moreover, being the listening position, e.g. the so-called comb filter effect. In this article, we propose a novel source separation model, that coherent with the direct sound, strong early reflections modify utilizes both the direct sound and the first early reflection the perceived sound coloration, by generating a comb filter information to model the comb filter effect. This is done by effect [15]. Hence, acoustic multipath properties should be observing the interaural phase difference obtained from the time- considered in the design of source separation methods [16]. frequency representation of binaural mixtures. Furthermore, a Many different approaches can be found in the literature method is proposed to model the interaural coherence of the to tackle the source separation problem. However, most of signals. Including information related to the sound multipath propagation, the performance of the proposed separation method them do not explicitly model the acoustic multipath properties. is improved with respect to the baselines that did not use such For instance, in the well-known Model-based Expectation information, as illustrated by using binaural recordings made in Maximization Source Separation and Localization (MESSL) four rooms, having different sizes and reverberation times. method [17] only the direct sound interaural cues (i.e. the in- Index Terms—Source separation, comb filter effect, RIRs, teraural phase difference (IPD) and interaural level difference IPD, ILD, binaural audio, multipath propagation, interaural (ILD)) were modeled, without considering any early reflection coherence. effect. Furthermore, although a garbage source was defined to indirectly deal with the late reverberation, there was not any I. INTRODUCTION formal attempt to model the reverb. The aim of this article is to investigate how information Source separation is one of the most investigated fields in related to early reflections can improve source separation the signal processing community. Several application areas can methods, in general. Such information can be potentially used benefit from it. For instance, it can improve target detection in many source separation methods, either unsupervised or performance of passive sonar systems [1]. In biomedical supervised. Here, we selected MESSL [17] as a baseline engineering, source separation is often used to analyze elec- method due to its unsupervised nature, and the convenience trocardiograms, electroencephalograms, or magnetic resonance in incorporating the early reflections information into its IPD images [2]. Work on ancient document restoration has utilized model. We extended MESSL [17], by emulating the comb filter source separation for correcting bleed-through distortion [3]. effect produced by the early reflections. To do so, we define Source separation has also been used in a large range of speech parametric functions in the time-frequency (TF) domain, and applications. For instance, it is used for improving speech model the behavior of the IPD, by considering the interaction enhancement [4], crosstalk cancellation [5], and automatic between the direct sound and the first arriving early reflection. speech recognition systems [6]. It can also be applied to The first reflection is chosen to be included into the model as improve hearing aids [7], or improve security systems [8]. it is the one that most affects the spatial cues [18]. Similar Spatial audio can also rely on it, to produce object-based to MESSL, we also use an ILD model, which considers the audio [9]. Robust speech processing is another target area [10]. direct sound cue, and the garbage source. In typical conditions, a sound produced by a source interacts In addition to the comb filter effect, we propose a model with its environment during propagation, before it reaches a that separates the reverberation’s effect from the rest of the listening position. This multipath propagation is defined by its RIR’s. This is done by approximating the human capability of room impulse response (RIR), i.e. an acoustic signal describing separating sounds in reverberant environments. Specifically, the propagation of sound from source to listening position. we model the interaural coherence (IC) of indivual sources in RIRs have three parts: direct sound, early reflections, and late the mixture, similar to what was introduced in [19]. However, reverberation [11]. The direct sound carries information related there, the target source was assumed to be in front of the to the source. Late reverberation provides clues about the listener. Here, we propose an approach that is not limited by this, but works for any target source position. IEEE Copyright The authors are with the Centre for Vision, Speech and Signal Pro- The main novelties of this article include: cessing, University of Surrey, Guildford, GU2 7XH, UK. W. Wang is a new IPD model, considering both direct sound and first also with Qingdao University of Science and Technology, China. Emails: [l.remaggi, p.jackson, w.wang]@surrey.ac.uk. reflection, to approximate the comb filter effect; arXiv:1910.02127v1 [cs.SD] 4 Oct 2019 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 2 an extension of the MESSL IPD model, employing the IPD, relating the azimuthal sound direction of arrival (DOA) target signal IC; to the head orientation [40]. The method presented in [41] an additional novel source separation method, obtained utilized, instead, the so called mixing vector (MV). For each by combining the two new models above; frequency bin, this vector contains the time invariant frequency the application of a source and image source localization response component of the room. In both [17] and [41], the algorithm to initialize the expectation maximization (EM) probability of each TF point belonging to a specific source algorithm used to estimate the Gaussian mixture model in the mixture was determined. From this probability, TF (GMM) parameters, and one deep-learning approach masks were generated. In [42], the two methods proposed using an MLP architecture with two hidden layers to in [17] and [41] were combined, constructing a probability generate the TF mask. distribution that takes into account the three cues ILD, IPD and MV. In [43], a high-dimensional vector, constructed by Since the novel IPD model approximates the early reflection combining the IPD and ILD cues, was projected onto a 2D information, the first new pipeline is named as Early Reflection space, represented by the sound azimuth and elevation DOA. MESSL (ER-MESSL). The second novel pipeline uses the IC A regression approach located the sources, and estimated the of the estimated target signal, hence, its name is IC-MESSL. TF masks. The IC cue was then employed in [44]. By combining the new IPD model with the IC based model, In the literature, yet few works can be found that consider we obtain the third proposed method, thus named as ERIC- both direct sound and early reflections. In [45], the source MESSL. Finally, there is need for the employed EM algo- separation problem was divided into different procedures, by rithm to be initialized. Since our proposed methods combine applying deconvolution to each individual reflection. However, the direct sound and first reflection information, we employ the performance degrades with low signal-to-noise ratio (SNR) our Image Source Direction and Ranging (ISDAR) [20] to conditions. In [46], a variation of the ICA method [47] was initialize it, by localizing the target source and related im- used to estimate the time-dependent mixing system, con- age source [21]. A comparative evaluation of early and late sidering the multipath propagation. However, with the ICA models is performed and reported as additional contribution. approach, the effect of its classical permutation problem was The challenging two source binaural speech mixture scenario exacerbated by the incorrect RIR components’ alignment. was analyzed, by employing signal and perceptual objective Deconvolution of the received signals was proposed in [48], measures. In the experimental section, we also evaluate the by employing simulated RIRs. These RIRs were estimated by improvement given by considering early reflection information matching the temporal support of recorded ones. Nevertheless, in a state-of-the-art deep learning based method, for supervised binaural effects, such as head shadowing and pinnae influence, speech separation. Through this, we further demonstrate that were not considered. Multichannel microphone arrays were early reflection information improves source separation meth- used in [33], where beamformers were designed to have their ods’ performance, including deep learning, and that this can directivty patterns characterized by multiple beams, to simul- be potentially applied to many approaches in the literature. taneously extract direct sound and early reflections. Results The overall structure of this article is as follows: in Section show improvement with respect to classical beamforming. II, related source separation methods are discussed; Section III However, they were tested only with simulated RIRs. The defines the theoretical foundations of the proposed approach. work in [49] demonstrated the benefit of including reflection In Sections IV and V, the proposed interaural cue models for information in source separation models, by employing a NMF the comb filter and IC are presented, respectively. Section VI approach. Nevertheless, only simulated RIRs were employed. describes the source separation algorithm. In Section VII, the In this article, we consider the first arriving early reflec- experiments are described, with related results and discussion. tion and related direct sound, to propose a binaural model Finally, Section VIII draws the conclusion. that increases the robustness in reverberant environments, by II. RELATED WORK IN SPEECH SOURCE SEPARATION estimating TF masks. It is based on [17], nevertheless, the Many approaches can be found in the literature to tackle proposed model could be potentially adapted to work with the source separation problem. Some of them exploit a-priori other methods described above, from beamformers to DNNs. information about basis functions representing the signals in the mixture [22]. Others employ the non-negative ma- III. BACKGROUND DEFINITIONS trix factorization (NMF) to learn sparse representation of In this section, we provide a general overview of the adopted speech sources [23–26]. The independent component analy- approach, and discuss the assumptions. The definitions of the sis (ICA) [27] is also used to decompose the mixture into general elements of the proposed architecture (e.g. binaural independent signals, by projecting the mixtures into different RIRs (BRIRs) and interaural spectrograms) are also given. domains. Scenarios where multiple microphones are available were also investigated [28–31], e.g. using beamformers [32], A. General Overview of the Proposed Method [33]. Recently, deep neural networks (DNNs) became widely Classical source separation methods exploit features related popular, when large training datasets are available [34–38]. TF masking is a popular approach, which assigns different to the direct sound to separate the target sound from a mixture. weights to the mixture, in the TF domain [39]. In [17], the In [17], the authors presented one of the first models to deal authors presented the MESSL method which uses binaural sig- with the reverberation, by proposing the “garbage” source. nals. Two interaural cues were exploited, i.e. the ILD and the In this article, we model two perceptual effects: the comb IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 3 1,1,l 0,1,l n -n 1,1,l 0,1,l h (n) h (n) 0,1,l 1,1,l Source 1,1,l 0,1,l y (n) h (n) x (n) n 1,1,l l 0,1,l 0,1,l h (n) 0,1,l Left 1,1,l Time (samples) 0,2,l n -n Right 0,2,l 0,1,l 0,2,l h (n) h (n) 0,2,l y (n) 0,2,l 0,2,l 1,2,l h (n) 1,2,l 1,2,l 1,2,l h (n) 1,2,l n -n Time (samples) 1,2,l 0,2,l Fig. 2: Schematic representation of the comb filter effect Fig. 1: Example of an ideal BRIR, zoomed into its direct sound created for the two received sounds (y (n) and y (n)), given (blue) and first reflection (red) components (depicted as Dirac 1 2 the sound produced at the l-th source x (n). The direct sounds pulses). The top figure shows the RIR related to sensor i = l and reflections, together with the related delays ( ) and 1, whereas the bottom one the RIR at sensor i = 2. The attenuation factors (B) are the same as those defined in Fig. 1. amplitudes and delays are defined in Equation (2). filter and IC. Through the former we aim to model the first where i 2 [1; 2] 2 N and l are the microphone and source early reflection, in a constructive fashion, to enhance the indexes, respectively; n is the discrete time index, T indi- sound produced by the target speaker. The latter models the cates the last early reflection, and w (n) represents the late i;l reverberation, by aiding the garbage source in suppressing it. reverberation, whereas e is the reflection index (e = 0 indicates the direct sound). h is a function describing the reflection. e;i;l B. Proposed Method Assumptions n represents the reflection times of arrival (TOAs). e;i;l Following the assumption of having dominant specular com- In the proposed source separation method, assumptions were ponents, the early reflections are approximated by Dirac deltas made, defining its scientific boundaries as follows: (n) of different amplitudes P . For source separation e;i;l The number of sources L is known a-priori; purpose, we consider the direct sound and first reflection Source signals are sparse in the TF domain; components (i.e. e = f0; 1g) (see Fig. 1): The mixing system is time invariant; The first reflection has a dominant specular component; h (n) = P (n n ); 0;1;l 0;1;l 0;1;l Sources are sufficiently far from the reflectors; h (n) = P (n n ); 1;1;l 1;1;l 1;1;l The first early reflection is coherent with the direct sound. (2) h (n) = P (n n ); 0;2;l 0;2;l 0;2;l Although L has to be known a-priori, there is no restriction on it with respect to the number of microphones M , thus, h (n) = P (n n ): 1;2;l 1;2;l 1;2;l the method can be also applied to underdetermined scenarios. Sparsity over the TF domain corresponds to the assumption of D. Comb Filter and Interaural Coherence having, for each TF bin, only one of the sources dominating In environments where the first reflection is delayed between the mixture. Sources and microphones are assumed to be static 5 ms and 40 ms to the direct sound, the coloration of the sound within a static environment, i.e. the mixing system is time perceived is different from the one produced [14]. In signal invariant. Where the first reflection has a dominant specular processing, the superimposition of a signal with its delayed component, it is detected from RIRs to initialize the EM version is the result of comb filtering the signal, hence, we re-estimation. The sources have to be distant enough from model this perceptual effect as a comb filter effect (see Fig. 2). the reflectors, in order to have the first reflection arriving Reverberation is a diffuse component of the RIR that makes between 5 ms and 40 ms later than the direct sound. Finally, source separation more challenging by smearing the target the assumption of coherence between the first reflection and signal, both temporally and spatially. Thus it is useful for direct sound allow them to be modeled as a comb filter. The robust separation to suppress it. With spaced microphones, later reflections, having a more stochastic nature, are assumed reverberation signals are decorrelated above a certain fre- to be incoherent and modeled through the IC, with the reverb. quency [50]. With binaural microphones, IC measures the two signals correlation, hence we use it to model the reverberation. C. Binaural Room Impulse Response A RIR is a signal that characterizes the acoustics of an E. Interaural Spectrogram environment with respect to source and sensor positions. RIRs that are recorded by microphones in ear canals of a dummy Following the definition of BRIR in Equation (1), the head, are usually known as BRIRs. They are defined as: mixtures received at the i-th sensor can be written as: T L X X I (n) = h (n n ) + w (n); (1) y (n) = x (n) I (n) w (n); (3) i;l e;i;l e;i;l i;l i l i;l i;l e=0 l=1 Left Channel Right Channel IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 4 Fig. 4: On the left the IPD function for a mixture of two Fig. 3: The figure on the left shows the IPD as a function of sources is shown. On the right, our comb filter based ER- frequency for a single source convolved with an ideal BRIR MESSL IPD model (the fluctuating red curve) is employed to formed by only direct sound and first reflection. On the right, fit one of the two sources in the same IPD function. the same IPD function is simultaneously fitted by the MESSL IPD model [17] (the straight green line), and our comb filter based ER-MESSL IPD model (the fluctuating red curve). only the direct sound information was used [17]. By assuming ideal BRIRs as formed by direct sound and first reflection (see Fig. 1), the two channel frequency responses are: where x (n) is the signal generated by the l-th source, w (n) l i;l is the convolutive white Gaussian noise, L is the number of I (!) = P exp[j!n ] + P exp[j!n ]); 1;l 0;1;l 0;1;l 1;1;l 1;1;l sources, and “” is the convolution operator. Since the human I (!) = P exp[j!n ] + P exp[j!n ]): 2;l 0;2;l 0;2;l 1;2;l 1;2;l auditory system analyzes the received mixtures in the TF do- (6) main [51], we use the the short-time Fourier transform (STFT) Their ratio is the interaural frequency response model: to calculate the TF representation of y (n): I (!) 1;l I (!) = = L l I (!) 2;l y (m; !) = x (m; !)I (!)w (m; !); (4) i l i;l i P + P exp[j!(n n )] 0;1;l 1;1;l 1;1;l 0;1;l l=1 P exp[j!(n n )] + P exp[j!(n n )] 0;2;l 0;2;l 0;1;l 1;2;l 1;2;l 0;1;l where m is the discrete time frame index, whereas ! is the (7) ang angular frequency. I (!) is not time dependent, by assuming i;l ^ The phase of this equation, denoted as I (!), corresponds the mixing system to be time-invariant. Considering binaural to the proposed IPD model, and it is one of the main novelties systems, the interaural spectrogram is defined as [17]: of this article. For the l-th source, the difference between the IPD ILD y (m; !) observed IPD  (m; !) and its model is the phase residual: IS (m;!)=20 IPD y (m; !) = = 10 exp[j (m; !)]; y (m; !) ang 2 IPD IPD (m; !;C ) =  (m; !) I (!;C ); (8) l l l l l (5) ILD IPD where (m; !) and  (m; !) are the ILD and IPD of that is wrapped into the interval [ ); and: the observation, respectively, and j = 1. DS DF ST C = [n ; n ; n ; P ; P ; P ; P ]; (9) l 0;1;l 1;1;l 0;2;l 1;2;l l l l IV. M ODELING THE C OMB F ILTER EFFECT DS DF where n = n n , n = n n , and 0;2;l 0;1;l 1;1;l 0;1;l l l The IPD and ILD cues can be modeled to generate proba- ST n = n n . An example of the IPD model fitting 1;2;l 1;1;l bility distributions for identifying the dominant source, given an ideal IPD observation is shown in Fig. 3, together with each TF bin. The novel IPD model that approximates the comb a visual comparison of the MESSL IPD model [17]. The filter effect is proposed in this section. Furthermore, the ILD ideal IPD observation was obtained from a synthetic BRIR model (that was presented in [17]) is described. Finally, these composed of only direct sound and first reflection. From this two are combined into a joint probability distribution. figure, it is clear that our proposed ER-MESSL IPD model In the proposed model (as in MESSL [17]), sound sources fits the observed data better than MESSL, by considering the are assumed to be spatially quasi-static: they have to be static comb filter effect. In Fig. 4, we also report the IPD function within the time interval under investigation. Nonetheless, as related to a mixture of two sources, generated using recorded a potential extension for future work, one could employ a BRIRs. The two sources’ contributions are well visible from tracking system, that would provide the model with updated the figure on the left, as two linear patterns having opposite time delays (i.e. n ). Using audio only, beamformers could e;i;l gradients. From the figure on the right, it is also visible that be used to estimate constantly the DOAs of the direct sound our proposed ER-MESSL model fits one of the two sources. and early reflections. Alternatively, one could track sources by ILD The ILD cue, (m; !), is modeled, similar to [17], by employing a particle filter [52], or a multimodal approach [53]. considering directly the frequency-dependent BRIR, as: I (!) 1;l ILD A. Interaural Level and Phase Differences a (!) = 20 log ; (10) l 10 I (!) 2;l The proposed IPD model is defined to match the behavior of the observed IPD and is different from previous work where where “jj” indicates the absolute value. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 5 B. Interaural Cue Probability Distributions The values of (m; !) are constrained between 0 and 1, 1;2 thus, (m; !) is employed as the TF soft mask that models 1;2 For the ILD cue, the probability of each TF bin being associ- the IC. To do so, it will be used as prior mask during the ated to source l can be written as a Gaussian distribution [42]: posterior probability calculation, that will be described in ILD ILD ILD ILD p( (m; !)jl) = N ( (m; !)j (!);  (!)); l l Section VI-B . (m; !) is computed from the observation 1;2 (11) by employing the equations defined in [55]. ILD ILD where  (!) is the mean, and  (!) is the variance. l l The aim of modeling the IC is to suppress remaining early Regarding the IPD cue, a top-down approach is used to reflections and late reverberation, i.e. the BRIR parts that are IPD wrap the signal phase between  [17].  (m; !;C ) is l not modeled by the comb filter. A similar approach to calculate modeled by a Gaussian distribution: an IC based TF mask was employed in [19]. However, there, IPD ^ the target source was assumed to be in front of the listener. p( (m; !)jl;C ) = (12) Here, we do not make any assumption regarding the position IPD IPD IPD = N ( (m; !;C )j (!;C );  (!;C )); l l l l l of the target source. Its position is estimated by ISDAR, the IPD IPD algorithm described later, in Section VI-C. Having the target where  (!;C ) and  (!;C ) are the IPD distribution l l l l source position, we then calculate (m; !) by analyzing the 1;2 mean and variance, respectively. BRIR related to the estimated DOA. To sum up, by assuming the IPD and ILD observations as being conditionally independent given their related parameters, B. The Garbage Source their probability distributions can be combined as: Late reflections and reverberation are problematic compo- ILD IPD p( (m; !);  (m; !)jl;C ) = nents of the acoustics that are undesiderable in the comb-filter (13) ILD IPD model, proposed in Section IV, as their first-order statistics = N ( (m; !);  (m; !;C )j ); l l are unreliable. Hence, the IC model described above is used 2 2 ILD ILD IPD IPD where  = f (!);  (!);  (!;C );  (!;C )g. l l l l l l l to suppress these components of the BRIRs by consideration This probability distribution identifies the proposed comb of their second-order statistics. In addition to this, we utilize a filter model, that was conceived to approximate the interaction garbage source, as in [17]. It represents noise dominating the between the received direct sound and first early reflection, TF bins that are not claimed by any of the other sources. i.e. two strongly coherent signals. This model does not take The parameters  used to model the garbage source are the into account either later reflections or reverberation, which same as those used by the other sources to define the distribu- are, in this article, dealt by the IC model. tion in Equation (13). The difference is the initialization, since the garbage source is used to model the noise sources, such V. M ODELING THE INTERAURAL COHERENCE as background noise, measurement noise, and reverberation. To suppress reverberation, the idea is to identify those areas VI. SOURCE SEPARATION M ODEL REESTIMATION in the TF domain that are dominated by the direct sound, and The EM is described here, along with the log-likelihood the strong early reflections. The direct sound and a strong used to optimize the parameters of the proposed models. reflection recorded at the two ears are highly correlated and coherent. In contrast, the late reverberation is diffuse, and does A. Parameter Estimation from Mixtures not present correlation between the binaural signals, at every The parameters characterizing the interaural cue probability frequency. Thus, we use the IC to create a probability mask, models are = f ; ; g, where is the marginal based on the coherence level, for every TF bin [19]. l l l;C l;C l l class membership, described as the joint probability of each TF bin being dominated by source l with the IPD model A. Interaural Coherence TF Mask parameters C : = p(l;C ). These parameters can be l l;C l The process we employed to calculate the IC of a signal estimated for a specific source l. This is a trivial problem follows an approach that was originally proposed in [54], upon the availability of the dominant source information for for dereverberation. For each TF bin, the auto-power spectral each TF bin. However, whether the source l is dominating a density of the two channels i = f1; 2g is calculated as: specific TF bin is not directly observable from the mixtures. (m; !) =  (m 1; !) + (1 )jy (m; !)j ; (14) On the other hand, l can be inferred from the interaural cues i i i and observed models, that are not known a-priori. This missing where 0    1 is a smoothing factor determined as data problem is solved by the EM algorithm. = 1=(  f ), with  = 10 ms being a time constant and f s s The log-likelihood of the observations can be then defined the sampling frequency [55]. The cross-power spectral density as in [17], however, with the additional IC distribution: between the two channels is: ILD IPD L( ) = [log p( (m; !);  (m; !);j ) + log (m; !)] 1;2 (m; !) =  (m 1; !) + (1 )y (m; !)y (m; !); 1;2 1;2 1 m;! X X (15) ILD IPD = log p( (m; !)jl)p( (m; !)jl;C ) (m; !): l;C l 1;2 with [] indicating the complex conjugate operation. From m;! l;Cl (17) (14) and (15), the magnitude squared coherence is: (m; !) 1;2 This has been implemented using the MESSL open source code’s option (m; !) = : (16) 1;2 allowing the definition of prior masks: https://github.com/mim/messl. (m; !) (m; !) 1 2 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 6 This definition assumes that the IC, IPD and ILD cues are Since the proposed IPD model in ER-MESSL and ERIC- independent. As a result, the joint probability is written MESSL is composed of seven parameters C (Equation (9)), as the product of individual probabilities. In addition, the it involves a seven dimensional space when trying to find number of sources must be specified a-priori [17]. Note that the best combination of them, hence it is computationally the inclusion of the IC into the log-likelihood function is expensive. Therefore, the amplitudes P are fixed; only the e;i;l different from previous approaches, such as [19]. There, the IC initialized value is allowed. The time-dependent parameters’ mask was multiplied by the TF representation of the mixture. allowed ranges were found empirically, as in Table I. Equation (17) represents the proposed ERIC-MESSL. C. Model Initialization B. Expectation-Maximization (EM) The initialization part plays a crucial role for the EM The EM algorithm is used to estimate the parameters and algorithm performance, since the log-likelihood is not convex. probability at each TF bin. (m; !jl) is considered as a A poor initialization leads to local maxima, thus affecting the 1;2 prior, and not updated during the iterations. During the E-step, source separation results. The estimated source and image the occupation likelihood of source l with parameters C is source positions are used to initialize the time-dependent ILD IPD DF DS ST calculated for each TF bin, given (m; !) and  (m; !): parameters n , n and n . Instead, the amplitudes P , 0;1;l l l l P , P , P are initialized by analyzing the BRIR that ILD 1;1;l 0;2;l 1;2;l (m; !jC ) = p( (m; !)jl) l l l;C is related to the estimated DOA. Therefore, the early reflection (18) IPD p( (m; !)jl;C )p( (m; !)jl): l 1;2 information is not pre-estimated, but found and refined by the proposed system at each iteration. The microphone array is This expectation is then used in the M-step, to re-estimate only used to initialize the EM algorithm. the parameters, and maximize the likelihood. The ILD param- In [17], only the direct sound was used to model the source, eters are updated as [42]: P and the parameters were initialized by using the GCC-PHAT ILD (m; !) (m; !jC ) l l m;C ILD l algorithm [56]. In our proposed method, correct localization (!) = ; (m; !jC ) l l of the first reflection is also crucial. Source and image source m;C ILD positions are estimated through our ISDAR method [20]. (19) (!) = P P This method relies on RIRs recorded via a multichannel ILD ILD 2 ( (m; !)  (!))  (m; !jC ) l l m l C l microphone array, placed at the same listener position. We P ; (m; !jC ) chose this since, to our knowledge, no method in the literature l l m;C can reliably localize reflections, given binaural recordings. whereas the IPD residual parameters are updated as: However, other kinds of approaches could be also employed, (m; !jC ) (m; !jC ) for instance, audio-visual based methods [57]. l l l l IPD m (!jC ) = ; l l ISDAR is based on spherical coordinates. Direct sound (m; !jC ) l l and reflection TOAs n ^ are estimated through the clus- IPD e;i;l (20) (!jC ) = tered dynamic programming projected phase-slope algorithm IPD 2 ( (m; !jC )  (!jC ))  (m; !jC ) l l l l l m l (C-DYPSA), that we proposed in [20], whereas azimuth P : (m; !jC ) l l DOAs  are estimated through the delay-and-sum beam- m e;l former [20], [58]. Considering the listener at the center of the Also the marginal class membership is updated: coordinate system, the radial distances of the source and image 1 M source are calculated as  = (n ^ c ), where c is =  (m; !jC ); (21) e;l e;i;l 0 0 l;C l l M i=1 the sound speed, and n ^ is either the estimated direct sound m;! e;i;l (e = 0) or first reflection (e = 1) TOA. The source and image where B is the total number of TF bins. source positions in the Cartesian coordinate system are given The model parameters that are found during the last EM by b =  cos  and b =  sin  . Knowing x;e;l e;l e;l y;e;l e;l e;l iteration are selected as the final estimation. Probabilistic the listener position, these values are converted into TDOAs masks are generated by marginalizing over the estimated C : to populate Equation (9). The amplitudes P are calculated e;i;l M (m; !) =  (m; !jC ): (22) by directly analyzing the BRIRs at the reflection TOA n ^ . l l l e;i;l C Regarding the ILD distribution, the value of the ILD prior mean is estimated by utilizing a set of synthetic binaural RIRs, The separated source signal l can finally be obtained as: as in [17]. The garbage source is initialized to have a uniform y ^ (m; !) = y (m; !)M (m; !); 8m; 8!: (23) i;l i l distribution across IPD, and a uniform ILD distribution with zero mean for all frequencies. The seven interaural model parameters defined in C are treated in the EM as hidden variables. Specifically, they are VII. E XPERIM ENTS AND RESULTS modeled as discrete random variables, where the sets of allowed values are specified a-priori, as in [17]. The param- In this section, the results of a set of experiments are eters in C are not internally updated by the EM algorithm. described. In these experiments, we consider mixtures of Instead, every allowed value combination is tested [17]. The speech signals in four different recorded environments. When combination that maximizes the log-likelihood is then chosen. only the IC is modeled, and MESSL is used to model only IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 7 TABLE I: Range sizes for the allowed values around the initialized IPD model parameters. Vislab DWRC BBC UL Studio1 DF DS ST n , n , n 0:13 ms 0:13 ms 0:19 ms 0:31 ms l l l TABLE II: Recorded room RT60s, averaged over the octave bands between 500 Hz and 4 kHz, DRRs, and TISAs, averaged over all the tested combinations. L is the number of TOT loudspeakers. The loudspeaker positions are reported as lateral angles with respect to the dummy head orientation. Vislab DWRC BBC UL Studio1 RT60 (s) 0:32 0:27 0:28 0:94 DRR (dB) 17:8 3:9 15:7 6:0 AVG TISA (Deg) 75 37 71 32 TOT L 7 3 5 3 0; 30, 0; 37, Lateral angles (Deg) 0; 27 0; 27 60; 90 110 and bi-circular array were recorded separately, to avoid inter- ference effects. All the recordings were made by employing the swept-sine technique [59], with f = 48 kHz. Arrangements. Two further measures characterize the Fig. 5: Plan views of the four recorded rooms. The red datasets: the direct to reverberant ratio (DRR) [60], and circles represent the position of the dummy head, whereas the average target-interferer separation angle (AVG-TISA). the loudspeakers are depicted using their stylized symbol. These will allow a more comprehensive discussion over the separation performance achieved. DRR is calculated as the ratio between the energy carried by the direct sound and the direct sound, the proposed method is named as IC-MESSL. the rest of the BRIR. AVG-TISA is the mean lateral angle When the comb filter effect is modeled, extending MESSL in separating the target source from the interferer, considering that sense, without considering any prior knowledge regarding all the possible target-interferer combinations. DRR and AVG- the IC, the proposed method is ER-MESSL. Otherwise, if both TISA characterizing the four datasets are reported in Table II, the comb filter and the IC are modeled, the novel method together with the related RT60s, and DRRs. is named as ERIC-MESSL. The three proposed methods are Rooms. Vislab was an acoustically treated room at the compared to MESSL [17]. The ranges of allowed parameters University of Surrey, where the “Surrey Sound Sphere”, having for the comb filter model are in Table I, for each dataset. radius of 1.68 m, was assembled. The loudspeakers were At the end of this section, we also show that other separation clamped on the sphere equator. The dummy head employed algorithms would benefit from the inclusion of early reflection was the Cortex Manikin Mk2 Binaural Head and Torso Sim- information. We extend a deep learning based state-of-the- ulator. Both dummy head and bi-circular microphone array art method. Different from MESSL, which is an unsupervised were placed at the sound sphere center. method, the deep learning approach is used to demonstrate that DWRC is furnished as a living room-like area. Its acoustics improvements can be achieved also for supervised methods. are representative of typical domestic living rooms. A Cortex Manikin Mk2 Binaural Head and Torso Simulator sat on a A. Datasets sofa. The bi-circular array was positioned right behind it. BRIRs were recorded in four rooms, characterized with dif- BBC UL is a room at the BBC R&D center, in Salford, ferent size and reverberation time (RT60). The four rooms are UK. Similar to DWRC, it is furnished to resemble a typical named as “Vislab”, “Digital World Research Centre” (DWRC), living room environment. A Neumann KU100 dummy head “BBC Usability Laboratory” (BBC UL), and “Studio1”. Their was positioned on an armchair and the bi-circular array of plan views are shown in Fig. 5, whereas the RT60s are in microphones was separately measured at the same position. Table II, together with the number of loudspeaker positions Since the RT60s related to the three already introduced L and their lateral angles. Two different dummy heads TOT rooms were similar, an additional room was chosen: Studio1, were employed (i.e. a Cortex Manikin Mk2 Binaural Head a large recording studio at the University of Surrey. A Cortex and Torso Simulator and a Neumann KU100 dummy head), Manikin Mk2 Binaural Head and Torso Simulator was used as depending on their availability for the recordings. To obtain dummy head. The loudspeaker positions were selected to have data for the initialization, a 48-channel bi-circular array with their height similar to the dummy head’s. The microphone a typical microphone spacing of 21 mm and an aperture of array was positioned about 2 m far from the dummy head. 212 mm was utilized to record RIRs [20] . The dummy head Therefore, the image source positions found by this array were first manually modified, according to the dummy head posi- Available at http://cvssp.org/data/s3a, DOI: 10.15126/surreydata.00844867 DOIs: 10.15126/surreydata.00812228 and 10.15126/surreydata.00808465 tion, before being used to initialize the EM. Depending on the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 8 1.00 the related BRIR direct sound. This is also used for the other Left Channel Left Channel performance metrics, described below. To extract the direct Right Channel Right Channel 0.75 Reflection sound component from the BRIRs, we truncated them by using Reflection Reflection a Hamming window, centered at the direct sound TOA. 0.50 Reflection The perceptual evaluation of speech quality (PESQ) has 0.25 been widely employed to evaluate processed speech qual- ity [63]. This is related to the Mean Opinion Score (MOS) 0.00 of human subjective assessments, therefore, the PESQ unit 5 10 15 20 25 5.0 7.5 10.0 12.5 15.0 Time (ms) Time (ms) of measure is MOS. Before proceeding with the PESQ value tar calculation, y ^ (m; !) and y (m; !) are aligned in time, i;l i;l Fig. 6: Two BRIR absolute values, for a frontal source, zoomed in terms of amplitudes and delays, by employing Wiener into their direct sound and first reflection. On the left, reflection filters [63]. Through two parameters that model symmetric is generated by the floor, thus it arrives at the two ears and asymmetric disturbances, a parametric function is then simultaneously; on the right, reflection arrives from a lateral employed, mapping the differences between the processed wall, thus there is a difference in TOAs and amplitudes. tar version of y ^ (m; !) and y (m; !), to subjective assessment i;l i;l results [63]. The overall PESQ is the mean over the  target- interferer combinations, as PESQ = PESQ . loudspeaker-microphone positions in each room, reflections =1 Another aspect that has to be evaluated in speech signals are generated from either the floor or lateral walls. Examples separated via source separation algorithms is intelligibility. of RIRs for these two cases are depicted in Fig. 6. To do so, we employ the extended short-time objective in- The Utterances. Fifteen utterances, of 3 s length, were ran- telligibility (ESTOI) metric [64]. ESTOI is a function of the domly selected from the TIMIT acoustic-phonetic continuous tar separated signal y ^ (m; !) and the clean reference y (m; !). i;l speech corpus [61]. For each combination of target source i;l The goal of ESTOI is to produce an index (that we name as and interferer(s), U = 15 random combinations of the fifteen ESTOI ) that is monotonically related to the intelligibility of utterances were selected and tested. Therefore, the number of y ^ (m; !) [64]. The overall ESTOI is the mean over the i;l mixtures generated and tested for each dataset is: target-interferer combinations: ESTOI = ESTOI . TOT  =1 = U; (24) C. Control Masks where the symbol “()” represents the binomial coefficient, L TOT Performance bounds are needed to perform a fair evalu- is the number of sources in the mixture, and L is the total ation of source separation systems [65]. Reference signals number of loudspeaker positions available in the dataset. The are generated from the mixtures, for comparison with the utterances were normalized before applying the convolutions output of the proposed source separation methods. For the to have the same root mean square energy. lower bound, random TF masks were applied to the mixture. For the upper bound, we chose to calculate the ideal binary B. Evaluation Metrics IBM mask M (m; !), also known as ORACLE mask [66]. It The source to distortion ratio (SDR) metric is based on sig- is generated, for each source l, by comparing the l-th signal nal energy ratios, thus, is typically reported in dB. Following tar energy E (n; !), for each TF bin, with respect to the Equation (4), the ideal target signal l, that arrives at channel int interferers’ E 0 (m; !) in the mixture: i free from any interference and noise, can be defined as: tar int 0 1; E (n; !) > E (m; !); 8l 6= l tar IBM l l y (m; !) = x (m; !)I (!): (25) l i;l M (m; !) = i;l 0; otherwise. Hence, the source y ^ (m; !), separated by a source separation i;l (28) method as in Equation (23), can be decomposed as [62]: where l is referred to a source that is other than l. This equation could have also been defined by looking at the source tar y ^ (m; !) = y (m; !) + E + E + E ; (26) i;l interf noise artif i;l that is louder than the sum of all other sources, instead of the where E is the interference error term, E the noise loudest in general. Nevertheless, for our experiments in this interf noise error term, and E errors provided by general artifacts. article, this would not change the results, since we are focusing artif We chose the SDR, since it emphasizes all the three error on cases where there are only two sources in the mixtures. terms [62]: tar jjy (m; !)jj i;l D. Source Separation Experiments SDR = 10 log ; (27) jjE + E + E jj interf noise artif The experiments performed were focused on analyzing the where jjjj represents the Euclidean norm operator. Once the source separation performance, employing mixtures composed SDR for each of the  combinations of sources is obtained, the of two sources (L = 2), i.e. target and interferer. These experi- overall result for the dataset is calculated as their mean SDR = ments were designed to compare our three novel methods (i.e. SDR ;, where  is the tested mixture index. As clean IC-MESSL, ER-MESSL and ERIC-MESSL) with the baseline =1 reference, we employed the target utterance convolved with (i.e. MESSL [17]), that models only the direct sound IPD, by Norm. Amplitude IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 9 Fig. 7: The top three figures show a zoom into a mixture TF domain absolute value, the related TF masks generated by MESSL, and the TF mask estimated by the proposed ERIC-MESSL. The bottom three figures show the same TF bins of the target signal, the signal separated by MESSL, and ERIC-MESSL, respectively. TABLE III: SDRs (left) and PESQs (right) obtained by separating the target speech from a two-talker mixture. SDR(dB) Vislab DWRC BBC UL Studio1 AVG PESQ(MOS) Vislab DWRC BBC UL Studio1 AVG Random 0:43 0:61 0:96 0:06 0:49 Random 1:36 1:45 1:45 1:37 1:38 MESSL [17] 4:53 2:54 5:47 0:58 3:28 MESSL [17] 1:96 1:93 2:06 1:82 1:94 IC-MESSL 4:80 2:73 5:79 0:65 3:49 IC-MESSL 1:98 1:95 2:07 1:87 1:97 ER-MESSL 4:98 2:68 5:67 0:67 3:50 ER-MESSL 2:00 1:93 2:06 1:83 1:96 ERIC-MESSL 5:14 2:70 5:89 0:75 3:62 ERIC-MESSL 2:01 1:95 2:07 1:87 1:98 ORACLE 6:21 5:04 6:82 0:88 4:66 ORACLE 2:34 2:45 2:45 1:96 2:30 TABLE IV: ESTOIs obtained by separating the target speech 7.0 3.15 Vislab DWRC from a two-talker mixture. 6.3 5.6 ESTOI Vislab DWRC BBC UL Studio1 AVG 2.65 MESSL [17] Random 0:19 0:17 0:19 0:05 0:15 4.9 MESSL [17] 0:28 0:22 0:30 0:07 0:22 IC-MESSL 4.2 IC-MESSL 0:29 0:23 0:31 0:07 0:23 ER-MESSL ER-MESSL 0:29 0:23 0:30 0:08 0:23 3.5 2.15 ERIC-MESSL ERIC-MESSL 0:29 0:24 0:31 0:10 0:24 6.5 1.00 BBC UL ORACLE 0:34 0:29 0:36 0:10 0:27 6.0 TABLE V: P-values obtained from a paired t-test that com- 5.5 0.65 pared the SDRs using MESSL, with the SDRs using each of 5.0 the three proposed methods. Studio1 4.5 0.30 Vislab DWRC BBC UL Studio1 AVG -90 -60 -30 0 30 60 90 -30 0 30 IC-MESSL 0:0 % 0:0 % 0:0 % 7:9 % 0:0 % Angle (Deg) ER-MESSL 0:0 % 8:6 % 0:0 % 12:0 % 0:0 % ERIC-MESSL 0:0 % 68:9 % 0:0 % 4:1 % 0:0 % Fig. 8: SDRs obtained by separating a target speech from a two-talker mixture. These results refer to different target source positions, averaged over every interferer position. calculating the SDR and PESQ scores. Results obtained by applying the ideal masks are also reported as reference. The number of maximum iterations for the EM algorithm to create the reverberant mixtures described in Equation (3). was set, for all the experiments, to be 16. The smoothing Since the BRIRs were recorded having, within the same factor to calculate the IC was set to be  = 0:5. The BRIRs dataset, constant distance between loudspeakers and listening and the utterances introduced in Section VII-A were utilized position, the target-to-interferer ratio (TIR) in the mixture was SDR (dB) IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 10 equal to 0 dB. This choice was made to focus the evaluation mine whether the results, generated through the three proposed on the source separation methods’ performance, by avoiding methods, are significantly different from the ones obtained by their dependency on the variation in utterance energy and MESSL. In Table V, the p-values are reported. They represent source distance. Furthermore, TIR equal to 0 dB represents the probability of rejecting the hypothesis that the two sets a challenging case, where no distinction can be made between under investigation are statistically different (i.e. a low p-value target and interferer by looking at their energy levels. means that the two sets are statistically different). By looking Examples of masks generated by MESSL and the proposed at the results averaged over all the datasets by comparing every ER-MESSL are depicted in Fig. 7. We can observe that tested sample, with a significance level of 5 %, we can state differences between the two masks are pronounced. These dif- that the results of IC-MESSL, ER-MESSL, and ERIC-MESSL ferences lead to the TF representation of the signal separated are statistically different from those of MESSL. Moreover, by through ERIC-MESSL to be more similar to the groundtruth looking at each dataset singularly, results show that the three target signal, when compared to MESSL’s separated signal. proposed methods are statistically different from MESSL in For our experiments we used the open-source code of Vislab and BBC UL. However, in DWRC and Studio1 this MESSL, where we set to the frequency-dependent parameter is valid only for IC-MESSL and ERIC-MESSL, respectively. modeling option. The tested MESSL model, hence, includes These results confirm what was already shown in Table III, a non-parametric modeling of the “impurities” around the where the improvement given by IC-MESSL, ER-MESSL, direct sound component. Nevertheless, in MESSL, the early and ERIC-MESSL is, in general, higher in BBC UL and reflection model was not directly defined through parameters. Vislab than in DWRC and Studio1. The statistical significance Instead, we drive our system to extract the information related of the results demonstrates the key point of the manuscript, to both direct sound and early reflection. We also use the which is about the importance of considering early reflection frequency-dependent parameter modelling (pre-implemented information when constructing a source separation model. in MESSL) to model the impurities around the estimation. For the four datasets, the SDR results can also be reported as a function of the target source location, as shown in Fig. 8. E. Source Separation Results For each target source position, within the dataset, the SDR is The SDR side of Table III shows that ERIC-MESSL, the calculated by considering each of the correspondent interferer locations. Then, the obtained SDRs are averaged over these proposed source separation method that models both the comb interferer positions, leading to one result for each target source filter and IC, outperforms the baseline (i.e. the MESSL method location. Due to the cone of confusion, which is well-known [17]), when applied to any of the four datasets. Furthermore, it for IPD based localization methods [67], it is not possible provides better performance if compared to the other proposed to discriminate between the IPD of two sources lying at the methods. However, for the DWRC dataset, the other proposed same lateral angle. Therefore, results are reported in terms method IC-MESSL produces the highest SDR. This is due of lateral angle, rather than azimuth. Apart from DWRC, the to strong reflections arriving from different directions with general trend of the results suggests that source separation respect to the direct sound, which corresponds to a lower performs better in situations where the target is frontal to impact of the comb filter effect [15]. Observing PESQ in the listener. This situation was, in fact, one of the classical Table III, in general, the two proposed methods that model assumptions made to evaluate source separation methods [17]. the IC (i.e. IC-MESSL and ERIC-MESSL) have comparable By reporting results as in Fig. 8, we overcome this assumption. results, and are both better than the other methods. However, in The proposed ERIC-MESSL performs better than the others acoustically controlled environments, such as Vislab, the first for almost every position of the target source. For the few reflection direction is initialized more accurately by ISDAR, and the comb filter model performs better, with ERIC-MESSL positions where it is not the best, either the proposed IC- having a higher PESQ. This shows the importance of an MESSL or ER-MESSL has higher SDRs. In DWRC, the loud- accurate initialization of the GMM parameters. Similar trends speaker positioned at 27 stood next to a chest of drawers, that are reported in Table IV, where the ESTOIs related to the produces scattering. This conflicts with the overall assumption proposed methods are greater than the baseline. ESTOI results of having reflections with a dominant specular component. show ERIC-MESSL to be the best proposed method, providing Therefore, the localization of the first reflection, for modeling a greater intelligibility for every dataset. the comb filter, is affected by estimation errors. Similar to In general, DWRC and Studio1 are more challenging 0 in DWRC and 27 in Studio1, for 37 in BBC UL, strong lateral reflections arrive before those from the direct datasets, producing low SDR, PESQ and ESTOI values for sound direction, making the IC dominate the comb filtering every tested method. The reason can be found in Table II: they effect [15]. Similar results can be observed in Fig. 9, where have low DRRs and narrow AVG-TISAs. Low DRR entails the PESQ results are reported as a function of the target source difficulties for each of the algorithms, since the IPD curve, location. It is evident how the proposed ERIC-MESSL, which that was described in Fig. 3, is highly distorted by the strong combines the two proposed models, outperforms, in general reverberation. At the same time, narrow AVG-TISA affects the the baseline MESSL [17]. Furthermore, these PESQ results overall results, since small angles between target and interferer correspond to small variations between the IPD and ILD cues also show what was already observed in Fig. 8 for the SDRs related to the two signals in the mixture. (and discussed above), ERIC-MESSL mainly suffers when Assuming the  SDR results of each dataset as being early reflections are not completely specular. normally distributed, the paired t-test was performed to deter- The majority of the setups that we tested, had a certain IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 11 TABLE VI: SDRs (left) and PESQs (right) obtained by separating the target speech from a two-talker mixture. These results are calculated by considering only recording setups where direct sound and first reflection have same DOA. SDR(dB) DWRC BBC UL Studio1 AVG PESQ(MOS) DWRC BBC UL Studio1 AVG MESSL [17] 2:00 5:22 0:55 2:59 MESSL [17] 1:86 2:04 1:87 1:92 IC-MESSL 2:26 5:57 0:68 2:84 IC-MESSL 1:88 2:05 2:92 1:95 ER-MESSL 2:43 5:60 0:80 2:94 ER-MESSL 1:86 2:06 1:92 1:95 ERIC-MESSL 2:70 5:80 0:87 3:12 ERIC-MESSL 1:88 2:07 1:95 1:97 2.25 2.25 Vislab DWRC 7.5 3.0 Vislab DWRC 7.0 2.05 2.5 6.5 2.00 1.85 6.0 2.0 1.0 6.5 Studio1 1.75 1.65 MESSL [17] 6.0 0.5 2.25 2.00 IC-MESSL Studio1 BBC UL BBC UL 5.5 0.0 ER-MESSL 2.18 1.95 30 45 60 75 90 15 30 45 ERIC-MESSL 2.11 1.90 Angle (Deg) 2.04 1.85 Fig. 10: SDRs for different interferer positions, fixing target 1.97 1.80 at 0 . The black vertical crossed lines refer to ERIC-MESSL, 1.90 1.75 the red circled lines to MESSL [17], the green starred lines to -90 -60 -30 0 30 60 90 -30 0 30 ER-MESSL, and the blue crossed lines to IC-MESSL. Angle (deg) TABLE VII: Evaluation results for the deep learning based Fig. 9: PESQs obtained by separating a target speech from methods over Vislab, in terms of SDR, PESQ and ESTOI. a two-talker mixture. These results refer to different target source positions, averaged over every interferer position. SDR PESQ ESTOI Direct sound information 8.33 2.51 0.70 Direct sound and early reflection info 8:80 2:59 0:73 configuration that produced, as the first reflection, the one corresponding to the floor (i.e. having same azimuth as the almost every TISA, apart from the extreme cases (i.e. 90 direct sound). Nevertheless, in BBC UL, DWRC, and Studio1, in Vislab and 70 in BBC UL). Therefore, we can conclude there are cases where the first arriving reflection has a different that the comb filter is, on average, more effective than the IC, direction of arrival (DOA) than the direct sound (i.e. coming apart from large TISAs. For both DWRC and Studio1, all the from a lateral wall). The proposed model does not make any methods show degradation at low TISA. This is a common assumption regarding the direction of the reflections, however, source separation problem [17]. Studio1 is also confirmed to the condition that better matches the idea behind it (i.e. a be problematic, with SDR lower than 1 dB, for every method. strong comb filter effect) is given by the case of direct sound Regarding the overall computational complexity, the average and early reflection coming from the same direction. To better run time, for a code run in MATLAB R2014b on Intel(R) show the strength of the proposed models, in Table VI, we Core(TM)i7-2600 CPU @ 3.40GHz, 16GB RAM PC is 55 s show the results of the experiments by considering only those for ERIC-MESSL and 8 s for MESSL [17]. The parameters are situations where direct sound and first reflection have the searched within a 7-D space in ERIC-MESSL, making it less same DOA. These results show that our methods outperform efficient than MESSL, where the space was one dimensional. MESSL with a much wider difference than the overall results Early Reflections and Deep Learning. We now evaluate in Table III, and ERIC-MESSL is the best. a DNN-based method that is representative of state-of-the-art To analyze the effect of separation angle, the source separa- approaches in speech separation. We modified this reference tion performance was calculated with the frontal loudspeaker method to test the key point behind our main work: that the (0 azimuth) as the target source, and varying the interferer. The results are reported in Fig. 10, as is typical in the inclusion of early reflection information into source separation literature for source separation [17], [41], [42]. This kind methods improves the performance. This test is intended to of visualization allows a better understanding of the source examine the potential for exploiting this information using separation performance by varying TISA. By observing the a DNN approach, and give a preliminary validation. Further results of Vislab and BBC UL (datasets having loudspeaker experiments are needed to explore the best way to incorporate positions around the listener), the proposed ERIC-MESSL early reflection information within DNN architectures for consistently provides the highest performance. However, for source separation, beyond the present preliminary integration. the extreme cases of TISA (i.e. 90 in Vislab and 70 in The selected pipeline is based on the classic multilayer BBC UL), the proposed IC-MESSL performs better. This perceptron (MLP) architecture, as presented in [68]. A similar behavior is best seen in the proposed ER-MESSL results. As architecture can be also found in [69]. In our implementation, for ERIC-MESSL, ER-MESSL is better than IC-MESSL for the MLP has two hidden layers, containing 1024 leaky rectified PESQ (MOS) SDR (dB) IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 12 linear units (ReLU) each. We employed batch normalisation are unsupervised techniques, hence do not need any labeling. (BN) layers [70] to accelerate convergence, and Adam opti- Therefore, it is inappropriate to directly compare the results mizer [71] with He initialization [72]. The binary crossentropy in Table VII with those in Tables III and IV. was used as loss function. The mini-batch size was set to 1000. Recordings from two male speakers and two female speak- VIII. CONCLUSION ers in the TIMIT dataset [61] were used for our experiment. Two room properties (i.e. early reflections and late reverber- For each of these speakers, ten sentences were randomly ation) have been modeled for source separation. Depending on selected. The binaural mixtures were generated by convolv- whether they are modeled individually or together, three novel ing the randomly chosen utterances with BRIRs recorded in source separation methods have been proposed: ER-MESSL, Vislab. The BRIRs used were the ones recorded for the angles that models the comb filter effect; IC-MESSL, that models the at 0 , 30 and 60 . To create the mixtures, each of the 4 IC; ERIC-MESSL, that combines the two models together. speakers was combined to the other 3. For each of these 12 Experiments were performed by recording four reverber- combinations, we associated the 10 sentences. In terms of the ant environments, and comparing the source separation per- product rule for counting, this makes a total of 1200 utterance formance of the proposed methods with MESSL’s [17]. In combinations. Regarding the BRIRs, each of the 5 DOAs was general, the proposed ERIC-MESSL outperforms all the other combined to the other 4, making a total of 20 combinations. methods. With respect to MESSL, the improvement given by Convolving utterances with BRIRs, we obtain 24000 mixtures: ERIC-MESSL, averaged over the four tested datasets, is about 19200 were randomly selected for training, the rest for testing. 10 % for SDR and 2 % for PESQ. It was also shown, by These 24000 samples comprising the dataset represent all running t-tests, that the ERIC-MESSL results are statistically combinations of the BRIR directions convolved with the indi- different from MESSL’s. Moreover, this experimental analysis vidual utterances. A distinct set of direction-utterance samples revealed that low DRRs and narrow AVG-TISAs led to a was used for testing and training, although all directions and degradation of the results. In addition, results were also some utterances did overlap (but not any specific combination). observed by varying both the target source and interferer The performance of the methods tested here would likely positions. Also in this case, it was consistently observed that decrease when generalizing to new unseen utterances and ERIC-MESSL is, in general, the better model. We conclude BRIRs, which is however beyond the scope of the present that modeling together the comb filter effect and IC is helpful tests. In fact, as mentioned above, this DNN experiment for improving the performance of classical source separation is to demonstrate that, by adding information about early methods. Furthermore, we have also reported an experiment reflections, supervised deep learning based source separation undertaken by including early reflection information into a method can also be improved, over the case where only the DNN based state-of-the-art source separation method. Results direct sound is considered, as we observed in the main novelty showed a great improvement, thus confirming the importance of this article, i.e. the GMM based unsupervised method. of incorporating the early reflection information into both The training was performed by providing the features related unsupervised and supervised source separation methods. to the IPD as input to the network, and matching with the Future work may be conducted on extending the methods ORACLE masks in output. In both models, the IPD features to multichannel arrays of microphones. Furthermore, a com- were calculated through the approach in Sections III and IV. bination of audio-visual sensing may be explored, to tackle To evaluate the improvement given by the early reflection problematic scenarios where the interferer has a higher level information, we have trained one model that considers only than the target. The proposed models could also be applied to the direct sound information [68], and a novel one which other popular approaches, such as NMF. we propose to also incorporate the early reflections. The ORACLE masks in output to the training stage were generated ACKNOWLEDGMENTS tar int from Equation (28), by considering E (n; !) and E (n; !) l l This work was supported by the EPSRC Programme Grant related to the direct sound for the model used as in [68], and S3A: Future Spatial Audio for an Immersive Listener Ex- direct sound plus early reflections for our model. This was perience at Home (EP/L000539/1) and BBC as part of the done by segmenting the related BRIRs through a Hamming BBC Audio Research Partnership. The authors would like to window (5 ms, and 30 ms, respectively). thank the reviewers and the associate editor for their helpful During the test, the masks predicted by the networks are comments to improve the article. used to separate the sounds, by employing Equation (23). Results are reported in Table VII. There, it is shown how the REFERENCES model containing information about the early reflections offers [1] A. Sutin, B. Bunin, N. Sedunov, L. Fillinger, M. Tsionskiv, and better performance with respect to the pipeline which consid- M. Bruno, “Stevens passive acoustic system for underwater surveil- ers only direct sound, for every metric (i.e. SDR, PESQ and lance,” in Proc. of the International WaterSide Security Conference, Carrara, Italy, 2010. ESTOI). This has demonstrated the key idea of the manuscript: [2] M. Ungureanu, C. Bigan, R. Strungaru, and V. Lazarescu, “Independent early reflections carry important information that is helpful component analysis applied in biomedical signal processing,” Measure- for improving the performance of speech separation models, ment Science Review, vol. 4, no. 2, pp. 1–8, 2004. [3] A. Tonazzini, E. Salerno, and L. Bedini, “Fast correction of bleed- including both unsupervised (e.g. MESSL) and supervised through distortion in grayscale documents by a blind source separation techniques (e.g. DNNs). However, it is important to stress technique,” International Journal of Document Analysis, vol. 10, no. 1, that MESSL [17] and the methods proposed in Section VI pp. 17–25, 2007. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 13 [4] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsu- Conference (LVA/ICA). Tel Aviv, Israel, 2012, pp. 322–329, Springer pervised speech enhancement using nonnegative matrix factorization,” Berlin Heidelberg. IEEE Transactions on Audio, Speech, and Language Processing, vol. [26] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoff- 21, no. 10, pp. 2140–2151, 2013. man, “Static and dynamic source separation using nonnegative factor- izations: A unified view,” IEEE Signal Processing Magazine, vol. 31, [5] M. A. Akeroyd, J. Chambers, D. Bullock, Palmer A. R., and A. Q. no. 3, pp. 66–75, 2014. Summerfield, “The binaural performance of a cross-talk cancellation [27] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Blind extraction of system with matched or mismatched setup and playback acoustics,” J. dominant target sources using ICA and time-frequency masking,” IEEE Acoustical Society of America, vol. 121, no. 2, pp. 1056–1069, 2007. Transactions on Audio, Speech and Language Processing, vol. 14, no. [6] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of 6, pp. 2165–2173, 2006. noise-robust automatic speech recognition,” IEEE/ACM Transactions on [28] A. Ozerov and C. Fev ´ otte, “Multichannel nonegative matrix factorization Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550–563, [7] E. W. Healy, S. E. Yoho, Y. Wang, and Wang D., “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. [29] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A Acoustical Society of America, vol. 134, no. 4, pp. 3029–3038, 2013. multichannel MMSE-based framework for speech source separation and [8] C. Crocco, M. Cristiani, A. Trucco, and V. Murino, “Audio surveillance: noise reduction,” IEEE Transactions on Audio, Speech, and Language a systematic review,” ACM Computing Surveys, vol. 48, no. 4, pp. 52:1– Processing, vol. 21, no. 9, pp. 1913–1928, 2013. 52:46, 2016. [30] L. Wang, J. D. Reiss, and A. Cavallaro, “Over-determined source [9] Q. Liu, W. Wang, P. J. B. Jackson, and T. J. Cox, “A source separation separation and localization using distributed microphones,” IEEE/ACM evaluation method in object-based spatial audio,” in Proc. of the Transactions on Audio, Speech and Language Processing, vol. 24, no. 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 9, pp. 1573–1588, 2016. [31] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A [10] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, consolidated perspective on multimicrophone speech enhancement and W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and source separation,” IEEE/ACM Transactions on Audio, Speech and T. Yoshioka, “A summary of the REVERB challenge: state-of-the-art Language Processing, vol. 25, no. 4, pp. 692–730, 2017. and remaining challenges in reverberant speech processing research,” [32] H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, EURASIP J. on Advances in Signal Processing, vol. 2016, no. 1, pp. “Blind source separation based on a fast-convergence algorithm com- 7:1–7:19, 2016. bining ICA and beamforming,” IEEE Transactions on Audio, Speech [11] H. Kuttruff, Room Acoustics - Fifth edition, Spon press, 2009. and Language Processing, vol. 14, no. 6, pp. 2165–2173, 2006. [12] B. Blesser, “An interdisciplinary synthesis of reverberation viewpoints,” [33] I. Dokmanic, ´ R. Scheibler, and M. Vetterli, “Raking the cocktail party,” J. Audio Engineering Society, vol. 49, no. 10, pp. 867–903, 2001. IEEE J. of Selected Topics in Signal Processing, vol. 9, no. 5, pp. 825– [13] V. Valim ¨ aki, ¨ J. A. Parker, L. Savioja, J. O. Smith, and J. S. Abel, “Fifty 836, 2015. years of artificial reverberation,” IEEE Transactions on Audio, Speech [34] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint and Language Processing, vol. 20, no. 5, pp. 1421–1448, 2012. optimization of masks and deep recurrent neural networks for monaural [14] M. Barron, “The subjective effects of first reflections in concert halls - source separation,” IEEE/ACM Transactions on Audio, Speech and the need for lateral reflections,” J. of Sound and Vibration, vol. 15, no. Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015. 4, pp. 475–494, 1971. [35] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source [15] T. Lokki, J. Patynen, ¨ T. Sakar, S. Siltanen, and L. Savioja, “Engaging separation with deep neural networks,” IEEE/ACM Transactions on concert hall acoustics is made up of temporal envelope preserving Audio, Speech and Language Processing, vol. 24, no. 9, pp. 1652–1664, reflections,” J. Acoustical Society of America Express Letters, vol. 129, no. 6, pp. EL223–EL228, 2011. [36] X.-L. Zhang and D. L. Wang, “A deep ensemble learning method for [16] E. Vincent, N. Bertin, R. Gribonval, and F. Bimbot, “From blind to monaural speech separation,” IEEE/ACM Transactions on Audio, Speech guided audio source separation,” IEEE Signal Processing Magazine, and Language Processing, vol. 24, no. 5, pp. 967–977, 2016. vol. 31, no. 3, pp. 107–115, 2014. [37] J. Du, Y. Tu, L-R. Dai, and C.-H. Lee, “A regression approach to single- [17] M. I. Mandel, R. J. Weiss, and D. P. W. Ellis, “Model-based expectation channel speech separation via high-resolution deep neural networks,” maximization source separation and localization,” IEEE Transactions on IEEE/ACM Transactions on Audio, Speech and Language Processing, Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 382–394, vol. 24, no. 8, pp. 1424–1437, 2016. [38] Y. Wang, J. Du, L.-R. Dai, and C.-H. Lee, “A gender mixture detection [18] S. Bech, “Spatial aspects of reproduced sound in small rooms,” J. approach to unsupervised single-channel speech separation based on Acoustical Society of America, vol. 103, no. 1, pp. 434–445, 1998. deep neural networks,” IEEE/ACM Transactions on Audio, Speech and [19] A. Alinaghi, W. Wang, and P. J. B. Jackson, “Spatial and coherence Language Processing, vol. 25, no. 7, pp. 1535–1546, 2017. cues based time-frequency masking for binaural reverberant speech [39] D. Wang, “Time-frequency masking for speech separation and its separation,” in Proc. of the IEEE International Conference on Acoustics, potential for hearing aid design,” Trends in Amplification, vol. 12, no. Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013. 4, pp. 332–353, 2008. [20] L. Remaggi, P. J. B. Jackson, P. Coleman, and W. Wang, “Acoustic re- [40] P. M. Hofman and J. Van Opstal, “Spectro-temporal factors in two- flector localization: novel image source reversion and direct localization dimensional human sound localization,” J. Acoustical Society of Amer- methods,” IEEE/ACM Transactions on Audio, Speech and Language ica, vol. 103, no. 5, pp. 2634–2648, 1998. Processing, vol. 25, no. 2, pp. 296–309, 2017. [41] H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive [21] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating blind source separation via frequency bin-wise clustering and permuta- small-room acoustics,” J. Acoustical Society of America, vol. 4, no. 65, tion alignment,” IEEE Transactions on Audio, Speech, and Language pp. 943–950, 1979. Processing, vol. 19, no. 3, pp. 516–527, 2011. [22] G-J. Jang and T-W. Lee, “A maximum likelihood approach to single- [42] A. Alinaghi, P. J. B. Jackson, Q. Liu, and W. Wang, “Joint mixing channel source separation,” J. of Machine Learning Research, vol. 23, vector and binaural model based stereo source separation,” IEEE/ACM pp. 1365–1392, 2003. Transactions on Audio, Speech and Language Processing, vol. 22, no. [23] M. N. Schmidt and R. K. Olsson, “Single-channel speech separation 9, pp. 1434–1448, 2014. using sparse non-negative matrix factorization,” in Proc. of Interspeech, [43] A. Deleforge, F. Forbes, and R. Horaud, “Acoustic space learning Pittsburgh, USA, 2006. for sound-source separation and localization on binaural manifolds,” [24] S. Arberet, A. Ozerov, N. Q. K. Duong, E. Vincent, R. Gribonval, International Journal of Neural Systems, vol. 25, no. 1, 2015. F. Bimbot, and P. Vandergheynst, “Nonnegative matrix factorization [44] C. Hummersone, R. Mason, and T. Brookes, “Dynamic precedence and spatial covariance model for under-determined reverberant audio effect modeling for source separation in reverberant environments,” source separation,” in Proc. of the 10th International Conference on IEEE Transactions on Audio, Speech, and Language Processing, vol. Information Science, Signal Processing and their Applications (ISSPA), 18, no. 7, pp. 1867–1871, 2010. Kuala Lumpur, Malaysia, 2010. [45] Y. Huang, J. Benesty, and J. Chen, “A blind channel identification-based [25] C. Joder, F. Weninger, F. Eyben, D. Virette, and B. Schuller, “Real-time two-stage approach to separation and dereverberation of speech signals speech separation by semi-supervised nonnegative matrix factorization,” in a reverberant environment,” IEEE Transactions on Audio, Speech and in Latent Variable Analysis and Signal Separation: 10th International Language Processing, vol. 13, no. 5, pp. 882–895, 2005. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 14 [46] F. Nesta and M. Omologo, “Convolutive underdetermined source [70] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep separation through weighted interleaved ICA and spatio-temporal source network training by reducing internal covariate shift,” in Proc. of the correlation,” in Latent Variable Analysis and Signal Separation: 10th International Conference on Machine Learning, Lille, France, 2015. International Conference (LVA/ICA). Tel Aviv, Israel, 2012, pp. 222– [71] D. P. Kingma and J. L. Ba, “ADAM: A method for stochastic 230, Springer Berlin Heidelberg. optimization,” in Proc. of the International Conference on Learning Representations (ICLR), San Diego, USA, 2015. [47] S. Makino, H. Sawada, and T. W. Lee, Blind Speech Separation, [72] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Springer, 2007. surpassing human-level performance on imagenet classification,” in [48] A. Asaei, M. Golbabaee, H. Bourlard, and V. Cevher, “Structured spar- Proc. of the International Conference on Computer Vision (ICCV), sity models for reverberant speech separation,” IEEE/ACM Transactions Santiago, Chile, 2015. on Audio, Speech and Language Processing, vol. 22, no. 30, pp. 620– 633, 2014. Luca Remaggi is Audio Research Engineer at Cre- [49] R. Scheibler, D. Di Carlo, A. Deleforge, and I. Dokmanic, “Separake: ative Labs, UK, working on cutting edge spatial Source separation with a little help from echoes,” arXiv: CoRR, vol. audio products. Between 2017 and 2019, he was abs/1711.06805, 2017. Research Fellow at the Centre for Vision, Speech [50] E. Vincent, T. Virtanen, and S. Gannot, Audio source separation and and Signal Processing, University of Surrey, UK, speech enhancement, John Wiley & Sons, Ltd, 2018. where he also pursued his PhD, in 2017. His re- [51] G. J. Brown and M. Cook, “Computational auditory scene analysis,” search interest was to investigate the multipath sound Computer Speech and Language, vol. 8, pp. 297–336, 1994. propagation combining acoustic and visual data, for [52] J.-M. Valin, F. Michaud, and J. Rouat, “Robust localization and tracking applications in spatial audio and source separation. of simultaneous moving sound sources using beamforming and particle He received the B.Sc. and M.E. degrees in Elec- filtering,” Robotics and Autonomous Systems, vol. 55, no. 1, pp. 216– tronic Engineering from Universita ` Politecnica delle 228, 2007. Marche, Italy, in 2009 and 2012, respectively. During his M.E., he has been an [53] S. M. Naqvi, M. Yu, and J. A. Chambers, “A multimodal approach to intern at the Department of Signal Processing and Acoustics, Aalto University, blind source separation of moving sources,” IEEE Journal of Selected Finland, where he focused on the sound synthesis of musical instruments. Topics in Signal Processing, vol. 4, no. 5, pp. 895–910, 2010. Philip Jackson is Reader in Machine Audition at [54] C. Faller and J. Merimaa, “Source localization in complex listening the Centre for Vision, Speech & Signal Processing situations: Selection of binaural cues based on interaural coherence,” (CVSSP, University of Surrey, UK) with MA in The Journal of the Acoustical Society of America, vol. 116, no. 5, pp. Engineering (Cambridge University, UK) and PhD 3075–3089, 2004. in Electronic Engineering (University of Southamp- [55] M. Jeub, M. Schafer ¨ , T. Esch, and P. Vary, “Model-based dereverberation ton, UK). His broad interests in acoustical signals preserving binaural cues,” IEEE Transactions on Audio, Speech, and have led to research contributions in sound field Language Processing, vol. 18, no. 7, pp. 1732–1745, 2010. control, modeling speech articulation, acoustics and [56] P. Aarabi, “Self-localizing dynamic microphone arrays,” IEEE Trans- recognition, in audio-visual perception, blind source actions on Systems, Man, and Cybernetics, Part C (Applications and separation, and spatial audio reverberation, capture, Reviews), vol. 32, no. 4, pp. 474–484, 2002. reproduction and quality evaluation [h-index 22; [57] H. Kim, L. Remaggi, P. J. B. Jackson, F. M. Fazi, and A. Hilton, “3D Google Scholar: bit.ly/2oTRw1C]. He led one of four research streams on room geometry reconstruction using audio-visual sensors,” in Proc. of object-based spatial audio in the S3A programme grant funded in the UK by the Conference on 3D Vision (3DV), Qingdao, China, 2017. EPSRC, and enjoys listening. [58] B. D. VanVeen and K. M. Buckley, “Beamforming: a versatile approach Wenwu Wang (M02SM11) was born in Anhui, to spatial filtering,” IEEE Acoustic, Speech and Signal Processing China. He received the B.Sc. degree in 1997, the Magazine, vol. 5, no. 2, pp. 4–24, 1988. M.E. degree in 2000, and the Ph.D. degree in 2002, [59] A. Farina, “Simultaneous measurement of impulse response and all from Harbin Engineering University, China. He distortion with a swept-sine technique,” in Proc. of the 108th Audio then worked in Kings College London (2002-2003), Engineering Society Convention (AES), Paris, France, 2000. Cardiff University (2004-2005), Tao Group Ltd. [60] P. Zahorik, “Direct-to-reverberant energy ratio sensitivity,” J. Acoustical (now Antix Labs Ltd.) (2005-2006), and Creative Society of America, vol. 112, no. 5, Pt. 1, pp. 2110–2117, 2002. Labs (2006-2007), before joining University of Sur- [61] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallet, and rey, UK, in May 2007, where he is currently a N. L. Dahlgren, “DARPA TIMIT acoustic phonetic continuous speech Professor in Signal Processing and Machine Learn- corpus CDROM,” Tech. Rep., NIST Interagency, 1993. ing, and a Co-Director of the Machine Audition [62] E. Vincent, R. Gribonval, and C. Fev ´ otte, “Performance measurement Lab within the Centre for Vision Speech and Signal Processing. He was in blind audio source separation,” IEEE Transactions on Audio, Speech a Visiting Scholar at Ohio State University, USA, in 2008. He has been a and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. Guest Professor on Machine Perception at Qingdao University of Science and [63] P. C. Loizou, Speech Enhancement: Theory and Practice - Second Technology, China, since 2018. His current research interests include blind Edition, CRC Press, 2013. signal processing, sparse signal processing, audio-visual signal processing, [64] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility machine learning and perception, artificial intelligence, machine audition of speech masked by modulated noise maskers,” IEEE/ACM Transac- (listening), and statistical anomaly detection. He has (co)-authored over 250 tions on Audio, Speech and Language Processing, vol. 24, no. 11, pp. publications in these areas. He and his team have won the Best Paper Award 2009–2022, 2016. on LVA/ICA 2018, the Best Oral Presentation on FSDM 2016, the Top [65] E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, Paper Award in IEEE ICME 2015, Best Student Paper Award shortlists V. Gowreesunker, D. Lutter, and N. Q. K. Duong, “The signal sepa- on IEEE ICASSP 2019 and LVA/ICA 2010. His papers are among the ration evaluation campaign (2007-2010): achievements and remaining Most Downloaded Papers in IEEE/ACM Transactions on Audio Speech and challenges,” Signal Processing, vol. 92, no. 8, pp. 1928–1936, 2012. Language Processing in 2018 and 2019, and Featured Articles in IEEE [66] D. Wang, “On ideal binary mask as the computational goal of auditory Transactions on Signal Processing 2013. As a team member, he achieved scene analysis,” in Speech Separation by Humans and Machines, the 2nd place (among 23 teams) in the DCASE 2019 Challenge Sound event P. Divenyi, Ed., chapter 12, pp. 181–197. Kluwer Academic, 2005. localization and detection, the 3rd place (among 558 submitted systems) in [67] E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wightman, “Lo- the 2018 Kaggle Challenge ”Free-sound general purpose audio tagging”, the calization using nonindividualized head-related transfer functions,” J. 1st place (among 35 submitted systems) in the 2017 DCASE Challenge on Acoustical Society of America, vol. 94, no. 1, pp. 111–123, 1993. ”Large-scale weakly supervised sound event detection for smart cars”, the [68] Q. Liu, Y. Xu, P. J. B. Jackson, W. Wang, and P. Coleman, “Iterative TVB Europe Award for Best Achievement in Sound in 2016 and the finalist deep neural networks for speaker-independent binaural blind speech for GooglePlay Best VR Experience in 2017, and the Best Solution Award on separation,” in Proc. of the IEEE International Conference on Acoustics, the Dstl Challenge ”Under-sampled signal signal recognition” in 2012. He is a Speech and Signal Processing (ICASSP), Brisbane, Canada, 2018. Senior Area Editor (2019-) for IEEE Transactions on Signal Processing and an [69] Z.-Q. Wang and D. Wang, “On spatial features for supervised speech Associate Editor (2019-) for EURASIP Journal on Audio Speech and Music separation and its application to beamforming and robust ASR,” in Proc. Processing. He was an Associate Editor (2014-2018) for IEEE Transactions on of the IEEE International Conference on Acoustics, Speech and Signal Signal Processing. He was a Publication Co-Chair for ICASSP 2019, Brighton, Processing (ICASSP), Brisbane, Canada, 2018. UK. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation

Loading next page...
 
/lp/arxiv-cornell-university/modeling-the-comb-filter-effect-and-interaural-coherence-for-binaural-t5QUlEwUBR

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2946043
Publisher site
See Article on Publisher Site

Abstract

IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 1 Modeling the Comb Filter Effect and Interaural Coherence for Binaural Source Separation Luca Remaggi, Philip J. B. Jackson, Wenwu Wang, Senior Member IEEE . Abstract—Typical methods for binaural source separation size of the environment, without directional information [12]. consider only the direct sound as the target signal in a mixture. Instead, early reflections affect the human sound perception, However, in most scenarios, this assumption limits the source by conveying a directional sense of the geometry of the separation performance. It is well known that the early reflections environment [13]. This generates auditory effects, for instance interact with the direct sound, producing acoustic effects at modifying the source width perception [14]. Moreover, being the listening position, e.g. the so-called comb filter effect. In this article, we propose a novel source separation model, that coherent with the direct sound, strong early reflections modify utilizes both the direct sound and the first early reflection the perceived sound coloration, by generating a comb filter information to model the comb filter effect. This is done by effect [15]. Hence, acoustic multipath properties should be observing the interaural phase difference obtained from the time- considered in the design of source separation methods [16]. frequency representation of binaural mixtures. Furthermore, a Many different approaches can be found in the literature method is proposed to model the interaural coherence of the to tackle the source separation problem. However, most of signals. Including information related to the sound multipath propagation, the performance of the proposed separation method them do not explicitly model the acoustic multipath properties. is improved with respect to the baselines that did not use such For instance, in the well-known Model-based Expectation information, as illustrated by using binaural recordings made in Maximization Source Separation and Localization (MESSL) four rooms, having different sizes and reverberation times. method [17] only the direct sound interaural cues (i.e. the in- Index Terms—Source separation, comb filter effect, RIRs, teraural phase difference (IPD) and interaural level difference IPD, ILD, binaural audio, multipath propagation, interaural (ILD)) were modeled, without considering any early reflection coherence. effect. Furthermore, although a garbage source was defined to indirectly deal with the late reverberation, there was not any I. INTRODUCTION formal attempt to model the reverb. The aim of this article is to investigate how information Source separation is one of the most investigated fields in related to early reflections can improve source separation the signal processing community. Several application areas can methods, in general. Such information can be potentially used benefit from it. For instance, it can improve target detection in many source separation methods, either unsupervised or performance of passive sonar systems [1]. In biomedical supervised. Here, we selected MESSL [17] as a baseline engineering, source separation is often used to analyze elec- method due to its unsupervised nature, and the convenience trocardiograms, electroencephalograms, or magnetic resonance in incorporating the early reflections information into its IPD images [2]. Work on ancient document restoration has utilized model. We extended MESSL [17], by emulating the comb filter source separation for correcting bleed-through distortion [3]. effect produced by the early reflections. To do so, we define Source separation has also been used in a large range of speech parametric functions in the time-frequency (TF) domain, and applications. For instance, it is used for improving speech model the behavior of the IPD, by considering the interaction enhancement [4], crosstalk cancellation [5], and automatic between the direct sound and the first arriving early reflection. speech recognition systems [6]. It can also be applied to The first reflection is chosen to be included into the model as improve hearing aids [7], or improve security systems [8]. it is the one that most affects the spatial cues [18]. Similar Spatial audio can also rely on it, to produce object-based to MESSL, we also use an ILD model, which considers the audio [9]. Robust speech processing is another target area [10]. direct sound cue, and the garbage source. In typical conditions, a sound produced by a source interacts In addition to the comb filter effect, we propose a model with its environment during propagation, before it reaches a that separates the reverberation’s effect from the rest of the listening position. This multipath propagation is defined by its RIR’s. This is done by approximating the human capability of room impulse response (RIR), i.e. an acoustic signal describing separating sounds in reverberant environments. Specifically, the propagation of sound from source to listening position. we model the interaural coherence (IC) of indivual sources in RIRs have three parts: direct sound, early reflections, and late the mixture, similar to what was introduced in [19]. However, reverberation [11]. The direct sound carries information related there, the target source was assumed to be in front of the to the source. Late reverberation provides clues about the listener. Here, we propose an approach that is not limited by this, but works for any target source position. IEEE Copyright The authors are with the Centre for Vision, Speech and Signal Pro- The main novelties of this article include: cessing, University of Surrey, Guildford, GU2 7XH, UK. W. Wang is a new IPD model, considering both direct sound and first also with Qingdao University of Science and Technology, China. Emails: [l.remaggi, p.jackson, w.wang]@surrey.ac.uk. reflection, to approximate the comb filter effect; arXiv:1910.02127v1 [cs.SD] 4 Oct 2019 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 2 an extension of the MESSL IPD model, employing the IPD, relating the azimuthal sound direction of arrival (DOA) target signal IC; to the head orientation [40]. The method presented in [41] an additional novel source separation method, obtained utilized, instead, the so called mixing vector (MV). For each by combining the two new models above; frequency bin, this vector contains the time invariant frequency the application of a source and image source localization response component of the room. In both [17] and [41], the algorithm to initialize the expectation maximization (EM) probability of each TF point belonging to a specific source algorithm used to estimate the Gaussian mixture model in the mixture was determined. From this probability, TF (GMM) parameters, and one deep-learning approach masks were generated. In [42], the two methods proposed using an MLP architecture with two hidden layers to in [17] and [41] were combined, constructing a probability generate the TF mask. distribution that takes into account the three cues ILD, IPD and MV. In [43], a high-dimensional vector, constructed by Since the novel IPD model approximates the early reflection combining the IPD and ILD cues, was projected onto a 2D information, the first new pipeline is named as Early Reflection space, represented by the sound azimuth and elevation DOA. MESSL (ER-MESSL). The second novel pipeline uses the IC A regression approach located the sources, and estimated the of the estimated target signal, hence, its name is IC-MESSL. TF masks. The IC cue was then employed in [44]. By combining the new IPD model with the IC based model, In the literature, yet few works can be found that consider we obtain the third proposed method, thus named as ERIC- both direct sound and early reflections. In [45], the source MESSL. Finally, there is need for the employed EM algo- separation problem was divided into different procedures, by rithm to be initialized. Since our proposed methods combine applying deconvolution to each individual reflection. However, the direct sound and first reflection information, we employ the performance degrades with low signal-to-noise ratio (SNR) our Image Source Direction and Ranging (ISDAR) [20] to conditions. In [46], a variation of the ICA method [47] was initialize it, by localizing the target source and related im- used to estimate the time-dependent mixing system, con- age source [21]. A comparative evaluation of early and late sidering the multipath propagation. However, with the ICA models is performed and reported as additional contribution. approach, the effect of its classical permutation problem was The challenging two source binaural speech mixture scenario exacerbated by the incorrect RIR components’ alignment. was analyzed, by employing signal and perceptual objective Deconvolution of the received signals was proposed in [48], measures. In the experimental section, we also evaluate the by employing simulated RIRs. These RIRs were estimated by improvement given by considering early reflection information matching the temporal support of recorded ones. Nevertheless, in a state-of-the-art deep learning based method, for supervised binaural effects, such as head shadowing and pinnae influence, speech separation. Through this, we further demonstrate that were not considered. Multichannel microphone arrays were early reflection information improves source separation meth- used in [33], where beamformers were designed to have their ods’ performance, including deep learning, and that this can directivty patterns characterized by multiple beams, to simul- be potentially applied to many approaches in the literature. taneously extract direct sound and early reflections. Results The overall structure of this article is as follows: in Section show improvement with respect to classical beamforming. II, related source separation methods are discussed; Section III However, they were tested only with simulated RIRs. The defines the theoretical foundations of the proposed approach. work in [49] demonstrated the benefit of including reflection In Sections IV and V, the proposed interaural cue models for information in source separation models, by employing a NMF the comb filter and IC are presented, respectively. Section VI approach. Nevertheless, only simulated RIRs were employed. describes the source separation algorithm. In Section VII, the In this article, we consider the first arriving early reflec- experiments are described, with related results and discussion. tion and related direct sound, to propose a binaural model Finally, Section VIII draws the conclusion. that increases the robustness in reverberant environments, by II. RELATED WORK IN SPEECH SOURCE SEPARATION estimating TF masks. It is based on [17], nevertheless, the Many approaches can be found in the literature to tackle proposed model could be potentially adapted to work with the source separation problem. Some of them exploit a-priori other methods described above, from beamformers to DNNs. information about basis functions representing the signals in the mixture [22]. Others employ the non-negative ma- III. BACKGROUND DEFINITIONS trix factorization (NMF) to learn sparse representation of In this section, we provide a general overview of the adopted speech sources [23–26]. The independent component analy- approach, and discuss the assumptions. The definitions of the sis (ICA) [27] is also used to decompose the mixture into general elements of the proposed architecture (e.g. binaural independent signals, by projecting the mixtures into different RIRs (BRIRs) and interaural spectrograms) are also given. domains. Scenarios where multiple microphones are available were also investigated [28–31], e.g. using beamformers [32], A. General Overview of the Proposed Method [33]. Recently, deep neural networks (DNNs) became widely Classical source separation methods exploit features related popular, when large training datasets are available [34–38]. TF masking is a popular approach, which assigns different to the direct sound to separate the target sound from a mixture. weights to the mixture, in the TF domain [39]. In [17], the In [17], the authors presented one of the first models to deal authors presented the MESSL method which uses binaural sig- with the reverberation, by proposing the “garbage” source. nals. Two interaural cues were exploited, i.e. the ILD and the In this article, we model two perceptual effects: the comb IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 3 1,1,l 0,1,l n -n 1,1,l 0,1,l h (n) h (n) 0,1,l 1,1,l Source 1,1,l 0,1,l y (n) h (n) x (n) n 1,1,l l 0,1,l 0,1,l h (n) 0,1,l Left 1,1,l Time (samples) 0,2,l n -n Right 0,2,l 0,1,l 0,2,l h (n) h (n) 0,2,l y (n) 0,2,l 0,2,l 1,2,l h (n) 1,2,l 1,2,l 1,2,l h (n) 1,2,l n -n Time (samples) 1,2,l 0,2,l Fig. 2: Schematic representation of the comb filter effect Fig. 1: Example of an ideal BRIR, zoomed into its direct sound created for the two received sounds (y (n) and y (n)), given (blue) and first reflection (red) components (depicted as Dirac 1 2 the sound produced at the l-th source x (n). The direct sounds pulses). The top figure shows the RIR related to sensor i = l and reflections, together with the related delays ( ) and 1, whereas the bottom one the RIR at sensor i = 2. The attenuation factors (B) are the same as those defined in Fig. 1. amplitudes and delays are defined in Equation (2). filter and IC. Through the former we aim to model the first where i 2 [1; 2] 2 N and l are the microphone and source early reflection, in a constructive fashion, to enhance the indexes, respectively; n is the discrete time index, T indi- sound produced by the target speaker. The latter models the cates the last early reflection, and w (n) represents the late i;l reverberation, by aiding the garbage source in suppressing it. reverberation, whereas e is the reflection index (e = 0 indicates the direct sound). h is a function describing the reflection. e;i;l B. Proposed Method Assumptions n represents the reflection times of arrival (TOAs). e;i;l Following the assumption of having dominant specular com- In the proposed source separation method, assumptions were ponents, the early reflections are approximated by Dirac deltas made, defining its scientific boundaries as follows: (n) of different amplitudes P . For source separation e;i;l The number of sources L is known a-priori; purpose, we consider the direct sound and first reflection Source signals are sparse in the TF domain; components (i.e. e = f0; 1g) (see Fig. 1): The mixing system is time invariant; The first reflection has a dominant specular component; h (n) = P (n n ); 0;1;l 0;1;l 0;1;l Sources are sufficiently far from the reflectors; h (n) = P (n n ); 1;1;l 1;1;l 1;1;l The first early reflection is coherent with the direct sound. (2) h (n) = P (n n ); 0;2;l 0;2;l 0;2;l Although L has to be known a-priori, there is no restriction on it with respect to the number of microphones M , thus, h (n) = P (n n ): 1;2;l 1;2;l 1;2;l the method can be also applied to underdetermined scenarios. Sparsity over the TF domain corresponds to the assumption of D. Comb Filter and Interaural Coherence having, for each TF bin, only one of the sources dominating In environments where the first reflection is delayed between the mixture. Sources and microphones are assumed to be static 5 ms and 40 ms to the direct sound, the coloration of the sound within a static environment, i.e. the mixing system is time perceived is different from the one produced [14]. In signal invariant. Where the first reflection has a dominant specular processing, the superimposition of a signal with its delayed component, it is detected from RIRs to initialize the EM version is the result of comb filtering the signal, hence, we re-estimation. The sources have to be distant enough from model this perceptual effect as a comb filter effect (see Fig. 2). the reflectors, in order to have the first reflection arriving Reverberation is a diffuse component of the RIR that makes between 5 ms and 40 ms later than the direct sound. Finally, source separation more challenging by smearing the target the assumption of coherence between the first reflection and signal, both temporally and spatially. Thus it is useful for direct sound allow them to be modeled as a comb filter. The robust separation to suppress it. With spaced microphones, later reflections, having a more stochastic nature, are assumed reverberation signals are decorrelated above a certain fre- to be incoherent and modeled through the IC, with the reverb. quency [50]. With binaural microphones, IC measures the two signals correlation, hence we use it to model the reverberation. C. Binaural Room Impulse Response A RIR is a signal that characterizes the acoustics of an E. Interaural Spectrogram environment with respect to source and sensor positions. RIRs that are recorded by microphones in ear canals of a dummy Following the definition of BRIR in Equation (1), the head, are usually known as BRIRs. They are defined as: mixtures received at the i-th sensor can be written as: T L X X I (n) = h (n n ) + w (n); (1) y (n) = x (n) I (n) w (n); (3) i;l e;i;l e;i;l i;l i l i;l i;l e=0 l=1 Left Channel Right Channel IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 4 Fig. 4: On the left the IPD function for a mixture of two Fig. 3: The figure on the left shows the IPD as a function of sources is shown. On the right, our comb filter based ER- frequency for a single source convolved with an ideal BRIR MESSL IPD model (the fluctuating red curve) is employed to formed by only direct sound and first reflection. On the right, fit one of the two sources in the same IPD function. the same IPD function is simultaneously fitted by the MESSL IPD model [17] (the straight green line), and our comb filter based ER-MESSL IPD model (the fluctuating red curve). only the direct sound information was used [17]. By assuming ideal BRIRs as formed by direct sound and first reflection (see Fig. 1), the two channel frequency responses are: where x (n) is the signal generated by the l-th source, w (n) l i;l is the convolutive white Gaussian noise, L is the number of I (!) = P exp[j!n ] + P exp[j!n ]); 1;l 0;1;l 0;1;l 1;1;l 1;1;l sources, and “” is the convolution operator. Since the human I (!) = P exp[j!n ] + P exp[j!n ]): 2;l 0;2;l 0;2;l 1;2;l 1;2;l auditory system analyzes the received mixtures in the TF do- (6) main [51], we use the the short-time Fourier transform (STFT) Their ratio is the interaural frequency response model: to calculate the TF representation of y (n): I (!) 1;l I (!) = = L l I (!) 2;l y (m; !) = x (m; !)I (!)w (m; !); (4) i l i;l i P + P exp[j!(n n )] 0;1;l 1;1;l 1;1;l 0;1;l l=1 P exp[j!(n n )] + P exp[j!(n n )] 0;2;l 0;2;l 0;1;l 1;2;l 1;2;l 0;1;l where m is the discrete time frame index, whereas ! is the (7) ang angular frequency. I (!) is not time dependent, by assuming i;l ^ The phase of this equation, denoted as I (!), corresponds the mixing system to be time-invariant. Considering binaural to the proposed IPD model, and it is one of the main novelties systems, the interaural spectrogram is defined as [17]: of this article. For the l-th source, the difference between the IPD ILD y (m; !) observed IPD  (m; !) and its model is the phase residual: IS (m;!)=20 IPD y (m; !) = = 10 exp[j (m; !)]; y (m; !) ang 2 IPD IPD (m; !;C ) =  (m; !) I (!;C ); (8) l l l l l (5) ILD IPD where (m; !) and  (m; !) are the ILD and IPD of that is wrapped into the interval [ ); and: the observation, respectively, and j = 1. DS DF ST C = [n ; n ; n ; P ; P ; P ; P ]; (9) l 0;1;l 1;1;l 0;2;l 1;2;l l l l IV. M ODELING THE C OMB F ILTER EFFECT DS DF where n = n n , n = n n , and 0;2;l 0;1;l 1;1;l 0;1;l l l The IPD and ILD cues can be modeled to generate proba- ST n = n n . An example of the IPD model fitting 1;2;l 1;1;l bility distributions for identifying the dominant source, given an ideal IPD observation is shown in Fig. 3, together with each TF bin. The novel IPD model that approximates the comb a visual comparison of the MESSL IPD model [17]. The filter effect is proposed in this section. Furthermore, the ILD ideal IPD observation was obtained from a synthetic BRIR model (that was presented in [17]) is described. Finally, these composed of only direct sound and first reflection. From this two are combined into a joint probability distribution. figure, it is clear that our proposed ER-MESSL IPD model In the proposed model (as in MESSL [17]), sound sources fits the observed data better than MESSL, by considering the are assumed to be spatially quasi-static: they have to be static comb filter effect. In Fig. 4, we also report the IPD function within the time interval under investigation. Nonetheless, as related to a mixture of two sources, generated using recorded a potential extension for future work, one could employ a BRIRs. The two sources’ contributions are well visible from tracking system, that would provide the model with updated the figure on the left, as two linear patterns having opposite time delays (i.e. n ). Using audio only, beamformers could e;i;l gradients. From the figure on the right, it is also visible that be used to estimate constantly the DOAs of the direct sound our proposed ER-MESSL model fits one of the two sources. and early reflections. Alternatively, one could track sources by ILD The ILD cue, (m; !), is modeled, similar to [17], by employing a particle filter [52], or a multimodal approach [53]. considering directly the frequency-dependent BRIR, as: I (!) 1;l ILD A. Interaural Level and Phase Differences a (!) = 20 log ; (10) l 10 I (!) 2;l The proposed IPD model is defined to match the behavior of the observed IPD and is different from previous work where where “jj” indicates the absolute value. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 5 B. Interaural Cue Probability Distributions The values of (m; !) are constrained between 0 and 1, 1;2 thus, (m; !) is employed as the TF soft mask that models 1;2 For the ILD cue, the probability of each TF bin being associ- the IC. To do so, it will be used as prior mask during the ated to source l can be written as a Gaussian distribution [42]: posterior probability calculation, that will be described in ILD ILD ILD ILD p( (m; !)jl) = N ( (m; !)j (!);  (!)); l l Section VI-B . (m; !) is computed from the observation 1;2 (11) by employing the equations defined in [55]. ILD ILD where  (!) is the mean, and  (!) is the variance. l l The aim of modeling the IC is to suppress remaining early Regarding the IPD cue, a top-down approach is used to reflections and late reverberation, i.e. the BRIR parts that are IPD wrap the signal phase between  [17].  (m; !;C ) is l not modeled by the comb filter. A similar approach to calculate modeled by a Gaussian distribution: an IC based TF mask was employed in [19]. However, there, IPD ^ the target source was assumed to be in front of the listener. p( (m; !)jl;C ) = (12) Here, we do not make any assumption regarding the position IPD IPD IPD = N ( (m; !;C )j (!;C );  (!;C )); l l l l l of the target source. Its position is estimated by ISDAR, the IPD IPD algorithm described later, in Section VI-C. Having the target where  (!;C ) and  (!;C ) are the IPD distribution l l l l source position, we then calculate (m; !) by analyzing the 1;2 mean and variance, respectively. BRIR related to the estimated DOA. To sum up, by assuming the IPD and ILD observations as being conditionally independent given their related parameters, B. The Garbage Source their probability distributions can be combined as: Late reflections and reverberation are problematic compo- ILD IPD p( (m; !);  (m; !)jl;C ) = nents of the acoustics that are undesiderable in the comb-filter (13) ILD IPD model, proposed in Section IV, as their first-order statistics = N ( (m; !);  (m; !;C )j ); l l are unreliable. Hence, the IC model described above is used 2 2 ILD ILD IPD IPD where  = f (!);  (!);  (!;C );  (!;C )g. l l l l l l l to suppress these components of the BRIRs by consideration This probability distribution identifies the proposed comb of their second-order statistics. In addition to this, we utilize a filter model, that was conceived to approximate the interaction garbage source, as in [17]. It represents noise dominating the between the received direct sound and first early reflection, TF bins that are not claimed by any of the other sources. i.e. two strongly coherent signals. This model does not take The parameters  used to model the garbage source are the into account either later reflections or reverberation, which same as those used by the other sources to define the distribu- are, in this article, dealt by the IC model. tion in Equation (13). The difference is the initialization, since the garbage source is used to model the noise sources, such V. M ODELING THE INTERAURAL COHERENCE as background noise, measurement noise, and reverberation. To suppress reverberation, the idea is to identify those areas VI. SOURCE SEPARATION M ODEL REESTIMATION in the TF domain that are dominated by the direct sound, and The EM is described here, along with the log-likelihood the strong early reflections. The direct sound and a strong used to optimize the parameters of the proposed models. reflection recorded at the two ears are highly correlated and coherent. In contrast, the late reverberation is diffuse, and does A. Parameter Estimation from Mixtures not present correlation between the binaural signals, at every The parameters characterizing the interaural cue probability frequency. Thus, we use the IC to create a probability mask, models are = f ; ; g, where is the marginal based on the coherence level, for every TF bin [19]. l l l;C l;C l l class membership, described as the joint probability of each TF bin being dominated by source l with the IPD model A. Interaural Coherence TF Mask parameters C : = p(l;C ). These parameters can be l l;C l The process we employed to calculate the IC of a signal estimated for a specific source l. This is a trivial problem follows an approach that was originally proposed in [54], upon the availability of the dominant source information for for dereverberation. For each TF bin, the auto-power spectral each TF bin. However, whether the source l is dominating a density of the two channels i = f1; 2g is calculated as: specific TF bin is not directly observable from the mixtures. (m; !) =  (m 1; !) + (1 )jy (m; !)j ; (14) On the other hand, l can be inferred from the interaural cues i i i and observed models, that are not known a-priori. This missing where 0    1 is a smoothing factor determined as data problem is solved by the EM algorithm. = 1=(  f ), with  = 10 ms being a time constant and f s s The log-likelihood of the observations can be then defined the sampling frequency [55]. The cross-power spectral density as in [17], however, with the additional IC distribution: between the two channels is: ILD IPD L( ) = [log p( (m; !);  (m; !);j ) + log (m; !)] 1;2 (m; !) =  (m 1; !) + (1 )y (m; !)y (m; !); 1;2 1;2 1 m;! X X (15) ILD IPD = log p( (m; !)jl)p( (m; !)jl;C ) (m; !): l;C l 1;2 with [] indicating the complex conjugate operation. From m;! l;Cl (17) (14) and (15), the magnitude squared coherence is: (m; !) 1;2 This has been implemented using the MESSL open source code’s option (m; !) = : (16) 1;2 allowing the definition of prior masks: https://github.com/mim/messl. (m; !) (m; !) 1 2 IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 6 This definition assumes that the IC, IPD and ILD cues are Since the proposed IPD model in ER-MESSL and ERIC- independent. As a result, the joint probability is written MESSL is composed of seven parameters C (Equation (9)), as the product of individual probabilities. In addition, the it involves a seven dimensional space when trying to find number of sources must be specified a-priori [17]. Note that the best combination of them, hence it is computationally the inclusion of the IC into the log-likelihood function is expensive. Therefore, the amplitudes P are fixed; only the e;i;l different from previous approaches, such as [19]. There, the IC initialized value is allowed. The time-dependent parameters’ mask was multiplied by the TF representation of the mixture. allowed ranges were found empirically, as in Table I. Equation (17) represents the proposed ERIC-MESSL. C. Model Initialization B. Expectation-Maximization (EM) The initialization part plays a crucial role for the EM The EM algorithm is used to estimate the parameters and algorithm performance, since the log-likelihood is not convex. probability at each TF bin. (m; !jl) is considered as a A poor initialization leads to local maxima, thus affecting the 1;2 prior, and not updated during the iterations. During the E-step, source separation results. The estimated source and image the occupation likelihood of source l with parameters C is source positions are used to initialize the time-dependent ILD IPD DF DS ST calculated for each TF bin, given (m; !) and  (m; !): parameters n , n and n . Instead, the amplitudes P , 0;1;l l l l P , P , P are initialized by analyzing the BRIR that ILD 1;1;l 0;2;l 1;2;l (m; !jC ) = p( (m; !)jl) l l l;C is related to the estimated DOA. Therefore, the early reflection (18) IPD p( (m; !)jl;C )p( (m; !)jl): l 1;2 information is not pre-estimated, but found and refined by the proposed system at each iteration. The microphone array is This expectation is then used in the M-step, to re-estimate only used to initialize the EM algorithm. the parameters, and maximize the likelihood. The ILD param- In [17], only the direct sound was used to model the source, eters are updated as [42]: P and the parameters were initialized by using the GCC-PHAT ILD (m; !) (m; !jC ) l l m;C ILD l algorithm [56]. In our proposed method, correct localization (!) = ; (m; !jC ) l l of the first reflection is also crucial. Source and image source m;C ILD positions are estimated through our ISDAR method [20]. (19) (!) = P P This method relies on RIRs recorded via a multichannel ILD ILD 2 ( (m; !)  (!))  (m; !jC ) l l m l C l microphone array, placed at the same listener position. We P ; (m; !jC ) chose this since, to our knowledge, no method in the literature l l m;C can reliably localize reflections, given binaural recordings. whereas the IPD residual parameters are updated as: However, other kinds of approaches could be also employed, (m; !jC ) (m; !jC ) for instance, audio-visual based methods [57]. l l l l IPD m (!jC ) = ; l l ISDAR is based on spherical coordinates. Direct sound (m; !jC ) l l and reflection TOAs n ^ are estimated through the clus- IPD e;i;l (20) (!jC ) = tered dynamic programming projected phase-slope algorithm IPD 2 ( (m; !jC )  (!jC ))  (m; !jC ) l l l l l m l (C-DYPSA), that we proposed in [20], whereas azimuth P : (m; !jC ) l l DOAs  are estimated through the delay-and-sum beam- m e;l former [20], [58]. Considering the listener at the center of the Also the marginal class membership is updated: coordinate system, the radial distances of the source and image 1 M source are calculated as  = (n ^ c ), where c is =  (m; !jC ); (21) e;l e;i;l 0 0 l;C l l M i=1 the sound speed, and n ^ is either the estimated direct sound m;! e;i;l (e = 0) or first reflection (e = 1) TOA. The source and image where B is the total number of TF bins. source positions in the Cartesian coordinate system are given The model parameters that are found during the last EM by b =  cos  and b =  sin  . Knowing x;e;l e;l e;l y;e;l e;l e;l iteration are selected as the final estimation. Probabilistic the listener position, these values are converted into TDOAs masks are generated by marginalizing over the estimated C : to populate Equation (9). The amplitudes P are calculated e;i;l M (m; !) =  (m; !jC ): (22) by directly analyzing the BRIRs at the reflection TOA n ^ . l l l e;i;l C Regarding the ILD distribution, the value of the ILD prior mean is estimated by utilizing a set of synthetic binaural RIRs, The separated source signal l can finally be obtained as: as in [17]. The garbage source is initialized to have a uniform y ^ (m; !) = y (m; !)M (m; !); 8m; 8!: (23) i;l i l distribution across IPD, and a uniform ILD distribution with zero mean for all frequencies. The seven interaural model parameters defined in C are treated in the EM as hidden variables. Specifically, they are VII. E XPERIM ENTS AND RESULTS modeled as discrete random variables, where the sets of allowed values are specified a-priori, as in [17]. The param- In this section, the results of a set of experiments are eters in C are not internally updated by the EM algorithm. described. In these experiments, we consider mixtures of Instead, every allowed value combination is tested [17]. The speech signals in four different recorded environments. When combination that maximizes the log-likelihood is then chosen. only the IC is modeled, and MESSL is used to model only IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 7 TABLE I: Range sizes for the allowed values around the initialized IPD model parameters. Vislab DWRC BBC UL Studio1 DF DS ST n , n , n 0:13 ms 0:13 ms 0:19 ms 0:31 ms l l l TABLE II: Recorded room RT60s, averaged over the octave bands between 500 Hz and 4 kHz, DRRs, and TISAs, averaged over all the tested combinations. L is the number of TOT loudspeakers. The loudspeaker positions are reported as lateral angles with respect to the dummy head orientation. Vislab DWRC BBC UL Studio1 RT60 (s) 0:32 0:27 0:28 0:94 DRR (dB) 17:8 3:9 15:7 6:0 AVG TISA (Deg) 75 37 71 32 TOT L 7 3 5 3 0; 30, 0; 37, Lateral angles (Deg) 0; 27 0; 27 60; 90 110 and bi-circular array were recorded separately, to avoid inter- ference effects. All the recordings were made by employing the swept-sine technique [59], with f = 48 kHz. Arrangements. Two further measures characterize the Fig. 5: Plan views of the four recorded rooms. The red datasets: the direct to reverberant ratio (DRR) [60], and circles represent the position of the dummy head, whereas the average target-interferer separation angle (AVG-TISA). the loudspeakers are depicted using their stylized symbol. These will allow a more comprehensive discussion over the separation performance achieved. DRR is calculated as the ratio between the energy carried by the direct sound and the direct sound, the proposed method is named as IC-MESSL. the rest of the BRIR. AVG-TISA is the mean lateral angle When the comb filter effect is modeled, extending MESSL in separating the target source from the interferer, considering that sense, without considering any prior knowledge regarding all the possible target-interferer combinations. DRR and AVG- the IC, the proposed method is ER-MESSL. Otherwise, if both TISA characterizing the four datasets are reported in Table II, the comb filter and the IC are modeled, the novel method together with the related RT60s, and DRRs. is named as ERIC-MESSL. The three proposed methods are Rooms. Vislab was an acoustically treated room at the compared to MESSL [17]. The ranges of allowed parameters University of Surrey, where the “Surrey Sound Sphere”, having for the comb filter model are in Table I, for each dataset. radius of 1.68 m, was assembled. The loudspeakers were At the end of this section, we also show that other separation clamped on the sphere equator. The dummy head employed algorithms would benefit from the inclusion of early reflection was the Cortex Manikin Mk2 Binaural Head and Torso Sim- information. We extend a deep learning based state-of-the- ulator. Both dummy head and bi-circular microphone array art method. Different from MESSL, which is an unsupervised were placed at the sound sphere center. method, the deep learning approach is used to demonstrate that DWRC is furnished as a living room-like area. Its acoustics improvements can be achieved also for supervised methods. are representative of typical domestic living rooms. A Cortex Manikin Mk2 Binaural Head and Torso Simulator sat on a A. Datasets sofa. The bi-circular array was positioned right behind it. BRIRs were recorded in four rooms, characterized with dif- BBC UL is a room at the BBC R&D center, in Salford, ferent size and reverberation time (RT60). The four rooms are UK. Similar to DWRC, it is furnished to resemble a typical named as “Vislab”, “Digital World Research Centre” (DWRC), living room environment. A Neumann KU100 dummy head “BBC Usability Laboratory” (BBC UL), and “Studio1”. Their was positioned on an armchair and the bi-circular array of plan views are shown in Fig. 5, whereas the RT60s are in microphones was separately measured at the same position. Table II, together with the number of loudspeaker positions Since the RT60s related to the three already introduced L and their lateral angles. Two different dummy heads TOT rooms were similar, an additional room was chosen: Studio1, were employed (i.e. a Cortex Manikin Mk2 Binaural Head a large recording studio at the University of Surrey. A Cortex and Torso Simulator and a Neumann KU100 dummy head), Manikin Mk2 Binaural Head and Torso Simulator was used as depending on their availability for the recordings. To obtain dummy head. The loudspeaker positions were selected to have data for the initialization, a 48-channel bi-circular array with their height similar to the dummy head’s. The microphone a typical microphone spacing of 21 mm and an aperture of array was positioned about 2 m far from the dummy head. 212 mm was utilized to record RIRs [20] . The dummy head Therefore, the image source positions found by this array were first manually modified, according to the dummy head posi- Available at http://cvssp.org/data/s3a, DOI: 10.15126/surreydata.00844867 DOIs: 10.15126/surreydata.00812228 and 10.15126/surreydata.00808465 tion, before being used to initialize the EM. Depending on the IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 8 1.00 the related BRIR direct sound. This is also used for the other Left Channel Left Channel performance metrics, described below. To extract the direct Right Channel Right Channel 0.75 Reflection sound component from the BRIRs, we truncated them by using Reflection Reflection a Hamming window, centered at the direct sound TOA. 0.50 Reflection The perceptual evaluation of speech quality (PESQ) has 0.25 been widely employed to evaluate processed speech qual- ity [63]. This is related to the Mean Opinion Score (MOS) 0.00 of human subjective assessments, therefore, the PESQ unit 5 10 15 20 25 5.0 7.5 10.0 12.5 15.0 Time (ms) Time (ms) of measure is MOS. Before proceeding with the PESQ value tar calculation, y ^ (m; !) and y (m; !) are aligned in time, i;l i;l Fig. 6: Two BRIR absolute values, for a frontal source, zoomed in terms of amplitudes and delays, by employing Wiener into their direct sound and first reflection. On the left, reflection filters [63]. Through two parameters that model symmetric is generated by the floor, thus it arrives at the two ears and asymmetric disturbances, a parametric function is then simultaneously; on the right, reflection arrives from a lateral employed, mapping the differences between the processed wall, thus there is a difference in TOAs and amplitudes. tar version of y ^ (m; !) and y (m; !), to subjective assessment i;l i;l results [63]. The overall PESQ is the mean over the  target- interferer combinations, as PESQ = PESQ . loudspeaker-microphone positions in each room, reflections =1 Another aspect that has to be evaluated in speech signals are generated from either the floor or lateral walls. Examples separated via source separation algorithms is intelligibility. of RIRs for these two cases are depicted in Fig. 6. To do so, we employ the extended short-time objective in- The Utterances. Fifteen utterances, of 3 s length, were ran- telligibility (ESTOI) metric [64]. ESTOI is a function of the domly selected from the TIMIT acoustic-phonetic continuous tar separated signal y ^ (m; !) and the clean reference y (m; !). i;l speech corpus [61]. For each combination of target source i;l The goal of ESTOI is to produce an index (that we name as and interferer(s), U = 15 random combinations of the fifteen ESTOI ) that is monotonically related to the intelligibility of utterances were selected and tested. Therefore, the number of y ^ (m; !) [64]. The overall ESTOI is the mean over the i;l mixtures generated and tested for each dataset is: target-interferer combinations: ESTOI = ESTOI . TOT  =1 = U; (24) C. Control Masks where the symbol “()” represents the binomial coefficient, L TOT Performance bounds are needed to perform a fair evalu- is the number of sources in the mixture, and L is the total ation of source separation systems [65]. Reference signals number of loudspeaker positions available in the dataset. The are generated from the mixtures, for comparison with the utterances were normalized before applying the convolutions output of the proposed source separation methods. For the to have the same root mean square energy. lower bound, random TF masks were applied to the mixture. For the upper bound, we chose to calculate the ideal binary B. Evaluation Metrics IBM mask M (m; !), also known as ORACLE mask [66]. It The source to distortion ratio (SDR) metric is based on sig- is generated, for each source l, by comparing the l-th signal nal energy ratios, thus, is typically reported in dB. Following tar energy E (n; !), for each TF bin, with respect to the Equation (4), the ideal target signal l, that arrives at channel int interferers’ E 0 (m; !) in the mixture: i free from any interference and noise, can be defined as: tar int 0 1; E (n; !) > E (m; !); 8l 6= l tar IBM l l y (m; !) = x (m; !)I (!): (25) l i;l M (m; !) = i;l 0; otherwise. Hence, the source y ^ (m; !), separated by a source separation i;l (28) method as in Equation (23), can be decomposed as [62]: where l is referred to a source that is other than l. This equation could have also been defined by looking at the source tar y ^ (m; !) = y (m; !) + E + E + E ; (26) i;l interf noise artif i;l that is louder than the sum of all other sources, instead of the where E is the interference error term, E the noise loudest in general. Nevertheless, for our experiments in this interf noise error term, and E errors provided by general artifacts. article, this would not change the results, since we are focusing artif We chose the SDR, since it emphasizes all the three error on cases where there are only two sources in the mixtures. terms [62]: tar jjy (m; !)jj i;l D. Source Separation Experiments SDR = 10 log ; (27) jjE + E + E jj interf noise artif The experiments performed were focused on analyzing the where jjjj represents the Euclidean norm operator. Once the source separation performance, employing mixtures composed SDR for each of the  combinations of sources is obtained, the of two sources (L = 2), i.e. target and interferer. These experi- overall result for the dataset is calculated as their mean SDR = ments were designed to compare our three novel methods (i.e. SDR ;, where  is the tested mixture index. As clean IC-MESSL, ER-MESSL and ERIC-MESSL) with the baseline =1 reference, we employed the target utterance convolved with (i.e. MESSL [17]), that models only the direct sound IPD, by Norm. Amplitude IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 9 Fig. 7: The top three figures show a zoom into a mixture TF domain absolute value, the related TF masks generated by MESSL, and the TF mask estimated by the proposed ERIC-MESSL. The bottom three figures show the same TF bins of the target signal, the signal separated by MESSL, and ERIC-MESSL, respectively. TABLE III: SDRs (left) and PESQs (right) obtained by separating the target speech from a two-talker mixture. SDR(dB) Vislab DWRC BBC UL Studio1 AVG PESQ(MOS) Vislab DWRC BBC UL Studio1 AVG Random 0:43 0:61 0:96 0:06 0:49 Random 1:36 1:45 1:45 1:37 1:38 MESSL [17] 4:53 2:54 5:47 0:58 3:28 MESSL [17] 1:96 1:93 2:06 1:82 1:94 IC-MESSL 4:80 2:73 5:79 0:65 3:49 IC-MESSL 1:98 1:95 2:07 1:87 1:97 ER-MESSL 4:98 2:68 5:67 0:67 3:50 ER-MESSL 2:00 1:93 2:06 1:83 1:96 ERIC-MESSL 5:14 2:70 5:89 0:75 3:62 ERIC-MESSL 2:01 1:95 2:07 1:87 1:98 ORACLE 6:21 5:04 6:82 0:88 4:66 ORACLE 2:34 2:45 2:45 1:96 2:30 TABLE IV: ESTOIs obtained by separating the target speech 7.0 3.15 Vislab DWRC from a two-talker mixture. 6.3 5.6 ESTOI Vislab DWRC BBC UL Studio1 AVG 2.65 MESSL [17] Random 0:19 0:17 0:19 0:05 0:15 4.9 MESSL [17] 0:28 0:22 0:30 0:07 0:22 IC-MESSL 4.2 IC-MESSL 0:29 0:23 0:31 0:07 0:23 ER-MESSL ER-MESSL 0:29 0:23 0:30 0:08 0:23 3.5 2.15 ERIC-MESSL ERIC-MESSL 0:29 0:24 0:31 0:10 0:24 6.5 1.00 BBC UL ORACLE 0:34 0:29 0:36 0:10 0:27 6.0 TABLE V: P-values obtained from a paired t-test that com- 5.5 0.65 pared the SDRs using MESSL, with the SDRs using each of 5.0 the three proposed methods. Studio1 4.5 0.30 Vislab DWRC BBC UL Studio1 AVG -90 -60 -30 0 30 60 90 -30 0 30 IC-MESSL 0:0 % 0:0 % 0:0 % 7:9 % 0:0 % Angle (Deg) ER-MESSL 0:0 % 8:6 % 0:0 % 12:0 % 0:0 % ERIC-MESSL 0:0 % 68:9 % 0:0 % 4:1 % 0:0 % Fig. 8: SDRs obtained by separating a target speech from a two-talker mixture. These results refer to different target source positions, averaged over every interferer position. calculating the SDR and PESQ scores. Results obtained by applying the ideal masks are also reported as reference. The number of maximum iterations for the EM algorithm to create the reverberant mixtures described in Equation (3). was set, for all the experiments, to be 16. The smoothing Since the BRIRs were recorded having, within the same factor to calculate the IC was set to be  = 0:5. The BRIRs dataset, constant distance between loudspeakers and listening and the utterances introduced in Section VII-A were utilized position, the target-to-interferer ratio (TIR) in the mixture was SDR (dB) IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 10 equal to 0 dB. This choice was made to focus the evaluation mine whether the results, generated through the three proposed on the source separation methods’ performance, by avoiding methods, are significantly different from the ones obtained by their dependency on the variation in utterance energy and MESSL. In Table V, the p-values are reported. They represent source distance. Furthermore, TIR equal to 0 dB represents the probability of rejecting the hypothesis that the two sets a challenging case, where no distinction can be made between under investigation are statistically different (i.e. a low p-value target and interferer by looking at their energy levels. means that the two sets are statistically different). By looking Examples of masks generated by MESSL and the proposed at the results averaged over all the datasets by comparing every ER-MESSL are depicted in Fig. 7. We can observe that tested sample, with a significance level of 5 %, we can state differences between the two masks are pronounced. These dif- that the results of IC-MESSL, ER-MESSL, and ERIC-MESSL ferences lead to the TF representation of the signal separated are statistically different from those of MESSL. Moreover, by through ERIC-MESSL to be more similar to the groundtruth looking at each dataset singularly, results show that the three target signal, when compared to MESSL’s separated signal. proposed methods are statistically different from MESSL in For our experiments we used the open-source code of Vislab and BBC UL. However, in DWRC and Studio1 this MESSL, where we set to the frequency-dependent parameter is valid only for IC-MESSL and ERIC-MESSL, respectively. modeling option. The tested MESSL model, hence, includes These results confirm what was already shown in Table III, a non-parametric modeling of the “impurities” around the where the improvement given by IC-MESSL, ER-MESSL, direct sound component. Nevertheless, in MESSL, the early and ERIC-MESSL is, in general, higher in BBC UL and reflection model was not directly defined through parameters. Vislab than in DWRC and Studio1. The statistical significance Instead, we drive our system to extract the information related of the results demonstrates the key point of the manuscript, to both direct sound and early reflection. We also use the which is about the importance of considering early reflection frequency-dependent parameter modelling (pre-implemented information when constructing a source separation model. in MESSL) to model the impurities around the estimation. For the four datasets, the SDR results can also be reported as a function of the target source location, as shown in Fig. 8. E. Source Separation Results For each target source position, within the dataset, the SDR is The SDR side of Table III shows that ERIC-MESSL, the calculated by considering each of the correspondent interferer locations. Then, the obtained SDRs are averaged over these proposed source separation method that models both the comb interferer positions, leading to one result for each target source filter and IC, outperforms the baseline (i.e. the MESSL method location. Due to the cone of confusion, which is well-known [17]), when applied to any of the four datasets. Furthermore, it for IPD based localization methods [67], it is not possible provides better performance if compared to the other proposed to discriminate between the IPD of two sources lying at the methods. However, for the DWRC dataset, the other proposed same lateral angle. Therefore, results are reported in terms method IC-MESSL produces the highest SDR. This is due of lateral angle, rather than azimuth. Apart from DWRC, the to strong reflections arriving from different directions with general trend of the results suggests that source separation respect to the direct sound, which corresponds to a lower performs better in situations where the target is frontal to impact of the comb filter effect [15]. Observing PESQ in the listener. This situation was, in fact, one of the classical Table III, in general, the two proposed methods that model assumptions made to evaluate source separation methods [17]. the IC (i.e. IC-MESSL and ERIC-MESSL) have comparable By reporting results as in Fig. 8, we overcome this assumption. results, and are both better than the other methods. However, in The proposed ERIC-MESSL performs better than the others acoustically controlled environments, such as Vislab, the first for almost every position of the target source. For the few reflection direction is initialized more accurately by ISDAR, and the comb filter model performs better, with ERIC-MESSL positions where it is not the best, either the proposed IC- having a higher PESQ. This shows the importance of an MESSL or ER-MESSL has higher SDRs. In DWRC, the loud- accurate initialization of the GMM parameters. Similar trends speaker positioned at 27 stood next to a chest of drawers, that are reported in Table IV, where the ESTOIs related to the produces scattering. This conflicts with the overall assumption proposed methods are greater than the baseline. ESTOI results of having reflections with a dominant specular component. show ERIC-MESSL to be the best proposed method, providing Therefore, the localization of the first reflection, for modeling a greater intelligibility for every dataset. the comb filter, is affected by estimation errors. Similar to In general, DWRC and Studio1 are more challenging 0 in DWRC and 27 in Studio1, for 37 in BBC UL, strong lateral reflections arrive before those from the direct datasets, producing low SDR, PESQ and ESTOI values for sound direction, making the IC dominate the comb filtering every tested method. The reason can be found in Table II: they effect [15]. Similar results can be observed in Fig. 9, where have low DRRs and narrow AVG-TISAs. Low DRR entails the PESQ results are reported as a function of the target source difficulties for each of the algorithms, since the IPD curve, location. It is evident how the proposed ERIC-MESSL, which that was described in Fig. 3, is highly distorted by the strong combines the two proposed models, outperforms, in general reverberation. At the same time, narrow AVG-TISA affects the the baseline MESSL [17]. Furthermore, these PESQ results overall results, since small angles between target and interferer correspond to small variations between the IPD and ILD cues also show what was already observed in Fig. 8 for the SDRs related to the two signals in the mixture. (and discussed above), ERIC-MESSL mainly suffers when Assuming the  SDR results of each dataset as being early reflections are not completely specular. normally distributed, the paired t-test was performed to deter- The majority of the setups that we tested, had a certain IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 11 TABLE VI: SDRs (left) and PESQs (right) obtained by separating the target speech from a two-talker mixture. These results are calculated by considering only recording setups where direct sound and first reflection have same DOA. SDR(dB) DWRC BBC UL Studio1 AVG PESQ(MOS) DWRC BBC UL Studio1 AVG MESSL [17] 2:00 5:22 0:55 2:59 MESSL [17] 1:86 2:04 1:87 1:92 IC-MESSL 2:26 5:57 0:68 2:84 IC-MESSL 1:88 2:05 2:92 1:95 ER-MESSL 2:43 5:60 0:80 2:94 ER-MESSL 1:86 2:06 1:92 1:95 ERIC-MESSL 2:70 5:80 0:87 3:12 ERIC-MESSL 1:88 2:07 1:95 1:97 2.25 2.25 Vislab DWRC 7.5 3.0 Vislab DWRC 7.0 2.05 2.5 6.5 2.00 1.85 6.0 2.0 1.0 6.5 Studio1 1.75 1.65 MESSL [17] 6.0 0.5 2.25 2.00 IC-MESSL Studio1 BBC UL BBC UL 5.5 0.0 ER-MESSL 2.18 1.95 30 45 60 75 90 15 30 45 ERIC-MESSL 2.11 1.90 Angle (Deg) 2.04 1.85 Fig. 10: SDRs for different interferer positions, fixing target 1.97 1.80 at 0 . The black vertical crossed lines refer to ERIC-MESSL, 1.90 1.75 the red circled lines to MESSL [17], the green starred lines to -90 -60 -30 0 30 60 90 -30 0 30 ER-MESSL, and the blue crossed lines to IC-MESSL. Angle (deg) TABLE VII: Evaluation results for the deep learning based Fig. 9: PESQs obtained by separating a target speech from methods over Vislab, in terms of SDR, PESQ and ESTOI. a two-talker mixture. These results refer to different target source positions, averaged over every interferer position. SDR PESQ ESTOI Direct sound information 8.33 2.51 0.70 Direct sound and early reflection info 8:80 2:59 0:73 configuration that produced, as the first reflection, the one corresponding to the floor (i.e. having same azimuth as the almost every TISA, apart from the extreme cases (i.e. 90 direct sound). Nevertheless, in BBC UL, DWRC, and Studio1, in Vislab and 70 in BBC UL). Therefore, we can conclude there are cases where the first arriving reflection has a different that the comb filter is, on average, more effective than the IC, direction of arrival (DOA) than the direct sound (i.e. coming apart from large TISAs. For both DWRC and Studio1, all the from a lateral wall). The proposed model does not make any methods show degradation at low TISA. This is a common assumption regarding the direction of the reflections, however, source separation problem [17]. Studio1 is also confirmed to the condition that better matches the idea behind it (i.e. a be problematic, with SDR lower than 1 dB, for every method. strong comb filter effect) is given by the case of direct sound Regarding the overall computational complexity, the average and early reflection coming from the same direction. To better run time, for a code run in MATLAB R2014b on Intel(R) show the strength of the proposed models, in Table VI, we Core(TM)i7-2600 CPU @ 3.40GHz, 16GB RAM PC is 55 s show the results of the experiments by considering only those for ERIC-MESSL and 8 s for MESSL [17]. The parameters are situations where direct sound and first reflection have the searched within a 7-D space in ERIC-MESSL, making it less same DOA. These results show that our methods outperform efficient than MESSL, where the space was one dimensional. MESSL with a much wider difference than the overall results Early Reflections and Deep Learning. We now evaluate in Table III, and ERIC-MESSL is the best. a DNN-based method that is representative of state-of-the-art To analyze the effect of separation angle, the source separa- approaches in speech separation. We modified this reference tion performance was calculated with the frontal loudspeaker method to test the key point behind our main work: that the (0 azimuth) as the target source, and varying the interferer. The results are reported in Fig. 10, as is typical in the inclusion of early reflection information into source separation literature for source separation [17], [41], [42]. This kind methods improves the performance. This test is intended to of visualization allows a better understanding of the source examine the potential for exploiting this information using separation performance by varying TISA. By observing the a DNN approach, and give a preliminary validation. Further results of Vislab and BBC UL (datasets having loudspeaker experiments are needed to explore the best way to incorporate positions around the listener), the proposed ERIC-MESSL early reflection information within DNN architectures for consistently provides the highest performance. However, for source separation, beyond the present preliminary integration. the extreme cases of TISA (i.e. 90 in Vislab and 70 in The selected pipeline is based on the classic multilayer BBC UL), the proposed IC-MESSL performs better. This perceptron (MLP) architecture, as presented in [68]. A similar behavior is best seen in the proposed ER-MESSL results. As architecture can be also found in [69]. In our implementation, for ERIC-MESSL, ER-MESSL is better than IC-MESSL for the MLP has two hidden layers, containing 1024 leaky rectified PESQ (MOS) SDR (dB) IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 12 linear units (ReLU) each. We employed batch normalisation are unsupervised techniques, hence do not need any labeling. (BN) layers [70] to accelerate convergence, and Adam opti- Therefore, it is inappropriate to directly compare the results mizer [71] with He initialization [72]. The binary crossentropy in Table VII with those in Tables III and IV. was used as loss function. The mini-batch size was set to 1000. Recordings from two male speakers and two female speak- VIII. CONCLUSION ers in the TIMIT dataset [61] were used for our experiment. Two room properties (i.e. early reflections and late reverber- For each of these speakers, ten sentences were randomly ation) have been modeled for source separation. Depending on selected. The binaural mixtures were generated by convolv- whether they are modeled individually or together, three novel ing the randomly chosen utterances with BRIRs recorded in source separation methods have been proposed: ER-MESSL, Vislab. The BRIRs used were the ones recorded for the angles that models the comb filter effect; IC-MESSL, that models the at 0 , 30 and 60 . To create the mixtures, each of the 4 IC; ERIC-MESSL, that combines the two models together. speakers was combined to the other 3. For each of these 12 Experiments were performed by recording four reverber- combinations, we associated the 10 sentences. In terms of the ant environments, and comparing the source separation per- product rule for counting, this makes a total of 1200 utterance formance of the proposed methods with MESSL’s [17]. In combinations. Regarding the BRIRs, each of the 5 DOAs was general, the proposed ERIC-MESSL outperforms all the other combined to the other 4, making a total of 20 combinations. methods. With respect to MESSL, the improvement given by Convolving utterances with BRIRs, we obtain 24000 mixtures: ERIC-MESSL, averaged over the four tested datasets, is about 19200 were randomly selected for training, the rest for testing. 10 % for SDR and 2 % for PESQ. It was also shown, by These 24000 samples comprising the dataset represent all running t-tests, that the ERIC-MESSL results are statistically combinations of the BRIR directions convolved with the indi- different from MESSL’s. Moreover, this experimental analysis vidual utterances. A distinct set of direction-utterance samples revealed that low DRRs and narrow AVG-TISAs led to a was used for testing and training, although all directions and degradation of the results. In addition, results were also some utterances did overlap (but not any specific combination). observed by varying both the target source and interferer The performance of the methods tested here would likely positions. Also in this case, it was consistently observed that decrease when generalizing to new unseen utterances and ERIC-MESSL is, in general, the better model. We conclude BRIRs, which is however beyond the scope of the present that modeling together the comb filter effect and IC is helpful tests. In fact, as mentioned above, this DNN experiment for improving the performance of classical source separation is to demonstrate that, by adding information about early methods. Furthermore, we have also reported an experiment reflections, supervised deep learning based source separation undertaken by including early reflection information into a method can also be improved, over the case where only the DNN based state-of-the-art source separation method. Results direct sound is considered, as we observed in the main novelty showed a great improvement, thus confirming the importance of this article, i.e. the GMM based unsupervised method. of incorporating the early reflection information into both The training was performed by providing the features related unsupervised and supervised source separation methods. to the IPD as input to the network, and matching with the Future work may be conducted on extending the methods ORACLE masks in output. In both models, the IPD features to multichannel arrays of microphones. Furthermore, a com- were calculated through the approach in Sections III and IV. bination of audio-visual sensing may be explored, to tackle To evaluate the improvement given by the early reflection problematic scenarios where the interferer has a higher level information, we have trained one model that considers only than the target. The proposed models could also be applied to the direct sound information [68], and a novel one which other popular approaches, such as NMF. we propose to also incorporate the early reflections. The ORACLE masks in output to the training stage were generated ACKNOWLEDGMENTS tar int from Equation (28), by considering E (n; !) and E (n; !) l l This work was supported by the EPSRC Programme Grant related to the direct sound for the model used as in [68], and S3A: Future Spatial Audio for an Immersive Listener Ex- direct sound plus early reflections for our model. This was perience at Home (EP/L000539/1) and BBC as part of the done by segmenting the related BRIRs through a Hamming BBC Audio Research Partnership. The authors would like to window (5 ms, and 30 ms, respectively). thank the reviewers and the associate editor for their helpful During the test, the masks predicted by the networks are comments to improve the article. used to separate the sounds, by employing Equation (23). Results are reported in Table VII. There, it is shown how the REFERENCES model containing information about the early reflections offers [1] A. Sutin, B. Bunin, N. Sedunov, L. Fillinger, M. Tsionskiv, and better performance with respect to the pipeline which consid- M. Bruno, “Stevens passive acoustic system for underwater surveil- ers only direct sound, for every metric (i.e. SDR, PESQ and lance,” in Proc. of the International WaterSide Security Conference, Carrara, Italy, 2010. ESTOI). This has demonstrated the key idea of the manuscript: [2] M. Ungureanu, C. Bigan, R. Strungaru, and V. Lazarescu, “Independent early reflections carry important information that is helpful component analysis applied in biomedical signal processing,” Measure- for improving the performance of speech separation models, ment Science Review, vol. 4, no. 2, pp. 1–8, 2004. [3] A. Tonazzini, E. Salerno, and L. Bedini, “Fast correction of bleed- including both unsupervised (e.g. MESSL) and supervised through distortion in grayscale documents by a blind source separation techniques (e.g. DNNs). However, it is important to stress technique,” International Journal of Document Analysis, vol. 10, no. 1, that MESSL [17] and the methods proposed in Section VI pp. 17–25, 2007. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 13 [4] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and unsu- Conference (LVA/ICA). Tel Aviv, Israel, 2012, pp. 322–329, Springer pervised speech enhancement using nonnegative matrix factorization,” Berlin Heidelberg. IEEE Transactions on Audio, Speech, and Language Processing, vol. [26] P. Smaragdis, C. Fevotte, G. J. Mysore, N. Mohammadiha, and M. Hoff- 21, no. 10, pp. 2140–2151, 2013. man, “Static and dynamic source separation using nonnegative factor- izations: A unified view,” IEEE Signal Processing Magazine, vol. 31, [5] M. A. Akeroyd, J. Chambers, D. Bullock, Palmer A. R., and A. Q. no. 3, pp. 66–75, 2014. Summerfield, “The binaural performance of a cross-talk cancellation [27] H. Sawada, S. Araki, R. Mukai, and S. Makino, “Blind extraction of system with matched or mismatched setup and playback acoustics,” J. dominant target sources using ICA and time-frequency masking,” IEEE Acoustical Society of America, vol. 121, no. 2, pp. 1056–1069, 2007. Transactions on Audio, Speech and Language Processing, vol. 14, no. [6] J. Li, L. Deng, Y. Gong, and R. Haeb-Umbach, “An overview of 6, pp. 2165–2173, 2006. noise-robust automatic speech recognition,” IEEE/ACM Transactions on [28] A. Ozerov and C. Fev ´ otte, “Multichannel nonegative matrix factorization Audio, Speech, and Language Processing, vol. 22, no. 4, pp. 745–777, in convolutive mixtures for audio source separation,” IEEE Transactions on Audio, Speech and Language Processing, vol. 18, no. 3, pp. 550–563, [7] E. W. Healy, S. E. Yoho, Y. Wang, and Wang D., “An algorithm to improve speech recognition in noise for hearing-impaired listeners,” J. [29] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A Acoustical Society of America, vol. 134, no. 4, pp. 3029–3038, 2013. multichannel MMSE-based framework for speech source separation and [8] C. Crocco, M. Cristiani, A. Trucco, and V. Murino, “Audio surveillance: noise reduction,” IEEE Transactions on Audio, Speech, and Language a systematic review,” ACM Computing Surveys, vol. 48, no. 4, pp. 52:1– Processing, vol. 21, no. 9, pp. 1913–1928, 2013. 52:46, 2016. [30] L. Wang, J. D. Reiss, and A. Cavallaro, “Over-determined source [9] Q. Liu, W. Wang, P. J. B. Jackson, and T. J. Cox, “A source separation separation and localization using distributed microphones,” IEEE/ACM evaluation method in object-based spatial audio,” in Proc. of the Transactions on Audio, Speech and Language Processing, vol. 24, no. 23rd European Signal Processing Conference (EUSIPCO), Nice, France, 9, pp. 1573–1588, 2016. [31] S. Gannot, E. Vincent, S. Markovich-Golan, and A. Ozerov, “A [10] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, consolidated perspective on multimicrophone speech enhancement and W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and source separation,” IEEE/ACM Transactions on Audio, Speech and T. Yoshioka, “A summary of the REVERB challenge: state-of-the-art Language Processing, vol. 25, no. 4, pp. 692–730, 2017. and remaining challenges in reverberant speech processing research,” [32] H. Saruwatari, T. Kawamura, T. Nishikawa, A. Lee, and K. Shikano, EURASIP J. on Advances in Signal Processing, vol. 2016, no. 1, pp. “Blind source separation based on a fast-convergence algorithm com- 7:1–7:19, 2016. bining ICA and beamforming,” IEEE Transactions on Audio, Speech [11] H. Kuttruff, Room Acoustics - Fifth edition, Spon press, 2009. and Language Processing, vol. 14, no. 6, pp. 2165–2173, 2006. [12] B. Blesser, “An interdisciplinary synthesis of reverberation viewpoints,” [33] I. Dokmanic, ´ R. Scheibler, and M. Vetterli, “Raking the cocktail party,” J. Audio Engineering Society, vol. 49, no. 10, pp. 867–903, 2001. IEEE J. of Selected Topics in Signal Processing, vol. 9, no. 5, pp. 825– [13] V. Valim ¨ aki, ¨ J. A. Parker, L. Savioja, J. O. Smith, and J. S. Abel, “Fifty 836, 2015. years of artificial reverberation,” IEEE Transactions on Audio, Speech [34] P.-S. Huang, M. Kim, M. Hasegawa-Johnson, and P. Smaragdis, “Joint and Language Processing, vol. 20, no. 5, pp. 1421–1448, 2012. optimization of masks and deep recurrent neural networks for monaural [14] M. Barron, “The subjective effects of first reflections in concert halls - source separation,” IEEE/ACM Transactions on Audio, Speech and the need for lateral reflections,” J. of Sound and Vibration, vol. 15, no. Language Processing, vol. 23, no. 12, pp. 2136–2147, 2015. 4, pp. 475–494, 1971. [35] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel audio source [15] T. Lokki, J. Patynen, ¨ T. Sakar, S. Siltanen, and L. Savioja, “Engaging separation with deep neural networks,” IEEE/ACM Transactions on concert hall acoustics is made up of temporal envelope preserving Audio, Speech and Language Processing, vol. 24, no. 9, pp. 1652–1664, reflections,” J. Acoustical Society of America Express Letters, vol. 129, no. 6, pp. EL223–EL228, 2011. [36] X.-L. Zhang and D. L. Wang, “A deep ensemble learning method for [16] E. Vincent, N. Bertin, R. Gribonval, and F. Bimbot, “From blind to monaural speech separation,” IEEE/ACM Transactions on Audio, Speech guided audio source separation,” IEEE Signal Processing Magazine, and Language Processing, vol. 24, no. 5, pp. 967–977, 2016. vol. 31, no. 3, pp. 107–115, 2014. [37] J. Du, Y. Tu, L-R. Dai, and C.-H. Lee, “A regression approach to single- [17] M. I. Mandel, R. J. Weiss, and D. P. W. Ellis, “Model-based expectation channel speech separation via high-resolution deep neural networks,” maximization source separation and localization,” IEEE Transactions on IEEE/ACM Transactions on Audio, Speech and Language Processing, Audio, Speech, and Language Processing, vol. 18, no. 2, pp. 382–394, vol. 24, no. 8, pp. 1424–1437, 2016. [38] Y. Wang, J. Du, L.-R. Dai, and C.-H. Lee, “A gender mixture detection [18] S. Bech, “Spatial aspects of reproduced sound in small rooms,” J. approach to unsupervised single-channel speech separation based on Acoustical Society of America, vol. 103, no. 1, pp. 434–445, 1998. deep neural networks,” IEEE/ACM Transactions on Audio, Speech and [19] A. Alinaghi, W. Wang, and P. J. B. Jackson, “Spatial and coherence Language Processing, vol. 25, no. 7, pp. 1535–1546, 2017. cues based time-frequency masking for binaural reverberant speech [39] D. Wang, “Time-frequency masking for speech separation and its separation,” in Proc. of the IEEE International Conference on Acoustics, potential for hearing aid design,” Trends in Amplification, vol. 12, no. Speech and Signal Processing (ICASSP), Vancouver, Canada, 2013. 4, pp. 332–353, 2008. [20] L. Remaggi, P. J. B. Jackson, P. Coleman, and W. Wang, “Acoustic re- [40] P. M. Hofman and J. Van Opstal, “Spectro-temporal factors in two- flector localization: novel image source reversion and direct localization dimensional human sound localization,” J. Acoustical Society of Amer- methods,” IEEE/ACM Transactions on Audio, Speech and Language ica, vol. 103, no. 5, pp. 2634–2648, 1998. Processing, vol. 25, no. 2, pp. 296–309, 2017. [41] H. Sawada, S. Araki, and S. Makino, “Underdetermined convolutive [21] J. B. Allen and D. A. Berkley, “Image method for efficiently simulating blind source separation via frequency bin-wise clustering and permuta- small-room acoustics,” J. Acoustical Society of America, vol. 4, no. 65, tion alignment,” IEEE Transactions on Audio, Speech, and Language pp. 943–950, 1979. Processing, vol. 19, no. 3, pp. 516–527, 2011. [22] G-J. Jang and T-W. Lee, “A maximum likelihood approach to single- [42] A. Alinaghi, P. J. B. Jackson, Q. Liu, and W. Wang, “Joint mixing channel source separation,” J. of Machine Learning Research, vol. 23, vector and binaural model based stereo source separation,” IEEE/ACM pp. 1365–1392, 2003. Transactions on Audio, Speech and Language Processing, vol. 22, no. [23] M. N. Schmidt and R. K. Olsson, “Single-channel speech separation 9, pp. 1434–1448, 2014. using sparse non-negative matrix factorization,” in Proc. of Interspeech, [43] A. Deleforge, F. Forbes, and R. Horaud, “Acoustic space learning Pittsburgh, USA, 2006. for sound-source separation and localization on binaural manifolds,” [24] S. Arberet, A. Ozerov, N. Q. K. Duong, E. Vincent, R. Gribonval, International Journal of Neural Systems, vol. 25, no. 1, 2015. F. Bimbot, and P. Vandergheynst, “Nonnegative matrix factorization [44] C. Hummersone, R. Mason, and T. Brookes, “Dynamic precedence and spatial covariance model for under-determined reverberant audio effect modeling for source separation in reverberant environments,” source separation,” in Proc. of the 10th International Conference on IEEE Transactions on Audio, Speech, and Language Processing, vol. Information Science, Signal Processing and their Applications (ISSPA), 18, no. 7, pp. 1867–1871, 2010. Kuala Lumpur, Malaysia, 2010. [45] Y. Huang, J. Benesty, and J. Chen, “A blind channel identification-based [25] C. Joder, F. Weninger, F. Eyben, D. Virette, and B. Schuller, “Real-time two-stage approach to separation and dereverberation of speech signals speech separation by semi-supervised nonnegative matrix factorization,” in a reverberant environment,” IEEE Transactions on Audio, Speech and in Latent Variable Analysis and Signal Separation: 10th International Language Processing, vol. 13, no. 5, pp. 882–895, 2005. IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, IEEE COPYRIGHT 14 [46] F. Nesta and M. Omologo, “Convolutive underdetermined source [70] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep separation through weighted interleaved ICA and spatio-temporal source network training by reducing internal covariate shift,” in Proc. of the correlation,” in Latent Variable Analysis and Signal Separation: 10th International Conference on Machine Learning, Lille, France, 2015. International Conference (LVA/ICA). Tel Aviv, Israel, 2012, pp. 222– [71] D. P. Kingma and J. L. Ba, “ADAM: A method for stochastic 230, Springer Berlin Heidelberg. optimization,” in Proc. of the International Conference on Learning Representations (ICLR), San Diego, USA, 2015. [47] S. Makino, H. Sawada, and T. W. Lee, Blind Speech Separation, [72] K. He, X. Zhang, S. Ren, and J. Sun, “Delving deep into rectifiers: Springer, 2007. surpassing human-level performance on imagenet classification,” in [48] A. Asaei, M. Golbabaee, H. Bourlard, and V. Cevher, “Structured spar- Proc. of the International Conference on Computer Vision (ICCV), sity models for reverberant speech separation,” IEEE/ACM Transactions Santiago, Chile, 2015. on Audio, Speech and Language Processing, vol. 22, no. 30, pp. 620– 633, 2014. Luca Remaggi is Audio Research Engineer at Cre- [49] R. Scheibler, D. Di Carlo, A. Deleforge, and I. Dokmanic, “Separake: ative Labs, UK, working on cutting edge spatial Source separation with a little help from echoes,” arXiv: CoRR, vol. audio products. Between 2017 and 2019, he was abs/1711.06805, 2017. Research Fellow at the Centre for Vision, Speech [50] E. Vincent, T. Virtanen, and S. Gannot, Audio source separation and and Signal Processing, University of Surrey, UK, speech enhancement, John Wiley & Sons, Ltd, 2018. where he also pursued his PhD, in 2017. His re- [51] G. J. Brown and M. Cook, “Computational auditory scene analysis,” search interest was to investigate the multipath sound Computer Speech and Language, vol. 8, pp. 297–336, 1994. propagation combining acoustic and visual data, for [52] J.-M. Valin, F. Michaud, and J. Rouat, “Robust localization and tracking applications in spatial audio and source separation. of simultaneous moving sound sources using beamforming and particle He received the B.Sc. and M.E. degrees in Elec- filtering,” Robotics and Autonomous Systems, vol. 55, no. 1, pp. 216– tronic Engineering from Universita ` Politecnica delle 228, 2007. Marche, Italy, in 2009 and 2012, respectively. During his M.E., he has been an [53] S. M. Naqvi, M. Yu, and J. A. Chambers, “A multimodal approach to intern at the Department of Signal Processing and Acoustics, Aalto University, blind source separation of moving sources,” IEEE Journal of Selected Finland, where he focused on the sound synthesis of musical instruments. Topics in Signal Processing, vol. 4, no. 5, pp. 895–910, 2010. Philip Jackson is Reader in Machine Audition at [54] C. Faller and J. Merimaa, “Source localization in complex listening the Centre for Vision, Speech & Signal Processing situations: Selection of binaural cues based on interaural coherence,” (CVSSP, University of Surrey, UK) with MA in The Journal of the Acoustical Society of America, vol. 116, no. 5, pp. Engineering (Cambridge University, UK) and PhD 3075–3089, 2004. in Electronic Engineering (University of Southamp- [55] M. Jeub, M. Schafer ¨ , T. Esch, and P. Vary, “Model-based dereverberation ton, UK). His broad interests in acoustical signals preserving binaural cues,” IEEE Transactions on Audio, Speech, and have led to research contributions in sound field Language Processing, vol. 18, no. 7, pp. 1732–1745, 2010. control, modeling speech articulation, acoustics and [56] P. Aarabi, “Self-localizing dynamic microphone arrays,” IEEE Trans- recognition, in audio-visual perception, blind source actions on Systems, Man, and Cybernetics, Part C (Applications and separation, and spatial audio reverberation, capture, Reviews), vol. 32, no. 4, pp. 474–484, 2002. reproduction and quality evaluation [h-index 22; [57] H. Kim, L. Remaggi, P. J. B. Jackson, F. M. Fazi, and A. Hilton, “3D Google Scholar: bit.ly/2oTRw1C]. He led one of four research streams on room geometry reconstruction using audio-visual sensors,” in Proc. of object-based spatial audio in the S3A programme grant funded in the UK by the Conference on 3D Vision (3DV), Qingdao, China, 2017. EPSRC, and enjoys listening. [58] B. D. VanVeen and K. M. Buckley, “Beamforming: a versatile approach Wenwu Wang (M02SM11) was born in Anhui, to spatial filtering,” IEEE Acoustic, Speech and Signal Processing China. He received the B.Sc. degree in 1997, the Magazine, vol. 5, no. 2, pp. 4–24, 1988. M.E. degree in 2000, and the Ph.D. degree in 2002, [59] A. Farina, “Simultaneous measurement of impulse response and all from Harbin Engineering University, China. He distortion with a swept-sine technique,” in Proc. of the 108th Audio then worked in Kings College London (2002-2003), Engineering Society Convention (AES), Paris, France, 2000. Cardiff University (2004-2005), Tao Group Ltd. [60] P. Zahorik, “Direct-to-reverberant energy ratio sensitivity,” J. Acoustical (now Antix Labs Ltd.) (2005-2006), and Creative Society of America, vol. 112, no. 5, Pt. 1, pp. 2110–2117, 2002. Labs (2006-2007), before joining University of Sur- [61] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallet, and rey, UK, in May 2007, where he is currently a N. L. Dahlgren, “DARPA TIMIT acoustic phonetic continuous speech Professor in Signal Processing and Machine Learn- corpus CDROM,” Tech. Rep., NIST Interagency, 1993. ing, and a Co-Director of the Machine Audition [62] E. Vincent, R. Gribonval, and C. Fev ´ otte, “Performance measurement Lab within the Centre for Vision Speech and Signal Processing. He was in blind audio source separation,” IEEE Transactions on Audio, Speech a Visiting Scholar at Ohio State University, USA, in 2008. He has been a and Language Processing, vol. 14, no. 4, pp. 1462–1469, 2006. Guest Professor on Machine Perception at Qingdao University of Science and [63] P. C. Loizou, Speech Enhancement: Theory and Practice - Second Technology, China, since 2018. His current research interests include blind Edition, CRC Press, 2013. signal processing, sparse signal processing, audio-visual signal processing, [64] J. Jensen and C. H. Taal, “An algorithm for predicting the intelligibility machine learning and perception, artificial intelligence, machine audition of speech masked by modulated noise maskers,” IEEE/ACM Transac- (listening), and statistical anomaly detection. He has (co)-authored over 250 tions on Audio, Speech and Language Processing, vol. 24, no. 11, pp. publications in these areas. He and his team have won the Best Paper Award 2009–2022, 2016. on LVA/ICA 2018, the Best Oral Presentation on FSDM 2016, the Top [65] E. Vincent, S. Araki, F. Theis, G. Nolte, P. Bofill, H. Sawada, A. Ozerov, Paper Award in IEEE ICME 2015, Best Student Paper Award shortlists V. Gowreesunker, D. Lutter, and N. Q. K. Duong, “The signal sepa- on IEEE ICASSP 2019 and LVA/ICA 2010. His papers are among the ration evaluation campaign (2007-2010): achievements and remaining Most Downloaded Papers in IEEE/ACM Transactions on Audio Speech and challenges,” Signal Processing, vol. 92, no. 8, pp. 1928–1936, 2012. Language Processing in 2018 and 2019, and Featured Articles in IEEE [66] D. Wang, “On ideal binary mask as the computational goal of auditory Transactions on Signal Processing 2013. As a team member, he achieved scene analysis,” in Speech Separation by Humans and Machines, the 2nd place (among 23 teams) in the DCASE 2019 Challenge Sound event P. Divenyi, Ed., chapter 12, pp. 181–197. Kluwer Academic, 2005. localization and detection, the 3rd place (among 558 submitted systems) in [67] E. M. Wenzel, M. Arruda, D. J. Kistler, and F. L. Wightman, “Lo- the 2018 Kaggle Challenge ”Free-sound general purpose audio tagging”, the calization using nonindividualized head-related transfer functions,” J. 1st place (among 35 submitted systems) in the 2017 DCASE Challenge on Acoustical Society of America, vol. 94, no. 1, pp. 111–123, 1993. ”Large-scale weakly supervised sound event detection for smart cars”, the [68] Q. Liu, Y. Xu, P. J. B. Jackson, W. Wang, and P. Coleman, “Iterative TVB Europe Award for Best Achievement in Sound in 2016 and the finalist deep neural networks for speaker-independent binaural blind speech for GooglePlay Best VR Experience in 2017, and the Best Solution Award on separation,” in Proc. of the IEEE International Conference on Acoustics, the Dstl Challenge ”Under-sampled signal signal recognition” in 2012. He is a Speech and Signal Processing (ICASSP), Brisbane, Canada, 2018. Senior Area Editor (2019-) for IEEE Transactions on Signal Processing and an [69] Z.-Q. Wang and D. Wang, “On spatial features for supervised speech Associate Editor (2019-) for EURASIP Journal on Audio Speech and Music separation and its application to beamforming and robust ASR,” in Proc. Processing. He was an Associate Editor (2014-2018) for IEEE Transactions on of the IEEE International Conference on Acoustics, Speech and Signal Signal Processing. He was a Publication Co-Chair for ICASSP 2019, Brighton, Processing (ICASSP), Brisbane, Canada, 2018. UK.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Oct 4, 2019

There are no references for this article.