Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Jointly optimal denoising, dereverberation, and source separation

Jointly optimal denoising, dereverberation, and source separation c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Jointly optimal denoising, dereverberation, and source separation Tomohiro Nakatani, Senior Member, IEEE, Christoph Boeddeker, Student Member, IEEE, Keisuke Kinoshita, Senior Member, IEEE, Rintaro Ikeshita, Member, IEEE, Marc Delcroix, Senior Member, IEEE, Reinhold Haeb-Umbach, Fellow, IEEE Abstract—This paper proposes methods that can optimize a acquired signal. For performing denoising (DN), beamform- Convolutional BeamFormer (CBF) for jointly performing denois- ing techniques have been investigated for decades [1], [2], ing, dereverberation, and source separation (DN+DR+SS) in a [3], [4], and the Minimum Variance Distortionless Response computationally efficient way. Conventionally, a cascade config- (MVDR) beamformer and the Minimum Power Distortionless uration, composed of a Weighted Prediction Error minimization Response (MPDR) beamformer, are now widely used as state- (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response (MVDR) beamformer, has been used as of-the-art techniques. For source separation (SS), a number the state-of-the-art frontend of far-field speech recognition, even of blind signal processing techniques have been developed, though this approach’s overall optimality is not guaranteed. In including independent component analysis [5], independent the blind signal processing area, an approach for jointly opti- vector analysis [6], and spatial clustering-based beamforming mizing dereverberation and source separation (DR+SS) has been [7]. For dereverberation (DR), a Weighted Prediction Error proposed; however, it requires huge computing cost, and has not been extended for applications to DN+DR+SS. To overcome the minimization (WPE) based linear prediction technique [8], above limitations, this paper develops new approaches for jointly [9] and its variants [10] have been actively studied as an optimizing DN+DR+SS in a computationally much more efficient effective approach. With these techniques, for determining the way. To this end, we first present an objective function to optimize coefficients of filtering, it is crucial to accurately estimate a CBF for performing DN+DR+SS based on maximum likelihood such statistics of the speech signals and the noise as their estimation on an assumption that the steering vectors of the target signals are given or can be estimated, e.g., using a neural spatial covariances and time-varying variances. However, the network. This paper refers to a CBF optimized by this objective estimation often becomes inaccurate when the signals are function as a weighted Minimum-Power Distortionless Response mixed under reverberant and noisy conditions, which seriously (wMPDR) CBF. Then, we derive two algorithms for optimizing a degrades the performance of these techniques. wMPDR CBF based on two different ways of factorizing a CBF To enhance the robustness of the above techniques, neural into WPE filters and beamformers: one based on an extension of the conventional joint optimization approach proposed for network-supported microphone array speech enhancement has DR+SS and another based on a novel technique. Experiments been actively studied, and its effectiveness has been iden- using noisy reverberant sound mixtures show that the proposed tified for denoising [11], dereverberation [12], and source optimization approaches greatly improve the performance of the separation [13], [14]. With this approach, neural networks speech enhancement in comparison with the conventional cascade estimate such statistics of the signals and noise as Time- configuration in terms of signal distortion measures and ASR performance. The proposed approaches also greatly reduce the Frequency (TF) masks and time-varying variances [13], [15], computing cost with improved estimation accuracy in comparison [16], [17], while microphone array signal processing performs with the conventional joint optimization approach. speech enhancement. This combination is particularly effective Index Terms—Beamforming, dereverberation, source separa- because neural networks can successfully capture the spectral tion, microphone array, automatic speech recognition, maximum patterns of signals over wide TF ranges and reliably estimate likelihood estimation such statistics of the signals. Conventional signal processing often fails to adequately handle them. On the other hand, neural networks often introduce into the processed signal I. INTRODUCTION nonlinear distortions, which are harmful to perceived speech When a speech signal is captured by distant microphones, quality and ASR. This problem can be avoided by microphone e.g., in a conference room, it often contains reverberation, dif- array techniques. A number of articles have reported the fuse noise, and extraneous speakers’ voices. These components usefulness of this combination, particularly for far-field ASR, are detrimental to the intelligibility of the captured speech e.g., at the REVERB challenge [18] and the CHiME-3/4/5 and often cause serious degradation in many applications challenges [19], [20]. such as hands-free teleconferencing and Automatic Speech Despite the success of neural network-supported micro- Recognition (ASR). phone array speech enhancement, how to optimally combine Microphone array speech enhancement has been scrutinized individual microphone array techniques for simultaneously to minimize the aforementioned detrimental effects in the performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way remains T. Nakatani, K. Kinoshita, R. Ikeshita, and M. Delcroix are with NTT inadequately investigated. For example, for denoising and Corporation. C. Boeddeker and R. Haeb-Umbach are with Paderborn Univ. Manuscript received January 1, 2020; revised XXXX XX, 2020. dereverberation (DN+DR), the cascade configuration of a arXiv:2005.09843v3 [eess.AS] 2 Aug 2020 2 WPE filter followed by a MVDR/MPDR beamformer has been computational efficiency. An additional benefit of source-wise widely used as the state-of-the-art frontend, e.g., at the far- factorization is that it can be used, without loss of optimality field ASR challenges [18], [19], [20], [21]. However, since for the extraction of a single target source from a sound the WPE filter and the beamformer are separately optimized, mixture, which is now an important application area of speech the overall optimality of this approach is not guaranteed. enhancement [13], [29]. To optimally perform DN+DR, several techniques have been Experiments based on noisy reverberant sound mixtures proposed using a Kalman filter [22], [23], [24]. A technique, created using the REVERB Challenge dataset [18] show that called Integrated Sidelobe Cancellation and Linear Prediction the proposed optimization approaches substantially improve (ISCLP) [24], optimizes an integrated filter that can cancel the DN+DR+SS performance in comparison to the conven- noise and reverberation from the observed signals using a side- tional cascade configuration in terms of ASR performance and lobe cancellation framework. With this technique, however, signal distortion reduction. These two proposed approaches the steering vector of the target signal needs to be directly can also greatly reduce the computing cost with improved estimated in advance from noisy reverberant speech, which is estimation accuracy in comparison with the conventional joint challenging and limits the overall estimation accuracy. In the optimization approach. blind signal processing area, on the other hand, a technique Certain parts of this paper have already been presented that jointly optimizes a pair comprised of a WPE filter and a in our recent conference papers. The ML formulation for beamformer has been proposed for dereverberation and source optimizing a CBF was derived for DN+DR [30]. Another work separation (DR+SS) under noiseless conditions [25], [26], [31] argued that a CBF for DN+DR can be factorized into a [27]. One advantage of this approach is that we can access WPE filter and a wMPDR (non-convolutional) beamformer, multichannel dereverberated signals obtained as the output and jointly optimized without loss of optimality. Another of the WPE filter during the optimization, and utilize them work [32] presented ways to reliably estimate TF masks for to reliably estimate the beamformer. However, this approach DN+DR+SS. This paper integrates these techniques to perform requires 1) huge computing cost for the optimization, and 2) DN+DR+SS in a computationally efficient way. has not been extended for application to DN+DR+SS. In the remainder of this paper, the models of the observed To overcome the above limitations, this paper develops signal and the CBF are defined in Section II. Then, Section III algorithms for optimizing a Convolutional BeamFormer (CBF) presents our proposed optimization methods, and Section IV that can perform DN+DR+SS in a computationally much more summarizes their characteristics and advantages. Sections V and VI describe experimental results and concluding remarks. efficient way. A CBF is a filter that is applied to a multichannel observed signal to yield the desired output signals. For CBF optimization, this paper first presents a common objective II. MODELS OF SIGNAL AND BEAMFORMER function based on the Maximum Likelihood (ML) criterion This paper assumes that I source signals are captured by by assuming that the steering vectors of the desired signals M(≥ I) microphones in a noisy reverberant environment. The are given, or can be estimated. This paper refers to a CBF captured signal at each TF point in the short-time Fourier optimized by this objective function as a weighted MPDR transform (STFT) domain is modeled by (wMPDR) CBF. After showing that a CBF can be factorized into WPE filter(s) and beamformer(s) in two different ways, I (i) we derive two different algorithms for optimizing the wMPDR x = x + n , (1) t,f t,f t,f CBF, based on the CBF factorization ways. The first approach, i=1 (i) (i) (i) called source-packed factorization, is an extension of the x = d + r , (2) t,f t,f t,f conventional joint optimization technique proposed for DR+SS where t and f are time and frequency indices, respectively, [25], [26], [27]. We first show that its direct application to ⊤ M×1 x = [x , . . . , x ] ∈ C is a column vec- DN+DR+SS suffers from serious problems in terms of the t,f 1,t,f M,t,f tor containing all the microphone signals at a TF point. computational efficiency and estimation accuracy and present (i) Here, (·) denotes the non-conjugate transpose. x = an extension for solving them. The second approach, called t,f (i) (i) source-wise factorization, is based on a novel factorization [x , . . . , x ] is a (noiseless) reverberant signal cor- 1,t,f M,t,f technique that factorizes a CBF into a set of sub-filter pairs, responding to the ith source, and n = [n , . . . , n ] t,f 1,t,f M,t,f (i) each of which is composed of a WPE filter and a beam- is the additive diffuse noise. x for each source in Eq. (1) t,f former, and independently estimates each source. For both is further decomposed into two parts in Eq. (2), one of which approaches, we also present a method that robustly estimates consists of the direct signal and early reflections, referred the steering vectors of the desired signals during the wMPDR (i) to as desired signal d , and the other corresponds to late t,f CBF optimization using the output of the WPE filters. A (i) reverberation r . Hereafter, the frequency indices of the t,f neural network-supported TF-mask estimation technique is symbols are omitted for brevity, assuming that each frequency also incorporated to estimate the steering vectors. Although bin is processed independently in the same way. both approaches work comparably well in terms of estimation (i) In this paper, the goal of DN+DR+SS is to estimate d accuracy, source-wise factorization has advantages in terms of (i) for each source i from x in Eq. (1) by reducing r of (i ) source i, x of all the other sources i 6= i, and diffuse noise Note that the proposed techniques can also be applied to conventional t blind signal processing for DR+SS, as discussed in an article [28]. n . Since in noisy reverberant environments, early reflections t … Mulple- Beamformer Convoluonal target matrix for beamformer for dereverberaon, linear separaon denoising, and predicon and source separaon (LP) denoising (a) MIMO CBF (b) MIMO CBF with source-packed factorization (1) Convoluonal (1) (1) Single-target Beamformer beamformer for = 1 for = 1 LP for = 1 ( ) Convoluonal ( ) Beamformer ( ) Single-target beamformer for = for = LP for = (c) Set of MISO CBFs (d) MISO CBFs with source-wise factorization Fig. 1. Multi-Input Multi-Output (MIMO) CBF and its three different implementations. They are equivalent to each other in the sense that whatever values are set to coefficients of one implementation, certain coefficients of the other implementations can be determined such that they realize identical input-output relationships. Thus, optimal solutions of all implementations are identical as long as they are optimized based on the same objective function. (i) enhance the intelligibility of speech for human perception In this paper, we further assume that d is statistically [33] and improve the ASR performance by computer [34], independent of the following variables: (i) (i) we include them in the desired signal. Hereafter, we use ′ • s ′ for t ≤ t − Δ (and thus d is statistically m = 1 as a reference microphone and describe a method (i) independent of x for t ≤ t− Δ), (i) for estimating desired signal d at the microphone without (i) 1,t ′′ • r for t ≤ t, ′′ loss of generality. ′ (i ) ′ ′ (i) • x and n ′ for all t, t and i 6= i. To achieve the above goal, we further model d : These assumptions are used to derive the optimization algo- (i) (i) (i) (i) (i) d = v s = v ˜ d , (3) rithms described in the following. t t 1,t (i) where s is the ith clean speech at a TF point. In Eq. (3), the (i) (i) (i) A. Definition of a CBF and its three different implementations desired signal of the ith source, d , is modeled by v s , t t i.e., a product in the STFT domain of the clean speech with We now define a CBF, which will later br factorized into (i) transfer function v , hereafter a steering vector, assuming WPE filter(s) and beamformer(s): that the duration of the impulse response corresponding to L−1 the direct signal and early reflections in the time domain is H H y = W x + W x , (6) t t t−τ 0 τ sufficiently short in comparison with the analysis window [35]. τ=Δ (i) (i) We further rewrite the desired signal as v ˜ d , i.e., a product 1,t (i) (i) (i) (1) (I) ⊤ I×1 where y = [y , . . . , y ] ∈ C is the output of the of the desired signal at reference microphone d = v s 1,t 1 t t t CBF corresponding to the estimates of I desired signals, with a Relative Transfer Function (RTF) [36], which is defined M×I W ∈ C for each τ ∈ {0, Δ, Δ+1, . . . , L−1} is a matrix as the steering vector divided by its reference microphone element, composed of the beamformer coefficients, (·) denotes a conjugate transpose, and Δ is the prediction delay of CBF. We (i) (i) (i) v ˜ = v /v . (4) 1 set Δ equal to the mixing time introduced in Eq. (5), so that the desired signals are included only in the first term of Eq. (6) In contrast, assuming that the duration of the late reverber- and are statistically independent of the second term based on ation in the time domain exceeds the analysis window, late the assumptions introduced in the signal model. Then this (i) reverberation r is modeled by a convolution in the STFT paper performs DN+DR+SS by estimating the beamformer domain [37] of the clean speech with a time series of acoustic coefficients that can estimate the desired signals included in transfer functions that corresponds to the late reverberation: the first term of Eq. (6). L −1 a For notational simplicity, we also introduce a matrix repre- (i) (i) (i) r = a s , (5) sentation of a CBF: t τ t−τ τ Δ W x 0 t (i) (i) (i) y = , (7) ⊤ t where a = [a , . . . , a ] for τ ∈ {Δ, . . . , L − 1} x τ a W 1,τ M,τ t are the convolutional acoustic transfer functions, and Δ is the mixing time, which represents the relative frame delay of the See a previous work [8] for more precise discussion of the statistical (i) late reverberation start time to the direct signal. independence between d and s ′ for t ≤ t − Δ. t 4 where W is a matrix containing W for Δ ≤ τ ≤ L− 1 and For example, MISO CBFs were previously used [30], [39]. x is a column vector containing past multichannel observed ISCLP [24] can also be viewed as the realization of a MISO signals x for Δ ≤ τ ≤ L− 1: CBF using a sidelobe cancellation framework [40]. t−τ 3) Source-wise factorization: With the source-wise factor- ⊤ ⊤ M(L−Δ)×I W = W , . . . ,W ∈ C , (8) Δ L−1 ization shown in Fig. 1 (d), we further factorize each MISO ⊤ ⊤ M(L−Δ)×1 CBF defined in Eq. (14) for source i: x = x , . . . ,x ∈ C . (9) t−Δ t−L+1 " # " # (i) Hereafter, we refer to the CBF defined by Eqs. (6) and (7) as I (i) = q , (15) (i) (i) a MIMO CBF. w −G In the following, we further present three different im- (i) (i) M×1 M(L−Δ)×M plementations of CBF, including two ways of factorizing it. where q ∈ C and G ∈ C . Then, Eq. (14) Figure 1 illustrates the MIMO CBF and its three different can be rewritten as a pair of a linear prediction filter and a implementations. beamformer: 1) Source-packed factorization: With the implementation (i) (i) shown in Fig. 1 (b), we directly factorize the MIMO CBF in z = x − G x , (16) t t Eq. (7): (i) (i) (i) y = q z , (17) t t W I 0 M = Q, (10) W −G (i) (i) M×1 where z ∈ C and G are the output and the prediction M×I M(L−Δ)×M M×M (i) where Q ∈ C , G ∈ C , and I ∈ R is M matrix of the linear prediction, and q is the beamformer’s an identity matrix. Then Eq. (6) can be rewritten as a pair of coefficient vector. Because Eq. (16) is performed only to esti- a (convolutional) linear prediction filter followed by a (non- mate the ith source, it is called single-target linear prediction. convolutional) beamformer matrix: 4) Relationship between two factorization approaches: The H difference between the two factorization approaches, namely z = x − G x , (11) t t t Figs. 1 (b) and (d), is based only on how the linear prediction is y = Q z . (12) t t performed: Eq. (11) or Eq. (16). More specifically, it is based (i) M×1 on whether the prediction matrices, G and G , are common Here z ∈ C and G are the output and the prediction to all the sources or different over different sources. Therefore, matrix of the linear prediction, and Q is the coefficient matrix different optimization algorithms with different characteristics of the beamformer. Eq. (11), which is supposed to derever- are derived, as will be shown in Section III. In contrast, berate all the sources at the same time, is thus referred to as (i) the beamformer parts, Q and q in Eqs. (12) and (17) a multiple-target linear prediction, and Eq. (12) is supposed (i) are identical in the two approaches, viewing q as the ith to perform denoising and source separation at the same time. column of Q, because they satisfy W = Q in Eq. (10) and Because individual sources are not distinguished in the WPE (i) (i) w = q in Eq. (15). filter’s output, this implementation is called source-packed In addition, it should be noted that all the above CBF factorization. implementations are equivalent to each other in the sense One example of source-packed factorization is the cascade that whatever values are set to the coefficients of one imple- configuration composed of a WPE filter followed by a beam- mentation, certain coefficients of the other implementations former, which has been widely used for DN+DR+SS in the can be determined such that they realize the same input- far-field speech recognition area [14], [20], [38], and the other output relationship. Thus, the optimal solutions of all the example is one used in the joint optimization of a WPE filter implementations are identical as long as they are based on and a beamformer, which has been investigated for DR+SS in the same objective function. the blind signal processing area [25], [26], [27]. 2) Multi-Input Single-Output (MISO) CBF: Next we define the set of MISO CBFs shown in Fig. 1 (c). They were obtained III. ML ESTIMATION OF CBF by decomposing the beamformer coefficients in Eq. (7): In this section, we derive two different optimization algo- " # (1) (2) (I) rithms using (b) source-packed factorization and (d) source- 0 w w . . . w 0 0 0 = , (13) wise factorization. For the derivations, we assume that the (1) (2) (I) w w . . . w (i) RTFs v ˜ and the time-varying variances of the output signals (i) (i) M×1 (i) M(L−Δ)×1 yielded by the optimal CBF, denoted by λ , are given. where w ∈ C and w ∈ C are column Then in Section III-E, we describe ways for jointly estimating vectors, which respectively contain the ith columns of W (i) λ with CBF coefficients based on the ML criterion and and W; they are used to extract the ith desired signal. Then, (i) estimating v ˜ based on the WPE filter’s output obtained at a Eq. (7) can be rewritten for each source i: " # step of the optimization. (i) w x (i) t y = . (14) (i) w t A. Probabilistic model First, we formulate the objective function for DN+DR+SS The existence of G, which satisfies W = −GQ, is guaranteed for any W when M ≥ I and rank{Q} = I. by reinterpreting the objective function proposed for DN+DR 5 ⊤ ⊤ ⊤ [30]. For this formulation, we interpret DN+DR+SS to be where n = [n , . . . ,n ] . According to the statistical t−Δ t−L+1 (i) composed of a set of separate processing steps, each of independence assumptions introduced in Section II, d is sta- 1,t which applies DN+DR to enhance source i by reducing the (i) (i ) tistically independent of rˆ , xˆ , and n ˆ . Then substituting t t late reverberation of the source (DR) and the additive noise Eq. (21) into Eq. (19) and omitting the constant terms, we including the other sources and the diffuse noise (DN). With obtain the following (in the expectation sense): this interpretation, we introduce the following assumptions, similar to the previous work [30]: (i) (i ) E rˆ + xˆ + n ˆ T ′ t n o t t (i) i 6=i • The output of the optimal CBF for each i, namely y , (i) E L (θ ) = . (i) follows a zero-mean complex Gaussian distribution with T t=1 t (i) (i) (25) time-varying variance λ = E y [8]. t t • The beamformer satisfies a distortionless constraint for The above equation indicates that minimization of the objec- (i) (i ) (i) each source i defined using RTF v ˜ in Eq. (4): tive function indeed minimizes the sum of rˆ , xˆ for i 6= i, t t and n ˆ in Eq. (21). H H t (i) (i) (i) (i) w v ˜ = 1 or q v ˜ = 1 . (18) Before deriving the optimization algorithms, we define a matrix that is frequently used in the derivation, referred to Then based on the previous discussion [30], we can approx- as a variance-normalized spatio-temporal covariance matrix. imately derive the objective function to be minimized for Letting x be a column vector composed of the current and (i) estimating the CBF coefficients for source i, e.g., θ = past observed signals at all the microphones, defined as (i) (i) {w ,w }, according to ML estimation: ⊤ ⊤ M(L−Δ+1)×1   x = x ,x ∈ C , (26) 2 t t t (i)   (i) (i) (i) (i) the matrix is defined: L (θ ) = + log λ s.t. w v ˜ = 1. i   t 0 (i) t=1 t 1 x x (i) t t M(L−Δ+1)×M(L−Δ+1) R = ∈ C . (27) (19) (i) t=1 t The objective function for estimating all the sources can then Its factorized form is also defined: be obtained by summing Eq. (19) over all the sources:   I (i) (i) R P (i) (i) x x (i) (i)   ˜ R = , (28) L (Θ) = L (θ ), s.t. w v = 1 for all i, (20) i x (i) (i) P R i=1 (1) (I) where Θ = θ , . . . , θ . This objective function is used where commonly for all the implementations of a CBF. In this paper, 1 x x we call a CBF optimized by the above objective function (i) t M×M R = ∈ C , (29) (i) a weighted MPDR (wMPDR) CBF because it minimizes t=1 t (i) the average power of output y weighted by time-varying 1 x x (i) t (i) t M(L−Δ)×M variance, λ , of the signal. P = ∈ C , (30) (i) Here, let us briefly explain how DN+DR+SS is performed t=1 t by Eqs. (19) and (20). Substituting Eqs. (1) and (2) into H (i) 1 x x t M(L−Δ)×M(L−Δ) R = ∈ C . (31) Eq. (14) and using the model of the desired signal in Eq. (3) (i) t=1 t and the distortionless constraint in Eq. (18), we obtain (i) (i) (i) (i ) y = d + rˆ + xˆ + n ˆ , (21) t t t 1,t B. Optimization based on source-packed factorization i 6=i This subsection discusses methods for optimizing a CBF (i) (i ) where rˆ , xˆ for i 6= i, and n ˆ are respectively the late t with the source-packed factorization. In the following, after t t reverberation of the ith source, all the other sources, and the describing a method for directly applying the conventional additive diffuse noise remaining in the CBF output, written in joint optimization technique used for DR+SS to DN+DR+SS, MISO CBF form: we summarize the problems in it, and present the solutions to " # " # the problems. (i) (i) (i) w 1) Direct application of a conventional technique: With the rˆ = , (22) t (i) (i) w x t source-packed factorization in Eqs. (11) and (12), simultane- " # " # H ′ ously estimating both Q and G in closed form is difficult (i ) (i) (i ) w x (i) t (i) xˆ = ′ , (23) even when both λ and v are given. Instead, we use an t t (i) (i ) iterative and alternate estimation scheme, following a blind " # signal processing technique [25], [26], [27], where at each (i) w n n ˆ = , (24) estimation step, either Q or G is updated and the other is (i) w t fixed. 6 (i) For updating G, we fix Q at its previously estimated Then q , which minimizes Eq. (40) under the distortionless (i) (i) value. For the algorithm derivation, the representation of linear constraint q v ˜ = 1, can be obtained: prediction in Eq. (11) is slightly modified: −1 (i) (i) R v ˜ (i) z = x − X g, (32) t t t q = . (42) −1 (i) (i) (i) v ˜ R v ˜ where X and g are equivalent to x and G with a modified t t matrix structure defined: Because the above beamformer minimizes the average power of z weighted by the time-varying variance, we call it ⊤ M×M (L−Δ) X = I ⊗ x ∈ C , (33) t M a weighted MPDR (wMPDR) beamformer . As shown in H 2 Section III-C, a wMPDR beamformer is a special case of a ⊤ ⊤ M (L−Δ)×1 g = g , . . . ,g ∈ C , (34) 1 M wMPDR CBF, which is reduced to a wMPDR beamformer when setting the length of the CBF L = 1, i.e., by just where ⊗ is a Kronecker product and g is the mth column converting it into a non-convolutional beamformer. of G. Then, considering that the CBF in Eqs. (11) and (12) (i) The above algorithm, however, has two serious problems. (i) can be written as y = q x − X g and omitting t t First, the size of the covariance matrix in Eq. (38) is too the normalization terms, the objective function in Eq. (20) large, requiring huge computing cost for calculating it and becomes its inverse. Second, as shown in our experiments, the iterative and alternate estimation of Q and G tends to converge to 1 2 L (g) = x − X g , (35) g t t a sub-optimal point. This is probably because the update q,t t=1 of G is performed based only on the output of the fixed beamformer in the iterative and alternate estimation, as in where kxk = x Rx, and Φ is a semi-definite Hermitian q,t Eq. (19); the signal dimension of the beamformer output, i.e., matrix: I, is reduced from that of the original signal space, i.e., M, I with the over-determined case, i.e., I < M. As a consequence, (i) (i) q q M×M signal components that are relevant for the update of G may Φ = ∈ C . (36) q,t (i) i=1 t be reduced in the beamformer output, especially when the estimation of Q is less accurate at the early stage of the Because Eq. (35) is a quadratic form with a lower bound, g, optimization. This can seriously degrade the update of G. which minimizes it, can be obtained: 2) Proposed extension: Next we present two techniques to mitigate the above problems within the source-packed g = Ψ ψ, (37) factorization approach. The first reduces the computing cost. 1 2 2 M (L−Δ)×M (L−Δ) As shown in Appendix A, Eqs. (38) and (39) can be rewritten, Ψ = X Φ X ∈ C , (38) q,t t t using Eq. (28): 1 H 2 M (L−Δ)×1 ψ = X Φ x ∈ C , (39) X q,t t H ⊤ (i) (i) (i) Ψ = q q ⊗ R , (43) i=1 where (·) is the Moore-Penrose pseudo-inverse. Since the I (i) (i) (i) rank of Ψ is equal to or smaller than MI(L−Δ), as shown in ψ = q ⊗ P q , (44) Section III-B2, Ψ is rank deficient for over-determined cases, i=1 namely when M > I, and thus the use of the pseudo-inverse is where () denotes the complex conjugate. In the above equa- indispensable. Eqs. (37) to (39) are equivalent to those used in (i) tions, the majority of the calculation is coming from R . the dereverberation step for DR+SS [25], [26], [27] except that Because the size of the matrix is much smaller than that in our paper denoising is additionally included in the objective of Ψ, we can greatly reduce the computing cost with this and over-determined cases are also considered. We call this a modification in comparison with the direct calculation of multiple-target WPE filter. Eqs. (38) and (39). Although we still need to calculate the For the update of Q, fixing g at its previously estimated inverse of huge matrix Ψ even with this modification, the cost value, the objective in Eq. (20) can be rewritten: is relatively small in comparison with the direct calculation of Ψ. Note that Eq. (43) also shows the rank of Ψ to be equal X 2 H (i) (i) (i) to or smaller than MI(L− Δ). L (Q) = q s.t. q v ˜ = 1, (40) (i) The second technique introduces a heuristic to improve the i=1 update of the WPE filter. To use a whole M-dimensional signal (i) where R is a variance-normalized spatial covariance matrix A wMPDR beamformer was also called a Maximum-Likelihood Distor- of the output of the multiple-target WPE filter, calculated as tionless Response (MLDR) beamformer [41]. In general, the computational complexity of a matrix multiplication 1 z z t exceeds O(n ). Because the size of Ψ is M-times larger than R , the t x (i) R = . (41) computational complexity for calculating Ψ is probably at least M times (i) t=1 t larger than that for calculating R . x 7 (i) space to be considered for the update, we modify the CBF to to the RTF v ˜ with zero padding. Finally, we obtain the output not only I desired signals, but also M − I auxiliary solution: signals that are included in orthogonal complement Q of −1 (i) (i) R v Q and model the auxiliary signals as zero-mean time-varying (i) w = . (51) −1 complex Gaussians. With this modification, the optimization is (i) (i) (i) v R performed by calculating the summation in Eqs. (43) and (44) (I+1) (M) over both 1 ≤ i ≤ I and I < i ≤ M, letting q , . . . ,q The above equation, which gives the simplest form of the be the orthonormal bases for the orthogonal complement Q . solution to a wMPDR CBF, clearly shows that a wMPDR (i) Because distinguishing variances λ of the auxiliary signals CBF is a general case of a wMPDR beamformer. By setting is inconsequential, we use the same value for them, calculated L = 1 in the above solution, namely, by letting it be a as non-convolutional beamformer, it reduces to the solution of a wMPDR beamformer in Eq. (42). M 2 X H ⊥ (i) An advantage of the solution using the MISO CBFs is that λ = q z , (45) M − I it can be obtained by a closed form equation, provided the i=I+1 RTFs and the time-varying variances of the desired signals are given and that we can ignore the interaction between DN and and calculate P and R based on Eqs. (30) and (31) x x DR. With this approach, however, the RTFs must be directly accordingly. In summary, we can implement this modification estimated from a reverberant observation, similar to ISCLP by adding the following terms to Ψ and ψ in Eqs. (43) and [24]. A solution to this problem is to use dereverberation (44): preprocessing based on a WPE filter for the RTF estimation. X H ⊤ Although it was shown that the output of a WPE filter can ⊥ (i) (i) Ψ = q q ⊗ R , (46) be obtained in a computationally efficient way within the i=I+1 framework of this approach [30], the source-wise factorization X ∗ approach described in the following can more naturally solve ⊥ (i) ⊥ (i) ψ = q ⊗ P q . (47) this problem. So, this paper adopts it as the solution. i=i+1 D. Optimization based on source-wise factorization C. Direct optimization of MISO CBFs With source-wise factorization, similar to the case with the direct optimization of the MISO CBFs, the optimization can Before deriving the optimization with source-wise factoriza- be performed separately for each source, and the resultant tion, we show that we can directly optimize the MISO CBFs algorithm is identical to that proposed for DN+DR [31]. in Eq. (14), and summarize their characteristics. With this Considering that a CBF can be written based on Eqs. (16) setting, the CBFs and the objective function are both defined H (i) (i) separately for each source in Eqs. (14) and (19), and thus, the (i) and (17) as y = q x − G x and using the t t optimization can be performed separately for each source. The (i) factorized form of R in Eq. (28), the objective function in resultant algorithm is, therefore, identical to that previously Eq. (19) can be rewritten: proposed for DN+DR [42], where this type of CBF is also called a Weighted Power minimization Distortionless response −1 (i) (i) (i) (i) (i) (i) (WPD) CBF. L G ,q = G − R P q x x (i) For presenting the solution, we introduce the following x vector representation of Eq. (14): (i) + q . (52) H −1 (i) (i) (i) (i) R − P R P x x x (i) (i) y = w x , (48) t t (i) In the above objective function, G is contained only in the first term, and the term can be minimized without depending (i) where w is defined: (i) (i) on the value of q , when G takes the following value: " # (i) −1 (i) (i) (i) (i) w = , (49) G = R P . (53) (i) x x (i) So, this is a solution of G that globally minimizes the (i) (i) Then, when λ and v ˜ are given, Eq. (19) becomes a simple t (i) objective function given time-varing variance λ . Interest- constraint quadratic form: ingly, this solution is identical to that of conventional WPE 2 H dereverberation. This means that the WPE filter, which is (i) (i) (i) (i) L (w ) = w s.t. w v = 1, (50) (i) optimized solely for dereverberation, can perform the optimal dereverberation for the joint optimization without depending (i) where R is the covariance matrix defined in Eq. (28), and h i 6 ⊤ This is not a unique solution. The first term is minimized even when an (i) (i) M(L−Δ+1)×1 (i) v = v ˜ , 0, . . . , 0 ∈ C corresponds arbitrary matrix, whose null space includes q , is added to Eq. (53). 8 Algorithm 1: Source-packed factorization-based optimiza- Algorithm 2: Source-wise factorization-based optimiza- tion for estimation of all sources tion for estimation of ith source Data: Observed signal x for all t Data: Observed signal x for all t t t (i) (i) TF masks γ for all t and 1 ≤ i ≤ I TF masks γ for all t t t (i) (i) Result: Estimated sources y for all t and 1 ≤ i ≤ I Result: Estimated ith source y for all t t t (i) (i) 2 2 1 Initialize λ as ||x || /M for all t and 1 ≤ i ≤ I 1 Initialize λ as ||x || /M for all t t t t I t I M M (i) 2 Initialize q as the ith column of I for 1 ≤ i ≤ I 2 repeat P H (i) T x x 1 t 3 Initialize z as x for all t t t t 3 R ← x (i) T t=1 4 repeat P H (i) T x x 1 t H t (i) T x x 4 P ← 1 t x t (i) T t=1 5 R ← for 1 ≤ i ≤ I λ (i) t x t=1 (i) (i) H (i) (i) T x x 1 t t 5 G ← R P 6 P ← for 1 ≤ i ≤ I x (i) t=1 t  H (i) (i) H (i) I 6 z ← x − G x (i) (i) t t 7 Ψ ← q q ⊗ R i=1 x (i) (i) (i) 7 Estimate v ˜ based on z and γ P t t I (i) (i) (i) (i) (i) 8 ψ ← q ⊗ P q i=1 P z z (i) T t t 8 R ← z (i) t=1 9 Begin Add orthogonal complement beamformer (i) (i) (I+1) (M) ´ (R ) v 10 Set q , . . . ,q as the orthonormal bases (i) z 9 q ← H (i) (i) ´ (i) for orthogonal complement Q of Q v˜ R v˜ ( ) z P (i) (i) M (i) ⊥ 1 (i) 10 y ← q z 11 λ ← q z t t t i=I+1 M−I P (i) (i) T x x 1 t 11 λ ← y 12 R ← t t x t=1 T λ P H T x x 12 until convergence ⊥ 1 t 13 P ← x ⊥ T t=1 H ⊥ (i) (i) 14 Ψ ← Ψ + q q ⊗ R i=I+1 (i) ⊥ (i) output of the single-target WPE filter, calculated as 15 ψ ← ψ + q ⊗ P q i=I+1 16 End (i) (i) z z t t 17 g ← Ψ ψ (i) M×M R = ∈ C . (55) (i) 18 z ← x − X g λ t t t t=1 t (i) (i) 19 Estimate v ˜ based on z and γ for 1 ≤ i ≤ I Then the solution can be obtained, under a distortionless P H (i) T z (z ) 1 t t 20 R ← for 1 ≤ i ≤ I z constraint, as a wMPDR beamformer: (i) T t=1 (i) (i) −1 R v˜ ( ) (i) (i) (i) 21 q ← for 1 ≤ i ≤ I + R v ˜ H z (i) (i) (i) v˜ R v˜ ( ) z (i) q = . (56) H −1 (i) (i) H (i) (i) ´ (i) 22 y ← q z for 1 ≤ i ≤ I t v ˜ R v ˜ (i) (i) 23 λ ← y for 1 ≤ i ≤ I t t Eqs. (54) to (56) closely resemble Eqs. (40) to (42). The 24 until convergence difference is whether the dereverberation is performed by a multiple-target WPE filter or single-target WPE filters. With source-wise factorization, the solution can be obtained (i) (i) in closed form when λ and v ˜ are given, similar to the case on the subsequent beamforming, provided the time-varying with the direct optimization of the MISO CBFs. In addition, (i) variance of the desired source is given for the optimization. In the output of the WPE filter is obtained as z in Eq. (16), addition, unlike the source-packed factorization approach, this and can be efficiently used for the estimation of the RTFs. approach does not need to compensate for the dimensionality Furthermore, since the temporal-spatial covariance matrix in (i) reduction of the beamformer output for the update of G Eq. (31) is much smaller than that in Eq. (38) of the source- because it considers a whole signal space without adding any packed factorization, the computational cost can be reduced. (i) modification. We refer to this filter G as a single-target WPE (See Section IV for more scrutiny of the computing cost.) filter. (i) (i) Once G is obtained as the above solution, the objective (i) E. Processing flow with estimation of λ and v function in Eq. (19) can be rewritten as This subsection describes examples of processing flows in Algorithms 1 and 2, for optimizing a CBF based on source- 2 H (i) (i) (i) (i) packed factorization and source-wise factorization, including L q = q s.t. q v ˜ = 1, (54) (i) (i) estimation of the time-varying variances, λ , and the RTFs, (i) v ˜ . Hereafter, we refer to the algorithms as A-1 and A-2 for (i) where R is a variance-normalized covariance matrix of the brevity. Although A-1 simultaneously estimates all sources, z 9 WPE for wMPDR for was trained so that it receives the WPE filter’s output, which is obtained at the first iteration in the iterative optimization of Dereverb Beamform the CBF, and estimates the TF masks of the desired signals. The network’s input was set as a concatenation of the real Es mate Es mate and imaginary parts of the STFT coefficients, and the loss function was set as the (scale-dependent) signal-to-distortion ratio (SDR) of an enhanced signal obtained by multiplying the Es mate estimated masks to an observed signal. For the training and Es mate validation data, we synthesized mixtures using two utterances TF masks randomly extracted from the WSJ-CAM0 corpus [45] and two st room impulse responses and background noise extracted from For 1 me For subsequent mes the REVERB Challenge training set [18]. (i) Fig. 2. Processing flow of source-wise factorization-based CBF for estimating For the estimation of the RTFs, v ˜ , we adopted a method a source i. based on eigenvalue decomposition with noise covariance (i) whitening [46], [47]. With this technique, steering vector v (i) is first estimated: y for all i, from observed signal x , A-2 estimates only (i) (i) −1 one of the sources, y for a certain i, and (if necessary) is v = R MaxEig R R , (57) t i \i \i repeatedly applied to the observed signal to estimate all the where MaxEig(·) is a function that calculates the eigenvector sources one after another. TF masks are provided as auxiliary (i) corresponding to the maximum eigenvalue and R and R \i inputs for both algorithms. TF mask γ , which is associated are spatial covariance matrices of the i-th desired signal and with a source and a TF point, takes a value between 0 and the other signals estimated as: 1 and indicates whether the source’s desired signal dominates (i) (i) the TF point (γ = 1) or not (γ = 0). The TF masks (i) (i) (i) t t γ z z t t t over all the TF points are used to estimate the RTF(s) of the R = , (58) (i) desired signal(s) in line 19 of A-1 and line 7 of A-2. (See t t Section III-E1 for the estimation detail of the TF masks and (i) (i) (i) 1− γ z z the RTFs.) t t t t (i) R =   . (59) \i Both algorithms estimate time-varying variances λ based (i) 1− γ on the same objective as that for the CBF, defined in Eq. (19). Because no closed form solution to the estimation of the Then, the RTF is obtained by Eq. (4). CBF and the time-varying variances is known, an iterative and IV. DISCUSSION alternate optimization scheme is introduced to both algorithms. (i) In each iteration, the time-varying variances, λ , are updated In summary, our proposed techniques can optimize a CBF in line 23 of A-1 and line 11 of A-2 as the power of the for jointly performing DN+DR+SS with greatly reduced com- (i) previously estimated values of desired signal y , and then the puting cost in comparison with the direct application of the (i) CBF and desired signal y are updated while fixing the time- conventional joint optimization technique proposed for DR+SS varying variances. The iteration is repeated until convergence to DN+DR+SS. With the conventional technique, a huge is obtained. covariance matrix Ψ must be calculated to take into account the dependency of G on Q that is inherently introduced into The optimization methods described in Sections III-B and source-packed factorization. This makes the computing cost of III-D are used in their respective algorithms to update the CBF the conventional technique extremely high. In contrast, since and the desired signal(s). The WPE filter is first estimated in the proposed extension of the source-packed factorization lines 5 to 17 of A-1 and lines 3 to 5 of A-2, and applied in line 18 of A-1 and line 6 of A-2. After the RTF(s) is updated approach substantively reduces the size of the matrix to be using the dereverberated signals, the wMPDR beamformer is calculated from M (L− Δ) for Ψ to M(L− Δ) for R , the estimated in lines 20 and 21 of A-1 and lines 8 and 9 of A-2, computing cost can be effectively reduced. (i) and applied in line 22 of A-1 and line 10 of A-2. On the other hand, with source-wise factorization, G (i) Figure 2 also illustrates the processing flow of a CBF with can be optimized independently of q , which also allows source-wise factorization for estimating a source i. us to reduce the size of the matrix to be calculated to the same as that of the proposed extension of the source-packed 1) Methods for estimating TF masks and RTFs: In our (i) factorization approach. In addition, we can skip the calculation experiments, for estimating TF masks, γ , for all i and t at each frequency, we used a Convolutional Neural Network of an additional matrix, R , and the inverse of the huge −1 that works in the TF domain and is trained using utterance- matrix, Ψ , both of which are required for the proposed level Permutation Invariant Training criterion (CNN-uPIT) extension of the source-packed factorization approach. This [43]. According to our preliminary experiments [32], we set further increases the computational efficiency of the source- the network structure as a CNN with a large receptive field wise factorization approach. A drawback of source-wise fac- similar to one used by a fully-Convolutional Time-domain torization is that it has to handle I-times more dereverberated Audio Separation Network (Conv-TasNet) [44]. The network signals than source-packed factorization. 10 TABLE I CBFS COMPARED IN EXPERIMENTS: (1) AND (2) ARE CONVENTIONAL CASCADE CONFIGURATION APPROACHES, (5) IS A CONVENTIONAL JOINT OPTIMIZATION APPROACH, (6) AND (7) ARE PROPOSED JOINT OPTIMIZATION APPROACHES, AND (3) AND (4) ARE TEST CONDITIONS USED JUST FOR COMPARISON. (5), (6), AND (7) ARE CATEGORIZED AS “JOINTLY OPTIMAL” BECAUSE THEY ARE COMPOSED OF WPE AND WMPDR AND OPTIMIZED BASED ON INTEGRATED VARIANCE ESTIMATION (SEE FIG. 3 FOR THE DIFFERENCE BETWEEN SEPARATE AND INTEGRATED VARIANCE ESTIMATION). Name of method Jointly WPE BF Variance Category optimal estimation (1) WPE+MPDR (separate) Multiple-target MPDR Separate Cascade (conventional) (2) WPE+MVDR (separate) Multiple-target MVDR Separate Cascade (conventional) (3) WPE+wMPDR (separate) Multiple-target wMPDR Separate Test condition (4) WPE+MPDR (integrated) Single-target MPDR Integrated Test condition (5) Source-packed factorization (conventional) X Multiple-target wMPDR Integrated Jointly optimal (conventional) (6) Source-packed factorization (extended) X Multiple-target wMPDR Integrated Jointly optimal (proposed) (7) Source-wise factorization X Single-target wMPDR Integrated Jointly optimal (proposed) The source-wise factorization approach has additional ben- efits w.r.t. computational efficiency when it is used in specific Derev BF Derev BF scenarios listed below: • The source-wise factorization approach can estimate the (a) Separate optimization (b) Integrated optimization CBF by a closed-form equation when time-varying source Fig. 3. Separate and integrated variance optimization schemes: While separate variances are given, or estimated, e.g., using neural net- variance optimization updates λ for Derev as the variance of Derev output, works [15], [12]. In such a case, we can skip iterative integrated variance optimization updates it as the variance of the beamformer optimization. In contrast, the source-packed factorization output. Consequently, λ for Derev is common to all the sources with separate variance optimization. approach needs to maintain iterations to alternately esti- mate Q and g due to their mutual dependency. • The source-wise factorization approach is advantageous A. Dataset and evaluation metrics when it is combined with neural network-based single target speaker extraction that has recently been actively For the evaluation, we prepared a set of noisy reverberant studied [13]. With this combination, we can skip the es- speech mixtures (REVERB-2MIX) using the REVERB Chal- timation of sources other than the target source, allowing lenge dataset (REVERB) [18]. Each utterance in REVERB us to further reduce the computing cost. contains a single reverberant speech with moderate stationary diffuse noise. For generating a set of test data, we mixed two V. EXPERIMENTS utterances extracted from REVERB, one from its development This section experimentally confirms the effectiveness of set (Dev set) and the other from its evaluation set (Eval set), our proposed joint optimization approaches. Table I summa- so that each pair of mixed utterances was recorded in the same rizes the optimization methods that we experimentally com- room, by the same microphone array, and under the same pared (see Sections V-C and V-D for details of the methods) condition (near or far, RealData or SimData). We categorized in the following three aspects. the test data based on the original categories of the data in REVERB (e.g., SimData or RealData). We created the same 1) Effectiveness of joint optimization number of mixtures in the test data as in the REVERB Eval set, We compared a CBF with and without joint optimiza- such that each utterance in the REVERB Eval set is contained tion in terms of estimation accuracy. The source-wise in one of the mixtures in the test data. Furthermore, the length factorization approach (Table I (7)) is compared with of each mixture in the test data was set at the same as that of the conventional cascade configuration (Table I (1) and the corresponding utterance in the REVERB Eval set, and the (2)), and two additional test conditions (Table I (3) and utterance from the Dev set was trimmed or zero-padded at its (4)). end to be the same length as that of Eval set. 2) Comparison among joint optimization approaches For the experiments in Section V-E, we also prepared a We compared three joint optimization approaches, i.e., set of noisy reverberant speech mixtures, each of which is the source-packed factorization approach with its con- composed of three speaker utterances (REVERB-3MIX). We ventional setting (Table I (5)) and its proposed extension created REVERB-3MIX by adding one utterance extracted (Table I (6)), and the source-wise factorization approach from REVERB Dev set to each mixture in REVERB-2MIX. (Table I (7)), respectively described in Sections III-B1, Only RealData (i.e., real recordings of reverberant data) was III-B2, and III-D, in terms of computational efficiency created for REVERB-3MIX. and estimation accuracy. 3) Evaluation using oracle masks In the experiments, we respectively estimated two or three We used oracle masks instead of estimated masks for speech signals from each mixture for REVERB-2MIX and evaluating a CBF to test the performance of a CBF using REVERB-3MIX and evaluated only one of them correspond- different types of masks and also to obtain its top-line ing to the REVERB Eval set using the baseline evaluation tools performance. provided for it. We selected the signal to be evaluated from all 11 (1) WPE+MPDR (separate) (4) WPE+MPDR (integrated) TABLE II (2) WPE+MVDR (separate) (7) Source-wise factorization BEAMFORMER CONFIGURATIONS USED IN EXPERIMENTS (3) WPE+wMPDR (separate) M L at each freq. range (kHz) #Iterations 24 4.4 0.0-0.8 0.8-1.5 1.5-8.0 Config-1 8 20 16 8 10 Config-2 4 20 16 8 10 4.2 TABLE III WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND STOI FOR SIMDATA IN REVERB-2MIX OBTAINED USING DIFFERENT BEAMFORMERS AFTER FIVE ESTIMATION ITERATIONS WITH CONFIG-1. 3.8 SCORES FOR REVERB-2MIX AND REVERB (I.E., SINGLE SPEAKER) WITHOUT ENHANCEMENT (NO ENH), ARE ALSO SHOWN. 3.6 Enhancement method WER CD FWSSNR PESQ STOI No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 2 4 6 8 10 2 4 6 8 10 No Enh (REVERB) 18.61 3.97 3.62 1.48 0.75 #iterations #iterations 6 1.85 MPDR (w/o iteration) 30.79 4.40 3.07 1.45 0.73 MVDR (w/o iteration) 30.89 4.43 3.00 1.44 0.73 1.8 5.5 wMPDR 28.75 3.96 4.46 1.60 0.75 1.75 (1) WPE+MPDR (separate) 23.04 4.30 3.77 1.58 0.77 (2) WPE+MVDR (separate) 23.34 4.34 3.66 1.57 0.76 1.7 (3) WPE+wMPDR (separate) 21.53 3.74 5.42 1.77 0.82 1.65 4.5 (4) WPE+MPDR (integrated) 23.22 4.28 3.66 1.56 0.76 (7) Source-wise factorization 20.03 3.67 5.57 1.80 0.81 1.6 1.55 the estimated speech signals based on the correlation between 3.5 1.5 the separated signals and the original signal in the REVERB 2 4 6 8 10 2 4 6 8 10 Eval set. As objective measures for speech enhancement [48], #iterations #iterations we used the Cepstrum Distance (CD), the Frequency-Weighted Fig. 4. Comparison among joint optimization and cascade configuration Segmental SNR (FWSSNR), the Perceptual Evaluation of approaches when using WPE+MPDR and WPE+wMPDR with integrated and Speech Quality (PESQ), and the Short-Time Objective Intel- separate optimization schemed using Config-1 for REVERB-2MIX. ligibility measure (STOI) [49]. To evaluate the ASR perfor- mance, we used a baseline ASR system for REVERB that was filter followed by an MPDR beamformer (WPE+MPDR), recently developed using Kaldi [50]. This system is composed of a Time-Delay Neural Network (TDNN) acoustic model and a WPE filter followed by an MVDR beamformer trained using lattice-free maximum mutual information (LF- (WPE+MVDR). The first combination is required for jointly MMI) and online i-vector extraction, and a trigram language optimal processing, and the others have been used for the conventional cascade configuration. Second, we compared two model. They were trained on the REVERB training set. different variance optimization schemes shown in Fig. 3: “separate” and “integrated.” With the separate variance opti- B. CBF configurations mization, the iterative estimation of the time-varying variance Table I summarizes two configurations of the CBF examined was performed separately for the WPE filter and for the in experiments including the number of microphones M, the beamformer. This is the scheme used by the conventional filter length L, and the number of optimization iterations. The cascade configuration. In contrast, with the integrated variance sampling frequency was 16 kHz. A Hann window was used optimization, the iterative estimation was performed jointly for for a short-time analysis where the frame length and shift were the WPE filter and the beamformer. A significant difference set at 32 and 8 ms. The prediction delay was set at Δ = 4 for between the two schemes is whether the WPE filter uses the WPE filter. the same variances for all the sources or different variances In the iterative optimization, the time-varying variances of dependent on the sources estimated by the beamformer. the sources were initialized as those of the observed signal for Table III compares WERs, CDs, FWSSNRs, PESQs, and the WPE filter and as 1 for the wMPDR beamformer for all STOIs obtained after five estimation iterations using three the methods. beamformers (MPDR, MVDR, and wMPDR), two conven- tional cascade configuration approaches ((1) WPE+MPDR C. Experiment-1: effectiveness of joint optimization and (2) WPE+MVDR), two test conditions ((3) and (4)), In this experiment, we evaluated the effectiveness of the and a proposed joint optimization approach ((7) source-wise joint optimization focusing on its two characteristics. First, factorization). All methods used configuration Config-1 in we compared three different filter combinations: a WPE filter Table I. Table III shows that 1) WPE+MPDR, WPE+MVDR, followed by a wMPDR beamformer (WPE+wMPDR), a WPE and WPE+wMPDR greatly outperformed MPDR, MVDR, FWSSNR (dB) WER (%) PESQ CD (dB) 12 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (a) Observed signal (b) MVDR 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (c) WPE+MVDR (d) CBF with source-wise factorization Fig. 5. Spectrogram of (a) a noisy reverberant mixture in RealData of REVERB-2MIX and spectrograms of enhanced signals obtained by (b) MVDR, (c) WPE+MVDR and (d) CBF with source-wise factorization. Mixture is composed of two female speakers under far conditions. and wMPDR, respectively, with all the conditions, 2) the D. Experiment-2: Comparison among joint optimization ap- joint optimization approach, i.e., (7) source-wise factorization, proaches substantially outperformed all the other methods in terms of In this experiment, we compared three joint optimiza- all the measures except for a case in terms of STOI where tion approaches, denoted as (5) Source-packed factorization WPE+wMPDR (separate) gave a slightly better score than (7) (conventional), (6) Source-packed factorization (extended), source-wise factorization. Furthermore, Fig. 4 shows the con- and (7) Source-wise factorization. (5) Source-packed factor- vergence curves of the two cascade configuration approaches, ization (conventional) corresponds to the conventional joint two test conditions, and the joint optimization approach. The optimization technique described in Section III-B1, and (6) source-wise factorization performance (7) was the best of Source-packed factorization (extended) and (7) Source-wise all and improved as the number of iterations increased. The factorization correspond to our proposed methods respectively second best was (3) WPE+wMPDR (separate). The other described in Sections III-B2 and III-D. methods did not improve the scores after the first iteration Figure 6 compares the WERs obtained using the three with both the integrated and separate variance optimization approaches with Config-1 and Config-2. Our proposed meth- schemes. ods, i.e., (6) Source-packed factorization (extended) and (7) Figure 5 shows a spectrogram of a noisy reverberant mix- Source-wise factorization, performed comparably well and ture in RealData of REVERB-2MIX, and spectrograms of both greatly outperformed (5) Source-packed factorization enhanced signals obtained using MVDR, WPE+MVDR, and (conventional). CBF with source-wise factorization. The figure shows that all Table IV compares the computing times required for the the enhancement methods were effective and the CBF with three approaches to estimate and apply the CBFs with ten source-wise factorization was the best of all for achieving estimation iterations for processing a mixture utterance whose denoising, dereverberation, and source separation. length is 9.44 s. The computing time was measured by a The above results clearly show that the two characteristics Matlab interpreter as elapsed time. The computing times of the joint optimization approach, i.e., 1) the optimal combi- for estimating the masks were 0.63 s and 7.2 s with and nation of a WPE filter and a wMPDR beamformer, and 2) the without a GPU (NVIDIA 2080ti), and they are not included integrated variance optimization, are both critical for achieving in the table. As shown in the table, for both configurations, optimal performance. (6) Source-packed factorization (extended) greatly reduced Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) 13 (5) Source-packed (conventional) (7) Source-wise fact. TABLE V WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND (6) Source-packed (extended) STOI FOR SIMDATA IN REVERB-2MIX OF ENHANCED SIGNALS 29.5 OBTAINED BASED ON ORACLE MASKS USING DIFFERENT BEAMFORMERS AFTER THREE ESTIMATION ITERATIONS WITH CONFIG-1. SCORES FOR REVERB-2MIX WITH NO ENHANCEMENT (NO ENH) AND THOSE OBTAINED BY APPLYING A WMPDR CBF, WPD [30], TO REVERB (I.E., 28.5 SINGLE SPEAKER), ARE ALSO SHOWN. Enhancement method WER CD FWSSNR PESQ STOI 27.5 No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 27 WPD (REVERB) [30] 8.91 2.59 8.29 2.41 0.91 MPDR (w/o iteration) 20.16 3.53 5.49 1.86 0.84 26.5 MVDR (w/o iteration) 20.32 3.56 5.36 1.84 0.83 19 wMPDR 20.12 3.31 6.11 1.96 0.86 2 4 6 8 10 2 4 6 8 10 (1) WPE+MPDR (separate) 12.89 3.39 6.11 2.10 0.87 #iterations #iterations (2) WPE+MVDR (separate) 12.91 3.32 6.30 2.07 0.87 (a) Config-1 (b) Config-2 (3) WPE+wMPDR (separate) 12.59 3.12 6.84 2.21 0.89 (6) Source-packed fact. 12.23 3.02 7.15 2.33 0.90 Fig. 6. WERs (%) obtained for REVERB-2MIX when jointly optimizing (7) Source-wise fact. 12.23 2.98 7.25 2.32 0.90 WPE+wMPDR based on source-packed factorization (conventional/extended) and source-wise factorization approaches. TABLE IV in REVERB-2MIX using signal components in the observed COMPUTING TIME REQUIRED FOR PROCESSING A MIXTURE UTTERANCE signals. In contrast, we can only calculate the oracle masks OF LENGTH OF 9.44 S IN REVERB-2MIX. COMPUTING TIME WAS MEASURED BY ELAPSED TIME ON A MATLAB INTERPRETER. approximately for RealData because we cannot access the signal components. Thus, we first estimated the desired signals Method Time (s) by applying dereverberation and denoising to utterances in Config-1 Config-2 REVERB, and then calculated the oracle masks using the (4) Source-packed factorization (conventional) 3467 688 (5) Source-packed factorization (extended) 209 33 estimated desired signals for REVERB-2MIX and REVERB- (6) Source-wise factorization 40 23 3MIX. Table V shows WERs, CDs, FWSSNRs, PESQs, and STOIs measured on enhanced signals obtained from REVERB-2MIX the computing time in comparison with (5) Source-packed using various (non-convolutional) beamformers and CBFs factorization (conventional), and (7) Source-wise factorization after three estimation iterations. As a reference, the table further reduced the computing time. also includes previously reported scores denoted by WPD The above results clearly demonstrate the superiority of the (REVERB) [30], which were obtained by applying a wMPDR two proposed approaches over the conventional joint optimiza- CBF, referred to as WPD (see also Section III-C in this paper), tion technique in terms of both computational efficiency and to REVERB, i.e., noisy reverberant single speaker utterances. estimation accuracy. However, Table IV indicates that the pro- In addition, the convergence curves obtained using the CBFs posed approaches still require relatively large computing cost, in terms of WERs for REVERB-2MIX and REVERB-3MIX, e.g., 40 s computing time for processing a 9.44 s utterance and those obtained in terms of CDs, FWSSNRs, PESQs, with Config-1, to obtain the high performance gain shown and STOIs for REVERB-2MIX are respectively shown in in Fig. 6 (a). Future work must address this problem. For Figs. 7 and 8. In all these results, the two joint optimization example, it might be mitigated by setting the goal as extraction approaches, (6) source-packed factorization (extended) and (7) of a single target source. Then, due to the characteristics source-wise factorization, outperformed all the other methods of source-wise factorization, we can omit the estimation of in terms of every measurement. As a whole, almost the same the other sources, and omit the iterative estimation, e.g., tendency was observed in the cases using the estimated masks. when we separately estimate source variances using a neural One exception is that the WERs obtained with the source-wise network. As a reference, the computing time (40 s) in Table factorization tended to increase after a few iterations although III required for the source-wise factorization with Config-1 is such a tendency was not observed in terms of signal distortion roughly reduced to 2.0 s for one iteration per source (namely measures. This means that improvement in the signal level 40 s/10/2), which results in the real-time factor being 0.21 distortion does not necessarily result in improvement in WER, (= 2.0 s/9.44 s). and suggests the importance of optimization by ASR level criteria, similar to conventional beamforming techniques [51], E. Experiment-3: Evaluation using oracle masks [52]. In this experiment, we examined the performance of CBFs using a different type of masks, i.e., oracle masks. An oracle VI. CONCLUDING REMARKS mask, which is the power ratio of the desired signal to the observed signal at each TF point, is calculated using reference This paper presented methods for optimizing a CBF that signals. Oracle masks can be precisely calculated for SimData performs DN+DR+SS based on ML estimation. We introduced WER (%) WER (%) 14 (1) WPE+MPDR (separate) (6) Source-packed (extended) the source-packed factorization approach, and into a set of (2) WPE+MVDR (separate) (7) Source-wise factorization single-target WPE filters followed by wMPDR beamformers (3) WPE+wMPDR (separate) using the source-wise factorization approach. This paper also presented the overall processing flows for both approaches 13.5 based on an assumption that TF masks are provided as auxil- iary inputs. In the flows, the time varying source variances, which are required for ML estimation, can be optimally estimated jointly with the CBF using iterative optimization; the steering vectors of the desired signals, which are required for beamformer optimization, can be reliably estimated based 12.5 on the dereverberated multichannel signals obtained at an optimization step. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches substantially im- 2 4 6 8 10 2 4 6 8 10 proved the CBF performance in comparison with the conven- #iterations #iterations tional cascade configuration in terms of ASR performance (a) REVERB-2MIX (b) REVERB-3MIX and signal distortion reduction. Our proposed approaches Fig. 7. Comparison of WERs among cascade configuration and joint can also greatly reduce the computing cost with improved optimization approaches using Config-1 for REVERB-2MIX and REVERB- estimation accuracy in comparison with the conventional joint 3MIX. optimization technique. The proposed approaches, however, (1) WPE+MPDR (separate) (6) Source-packed (extended) still result in relatively large computing costs to obtain high (2) WPE+MVDR (separate) (7) Source-wise factorization performance gain. Future work will address this problem. (3) WPE+wMPDR (separate) 7.5 APPENDIX A 3.4 DERIVATION OF EQS. (43) AND (44) We can rewrite Ψ in Eq. (38) using Eq. (36): 3.3 Ψ = X Φ X , (60) t q,t t 3.2 XX H H 1 1 6.5 3.1 (i) (i) = q X q X . (61) t t (i) t i t (i) Using Eq. (33), q X can further be rewritten: 2.9 H H 2 4 6 8 10 2 4 6 8 10 (i) (i) T q X = q I ⊗ x , (62) t M #iterations #iterations 0.91 (i) T 2.35 = q ⊗ x . (63) 0.9 2.3 Substituting the above equation in Eq. (61) yields 2.25 XX H 1 1 ⊤ (i) H (i) T 0.89 Ψ = q ⊗ x q ⊗ x , t t (i) 2.2 t i t 0.88 (64) 2.15 XX 1 1 ⊤ 2.1 (i) (i) H = x x , (65) q q ⊗ 0.87 t (i) 2.05 t i t X ⊤ (i) 2 0.86 (i) (i) = q q ⊗ R . (66) 2 4 6 8 10 2 4 6 8 10 #iterations #iterations Similarly, we can obtain Fig. 8. Comparison of CDs, FWSSNRs, PESQs, and STOIS among cascade 1 H configuration and joint optimization approaches using Config-1 for REVERB- ψ = X Φ x , (67) q t 2MIX. XX 1 1 ⊤ (i) H (i) = q ⊗ x q x , (68) (i) t i t two different approaches for factorizing a CBF, i.e., source- XX 1 1 T (i) H (i) packed and source-wise factorization approaches, and derived = q ⊗ x x q , (69) (i) optimization algorithms for the respective approaches. A CBF t i t can be factorized without loss of optimality into a multiple- (i) (i) (i) = q ⊗ P q . (70) target WPE filter followed by wMPDR beamformers using PESQ WER (%) CD (dB) STOI FWSSNR (dB) WER (%) 15 REFERENCES [23] S. Braun and E. A. P. Habets, “Linear prediction based online dereverberation and noise reduction using alternating Kalman filters,” IEEE/ACM trans. on Audio, Speech, and Language Processing, vol. 26, [1] B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach no. 6, pp. 1119–1129, 2018. to spatial filtering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, 1988. [24] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Joint multi- [2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, microphone speech dereverberation and noise reduction using integrated Estimation, and Modulation Theory. New York: Wiley-Interscience, sidelobe cancellation and linear prediction,” in Proc. IWAENC, 2018. [25] T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno, “Blind [3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array separation and dereverberation of speech mixtures by joint optimization,” processors,” The Journal of the Acoustical Society of America, vol. 54, IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 771–785, 1973. January 2011. [4] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain [26] N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, “Relaxed disjointness multichannel linear filtering for noise reduction,” IEEE Trans. Audio, based clustering for joint blind source separation and dereverberation,” Speech, and Language Processing, vol. 18, no. 2, pp. 260–276, 2007. in Proc. IWAENC, 2014. [5] A. Hyva¨rinen, J. Karhunen, and E. Oja, Independent Component Anal- [27] H. Kagami, H. Kameoka, and M. Yukawa, “Joint separation and dere- ysis. New York: John Wiley & Sons, 2001. verberation of reverberant mixtures with determined multichannel non- [6] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separa- negative matrix factorization,” in Proc. IEEE ICASSP, 2018, pp. 31–35. tion exploiting higher-order frequency dependencies,” IEEE Trans. on [28] T. Nakatani, R. Ikeshita, K. Kinoshita, H. Sawada, and S. Araki, “Com- Speech, and Audio Processing, vol. 15, no. 1, pp. 70–79, 2006. putationally efficient and versatile framework for joint optimization of [7] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A blind speech separation and dereverberation,” in Proc. Interspeech, 2020. multichannel MMSE-based framework for speech source separation [29] Z. Koldovsky and P. Tichavsky´, “Gradient algorithms for complex non- and noise reduction,” IEEE Trans. on Audio, Speech, and Language Gaussian independent component/vector extraction, question of conver- Processing, vol. 21, no. 9, pp. 1913–1928, 2010. gence,” IEEE Trans. on Signal Processing, vol. 67, no. 4, pp. 1050–1064, [8] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear [30] T. Nakatani and K. Kinoshita, “Maximum likelihood convolutional prediction,” IEEE trans. on Audio, Speech, and Language Processing, beamformer for simultaneous denoising and dereverberation,” in Proc. vol. 18, no. 7, pp. 1717–1731, 2010. EUSIPCO, 2019. [9] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear [31] C. Boeddeker, T. Nakatani, K. Kinoshita, and R. Haeb-Umbach, “Jointly prediction methods for blind MIMO impulse response shortening,” IEEE optimal dereverberation and beamforming,” in Proc. ICASSP, 2020, pp. trans. on Audio, Speech and Language Processing, vol. 20, no. 10, pp. 216–220. 2707–2720, 2012. [32] T. Nakatani, R. Takahashi, T. Ochiai, K. Kinoshita, R. Ikeshita, [10] A. Jukic´, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi- M. Declroix, and S. Araki, “DNN-supported mask-based convolutional channel linear prediction-based speech dereverberation with sparse pri- beamforming for simultaneous denoising, dereverberation, and source ors,” IEEE/ACM trans. on Audio, Speech and Language Processing, separation,” in Proc. IEEE ICASSP, 2020. vol. 23, no. 9, pp. 1509–1520, 2015. [33] J. S. Bradley, H. Sato, and M. Picard, “On the importance of early [11] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based reflections for speech in rooms,” The Journal of the Acoustic Sociaty of spectral mask estimation for acoustic beamforming,” in Proc. IEEE America, vol. 113, pp. 3233–3244, 2003. ICASSP, 2016, pp. 196–200. [34] T. Nishiura, Y. Hirano, Y. Denda, and M. Nakayama, “Investigations into [12] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani, “Neural early and late reflections on distant-talking speech recognition toward network-based spectrum estimation for online wpe dereverberation,” in suitable reverberation criteria,” in Proc. Interspeech, 2007, pp. 1082– Proc. Interspeech, 2017, pp. 384–388. [13] K. Zmol´ıkova´, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, [35] Y. Avargel and I. Cohen, “On multiplicative transfer function approxima- L. Burget, and J. Cernocky´, “SpeakerBeam: Speaker aware neural tion in the short-time fourier transform domain,” IEEE Signal Processing network for target speaker extraction in speech mixtures,” IEEE Journal Letters, vol. 14, pp. 337–340, 2007. of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, [36] I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451– [14] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, , and F. Alleva, “Recogniz- 459, 2004. ing overlapped speech in meetings: A multichannel separation approach [37] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. H. Juang, using neural networks,” in Proc. Interspeech, 2018. “Blind speech dereverberation with multi-channel linear prediction based [15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to on short time Fourier transform representation,” in Proc. IEEE ICASSP, speech enhancement based on deep neural networks,” IEEE/ACM trans. 2008, pp. 85–88. on Audio, Speech, and Language Processing, vol. 23, no. 1, 2015. [38] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, [16] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Naka- Discriminative embeddings for segmentation and separation,” in Proc. mura, and J. Yamato, “Low-latency real-time meeting recognition and IEEE ICASSP, 2016, pp. 31–35. understanding using distant microphones and omni-directional camera,” [17] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 2, separation with utterance-level permutation invariant training of deep pp. 499–513, 2011. recurrent neural networks,” IEEE Trans. Audio, Speech, and Language [39] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “Independent low-rank Processing, pp. 1901–1913, 2017. matrix analysis with decorrelation learning,” in IEEE WASPAA, 2019. [18] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, [40] T. Nakatani and K. Kinoshita, “Simultaneous denoising and dereverber- W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and ation for low-latency applications using frame-by-frame online unified T. Yoshioka, “A summary of the REVERB challenge: State-of-the-art convolutional beamformer,” in Proc. Interspeech, 2019. and remaining challenges in reverberant speech processing research,” [41] B. J. Cho, J. Lee, and H. Park, “A beamforming algorithm based on EURASIP Journal on Advances in Signal Processing, 2016. maximum likelihood of a complex Gaussian distribution with time- [19] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ varying variances for robust speech recognition,” IEEE Signal Process- speech separation and recognition challenge: Dataset, task and base- ing Letters, vol. 26, no. 9, pp. 1398–1402, August 2019. lines,” in Proc. IEEE ASRU-2015, 2015, pp. 504–511. [42] T. Nakatani and K. Kinoshita, “A unified convolutional beamformer for [20] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, simultaneous denoising and dereverberation,” IEEE Signal Processing K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meets Letters, vol. 26, no. 6, pp. 903–907, April 2019. a strong asr backend: Hitachi/Paderborn university joint investigation for [43] F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, dinner party ASR,” in Proc. Interspeech, 2019. “A comprehensive study of speech separation: spectrogram vs waveform [21] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeis- separation,” in Interspeech, 2019. ter, M. Seltzer, H. Zen, and M. Souden, “Speech processing for digital [44] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time- home assistants,” IEEE Signal Processing Magazine, 2019. frequency magnitude masking for speech separation,” IEEE/ACM Trans. [22] M. Togami, “Multichannel online speech dereverberation under noisy on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256– environments,” in Proc. EUSIPCO, 2015, pp. 1078–1082. 1266, 2019. 16 [45] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 81–84. [46] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments,” in Proc. IEEE ICASSP, 2017, pp. 681–685. [47] S. Markovich-Golan, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfer- ing speech signals,” IEEE Trans. ASLP, vol. 17, no. 6, pp. 1071–1086, [48] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Tran. Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008. [49] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of timefrequency weighted noisy speech,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 7, [50] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- mer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU, 2011. [51] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb- Umbach, “Eamnet: End-to-end training of a beamformer-supported multi-channel ASR system,” in Proc. IEEE ICASSP, 2017. [52] A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement using end-to-end speech recognition objectives,” in Proc. IEEE WASPAA, 2019. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Loading next page...
 
/lp/arxiv-cornell-university/jointly-optimal-denoising-dereverberation-and-source-separation-ZTSaeNqa0V

References (54)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2020.3013118
Publisher site
See Article on Publisher Site

Abstract

c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Jointly optimal denoising, dereverberation, and source separation Tomohiro Nakatani, Senior Member, IEEE, Christoph Boeddeker, Student Member, IEEE, Keisuke Kinoshita, Senior Member, IEEE, Rintaro Ikeshita, Member, IEEE, Marc Delcroix, Senior Member, IEEE, Reinhold Haeb-Umbach, Fellow, IEEE Abstract—This paper proposes methods that can optimize a acquired signal. For performing denoising (DN), beamform- Convolutional BeamFormer (CBF) for jointly performing denois- ing techniques have been investigated for decades [1], [2], ing, dereverberation, and source separation (DN+DR+SS) in a [3], [4], and the Minimum Variance Distortionless Response computationally efficient way. Conventionally, a cascade config- (MVDR) beamformer and the Minimum Power Distortionless uration, composed of a Weighted Prediction Error minimization Response (MPDR) beamformer, are now widely used as state- (WPE) dereverberation filter followed by a Minimum Variance Distortionless Response (MVDR) beamformer, has been used as of-the-art techniques. For source separation (SS), a number the state-of-the-art frontend of far-field speech recognition, even of blind signal processing techniques have been developed, though this approach’s overall optimality is not guaranteed. In including independent component analysis [5], independent the blind signal processing area, an approach for jointly opti- vector analysis [6], and spatial clustering-based beamforming mizing dereverberation and source separation (DR+SS) has been [7]. For dereverberation (DR), a Weighted Prediction Error proposed; however, it requires huge computing cost, and has not been extended for applications to DN+DR+SS. To overcome the minimization (WPE) based linear prediction technique [8], above limitations, this paper develops new approaches for jointly [9] and its variants [10] have been actively studied as an optimizing DN+DR+SS in a computationally much more efficient effective approach. With these techniques, for determining the way. To this end, we first present an objective function to optimize coefficients of filtering, it is crucial to accurately estimate a CBF for performing DN+DR+SS based on maximum likelihood such statistics of the speech signals and the noise as their estimation on an assumption that the steering vectors of the target signals are given or can be estimated, e.g., using a neural spatial covariances and time-varying variances. However, the network. This paper refers to a CBF optimized by this objective estimation often becomes inaccurate when the signals are function as a weighted Minimum-Power Distortionless Response mixed under reverberant and noisy conditions, which seriously (wMPDR) CBF. Then, we derive two algorithms for optimizing a degrades the performance of these techniques. wMPDR CBF based on two different ways of factorizing a CBF To enhance the robustness of the above techniques, neural into WPE filters and beamformers: one based on an extension of the conventional joint optimization approach proposed for network-supported microphone array speech enhancement has DR+SS and another based on a novel technique. Experiments been actively studied, and its effectiveness has been iden- using noisy reverberant sound mixtures show that the proposed tified for denoising [11], dereverberation [12], and source optimization approaches greatly improve the performance of the separation [13], [14]. With this approach, neural networks speech enhancement in comparison with the conventional cascade estimate such statistics of the signals and noise as Time- configuration in terms of signal distortion measures and ASR performance. The proposed approaches also greatly reduce the Frequency (TF) masks and time-varying variances [13], [15], computing cost with improved estimation accuracy in comparison [16], [17], while microphone array signal processing performs with the conventional joint optimization approach. speech enhancement. This combination is particularly effective Index Terms—Beamforming, dereverberation, source separa- because neural networks can successfully capture the spectral tion, microphone array, automatic speech recognition, maximum patterns of signals over wide TF ranges and reliably estimate likelihood estimation such statistics of the signals. Conventional signal processing often fails to adequately handle them. On the other hand, neural networks often introduce into the processed signal I. INTRODUCTION nonlinear distortions, which are harmful to perceived speech When a speech signal is captured by distant microphones, quality and ASR. This problem can be avoided by microphone e.g., in a conference room, it often contains reverberation, dif- array techniques. A number of articles have reported the fuse noise, and extraneous speakers’ voices. These components usefulness of this combination, particularly for far-field ASR, are detrimental to the intelligibility of the captured speech e.g., at the REVERB challenge [18] and the CHiME-3/4/5 and often cause serious degradation in many applications challenges [19], [20]. such as hands-free teleconferencing and Automatic Speech Despite the success of neural network-supported micro- Recognition (ASR). phone array speech enhancement, how to optimally combine Microphone array speech enhancement has been scrutinized individual microphone array techniques for simultaneously to minimize the aforementioned detrimental effects in the performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efficient way remains T. Nakatani, K. Kinoshita, R. Ikeshita, and M. Delcroix are with NTT inadequately investigated. For example, for denoising and Corporation. C. Boeddeker and R. Haeb-Umbach are with Paderborn Univ. Manuscript received January 1, 2020; revised XXXX XX, 2020. dereverberation (DN+DR), the cascade configuration of a arXiv:2005.09843v3 [eess.AS] 2 Aug 2020 2 WPE filter followed by a MVDR/MPDR beamformer has been computational efficiency. An additional benefit of source-wise widely used as the state-of-the-art frontend, e.g., at the far- factorization is that it can be used, without loss of optimality field ASR challenges [18], [19], [20], [21]. However, since for the extraction of a single target source from a sound the WPE filter and the beamformer are separately optimized, mixture, which is now an important application area of speech the overall optimality of this approach is not guaranteed. enhancement [13], [29]. To optimally perform DN+DR, several techniques have been Experiments based on noisy reverberant sound mixtures proposed using a Kalman filter [22], [23], [24]. A technique, created using the REVERB Challenge dataset [18] show that called Integrated Sidelobe Cancellation and Linear Prediction the proposed optimization approaches substantially improve (ISCLP) [24], optimizes an integrated filter that can cancel the DN+DR+SS performance in comparison to the conven- noise and reverberation from the observed signals using a side- tional cascade configuration in terms of ASR performance and lobe cancellation framework. With this technique, however, signal distortion reduction. These two proposed approaches the steering vector of the target signal needs to be directly can also greatly reduce the computing cost with improved estimated in advance from noisy reverberant speech, which is estimation accuracy in comparison with the conventional joint challenging and limits the overall estimation accuracy. In the optimization approach. blind signal processing area, on the other hand, a technique Certain parts of this paper have already been presented that jointly optimizes a pair comprised of a WPE filter and a in our recent conference papers. The ML formulation for beamformer has been proposed for dereverberation and source optimizing a CBF was derived for DN+DR [30]. Another work separation (DR+SS) under noiseless conditions [25], [26], [31] argued that a CBF for DN+DR can be factorized into a [27]. One advantage of this approach is that we can access WPE filter and a wMPDR (non-convolutional) beamformer, multichannel dereverberated signals obtained as the output and jointly optimized without loss of optimality. Another of the WPE filter during the optimization, and utilize them work [32] presented ways to reliably estimate TF masks for to reliably estimate the beamformer. However, this approach DN+DR+SS. This paper integrates these techniques to perform requires 1) huge computing cost for the optimization, and 2) DN+DR+SS in a computationally efficient way. has not been extended for application to DN+DR+SS. In the remainder of this paper, the models of the observed To overcome the above limitations, this paper develops signal and the CBF are defined in Section II. Then, Section III algorithms for optimizing a Convolutional BeamFormer (CBF) presents our proposed optimization methods, and Section IV that can perform DN+DR+SS in a computationally much more summarizes their characteristics and advantages. Sections V and VI describe experimental results and concluding remarks. efficient way. A CBF is a filter that is applied to a multichannel observed signal to yield the desired output signals. For CBF optimization, this paper first presents a common objective II. MODELS OF SIGNAL AND BEAMFORMER function based on the Maximum Likelihood (ML) criterion This paper assumes that I source signals are captured by by assuming that the steering vectors of the desired signals M(≥ I) microphones in a noisy reverberant environment. The are given, or can be estimated. This paper refers to a CBF captured signal at each TF point in the short-time Fourier optimized by this objective function as a weighted MPDR transform (STFT) domain is modeled by (wMPDR) CBF. After showing that a CBF can be factorized into WPE filter(s) and beamformer(s) in two different ways, I (i) we derive two different algorithms for optimizing the wMPDR x = x + n , (1) t,f t,f t,f CBF, based on the CBF factorization ways. The first approach, i=1 (i) (i) (i) called source-packed factorization, is an extension of the x = d + r , (2) t,f t,f t,f conventional joint optimization technique proposed for DR+SS where t and f are time and frequency indices, respectively, [25], [26], [27]. We first show that its direct application to ⊤ M×1 x = [x , . . . , x ] ∈ C is a column vec- DN+DR+SS suffers from serious problems in terms of the t,f 1,t,f M,t,f tor containing all the microphone signals at a TF point. computational efficiency and estimation accuracy and present (i) Here, (·) denotes the non-conjugate transpose. x = an extension for solving them. The second approach, called t,f (i) (i) source-wise factorization, is based on a novel factorization [x , . . . , x ] is a (noiseless) reverberant signal cor- 1,t,f M,t,f technique that factorizes a CBF into a set of sub-filter pairs, responding to the ith source, and n = [n , . . . , n ] t,f 1,t,f M,t,f (i) each of which is composed of a WPE filter and a beam- is the additive diffuse noise. x for each source in Eq. (1) t,f former, and independently estimates each source. For both is further decomposed into two parts in Eq. (2), one of which approaches, we also present a method that robustly estimates consists of the direct signal and early reflections, referred the steering vectors of the desired signals during the wMPDR (i) to as desired signal d , and the other corresponds to late t,f CBF optimization using the output of the WPE filters. A (i) reverberation r . Hereafter, the frequency indices of the t,f neural network-supported TF-mask estimation technique is symbols are omitted for brevity, assuming that each frequency also incorporated to estimate the steering vectors. Although bin is processed independently in the same way. both approaches work comparably well in terms of estimation (i) In this paper, the goal of DN+DR+SS is to estimate d accuracy, source-wise factorization has advantages in terms of (i) for each source i from x in Eq. (1) by reducing r of (i ) source i, x of all the other sources i 6= i, and diffuse noise Note that the proposed techniques can also be applied to conventional t blind signal processing for DR+SS, as discussed in an article [28]. n . Since in noisy reverberant environments, early reflections t … Mulple- Beamformer Convoluonal target matrix for beamformer for dereverberaon, linear separaon denoising, and predicon and source separaon (LP) denoising (a) MIMO CBF (b) MIMO CBF with source-packed factorization (1) Convoluonal (1) (1) Single-target Beamformer beamformer for = 1 for = 1 LP for = 1 ( ) Convoluonal ( ) Beamformer ( ) Single-target beamformer for = for = LP for = (c) Set of MISO CBFs (d) MISO CBFs with source-wise factorization Fig. 1. Multi-Input Multi-Output (MIMO) CBF and its three different implementations. They are equivalent to each other in the sense that whatever values are set to coefficients of one implementation, certain coefficients of the other implementations can be determined such that they realize identical input-output relationships. Thus, optimal solutions of all implementations are identical as long as they are optimized based on the same objective function. (i) enhance the intelligibility of speech for human perception In this paper, we further assume that d is statistically [33] and improve the ASR performance by computer [34], independent of the following variables: (i) (i) we include them in the desired signal. Hereafter, we use ′ • s ′ for t ≤ t − Δ (and thus d is statistically m = 1 as a reference microphone and describe a method (i) independent of x for t ≤ t− Δ), (i) for estimating desired signal d at the microphone without (i) 1,t ′′ • r for t ≤ t, ′′ loss of generality. ′ (i ) ′ ′ (i) • x and n ′ for all t, t and i 6= i. To achieve the above goal, we further model d : These assumptions are used to derive the optimization algo- (i) (i) (i) (i) (i) d = v s = v ˜ d , (3) rithms described in the following. t t 1,t (i) where s is the ith clean speech at a TF point. In Eq. (3), the (i) (i) (i) A. Definition of a CBF and its three different implementations desired signal of the ith source, d , is modeled by v s , t t i.e., a product in the STFT domain of the clean speech with We now define a CBF, which will later br factorized into (i) transfer function v , hereafter a steering vector, assuming WPE filter(s) and beamformer(s): that the duration of the impulse response corresponding to L−1 the direct signal and early reflections in the time domain is H H y = W x + W x , (6) t t t−τ 0 τ sufficiently short in comparison with the analysis window [35]. τ=Δ (i) (i) We further rewrite the desired signal as v ˜ d , i.e., a product 1,t (i) (i) (i) (1) (I) ⊤ I×1 where y = [y , . . . , y ] ∈ C is the output of the of the desired signal at reference microphone d = v s 1,t 1 t t t CBF corresponding to the estimates of I desired signals, with a Relative Transfer Function (RTF) [36], which is defined M×I W ∈ C for each τ ∈ {0, Δ, Δ+1, . . . , L−1} is a matrix as the steering vector divided by its reference microphone element, composed of the beamformer coefficients, (·) denotes a conjugate transpose, and Δ is the prediction delay of CBF. We (i) (i) (i) v ˜ = v /v . (4) 1 set Δ equal to the mixing time introduced in Eq. (5), so that the desired signals are included only in the first term of Eq. (6) In contrast, assuming that the duration of the late reverber- and are statistically independent of the second term based on ation in the time domain exceeds the analysis window, late the assumptions introduced in the signal model. Then this (i) reverberation r is modeled by a convolution in the STFT paper performs DN+DR+SS by estimating the beamformer domain [37] of the clean speech with a time series of acoustic coefficients that can estimate the desired signals included in transfer functions that corresponds to the late reverberation: the first term of Eq. (6). L −1 a For notational simplicity, we also introduce a matrix repre- (i) (i) (i) r = a s , (5) sentation of a CBF: t τ t−τ τ Δ W x 0 t (i) (i) (i) y = , (7) ⊤ t where a = [a , . . . , a ] for τ ∈ {Δ, . . . , L − 1} x τ a W 1,τ M,τ t are the convolutional acoustic transfer functions, and Δ is the mixing time, which represents the relative frame delay of the See a previous work [8] for more precise discussion of the statistical (i) late reverberation start time to the direct signal. independence between d and s ′ for t ≤ t − Δ. t 4 where W is a matrix containing W for Δ ≤ τ ≤ L− 1 and For example, MISO CBFs were previously used [30], [39]. x is a column vector containing past multichannel observed ISCLP [24] can also be viewed as the realization of a MISO signals x for Δ ≤ τ ≤ L− 1: CBF using a sidelobe cancellation framework [40]. t−τ 3) Source-wise factorization: With the source-wise factor- ⊤ ⊤ M(L−Δ)×I W = W , . . . ,W ∈ C , (8) Δ L−1 ization shown in Fig. 1 (d), we further factorize each MISO ⊤ ⊤ M(L−Δ)×1 CBF defined in Eq. (14) for source i: x = x , . . . ,x ∈ C . (9) t−Δ t−L+1 " # " # (i) Hereafter, we refer to the CBF defined by Eqs. (6) and (7) as I (i) = q , (15) (i) (i) a MIMO CBF. w −G In the following, we further present three different im- (i) (i) M×1 M(L−Δ)×M plementations of CBF, including two ways of factorizing it. where q ∈ C and G ∈ C . Then, Eq. (14) Figure 1 illustrates the MIMO CBF and its three different can be rewritten as a pair of a linear prediction filter and a implementations. beamformer: 1) Source-packed factorization: With the implementation (i) (i) shown in Fig. 1 (b), we directly factorize the MIMO CBF in z = x − G x , (16) t t Eq. (7): (i) (i) (i) y = q z , (17) t t W I 0 M = Q, (10) W −G (i) (i) M×1 where z ∈ C and G are the output and the prediction M×I M(L−Δ)×M M×M (i) where Q ∈ C , G ∈ C , and I ∈ R is M matrix of the linear prediction, and q is the beamformer’s an identity matrix. Then Eq. (6) can be rewritten as a pair of coefficient vector. Because Eq. (16) is performed only to esti- a (convolutional) linear prediction filter followed by a (non- mate the ith source, it is called single-target linear prediction. convolutional) beamformer matrix: 4) Relationship between two factorization approaches: The H difference between the two factorization approaches, namely z = x − G x , (11) t t t Figs. 1 (b) and (d), is based only on how the linear prediction is y = Q z . (12) t t performed: Eq. (11) or Eq. (16). More specifically, it is based (i) M×1 on whether the prediction matrices, G and G , are common Here z ∈ C and G are the output and the prediction to all the sources or different over different sources. Therefore, matrix of the linear prediction, and Q is the coefficient matrix different optimization algorithms with different characteristics of the beamformer. Eq. (11), which is supposed to derever- are derived, as will be shown in Section III. In contrast, berate all the sources at the same time, is thus referred to as (i) the beamformer parts, Q and q in Eqs. (12) and (17) a multiple-target linear prediction, and Eq. (12) is supposed (i) are identical in the two approaches, viewing q as the ith to perform denoising and source separation at the same time. column of Q, because they satisfy W = Q in Eq. (10) and Because individual sources are not distinguished in the WPE (i) (i) w = q in Eq. (15). filter’s output, this implementation is called source-packed In addition, it should be noted that all the above CBF factorization. implementations are equivalent to each other in the sense One example of source-packed factorization is the cascade that whatever values are set to the coefficients of one imple- configuration composed of a WPE filter followed by a beam- mentation, certain coefficients of the other implementations former, which has been widely used for DN+DR+SS in the can be determined such that they realize the same input- far-field speech recognition area [14], [20], [38], and the other output relationship. Thus, the optimal solutions of all the example is one used in the joint optimization of a WPE filter implementations are identical as long as they are based on and a beamformer, which has been investigated for DR+SS in the same objective function. the blind signal processing area [25], [26], [27]. 2) Multi-Input Single-Output (MISO) CBF: Next we define the set of MISO CBFs shown in Fig. 1 (c). They were obtained III. ML ESTIMATION OF CBF by decomposing the beamformer coefficients in Eq. (7): In this section, we derive two different optimization algo- " # (1) (2) (I) rithms using (b) source-packed factorization and (d) source- 0 w w . . . w 0 0 0 = , (13) wise factorization. For the derivations, we assume that the (1) (2) (I) w w . . . w (i) RTFs v ˜ and the time-varying variances of the output signals (i) (i) M×1 (i) M(L−Δ)×1 yielded by the optimal CBF, denoted by λ , are given. where w ∈ C and w ∈ C are column Then in Section III-E, we describe ways for jointly estimating vectors, which respectively contain the ith columns of W (i) λ with CBF coefficients based on the ML criterion and and W; they are used to extract the ith desired signal. Then, (i) estimating v ˜ based on the WPE filter’s output obtained at a Eq. (7) can be rewritten for each source i: " # step of the optimization. (i) w x (i) t y = . (14) (i) w t A. Probabilistic model First, we formulate the objective function for DN+DR+SS The existence of G, which satisfies W = −GQ, is guaranteed for any W when M ≥ I and rank{Q} = I. by reinterpreting the objective function proposed for DN+DR 5 ⊤ ⊤ ⊤ [30]. For this formulation, we interpret DN+DR+SS to be where n = [n , . . . ,n ] . According to the statistical t−Δ t−L+1 (i) composed of a set of separate processing steps, each of independence assumptions introduced in Section II, d is sta- 1,t which applies DN+DR to enhance source i by reducing the (i) (i ) tistically independent of rˆ , xˆ , and n ˆ . Then substituting t t late reverberation of the source (DR) and the additive noise Eq. (21) into Eq. (19) and omitting the constant terms, we including the other sources and the diffuse noise (DN). With obtain the following (in the expectation sense): this interpretation, we introduce the following assumptions, similar to the previous work [30]: (i) (i ) E rˆ + xˆ + n ˆ T ′ t n o t t (i) i 6=i • The output of the optimal CBF for each i, namely y , (i) E L (θ ) = . (i) follows a zero-mean complex Gaussian distribution with T t=1 t (i) (i) (25) time-varying variance λ = E y [8]. t t • The beamformer satisfies a distortionless constraint for The above equation indicates that minimization of the objec- (i) (i ) (i) each source i defined using RTF v ˜ in Eq. (4): tive function indeed minimizes the sum of rˆ , xˆ for i 6= i, t t and n ˆ in Eq. (21). H H t (i) (i) (i) (i) w v ˜ = 1 or q v ˜ = 1 . (18) Before deriving the optimization algorithms, we define a matrix that is frequently used in the derivation, referred to Then based on the previous discussion [30], we can approx- as a variance-normalized spatio-temporal covariance matrix. imately derive the objective function to be minimized for Letting x be a column vector composed of the current and (i) estimating the CBF coefficients for source i, e.g., θ = past observed signals at all the microphones, defined as (i) (i) {w ,w }, according to ML estimation: ⊤ ⊤ M(L−Δ+1)×1   x = x ,x ∈ C , (26) 2 t t t (i)   (i) (i) (i) (i) the matrix is defined: L (θ ) = + log λ s.t. w v ˜ = 1. i   t 0 (i) t=1 t 1 x x (i) t t M(L−Δ+1)×M(L−Δ+1) R = ∈ C . (27) (19) (i) t=1 t The objective function for estimating all the sources can then Its factorized form is also defined: be obtained by summing Eq. (19) over all the sources:   I (i) (i) R P (i) (i) x x (i) (i)   ˜ R = , (28) L (Θ) = L (θ ), s.t. w v = 1 for all i, (20) i x (i) (i) P R i=1 (1) (I) where Θ = θ , . . . , θ . This objective function is used where commonly for all the implementations of a CBF. In this paper, 1 x x we call a CBF optimized by the above objective function (i) t M×M R = ∈ C , (29) (i) a weighted MPDR (wMPDR) CBF because it minimizes t=1 t (i) the average power of output y weighted by time-varying 1 x x (i) t (i) t M(L−Δ)×M variance, λ , of the signal. P = ∈ C , (30) (i) Here, let us briefly explain how DN+DR+SS is performed t=1 t by Eqs. (19) and (20). Substituting Eqs. (1) and (2) into H (i) 1 x x t M(L−Δ)×M(L−Δ) R = ∈ C . (31) Eq. (14) and using the model of the desired signal in Eq. (3) (i) t=1 t and the distortionless constraint in Eq. (18), we obtain (i) (i) (i) (i ) y = d + rˆ + xˆ + n ˆ , (21) t t t 1,t B. Optimization based on source-packed factorization i 6=i This subsection discusses methods for optimizing a CBF (i) (i ) where rˆ , xˆ for i 6= i, and n ˆ are respectively the late t with the source-packed factorization. In the following, after t t reverberation of the ith source, all the other sources, and the describing a method for directly applying the conventional additive diffuse noise remaining in the CBF output, written in joint optimization technique used for DR+SS to DN+DR+SS, MISO CBF form: we summarize the problems in it, and present the solutions to " # " # the problems. (i) (i) (i) w 1) Direct application of a conventional technique: With the rˆ = , (22) t (i) (i) w x t source-packed factorization in Eqs. (11) and (12), simultane- " # " # H ′ ously estimating both Q and G in closed form is difficult (i ) (i) (i ) w x (i) t (i) xˆ = ′ , (23) even when both λ and v are given. Instead, we use an t t (i) (i ) iterative and alternate estimation scheme, following a blind " # signal processing technique [25], [26], [27], where at each (i) w n n ˆ = , (24) estimation step, either Q or G is updated and the other is (i) w t fixed. 6 (i) For updating G, we fix Q at its previously estimated Then q , which minimizes Eq. (40) under the distortionless (i) (i) value. For the algorithm derivation, the representation of linear constraint q v ˜ = 1, can be obtained: prediction in Eq. (11) is slightly modified: −1 (i) (i) R v ˜ (i) z = x − X g, (32) t t t q = . (42) −1 (i) (i) (i) v ˜ R v ˜ where X and g are equivalent to x and G with a modified t t matrix structure defined: Because the above beamformer minimizes the average power of z weighted by the time-varying variance, we call it ⊤ M×M (L−Δ) X = I ⊗ x ∈ C , (33) t M a weighted MPDR (wMPDR) beamformer . As shown in H 2 Section III-C, a wMPDR beamformer is a special case of a ⊤ ⊤ M (L−Δ)×1 g = g , . . . ,g ∈ C , (34) 1 M wMPDR CBF, which is reduced to a wMPDR beamformer when setting the length of the CBF L = 1, i.e., by just where ⊗ is a Kronecker product and g is the mth column converting it into a non-convolutional beamformer. of G. Then, considering that the CBF in Eqs. (11) and (12) (i) The above algorithm, however, has two serious problems. (i) can be written as y = q x − X g and omitting t t First, the size of the covariance matrix in Eq. (38) is too the normalization terms, the objective function in Eq. (20) large, requiring huge computing cost for calculating it and becomes its inverse. Second, as shown in our experiments, the iterative and alternate estimation of Q and G tends to converge to 1 2 L (g) = x − X g , (35) g t t a sub-optimal point. This is probably because the update q,t t=1 of G is performed based only on the output of the fixed beamformer in the iterative and alternate estimation, as in where kxk = x Rx, and Φ is a semi-definite Hermitian q,t Eq. (19); the signal dimension of the beamformer output, i.e., matrix: I, is reduced from that of the original signal space, i.e., M, I with the over-determined case, i.e., I < M. As a consequence, (i) (i) q q M×M signal components that are relevant for the update of G may Φ = ∈ C . (36) q,t (i) i=1 t be reduced in the beamformer output, especially when the estimation of Q is less accurate at the early stage of the Because Eq. (35) is a quadratic form with a lower bound, g, optimization. This can seriously degrade the update of G. which minimizes it, can be obtained: 2) Proposed extension: Next we present two techniques to mitigate the above problems within the source-packed g = Ψ ψ, (37) factorization approach. The first reduces the computing cost. 1 2 2 M (L−Δ)×M (L−Δ) As shown in Appendix A, Eqs. (38) and (39) can be rewritten, Ψ = X Φ X ∈ C , (38) q,t t t using Eq. (28): 1 H 2 M (L−Δ)×1 ψ = X Φ x ∈ C , (39) X q,t t H ⊤ (i) (i) (i) Ψ = q q ⊗ R , (43) i=1 where (·) is the Moore-Penrose pseudo-inverse. Since the I (i) (i) (i) rank of Ψ is equal to or smaller than MI(L−Δ), as shown in ψ = q ⊗ P q , (44) Section III-B2, Ψ is rank deficient for over-determined cases, i=1 namely when M > I, and thus the use of the pseudo-inverse is where () denotes the complex conjugate. In the above equa- indispensable. Eqs. (37) to (39) are equivalent to those used in (i) tions, the majority of the calculation is coming from R . the dereverberation step for DR+SS [25], [26], [27] except that Because the size of the matrix is much smaller than that in our paper denoising is additionally included in the objective of Ψ, we can greatly reduce the computing cost with this and over-determined cases are also considered. We call this a modification in comparison with the direct calculation of multiple-target WPE filter. Eqs. (38) and (39). Although we still need to calculate the For the update of Q, fixing g at its previously estimated inverse of huge matrix Ψ even with this modification, the cost value, the objective in Eq. (20) can be rewritten: is relatively small in comparison with the direct calculation of Ψ. Note that Eq. (43) also shows the rank of Ψ to be equal X 2 H (i) (i) (i) to or smaller than MI(L− Δ). L (Q) = q s.t. q v ˜ = 1, (40) (i) The second technique introduces a heuristic to improve the i=1 update of the WPE filter. To use a whole M-dimensional signal (i) where R is a variance-normalized spatial covariance matrix A wMPDR beamformer was also called a Maximum-Likelihood Distor- of the output of the multiple-target WPE filter, calculated as tionless Response (MLDR) beamformer [41]. In general, the computational complexity of a matrix multiplication 1 z z t exceeds O(n ). Because the size of Ψ is M-times larger than R , the t x (i) R = . (41) computational complexity for calculating Ψ is probably at least M times (i) t=1 t larger than that for calculating R . x 7 (i) space to be considered for the update, we modify the CBF to to the RTF v ˜ with zero padding. Finally, we obtain the output not only I desired signals, but also M − I auxiliary solution: signals that are included in orthogonal complement Q of −1 (i) (i) R v Q and model the auxiliary signals as zero-mean time-varying (i) w = . (51) −1 complex Gaussians. With this modification, the optimization is (i) (i) (i) v R performed by calculating the summation in Eqs. (43) and (44) (I+1) (M) over both 1 ≤ i ≤ I and I < i ≤ M, letting q , . . . ,q The above equation, which gives the simplest form of the be the orthonormal bases for the orthogonal complement Q . solution to a wMPDR CBF, clearly shows that a wMPDR (i) Because distinguishing variances λ of the auxiliary signals CBF is a general case of a wMPDR beamformer. By setting is inconsequential, we use the same value for them, calculated L = 1 in the above solution, namely, by letting it be a as non-convolutional beamformer, it reduces to the solution of a wMPDR beamformer in Eq. (42). M 2 X H ⊥ (i) An advantage of the solution using the MISO CBFs is that λ = q z , (45) M − I it can be obtained by a closed form equation, provided the i=I+1 RTFs and the time-varying variances of the desired signals are given and that we can ignore the interaction between DN and and calculate P and R based on Eqs. (30) and (31) x x DR. With this approach, however, the RTFs must be directly accordingly. In summary, we can implement this modification estimated from a reverberant observation, similar to ISCLP by adding the following terms to Ψ and ψ in Eqs. (43) and [24]. A solution to this problem is to use dereverberation (44): preprocessing based on a WPE filter for the RTF estimation. X H ⊤ Although it was shown that the output of a WPE filter can ⊥ (i) (i) Ψ = q q ⊗ R , (46) be obtained in a computationally efficient way within the i=I+1 framework of this approach [30], the source-wise factorization X ∗ approach described in the following can more naturally solve ⊥ (i) ⊥ (i) ψ = q ⊗ P q . (47) this problem. So, this paper adopts it as the solution. i=i+1 D. Optimization based on source-wise factorization C. Direct optimization of MISO CBFs With source-wise factorization, similar to the case with the direct optimization of the MISO CBFs, the optimization can Before deriving the optimization with source-wise factoriza- be performed separately for each source, and the resultant tion, we show that we can directly optimize the MISO CBFs algorithm is identical to that proposed for DN+DR [31]. in Eq. (14), and summarize their characteristics. With this Considering that a CBF can be written based on Eqs. (16) setting, the CBFs and the objective function are both defined H (i) (i) separately for each source in Eqs. (14) and (19), and thus, the (i) and (17) as y = q x − G x and using the t t optimization can be performed separately for each source. The (i) factorized form of R in Eq. (28), the objective function in resultant algorithm is, therefore, identical to that previously Eq. (19) can be rewritten: proposed for DN+DR [42], where this type of CBF is also called a Weighted Power minimization Distortionless response −1 (i) (i) (i) (i) (i) (i) (WPD) CBF. L G ,q = G − R P q x x (i) For presenting the solution, we introduce the following x vector representation of Eq. (14): (i) + q . (52) H −1 (i) (i) (i) (i) R − P R P x x x (i) (i) y = w x , (48) t t (i) In the above objective function, G is contained only in the first term, and the term can be minimized without depending (i) where w is defined: (i) (i) on the value of q , when G takes the following value: " # (i) −1 (i) (i) (i) (i) w = , (49) G = R P . (53) (i) x x (i) So, this is a solution of G that globally minimizes the (i) (i) Then, when λ and v ˜ are given, Eq. (19) becomes a simple t (i) objective function given time-varing variance λ . Interest- constraint quadratic form: ingly, this solution is identical to that of conventional WPE 2 H dereverberation. This means that the WPE filter, which is (i) (i) (i) (i) L (w ) = w s.t. w v = 1, (50) (i) optimized solely for dereverberation, can perform the optimal dereverberation for the joint optimization without depending (i) where R is the covariance matrix defined in Eq. (28), and h i 6 ⊤ This is not a unique solution. The first term is minimized even when an (i) (i) M(L−Δ+1)×1 (i) v = v ˜ , 0, . . . , 0 ∈ C corresponds arbitrary matrix, whose null space includes q , is added to Eq. (53). 8 Algorithm 1: Source-packed factorization-based optimiza- Algorithm 2: Source-wise factorization-based optimiza- tion for estimation of all sources tion for estimation of ith source Data: Observed signal x for all t Data: Observed signal x for all t t t (i) (i) TF masks γ for all t and 1 ≤ i ≤ I TF masks γ for all t t t (i) (i) Result: Estimated sources y for all t and 1 ≤ i ≤ I Result: Estimated ith source y for all t t t (i) (i) 2 2 1 Initialize λ as ||x || /M for all t and 1 ≤ i ≤ I 1 Initialize λ as ||x || /M for all t t t t I t I M M (i) 2 Initialize q as the ith column of I for 1 ≤ i ≤ I 2 repeat P H (i) T x x 1 t 3 Initialize z as x for all t t t t 3 R ← x (i) T t=1 4 repeat P H (i) T x x 1 t H t (i) T x x 4 P ← 1 t x t (i) T t=1 5 R ← for 1 ≤ i ≤ I λ (i) t x t=1 (i) (i) H (i) (i) T x x 1 t t 5 G ← R P 6 P ← for 1 ≤ i ≤ I x (i) t=1 t  H (i) (i) H (i) I 6 z ← x − G x (i) (i) t t 7 Ψ ← q q ⊗ R i=1 x (i) (i) (i) 7 Estimate v ˜ based on z and γ P t t I (i) (i) (i) (i) (i) 8 ψ ← q ⊗ P q i=1 P z z (i) T t t 8 R ← z (i) t=1 9 Begin Add orthogonal complement beamformer (i) (i) (I+1) (M) ´ (R ) v 10 Set q , . . . ,q as the orthonormal bases (i) z 9 q ← H (i) (i) ´ (i) for orthogonal complement Q of Q v˜ R v˜ ( ) z P (i) (i) M (i) ⊥ 1 (i) 10 y ← q z 11 λ ← q z t t t i=I+1 M−I P (i) (i) T x x 1 t 11 λ ← y 12 R ← t t x t=1 T λ P H T x x 12 until convergence ⊥ 1 t 13 P ← x ⊥ T t=1 H ⊥ (i) (i) 14 Ψ ← Ψ + q q ⊗ R i=I+1 (i) ⊥ (i) output of the single-target WPE filter, calculated as 15 ψ ← ψ + q ⊗ P q i=I+1 16 End (i) (i) z z t t 17 g ← Ψ ψ (i) M×M R = ∈ C . (55) (i) 18 z ← x − X g λ t t t t=1 t (i) (i) 19 Estimate v ˜ based on z and γ for 1 ≤ i ≤ I Then the solution can be obtained, under a distortionless P H (i) T z (z ) 1 t t 20 R ← for 1 ≤ i ≤ I z constraint, as a wMPDR beamformer: (i) T t=1 (i) (i) −1 R v˜ ( ) (i) (i) (i) 21 q ← for 1 ≤ i ≤ I + R v ˜ H z (i) (i) (i) v˜ R v˜ ( ) z (i) q = . (56) H −1 (i) (i) H (i) (i) ´ (i) 22 y ← q z for 1 ≤ i ≤ I t v ˜ R v ˜ (i) (i) 23 λ ← y for 1 ≤ i ≤ I t t Eqs. (54) to (56) closely resemble Eqs. (40) to (42). The 24 until convergence difference is whether the dereverberation is performed by a multiple-target WPE filter or single-target WPE filters. With source-wise factorization, the solution can be obtained (i) (i) in closed form when λ and v ˜ are given, similar to the case on the subsequent beamforming, provided the time-varying with the direct optimization of the MISO CBFs. In addition, (i) variance of the desired source is given for the optimization. In the output of the WPE filter is obtained as z in Eq. (16), addition, unlike the source-packed factorization approach, this and can be efficiently used for the estimation of the RTFs. approach does not need to compensate for the dimensionality Furthermore, since the temporal-spatial covariance matrix in (i) reduction of the beamformer output for the update of G Eq. (31) is much smaller than that in Eq. (38) of the source- because it considers a whole signal space without adding any packed factorization, the computational cost can be reduced. (i) modification. We refer to this filter G as a single-target WPE (See Section IV for more scrutiny of the computing cost.) filter. (i) (i) Once G is obtained as the above solution, the objective (i) E. Processing flow with estimation of λ and v function in Eq. (19) can be rewritten as This subsection describes examples of processing flows in Algorithms 1 and 2, for optimizing a CBF based on source- 2 H (i) (i) (i) (i) packed factorization and source-wise factorization, including L q = q s.t. q v ˜ = 1, (54) (i) (i) estimation of the time-varying variances, λ , and the RTFs, (i) v ˜ . Hereafter, we refer to the algorithms as A-1 and A-2 for (i) where R is a variance-normalized covariance matrix of the brevity. Although A-1 simultaneously estimates all sources, z 9 WPE for wMPDR for was trained so that it receives the WPE filter’s output, which is obtained at the first iteration in the iterative optimization of Dereverb Beamform the CBF, and estimates the TF masks of the desired signals. The network’s input was set as a concatenation of the real Es mate Es mate and imaginary parts of the STFT coefficients, and the loss function was set as the (scale-dependent) signal-to-distortion ratio (SDR) of an enhanced signal obtained by multiplying the Es mate estimated masks to an observed signal. For the training and Es mate validation data, we synthesized mixtures using two utterances TF masks randomly extracted from the WSJ-CAM0 corpus [45] and two st room impulse responses and background noise extracted from For 1 me For subsequent mes the REVERB Challenge training set [18]. (i) Fig. 2. Processing flow of source-wise factorization-based CBF for estimating For the estimation of the RTFs, v ˜ , we adopted a method a source i. based on eigenvalue decomposition with noise covariance (i) whitening [46], [47]. With this technique, steering vector v (i) is first estimated: y for all i, from observed signal x , A-2 estimates only (i) (i) −1 one of the sources, y for a certain i, and (if necessary) is v = R MaxEig R R , (57) t i \i \i repeatedly applied to the observed signal to estimate all the where MaxEig(·) is a function that calculates the eigenvector sources one after another. TF masks are provided as auxiliary (i) corresponding to the maximum eigenvalue and R and R \i inputs for both algorithms. TF mask γ , which is associated are spatial covariance matrices of the i-th desired signal and with a source and a TF point, takes a value between 0 and the other signals estimated as: 1 and indicates whether the source’s desired signal dominates (i) (i) the TF point (γ = 1) or not (γ = 0). The TF masks (i) (i) (i) t t γ z z t t t over all the TF points are used to estimate the RTF(s) of the R = , (58) (i) desired signal(s) in line 19 of A-1 and line 7 of A-2. (See t t Section III-E1 for the estimation detail of the TF masks and (i) (i) (i) 1− γ z z the RTFs.) t t t t (i) R =   . (59) \i Both algorithms estimate time-varying variances λ based (i) 1− γ on the same objective as that for the CBF, defined in Eq. (19). Because no closed form solution to the estimation of the Then, the RTF is obtained by Eq. (4). CBF and the time-varying variances is known, an iterative and IV. DISCUSSION alternate optimization scheme is introduced to both algorithms. (i) In each iteration, the time-varying variances, λ , are updated In summary, our proposed techniques can optimize a CBF in line 23 of A-1 and line 11 of A-2 as the power of the for jointly performing DN+DR+SS with greatly reduced com- (i) previously estimated values of desired signal y , and then the puting cost in comparison with the direct application of the (i) CBF and desired signal y are updated while fixing the time- conventional joint optimization technique proposed for DR+SS varying variances. The iteration is repeated until convergence to DN+DR+SS. With the conventional technique, a huge is obtained. covariance matrix Ψ must be calculated to take into account the dependency of G on Q that is inherently introduced into The optimization methods described in Sections III-B and source-packed factorization. This makes the computing cost of III-D are used in their respective algorithms to update the CBF the conventional technique extremely high. In contrast, since and the desired signal(s). The WPE filter is first estimated in the proposed extension of the source-packed factorization lines 5 to 17 of A-1 and lines 3 to 5 of A-2, and applied in line 18 of A-1 and line 6 of A-2. After the RTF(s) is updated approach substantively reduces the size of the matrix to be using the dereverberated signals, the wMPDR beamformer is calculated from M (L− Δ) for Ψ to M(L− Δ) for R , the estimated in lines 20 and 21 of A-1 and lines 8 and 9 of A-2, computing cost can be effectively reduced. (i) and applied in line 22 of A-1 and line 10 of A-2. On the other hand, with source-wise factorization, G (i) Figure 2 also illustrates the processing flow of a CBF with can be optimized independently of q , which also allows source-wise factorization for estimating a source i. us to reduce the size of the matrix to be calculated to the same as that of the proposed extension of the source-packed 1) Methods for estimating TF masks and RTFs: In our (i) factorization approach. In addition, we can skip the calculation experiments, for estimating TF masks, γ , for all i and t at each frequency, we used a Convolutional Neural Network of an additional matrix, R , and the inverse of the huge −1 that works in the TF domain and is trained using utterance- matrix, Ψ , both of which are required for the proposed level Permutation Invariant Training criterion (CNN-uPIT) extension of the source-packed factorization approach. This [43]. According to our preliminary experiments [32], we set further increases the computational efficiency of the source- the network structure as a CNN with a large receptive field wise factorization approach. A drawback of source-wise fac- similar to one used by a fully-Convolutional Time-domain torization is that it has to handle I-times more dereverberated Audio Separation Network (Conv-TasNet) [44]. The network signals than source-packed factorization. 10 TABLE I CBFS COMPARED IN EXPERIMENTS: (1) AND (2) ARE CONVENTIONAL CASCADE CONFIGURATION APPROACHES, (5) IS A CONVENTIONAL JOINT OPTIMIZATION APPROACH, (6) AND (7) ARE PROPOSED JOINT OPTIMIZATION APPROACHES, AND (3) AND (4) ARE TEST CONDITIONS USED JUST FOR COMPARISON. (5), (6), AND (7) ARE CATEGORIZED AS “JOINTLY OPTIMAL” BECAUSE THEY ARE COMPOSED OF WPE AND WMPDR AND OPTIMIZED BASED ON INTEGRATED VARIANCE ESTIMATION (SEE FIG. 3 FOR THE DIFFERENCE BETWEEN SEPARATE AND INTEGRATED VARIANCE ESTIMATION). Name of method Jointly WPE BF Variance Category optimal estimation (1) WPE+MPDR (separate) Multiple-target MPDR Separate Cascade (conventional) (2) WPE+MVDR (separate) Multiple-target MVDR Separate Cascade (conventional) (3) WPE+wMPDR (separate) Multiple-target wMPDR Separate Test condition (4) WPE+MPDR (integrated) Single-target MPDR Integrated Test condition (5) Source-packed factorization (conventional) X Multiple-target wMPDR Integrated Jointly optimal (conventional) (6) Source-packed factorization (extended) X Multiple-target wMPDR Integrated Jointly optimal (proposed) (7) Source-wise factorization X Single-target wMPDR Integrated Jointly optimal (proposed) The source-wise factorization approach has additional ben- efits w.r.t. computational efficiency when it is used in specific Derev BF Derev BF scenarios listed below: • The source-wise factorization approach can estimate the (a) Separate optimization (b) Integrated optimization CBF by a closed-form equation when time-varying source Fig. 3. Separate and integrated variance optimization schemes: While separate variances are given, or estimated, e.g., using neural net- variance optimization updates λ for Derev as the variance of Derev output, works [15], [12]. In such a case, we can skip iterative integrated variance optimization updates it as the variance of the beamformer optimization. In contrast, the source-packed factorization output. Consequently, λ for Derev is common to all the sources with separate variance optimization. approach needs to maintain iterations to alternately esti- mate Q and g due to their mutual dependency. • The source-wise factorization approach is advantageous A. Dataset and evaluation metrics when it is combined with neural network-based single target speaker extraction that has recently been actively For the evaluation, we prepared a set of noisy reverberant studied [13]. With this combination, we can skip the es- speech mixtures (REVERB-2MIX) using the REVERB Chal- timation of sources other than the target source, allowing lenge dataset (REVERB) [18]. Each utterance in REVERB us to further reduce the computing cost. contains a single reverberant speech with moderate stationary diffuse noise. For generating a set of test data, we mixed two V. EXPERIMENTS utterances extracted from REVERB, one from its development This section experimentally confirms the effectiveness of set (Dev set) and the other from its evaluation set (Eval set), our proposed joint optimization approaches. Table I summa- so that each pair of mixed utterances was recorded in the same rizes the optimization methods that we experimentally com- room, by the same microphone array, and under the same pared (see Sections V-C and V-D for details of the methods) condition (near or far, RealData or SimData). We categorized in the following three aspects. the test data based on the original categories of the data in REVERB (e.g., SimData or RealData). We created the same 1) Effectiveness of joint optimization number of mixtures in the test data as in the REVERB Eval set, We compared a CBF with and without joint optimiza- such that each utterance in the REVERB Eval set is contained tion in terms of estimation accuracy. The source-wise in one of the mixtures in the test data. Furthermore, the length factorization approach (Table I (7)) is compared with of each mixture in the test data was set at the same as that of the conventional cascade configuration (Table I (1) and the corresponding utterance in the REVERB Eval set, and the (2)), and two additional test conditions (Table I (3) and utterance from the Dev set was trimmed or zero-padded at its (4)). end to be the same length as that of Eval set. 2) Comparison among joint optimization approaches For the experiments in Section V-E, we also prepared a We compared three joint optimization approaches, i.e., set of noisy reverberant speech mixtures, each of which is the source-packed factorization approach with its con- composed of three speaker utterances (REVERB-3MIX). We ventional setting (Table I (5)) and its proposed extension created REVERB-3MIX by adding one utterance extracted (Table I (6)), and the source-wise factorization approach from REVERB Dev set to each mixture in REVERB-2MIX. (Table I (7)), respectively described in Sections III-B1, Only RealData (i.e., real recordings of reverberant data) was III-B2, and III-D, in terms of computational efficiency created for REVERB-3MIX. and estimation accuracy. 3) Evaluation using oracle masks In the experiments, we respectively estimated two or three We used oracle masks instead of estimated masks for speech signals from each mixture for REVERB-2MIX and evaluating a CBF to test the performance of a CBF using REVERB-3MIX and evaluated only one of them correspond- different types of masks and also to obtain its top-line ing to the REVERB Eval set using the baseline evaluation tools performance. provided for it. We selected the signal to be evaluated from all 11 (1) WPE+MPDR (separate) (4) WPE+MPDR (integrated) TABLE II (2) WPE+MVDR (separate) (7) Source-wise factorization BEAMFORMER CONFIGURATIONS USED IN EXPERIMENTS (3) WPE+wMPDR (separate) M L at each freq. range (kHz) #Iterations 24 4.4 0.0-0.8 0.8-1.5 1.5-8.0 Config-1 8 20 16 8 10 Config-2 4 20 16 8 10 4.2 TABLE III WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND STOI FOR SIMDATA IN REVERB-2MIX OBTAINED USING DIFFERENT BEAMFORMERS AFTER FIVE ESTIMATION ITERATIONS WITH CONFIG-1. 3.8 SCORES FOR REVERB-2MIX AND REVERB (I.E., SINGLE SPEAKER) WITHOUT ENHANCEMENT (NO ENH), ARE ALSO SHOWN. 3.6 Enhancement method WER CD FWSSNR PESQ STOI No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 2 4 6 8 10 2 4 6 8 10 No Enh (REVERB) 18.61 3.97 3.62 1.48 0.75 #iterations #iterations 6 1.85 MPDR (w/o iteration) 30.79 4.40 3.07 1.45 0.73 MVDR (w/o iteration) 30.89 4.43 3.00 1.44 0.73 1.8 5.5 wMPDR 28.75 3.96 4.46 1.60 0.75 1.75 (1) WPE+MPDR (separate) 23.04 4.30 3.77 1.58 0.77 (2) WPE+MVDR (separate) 23.34 4.34 3.66 1.57 0.76 1.7 (3) WPE+wMPDR (separate) 21.53 3.74 5.42 1.77 0.82 1.65 4.5 (4) WPE+MPDR (integrated) 23.22 4.28 3.66 1.56 0.76 (7) Source-wise factorization 20.03 3.67 5.57 1.80 0.81 1.6 1.55 the estimated speech signals based on the correlation between 3.5 1.5 the separated signals and the original signal in the REVERB 2 4 6 8 10 2 4 6 8 10 Eval set. As objective measures for speech enhancement [48], #iterations #iterations we used the Cepstrum Distance (CD), the Frequency-Weighted Fig. 4. Comparison among joint optimization and cascade configuration Segmental SNR (FWSSNR), the Perceptual Evaluation of approaches when using WPE+MPDR and WPE+wMPDR with integrated and Speech Quality (PESQ), and the Short-Time Objective Intel- separate optimization schemed using Config-1 for REVERB-2MIX. ligibility measure (STOI) [49]. To evaluate the ASR perfor- mance, we used a baseline ASR system for REVERB that was filter followed by an MPDR beamformer (WPE+MPDR), recently developed using Kaldi [50]. This system is composed of a Time-Delay Neural Network (TDNN) acoustic model and a WPE filter followed by an MVDR beamformer trained using lattice-free maximum mutual information (LF- (WPE+MVDR). The first combination is required for jointly MMI) and online i-vector extraction, and a trigram language optimal processing, and the others have been used for the conventional cascade configuration. Second, we compared two model. They were trained on the REVERB training set. different variance optimization schemes shown in Fig. 3: “separate” and “integrated.” With the separate variance opti- B. CBF configurations mization, the iterative estimation of the time-varying variance Table I summarizes two configurations of the CBF examined was performed separately for the WPE filter and for the in experiments including the number of microphones M, the beamformer. This is the scheme used by the conventional filter length L, and the number of optimization iterations. The cascade configuration. In contrast, with the integrated variance sampling frequency was 16 kHz. A Hann window was used optimization, the iterative estimation was performed jointly for for a short-time analysis where the frame length and shift were the WPE filter and the beamformer. A significant difference set at 32 and 8 ms. The prediction delay was set at Δ = 4 for between the two schemes is whether the WPE filter uses the WPE filter. the same variances for all the sources or different variances In the iterative optimization, the time-varying variances of dependent on the sources estimated by the beamformer. the sources were initialized as those of the observed signal for Table III compares WERs, CDs, FWSSNRs, PESQs, and the WPE filter and as 1 for the wMPDR beamformer for all STOIs obtained after five estimation iterations using three the methods. beamformers (MPDR, MVDR, and wMPDR), two conven- tional cascade configuration approaches ((1) WPE+MPDR C. Experiment-1: effectiveness of joint optimization and (2) WPE+MVDR), two test conditions ((3) and (4)), In this experiment, we evaluated the effectiveness of the and a proposed joint optimization approach ((7) source-wise joint optimization focusing on its two characteristics. First, factorization). All methods used configuration Config-1 in we compared three different filter combinations: a WPE filter Table I. Table III shows that 1) WPE+MPDR, WPE+MVDR, followed by a wMPDR beamformer (WPE+wMPDR), a WPE and WPE+wMPDR greatly outperformed MPDR, MVDR, FWSSNR (dB) WER (%) PESQ CD (dB) 12 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (a) Observed signal (b) MVDR 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (c) WPE+MVDR (d) CBF with source-wise factorization Fig. 5. Spectrogram of (a) a noisy reverberant mixture in RealData of REVERB-2MIX and spectrograms of enhanced signals obtained by (b) MVDR, (c) WPE+MVDR and (d) CBF with source-wise factorization. Mixture is composed of two female speakers under far conditions. and wMPDR, respectively, with all the conditions, 2) the D. Experiment-2: Comparison among joint optimization ap- joint optimization approach, i.e., (7) source-wise factorization, proaches substantially outperformed all the other methods in terms of In this experiment, we compared three joint optimiza- all the measures except for a case in terms of STOI where tion approaches, denoted as (5) Source-packed factorization WPE+wMPDR (separate) gave a slightly better score than (7) (conventional), (6) Source-packed factorization (extended), source-wise factorization. Furthermore, Fig. 4 shows the con- and (7) Source-wise factorization. (5) Source-packed factor- vergence curves of the two cascade configuration approaches, ization (conventional) corresponds to the conventional joint two test conditions, and the joint optimization approach. The optimization technique described in Section III-B1, and (6) source-wise factorization performance (7) was the best of Source-packed factorization (extended) and (7) Source-wise all and improved as the number of iterations increased. The factorization correspond to our proposed methods respectively second best was (3) WPE+wMPDR (separate). The other described in Sections III-B2 and III-D. methods did not improve the scores after the first iteration Figure 6 compares the WERs obtained using the three with both the integrated and separate variance optimization approaches with Config-1 and Config-2. Our proposed meth- schemes. ods, i.e., (6) Source-packed factorization (extended) and (7) Figure 5 shows a spectrogram of a noisy reverberant mix- Source-wise factorization, performed comparably well and ture in RealData of REVERB-2MIX, and spectrograms of both greatly outperformed (5) Source-packed factorization enhanced signals obtained using MVDR, WPE+MVDR, and (conventional). CBF with source-wise factorization. The figure shows that all Table IV compares the computing times required for the the enhancement methods were effective and the CBF with three approaches to estimate and apply the CBFs with ten source-wise factorization was the best of all for achieving estimation iterations for processing a mixture utterance whose denoising, dereverberation, and source separation. length is 9.44 s. The computing time was measured by a The above results clearly show that the two characteristics Matlab interpreter as elapsed time. The computing times of the joint optimization approach, i.e., 1) the optimal combi- for estimating the masks were 0.63 s and 7.2 s with and nation of a WPE filter and a wMPDR beamformer, and 2) the without a GPU (NVIDIA 2080ti), and they are not included integrated variance optimization, are both critical for achieving in the table. As shown in the table, for both configurations, optimal performance. (6) Source-packed factorization (extended) greatly reduced Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) 13 (5) Source-packed (conventional) (7) Source-wise fact. TABLE V WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND (6) Source-packed (extended) STOI FOR SIMDATA IN REVERB-2MIX OF ENHANCED SIGNALS 29.5 OBTAINED BASED ON ORACLE MASKS USING DIFFERENT BEAMFORMERS AFTER THREE ESTIMATION ITERATIONS WITH CONFIG-1. SCORES FOR REVERB-2MIX WITH NO ENHANCEMENT (NO ENH) AND THOSE OBTAINED BY APPLYING A WMPDR CBF, WPD [30], TO REVERB (I.E., 28.5 SINGLE SPEAKER), ARE ALSO SHOWN. Enhancement method WER CD FWSSNR PESQ STOI 27.5 No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 27 WPD (REVERB) [30] 8.91 2.59 8.29 2.41 0.91 MPDR (w/o iteration) 20.16 3.53 5.49 1.86 0.84 26.5 MVDR (w/o iteration) 20.32 3.56 5.36 1.84 0.83 19 wMPDR 20.12 3.31 6.11 1.96 0.86 2 4 6 8 10 2 4 6 8 10 (1) WPE+MPDR (separate) 12.89 3.39 6.11 2.10 0.87 #iterations #iterations (2) WPE+MVDR (separate) 12.91 3.32 6.30 2.07 0.87 (a) Config-1 (b) Config-2 (3) WPE+wMPDR (separate) 12.59 3.12 6.84 2.21 0.89 (6) Source-packed fact. 12.23 3.02 7.15 2.33 0.90 Fig. 6. WERs (%) obtained for REVERB-2MIX when jointly optimizing (7) Source-wise fact. 12.23 2.98 7.25 2.32 0.90 WPE+wMPDR based on source-packed factorization (conventional/extended) and source-wise factorization approaches. TABLE IV in REVERB-2MIX using signal components in the observed COMPUTING TIME REQUIRED FOR PROCESSING A MIXTURE UTTERANCE signals. In contrast, we can only calculate the oracle masks OF LENGTH OF 9.44 S IN REVERB-2MIX. COMPUTING TIME WAS MEASURED BY ELAPSED TIME ON A MATLAB INTERPRETER. approximately for RealData because we cannot access the signal components. Thus, we first estimated the desired signals Method Time (s) by applying dereverberation and denoising to utterances in Config-1 Config-2 REVERB, and then calculated the oracle masks using the (4) Source-packed factorization (conventional) 3467 688 (5) Source-packed factorization (extended) 209 33 estimated desired signals for REVERB-2MIX and REVERB- (6) Source-wise factorization 40 23 3MIX. Table V shows WERs, CDs, FWSSNRs, PESQs, and STOIs measured on enhanced signals obtained from REVERB-2MIX the computing time in comparison with (5) Source-packed using various (non-convolutional) beamformers and CBFs factorization (conventional), and (7) Source-wise factorization after three estimation iterations. As a reference, the table further reduced the computing time. also includes previously reported scores denoted by WPD The above results clearly demonstrate the superiority of the (REVERB) [30], which were obtained by applying a wMPDR two proposed approaches over the conventional joint optimiza- CBF, referred to as WPD (see also Section III-C in this paper), tion technique in terms of both computational efficiency and to REVERB, i.e., noisy reverberant single speaker utterances. estimation accuracy. However, Table IV indicates that the pro- In addition, the convergence curves obtained using the CBFs posed approaches still require relatively large computing cost, in terms of WERs for REVERB-2MIX and REVERB-3MIX, e.g., 40 s computing time for processing a 9.44 s utterance and those obtained in terms of CDs, FWSSNRs, PESQs, with Config-1, to obtain the high performance gain shown and STOIs for REVERB-2MIX are respectively shown in in Fig. 6 (a). Future work must address this problem. For Figs. 7 and 8. In all these results, the two joint optimization example, it might be mitigated by setting the goal as extraction approaches, (6) source-packed factorization (extended) and (7) of a single target source. Then, due to the characteristics source-wise factorization, outperformed all the other methods of source-wise factorization, we can omit the estimation of in terms of every measurement. As a whole, almost the same the other sources, and omit the iterative estimation, e.g., tendency was observed in the cases using the estimated masks. when we separately estimate source variances using a neural One exception is that the WERs obtained with the source-wise network. As a reference, the computing time (40 s) in Table factorization tended to increase after a few iterations although III required for the source-wise factorization with Config-1 is such a tendency was not observed in terms of signal distortion roughly reduced to 2.0 s for one iteration per source (namely measures. This means that improvement in the signal level 40 s/10/2), which results in the real-time factor being 0.21 distortion does not necessarily result in improvement in WER, (= 2.0 s/9.44 s). and suggests the importance of optimization by ASR level criteria, similar to conventional beamforming techniques [51], E. Experiment-3: Evaluation using oracle masks [52]. In this experiment, we examined the performance of CBFs using a different type of masks, i.e., oracle masks. An oracle VI. CONCLUDING REMARKS mask, which is the power ratio of the desired signal to the observed signal at each TF point, is calculated using reference This paper presented methods for optimizing a CBF that signals. Oracle masks can be precisely calculated for SimData performs DN+DR+SS based on ML estimation. We introduced WER (%) WER (%) 14 (1) WPE+MPDR (separate) (6) Source-packed (extended) the source-packed factorization approach, and into a set of (2) WPE+MVDR (separate) (7) Source-wise factorization single-target WPE filters followed by wMPDR beamformers (3) WPE+wMPDR (separate) using the source-wise factorization approach. This paper also presented the overall processing flows for both approaches 13.5 based on an assumption that TF masks are provided as auxil- iary inputs. In the flows, the time varying source variances, which are required for ML estimation, can be optimally estimated jointly with the CBF using iterative optimization; the steering vectors of the desired signals, which are required for beamformer optimization, can be reliably estimated based 12.5 on the dereverberated multichannel signals obtained at an optimization step. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches substantially im- 2 4 6 8 10 2 4 6 8 10 proved the CBF performance in comparison with the conven- #iterations #iterations tional cascade configuration in terms of ASR performance (a) REVERB-2MIX (b) REVERB-3MIX and signal distortion reduction. Our proposed approaches Fig. 7. Comparison of WERs among cascade configuration and joint can also greatly reduce the computing cost with improved optimization approaches using Config-1 for REVERB-2MIX and REVERB- estimation accuracy in comparison with the conventional joint 3MIX. optimization technique. The proposed approaches, however, (1) WPE+MPDR (separate) (6) Source-packed (extended) still result in relatively large computing costs to obtain high (2) WPE+MVDR (separate) (7) Source-wise factorization performance gain. Future work will address this problem. (3) WPE+wMPDR (separate) 7.5 APPENDIX A 3.4 DERIVATION OF EQS. (43) AND (44) We can rewrite Ψ in Eq. (38) using Eq. (36): 3.3 Ψ = X Φ X , (60) t q,t t 3.2 XX H H 1 1 6.5 3.1 (i) (i) = q X q X . (61) t t (i) t i t (i) Using Eq. (33), q X can further be rewritten: 2.9 H H 2 4 6 8 10 2 4 6 8 10 (i) (i) T q X = q I ⊗ x , (62) t M #iterations #iterations 0.91 (i) T 2.35 = q ⊗ x . (63) 0.9 2.3 Substituting the above equation in Eq. (61) yields 2.25 XX H 1 1 ⊤ (i) H (i) T 0.89 Ψ = q ⊗ x q ⊗ x , t t (i) 2.2 t i t 0.88 (64) 2.15 XX 1 1 ⊤ 2.1 (i) (i) H = x x , (65) q q ⊗ 0.87 t (i) 2.05 t i t X ⊤ (i) 2 0.86 (i) (i) = q q ⊗ R . (66) 2 4 6 8 10 2 4 6 8 10 #iterations #iterations Similarly, we can obtain Fig. 8. Comparison of CDs, FWSSNRs, PESQs, and STOIS among cascade 1 H configuration and joint optimization approaches using Config-1 for REVERB- ψ = X Φ x , (67) q t 2MIX. XX 1 1 ⊤ (i) H (i) = q ⊗ x q x , (68) (i) t i t two different approaches for factorizing a CBF, i.e., source- XX 1 1 T (i) H (i) packed and source-wise factorization approaches, and derived = q ⊗ x x q , (69) (i) optimization algorithms for the respective approaches. A CBF t i t can be factorized without loss of optimality into a multiple- (i) (i) (i) = q ⊗ P q . (70) target WPE filter followed by wMPDR beamformers using PESQ WER (%) CD (dB) STOI FWSSNR (dB) WER (%) 15 REFERENCES [23] S. Braun and E. A. P. Habets, “Linear prediction based online dereverberation and noise reduction using alternating Kalman filters,” IEEE/ACM trans. on Audio, Speech, and Language Processing, vol. 26, [1] B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach no. 6, pp. 1119–1129, 2018. to spatial filtering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, 1988. [24] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Joint multi- [2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, microphone speech dereverberation and noise reduction using integrated Estimation, and Modulation Theory. New York: Wiley-Interscience, sidelobe cancellation and linear prediction,” in Proc. IWAENC, 2018. [25] T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno, “Blind [3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array separation and dereverberation of speech mixtures by joint optimization,” processors,” The Journal of the Acoustical Society of America, vol. 54, IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 771–785, 1973. January 2011. [4] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain [26] N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, “Relaxed disjointness multichannel linear filtering for noise reduction,” IEEE Trans. Audio, based clustering for joint blind source separation and dereverberation,” Speech, and Language Processing, vol. 18, no. 2, pp. 260–276, 2007. in Proc. IWAENC, 2014. [5] A. Hyva¨rinen, J. Karhunen, and E. Oja, Independent Component Anal- [27] H. Kagami, H. Kameoka, and M. Yukawa, “Joint separation and dere- ysis. New York: John Wiley & Sons, 2001. verberation of reverberant mixtures with determined multichannel non- [6] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separa- negative matrix factorization,” in Proc. IEEE ICASSP, 2018, pp. 31–35. tion exploiting higher-order frequency dependencies,” IEEE Trans. on [28] T. Nakatani, R. Ikeshita, K. Kinoshita, H. Sawada, and S. Araki, “Com- Speech, and Audio Processing, vol. 15, no. 1, pp. 70–79, 2006. putationally efficient and versatile framework for joint optimization of [7] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A blind speech separation and dereverberation,” in Proc. Interspeech, 2020. multichannel MMSE-based framework for speech source separation [29] Z. Koldovsky and P. Tichavsky´, “Gradient algorithms for complex non- and noise reduction,” IEEE Trans. on Audio, Speech, and Language Gaussian independent component/vector extraction, question of conver- Processing, vol. 21, no. 9, pp. 1913–1928, 2010. gence,” IEEE Trans. on Signal Processing, vol. 67, no. 4, pp. 1050–1064, [8] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear [30] T. Nakatani and K. Kinoshita, “Maximum likelihood convolutional prediction,” IEEE trans. on Audio, Speech, and Language Processing, beamformer for simultaneous denoising and dereverberation,” in Proc. vol. 18, no. 7, pp. 1717–1731, 2010. EUSIPCO, 2019. [9] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear [31] C. Boeddeker, T. Nakatani, K. Kinoshita, and R. Haeb-Umbach, “Jointly prediction methods for blind MIMO impulse response shortening,” IEEE optimal dereverberation and beamforming,” in Proc. ICASSP, 2020, pp. trans. on Audio, Speech and Language Processing, vol. 20, no. 10, pp. 216–220. 2707–2720, 2012. [32] T. Nakatani, R. Takahashi, T. Ochiai, K. Kinoshita, R. Ikeshita, [10] A. Jukic´, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi- M. Declroix, and S. Araki, “DNN-supported mask-based convolutional channel linear prediction-based speech dereverberation with sparse pri- beamforming for simultaneous denoising, dereverberation, and source ors,” IEEE/ACM trans. on Audio, Speech and Language Processing, separation,” in Proc. IEEE ICASSP, 2020. vol. 23, no. 9, pp. 1509–1520, 2015. [33] J. S. Bradley, H. Sato, and M. Picard, “On the importance of early [11] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based reflections for speech in rooms,” The Journal of the Acoustic Sociaty of spectral mask estimation for acoustic beamforming,” in Proc. IEEE America, vol. 113, pp. 3233–3244, 2003. ICASSP, 2016, pp. 196–200. [34] T. Nishiura, Y. Hirano, Y. Denda, and M. Nakayama, “Investigations into [12] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani, “Neural early and late reflections on distant-talking speech recognition toward network-based spectrum estimation for online wpe dereverberation,” in suitable reverberation criteria,” in Proc. Interspeech, 2007, pp. 1082– Proc. Interspeech, 2017, pp. 384–388. [13] K. Zmol´ıkova´, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, [35] Y. Avargel and I. Cohen, “On multiplicative transfer function approxima- L. Burget, and J. Cernocky´, “SpeakerBeam: Speaker aware neural tion in the short-time fourier transform domain,” IEEE Signal Processing network for target speaker extraction in speech mixtures,” IEEE Journal Letters, vol. 14, pp. 337–340, 2007. of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, [36] I. Cohen, “Relative transfer function identification using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451– [14] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, , and F. Alleva, “Recogniz- 459, 2004. ing overlapped speech in meetings: A multichannel separation approach [37] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. H. Juang, using neural networks,” in Proc. Interspeech, 2018. “Blind speech dereverberation with multi-channel linear prediction based [15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to on short time Fourier transform representation,” in Proc. IEEE ICASSP, speech enhancement based on deep neural networks,” IEEE/ACM trans. 2008, pp. 85–88. on Audio, Speech, and Language Processing, vol. 23, no. 1, 2015. [38] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, [16] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Naka- Discriminative embeddings for segmentation and separation,” in Proc. mura, and J. Yamato, “Low-latency real-time meeting recognition and IEEE ICASSP, 2016, pp. 31–35. understanding using distant microphones and omni-directional camera,” [17] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 2, separation with utterance-level permutation invariant training of deep pp. 499–513, 2011. recurrent neural networks,” IEEE Trans. Audio, Speech, and Language [39] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “Independent low-rank Processing, pp. 1901–1913, 2017. matrix analysis with decorrelation learning,” in IEEE WASPAA, 2019. [18] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, [40] T. Nakatani and K. Kinoshita, “Simultaneous denoising and dereverber- W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and ation for low-latency applications using frame-by-frame online unified T. Yoshioka, “A summary of the REVERB challenge: State-of-the-art convolutional beamformer,” in Proc. Interspeech, 2019. and remaining challenges in reverberant speech processing research,” [41] B. J. Cho, J. Lee, and H. Park, “A beamforming algorithm based on EURASIP Journal on Advances in Signal Processing, 2016. maximum likelihood of a complex Gaussian distribution with time- [19] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ varying variances for robust speech recognition,” IEEE Signal Process- speech separation and recognition challenge: Dataset, task and base- ing Letters, vol. 26, no. 9, pp. 1398–1402, August 2019. lines,” in Proc. IEEE ASRU-2015, 2015, pp. 504–511. [42] T. Nakatani and K. Kinoshita, “A unified convolutional beamformer for [20] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, simultaneous denoising and dereverberation,” IEEE Signal Processing K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meets Letters, vol. 26, no. 6, pp. 903–907, April 2019. a strong asr backend: Hitachi/Paderborn university joint investigation for [43] F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, dinner party ASR,” in Proc. Interspeech, 2019. “A comprehensive study of speech separation: spectrogram vs waveform [21] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeis- separation,” in Interspeech, 2019. ter, M. Seltzer, H. Zen, and M. Souden, “Speech processing for digital [44] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time- home assistants,” IEEE Signal Processing Magazine, 2019. frequency magnitude masking for speech separation,” IEEE/ACM Trans. [22] M. Togami, “Multichannel online speech dereverberation under noisy on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256– environments,” in Proc. EUSIPCO, 2015, pp. 1078–1082. 1266, 2019. 16 [45] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 81–84. [46] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments,” in Proc. IEEE ICASSP, 2017, pp. 681–685. [47] S. Markovich-Golan, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfer- ing speech signals,” IEEE Trans. ASLP, vol. 17, no. 6, pp. 1071–1086, [48] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Tran. Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008. [49] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of timefrequency weighted noisy speech,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 7, [50] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- mer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU, 2011. [51] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb- Umbach, “Eamnet: End-to-end training of a beamformer-supported multi-channel ASR system,” in Proc. IEEE ICASSP, 2017. [52] A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement using end-to-end speech recognition objectives,” in Proc. IEEE WASPAA, 2019.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: May 20, 2020

There are no references for this article.