Access the full text.
Sign up today, get DeepDyve free for 14 days.
J. Barker, Shinji Watanabe, E. Vincent, J. Trmal (2018)
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
Takuya Yoshioka, T. Nakatani, M. Miyoshi, HIroshi Okuno (2011)
Blind Separation and Dereverberation of Speech Mixtures by Joint OptimizationIEEE Transactions on Audio, Speech, and Language Processing, 19
M. Togami (2015)
Multichannel online speech dereverberation under noisy environments2015 23rd European Signal Processing Conference (EUSIPCO)
Naoyuki Kanda, Christoph Böddeker, Jens Heitkaemper, Yusuke Fujita, Shota Horiguchi, Kenji Nagamatsu, Reinhold Häb-Umbach (2019)
Guided Source Separation Meets a Strong ASR Backend: Hitachi/Paderborn University Joint Investigation for Dinner Party ASRArXiv, abs/1905.12230
Daniel Povey, Arnab Ghoshal, Gilles Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlícek, Y. Qian, Petr Schwarz, J. Silovský, G. Stemmer, Karel Veselý (2011)
The Kaldi Speech Recognition Toolkit
Takaaki Hori, S. Araki, Takuya Yoshioka, M. Fujimoto, Shinji Watanabe, T. Oba, A. Ogawa, K. Otsuka, Dan Mikami, K. Kinoshita, T. Nakatani, Atsushi Nakamura, Junji Yamato (2012)
Low-Latency Real-Time Meeting Recognition and Understanding Using Distant Microphones and Omni-Directional CameraIEEE Transactions on Audio, Speech, and Language Processing, 20
T. Nakatani, Takuya Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang (2010)
Speech Dereverberation Based on Variance-Normalized Delayed Linear PredictionIEEE Transactions on Audio, Speech, and Language Processing, 18
Yi Luo, N. Mesgarani (2018)
Conv-TasNet: Surpassing Ideal Time–Frequency Magnitude Masking for Speech SeparationIEEE/ACM Transactions on Audio, Speech, and Language Processing, 27
H. Trees (2002)
Optimum Array Processing: Part IV of Detection, Estimation, and Modulation Theory
T. Nakatani, K. Kinoshita (2019)
Maximum likelihood convolutional beamformer for simultaneous denoising and dereverberation2019 27th European Signal Processing Conference (EUSIPCO)
T. Nakatani, Riki Takahashi, Tsubasa Ochiai, K. Kinoshita, Rintaro Ikeshita, Marc Delcroix, S. Araki (2020)
DNN-supported Mask-based Convolutional Beamforming for Simultaneous Denoising, Dereverberation, and Source SeparationICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Jahn Heymann, Lukas Drude, Christoph Böddeker, Patrick Hanebrink, Reinhold Häb-Umbach (2017)
Beamnet: End-to-end training of a beamformer-supported multi-channel ASR system2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Y. Avargel, I. Cohen (2007)
On Multiplicative Transfer Function Approximation in the Short-Time Fourier Transform DomainIEEE Signal Processing Letters, 14
I. Cohen (2004)
Relative transfer function identification using speech signalsIEEE Transactions on Speech and Audio Processing, 12
H. Cox (1973)
Resolving power and sensitivity to mismatch of optimum array processorsJournal of the Acoustical Society of America, 54
M. Souden, J. Benesty, S. Affes (2010)
On Optimal Frequency-Domain Multichannel Linear Filtering for Noise ReductionIEEE Transactions on Audio, Speech, and Language Processing, 18
Rintaro Ikeshita, N. Ito, T. Nakatani, H. Sawada (2019)
Independent Low-Rank Matrix Analysis with Decorrelation Learning2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Christoph Boeddeker, T. Nakatani, K. Kinoshita, R. Haeb-Umbach (2019)
Jointly Optimal Dereverberation and BeamformingICASSP 2020 - 2020 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
S. Golan, S. Gannot (2015)
Performance analysis of the covariance subtraction method for relative transfer function estimation and comparison to the covariance whitening method2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Thomas Dietzen, S. Doclo, M. Moonen, T. Waterschoot (2018)
Joint Multi-Microphone Speech Dereverberation and Noise Reduction Using Integrated Sidelobe Cancellation and Linear Prediction2018 16th International Workshop on Acoustic Signal Enhancement (IWAENC)
C. Taal, R. Hendriks, R. Heusdens, J. Jensen (2011)
An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy SpeechIEEE Transactions on Audio, Speech, and Language Processing, 19
Taesu Kim, H. Attias, Soo-Young Lee, Te-Won Lee (2007)
Blind Source Separation Exploiting Higher-Order Frequency DependenciesIEEE Transactions on Audio, Speech, and Language Processing, 15
Nellie Brown (1896)
TreesJournal of Education, 43
Zbyněk Koldovský, P. Tichavský (2018)
Gradient Algorithms for Complex Non-Gaussian Independent Component/Vector Extraction, Question of ConvergenceIEEE Transactions on Signal Processing, 67
F. Bahmaninezhad, Jian Wu, Rongzhi Gu, Shi-Xiong Zhang, Yong Xu, Meng Yu, Dong Yu (2019)
A comprehensive study of speech separation: spectrogram vs waveform separationArXiv, abs/1905.07497
Takuya Yoshioka, T. Nakatani (2012)
Generalization of Multi-Channel Linear Prediction Methods for Blind MIMO Impulse Response ShorteningIEEE Transactions on Audio, Speech, and Language Processing, 20
T. Nakatani, Takuya Yoshioka, K. Kinoshita, M. Miyoshi, B. Juang (2008)
Blind speech dereverberation with multi-channel linear prediction based on short time fourier transform representation2008 IEEE International Conference on Acoustics, Speech and Signal Processing
Y. Hu, P. Loizou (2008)
Evaluation of Objective Quality Measures for Speech EnhancementIEEE Transactions on Audio, Speech, and Language Processing, 16
T. Nishiura, Yoshiki Hirano, Y. Denda, M. Nakayama (2007)
Investigations into early and late reflections on distant-talking speech recognition toward suitable reverberation criteria
A. Subramanian, Xiaofei Wang, M. Baskar, Shinji Watanabe, T. Taniguchi, Dung Tran, Yuya Fujita (2019)
Speech Enhancement Using End-to-End Speech Recognition Objectives2019 IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA)
Jahn Heymann, Lukas Drude, Reinhold Häb-Umbach (2016)
Neural network based spectral mask estimation for acoustic beamforming2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
T. Nakatani, K. Kinoshita (2018)
A Unified Convolutional Beamformer for Simultaneous Denoising and DereverberationIEEE Signal Processing Letters, 26
Takuya Yoshioka, Hakan Erdogan, Zhuo Chen, Xiong Xiao, F. Alleva (2018)
Recognizing Overlapped Speech in Meetings: A Multichannel Separation Approach Using Neural Networks
R. Haeb-Umbach, Shinji Watanabe, T. Nakatani, M. Bacchiani, Björn Hoffmeister, M. Seltzer, H. Zen, M. Souden (2019)
Speech Processing for Digital Home Assistants
K. Kinoshita, Marc Delcroix, Haeyong Kwon, Takuma Mori, T. Nakatani (2017)
Neural Network-Based Spectrum Estimation for Online WPE Dereverberation
T. Nakatani, K. Kinoshita (2019)
Simultaneous Denoising and Dereverberation for Low-Latency Applications Using Frame-by-Frame Online Unified Convolutional Beamformer
K. Kinoshita, Marc Delcroix, S. Gannot, Emanuël Habets, Reinhold Häb-Umbach, Walter Kellermann, Volker Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, Takuya Yoshioka (2016)
A summary of the REVERB challenge: state-of-the-art and remaining challenges in reverberant speech processing researchEURASIP Journal on Advances in Signal Processing, 2016
Barry Van, Kevin Buckley (1988)
Beamforming: a versatile approach to spatial filteringIEEE ASSP Magazine, 5
T. Nakatani, Rintaro Ikeshita, K. Kinoshita, H. Sawada, S. Araki (2020)
Computationally Efficient and Versatile Framework for Joint Optimization of Blind Speech Separation and Dereverberation
Yong Xu, Jun Du, Lirong Dai, Chin-Hui Lee (2015)
A Regression Approach to Speech Enhancement Based on Deep Neural NetworksIEEE/ACM Transactions on Audio, Speech, and Language Processing, 23
Hideaki Kagami, H. Kameoka, M. Yukawa (2018)
Joint Separation and Dereverberation of Reverberant Mixtures with Determined Multichannel Non-Negative Matrix Factorization2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
T. Robinson, J. Fransen, D. Pye, J. Foote, S. Renals (1995)
WSJCAMO: a British English speech corpus for large vocabulary continuous speech recognition1995 International Conference on Acoustics, Speech, and Signal Processing, 1
N. Ito, S. Araki, Marc Delcroix, T. Nakatani (2017)
Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Byung Cho, Jun-Min Lee, Hyung-Min Park (2019)
A Beamforming Algorithm Based on Maximum Likelihood of a Complex Gaussian Distribution With Time-Varying Variances for Robust Speech RecognitionIEEE Signal Processing Letters, 26
N. Ito, S. Araki, Takuya Yoshioka, T. Nakatani (2014)
Relaxed disjointness based clustering for joint blind source separation and dereverberation2014 14th International Workshop on Acoustic Signal Enhancement (IWAENC)
Ante Jukic, T. Waterschoot, Timo Gerkmann, S. Doclo (2015)
Multi-Channel Linear Prediction-Based Speech Dereverberation With Sparse PriorsIEEE/ACM Transactions on Audio, Speech, and Language Processing, 23
J. Bradley, H. Sato, M. Picard (2003)
On the importance of early reflections for speech in rooms.The Journal of the Acoustical Society of America, 113 6
Morten Kolbaek, Dong Yu, Z. Tan, J. Jensen (2017)
Multitalker Speech Separation With Utterance-Level Permutation Invariant Training of Deep Recurrent Neural NetworksIEEE/ACM Transactions on Audio, Speech, and Language Processing, 25
Seungjin Choi (2009)
Independent Component Analysis
Kateřina Žmolíková, Marc Delcroix, K. Kinoshita, Tsubasa Ochiai, T. Nakatani, L. Burget, J. Černocký (2019)
SpeakerBeam: Speaker Aware Neural Network for Target Speaker Extraction in Speech MixturesIEEE Journal of Selected Topics in Signal Processing, 13
S. Golan, S. Gannot, I. Cohen (2009)
Multichannel Eigenspace Beamforming in a Reverberant Noisy Environment With Multiple Interfering Speech SignalsIEEE Transactions on Audio, Speech, and Language Processing, 17
J. Hershey, Zhuo Chen, Jonathan Roux, Shinji Watanabe (2015)
Deep clustering: Discriminative embeddings for segmentation and separation2016 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
M. Souden, S. Araki, K. Kinoshita, T. Nakatani, H. Sawada (2013)
A Multichannel MMSE-Based Framework for Speech Source Separation and Noise ReductionIEEE Transactions on Audio, Speech, and Language Processing, 21
Sebastian Braun, Emanuël Habets (2018)
Linear Prediction-Based Online Dereverberation and Noise Reduction Using Alternating Kalman FiltersIEEE/ACM Transactions on Audio, Speech, and Language Processing, 26
c 2020 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, including reprinting/republishing this material for advertising or promotional purposes, creating new collective works, for resale or redistribution to servers or lists, or reuse of any copyrighted component of this work in other works. Jointly optimal denoising, dereverberation, and source separation Tomohiro Nakatani, Senior Member, IEEE, Christoph Boeddeker, Student Member, IEEE, Keisuke Kinoshita, Senior Member, IEEE, Rintaro Ikeshita, Member, IEEE, Marc Delcroix, Senior Member, IEEE, Reinhold Haeb-Umbach, Fellow, IEEE Abstract—This paper proposes methods that can optimize a acquired signal. For performing denoising (DN), beamform- Convolutional BeamFormer (CBF) for jointly performing denois- ing techniques have been investigated for decades [1], [2], ing, dereverberation, and source separation (DN+DR+SS) in a [3], [4], and the Minimum Variance Distortionless Response computationally efﬁcient way. Conventionally, a cascade conﬁg- (MVDR) beamformer and the Minimum Power Distortionless uration, composed of a Weighted Prediction Error minimization Response (MPDR) beamformer, are now widely used as state- (WPE) dereverberation ﬁlter followed by a Minimum Variance Distortionless Response (MVDR) beamformer, has been used as of-the-art techniques. For source separation (SS), a number the state-of-the-art frontend of far-ﬁeld speech recognition, even of blind signal processing techniques have been developed, though this approach’s overall optimality is not guaranteed. In including independent component analysis [5], independent the blind signal processing area, an approach for jointly opti- vector analysis [6], and spatial clustering-based beamforming mizing dereverberation and source separation (DR+SS) has been [7]. For dereverberation (DR), a Weighted Prediction Error proposed; however, it requires huge computing cost, and has not been extended for applications to DN+DR+SS. To overcome the minimization (WPE) based linear prediction technique [8], above limitations, this paper develops new approaches for jointly [9] and its variants [10] have been actively studied as an optimizing DN+DR+SS in a computationally much more efﬁcient effective approach. With these techniques, for determining the way. To this end, we ﬁrst present an objective function to optimize coefﬁcients of ﬁltering, it is crucial to accurately estimate a CBF for performing DN+DR+SS based on maximum likelihood such statistics of the speech signals and the noise as their estimation on an assumption that the steering vectors of the target signals are given or can be estimated, e.g., using a neural spatial covariances and time-varying variances. However, the network. This paper refers to a CBF optimized by this objective estimation often becomes inaccurate when the signals are function as a weighted Minimum-Power Distortionless Response mixed under reverberant and noisy conditions, which seriously (wMPDR) CBF. Then, we derive two algorithms for optimizing a degrades the performance of these techniques. wMPDR CBF based on two different ways of factorizing a CBF To enhance the robustness of the above techniques, neural into WPE ﬁlters and beamformers: one based on an extension of the conventional joint optimization approach proposed for network-supported microphone array speech enhancement has DR+SS and another based on a novel technique. Experiments been actively studied, and its effectiveness has been iden- using noisy reverberant sound mixtures show that the proposed tiﬁed for denoising [11], dereverberation [12], and source optimization approaches greatly improve the performance of the separation [13], [14]. With this approach, neural networks speech enhancement in comparison with the conventional cascade estimate such statistics of the signals and noise as Time- conﬁguration in terms of signal distortion measures and ASR performance. The proposed approaches also greatly reduce the Frequency (TF) masks and time-varying variances [13], [15], computing cost with improved estimation accuracy in comparison [16], [17], while microphone array signal processing performs with the conventional joint optimization approach. speech enhancement. This combination is particularly effective Index Terms—Beamforming, dereverberation, source separa- because neural networks can successfully capture the spectral tion, microphone array, automatic speech recognition, maximum patterns of signals over wide TF ranges and reliably estimate likelihood estimation such statistics of the signals. Conventional signal processing often fails to adequately handle them. On the other hand, neural networks often introduce into the processed signal I. INTRODUCTION nonlinear distortions, which are harmful to perceived speech When a speech signal is captured by distant microphones, quality and ASR. This problem can be avoided by microphone e.g., in a conference room, it often contains reverberation, dif- array techniques. A number of articles have reported the fuse noise, and extraneous speakers’ voices. These components usefulness of this combination, particularly for far-ﬁeld ASR, are detrimental to the intelligibility of the captured speech e.g., at the REVERB challenge [18] and the CHiME-3/4/5 and often cause serious degradation in many applications challenges [19], [20]. such as hands-free teleconferencing and Automatic Speech Despite the success of neural network-supported micro- Recognition (ASR). phone array speech enhancement, how to optimally combine Microphone array speech enhancement has been scrutinized individual microphone array techniques for simultaneously to minimize the aforementioned detrimental effects in the performing denoising, dereverberation, and source separation (DN+DR+SS) in a computationally efﬁcient way remains T. Nakatani, K. Kinoshita, R. Ikeshita, and M. Delcroix are with NTT inadequately investigated. For example, for denoising and Corporation. C. Boeddeker and R. Haeb-Umbach are with Paderborn Univ. Manuscript received January 1, 2020; revised XXXX XX, 2020. dereverberation (DN+DR), the cascade conﬁguration of a arXiv:2005.09843v3 [eess.AS] 2 Aug 2020 2 WPE ﬁlter followed by a MVDR/MPDR beamformer has been computational efﬁciency. An additional beneﬁt of source-wise widely used as the state-of-the-art frontend, e.g., at the far- factorization is that it can be used, without loss of optimality ﬁeld ASR challenges [18], [19], [20], [21]. However, since for the extraction of a single target source from a sound the WPE ﬁlter and the beamformer are separately optimized, mixture, which is now an important application area of speech the overall optimality of this approach is not guaranteed. enhancement [13], [29]. To optimally perform DN+DR, several techniques have been Experiments based on noisy reverberant sound mixtures proposed using a Kalman ﬁlter [22], [23], [24]. A technique, created using the REVERB Challenge dataset [18] show that called Integrated Sidelobe Cancellation and Linear Prediction the proposed optimization approaches substantially improve (ISCLP) [24], optimizes an integrated ﬁlter that can cancel the DN+DR+SS performance in comparison to the conven- noise and reverberation from the observed signals using a side- tional cascade conﬁguration in terms of ASR performance and lobe cancellation framework. With this technique, however, signal distortion reduction. These two proposed approaches the steering vector of the target signal needs to be directly can also greatly reduce the computing cost with improved estimated in advance from noisy reverberant speech, which is estimation accuracy in comparison with the conventional joint challenging and limits the overall estimation accuracy. In the optimization approach. blind signal processing area, on the other hand, a technique Certain parts of this paper have already been presented that jointly optimizes a pair comprised of a WPE ﬁlter and a in our recent conference papers. The ML formulation for beamformer has been proposed for dereverberation and source optimizing a CBF was derived for DN+DR [30]. Another work separation (DR+SS) under noiseless conditions [25], [26], [31] argued that a CBF for DN+DR can be factorized into a [27]. One advantage of this approach is that we can access WPE ﬁlter and a wMPDR (non-convolutional) beamformer, multichannel dereverberated signals obtained as the output and jointly optimized without loss of optimality. Another of the WPE ﬁlter during the optimization, and utilize them work [32] presented ways to reliably estimate TF masks for to reliably estimate the beamformer. However, this approach DN+DR+SS. This paper integrates these techniques to perform requires 1) huge computing cost for the optimization, and 2) DN+DR+SS in a computationally efﬁcient way. has not been extended for application to DN+DR+SS. In the remainder of this paper, the models of the observed To overcome the above limitations, this paper develops signal and the CBF are deﬁned in Section II. Then, Section III algorithms for optimizing a Convolutional BeamFormer (CBF) presents our proposed optimization methods, and Section IV that can perform DN+DR+SS in a computationally much more summarizes their characteristics and advantages. Sections V and VI describe experimental results and concluding remarks. efﬁcient way. A CBF is a ﬁlter that is applied to a multichannel observed signal to yield the desired output signals. For CBF optimization, this paper ﬁrst presents a common objective II. MODELS OF SIGNAL AND BEAMFORMER function based on the Maximum Likelihood (ML) criterion This paper assumes that I source signals are captured by by assuming that the steering vectors of the desired signals M(≥ I) microphones in a noisy reverberant environment. The are given, or can be estimated. This paper refers to a CBF captured signal at each TF point in the short-time Fourier optimized by this objective function as a weighted MPDR transform (STFT) domain is modeled by (wMPDR) CBF. After showing that a CBF can be factorized into WPE ﬁlter(s) and beamformer(s) in two different ways, I (i) we derive two different algorithms for optimizing the wMPDR x = x + n , (1) t,f t,f t,f CBF, based on the CBF factorization ways. The ﬁrst approach, i=1 (i) (i) (i) called source-packed factorization, is an extension of the x = d + r , (2) t,f t,f t,f conventional joint optimization technique proposed for DR+SS where t and f are time and frequency indices, respectively, [25], [26], [27]. We ﬁrst show that its direct application to ⊤ M×1 x = [x , . . . , x ] ∈ C is a column vec- DN+DR+SS suffers from serious problems in terms of the t,f 1,t,f M,t,f tor containing all the microphone signals at a TF point. computational efﬁciency and estimation accuracy and present (i) Here, (·) denotes the non-conjugate transpose. x = an extension for solving them. The second approach, called t,f (i) (i) source-wise factorization, is based on a novel factorization [x , . . . , x ] is a (noiseless) reverberant signal cor- 1,t,f M,t,f technique that factorizes a CBF into a set of sub-ﬁlter pairs, responding to the ith source, and n = [n , . . . , n ] t,f 1,t,f M,t,f (i) each of which is composed of a WPE ﬁlter and a beam- is the additive diffuse noise. x for each source in Eq. (1) t,f former, and independently estimates each source. For both is further decomposed into two parts in Eq. (2), one of which approaches, we also present a method that robustly estimates consists of the direct signal and early reﬂections, referred the steering vectors of the desired signals during the wMPDR (i) to as desired signal d , and the other corresponds to late t,f CBF optimization using the output of the WPE ﬁlters. A (i) reverberation r . Hereafter, the frequency indices of the t,f neural network-supported TF-mask estimation technique is symbols are omitted for brevity, assuming that each frequency also incorporated to estimate the steering vectors. Although bin is processed independently in the same way. both approaches work comparably well in terms of estimation (i) In this paper, the goal of DN+DR+SS is to estimate d accuracy, source-wise factorization has advantages in terms of (i) for each source i from x in Eq. (1) by reducing r of (i ) source i, x of all the other sources i 6= i, and diffuse noise Note that the proposed techniques can also be applied to conventional t blind signal processing for DR+SS, as discussed in an article [28]. n . Since in noisy reverberant environments, early reﬂections t … Mulple- Beamformer Convoluonal target matrix for beamformer for dereverberaon, linear separaon denoising, and predicon and source separaon (LP) denoising (a) MIMO CBF (b) MIMO CBF with source-packed factorization (1) Convoluonal (1) (1) Single-target Beamformer beamformer for = 1 for = 1 LP for = 1 ( ) Convoluonal ( ) Beamformer ( ) Single-target beamformer for = for = LP for = (c) Set of MISO CBFs (d) MISO CBFs with source-wise factorization Fig. 1. Multi-Input Multi-Output (MIMO) CBF and its three different implementations. They are equivalent to each other in the sense that whatever values are set to coefﬁcients of one implementation, certain coefﬁcients of the other implementations can be determined such that they realize identical input-output relationships. Thus, optimal solutions of all implementations are identical as long as they are optimized based on the same objective function. (i) enhance the intelligibility of speech for human perception In this paper, we further assume that d is statistically [33] and improve the ASR performance by computer [34], independent of the following variables: (i) (i) we include them in the desired signal. Hereafter, we use ′ • s ′ for t ≤ t − Δ (and thus d is statistically m = 1 as a reference microphone and describe a method (i) independent of x for t ≤ t− Δ), (i) for estimating desired signal d at the microphone without (i) 1,t ′′ • r for t ≤ t, ′′ loss of generality. ′ (i ) ′ ′ (i) • x and n ′ for all t, t and i 6= i. To achieve the above goal, we further model d : These assumptions are used to derive the optimization algo- (i) (i) (i) (i) (i) d = v s = v ˜ d , (3) rithms described in the following. t t 1,t (i) where s is the ith clean speech at a TF point. In Eq. (3), the (i) (i) (i) A. Deﬁnition of a CBF and its three different implementations desired signal of the ith source, d , is modeled by v s , t t i.e., a product in the STFT domain of the clean speech with We now deﬁne a CBF, which will later br factorized into (i) transfer function v , hereafter a steering vector, assuming WPE ﬁlter(s) and beamformer(s): that the duration of the impulse response corresponding to L−1 the direct signal and early reﬂections in the time domain is H H y = W x + W x , (6) t t t−τ 0 τ sufﬁciently short in comparison with the analysis window [35]. τ=Δ (i) (i) We further rewrite the desired signal as v ˜ d , i.e., a product 1,t (i) (i) (i) (1) (I) ⊤ I×1 where y = [y , . . . , y ] ∈ C is the output of the of the desired signal at reference microphone d = v s 1,t 1 t t t CBF corresponding to the estimates of I desired signals, with a Relative Transfer Function (RTF) [36], which is deﬁned M×I W ∈ C for each τ ∈ {0, Δ, Δ+1, . . . , L−1} is a matrix as the steering vector divided by its reference microphone element, composed of the beamformer coefﬁcients, (·) denotes a conjugate transpose, and Δ is the prediction delay of CBF. We (i) (i) (i) v ˜ = v /v . (4) 1 set Δ equal to the mixing time introduced in Eq. (5), so that the desired signals are included only in the ﬁrst term of Eq. (6) In contrast, assuming that the duration of the late reverber- and are statistically independent of the second term based on ation in the time domain exceeds the analysis window, late the assumptions introduced in the signal model. Then this (i) reverberation r is modeled by a convolution in the STFT paper performs DN+DR+SS by estimating the beamformer domain [37] of the clean speech with a time series of acoustic coefﬁcients that can estimate the desired signals included in transfer functions that corresponds to the late reverberation: the ﬁrst term of Eq. (6). L −1 a For notational simplicity, we also introduce a matrix repre- (i) (i) (i) r = a s , (5) sentation of a CBF: t τ t−τ τ Δ W x 0 t (i) (i) (i) y = , (7) ⊤ t where a = [a , . . . , a ] for τ ∈ {Δ, . . . , L − 1} x τ a W 1,τ M,τ t are the convolutional acoustic transfer functions, and Δ is the mixing time, which represents the relative frame delay of the See a previous work [8] for more precise discussion of the statistical (i) late reverberation start time to the direct signal. independence between d and s ′ for t ≤ t − Δ. t 4 where W is a matrix containing W for Δ ≤ τ ≤ L− 1 and For example, MISO CBFs were previously used [30], [39]. x is a column vector containing past multichannel observed ISCLP [24] can also be viewed as the realization of a MISO signals x for Δ ≤ τ ≤ L− 1: CBF using a sidelobe cancellation framework [40]. t−τ 3) Source-wise factorization: With the source-wise factor- ⊤ ⊤ M(L−Δ)×I W = W , . . . ,W ∈ C , (8) Δ L−1 ization shown in Fig. 1 (d), we further factorize each MISO ⊤ ⊤ M(L−Δ)×1 CBF deﬁned in Eq. (14) for source i: x = x , . . . ,x ∈ C . (9) t−Δ t−L+1 " # " # (i) Hereafter, we refer to the CBF deﬁned by Eqs. (6) and (7) as I (i) = q , (15) (i) (i) a MIMO CBF. w −G In the following, we further present three different im- (i) (i) M×1 M(L−Δ)×M plementations of CBF, including two ways of factorizing it. where q ∈ C and G ∈ C . Then, Eq. (14) Figure 1 illustrates the MIMO CBF and its three different can be rewritten as a pair of a linear prediction ﬁlter and a implementations. beamformer: 1) Source-packed factorization: With the implementation (i) (i) shown in Fig. 1 (b), we directly factorize the MIMO CBF in z = x − G x , (16) t t Eq. (7): (i) (i) (i) y = q z , (17) t t W I 0 M = Q, (10) W −G (i) (i) M×1 where z ∈ C and G are the output and the prediction M×I M(L−Δ)×M M×M (i) where Q ∈ C , G ∈ C , and I ∈ R is M matrix of the linear prediction, and q is the beamformer’s an identity matrix. Then Eq. (6) can be rewritten as a pair of coefﬁcient vector. Because Eq. (16) is performed only to esti- a (convolutional) linear prediction ﬁlter followed by a (non- mate the ith source, it is called single-target linear prediction. convolutional) beamformer matrix: 4) Relationship between two factorization approaches: The H difference between the two factorization approaches, namely z = x − G x , (11) t t t Figs. 1 (b) and (d), is based only on how the linear prediction is y = Q z . (12) t t performed: Eq. (11) or Eq. (16). More speciﬁcally, it is based (i) M×1 on whether the prediction matrices, G and G , are common Here z ∈ C and G are the output and the prediction to all the sources or different over different sources. Therefore, matrix of the linear prediction, and Q is the coefﬁcient matrix different optimization algorithms with different characteristics of the beamformer. Eq. (11), which is supposed to derever- are derived, as will be shown in Section III. In contrast, berate all the sources at the same time, is thus referred to as (i) the beamformer parts, Q and q in Eqs. (12) and (17) a multiple-target linear prediction, and Eq. (12) is supposed (i) are identical in the two approaches, viewing q as the ith to perform denoising and source separation at the same time. column of Q, because they satisfy W = Q in Eq. (10) and Because individual sources are not distinguished in the WPE (i) (i) w = q in Eq. (15). ﬁlter’s output, this implementation is called source-packed In addition, it should be noted that all the above CBF factorization. implementations are equivalent to each other in the sense One example of source-packed factorization is the cascade that whatever values are set to the coefﬁcients of one imple- conﬁguration composed of a WPE ﬁlter followed by a beam- mentation, certain coefﬁcients of the other implementations former, which has been widely used for DN+DR+SS in the can be determined such that they realize the same input- far-ﬁeld speech recognition area [14], [20], [38], and the other output relationship. Thus, the optimal solutions of all the example is one used in the joint optimization of a WPE ﬁlter implementations are identical as long as they are based on and a beamformer, which has been investigated for DR+SS in the same objective function. the blind signal processing area [25], [26], [27]. 2) Multi-Input Single-Output (MISO) CBF: Next we deﬁne the set of MISO CBFs shown in Fig. 1 (c). They were obtained III. ML ESTIMATION OF CBF by decomposing the beamformer coefﬁcients in Eq. (7): In this section, we derive two different optimization algo- " # (1) (2) (I) rithms using (b) source-packed factorization and (d) source- 0 w w . . . w 0 0 0 = , (13) wise factorization. For the derivations, we assume that the (1) (2) (I) w w . . . w (i) RTFs v ˜ and the time-varying variances of the output signals (i) (i) M×1 (i) M(L−Δ)×1 yielded by the optimal CBF, denoted by λ , are given. where w ∈ C and w ∈ C are column Then in Section III-E, we describe ways for jointly estimating vectors, which respectively contain the ith columns of W (i) λ with CBF coefﬁcients based on the ML criterion and and W; they are used to extract the ith desired signal. Then, (i) estimating v ˜ based on the WPE ﬁlter’s output obtained at a Eq. (7) can be rewritten for each source i: " # step of the optimization. (i) w x (i) t y = . (14) (i) w t A. Probabilistic model First, we formulate the objective function for DN+DR+SS The existence of G, which satisﬁes W = −GQ, is guaranteed for any W when M ≥ I and rank{Q} = I. by reinterpreting the objective function proposed for DN+DR 5 ⊤ ⊤ ⊤ [30]. For this formulation, we interpret DN+DR+SS to be where n = [n , . . . ,n ] . According to the statistical t−Δ t−L+1 (i) composed of a set of separate processing steps, each of independence assumptions introduced in Section II, d is sta- 1,t which applies DN+DR to enhance source i by reducing the (i) (i ) tistically independent of rˆ , xˆ , and n ˆ . Then substituting t t late reverberation of the source (DR) and the additive noise Eq. (21) into Eq. (19) and omitting the constant terms, we including the other sources and the diffuse noise (DN). With obtain the following (in the expectation sense): this interpretation, we introduce the following assumptions, similar to the previous work [30]: (i) (i ) E rˆ + xˆ + n ˆ T ′ t n o t t (i) i 6=i • The output of the optimal CBF for each i, namely y , (i) E L (θ ) = . (i) follows a zero-mean complex Gaussian distribution with T t=1 t (i) (i) (25) time-varying variance λ = E y [8]. t t • The beamformer satisﬁes a distortionless constraint for The above equation indicates that minimization of the objec- (i) (i ) (i) each source i deﬁned using RTF v ˜ in Eq. (4): tive function indeed minimizes the sum of rˆ , xˆ for i 6= i, t t and n ˆ in Eq. (21). H H t (i) (i) (i) (i) w v ˜ = 1 or q v ˜ = 1 . (18) Before deriving the optimization algorithms, we deﬁne a matrix that is frequently used in the derivation, referred to Then based on the previous discussion [30], we can approx- as a variance-normalized spatio-temporal covariance matrix. imately derive the objective function to be minimized for Letting x be a column vector composed of the current and (i) estimating the CBF coefﬁcients for source i, e.g., θ = past observed signals at all the microphones, deﬁned as (i) (i) {w ,w }, according to ML estimation: ⊤ ⊤ M(L−Δ+1)×1 x = x ,x ∈ C , (26) 2 t t t (i) (i) (i) (i) (i) the matrix is deﬁned: L (θ ) = + log λ s.t. w v ˜ = 1. i t 0 (i) t=1 t 1 x x (i) t t M(L−Δ+1)×M(L−Δ+1) R = ∈ C . (27) (19) (i) t=1 t The objective function for estimating all the sources can then Its factorized form is also deﬁned: be obtained by summing Eq. (19) over all the sources: I (i) (i) R P (i) (i) x x (i) (i) ˜ R = , (28) L (Θ) = L (θ ), s.t. w v = 1 for all i, (20) i x (i) (i) P R i=1 (1) (I) where Θ = θ , . . . , θ . This objective function is used where commonly for all the implementations of a CBF. In this paper, 1 x x we call a CBF optimized by the above objective function (i) t M×M R = ∈ C , (29) (i) a weighted MPDR (wMPDR) CBF because it minimizes t=1 t (i) the average power of output y weighted by time-varying 1 x x (i) t (i) t M(L−Δ)×M variance, λ , of the signal. P = ∈ C , (30) (i) Here, let us brieﬂy explain how DN+DR+SS is performed t=1 t by Eqs. (19) and (20). Substituting Eqs. (1) and (2) into H (i) 1 x x t M(L−Δ)×M(L−Δ) R = ∈ C . (31) Eq. (14) and using the model of the desired signal in Eq. (3) (i) t=1 t and the distortionless constraint in Eq. (18), we obtain (i) (i) (i) (i ) y = d + rˆ + xˆ + n ˆ , (21) t t t 1,t B. Optimization based on source-packed factorization i 6=i This subsection discusses methods for optimizing a CBF (i) (i ) where rˆ , xˆ for i 6= i, and n ˆ are respectively the late t with the source-packed factorization. In the following, after t t reverberation of the ith source, all the other sources, and the describing a method for directly applying the conventional additive diffuse noise remaining in the CBF output, written in joint optimization technique used for DR+SS to DN+DR+SS, MISO CBF form: we summarize the problems in it, and present the solutions to " # " # the problems. (i) (i) (i) w 1) Direct application of a conventional technique: With the rˆ = , (22) t (i) (i) w x t source-packed factorization in Eqs. (11) and (12), simultane- " # " # H ′ ously estimating both Q and G in closed form is difﬁcult (i ) (i) (i ) w x (i) t (i) xˆ = ′ , (23) even when both λ and v are given. Instead, we use an t t (i) (i ) iterative and alternate estimation scheme, following a blind " # signal processing technique [25], [26], [27], where at each (i) w n n ˆ = , (24) estimation step, either Q or G is updated and the other is (i) w t ﬁxed. 6 (i) For updating G, we ﬁx Q at its previously estimated Then q , which minimizes Eq. (40) under the distortionless (i) (i) value. For the algorithm derivation, the representation of linear constraint q v ˜ = 1, can be obtained: prediction in Eq. (11) is slightly modiﬁed: −1 (i) (i) R v ˜ (i) z = x − X g, (32) t t t q = . (42) −1 (i) (i) (i) v ˜ R v ˜ where X and g are equivalent to x and G with a modiﬁed t t matrix structure deﬁned: Because the above beamformer minimizes the average power of z weighted by the time-varying variance, we call it ⊤ M×M (L−Δ) X = I ⊗ x ∈ C , (33) t M a weighted MPDR (wMPDR) beamformer . As shown in H 2 Section III-C, a wMPDR beamformer is a special case of a ⊤ ⊤ M (L−Δ)×1 g = g , . . . ,g ∈ C , (34) 1 M wMPDR CBF, which is reduced to a wMPDR beamformer when setting the length of the CBF L = 1, i.e., by just where ⊗ is a Kronecker product and g is the mth column converting it into a non-convolutional beamformer. of G. Then, considering that the CBF in Eqs. (11) and (12) (i) The above algorithm, however, has two serious problems. (i) can be written as y = q x − X g and omitting t t First, the size of the covariance matrix in Eq. (38) is too the normalization terms, the objective function in Eq. (20) large, requiring huge computing cost for calculating it and becomes its inverse. Second, as shown in our experiments, the iterative and alternate estimation of Q and G tends to converge to 1 2 L (g) = x − X g , (35) g t t a sub-optimal point. This is probably because the update q,t t=1 of G is performed based only on the output of the ﬁxed beamformer in the iterative and alternate estimation, as in where kxk = x Rx, and Φ is a semi-deﬁnite Hermitian q,t Eq. (19); the signal dimension of the beamformer output, i.e., matrix: I, is reduced from that of the original signal space, i.e., M, I with the over-determined case, i.e., I < M. As a consequence, (i) (i) q q M×M signal components that are relevant for the update of G may Φ = ∈ C . (36) q,t (i) i=1 t be reduced in the beamformer output, especially when the estimation of Q is less accurate at the early stage of the Because Eq. (35) is a quadratic form with a lower bound, g, optimization. This can seriously degrade the update of G. which minimizes it, can be obtained: 2) Proposed extension: Next we present two techniques to mitigate the above problems within the source-packed g = Ψ ψ, (37) factorization approach. The ﬁrst reduces the computing cost. 1 2 2 M (L−Δ)×M (L−Δ) As shown in Appendix A, Eqs. (38) and (39) can be rewritten, Ψ = X Φ X ∈ C , (38) q,t t t using Eq. (28): 1 H 2 M (L−Δ)×1 ψ = X Φ x ∈ C , (39) X q,t t H ⊤ (i) (i) (i) Ψ = q q ⊗ R , (43) i=1 where (·) is the Moore-Penrose pseudo-inverse. Since the I (i) (i) (i) rank of Ψ is equal to or smaller than MI(L−Δ), as shown in ψ = q ⊗ P q , (44) Section III-B2, Ψ is rank deﬁcient for over-determined cases, i=1 namely when M > I, and thus the use of the pseudo-inverse is where () denotes the complex conjugate. In the above equa- indispensable. Eqs. (37) to (39) are equivalent to those used in (i) tions, the majority of the calculation is coming from R . the dereverberation step for DR+SS [25], [26], [27] except that Because the size of the matrix is much smaller than that in our paper denoising is additionally included in the objective of Ψ, we can greatly reduce the computing cost with this and over-determined cases are also considered. We call this a modiﬁcation in comparison with the direct calculation of multiple-target WPE ﬁlter. Eqs. (38) and (39). Although we still need to calculate the For the update of Q, ﬁxing g at its previously estimated inverse of huge matrix Ψ even with this modiﬁcation, the cost value, the objective in Eq. (20) can be rewritten: is relatively small in comparison with the direct calculation of Ψ. Note that Eq. (43) also shows the rank of Ψ to be equal X 2 H (i) (i) (i) to or smaller than MI(L− Δ). L (Q) = q s.t. q v ˜ = 1, (40) (i) The second technique introduces a heuristic to improve the i=1 update of the WPE ﬁlter. To use a whole M-dimensional signal (i) where R is a variance-normalized spatial covariance matrix A wMPDR beamformer was also called a Maximum-Likelihood Distor- of the output of the multiple-target WPE ﬁlter, calculated as tionless Response (MLDR) beamformer [41]. In general, the computational complexity of a matrix multiplication 1 z z t exceeds O(n ). Because the size of Ψ is M-times larger than R , the t x (i) R = . (41) computational complexity for calculating Ψ is probably at least M times (i) t=1 t larger than that for calculating R . x 7 (i) space to be considered for the update, we modify the CBF to to the RTF v ˜ with zero padding. Finally, we obtain the output not only I desired signals, but also M − I auxiliary solution: signals that are included in orthogonal complement Q of −1 (i) (i) R v Q and model the auxiliary signals as zero-mean time-varying (i) w = . (51) −1 complex Gaussians. With this modiﬁcation, the optimization is (i) (i) (i) v R performed by calculating the summation in Eqs. (43) and (44) (I+1) (M) over both 1 ≤ i ≤ I and I < i ≤ M, letting q , . . . ,q The above equation, which gives the simplest form of the be the orthonormal bases for the orthogonal complement Q . solution to a wMPDR CBF, clearly shows that a wMPDR (i) Because distinguishing variances λ of the auxiliary signals CBF is a general case of a wMPDR beamformer. By setting is inconsequential, we use the same value for them, calculated L = 1 in the above solution, namely, by letting it be a as non-convolutional beamformer, it reduces to the solution of a wMPDR beamformer in Eq. (42). M 2 X H ⊥ (i) An advantage of the solution using the MISO CBFs is that λ = q z , (45) M − I it can be obtained by a closed form equation, provided the i=I+1 RTFs and the time-varying variances of the desired signals are given and that we can ignore the interaction between DN and and calculate P and R based on Eqs. (30) and (31) x x DR. With this approach, however, the RTFs must be directly accordingly. In summary, we can implement this modiﬁcation estimated from a reverberant observation, similar to ISCLP by adding the following terms to Ψ and ψ in Eqs. (43) and [24]. A solution to this problem is to use dereverberation (44): preprocessing based on a WPE ﬁlter for the RTF estimation. X H ⊤ Although it was shown that the output of a WPE ﬁlter can ⊥ (i) (i) Ψ = q q ⊗ R , (46) be obtained in a computationally efﬁcient way within the i=I+1 framework of this approach [30], the source-wise factorization X ∗ approach described in the following can more naturally solve ⊥ (i) ⊥ (i) ψ = q ⊗ P q . (47) this problem. So, this paper adopts it as the solution. i=i+1 D. Optimization based on source-wise factorization C. Direct optimization of MISO CBFs With source-wise factorization, similar to the case with the direct optimization of the MISO CBFs, the optimization can Before deriving the optimization with source-wise factoriza- be performed separately for each source, and the resultant tion, we show that we can directly optimize the MISO CBFs algorithm is identical to that proposed for DN+DR [31]. in Eq. (14), and summarize their characteristics. With this Considering that a CBF can be written based on Eqs. (16) setting, the CBFs and the objective function are both deﬁned H (i) (i) separately for each source in Eqs. (14) and (19), and thus, the (i) and (17) as y = q x − G x and using the t t optimization can be performed separately for each source. The (i) factorized form of R in Eq. (28), the objective function in resultant algorithm is, therefore, identical to that previously Eq. (19) can be rewritten: proposed for DN+DR [42], where this type of CBF is also called a Weighted Power minimization Distortionless response −1 (i) (i) (i) (i) (i) (i) (WPD) CBF. L G ,q = G − R P q x x (i) For presenting the solution, we introduce the following x vector representation of Eq. (14): (i) + q . (52) H −1 (i) (i) (i) (i) R − P R P x x x (i) (i) y = w x , (48) t t (i) In the above objective function, G is contained only in the ﬁrst term, and the term can be minimized without depending (i) where w is deﬁned: (i) (i) on the value of q , when G takes the following value: " # (i) −1 (i) (i) (i) (i) w = , (49) G = R P . (53) (i) x x (i) So, this is a solution of G that globally minimizes the (i) (i) Then, when λ and v ˜ are given, Eq. (19) becomes a simple t (i) objective function given time-varing variance λ . Interest- constraint quadratic form: ingly, this solution is identical to that of conventional WPE 2 H dereverberation. This means that the WPE ﬁlter, which is (i) (i) (i) (i) L (w ) = w s.t. w v = 1, (50) (i) optimized solely for dereverberation, can perform the optimal dereverberation for the joint optimization without depending (i) where R is the covariance matrix deﬁned in Eq. (28), and h i 6 ⊤ This is not a unique solution. The ﬁrst term is minimized even when an (i) (i) M(L−Δ+1)×1 (i) v = v ˜ , 0, . . . , 0 ∈ C corresponds arbitrary matrix, whose null space includes q , is added to Eq. (53). 8 Algorithm 1: Source-packed factorization-based optimiza- Algorithm 2: Source-wise factorization-based optimiza- tion for estimation of all sources tion for estimation of ith source Data: Observed signal x for all t Data: Observed signal x for all t t t (i) (i) TF masks γ for all t and 1 ≤ i ≤ I TF masks γ for all t t t (i) (i) Result: Estimated sources y for all t and 1 ≤ i ≤ I Result: Estimated ith source y for all t t t (i) (i) 2 2 1 Initialize λ as ||x || /M for all t and 1 ≤ i ≤ I 1 Initialize λ as ||x || /M for all t t t t I t I M M (i) 2 Initialize q as the ith column of I for 1 ≤ i ≤ I 2 repeat P H (i) T x x 1 t 3 Initialize z as x for all t t t t 3 R ← x (i) T t=1 4 repeat P H (i) T x x 1 t H t (i) T x x 4 P ← 1 t x t (i) T t=1 5 R ← for 1 ≤ i ≤ I λ (i) t x t=1 (i) (i) H (i) (i) T x x 1 t t 5 G ← R P 6 P ← for 1 ≤ i ≤ I x (i) t=1 t H (i) (i) H (i) I 6 z ← x − G x (i) (i) t t 7 Ψ ← q q ⊗ R i=1 x (i) (i) (i) 7 Estimate v ˜ based on z and γ P t t I (i) (i) (i) (i) (i) 8 ψ ← q ⊗ P q i=1 P z z (i) T t t 8 R ← z (i) t=1 9 Begin Add orthogonal complement beamformer (i) (i) (I+1) (M) ´ (R ) v 10 Set q , . . . ,q as the orthonormal bases (i) z 9 q ← H (i) (i) ´ (i) for orthogonal complement Q of Q v˜ R v˜ ( ) z P (i) (i) M (i) ⊥ 1 (i) 10 y ← q z 11 λ ← q z t t t i=I+1 M−I P (i) (i) T x x 1 t 11 λ ← y 12 R ← t t x t=1 T λ P H T x x 12 until convergence ⊥ 1 t 13 P ← x ⊥ T t=1 H ⊥ (i) (i) 14 Ψ ← Ψ + q q ⊗ R i=I+1 (i) ⊥ (i) output of the single-target WPE ﬁlter, calculated as 15 ψ ← ψ + q ⊗ P q i=I+1 16 End (i) (i) z z t t 17 g ← Ψ ψ (i) M×M R = ∈ C . (55) (i) 18 z ← x − X g λ t t t t=1 t (i) (i) 19 Estimate v ˜ based on z and γ for 1 ≤ i ≤ I Then the solution can be obtained, under a distortionless P H (i) T z (z ) 1 t t 20 R ← for 1 ≤ i ≤ I z constraint, as a wMPDR beamformer: (i) T t=1 (i) (i) −1 R v˜ ( ) (i) (i) (i) 21 q ← for 1 ≤ i ≤ I + R v ˜ H z (i) (i) (i) v˜ R v˜ ( ) z (i) q = . (56) H −1 (i) (i) H (i) (i) ´ (i) 22 y ← q z for 1 ≤ i ≤ I t v ˜ R v ˜ (i) (i) 23 λ ← y for 1 ≤ i ≤ I t t Eqs. (54) to (56) closely resemble Eqs. (40) to (42). The 24 until convergence difference is whether the dereverberation is performed by a multiple-target WPE ﬁlter or single-target WPE ﬁlters. With source-wise factorization, the solution can be obtained (i) (i) in closed form when λ and v ˜ are given, similar to the case on the subsequent beamforming, provided the time-varying with the direct optimization of the MISO CBFs. In addition, (i) variance of the desired source is given for the optimization. In the output of the WPE ﬁlter is obtained as z in Eq. (16), addition, unlike the source-packed factorization approach, this and can be efﬁciently used for the estimation of the RTFs. approach does not need to compensate for the dimensionality Furthermore, since the temporal-spatial covariance matrix in (i) reduction of the beamformer output for the update of G Eq. (31) is much smaller than that in Eq. (38) of the source- because it considers a whole signal space without adding any packed factorization, the computational cost can be reduced. (i) modiﬁcation. We refer to this ﬁlter G as a single-target WPE (See Section IV for more scrutiny of the computing cost.) ﬁlter. (i) (i) Once G is obtained as the above solution, the objective (i) E. Processing ﬂow with estimation of λ and v function in Eq. (19) can be rewritten as This subsection describes examples of processing ﬂows in Algorithms 1 and 2, for optimizing a CBF based on source- 2 H (i) (i) (i) (i) packed factorization and source-wise factorization, including L q = q s.t. q v ˜ = 1, (54) (i) (i) estimation of the time-varying variances, λ , and the RTFs, (i) v ˜ . Hereafter, we refer to the algorithms as A-1 and A-2 for (i) where R is a variance-normalized covariance matrix of the brevity. Although A-1 simultaneously estimates all sources, z 9 WPE for wMPDR for was trained so that it receives the WPE ﬁlter’s output, which is obtained at the ﬁrst iteration in the iterative optimization of Dereverb Beamform the CBF, and estimates the TF masks of the desired signals. The network’s input was set as a concatenation of the real Es mate Es mate and imaginary parts of the STFT coefﬁcients, and the loss function was set as the (scale-dependent) signal-to-distortion ratio (SDR) of an enhanced signal obtained by multiplying the Es mate estimated masks to an observed signal. For the training and Es mate validation data, we synthesized mixtures using two utterances TF masks randomly extracted from the WSJ-CAM0 corpus [45] and two st room impulse responses and background noise extracted from For 1 me For subsequent mes the REVERB Challenge training set [18]. (i) Fig. 2. Processing ﬂow of source-wise factorization-based CBF for estimating For the estimation of the RTFs, v ˜ , we adopted a method a source i. based on eigenvalue decomposition with noise covariance (i) whitening [46], [47]. With this technique, steering vector v (i) is ﬁrst estimated: y for all i, from observed signal x , A-2 estimates only (i) (i) −1 one of the sources, y for a certain i, and (if necessary) is v = R MaxEig R R , (57) t i \i \i repeatedly applied to the observed signal to estimate all the where MaxEig(·) is a function that calculates the eigenvector sources one after another. TF masks are provided as auxiliary (i) corresponding to the maximum eigenvalue and R and R \i inputs for both algorithms. TF mask γ , which is associated are spatial covariance matrices of the i-th desired signal and with a source and a TF point, takes a value between 0 and the other signals estimated as: 1 and indicates whether the source’s desired signal dominates (i) (i) the TF point (γ = 1) or not (γ = 0). The TF masks (i) (i) (i) t t γ z z t t t over all the TF points are used to estimate the RTF(s) of the R = , (58) (i) desired signal(s) in line 19 of A-1 and line 7 of A-2. (See t t Section III-E1 for the estimation detail of the TF masks and (i) (i) (i) 1− γ z z the RTFs.) t t t t (i) R = . (59) \i Both algorithms estimate time-varying variances λ based (i) 1− γ on the same objective as that for the CBF, deﬁned in Eq. (19). Because no closed form solution to the estimation of the Then, the RTF is obtained by Eq. (4). CBF and the time-varying variances is known, an iterative and IV. DISCUSSION alternate optimization scheme is introduced to both algorithms. (i) In each iteration, the time-varying variances, λ , are updated In summary, our proposed techniques can optimize a CBF in line 23 of A-1 and line 11 of A-2 as the power of the for jointly performing DN+DR+SS with greatly reduced com- (i) previously estimated values of desired signal y , and then the puting cost in comparison with the direct application of the (i) CBF and desired signal y are updated while ﬁxing the time- conventional joint optimization technique proposed for DR+SS varying variances. The iteration is repeated until convergence to DN+DR+SS. With the conventional technique, a huge is obtained. covariance matrix Ψ must be calculated to take into account the dependency of G on Q that is inherently introduced into The optimization methods described in Sections III-B and source-packed factorization. This makes the computing cost of III-D are used in their respective algorithms to update the CBF the conventional technique extremely high. In contrast, since and the desired signal(s). The WPE ﬁlter is ﬁrst estimated in the proposed extension of the source-packed factorization lines 5 to 17 of A-1 and lines 3 to 5 of A-2, and applied in line 18 of A-1 and line 6 of A-2. After the RTF(s) is updated approach substantively reduces the size of the matrix to be using the dereverberated signals, the wMPDR beamformer is calculated from M (L− Δ) for Ψ to M(L− Δ) for R , the estimated in lines 20 and 21 of A-1 and lines 8 and 9 of A-2, computing cost can be effectively reduced. (i) and applied in line 22 of A-1 and line 10 of A-2. On the other hand, with source-wise factorization, G (i) Figure 2 also illustrates the processing ﬂow of a CBF with can be optimized independently of q , which also allows source-wise factorization for estimating a source i. us to reduce the size of the matrix to be calculated to the same as that of the proposed extension of the source-packed 1) Methods for estimating TF masks and RTFs: In our (i) factorization approach. In addition, we can skip the calculation experiments, for estimating TF masks, γ , for all i and t at each frequency, we used a Convolutional Neural Network of an additional matrix, R , and the inverse of the huge −1 that works in the TF domain and is trained using utterance- matrix, Ψ , both of which are required for the proposed level Permutation Invariant Training criterion (CNN-uPIT) extension of the source-packed factorization approach. This [43]. According to our preliminary experiments [32], we set further increases the computational efﬁciency of the source- the network structure as a CNN with a large receptive ﬁeld wise factorization approach. A drawback of source-wise fac- similar to one used by a fully-Convolutional Time-domain torization is that it has to handle I-times more dereverberated Audio Separation Network (Conv-TasNet) [44]. The network signals than source-packed factorization. 10 TABLE I CBFS COMPARED IN EXPERIMENTS: (1) AND (2) ARE CONVENTIONAL CASCADE CONFIGURATION APPROACHES, (5) IS A CONVENTIONAL JOINT OPTIMIZATION APPROACH, (6) AND (7) ARE PROPOSED JOINT OPTIMIZATION APPROACHES, AND (3) AND (4) ARE TEST CONDITIONS USED JUST FOR COMPARISON. (5), (6), AND (7) ARE CATEGORIZED AS “JOINTLY OPTIMAL” BECAUSE THEY ARE COMPOSED OF WPE AND WMPDR AND OPTIMIZED BASED ON INTEGRATED VARIANCE ESTIMATION (SEE FIG. 3 FOR THE DIFFERENCE BETWEEN SEPARATE AND INTEGRATED VARIANCE ESTIMATION). Name of method Jointly WPE BF Variance Category optimal estimation (1) WPE+MPDR (separate) Multiple-target MPDR Separate Cascade (conventional) (2) WPE+MVDR (separate) Multiple-target MVDR Separate Cascade (conventional) (3) WPE+wMPDR (separate) Multiple-target wMPDR Separate Test condition (4) WPE+MPDR (integrated) Single-target MPDR Integrated Test condition (5) Source-packed factorization (conventional) X Multiple-target wMPDR Integrated Jointly optimal (conventional) (6) Source-packed factorization (extended) X Multiple-target wMPDR Integrated Jointly optimal (proposed) (7) Source-wise factorization X Single-target wMPDR Integrated Jointly optimal (proposed) The source-wise factorization approach has additional ben- eﬁts w.r.t. computational efﬁciency when it is used in speciﬁc Derev BF Derev BF scenarios listed below: • The source-wise factorization approach can estimate the (a) Separate optimization (b) Integrated optimization CBF by a closed-form equation when time-varying source Fig. 3. Separate and integrated variance optimization schemes: While separate variances are given, or estimated, e.g., using neural net- variance optimization updates λ for Derev as the variance of Derev output, works [15], [12]. In such a case, we can skip iterative integrated variance optimization updates it as the variance of the beamformer optimization. In contrast, the source-packed factorization output. Consequently, λ for Derev is common to all the sources with separate variance optimization. approach needs to maintain iterations to alternately esti- mate Q and g due to their mutual dependency. • The source-wise factorization approach is advantageous A. Dataset and evaluation metrics when it is combined with neural network-based single target speaker extraction that has recently been actively For the evaluation, we prepared a set of noisy reverberant studied [13]. With this combination, we can skip the es- speech mixtures (REVERB-2MIX) using the REVERB Chal- timation of sources other than the target source, allowing lenge dataset (REVERB) [18]. Each utterance in REVERB us to further reduce the computing cost. contains a single reverberant speech with moderate stationary diffuse noise. For generating a set of test data, we mixed two V. EXPERIMENTS utterances extracted from REVERB, one from its development This section experimentally conﬁrms the effectiveness of set (Dev set) and the other from its evaluation set (Eval set), our proposed joint optimization approaches. Table I summa- so that each pair of mixed utterances was recorded in the same rizes the optimization methods that we experimentally com- room, by the same microphone array, and under the same pared (see Sections V-C and V-D for details of the methods) condition (near or far, RealData or SimData). We categorized in the following three aspects. the test data based on the original categories of the data in REVERB (e.g., SimData or RealData). We created the same 1) Effectiveness of joint optimization number of mixtures in the test data as in the REVERB Eval set, We compared a CBF with and without joint optimiza- such that each utterance in the REVERB Eval set is contained tion in terms of estimation accuracy. The source-wise in one of the mixtures in the test data. Furthermore, the length factorization approach (Table I (7)) is compared with of each mixture in the test data was set at the same as that of the conventional cascade conﬁguration (Table I (1) and the corresponding utterance in the REVERB Eval set, and the (2)), and two additional test conditions (Table I (3) and utterance from the Dev set was trimmed or zero-padded at its (4)). end to be the same length as that of Eval set. 2) Comparison among joint optimization approaches For the experiments in Section V-E, we also prepared a We compared three joint optimization approaches, i.e., set of noisy reverberant speech mixtures, each of which is the source-packed factorization approach with its con- composed of three speaker utterances (REVERB-3MIX). We ventional setting (Table I (5)) and its proposed extension created REVERB-3MIX by adding one utterance extracted (Table I (6)), and the source-wise factorization approach from REVERB Dev set to each mixture in REVERB-2MIX. (Table I (7)), respectively described in Sections III-B1, Only RealData (i.e., real recordings of reverberant data) was III-B2, and III-D, in terms of computational efﬁciency created for REVERB-3MIX. and estimation accuracy. 3) Evaluation using oracle masks In the experiments, we respectively estimated two or three We used oracle masks instead of estimated masks for speech signals from each mixture for REVERB-2MIX and evaluating a CBF to test the performance of a CBF using REVERB-3MIX and evaluated only one of them correspond- different types of masks and also to obtain its top-line ing to the REVERB Eval set using the baseline evaluation tools performance. provided for it. We selected the signal to be evaluated from all 11 (1) WPE+MPDR (separate) (4) WPE+MPDR (integrated) TABLE II (2) WPE+MVDR (separate) (7) Source-wise factorization BEAMFORMER CONFIGURATIONS USED IN EXPERIMENTS (3) WPE+wMPDR (separate) M L at each freq. range (kHz) #Iterations 24 4.4 0.0-0.8 0.8-1.5 1.5-8.0 Conﬁg-1 8 20 16 8 10 Conﬁg-2 4 20 16 8 10 4.2 TABLE III WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND STOI FOR SIMDATA IN REVERB-2MIX OBTAINED USING DIFFERENT BEAMFORMERS AFTER FIVE ESTIMATION ITERATIONS WITH CONFIG-1. 3.8 SCORES FOR REVERB-2MIX AND REVERB (I.E., SINGLE SPEAKER) WITHOUT ENHANCEMENT (NO ENH), ARE ALSO SHOWN. 3.6 Enhancement method WER CD FWSSNR PESQ STOI No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 2 4 6 8 10 2 4 6 8 10 No Enh (REVERB) 18.61 3.97 3.62 1.48 0.75 #iterations #iterations 6 1.85 MPDR (w/o iteration) 30.79 4.40 3.07 1.45 0.73 MVDR (w/o iteration) 30.89 4.43 3.00 1.44 0.73 1.8 5.5 wMPDR 28.75 3.96 4.46 1.60 0.75 1.75 (1) WPE+MPDR (separate) 23.04 4.30 3.77 1.58 0.77 (2) WPE+MVDR (separate) 23.34 4.34 3.66 1.57 0.76 1.7 (3) WPE+wMPDR (separate) 21.53 3.74 5.42 1.77 0.82 1.65 4.5 (4) WPE+MPDR (integrated) 23.22 4.28 3.66 1.56 0.76 (7) Source-wise factorization 20.03 3.67 5.57 1.80 0.81 1.6 1.55 the estimated speech signals based on the correlation between 3.5 1.5 the separated signals and the original signal in the REVERB 2 4 6 8 10 2 4 6 8 10 Eval set. As objective measures for speech enhancement [48], #iterations #iterations we used the Cepstrum Distance (CD), the Frequency-Weighted Fig. 4. Comparison among joint optimization and cascade conﬁguration Segmental SNR (FWSSNR), the Perceptual Evaluation of approaches when using WPE+MPDR and WPE+wMPDR with integrated and Speech Quality (PESQ), and the Short-Time Objective Intel- separate optimization schemed using Conﬁg-1 for REVERB-2MIX. ligibility measure (STOI) [49]. To evaluate the ASR perfor- mance, we used a baseline ASR system for REVERB that was ﬁlter followed by an MPDR beamformer (WPE+MPDR), recently developed using Kaldi [50]. This system is composed of a Time-Delay Neural Network (TDNN) acoustic model and a WPE ﬁlter followed by an MVDR beamformer trained using lattice-free maximum mutual information (LF- (WPE+MVDR). The ﬁrst combination is required for jointly MMI) and online i-vector extraction, and a trigram language optimal processing, and the others have been used for the conventional cascade conﬁguration. Second, we compared two model. They were trained on the REVERB training set. different variance optimization schemes shown in Fig. 3: “separate” and “integrated.” With the separate variance opti- B. CBF conﬁgurations mization, the iterative estimation of the time-varying variance Table I summarizes two conﬁgurations of the CBF examined was performed separately for the WPE ﬁlter and for the in experiments including the number of microphones M, the beamformer. This is the scheme used by the conventional ﬁlter length L, and the number of optimization iterations. The cascade conﬁguration. In contrast, with the integrated variance sampling frequency was 16 kHz. A Hann window was used optimization, the iterative estimation was performed jointly for for a short-time analysis where the frame length and shift were the WPE ﬁlter and the beamformer. A signiﬁcant difference set at 32 and 8 ms. The prediction delay was set at Δ = 4 for between the two schemes is whether the WPE ﬁlter uses the WPE ﬁlter. the same variances for all the sources or different variances In the iterative optimization, the time-varying variances of dependent on the sources estimated by the beamformer. the sources were initialized as those of the observed signal for Table III compares WERs, CDs, FWSSNRs, PESQs, and the WPE ﬁlter and as 1 for the wMPDR beamformer for all STOIs obtained after ﬁve estimation iterations using three the methods. beamformers (MPDR, MVDR, and wMPDR), two conven- tional cascade conﬁguration approaches ((1) WPE+MPDR C. Experiment-1: effectiveness of joint optimization and (2) WPE+MVDR), two test conditions ((3) and (4)), In this experiment, we evaluated the effectiveness of the and a proposed joint optimization approach ((7) source-wise joint optimization focusing on its two characteristics. First, factorization). All methods used conﬁguration Conﬁg-1 in we compared three different ﬁlter combinations: a WPE ﬁlter Table I. Table III shows that 1) WPE+MPDR, WPE+MVDR, followed by a wMPDR beamformer (WPE+wMPDR), a WPE and WPE+wMPDR greatly outperformed MPDR, MVDR, FWSSNR (dB) WER (%) PESQ CD (dB) 12 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (a) Observed signal (b) MVDR 8 8 6 6 4 4 2 2 0 0 0 0.5 1 1.5 0 0.5 1 1.5 Time (s) Time (s) (c) WPE+MVDR (d) CBF with source-wise factorization Fig. 5. Spectrogram of (a) a noisy reverberant mixture in RealData of REVERB-2MIX and spectrograms of enhanced signals obtained by (b) MVDR, (c) WPE+MVDR and (d) CBF with source-wise factorization. Mixture is composed of two female speakers under far conditions. and wMPDR, respectively, with all the conditions, 2) the D. Experiment-2: Comparison among joint optimization ap- joint optimization approach, i.e., (7) source-wise factorization, proaches substantially outperformed all the other methods in terms of In this experiment, we compared three joint optimiza- all the measures except for a case in terms of STOI where tion approaches, denoted as (5) Source-packed factorization WPE+wMPDR (separate) gave a slightly better score than (7) (conventional), (6) Source-packed factorization (extended), source-wise factorization. Furthermore, Fig. 4 shows the con- and (7) Source-wise factorization. (5) Source-packed factor- vergence curves of the two cascade conﬁguration approaches, ization (conventional) corresponds to the conventional joint two test conditions, and the joint optimization approach. The optimization technique described in Section III-B1, and (6) source-wise factorization performance (7) was the best of Source-packed factorization (extended) and (7) Source-wise all and improved as the number of iterations increased. The factorization correspond to our proposed methods respectively second best was (3) WPE+wMPDR (separate). The other described in Sections III-B2 and III-D. methods did not improve the scores after the ﬁrst iteration Figure 6 compares the WERs obtained using the three with both the integrated and separate variance optimization approaches with Conﬁg-1 and Conﬁg-2. Our proposed meth- schemes. ods, i.e., (6) Source-packed factorization (extended) and (7) Figure 5 shows a spectrogram of a noisy reverberant mix- Source-wise factorization, performed comparably well and ture in RealData of REVERB-2MIX, and spectrograms of both greatly outperformed (5) Source-packed factorization enhanced signals obtained using MVDR, WPE+MVDR, and (conventional). CBF with source-wise factorization. The ﬁgure shows that all Table IV compares the computing times required for the the enhancement methods were effective and the CBF with three approaches to estimate and apply the CBFs with ten source-wise factorization was the best of all for achieving estimation iterations for processing a mixture utterance whose denoising, dereverberation, and source separation. length is 9.44 s. The computing time was measured by a The above results clearly show that the two characteristics Matlab interpreter as elapsed time. The computing times of the joint optimization approach, i.e., 1) the optimal combi- for estimating the masks were 0.63 s and 7.2 s with and nation of a WPE ﬁlter and a wMPDR beamformer, and 2) the without a GPU (NVIDIA 2080ti), and they are not included integrated variance optimization, are both critical for achieving in the table. As shown in the table, for both conﬁgurations, optimal performance. (6) Source-packed factorization (extended) greatly reduced Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) 13 (5) Source-packed (conventional) (7) Source-wise fact. TABLE V WER (%) FOR REALDATA AND CD (DB), FWSSNR (DB), PESQ, AND (6) Source-packed (extended) STOI FOR SIMDATA IN REVERB-2MIX OF ENHANCED SIGNALS 29.5 OBTAINED BASED ON ORACLE MASKS USING DIFFERENT BEAMFORMERS AFTER THREE ESTIMATION ITERATIONS WITH CONFIG-1. SCORES FOR REVERB-2MIX WITH NO ENHANCEMENT (NO ENH) AND THOSE OBTAINED BY APPLYING A WMPDR CBF, WPD [30], TO REVERB (I.E., 28.5 SINGLE SPEAKER), ARE ALSO SHOWN. Enhancement method WER CD FWSSNR PESQ STOI 27.5 No Enh (REVERB-2MIX) 62.49 5.44 1.12 1.12 0.55 27 WPD (REVERB) [30] 8.91 2.59 8.29 2.41 0.91 MPDR (w/o iteration) 20.16 3.53 5.49 1.86 0.84 26.5 MVDR (w/o iteration) 20.32 3.56 5.36 1.84 0.83 19 wMPDR 20.12 3.31 6.11 1.96 0.86 2 4 6 8 10 2 4 6 8 10 (1) WPE+MPDR (separate) 12.89 3.39 6.11 2.10 0.87 #iterations #iterations (2) WPE+MVDR (separate) 12.91 3.32 6.30 2.07 0.87 (a) Conﬁg-1 (b) Conﬁg-2 (3) WPE+wMPDR (separate) 12.59 3.12 6.84 2.21 0.89 (6) Source-packed fact. 12.23 3.02 7.15 2.33 0.90 Fig. 6. WERs (%) obtained for REVERB-2MIX when jointly optimizing (7) Source-wise fact. 12.23 2.98 7.25 2.32 0.90 WPE+wMPDR based on source-packed factorization (conventional/extended) and source-wise factorization approaches. TABLE IV in REVERB-2MIX using signal components in the observed COMPUTING TIME REQUIRED FOR PROCESSING A MIXTURE UTTERANCE signals. In contrast, we can only calculate the oracle masks OF LENGTH OF 9.44 S IN REVERB-2MIX. COMPUTING TIME WAS MEASURED BY ELAPSED TIME ON A MATLAB INTERPRETER. approximately for RealData because we cannot access the signal components. Thus, we ﬁrst estimated the desired signals Method Time (s) by applying dereverberation and denoising to utterances in Conﬁg-1 Conﬁg-2 REVERB, and then calculated the oracle masks using the (4) Source-packed factorization (conventional) 3467 688 (5) Source-packed factorization (extended) 209 33 estimated desired signals for REVERB-2MIX and REVERB- (6) Source-wise factorization 40 23 3MIX. Table V shows WERs, CDs, FWSSNRs, PESQs, and STOIs measured on enhanced signals obtained from REVERB-2MIX the computing time in comparison with (5) Source-packed using various (non-convolutional) beamformers and CBFs factorization (conventional), and (7) Source-wise factorization after three estimation iterations. As a reference, the table further reduced the computing time. also includes previously reported scores denoted by WPD The above results clearly demonstrate the superiority of the (REVERB) [30], which were obtained by applying a wMPDR two proposed approaches over the conventional joint optimiza- CBF, referred to as WPD (see also Section III-C in this paper), tion technique in terms of both computational efﬁciency and to REVERB, i.e., noisy reverberant single speaker utterances. estimation accuracy. However, Table IV indicates that the pro- In addition, the convergence curves obtained using the CBFs posed approaches still require relatively large computing cost, in terms of WERs for REVERB-2MIX and REVERB-3MIX, e.g., 40 s computing time for processing a 9.44 s utterance and those obtained in terms of CDs, FWSSNRs, PESQs, with Conﬁg-1, to obtain the high performance gain shown and STOIs for REVERB-2MIX are respectively shown in in Fig. 6 (a). Future work must address this problem. For Figs. 7 and 8. In all these results, the two joint optimization example, it might be mitigated by setting the goal as extraction approaches, (6) source-packed factorization (extended) and (7) of a single target source. Then, due to the characteristics source-wise factorization, outperformed all the other methods of source-wise factorization, we can omit the estimation of in terms of every measurement. As a whole, almost the same the other sources, and omit the iterative estimation, e.g., tendency was observed in the cases using the estimated masks. when we separately estimate source variances using a neural One exception is that the WERs obtained with the source-wise network. As a reference, the computing time (40 s) in Table factorization tended to increase after a few iterations although III required for the source-wise factorization with Conﬁg-1 is such a tendency was not observed in terms of signal distortion roughly reduced to 2.0 s for one iteration per source (namely measures. This means that improvement in the signal level 40 s/10/2), which results in the real-time factor being 0.21 distortion does not necessarily result in improvement in WER, (= 2.0 s/9.44 s). and suggests the importance of optimization by ASR level criteria, similar to conventional beamforming techniques [51], E. Experiment-3: Evaluation using oracle masks [52]. In this experiment, we examined the performance of CBFs using a different type of masks, i.e., oracle masks. An oracle VI. CONCLUDING REMARKS mask, which is the power ratio of the desired signal to the observed signal at each TF point, is calculated using reference This paper presented methods for optimizing a CBF that signals. Oracle masks can be precisely calculated for SimData performs DN+DR+SS based on ML estimation. We introduced WER (%) WER (%) 14 (1) WPE+MPDR (separate) (6) Source-packed (extended) the source-packed factorization approach, and into a set of (2) WPE+MVDR (separate) (7) Source-wise factorization single-target WPE ﬁlters followed by wMPDR beamformers (3) WPE+wMPDR (separate) using the source-wise factorization approach. This paper also presented the overall processing ﬂows for both approaches 13.5 based on an assumption that TF masks are provided as auxil- iary inputs. In the ﬂows, the time varying source variances, which are required for ML estimation, can be optimally estimated jointly with the CBF using iterative optimization; the steering vectors of the desired signals, which are required for beamformer optimization, can be reliably estimated based 12.5 on the dereverberated multichannel signals obtained at an optimization step. Experiments using noisy reverberant sound mixtures show that the proposed optimization approaches substantially im- 2 4 6 8 10 2 4 6 8 10 proved the CBF performance in comparison with the conven- #iterations #iterations tional cascade conﬁguration in terms of ASR performance (a) REVERB-2MIX (b) REVERB-3MIX and signal distortion reduction. Our proposed approaches Fig. 7. Comparison of WERs among cascade conﬁguration and joint can also greatly reduce the computing cost with improved optimization approaches using Conﬁg-1 for REVERB-2MIX and REVERB- estimation accuracy in comparison with the conventional joint 3MIX. optimization technique. The proposed approaches, however, (1) WPE+MPDR (separate) (6) Source-packed (extended) still result in relatively large computing costs to obtain high (2) WPE+MVDR (separate) (7) Source-wise factorization performance gain. Future work will address this problem. (3) WPE+wMPDR (separate) 7.5 APPENDIX A 3.4 DERIVATION OF EQS. (43) AND (44) We can rewrite Ψ in Eq. (38) using Eq. (36): 3.3 Ψ = X Φ X , (60) t q,t t 3.2 XX H H 1 1 6.5 3.1 (i) (i) = q X q X . (61) t t (i) t i t (i) Using Eq. (33), q X can further be rewritten: 2.9 H H 2 4 6 8 10 2 4 6 8 10 (i) (i) T q X = q I ⊗ x , (62) t M #iterations #iterations 0.91 (i) T 2.35 = q ⊗ x . (63) 0.9 2.3 Substituting the above equation in Eq. (61) yields 2.25 XX H 1 1 ⊤ (i) H (i) T 0.89 Ψ = q ⊗ x q ⊗ x , t t (i) 2.2 t i t 0.88 (64) 2.15 XX 1 1 ⊤ 2.1 (i) (i) H = x x , (65) q q ⊗ 0.87 t (i) 2.05 t i t X ⊤ (i) 2 0.86 (i) (i) = q q ⊗ R . (66) 2 4 6 8 10 2 4 6 8 10 #iterations #iterations Similarly, we can obtain Fig. 8. Comparison of CDs, FWSSNRs, PESQs, and STOIS among cascade 1 H conﬁguration and joint optimization approaches using Conﬁg-1 for REVERB- ψ = X Φ x , (67) q t 2MIX. XX 1 1 ⊤ (i) H (i) = q ⊗ x q x , (68) (i) t i t two different approaches for factorizing a CBF, i.e., source- XX 1 1 T (i) H (i) packed and source-wise factorization approaches, and derived = q ⊗ x x q , (69) (i) optimization algorithms for the respective approaches. A CBF t i t can be factorized without loss of optimality into a multiple- (i) (i) (i) = q ⊗ P q . (70) target WPE ﬁlter followed by wMPDR beamformers using PESQ WER (%) CD (dB) STOI FWSSNR (dB) WER (%) 15 REFERENCES [23] S. Braun and E. A. P. Habets, “Linear prediction based online dereverberation and noise reduction using alternating Kalman ﬁlters,” IEEE/ACM trans. on Audio, Speech, and Language Processing, vol. 26, [1] B. D. V. Veen and K. M. Buckley, “Beamforming: A versatile approach no. 6, pp. 1119–1129, 2018. to spatial ﬁltering,” IEEE ASSP Magazine, vol. 5, no. 2, pp. 4–24, 1988. [24] T. Dietzen, S. Doclo, M. Moonen, and T. van Waterschoot, “Joint multi- [2] H. L. V. Trees, Optimum Array Processing, Part IV of Detection, microphone speech dereverberation and noise reduction using integrated Estimation, and Modulation Theory. New York: Wiley-Interscience, sidelobe cancellation and linear prediction,” in Proc. IWAENC, 2018. [25] T. Yoshioka, T. Nakatani, M. Miyoshi, and H. G. Okuno, “Blind [3] H. Cox, “Resolving power and sensitivity to mismatch of optimum array separation and dereverberation of speech mixtures by joint optimization,” processors,” The Journal of the Acoustical Society of America, vol. 54, IEEE Trans. on Audio, Speech, and Language Processing, vol. 19, no. 1, pp. 771–785, 1973. January 2011. [4] M. Souden, J. Benesty, and S. Affes, “On optimal frequency-domain [26] N. Ito, S. Araki, T. Yoshioka, and T. Nakatani, “Relaxed disjointness multichannel linear ﬁltering for noise reduction,” IEEE Trans. Audio, based clustering for joint blind source separation and dereverberation,” Speech, and Language Processing, vol. 18, no. 2, pp. 260–276, 2007. in Proc. IWAENC, 2014. [5] A. Hyva¨rinen, J. Karhunen, and E. Oja, Independent Component Anal- [27] H. Kagami, H. Kameoka, and M. Yukawa, “Joint separation and dere- ysis. New York: John Wiley & Sons, 2001. verberation of reverberant mixtures with determined multichannel non- [6] T. Kim, H. T. Attias, S.-Y. Lee, and T.-W. Lee, “Blind source separa- negative matrix factorization,” in Proc. IEEE ICASSP, 2018, pp. 31–35. tion exploiting higher-order frequency dependencies,” IEEE Trans. on [28] T. Nakatani, R. Ikeshita, K. Kinoshita, H. Sawada, and S. Araki, “Com- Speech, and Audio Processing, vol. 15, no. 1, pp. 70–79, 2006. putationally efﬁcient and versatile framework for joint optimization of [7] M. Souden, S. Araki, K. Kinoshita, T. Nakatani, and H. Sawada, “A blind speech separation and dereverberation,” in Proc. Interspeech, 2020. multichannel MMSE-based framework for speech source separation [29] Z. Koldovsky and P. Tichavsky´, “Gradient algorithms for complex non- and noise reduction,” IEEE Trans. on Audio, Speech, and Language Gaussian independent component/vector extraction, question of conver- Processing, vol. 21, no. 9, pp. 1913–1928, 2010. gence,” IEEE Trans. on Signal Processing, vol. 67, no. 4, pp. 1050–1064, [8] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B.-H. Juang, “Speech dereverberation based on variance-normalized delayed linear [30] T. Nakatani and K. Kinoshita, “Maximum likelihood convolutional prediction,” IEEE trans. on Audio, Speech, and Language Processing, beamformer for simultaneous denoising and dereverberation,” in Proc. vol. 18, no. 7, pp. 1717–1731, 2010. EUSIPCO, 2019. [9] T. Yoshioka and T. Nakatani, “Generalization of multi-channel linear [31] C. Boeddeker, T. Nakatani, K. Kinoshita, and R. Haeb-Umbach, “Jointly prediction methods for blind MIMO impulse response shortening,” IEEE optimal dereverberation and beamforming,” in Proc. ICASSP, 2020, pp. trans. on Audio, Speech and Language Processing, vol. 20, no. 10, pp. 216–220. 2707–2720, 2012. [32] T. Nakatani, R. Takahashi, T. Ochiai, K. Kinoshita, R. Ikeshita, [10] A. Jukic´, T. van Waterschoot, T. Gerkmann, and S. Doclo, “Multi- M. Declroix, and S. Araki, “DNN-supported mask-based convolutional channel linear prediction-based speech dereverberation with sparse pri- beamforming for simultaneous denoising, dereverberation, and source ors,” IEEE/ACM trans. on Audio, Speech and Language Processing, separation,” in Proc. IEEE ICASSP, 2020. vol. 23, no. 9, pp. 1509–1520, 2015. [33] J. S. Bradley, H. Sato, and M. Picard, “On the importance of early [11] J. Heymann, L. Drude, and R. Haeb-Umbach, “Neural network based reﬂections for speech in rooms,” The Journal of the Acoustic Sociaty of spectral mask estimation for acoustic beamforming,” in Proc. IEEE America, vol. 113, pp. 3233–3244, 2003. ICASSP, 2016, pp. 196–200. [34] T. Nishiura, Y. Hirano, Y. Denda, and M. Nakayama, “Investigations into [12] K. Kinoshita, M. Delcroix, H. Kwon, T. Mori, and T. Nakatani, “Neural early and late reﬂections on distant-talking speech recognition toward network-based spectrum estimation for online wpe dereverberation,” in suitable reverberation criteria,” in Proc. Interspeech, 2007, pp. 1082– Proc. Interspeech, 2017, pp. 384–388. [13] K. Zmol´ıkova´, M. Delcroix, K. Kinoshita, T. Ochiai, T. Nakatani, [35] Y. Avargel and I. Cohen, “On multiplicative transfer function approxima- L. Burget, and J. Cernocky´, “SpeakerBeam: Speaker aware neural tion in the short-time fourier transform domain,” IEEE Signal Processing network for target speaker extraction in speech mixtures,” IEEE Journal Letters, vol. 14, pp. 337–340, 2007. of Selected Topics in Signal Processing, vol. 13, no. 4, pp. 800–814, [36] I. Cohen, “Relative transfer function identiﬁcation using speech signals,” IEEE Trans. on Speech, and Audio Processing, vol. 12, no. 5, pp. 451– [14] T. Yoshioka, H. Erdogan, Z. Chen, X. Xiao, , and F. Alleva, “Recogniz- 459, 2004. ing overlapped speech in meetings: A multichannel separation approach [37] T. Nakatani, T. Yoshioka, K. Kinoshita, M. Miyoshi, and B. H. Juang, using neural networks,” in Proc. Interspeech, 2018. “Blind speech dereverberation with multi-channel linear prediction based [15] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to on short time Fourier transform representation,” in Proc. IEEE ICASSP, speech enhancement based on deep neural networks,” IEEE/ACM trans. 2008, pp. 85–88. on Audio, Speech, and Language Processing, vol. 23, no. 1, 2015. [38] T. Hori, S. Araki, T. Yoshioka, M. Fujimoto, S. Watanabe, T. Oba, [16] J. R. Hershey, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: A. Ogawa, K. Otsuka, D. Mikami, K. Kinoshita, T. Nakatani, A. Naka- Discriminative embeddings for segmentation and separation,” in Proc. mura, and J. Yamato, “Low-latency real-time meeting recognition and IEEE ICASSP, 2016, pp. 31–35. understanding using distant microphones and omni-directional camera,” [17] M. Kolbæk, D. Yu, Z.-H. Tan, and J. Jensen, “Multitalker speech IEEE Trans. on Audio, Speech, and Language Processing, vol. 20, no. 2, separation with utterance-level permutation invariant training of deep pp. 499–513, 2011. recurrent neural networks,” IEEE Trans. Audio, Speech, and Language [39] R. Ikeshita, N. Ito, T. Nakatani, and H. Sawada, “Independent low-rank Processing, pp. 1901–1913, 2017. matrix analysis with decorrelation learning,” in IEEE WASPAA, 2019. [18] K. Kinoshita, M. Delcroix, S. Gannot, E. A. P. Habets, R. Haeb-Umbach, [40] T. Nakatani and K. Kinoshita, “Simultaneous denoising and dereverber- W. Kellermann, V. Leutnant, R. Maas, T. Nakatani, B. Raj, A. Sehr, and ation for low-latency applications using frame-by-frame online uniﬁed T. Yoshioka, “A summary of the REVERB challenge: State-of-the-art convolutional beamformer,” in Proc. Interspeech, 2019. and remaining challenges in reverberant speech processing research,” [41] B. J. Cho, J. Lee, and H. Park, “A beamforming algorithm based on EURASIP Journal on Advances in Signal Processing, 2016. maximum likelihood of a complex Gaussian distribution with time- [19] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ varying variances for robust speech recognition,” IEEE Signal Process- speech separation and recognition challenge: Dataset, task and base- ing Letters, vol. 26, no. 9, pp. 1398–1402, August 2019. lines,” in Proc. IEEE ASRU-2015, 2015, pp. 504–511. [42] T. Nakatani and K. Kinoshita, “A uniﬁed convolutional beamformer for [20] N. Kanda, C. Boeddeker, J. Heitkaemper, Y. Fujita, S. Horiguchi, simultaneous denoising and dereverberation,” IEEE Signal Processing K. Nagamatsu, and R. Haeb-Umbach, “Guided source separation meets Letters, vol. 26, no. 6, pp. 903–907, April 2019. a strong asr backend: Hitachi/Paderborn university joint investigation for [43] F. Bahmaninezhad, J. Wu, R. Gu, S.-X. Zhang, Y. Xu, M. Yu, and D. Yu, dinner party ASR,” in Proc. Interspeech, 2019. “A comprehensive study of speech separation: spectrogram vs waveform [21] R. Haeb-Umbach, S. Watanabe, T. Nakatani, M. Bacchiani, B. Hoffmeis- separation,” in Interspeech, 2019. ter, M. Seltzer, H. Zen, and M. Souden, “Speech processing for digital [44] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing ideal time- home assistants,” IEEE Signal Processing Magazine, 2019. frequency magnitude masking for speech separation,” IEEE/ACM Trans. [22] M. Togami, “Multichannel online speech dereverberation under noisy on Audio, Speech, and Language Processing, vol. 27, no. 8, pp. 1256– environments,” in Proc. EUSIPCO, 2015, pp. 1078–1082. 1266, 2019. 16 [45] T. Robinson, J. Fransen, D. Pye, J. Foote, and S. Renals, “WSJCAMO: A British English speech corpus for large vocabulary continuous speech recognition,” in Proc. IEEE ICASSP, 1995, pp. 81–84. [46] N. Ito, S. Araki, M. Delcroix, and T. Nakatani, “Probabilistic spatial dictionary based online adaptive beamforming for meeting recognition in noisy and reverberant environments,” in Proc. IEEE ICASSP, 2017, pp. 681–685. [47] S. Markovich-Golan, S. Gannot, and I. Cohen, “Multichannel eigenspace beamforming in a reverberant noisy environment with multiple interfer- ing speech signals,” IEEE Trans. ASLP, vol. 17, no. 6, pp. 1071–1086, [48] Y. Hu and P. C. Loizou, “Evaluation of objective quality measures for speech enhancement,” IEEE Tran. Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 229–238, 2008. [49] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An algorithm for intelligibility prediction of timefrequency weighted noisy speech,” IEEE Trans. Audio, Speech, and Language Processing, vol. 19, no. 7, [50] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stem- mer, and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. IEEE ASRU, 2011. [51] J. Heymann, L. Drude, C. Boeddeker, P. Hanebrink, and R. Haeb- Umbach, “Eamnet: End-to-end training of a beamformer-supported multi-channel ASR system,” in Proc. IEEE ICASSP, 2017. [52] A. S. Subramanian, X. Wang, M. K. Baskar, S. Watanabe, T. Taniguchi, D. Tran, and Y. Fujita, “Speech enhancement using end-to-end speech recognition objectives,” in Proc. IEEE WASPAA, 2019.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: May 20, 2020
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.