Access the full text.
Sign up today, get an introductory month for just $19.
J. Barker, Shinji Watanabe, E. Vincent, J. Trmal (2018)
The fifth 'CHiME' Speech Separation and Recognition Challenge: Dataset, task and baselines
Morten Kolbæk, Z. Tan, J. Jensen (2018)
Monaural Speech Enhancement Using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Wei Han, Xiongwei Zhang, Gang Min, Xingyu Zhou, Wei Zhang (2016)
Perceptual weighting deep neural networks for single-channel speech enhancement2016 12th World Congress on Intelligent Control and Automation (WCICA)
Taffeta Elliott, F. Theunissen (2009)
The Modulation Transfer Function for Speech IntelligibilityPLoS Computational Biology, 5
R. Drullman, J. Festen, R. Plomp (1994)
Effect of temporal envelope smearing on speech reception.The Journal of the Acoustical Society of America, 95 2
D. Griffin, Jae Lim (1983)
Signal estimation from modified short-time Fourier transform
R. McAulay, M. Malpass (1980)
Speech enhancement using a soft-decision noise suppression filterIEEE Transactions on Acoustics, Speech, and Signal Processing, 28
Deliang Wang, Jitong Chen (2017)
Supervised Speech Separation Based on Deep Learning: An OverviewIEEE/ACM Transactions on Audio, Speech, and Language Processing, 26
Soren Jorgensen, Jens Cubick, T. Dau (2015)
Speech Intelligibility Evaluation for Mobile Phones.Acta Acustica United With Acustica, 101
Minje Kim, P. Smaragdis (2018)
Bitwise Neural Networks for Efficient Single-Channel Source Separation2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Yan Zhao, Buye Xu, Ritwik Giri, Zhang Tao (2018)
Perceptually Guided Speech Enhancement Using Deep Neural Networks2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Hui Zhang, Xueliang Zhang, Guanglai Gao (2018)
Training Supervised Speech Separation System to Improve STOI and PESQ Directly2018 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
E. Healy, Masood Delfarah, J. Vasko, Brittney Carter, Deliang Wang (2017)
An algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker.The Journal of the Acoustical Society of America, 141 6
E. Healy, Sarah Yoho, Jitong Chen, Yuxuan Wang, Deliang Wang (2015)
An algorithm to increase speech intelligibility for hearing-impaired listeners in novel segments of the same noise type.The Journal of the Acoustical Society of America, 138 3
Leo Lightburn, M. Brookes (2015)
SOBM - a binary mask for noisy speech that optimises an objective intelligibility metric2015 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
P. Shivakumar, P. Georgiou (2016)
Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement
Jitong Chen, Yuxuan Wang, Sarah Yoho, Deliang Wang, E. Healy (2016)
Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises.The Journal of the Acoustical Society of America, 139 5
S. Kay (1993)
Fundamentals of statistical signal processing: estimation theoryTechnometrics, 37
Szu-Wei Fu, Tao-Wei Wang, Yu Tsao, Xugang Lu, H. Kawai (2017)
End-to-End Waveform Utterance Enhancement for Direct Evaluation Metrics Optimization by Fully Convolutional Neural NetworksIEEE/ACM Transactions on Audio, Speech, and Language Processing, 26
Deliang Wang (2017)
Deep learning reinvents the hearing aidIEEE Spectrum, 54
C. Taal, R. Hendriks, R. Heusdens, J. Jensen (2011)
An Algorithm for Intelligibility Prediction of Time–Frequency Weighted Noisy SpeechIEEE Transactions on Audio, Speech, and Language Processing, 19
Yuma Koizumi, K. Niwa, Yusuke Hioka, Kazunori Kobayashi, Y. Haneda (2017)
DNN-based source enhancement self-optimized by reinforcement learning using sound quality measurements2017 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
J. Jensen, C. Taal (2016)
An Algorithm for Predicting the Intelligibility of Speech Masked by Modulated Noise MaskersIEEE/ACM Transactions on Audio, Speech, and Language Processing, 24
Ashia Wilson, R. Roelofs, Mitchell Stern, N. Srebro, B. Recht (2017)
The Marginal Value of Adaptive Gradient Methods in Machine Learning
R. Hendriks, Timo Gerkmann, J. Jensen (2013)
DFT-Domain Based Single-Microphone Noise Reduction for Speech Enhancement
A. Rix, J. Beerends, M. Hollier, A. Hekstra (2001)
Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs2001 IEEE International Conference on Acoustics, Speech, and Signal Processing. Proceedings (Cat. No.01CH37221), 2
Y. Ephraim, D. Malah (1984)
Speech enhancement using a minimum mean-square error log-spectral amplitude estimatorIEEE Trans. Acoust. Speech Signal Process., 33
P. Loizou (2005)
Speech enhancement based on perceptually motivated bayesian estimators of the magnitude spectrumIEEE Transactions on Speech and Audio Processing, 13
A. Andersen, Jan Haan, Z. Tan, J. Jensen (2016)
Predicting the Intelligibility of Noisy and Nonlinearly Processed Binaural SpeechIEEE/ACM Transactions on Audio, Speech, and Language Processing, 24
Kun Han, Deliang Wang (2012)
A classification based approach to speech segregation.The Journal of the Acoustical Society of America, 132 5
Science and Artiﬁcial Intelligence Laboratory, MIT, Cambridge, USA, an Associate Professor at Shanghai Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea
Xing Hao, Guigang Zhang, Shang Ma (2016)
Deep LearningInt. J. Semantic Comput., 10
J. Jensen, C. Taal (2014)
Speech Intelligibility Prediction Based on Mutual InformationIEEE/ACM Transactions on Audio, Speech, and Language Processing, 22
P. Sen, J. Singer (1993)
Large Sample Methods in Statistics: An Introduction with Applications
Ephraim (1984)
Speech enhancement using a minimum mean square error short-time spectral amplitude estimatorIEEE Transactions on Acoustics, Speech, and Signal Processing, 32
(1999)
1996, respectively, and the Ph.D. degree in electronic engineering from Shanghai Jiao Tong University, Shanghai, China
S. Nawab, T. Quatieri, J. Lim (1983)
Signal reconstruction from short-time Fourier transform magnitudeIEEE Transactions on Acoustics, Speech, and Signal Processing, 31
T. Falk, V. Parsa, J. Santos, K. Arehart, O. Hazrati, R. Huber, J. Kates, S. Scollie (2015)
Objective Quality and Intelligibility Prediction for Users of Assistive Listening Devices: Advantages and limitations of existing toolsIEEE Signal Processing Magazine, 32
Diederik Kingma, Jimmy Ba (2014)
Adam: A Method for Stochastic OptimizationCoRR, abs/1412.6980
Adam Croom (2014)
Auditory Neuroscience: Making Sense of Sound
A. Corrigan, Roshan Shrestha, I. Zulkipli, N. Hiroi, Yingjun Liu, Naoka Tamura, Bing Yang, Jessica Patel, Akira Funahashi, A. Donald (2013)
Supplemental Material to
(1993)
“Rec. P.56 : Objective measurement of active speech level,”
C. Taal, R. Hendriks, R. Heusdens (2012)
Matching pursuit for channel selection in cochlear implants based on an intelligibility metric2012 Proceedings of the 20th European Signal Processing Conference (EUSIPCO)
Morten Kolbæk, Z. Tan, J. Jensen (2017)
Speech Intelligibility Potential of General and Specialized Deep Neural Network Based Speech Enhancement SystemsIEEE/ACM Transactions on Audio, Speech, and Language Processing, 25
Dong Yu, Adam Eversole, M. Seltzer, K. Yao, B. Guenter, Oleksii Kuchaiev, F. Seide, Huaming Wang, J. Droppo, Zhiheng Huang, G. Zweig, C. Rossbach, J. Currey, Bhaskar Mitra (2014)
An introduction to computational networks and the computational network toolkit (invited talk)
J. Erkelens, R. Hendriks, R. Heusdens, J. Jensen (2007)
Minimum Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma PriorsIEEE Transactions on Audio, Speech, and Language Processing, 15
Rasool Fakoor, Xiaodong He, I. Tashev, Shuayb Zarar (2017)
Reinforcement Learning To Adapt Speech Enhancement to Instantaneous Input Signal QualityArXiv, abs/1711.10791
with Delft University of Technology Netherlands, and an External Associate with Aalborg University
(1993)
DARPA TIMIT Acoustic Phonetic Continuous Speech Corpus CDROM
P. Loizou (2007)
Speech Enhancement: Theory and Practice
J. Allen (1977)
Short term spectral analysis, synthesis, and modification by discrete Fourier transformIEEE Transactions on Acoustics, Speech, and Signal Processing, 25
(1993)
CSR-I (WSJ0) complete LDC93S6A
E. Owens (1977)
Introduction to the Psychology of HearingArchives of Otolaryngology-head & Neck Surgery, 103
R. Patterson, K. Robinson, J. Holdsworth, D. McKeown, C. Zhang, M. Allerhand (1992)
Complex Sounds and Auditory Images
Hakan Erdogan, J. Hershey, Shinji Watanabe, Jonathan Roux (2017)
Deep Recurrent Networks for Separation and Recognition of Single-Channel Speech in Nonstationary Background Audio
Stephen Boyd, L. Vandenberghe (2005)
Convex OptimizationJournal of the American Statistical Association, 100
Jae Lim, A. Oppenheim (1979)
Enhancement and bandwidth compression of noisy speechProceedings of the IEEE, 67
Risheng Xia, Junfeng Li, M. Akagi, Yonghong Yan (2012)
Evaluation of objective intelligibility prediction measures for noise-reduced signals in mandarin2012 IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP)
Gibak Kim, Yang Lu, Y. Hu, P. Loizou (2009)
An algorithm that improves speech intelligibility in noise for normal-hearing listeners.The Journal of the Acoustical Society of America, 126 3
K. Hornik, M. Stinchcombe, H. White (1989)
Multilayer feedforward networks are universal approximatorsNeural Networks, 2
On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement Morten Kolbæk, Zheng-Hua Tan, Senior Member, IEEE, and Jesper Jensen Abstract—The majority of deep neural network (DNN) based algorithms using e.g. a Gammatone ﬁlter bank [11] or a one- speech enhancement algorithms rely on the mean-square er- third octave band ﬁlter bank [12]. It is also well known that ror (MSE) criterion of short-time spectral amplitudes (STSA), preservation of modulation frequencies in the range 4-20 Hz which has no apparent link to human perception, e.g. speech are critical for speech intelligibility [9], [13], [14]. Therefore, intelligibility. Short-Time Objective Intelligibility (STOI), a pop- it is natural to believe that, if prior knowledge about the human ular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. auditory system is incorporated into a speech enhancement This raises the question if a DNN training criterion based on algorithm, improvements in speech intelligibility or speech envelope linear correlation (ELC) can lead to improved speech quality can be achieved [15]. intelligibility performance of DNN based speech enhancement Indeed, numerous works exist that attempt to incorporate algorithms compared to algorithms based on the STSA-MSE such knowledge (e.g. [16]–[26] and references therein). In criterion. In this paper we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically [16] a transform-domain method based on a Gammatone ﬁlter equivalent, and we provide empirical data to support our the- bank was used, which incorporates a non-linear frequency oretical results. Furthermore, our experimental ﬁndings suggest resolution mimicking that of the human auditory system. In that the standard STSA minimum-MSE estimator is near optimal, [17] different perceptually motivated cost functions were used if the objective is to enhance noisy speech in a manner which is to derive STSA clean speech spectrum estimators in order optimal with respect to the STOI speech intelligibility estimator. to emphasize spectral peak information, account for auditory masking or penalize spectral over-attenuation. In [20], [21] Index Terms—Speech enhancement, Speech intelligibility, Deep similar goals were pursued, but instead of using classical neural networks, Minimum mean-square error estimator. statistically-based models, DNNs were used. Finally, in [22] a deep reinforcement learning technique was used to reward I. I NTRODUCTION solutions that achieved a large score in terms of perceptual evaluation of speech quality (PESQ) [27], a commonly used ESPITE the recent success of deep neural network (DNN) speech quality estimator. based speech enhancement algorithms [1]–[5], it is yet Although the works in e.g. [16], [17], [21], [22] include unknown if these algorithms are optimal in terms of aspects knowledge about the human auditory system the techniques related to human auditory perception, e.g. speech intelligibility, since existing algorithms do not directly optimize criteria are not designed speciﬁcally to maximize speech intelligibility. designed with human auditory perception in mind. While speech processing methods that improve speech intel- ligibility would be of vital importance for applications such Many current state-of-the-art DNN based speech enhance- as mobile communications, or hearing assistive devices, only ment algorithms use a mean squared error (MSE) training very little research has been performed to understand if DNN- criterion [6]–[8] on short-time spectral amplitudes (STSA). This, however, might not be the optimal training criterion based speech enhancement systems can help improve speech if the target is the human auditory system, and improvement in intelligibility. Very recent work [23]–[26] has investigated if speech intelligibility or speech quality is the desired objective. DNNs trained to maximize a state-of-the-art speech intelligibil- ity estimator are capable of improving speech intelligibility as It is well known that the frequency sensitivity of the human auditory system is non-linear ( e.g. [9], [10]) and, as a measured by the estimator [23]–[25] or human listeners [26]. consequence, is often approximated in digital signal processing Speciﬁcally, DNNs were trained to maximize the short-time objective intelligibility (STOI) [12] estimator and were then Manuscript received month day, year; revised month day, year; accepted compared, in terms of STOI, with DNNs trained to minimize month day, year. Date of publication month day, year; date of current version the classical STSA-MSE criterion. Surprisingly, although all Month day, year. This research was partly funded by the Oticon Foundation. The associate editor coordinating the review of this manuscript and approving DNNs improved STOI, the DNNs trained to maximize STOI it for publication was xxyyzz xxyyzz. showed none or only very modest improvements in STOI M. Kolbæk and Z.-H. Tan are with the Department of Electronic Sys- compared to the DNNs trained with the classical STSA-MSE tems, Aalborg University, Aalborg 9220, Denmark (e-mail: mok@es.aau.dk; zt@es.aau.dk). criterion [23]–[26]. J. Jensen is with the Department of Electronic Systems, Aalborg University, The STOI speech intelligibility estimator has proven to Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark be able to quite accurately predict the intelligibility of (e-mail: jje@es.aau.dk; jesj@oticon.com). Digital Object Identiﬁer 00.0000/TASLP.2018.0000000 noisy/processed speech in a large range of acoustic scenar- arXiv:1806.08404v2 [cs.SD] 4 Dec 2018 2 where v[n] is a sample of additive noise. Furthermore, let r(k; m) a(k; m) and r(k; m), k = 1; : : : ; + 1, m = 1; : : : M; denote the single-sided magnitude spectra of the K -point short-time discrete Fourier transform (STFT) of x[n] and y[n] g^(k; m) a^(k; m) x^[n] T-F Gain T-F y[n], respectively, where M is the number of STFT frames. Analysis Estimator Synthesis Also, let a ^(k; m) denote an estimate of a(k; m) obtained as a ^(k; m) = g ^(k; m)r(k; m). Here, g ^(k; m) is a scalar gain factor applied to the magnitude spectrum of the noisy speech (k; m) y r(k; m) to arrive at an estimate a ^(k; m) of the clean speech magnitude spectrum a(k; m). It is the goal of many STFT- Fig. 1. Classical gain-based speech enhancement system. The noisy time- based speech enhancement systems to ﬁnd appropriate values domain signal y[n] = x[n]+v[n] is ﬁrst decomposed into a time-frequency (T- F) representation r(k; m) for time-frame m and frequency index k. An for g ^(k; m) based on the available noisy signal y[n]. The gain estimator, e.g. a DNN, estimates a gain g^(k; m) that is applied to the noisy factor g ^(k; m) is typically estimated using either statistical short-term magnitude spectrum r(k; m) to arrive at an enhanced signal model-based methods such as classical STSA minimum mean- magnitude a^(k; m) = g^(k; m)r(k; m). Finally, the enhanced time-domain signal x^[n] is obtained from a T-F synthesis stage using the phase of the noisy square error (MMSE) estimators [34], [18], [33], or machine signal (k; m). learning based techniques such as Gaussian mixture models [35], support vector machines [36], or, more recently, DNNs [6]–[8], [16]. For reconstructing the enhanced speech signal in ios, including speech processed by mobile communication the time domain, it is common practice to append the short-time devices [28], ideal time-frequency weighted noisy speech [12], phase spectrum of the noisy signal to the estimated short-time noisy speech enhanced by single-microphone time-frequency magnitude spectrum and then use the overlap-and-add technique weighting-based speech enhancement systems [12], [29], [30], [37], [33]. and speech processed by hearing assistive devices such as cochlear implants [31]. STOI has also been shown to be robust III. S HORT-T IME OBJECTIVE I NTELLIGIBILITY (STOI) to variations in language types, including Danish [12], Dutch [30], and Mandarin [32]. Finally, recent studies e.g. [6], [7] In the following, we shortly review the STOI intelligibility also show a good correspondence between STOI predictions estimator [12]. For further details we refer to [12]. Let the jth of noisy speech enhanced by DNN-based speech enhancement one-third octave band clean-speech amplitude, for time-frame systems, and speech intelligibility. As a consequence, STOI m, be deﬁned as is currently the, perhaps, most commonly used speech intelli- k (j) u 2 gibility estimator for objectively evaluating the performance a (m) = t a(k; m) ; (2) of speech enhancement systems [6]–[8], [16]. Therefore, it k=k (j) is natural to believe that gains in speech intelligibility, as estimated by STOI, can be achieved by utilizing an optimality where k (j) and k (j) denote the ﬁrst and last STFT bin index, 1 2 criterion based on STOI as opposed to the classical criterion respectively, of the jth one-third octave band. Furthermore, let based on STSA-MSE. a short-time temporal envelope vector that spans time-frames In this paper we study the potential gain in speech in- m N + 1; : : : ; m, for the clean speech signal be deﬁned as telligibility that can be achieved, if a DNN is designed to perform optimally with respect to the STOI speech intelligibility a = [a (m N + 1); a (m N + 2); : : : ; a (m)] (3) j j j j;m estimator. We derive that, under certain general conditions, In a similar manner we deﬁne a ^ and r for the enhanced j;m j;m maximizing an approximate-STOI criterion is equivalent to speech signal and the noisy observation, respectively. minimizing a STSA-MSE criterion. Furthermore, we present The parameter N deﬁnes the length of the temporal envelope empirical data using simulation studies with DNNs applied to and for STOI N = 30 , which for the STFT settings used in noisy speech signals, that support our theoretical results. Finally, this study, as well as in [12], corresponds to approximately we show theoretically under which conditions the equality 384 ms. Finally, the STOI speech intelligibility estimator for between the approximate-STOI criterion and the STSA-MSE a pair of short-time temporal envelope vectors can then be criterion holds for practical systems. Our results are in line approximated by the sample envelope linear correlation (ELC) with recent empirical work and might explain the somewhat between the clean and enhanced envelope vectors a and j;m surprising result in [23]–[26], where none or only very modest a ^ given as j;m improvements in STOI were achieved with STOI optimal DNNs compared to MSE optimal DNNs. a a ^ j;m j;m j;m a ^ j;m L(a ; a ^ ) = ; (4) j;m j;m II. STFT-DOMAIN BASED SPEECH ENHANCEMENT a a ^ j;m a j;m j;m a ^ j;m Fig. 1 shows a block-diagram of a classical gain-based where kk denotes the Euclidean ` -norm and and speech enhancement system [18], [33]. Let x[n] be the nth j;m denote the sample means of a and a ^ , respectively. j;m j;m sample of the clean time-domain speech signal and let a noisy a ^ j;m observation y[n] be given by With N = 30, STOI is sensitive to temporal modulations of 2:6 Hz and y[n] = x[n] + v[n]; (1) higher, which are frequencies important for speech intelligibility [12]. 3 Note that Eq. (4) is an approximation, since the clipping and be a random envelope vector representing an estimate of A (m). normalization steps otherwise used in STOI, have been omitted. Now, the contribution of A (m) to speech intelligibility may This has empirically been found not to have any signiﬁcant be approximated as the ELC between the envelope vectors effect on intelligibility prediction performance in most cases A (m) and A (m). In the following, the indices j and m are j j [19], [29], [38], [39]. Furthermore, since the normalization omitted for convenience. Let 1 denote a vector of ones, and 1 T step is applied for the entire vector a ^ , the normalization let = 1 A1 be a vector, whose entries equal the sample j;m procedure itself does not inﬂuence the ﬁnal STOI score. Also, mean of the entries in A. Let be deﬁned in a similar as clipping only occurs for time-frequency units for which the manner. Finally, let the ELC between A and A, which is a signal-to-distortion ratio (see Eq. (4) in [12]) is below 15 dB, random variable, be deﬁned as clipping only occurs for a minority of the envelope vectors and approximating STOI with ELC is well valid, or even exact, in A A A ^ most cases, when evaluating speech signals at practical SNRs. A; A , ; (9) FromL(a ; a ^ ), the ﬁnal STOI score for an entire speech j;m j;m A A signal is then deﬁned as [12] the scalar, 1 d 1, and the expected ELC as J M X X h i d = L(a ; a ^ ); (5) j;m j;m ^ = E A; A ELC J (M N + 1) A;R j=1 m=N Z Z = (a; a ^) f (a; r) da dr where J is the number of one-third octave bands and MN +1 A;R (10) Z Z is the total number of short-time temporal envelope vectors. = (a; a ^) f (ajr) da f (r) dr: Similarly to [12], we use J = 15 with a center frequency AjR R | {z } of the ﬁrst one-third octave band at 150 Hz and the last at (r) approximately 3.8 kHz to ensure a frequency range that covers the majority of the spectral information of human speech. The Here, a ^ is related to r via a deterministic map, e.g. a STOI score in general has been shown to often have high DNN, and f (a; r) denotes the joint probability density A;R correlation with listening tests involving human test subjects, function (PDF) of clean and noisy/processed one-third octave i.e. the higher numerical value of Eq. (5), the more intelligible band envelope vectors. Furthermore, f (ajr) and f (r) AjR R is the speech signal. denote a conditional and marginal PDF, respectively. Since STOI, as approximated by Eq. (5), is a sum of ELC An optimal estimator can be found by minimizing the Bayes values as given by Eq. (4), maximizing Eq. (4) will also risk [33], [40], which is equivalent to maximizing Eq. (10), maximize the overall STOI score in Eq. (5). As a consequence, hence arriving at the MMELC estimator, which we denote in order to ﬁnd an estimate x ^[n] of x[n] so that STOI is as a ^ . To do so, observe that for a particular noisy MMELC maximized, one can focus on ﬁnding optimal estimates of observation r maximizing (r) maximizes Eq. (10), since the individual short-time temporal envelope vectors a . j;m f (r) 0 8 r. In other words, our goal is to maximize (r) Therefore, we deﬁne a ^ = diag(g ^ )r as the short-time j;m j;m for each and every r. Hence, for a particular observation, r, j;m temporal one-third octave band envelope vector of the enhanced the MMELC estimate is given by speech signal, where g ^ is an estimated gain vector and j;m a ^ = arg max (a; a ^) f (ajr) da diag(g ^ ) is a diagonal matrix with the elements of g ^ on MMELC AjR j;m j;m a ^ the main diagonal. a a ^ a a ^ = arg max f (ajr) da AjR IV. E NVELOPE LINEAR CORRELATION ESTIMATOR a a ^ a ^ a a ^ We now introduce the approximate-STOI criterion in a a a ^ a a ^ stochastic context and derive the speech envelope estimator that = arg max f (ajr) da AjR a a ^ a ^ a a ^ maximizes it. We denote this estimator as the maximum mean | {z }| {z } envelope linear correlation (MMELC) estimator. Let A (m) E e(A) e(a ^) [ ] Ajr and R (m) denote random variables representing a clean and a noisy, respectively, one-third octave band magnitude, for band = arg max E e(A) e(a ^); Ajr a ^ j and time frame m. Furthermore, let (11) A (m) = [A (m N + 1); : : : A (m)] (6) j j where e() is a function that normalizes its vector argument to zero sample mean and unit norm and where we used that and for a given noisy observation r, a ^ is deterministic. Note that R (m) = [R (m N + 1); : : : R (m)] (7) j j the solution to Eq. (11) is non-unique. For one given solution, say a ^ , any afﬁne transformation, a ^ + 1 8 ; 2 R, is be the stack of these random variables in random envelope also a solution, because any such transformation is undone by vectors. Finally, in a similar manner, let e(). Hence, in the following we focus on ﬁnding one such h i ^ ^ ^ A (m) = A (m N + 1); : : : A (m) ; (8) particular solution, namely the zero sample mean, unit norm j j j 4 solution, i.e. the vector e(a ^) that maximizes the inner product This is a standard assumption in the area of speech enhance- with the vector E [e(Ajr)]. To do so, let = E [e(Ajr)], ment, when operating in the STFT domain and has been Ajr Ajr the underlying assumption of a very large number of speech and let e(a ^ ) denote the zero sample mean, unit norm vector enhancement methods (see e.g. [18], [33], [34], [41], [42] and that maximizes Eq. (11). Then, using the method of Lagrange references therein). The conditional independence assumption multipliers, it can be shown (see Appendix A) that the MMELC is, for example, valid, when speech and noise STFT coefﬁcients estimator is given by may be assumed statistically independent across time and a ^ = e(a ^ ) MMELC frequency and mutually independent [33], [34], [43]. Using Kolmogorovs strong law of large numbers [44, pp. (12) 67-68] and the conditional independence assumption, it can be shown (see Appendix B) that asymptotically, as N ! 1, the = ; expectation in Eq. (16) factorizes as kk " # which is nothing more than the vector , normalized to unit 1 T lim = lim E E [Z ] : (18) norm. The fact that = 1 1 = 0 follows from Eq. (11), Ajr Ajr N!1 N!1 Z where it is seen that = E [e(Ajr)] is an expectation over Ajr Combining this result with Eq. (12) leads to vectors (a ) a whose sample mean is zero. By a a interpreting the expectation as an inﬁnite linear combination lim a ^ = lim MMELC of such vectors, it follows that = 0. N!1 N!1 h i E E [Z ] V. R ELATION TO STSA-MMSE E STIMATORS Ajr Ajr kZk h i = lim We now show that the MMELC estimator, Eq. (12), is N!1 1 E E [Z ] Ajr Ajr kZk asymptotically equivalent to the one-third octave band STSA- h i (19) MMSE estimator for large envelope lengths, i.e. as N ! 1. E E [Z ] Ajr Ajr kZk The STSA-MSE (e.g. [34]) is deﬁned as h i = lim N!1 1 E E [Z ] 2 Ajr Ajr kZk = E A A : (13) MSE A;R E [Z ] Ajr = lim : N!1 It can be shown (e.g. [18], [33], [34]) that the optimal E [Z ] Ajr Bayesian estimator with respect to Eq. (13), is the STSA- MMSE estimator given by the conditional mean deﬁned as Since Eq. (11) is invariant to afﬁne transformations of its input arguments, we can scale a ^ with the scalar quantity MMELC a ^ = a f (ajr) da MMSE AjR kE [Z ]k in Eq. (19) to arrive at Ajr (14) = E [Ajr] : Ajr lim a ^ = E [Z ] : MMELC (20) Ajr N!1 To show that a ^ is asymptotically equivalent to a ^ , MMELC MMSE let us introduce the idempotent, symmetric matrix Finally, as N ! 1, the MMELC estimator a ^ is given MMELC by H = I 11 ; (15) lim a ^ = E [Z ] MMELC Ajr N!1 where I denotes the N -dimensional identity matrix. We can = E HAjr then rewrite the vector as Ajr " # a 1 a T = E I 11 Ajr = f (ajr) da Ajr AjR a ^ a ^ (21) Ha = E Ajr 11 Ajr = f (ajr) da Ajr AjR Ha " # (16) 1 HAjr = E Ajr 11 E Ajr Ajr Ajr = E Ajr HAjr = a ^ : MMSE a ^ " # MMSE In words, the MMELC estimator, a ^ , is (asymptotically = E ; MMELC Ajr in N ) an afﬁne transformation of the STSA-MMSE estimator a ^ . In practice, this means that using the STSA-MMSE where Ajr is a random vector, and we introduced the notation MMSE estimator leads to the same approximate-STOI criterion value Z , HAjr. We now employ the following conditional as the estimator, a ^ , derived to maximize this criterion. independence assumption MMELC In other words, applying the traditional STSA-MMSE estimator leads to maximum speech intelligibility as reﬂected f (ajr) = f (a jr ): (17) A jR =r j j j j j AjR by the approximate STOI estimator. j=1 5 VI. EXPERIM ENTAL DESIGN A. Noise-free Speech Mixtures We have used the Wall Street Journal (WSJ0) speech corpus We now investigate empirically the relationship between [47] as the clean speech data for both the training set, validation the MMELC estimator in Eq. (14) and the STSA-MMSE set, and test set. Speciﬁcally, the noise-free utterances used for estimator in Eq. (11) using an experimental study. As deﬁned training and validation are generated by randomly selecting in Eq. (11), the MMELC estimator is the vector that maximizes utterances from 44 male and 47 female speakers from the WSJ0 the expectation of the ELC cost function given by Eq. (10). This training set entitled si tr s. In total 20000 utterances are used expectation, Eq. (10), is deﬁned via an integral of (a; a ^) for for the training set and 2000 are used for the validation set, various realizations of a and a ^, and weighted by the joint PDF which adds up to approximately 37 hours of training data and 4 f (a; r). It is however, well known, that the integral may be A;R hours of validation data. For the test set, we have used a similar approximated (arbitrarily well) as a sum of (a; a ^) terms, where approach and sampled 1000 utterances among 16 speakers (10 realizations of a and a ^ are drawn according to f (a; r). A;R males and 6 females) from the WSJ0 validation set si dt 05 and This is similar to what a DNN approximates during a standard evaluation set si et 05, which is equivalent to approximately 2 training process, where a gradient based optimization technique hours of data, see [48] for further details. The speakers used in is used to minimize the cost on a representative training set the training and validation sets are different than the speakers [45]. Therefore, training a DNN, e.g. using stochastic gradient used for test, i.e. we test in a speaker independent setting. ascent, to maximize Eq. (4) may be seen as an approximation Finally, since WSJ0 utterances primarily include speech active of Eq. (11), where the approximation becomes more accurate regions we do not apply a VAD. This is motivated by the fact with increasing training set size. that noise-only regions are irrelevant for STOI, as these are From the theoretical results presented in Sec. V, we would discarded by an ideal VAD in the STOI front-end [12]. therefore expect that, for some sufﬁciently large N , one would obtain equality in an ELC sense, between a DNN trained to B. Noise Types maximize an ELC cost function and one that is trained to minimize the classical STSA-MSE cost function. To validate To simulate a wide variety of sound scenes we have used this expectation we follow the techniques formalized in Secs. II six different noise types in our experiments: two synthetic and III and train DNNs to estimate gain vectors, g ^ , that noise signals and four natural noise signals, which are real-life j;m we apply to noisy one-third octave band magnitude envelope recordings of naturally occurring sound scenes. For the two signals r , to arrive at enhanced signals a ^ . synthetic noise signals, we use a stationary speech shaped j;m j;m In principle, any supervised learning model would be noise (SSN) signal and a highly non-stationary 6-speaker applicable for these experiments but considering the universal babble (BBL) noise. For the naturally occurring noise signals, function approximation capability of DNNs [46], this is our we use the street (STR), cafeteria (CAF), bus (BUS), and model of choice. We use short-time temporal one-third octave pedestrian (PED) noise signals from the CHiME3 dataset [49]. band envelope vectors, as deﬁned in Eq. (3), and train multiple The SSN noise signal is Gaussian white noise, spectrally shaped DNNs, one for each of the J = 15 one-third octave bands, according to the long-term spectrum of the entire TIMIT speech for various N , to investigate if for sufﬁciently large N , DNNs corpus [50]. Similarly, the BBL noise signal is constructed by mixing utterances from both genders from TIMIT. To ensure trained with a STSA-MSE cost function approach the ELC that all noise types are equally represented and with unique values of DNNs trained with a cost function based on ELC. realizations in the training, validation and test sets, all six noise We construct two types of enhancement systems, one type is signals are split into non-overlapping segments such that 40 trained using the STSA-MSE cost function, denoted as ES , MSE min. is used for training, 5 min. is used for validation and and one that is trained using the ELC cost function denoted as another 5 min. is used for test. ES . Each of the systems consists of J = 15 DNNs, each ELC estimating a gain vector g ^ for a particular one-third octave j;m band directly from the STFT magnitudes of the noisy signal C. Noisy Speech Mixtures r(k; m), with the input context given by k = 1; : : : ; + 1, To construct the noisy speech signals used for training, we m N + 1 : : : ; m. This ensures that all DNNs have access to follow Eq. (1) and combine a noise-free training utterance x[n] the same information for a particular value of N , as they all with a randomly selected noise sequence v[n], of equal length, receive the same input data. Furthermore, we follow common from the training noise signal. We scale the noise signal v[n], practice (e.g. [6], [7], [16], [23]) and average overlapping to achieve a certain signal-to-noise ratio (SNR), according to estimated gain values, within a one-third octave band, during the active speech level of x[n] as deﬁned by ITU P.56 [51]. enhancement. We found during a preliminary study that this For the training and validation sets, the SNRs are chosen technique consistently lead to slightly larger STOI scores for uniformly from [5; 10] dB to ensure that the intelligibility both types of systems. of the noisy speech mixtures y[n] ranges from degraded to To compute the STFT coefﬁcients for all signals we use a perfectly intelligible. 10 kHz sample frequency and a K = 256 point STFT with a Hann-window size of 256 samples (25.6 ms) and a 128 D. Model Architecture and Training sample frame shift (12.8 ms). These coefﬁcients are then used to compute one-third octave band envelopes for the clean and The two types of enhancement systems, ES and ES , ELC MSE noisy signals using Eq. (3). each consist of 15 feed-forward DNNs. The DNNs in the 6 ES system are trained with the ELC cost function intro- Note, since L(a; a ^) is invariant to the magnitude of ka ^k (see ELC duced in Eq. (4) and the DNNs in the ES system are Eq. (4)), and a and N are constants during training, the gradient MSE trained using the well-known STSA-MSE cost function given norm of the ELC cost function, Eq. (23), with respect to a ^, is by inversely proportional to the gradient norm of the STSA-MSE cost function, Eq. (27). This suggests that the two cost functions J (a; a ^) = ka a ^k ; (22) have different optimal learning rates. This observation might where the subscripts j and m are omitted for convenience. partly explain why equality with respect to STOI between STOI We train both the ES and ES systems with 20000 ELC MSE optimal and STSA-MSE optimal DNNs were achieved in [23] training utterances and 2000 validation utterances and both but not in [24]–[26], as [23] was the only study that explicitly data sets have been mixed uniformly with the SSN, BBL, CAF, stated that different learning rates for the two cost functions and STR noise signals, which ensures that each noise type were used. In fact, in [24]–[26] the optimization method Adam have been mixed with 25% of the utterances in the training [52] was used, and although Adam is an adaptive gradient and validation sets. During test, we evaluate each system with method, it still has several critical hyper-parameters that can one noise type at a time, i.e. each system is evaluated with inﬂuence convergence [53]. 1000 noisy test utterances per noise type, and since BUS and During a preliminary grid-search using the validation set PED are not included in the training and validation sets, these corrupted with SSN at an SNR of 0 dB and N = 30, we found two noise signals serve as unmatched noise types, whereas learning rates of 0:01 and 5 10 per sample to be optimal for SSN, BBL, CAF, and STR are matched noise types. This will the ES and ES systems, respectively. During training, ELC MSE allow us to study how the ELC optimal DNNs and STSA-MSE the cost on the validation set was evaluated for each epoch optimal DNNs generalize to unmatched noise types. and the learning rates were scaled by 0:7, if the cost increased Each feed-forward DNN consists of three hidden layers with compared to the cost for the previous epoch. The training 512 units using ReLU activation functions. The N -dimensional was terminated, if the learning rate was below 10 . We output layer uses sigmoid functions which ensures that the implemented the DNNs using CNTK [54] and the scripts output gain g ^ is conﬁned between zero and one. The needed to reproduce the reported results can be found in [48]. j;m DNNs are trained using stochastic gradient de-/ascent with Note, the goal of these experiments is not to achieve state- the backpropagation technique and batch normalization [45]. of-the-art enhancement performance. In fact, increasing the The DNNs are trained for a maximum of 200 epochs with a size of the dataset or DNNs might likely improve performance, minibatch size of 256 randomly selected short-time temporal although we have not reason to believe it will change the one-third octave band envelope vectors. conclusion. Since the ES and ES systems use different cost ELC MSE functions, they likely have different optimal learning rates. This VII. E XPERIM ENTAL RESULTS is easily seen from the gradient norms of the two cost functions. To study the relationship between ES and ES ELC MSE It can be shown (details omitted due to space limitations) that systems as function of N , we have trained multiple systems for the ` -norm of the gradient of the ELC cost function in Eq. (4), various N . Speciﬁcally, a total of eight ES systems and ELC with respect to the desired signal vector a ^, is given by eight ES systems have been trained with N taking the p MSE 1L(a; a ^) values N = f4; 7; 15; 20; 30; 40; 50; 80g, which correspond to krL(a; a ^)k = ; (23) temporal envelope vectors with durations from approximately ka ^k 50 to 1000 milliseconds. where the gradient rL(a; a ^) is given by rL a; a ^ = A. Comparing One-third Octave Bands " # (24) @L a; a ^ @L a; a ^ @L a; a ^ In Fig. 2 we present the ELC scores, as function of envelope ; ; : : : ; ; @a ^ @a ^ @a ^ 1 2 N duration N , for each of the J = 15 one-third octave band DNNs in the ES and ES systems. All DNNs are ELC MSE and tested using speech corrupted with BBL noise at an SNR of 0 @L a; a ^ dB. First, we observe that both systems manage to improve the @a ^ ELC score considerably, when compared to the ELC score of (25) L a; a ^ a ^ L a; a ^ a m the noisy speech signals, i.e. both systems enhance the noisy m a ^ T T speech, which is in line with known results [8]. a ^ a a ^ a ^ a ^ a ^ a ^ Furthermore, we can observe that the DNNs trained with the is the partial derivative of L(a; a ^) with respect to entry m ELC cost function, i.e. the ES systems, in general achieve ELC of vector a ^. Similarly, the gradient of the STSA-MSE cost higher, or similar, ELC scores than the DNNs trained with the function in Eq. (22) is given by STSA-MSE cost function, i.e. the ES systems. This is an MSE important observation, since it veriﬁes that DNNs trained to (26) rJ a; a ^ = (a a ^) ; maximize ELC indeed achieve the highest, or similar, ELC scores compared to DNNs trained to optimize a different cost such that function, STSA-MSE in this case. Finally, and most importantly, (27) rJ a; a ^ = ka a ^k : we observe that the difference in ELC score between the N 7 Band 1 (CF: 150 Hz) Band 2 (CF: 189 Hz) Band 3 (CF: 238 Hz) Band 4 (CF: 300 Hz) Band 5 (CF: 378 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 6 (CF: 476 Hz) Band 7 (CF: 600 Hz) Band 8 (CF: 756 Hz) Band 9 (CF: 952 Hz) Band 10 (CF: 1200 Hz) 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 11 (CF: 1512 Hz) Band 12 (CF: 1905 Hz) Band 13 (CF: 2400 Hz) Band 14 (CF: 3024 Hz) Band 15 (CF: 3810 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.5 0.5 0.4 0.4 0.2 0.3 0.3 0.4 0.4 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 2. ELC values for ES and ES systems trained using various envelope durations, N , and tested with corresponding values of N using speech ELC MSE corrupted with BBL noise at an SNR of 0 dB. Each ﬁgure shows one out of J = 15 one-third octave band DNNs (center frequency (CF) shown in parenthesis). It is seen that as N ! 80 the difference between the ES DNNs and ES DNNs, as measured by ELC, tends to zero. This is in line with the ELC MSE theoretical results of Sec. V. ES and ES DNNs generally decrease with increasing durations and noise types, which indicate that our test set is ELC MSE N . For N = 80 the ELC score of the ES and ES sufﬁciently large to provide accurate estimates of the true mean ELC MSE DNNs practically coincide. ELC difference. Similarly to Fig. 2, the results in Fig. 3 support the theoretical results of Sec. V. Additionally, the results in B. Comparing ELC across Noise Types Fig. 3 show consistency across multiple noise types, which suggests that the theory in practice applies for various noise In Fig. 3 we present the ELC score difference, as function of type distributions. envelope duration N , for ES and ES systems, when ELC MSE tested using speech material corrupted with various noise types C. Comparing STOI across Noise Types at an SNR of 0 dB. Speciﬁcally, we compute the difference in ELC score for each pair of one-third octave band DNNs in the We now investigate if the global behavior observed for ES and ES systems, and then compute the average approximate-STOI, i.e. ELC, in Fig. 3 also applies for real ELC MSE ELC difference as function of envelope duration N . We do this STOI. To do this, we reconstruct the test signals used for Fig. 3 for all the 1000 test utterances and for each of the six noise in the time domain. We follow the technique proposed in [23], types introduced in Sec VI-B: SSN, BBL, CAF, STR, BUS, where a uniform gain across STFT coefﬁcients within a one- and PED. Finally, we compute the 95% conﬁdence interval (CI) third octave band is used before an inverse DFT is applied on the mean ELC difference. using the phase of the noisy signal. In Table I we present the From Fig. 3 we observe that the average ELC difference, STOI scores for ES and ES systems, as a function of ELC MSE i.e. ES ES , appears to be monotonically decreasing N , when tested using speech material corrupted with different ELC MSE with respect to the duration of the envelope N . Furthermore, noise types at an SNR of 0 dB. Note that these test signals we observe that the average ELC difference approaches zero are similar to the test signals used for Fig. 3 except that we as the duration of the envelope N increases, and similarly to now evaluate them according to STOI and not ELC. Fig. 2, for N = 80, the difference between the ES and From Table I we observe that the average STOI difference ELC ES systems is close to zero. Finally, we observe that the between the ES and ES systems is maximum for MSE ELC MSE 95% conﬁdence intervals are relatively narrow for all envelope N = 4, but quickly tends to zero as N increases and for 8 Noise type: SSN Noise type: BBL Noise type: CAF 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Noise type: STR Noise type: BUS Noise type: PED 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 3. Average ELC differences, as function of envelope durations N , between ES and ES systems, for different noise types. We observe a ELC MSE monotonic decreasing relationship between the average ELC difference and the envelope length and for N = 80, the average ELC difference between the ES and ES systems is close to zero. This is in line with the theoretical results of Sec. V. ELC MSE N 15, the STOI difference is practically zero, i.e. 0:01. TABLE I STOI SCORES AS FUNCTION OF N FOR ES AND ES SYSTEMS ELC MSE Also, we observe that the gap in STOI between the ES and ELC TESTED USING DIFFERENT NOISE TYPES AT AN SNR OF 0 DB. ES systems closes faster at a lower value of N in Table I MSE compared to Fig. 3. We believe this is due to the transformation N : 4 7 15 20 30 40 50 80 of the, potentially ”invalid”, sequences of (e.g. [55], [56]) ELC : 0.81 0.85 0.88 0.88 0.87 0.86 0.85 0.84 SSN: modiﬁed magnitude spectra, when reconstructing enhanced MSE : 0.84 0.87 0.87 0.87 0.87 0.86 0.85 0.84 time-domain signals, whose intelligibility is estimated by STOI ELC : 0.77 0.80 0.82 0.82 0.81 0.80 0.80 0.78 BBL: in Table I. Therefore, STOI in Table I might be computed MSE : 0.79 0.82 0.82 0.82 0.81 0.80 0.80 0.78 based on slightly different magnitude spectra compared to the ELC : 0.82 0.85 0.87 0.87 0.86 0.85 0.84 0.83 CAF: magnitude spectra used for computing the ELC scores in Fig. 3. MSE : 0.85 0.87 0.87 0.87 0.86 0.85 0.85 0.84 Furthermore, we observe that the ES achieve slightly MSE ELC : 0.83 0.86 0.88 0.89 0.88 0.87 0.87 0.85 STR: higher STOI scores than the ES systems for N = 4, which ELC MSE : 0.86 0.88 0.88 0.88 0.88 0.87 0.87 0.85 might be due to sub-optimal learning rates as the ones actually ELC : 0.77 0.81 0.83 0.83 0.83 0.82 0.81 0.80 PED: used during training of the systems at, e.g. N = 4, were found MSE : 0.80 0.82 0.83 0.83 0.82 0.82 0.81 0.80 based on a grid-search using systems with N = 30 (see Sec. ELC : 0.87 0.89 0.90 0.91 0.90 0.89 0.89 0.89 BUS: VI.D). More importantly, the maximum improvement in STOI MSE : 0.89 0.90 0.90 0.90 0.90 0.90 0.89 0.89 is achieved for N = f15; 20; 30g, where both systems achieve similar STOI scores. Finally, while the theoretical results of Sec. V show that approximate-STOI performance of a ^ MMELC results in Sec. V. However, the results in Sec. V predict that and a ^ is identical, asymptotically, for N ! 1, the MMSE not only do ES , and ES systems produce identical ELC MSE empirical results in Table I suggest that N 15 is sufﬁcient for ELC scores, they also predict that the systems are, in fact, practical equality to hold for DNN based speech enhancement essentially identical, i.e. up to an afﬁne transformation. Hence, systems. in this section, we compare how the systems actually operate. Speciﬁcally, we compare the gains estimated by ES ELC D. Comparing Gain-Values systems with gains estimated by ES systems. MSE Figures 2 and 3, and Table I show that ES systems In Fig. 4 we present scatter plots, one for each one-third ELC achieve approximately the same ELC and STOI values as octave band for pairs of gains estimated by ES and ES ELC MSE ES systems and that the ELC and STOI difference systems tested with BBL noise at an SNR of 5 dB. Each scatter MSE between the two types of systems approach zero as N becomes plot consists of 10000 pairs of gains acquired by sampling 10 large. These empirical results are in line with the theoretical gain-pairs randomly and uniformly distributed from each of the 9 TABLE II speech enhancement systems, when the DNNs are trained to S AMPLE CORRELATIONS BETWEEN GAINS FROM ES AND ES ELC MSE either maximize ELC or minimize MSE and the systems are SYSTEMS WITH N = 30. S EE F IG. 4 FOR PER BAND CORRELATIONS. evaluated using both ELC and STOI. Finally, our experimental ﬁndings suggest, that applying the traditional STSA-MMSE SNR SSN BBL CAF STR BUS PED estimator on noisy speech signals in practice leads to essentially [dB] maximum speech intelligibility as reﬂected by the STOI speech -5 0.94 0.87 0.89 0.93 0.87 0.90 intelligibility estimator. 0 0.94 0.92 0.92 0.93 0.88 0.92 5 0.95 0.95 0.93 0.93 0.90 0.92 10 0.95 0.95 0.92 0.92 0.91 0.93 APPENDIX A M AXIM IZING A CONSTRAINED INNER PRODUCT 1000 test utterances. In Fig. 4, yellow indicates high density This appendix derives an expression for the zero-mean, unit- of gain-pairs and dark blue indicates low density. From Fig. 4 norm vector e(a ^), which maximizes the inner product with it is seen that a correlation no smaller than 0:88 is achieved the vector E [e(Ajr)]. For notational convenience, let = Ajr for all 15 one-third octave bands. The highest correlation of E [e(Ajr)], and = e(a ^). The constrained optimization Ajr r = 0:98 is achieved by bands 5 to 7 and the lowest is r = 0:88 problem from Eq. (11) is then deﬁned as achieved by band 2 followed by band 1 with r = 0:89. It is also seen that a large number of gain values are either zero, or maximize one, as one would expect due to the sparse nature of speech T (28) in the T-F domain. However, although a strong correlation subject to 1 = 0; is observed for all bands, the gain-pairs are slightly more = 1: scattered at the ﬁrst few bands than for the remaining bands. This might be explained simply by the fact that low one-third The vector that solves Eq. (28) can be found using octave bands correspond to single STFT bins, whereas higher the method of Lagrange multipliers [57]. Introducing two one-third octave bands are sums of a large number of STFT scalar Lagrange multipliers, and , for the two equality 1 2 bins. This, in turn, may have the consequence that for ﬁnite constraints, the Lagrangian is given by N (N = 30), Kolmogorovs strong law of large numbers (see T T T Appendix. B) is better valid at higher frequencies than at lower L( ; ; ) = + 1 + ( 1): (29) 1 2 1 2 frequencies (so that gain vectors produced by one system is @L closer to an afﬁne transformation of gain vectors produced by Setting the partial derivatives equal to zero the other system). In fact, if we compute r for models trained with N = 50, we get r = 0:93, i.e. increased correlation @L between the gain vectors produced by the two systems. Finally, = + 1 + 2 = 0; (30) 1 2 in Table. II we present average correlation coefﬁcients and we observe correlation coefﬁcients 0:87 for all, both matched and solving for , we arrive at and unmatched, noise types, at multiple SNRs. = : (31) VIII. CONCLUSION 2 @L @L This study is motivated by the fact that most estimators Using the same approach for and , substituting in @ @ 1 2 used for speech enhancement, being either data-driven models, Eq. (31) and solving for , and such that the two constraints 1 2 e.g. deep neural networks (DNNs), or statistical model-based are fulﬁlled, we ﬁnd techniques such as the short-time spectral amplitude minimum mean-square error (STSA-MMSE) estimator, use the STSA T (32) = 1 = ; mean-square error (MSE) cost function as a performance indicator. Short-time objective intelligibility (STOI), a state- and of-the-art speech intelligibility estimator, on the other hand, k 1k (33) rely on the envelope linear correlation (ELC) of speech temporal = : envelopes. Since the primary goal of many speech enhancement systems is to improve speech intelligibility, it raises the question Inserting and into Eq. (31) results in 1 2 if estimators can beneﬁt from an ELC cost function. In this paper we derive the maximum mean envelope linear = ; (34) correlation (MMELC) estimator and study its relationship to k 1k the well-known STSA-MMSE estimator. We show theoretically that the MMELC estimator, under a commonly used conditional which is simply the vector , normalized to zero sample mean independence assumption, is asymptotically equivalent to and unit norm. the STSA-MMSE estimator. Furthermore, we demonstrate 2 T experimentally that this relationship also holds for DNN based We solve the equivalent problem that minimizes . 10 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 Fig. 4. Scatter plots based on gain values from ES and ES systems with an envelope length of N = 30. Dark blue indicate low density and bright ELC MSE ^ ^ ^ yellow indicate high density. The systems are tested with BBL noise corrupted speech at an SNR of 5 dB. Each ﬁgure shows one of 15 (g ; g ; : : : ; g ) 1 2 15 one-third octave bands. A correlation no smaller than 0:88 is achieved for all one-third octave bands, which indicates that the ES and ES systems ELC MSE estimate fairly similar gain vectors. APPENDIX B We can rewrite the factors on the right-hand side of Eq. (39) FACTORIZATION OF E XPECTATION as follows h i This appendix shows that the expectation in Eq. (16) factor- E [Z ] = E h Y izes into the product of expectations in Eq. (18), asymptotically as N ! 1. Let = E S 1 Y Y , Ajr; (35) 1 (40) = E [S ] 1 E [Y ] and N 1 N H , I 11 ; (36) = E [S ] E [S ] ; i j j=1 so that 2 3 " # Z = HY ; (37) 1 1 4 5 E = E Z T T where I denotes the N -dimensional identity matrix and Ajr N Y HH Y is a random vector distributed according to the conditional 2 3 probability density function f (ajr). A speciﬁc element Z , AjR 4 5 = E q of Z is then given by Y HY 2 3 Z = h Y (38) 1 T 4 5 = E q = S 1 Y ; T T 1 T Y Y Y 11 Y (41) 2 3 where h is the ith column of matrix H . We now deﬁne the covariance between Z and 1=kZk as i 6 7 6 7 = E " " # # 4 5 P P N N 1 1 1 2 1 S S cov(Z ; ) , E Z E [Z ] E j=1 j N j=1 i i i Z Z Z 2 3 " # " # Z 1 6 7 i N 6 7 = E E [Z ] E : i = E r ; 4 5 Z Z 2 P P N N 1 1 S S j=1 j j=1 N N (39) 11 and REFERENCES 2 3 [1] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Deep Recurrent " # 1 1 S S Networks for Separation and Recognition of Single-Channel Speech 6 i j 7 j=1 Z N N 6 7 in Nonstationary Background Audio,” in New Era for Robust Speech E = E r : (42) 4 5 Z Recognition. Springer, 2017, pp. 165–186. P P N N 1 1 S S [2] D. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, j=1 j j=1 N N vol. 54, no. 3, pp. 32–37, 2017. [3] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep In Eqs. (40), (41) and (42) two different sums of random Learning: An Overview,” arXiv:1708.07524, 2017. variables occur, [4] M. Kim and P. Smaragdis, “Bitwise Neural Networks for Efﬁcient Single- Channel Source Separation,” in Proc. NIPS Machine Learning for Audio Signal Processing Workshop, 2017. [5] R. Fakoor, X. He, I. Tashev, and S. Zarar, “Reinforcement Learning To S ; (43) N Adapt Speech Enhancement to Instantaneous Input Signal Quality,” in j=1 Proc. NIPS Machine Learning for Audio Signal Processing Workshop, and [6] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. 2604–2612, S : (44) j=1 [7] E. W. Healy, M. Delfarah, J. L. Vasko, B. L. Carter, and D. Wang, “An algorithm to increase intelligibility for hearing-impaired listeners in the Since, by assumption, Eq. (17), S 8 j are independent random presence of a competing talker,” J. Acoust. Soc. Am., vol. 141, no. 6, pp. 3 4230–4239, 2017. variables with ﬁnite variances , according to Kolmogorovs [8] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential strong law of large numbers [44], the sums given by Eqs. (43) of General and Specialized Deep Neural Network Based Speech and (44) will converge (almost surely, i.e. with probability (Pr) Enhancement Systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, 2017. one) to their average means = E[S ], and 2 = S j S j=1 [9] J. Schnupp, E. Nelken, and A. King, Auditory Neuroscience - Making E[S ], respectively, as N ! 1. Formally, we can Sense of Sound. MIT Press, 2011. j=1 j [10] B. Moore, An Introduction to the Psychology of Hearing. Brill, 2013. express this as [11] R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, 0 1 and M. Allerhand, “Complex sounds and auditory images,” in In Proc. 1 International Symposium on Hearing, 1992, pp. 429–446. @ A Pr lim S = = 1; (45) j S [12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm N!1 N for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech,” j=1 IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011. and [13] T. M. Elliott and F. E. Theunissen, “The Modulation Transfer Function 0 1 for Speech Intelligibility,” PLOS Computational Biology, vol. 5, no. 3, @ A Pr lim S = 2 = 1: (46) j S [14] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope N!1 j=1 smearing on speech reception,” J. Acoust. Soc. Am., vol. 95, no. 2, pp. 1053–1064, 1994. [15] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth com- By substituting Eqs. (45), and (46) into Eqs. (40), (41) and pression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. (42), we arrive at 1586–1604, 1979. [16] E. W. Healy, S. E. Yoho, J. Chen, Y. Wang, and D. Wang, “An algorithm lim E [Z ] = E [S ] ; i i S to increase speech intelligibility for hearing-impaired listeners in novel (47) N!1 segments of the same noise type,” J. Acoust. Soc. Am., vol. 138, no. 3, pp. 1660–1669, 2015. " # 1 [17] P. C. Loizou, “Speech Enhancement Based on Perceptually Motivated lim Bayesian Estimators of the Magnitude Spectrum,” IEEE/ACM Trans. N!1 (48) lim E = p ; Audio, Speech, Lang. Process., vol. 13, no. 5, pp. 857–869, 2005. N!1 [18] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-Domain Based Single- Microphone Noise Reduction for Speech Enhancement: A Survey of the and State of the Art,” Synth. Lect. on Speech and Audio Process., vol. 9, no. 1, pp. 1–80, 2013. " # 1 [19] L. Lightburn and M. Brookes, “SOBM - a binary mask for noisy speech lim that optimises an objective intelligibility metric,” in Proc. ICASSP, 2015, i N!1 lim E = (E [S ] ) i S pp. 5078–5082. N!1 [20] W. Han, X. Zhang, G. Min, X. Zhou, and W. Zhang, “Perceptual (49) " # weighting deep neural networks for single-channel speech enhancement,” in Proc. WCICA, 2016, pp. 446–450. = lim E [Z ] E ; N!1 [21] P. G. Shivakumar and P. Georgiou, “Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement - Semantic Scholar,” in Proc. INTERSPEECH, 2016, pp. 3743–3747. where the last line follows from Eq. (47) and (48). In words, [22] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNN- as N ! 1, the covariance between Z and 1=kZk tends to i based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Proc. ICASSP, 2017, pp. 81–85. zero and, consequently, the expectation in Eq. (16) factorizes [23] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement into the product of expectations in Eq. (18). using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure,” in Proc. ICASSP, 2018, pp. 5059 – 5063. Assuming a ﬁnite variance of S is motivated by the fact that S model [24] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually Guided Speech j j speech signals, which always take ﬁnite values due to both physical and Enhancement using Deep Neural Networks,” in Proc. ICASSP, 2018, pp. physiological limitations of sound and speech production systems, respectively. 5074–5078. 12 [25] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation [51] ITU, “Rec. P.56 : Objective measurement of active speech level,” 1993, System to Improve STOI and PESQ Directly,” in Proc. ICASSP, 2018, https://www.itu.int/rec/T-REC-P.56/. pp. 5374–5378. [52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” [26] S. W. Fu, T. W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to- in Proc. ICLR (arXiv:1412.6980), 2014. End Waveform Utterance Enhancement for Direct Evaluation Metrics [53] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Optimization by Fully Convolutional Neural Networks,” IEEE/ACM Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Trans. Audio, Speech, Lang. Process., vol. 26, no. 9, pp. 570 – 1584, Proc. NIPS, 2017. [54] A. Agarwal et al., “An introduction to computational networks and the [27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual computational network toolkit,” Microsoft Technical Report fMSR-TRg- evaluation of speech quality (PESQ)-a new method for speech quality 2014-112, Tech. Rep., 2014. assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, [55] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from short-time 2001, pp. 749–752. Fourier transform magnitude,” IEEE Trans. Acoust., Speech, and Sig. [28] S. Jørgensen, J. Cubick, and T. Dau, “Speech Intelligibility Evaluation Process., vol. 31, no. 4, pp. 986–998, 1983. for Mobile Phones.” Acustica United with Acta Acustica, vol. 101, pp. [56] D. Grifﬁn and J. Lim, “Signal estimation from modiﬁed short-time 1016–1025, 2015. Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Process., [29] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility vol. 32, no. 2, pp. 236–243, 1984. of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. [57] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016. University Press, 2004. [30] ——, “Speech Intelligibility Prediction Based on Mutual Information,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 430–440, 2014. Morten Kolbæk received the B.Eng. degree in [31] T. H. Falk et al., “Objective Quality and Intelligibility Prediction for electronic design at Aarhus University, Business and Users of Assistive Listening Devices: Advantages and limitations of Social Sciences, AU Herning, Denmark, in 2013 existing tools,” IEEE Sig. Process. Mag., vol. 32, no. 2, pp. 114–124, and the M.Sc. in signal processing and computing from Aalborg University, Denmark, in 2015. He is [32] R. Xia, J. Li, M. Akagi, and Y. Yan, “Evaluation of objective intelligibility currently pursuing his PhD degree at the section for prediction measures for noise-reduced signals in mandarin,” in Proc. Signal and Information Processing at the Department ICASSP, 2012, pp. 4465–4468. of Electronic Systems, Aalborg University, Denmark. [33] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, His research interests include speech enhancement and separation, deep learning, and intelligibility [34] Y. Ephraim and D. Malah, “Speech enhancement using a minimum- improvement of noisy speech. mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, and Sig. Process., vol. 32, no. 6, pp. 1109–1121, 1984. [35] G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, “An algorithm that improves Zheng-Hua Tan (M’00–SM’06) received the B.Sc. speech intelligibility in noise for normal-hearing listeners,” J. Acoust. and M.Sc. degrees in electrical engineering from Soc. Am., vol. 126, no. 3, pp. 1486–1494, 2009. Hunan University, Changsha, China, in 1990 and [36] K. Han and D. Wang, “A classiﬁcation based approach to speech 1996, respectively, and the Ph.D. degree in electronic segregation,” J. Acoust. Soc. Am., vol. 132, no. 5, pp. 3475–3483, 2012. engineering from Shanghai Jiao Tong University, [37] J. Allen, “Short term spectral analysis, synthesis, and modiﬁcation Shanghai, China, in 1999. He is a Professor and a Co- by discrete Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Head of the Centre for Acoustic Signal Processing Process., vol. 25, no. 3, pp. 235–238, 1977. Research (CASPR) at Aalborg University, Aalborg, [38] C. H. Taal, R. C. Hendriks, and R. Heusdens, “Matching pursuit for Denmark. He was a Visiting Scientist at the Computer channel selection in cochlear implants based on an intelligibility metric,” in Proc. EUSIPCO, 2012, pp. 504–508. Science and Artiﬁcial Intelligence Laboratory, MIT, [39] A. H. Andersen, J. M. d. Haan, Z. H. Tan, and J. Jensen, “Predicting Cambridge, USA, an Associate Professor at Shanghai the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech,” Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. His research interests include machine learning, deep learning, pattern 1908–1920, 2016. recognition, speech and speaker recognition, noise-robust speech processing, [40] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation multimodal signal processing, and social robotics. He is a member of the IEEE Theory. Prentice Hall, 2010. Signal Processing Society Machine Learning for Signal Processing Technical [41] Y. Ephraim and D. Malah, “Speech enhancement using a minimum Committee (MLSP TC). He is an Editorial Board Member for Computer mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech and Language and was a Guest Editor for the IEEE Journal of Selected Speech, and Sig. Process., vol. 33, no. 2, pp. 443–445, 1985. Topics in Signal Processing and Neurocomputing. He was the General Chair [42] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Minimum for IEEE MLSP 2018 and a TPC co-chair for IEEE SLT 2016. Mean-Square Error Estimation of Discrete Fourier Coefﬁcients With Generalized Gamma Priors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, 2007. Jesper Jensen received the M.Sc. degree in electrical [43] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision engineering and the Ph.D. degree in signal processing noise suppression ﬁlter,” IEEE Trans. Acoust., Speech, and Sig. Process., from Aalborg University, Aalborg, Denmark, in 1996 vol. 28, no. 2, pp. 137–145, 1980. and 2000, respectively. From 1996 to 2000, he was [44] P. K. Sen and J. M. Singer, Large Sample Methods in Statistics: An with the Center for Person Kommunikation (CPK), Introduction with Applications. Chapman & Hall, 1994. Aalborg University, as a Ph.D. student and Assistant [45] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, Research Professor. From 2000 to 2007, he was a 2016. Post-Doctoral Researcher and Assistant Professor [46] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward with Delft University of Technology, Delft, The networks are universal approximators,” Neural Networks, vol. 2, no. 5, Netherlands, and an External Associate Professor pp. 359–366, 1989. with Aalborg University. Currently, he is a Senior [47] J. Garofolo, D. Graff, P. Doug, and D. Pallett, “CSR-I (WSJ0) Complete Principal Scientist with Oticon A/S, Copenhagen, Denmark, where his main LDC93s6a,” 1993, philadelphia: Linguistic Data Consortium. responsibility is scouting and development of new signal processing concepts [48] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Supplemental Material.” [Online]. for hearing aid applications. He is a Professor with the Section for Signal Available: http://kom.aau.dk/ mok/taslp2018 and Information Processing (SIP), Department of Electronic Systems, at [49] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ Aalborg University. He is also a co-founder of the Centre for Acoustic Signal speech separation and recognition challenge: Dataset, task and baselines,” Processing Research (CASPR) at Aalborg University. His main interests are in Proc. ASRU, 2015, pp. 504–511. in the area of acoustic signal processing, including signal retrieval from noisy [50] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and observations, coding, speech and audio modiﬁcation and synthesis, intelligibility N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech enhancement of speech signals, signal processing for hearing aid applications, Corpus CDROM,” 1993. and perceptual aspects of signal processing.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: Jun 21, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get an introductory month for just $19.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.