Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

On the Relationship Between Short-Time Objective Intelligibility and Short-Time... On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement Morten Kolbæk, Zheng-Hua Tan, Senior Member, IEEE, and Jesper Jensen Abstract—The majority of deep neural network (DNN) based algorithms using e.g. a Gammatone filter bank [11] or a one- speech enhancement algorithms rely on the mean-square er- third octave band filter bank [12]. It is also well known that ror (MSE) criterion of short-time spectral amplitudes (STSA), preservation of modulation frequencies in the range 4-20 Hz which has no apparent link to human perception, e.g. speech are critical for speech intelligibility [9], [13], [14]. Therefore, intelligibility. Short-Time Objective Intelligibility (STOI), a pop- it is natural to believe that, if prior knowledge about the human ular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. auditory system is incorporated into a speech enhancement This raises the question if a DNN training criterion based on algorithm, improvements in speech intelligibility or speech envelope linear correlation (ELC) can lead to improved speech quality can be achieved [15]. intelligibility performance of DNN based speech enhancement Indeed, numerous works exist that attempt to incorporate algorithms compared to algorithms based on the STSA-MSE such knowledge (e.g. [16]–[26] and references therein). In criterion. In this paper we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically [16] a transform-domain method based on a Gammatone filter equivalent, and we provide empirical data to support our the- bank was used, which incorporates a non-linear frequency oretical results. Furthermore, our experimental findings suggest resolution mimicking that of the human auditory system. In that the standard STSA minimum-MSE estimator is near optimal, [17] different perceptually motivated cost functions were used if the objective is to enhance noisy speech in a manner which is to derive STSA clean speech spectrum estimators in order optimal with respect to the STOI speech intelligibility estimator. to emphasize spectral peak information, account for auditory masking or penalize spectral over-attenuation. In [20], [21] Index Terms—Speech enhancement, Speech intelligibility, Deep similar goals were pursued, but instead of using classical neural networks, Minimum mean-square error estimator. statistically-based models, DNNs were used. Finally, in [22] a deep reinforcement learning technique was used to reward I. I NTRODUCTION solutions that achieved a large score in terms of perceptual evaluation of speech quality (PESQ) [27], a commonly used ESPITE the recent success of deep neural network (DNN) speech quality estimator. based speech enhancement algorithms [1]–[5], it is yet Although the works in e.g. [16], [17], [21], [22] include unknown if these algorithms are optimal in terms of aspects knowledge about the human auditory system the techniques related to human auditory perception, e.g. speech intelligibility, since existing algorithms do not directly optimize criteria are not designed specifically to maximize speech intelligibility. designed with human auditory perception in mind. While speech processing methods that improve speech intel- ligibility would be of vital importance for applications such Many current state-of-the-art DNN based speech enhance- as mobile communications, or hearing assistive devices, only ment algorithms use a mean squared error (MSE) training very little research has been performed to understand if DNN- criterion [6]–[8] on short-time spectral amplitudes (STSA). This, however, might not be the optimal training criterion based speech enhancement systems can help improve speech if the target is the human auditory system, and improvement in intelligibility. Very recent work [23]–[26] has investigated if speech intelligibility or speech quality is the desired objective. DNNs trained to maximize a state-of-the-art speech intelligibil- ity estimator are capable of improving speech intelligibility as It is well known that the frequency sensitivity of the human auditory system is non-linear ( e.g. [9], [10]) and, as a measured by the estimator [23]–[25] or human listeners [26]. consequence, is often approximated in digital signal processing Specifically, DNNs were trained to maximize the short-time objective intelligibility (STOI) [12] estimator and were then Manuscript received month day, year; revised month day, year; accepted compared, in terms of STOI, with DNNs trained to minimize month day, year. Date of publication month day, year; date of current version the classical STSA-MSE criterion. Surprisingly, although all Month day, year. This research was partly funded by the Oticon Foundation. The associate editor coordinating the review of this manuscript and approving DNNs improved STOI, the DNNs trained to maximize STOI it for publication was xxyyzz xxyyzz. showed none or only very modest improvements in STOI M. Kolbæk and Z.-H. Tan are with the Department of Electronic Sys- compared to the DNNs trained with the classical STSA-MSE tems, Aalborg University, Aalborg 9220, Denmark (e-mail: mok@es.aau.dk; zt@es.aau.dk). criterion [23]–[26]. J. Jensen is with the Department of Electronic Systems, Aalborg University, The STOI speech intelligibility estimator has proven to Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark be able to quite accurately predict the intelligibility of (e-mail: jje@es.aau.dk; jesj@oticon.com). Digital Object Identifier 00.0000/TASLP.2018.0000000 noisy/processed speech in a large range of acoustic scenar- arXiv:1806.08404v2 [cs.SD] 4 Dec 2018 2 where v[n] is a sample of additive noise. Furthermore, let r(k; m) a(k; m) and r(k; m), k = 1; : : : ; + 1, m = 1; : : : M; denote the single-sided magnitude spectra of the K -point short-time discrete Fourier transform (STFT) of x[n] and y[n] g^(k; m) a^(k; m) x^[n] T-F Gain T-F y[n], respectively, where M is the number of STFT frames. Analysis Estimator Synthesis Also, let a ^(k; m) denote an estimate of a(k; m) obtained as a ^(k; m) = g ^(k; m)r(k; m). Here, g ^(k; m) is a scalar gain factor applied to the magnitude spectrum of the noisy speech (k; m) y r(k; m) to arrive at an estimate a ^(k; m) of the clean speech magnitude spectrum a(k; m). It is the goal of many STFT- Fig. 1. Classical gain-based speech enhancement system. The noisy time- based speech enhancement systems to find appropriate values domain signal y[n] = x[n]+v[n] is first decomposed into a time-frequency (T- F) representation r(k; m) for time-frame m and frequency index k. An for g ^(k; m) based on the available noisy signal y[n]. The gain estimator, e.g. a DNN, estimates a gain g^(k; m) that is applied to the noisy factor g ^(k; m) is typically estimated using either statistical short-term magnitude spectrum r(k; m) to arrive at an enhanced signal model-based methods such as classical STSA minimum mean- magnitude a^(k; m) = g^(k; m)r(k; m). Finally, the enhanced time-domain signal x^[n] is obtained from a T-F synthesis stage using the phase of the noisy square error (MMSE) estimators [34], [18], [33], or machine signal  (k; m). learning based techniques such as Gaussian mixture models [35], support vector machines [36], or, more recently, DNNs [6]–[8], [16]. For reconstructing the enhanced speech signal in ios, including speech processed by mobile communication the time domain, it is common practice to append the short-time devices [28], ideal time-frequency weighted noisy speech [12], phase spectrum of the noisy signal to the estimated short-time noisy speech enhanced by single-microphone time-frequency magnitude spectrum and then use the overlap-and-add technique weighting-based speech enhancement systems [12], [29], [30], [37], [33]. and speech processed by hearing assistive devices such as cochlear implants [31]. STOI has also been shown to be robust III. S HORT-T IME OBJECTIVE I NTELLIGIBILITY (STOI) to variations in language types, including Danish [12], Dutch [30], and Mandarin [32]. Finally, recent studies e.g. [6], [7] In the following, we shortly review the STOI intelligibility also show a good correspondence between STOI predictions estimator [12]. For further details we refer to [12]. Let the jth of noisy speech enhanced by DNN-based speech enhancement one-third octave band clean-speech amplitude, for time-frame systems, and speech intelligibility. As a consequence, STOI m, be defined as is currently the, perhaps, most commonly used speech intelli- k (j) u 2 gibility estimator for objectively evaluating the performance a (m) = t a(k; m) ; (2) of speech enhancement systems [6]–[8], [16]. Therefore, it k=k (j) is natural to believe that gains in speech intelligibility, as estimated by STOI, can be achieved by utilizing an optimality where k (j) and k (j) denote the first and last STFT bin index, 1 2 criterion based on STOI as opposed to the classical criterion respectively, of the jth one-third octave band. Furthermore, let based on STSA-MSE. a short-time temporal envelope vector that spans time-frames In this paper we study the potential gain in speech in- m N + 1; : : : ; m, for the clean speech signal be defined as telligibility that can be achieved, if a DNN is designed to perform optimally with respect to the STOI speech intelligibility a = [a (m N + 1); a (m N + 2); : : : ; a (m)] (3) j j j j;m estimator. We derive that, under certain general conditions, In a similar manner we define a ^ and r for the enhanced j;m j;m maximizing an approximate-STOI criterion is equivalent to speech signal and the noisy observation, respectively. minimizing a STSA-MSE criterion. Furthermore, we present The parameter N defines the length of the temporal envelope empirical data using simulation studies with DNNs applied to and for STOI N = 30 , which for the STFT settings used in noisy speech signals, that support our theoretical results. Finally, this study, as well as in [12], corresponds to approximately we show theoretically under which conditions the equality 384 ms. Finally, the STOI speech intelligibility estimator for between the approximate-STOI criterion and the STSA-MSE a pair of short-time temporal envelope vectors can then be criterion holds for practical systems. Our results are in line approximated by the sample envelope linear correlation (ELC) with recent empirical work and might explain the somewhat between the clean and enhanced envelope vectors a and j;m surprising result in [23]–[26], where none or only very modest a ^ given as j;m improvements in STOI were achieved with STOI optimal DNNs compared to MSE optimal DNNs. a  a ^ j;m j;m j;m a ^ j;m L(a ; a ^ ) = ; (4) j;m j;m II. STFT-DOMAIN BASED SPEECH ENHANCEMENT a  a ^ j;m a j;m j;m a ^ j;m Fig. 1 shows a block-diagram of a classical gain-based where kk denotes the Euclidean ` -norm and  and speech enhancement system [18], [33]. Let x[n] be the nth j;m denote the sample means of a and a ^ , respectively. j;m j;m sample of the clean time-domain speech signal and let a noisy a ^ j;m observation y[n] be given by With N = 30, STOI is sensitive to temporal modulations of 2:6 Hz and y[n] = x[n] + v[n]; (1) higher, which are frequencies important for speech intelligibility [12]. 3 Note that Eq. (4) is an approximation, since the clipping and be a random envelope vector representing an estimate of A (m). normalization steps otherwise used in STOI, have been omitted. Now, the contribution of A (m) to speech intelligibility may This has empirically been found not to have any significant be approximated as the ELC between the envelope vectors effect on intelligibility prediction performance in most cases A (m) and A (m). In the following, the indices j and m are j j [19], [29], [38], [39]. Furthermore, since the normalization omitted for convenience. Let 1 denote a vector of ones, and 1 T step is applied for the entire vector a ^ , the normalization let  = 1 A1 be a vector, whose entries equal the sample j;m procedure itself does not influence the final STOI score. Also, mean of the entries in A. Let  be defined in a similar as clipping only occurs for time-frequency units for which the manner. Finally, let the ELC between A and A, which is a signal-to-distortion ratio (see Eq. (4) in [12]) is below 15 dB, random variable, be defined as clipping only occurs for a minority of the envelope vectors and approximating STOI with ELC is well valid, or even exact, in A  A A ^ most cases, when evaluating speech signals at practical SNRs. A; A , ; (9) FromL(a ; a ^ ), the final STOI score for an entire speech j;m j;m A  A signal is then defined as [12] the scalar, 1  d  1, and the expected ELC as J M X X h  i d = L(a ; a ^ ); (5) j;m j;m ^ = E  A; A ELC J (M N + 1) A;R j=1 m=N Z Z =  (a; a ^) f (a; r) da dr where J is the number of one-third octave bands and MN +1 A;R (10) Z Z is the total number of short-time temporal envelope vectors. =  (a; a ^) f (ajr) da f (r) dr: Similarly to [12], we use J = 15 with a center frequency AjR R | {z } of the first one-third octave band at 150 Hz and the last at (r) approximately 3.8 kHz to ensure a frequency range that covers the majority of the spectral information of human speech. The Here, a ^ is related to r via a deterministic map, e.g. a STOI score in general has been shown to often have high DNN, and f (a; r) denotes the joint probability density A;R correlation with listening tests involving human test subjects, function (PDF) of clean and noisy/processed one-third octave i.e. the higher numerical value of Eq. (5), the more intelligible band envelope vectors. Furthermore, f (ajr) and f (r) AjR R is the speech signal. denote a conditional and marginal PDF, respectively. Since STOI, as approximated by Eq. (5), is a sum of ELC An optimal estimator can be found by minimizing the Bayes values as given by Eq. (4), maximizing Eq. (4) will also risk [33], [40], which is equivalent to maximizing Eq. (10), maximize the overall STOI score in Eq. (5). As a consequence, hence arriving at the MMELC estimator, which we denote in order to find an estimate x ^[n] of x[n] so that STOI is as a ^ . To do so, observe that for a particular noisy MMELC maximized, one can focus on finding optimal estimates of observation r maximizing (r) maximizes Eq. (10), since the individual short-time temporal envelope vectors a . j;m f (r)  0 8 r. In other words, our goal is to maximize (r) Therefore, we define a ^ = diag(g ^ )r as the short-time j;m j;m for each and every r. Hence, for a particular observation, r, j;m temporal one-third octave band envelope vector of the enhanced the MMELC estimate is given by speech signal, where g ^ is an estimated gain vector and j;m a ^ = arg max  (a; a ^) f (ajr) da diag(g ^ ) is a diagonal matrix with the elements of g ^ on MMELC AjR j;m j;m a ^ the main diagonal. a  a ^ a a ^ = arg max f (ajr) da AjR IV. E NVELOPE LINEAR CORRELATION ESTIMATOR a  a ^ a ^ a a ^ We now introduce the approximate-STOI criterion in a a  a ^ a a ^ stochastic context and derive the speech envelope estimator that = arg max f (ajr) da AjR a  a ^ a ^ a a ^ maximizes it. We denote this estimator as the maximum mean | {z }| {z } envelope linear correlation (MMELC) estimator. Let A (m) E e(A) e(a ^) [ ] Ajr and R (m) denote random variables representing a clean and a noisy, respectively, one-third octave band magnitude, for band = arg max E e(A) e(a ^); Ajr a ^ j and time frame m. Furthermore, let (11) A (m) = [A (m N + 1); : : : A (m)] (6) j j where e() is a function that normalizes its vector argument to zero sample mean and unit norm and where we used that and for a given noisy observation r, a ^ is deterministic. Note that R (m) = [R (m N + 1); : : : R (m)] (7) j j the solution to Eq. (11) is non-unique. For one given solution, say a ^ , any affine transformation, a ^ + 1 8 ; 2 R, is be the stack of these random variables in random envelope also a solution, because any such transformation is undone by vectors. Finally, in a similar manner, let e(). Hence, in the following we focus on finding one such h i ^ ^ ^ A (m) = A (m N + 1); : : : A (m) ; (8) particular solution, namely the zero sample mean, unit norm j j j 4 solution, i.e. the vector e(a ^) that maximizes the inner product This is a standard assumption in the area of speech enhance- with the vector E [e(Ajr)]. To do so, let = E [e(Ajr)], ment, when operating in the STFT domain and has been Ajr Ajr the underlying assumption of a very large number of speech and let e(a ^ ) denote the zero sample mean, unit norm vector enhancement methods (see e.g. [18], [33], [34], [41], [42] and that maximizes Eq. (11). Then, using the method of Lagrange references therein). The conditional independence assumption multipliers, it can be shown (see Appendix A) that the MMELC is, for example, valid, when speech and noise STFT coefficients estimator is given by may be assumed statistically independent across time and a ^ = e(a ^ ) MMELC frequency and mutually independent [33], [34], [43]. Using Kolmogorovs strong law of large numbers [44, pp. (12) 67-68] and the conditional independence assumption, it can be shown (see Appendix B) that asymptotically, as N ! 1, the = ; expectation in Eq. (16) factorizes as k k " # which is nothing more than the vector , normalized to unit 1 T lim = lim E E [Z ] : (18) norm. The fact that  = 1 1 = 0 follows from Eq. (11), Ajr Ajr N!1 N!1 Z where it is seen that = E [e(Ajr)] is an expectation over Ajr Combining this result with Eq. (12) leads to vectors (a  ) a  whose sample mean is zero. By a a interpreting the expectation as an infinite linear combination lim a ^ = lim MMELC of such vectors, it follows that  = 0. N!1 N!1 h i E E [Z ] V. R ELATION TO STSA-MMSE E STIMATORS Ajr Ajr kZk h i = lim We now show that the MMELC estimator, Eq. (12), is N!1 1 E E [Z ] Ajr Ajr kZk asymptotically equivalent to the one-third octave band STSA- h i (19) MMSE estimator for large envelope lengths, i.e. as N ! 1. E E [Z ] Ajr Ajr kZk The STSA-MSE (e.g. [34]) is defined as h i = lim N!1 1 E E [Z ] 2 Ajr Ajr kZk = E A A : (13) MSE A;R E [Z ] Ajr = lim : N!1 It can be shown (e.g. [18], [33], [34]) that the optimal E [Z ] Ajr Bayesian estimator with respect to Eq. (13), is the STSA- MMSE estimator given by the conditional mean defined as Since Eq. (11) is invariant to affine transformations of its input arguments, we can scale a ^ with the scalar quantity MMELC a ^ = a f (ajr) da MMSE AjR kE [Z ]k in Eq. (19) to arrive at Ajr (14) = E [Ajr] : Ajr lim a ^ = E [Z ] : MMELC (20) Ajr N!1 To show that a ^ is asymptotically equivalent to a ^ , MMELC MMSE let us introduce the idempotent, symmetric matrix Finally, as N ! 1, the MMELC estimator a ^ is given MMELC by H = I 11 ; (15) lim a ^ = E [Z ] MMELC Ajr N!1 where I denotes the N -dimensional identity matrix. We can = E HAjr then rewrite the vector as Ajr " # a  1 a T = E I 11 Ajr = f (ajr) da Ajr AjR a ^ a ^ (21) Ha = E Ajr 11 Ajr = f (ajr) da Ajr AjR Ha " # (16) 1 HAjr = E Ajr 11 E Ajr Ajr Ajr = E Ajr HAjr = a ^  : MMSE a ^ " # MMSE In words, the MMELC estimator, a ^ , is (asymptotically = E ; MMELC Ajr in N ) an affine transformation of the STSA-MMSE estimator a ^ . In practice, this means that using the STSA-MMSE where Ajr is a random vector, and we introduced the notation MMSE estimator leads to the same approximate-STOI criterion value Z , HAjr. We now employ the following conditional as the estimator, a ^ , derived to maximize this criterion. independence assumption MMELC In other words, applying the traditional STSA-MMSE estimator leads to maximum speech intelligibility as reflected f (ajr) = f (a jr ): (17) A jR =r j j j j j AjR by the approximate STOI estimator. j=1 5 VI. EXPERIM ENTAL DESIGN A. Noise-free Speech Mixtures We have used the Wall Street Journal (WSJ0) speech corpus We now investigate empirically the relationship between [47] as the clean speech data for both the training set, validation the MMELC estimator in Eq. (14) and the STSA-MMSE set, and test set. Specifically, the noise-free utterances used for estimator in Eq. (11) using an experimental study. As defined training and validation are generated by randomly selecting in Eq. (11), the MMELC estimator is the vector that maximizes utterances from 44 male and 47 female speakers from the WSJ0 the expectation of the ELC cost function given by Eq. (10). This training set entitled si tr s. In total 20000 utterances are used expectation, Eq. (10), is defined via an integral of  (a; a ^) for for the training set and 2000 are used for the validation set, various realizations of a and a ^, and weighted by the joint PDF which adds up to approximately 37 hours of training data and 4 f (a; r). It is however, well known, that the integral may be A;R hours of validation data. For the test set, we have used a similar approximated (arbitrarily well) as a sum of  (a; a ^) terms, where approach and sampled 1000 utterances among 16 speakers (10 realizations of a and a ^ are drawn according to f (a; r). A;R males and 6 females) from the WSJ0 validation set si dt 05 and This is similar to what a DNN approximates during a standard evaluation set si et 05, which is equivalent to approximately 2 training process, where a gradient based optimization technique hours of data, see [48] for further details. The speakers used in is used to minimize the cost on a representative training set the training and validation sets are different than the speakers [45]. Therefore, training a DNN, e.g. using stochastic gradient used for test, i.e. we test in a speaker independent setting. ascent, to maximize Eq. (4) may be seen as an approximation Finally, since WSJ0 utterances primarily include speech active of Eq. (11), where the approximation becomes more accurate regions we do not apply a VAD. This is motivated by the fact with increasing training set size. that noise-only regions are irrelevant for STOI, as these are From the theoretical results presented in Sec. V, we would discarded by an ideal VAD in the STOI front-end [12]. therefore expect that, for some sufficiently large N , one would obtain equality in an ELC sense, between a DNN trained to B. Noise Types maximize an ELC cost function and one that is trained to minimize the classical STSA-MSE cost function. To validate To simulate a wide variety of sound scenes we have used this expectation we follow the techniques formalized in Secs. II six different noise types in our experiments: two synthetic and III and train DNNs to estimate gain vectors, g ^ , that noise signals and four natural noise signals, which are real-life j;m we apply to noisy one-third octave band magnitude envelope recordings of naturally occurring sound scenes. For the two signals r , to arrive at enhanced signals a ^ . synthetic noise signals, we use a stationary speech shaped j;m j;m In principle, any supervised learning model would be noise (SSN) signal and a highly non-stationary 6-speaker applicable for these experiments but considering the universal babble (BBL) noise. For the naturally occurring noise signals, function approximation capability of DNNs [46], this is our we use the street (STR), cafeteria (CAF), bus (BUS), and model of choice. We use short-time temporal one-third octave pedestrian (PED) noise signals from the CHiME3 dataset [49]. band envelope vectors, as defined in Eq. (3), and train multiple The SSN noise signal is Gaussian white noise, spectrally shaped DNNs, one for each of the J = 15 one-third octave bands, according to the long-term spectrum of the entire TIMIT speech for various N , to investigate if for sufficiently large N , DNNs corpus [50]. Similarly, the BBL noise signal is constructed by mixing utterances from both genders from TIMIT. To ensure trained with a STSA-MSE cost function approach the ELC that all noise types are equally represented and with unique values of DNNs trained with a cost function based on ELC. realizations in the training, validation and test sets, all six noise We construct two types of enhancement systems, one type is signals are split into non-overlapping segments such that 40 trained using the STSA-MSE cost function, denoted as ES , MSE min. is used for training, 5 min. is used for validation and and one that is trained using the ELC cost function denoted as another 5 min. is used for test. ES . Each of the systems consists of J = 15 DNNs, each ELC estimating a gain vector g ^ for a particular one-third octave j;m band directly from the STFT magnitudes of the noisy signal C. Noisy Speech Mixtures r(k; m), with the input context given by k = 1; : : : ; + 1, To construct the noisy speech signals used for training, we m N + 1 : : : ; m. This ensures that all DNNs have access to follow Eq. (1) and combine a noise-free training utterance x[n] the same information for a particular value of N , as they all with a randomly selected noise sequence v[n], of equal length, receive the same input data. Furthermore, we follow common from the training noise signal. We scale the noise signal v[n], practice (e.g. [6], [7], [16], [23]) and average overlapping to achieve a certain signal-to-noise ratio (SNR), according to estimated gain values, within a one-third octave band, during the active speech level of x[n] as defined by ITU P.56 [51]. enhancement. We found during a preliminary study that this For the training and validation sets, the SNRs are chosen technique consistently lead to slightly larger STOI scores for uniformly from [5; 10] dB to ensure that the intelligibility both types of systems. of the noisy speech mixtures y[n] ranges from degraded to To compute the STFT coefficients for all signals we use a perfectly intelligible. 10 kHz sample frequency and a K = 256 point STFT with a Hann-window size of 256 samples (25.6 ms) and a 128 D. Model Architecture and Training sample frame shift (12.8 ms). These coefficients are then used to compute one-third octave band envelopes for the clean and The two types of enhancement systems, ES and ES , ELC MSE noisy signals using Eq. (3). each consist of 15 feed-forward DNNs. The DNNs in the 6 ES system are trained with the ELC cost function intro- Note, since L(a; a ^) is invariant to the magnitude of ka ^k (see ELC duced in Eq. (4) and the DNNs in the ES system are Eq. (4)), and a and N are constants during training, the gradient MSE trained using the well-known STSA-MSE cost function given norm of the ELC cost function, Eq. (23), with respect to a ^, is by inversely proportional to the gradient norm of the STSA-MSE cost function, Eq. (27). This suggests that the two cost functions J (a; a ^) = ka a ^k ; (22) have different optimal learning rates. This observation might where the subscripts j and m are omitted for convenience. partly explain why equality with respect to STOI between STOI We train both the ES and ES systems with 20000 ELC MSE optimal and STSA-MSE optimal DNNs were achieved in [23] training utterances and 2000 validation utterances and both but not in [24]–[26], as [23] was the only study that explicitly data sets have been mixed uniformly with the SSN, BBL, CAF, stated that different learning rates for the two cost functions and STR noise signals, which ensures that each noise type were used. In fact, in [24]–[26] the optimization method Adam have been mixed with 25% of the utterances in the training [52] was used, and although Adam is an adaptive gradient and validation sets. During test, we evaluate each system with method, it still has several critical hyper-parameters that can one noise type at a time, i.e. each system is evaluated with influence convergence [53]. 1000 noisy test utterances per noise type, and since BUS and During a preliminary grid-search using the validation set PED are not included in the training and validation sets, these corrupted with SSN at an SNR of 0 dB and N = 30, we found two noise signals serve as unmatched noise types, whereas learning rates of 0:01 and 5 10 per sample to be optimal for SSN, BBL, CAF, and STR are matched noise types. This will the ES and ES systems, respectively. During training, ELC MSE allow us to study how the ELC optimal DNNs and STSA-MSE the cost on the validation set was evaluated for each epoch optimal DNNs generalize to unmatched noise types. and the learning rates were scaled by 0:7, if the cost increased Each feed-forward DNN consists of three hidden layers with compared to the cost for the previous epoch. The training 512 units using ReLU activation functions. The N -dimensional was terminated, if the learning rate was below 10 . We output layer uses sigmoid functions which ensures that the implemented the DNNs using CNTK [54] and the scripts output gain g ^ is confined between zero and one. The needed to reproduce the reported results can be found in [48]. j;m DNNs are trained using stochastic gradient de-/ascent with Note, the goal of these experiments is not to achieve state- the backpropagation technique and batch normalization [45]. of-the-art enhancement performance. In fact, increasing the The DNNs are trained for a maximum of 200 epochs with a size of the dataset or DNNs might likely improve performance, minibatch size of 256 randomly selected short-time temporal although we have not reason to believe it will change the one-third octave band envelope vectors. conclusion. Since the ES and ES systems use different cost ELC MSE functions, they likely have different optimal learning rates. This VII. E XPERIM ENTAL RESULTS is easily seen from the gradient norms of the two cost functions. To study the relationship between ES and ES ELC MSE It can be shown (details omitted due to space limitations) that systems as function of N , we have trained multiple systems for the ` -norm of the gradient of the ELC cost function in Eq. (4), various N . Specifically, a total of eight ES systems and ELC with respect to the desired signal vector a ^, is given by eight ES systems have been trained with N taking the p MSE 1L(a; a ^) values N = f4; 7; 15; 20; 30; 40; 50; 80g, which correspond to krL(a; a ^)k = ; (23) temporal envelope vectors with durations from approximately ka ^k 50 to 1000 milliseconds. where the gradient rL(a; a ^) is given by rL a; a ^ = A. Comparing One-third Octave Bands " # (24) @L a; a ^ @L a; a ^ @L a; a ^ In Fig. 2 we present the ELC scores, as function of envelope ; ; : : : ; ; @a ^ @a ^ @a ^ 1 2 N duration N , for each of the J = 15 one-third octave band DNNs in the ES and ES systems. All DNNs are ELC MSE and tested using speech corrupted with BBL noise at an SNR of 0 @L a; a ^ dB. First, we observe that both systems manage to improve the @a ^ ELC score considerably, when compared to the ELC score of (25) L a; a ^ a ^ L a; a ^ a  m the noisy speech signals, i.e. both systems enhance the noisy m a ^ T T speech, which is in line with known results [8]. a ^  a  a ^  a ^ a ^ a ^ a ^ Furthermore, we can observe that the DNNs trained with the is the partial derivative of L(a; a ^) with respect to entry m ELC cost function, i.e. the ES systems, in general achieve ELC of vector a ^. Similarly, the gradient of the STSA-MSE cost higher, or similar, ELC scores than the DNNs trained with the function in Eq. (22) is given by STSA-MSE cost function, i.e. the ES systems. This is an MSE important observation, since it verifies that DNNs trained to (26) rJ a; a ^ = (a a ^) ; maximize ELC indeed achieve the highest, or similar, ELC scores compared to DNNs trained to optimize a different cost such that function, STSA-MSE in this case. Finally, and most importantly, (27) rJ a; a ^ = ka a ^k : we observe that the difference in ELC score between the N 7 Band 1 (CF: 150 Hz) Band 2 (CF: 189 Hz) Band 3 (CF: 238 Hz) Band 4 (CF: 300 Hz) Band 5 (CF: 378 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 6 (CF: 476 Hz) Band 7 (CF: 600 Hz) Band 8 (CF: 756 Hz) Band 9 (CF: 952 Hz) Band 10 (CF: 1200 Hz) 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 11 (CF: 1512 Hz) Band 12 (CF: 1905 Hz) Band 13 (CF: 2400 Hz) Band 14 (CF: 3024 Hz) Band 15 (CF: 3810 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.5 0.5 0.4 0.4 0.2 0.3 0.3 0.4 0.4 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 2. ELC values for ES and ES systems trained using various envelope durations, N , and tested with corresponding values of N using speech ELC MSE corrupted with BBL noise at an SNR of 0 dB. Each figure shows one out of J = 15 one-third octave band DNNs (center frequency (CF) shown in parenthesis). It is seen that as N ! 80 the difference between the ES DNNs and ES DNNs, as measured by ELC, tends to zero. This is in line with the ELC MSE theoretical results of Sec. V. ES and ES DNNs generally decrease with increasing durations and noise types, which indicate that our test set is ELC MSE N . For N = 80 the ELC score of the ES and ES sufficiently large to provide accurate estimates of the true mean ELC MSE DNNs practically coincide. ELC difference. Similarly to Fig. 2, the results in Fig. 3 support the theoretical results of Sec. V. Additionally, the results in B. Comparing ELC across Noise Types Fig. 3 show consistency across multiple noise types, which suggests that the theory in practice applies for various noise In Fig. 3 we present the ELC score difference, as function of type distributions. envelope duration N , for ES and ES systems, when ELC MSE tested using speech material corrupted with various noise types C. Comparing STOI across Noise Types at an SNR of 0 dB. Specifically, we compute the difference in ELC score for each pair of one-third octave band DNNs in the We now investigate if the global behavior observed for ES and ES systems, and then compute the average approximate-STOI, i.e. ELC, in Fig. 3 also applies for real ELC MSE ELC difference as function of envelope duration N . We do this STOI. To do this, we reconstruct the test signals used for Fig. 3 for all the 1000 test utterances and for each of the six noise in the time domain. We follow the technique proposed in [23], types introduced in Sec VI-B: SSN, BBL, CAF, STR, BUS, where a uniform gain across STFT coefficients within a one- and PED. Finally, we compute the 95% confidence interval (CI) third octave band is used before an inverse DFT is applied on the mean ELC difference. using the phase of the noisy signal. In Table I we present the From Fig. 3 we observe that the average ELC difference, STOI scores for ES and ES systems, as a function of ELC MSE i.e. ES ES , appears to be monotonically decreasing N , when tested using speech material corrupted with different ELC MSE with respect to the duration of the envelope N . Furthermore, noise types at an SNR of 0 dB. Note that these test signals we observe that the average ELC difference approaches zero are similar to the test signals used for Fig. 3 except that we as the duration of the envelope N increases, and similarly to now evaluate them according to STOI and not ELC. Fig. 2, for N = 80, the difference between the ES and From Table I we observe that the average STOI difference ELC ES systems is close to zero. Finally, we observe that the between the ES and ES systems is maximum for MSE ELC MSE 95% confidence intervals are relatively narrow for all envelope N = 4, but quickly tends to zero as N increases and for 8 Noise type: SSN Noise type: BBL Noise type: CAF 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Noise type: STR Noise type: BUS Noise type: PED 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 3. Average ELC differences, as function of envelope durations N , between ES and ES systems, for different noise types. We observe a ELC MSE monotonic decreasing relationship between the average ELC difference and the envelope length and for N = 80, the average ELC difference between the ES and ES systems is close to zero. This is in line with the theoretical results of Sec. V. ELC MSE N  15, the STOI difference is practically zero, i.e.  0:01. TABLE I STOI SCORES AS FUNCTION OF N FOR ES AND ES SYSTEMS ELC MSE Also, we observe that the gap in STOI between the ES and ELC TESTED USING DIFFERENT NOISE TYPES AT AN SNR OF 0 DB. ES systems closes faster at a lower value of N in Table I MSE compared to Fig. 3. We believe this is due to the transformation N : 4 7 15 20 30 40 50 80 of the, potentially ”invalid”, sequences of (e.g. [55], [56]) ELC : 0.81 0.85 0.88 0.88 0.87 0.86 0.85 0.84 SSN: modified magnitude spectra, when reconstructing enhanced MSE : 0.84 0.87 0.87 0.87 0.87 0.86 0.85 0.84 time-domain signals, whose intelligibility is estimated by STOI ELC : 0.77 0.80 0.82 0.82 0.81 0.80 0.80 0.78 BBL: in Table I. Therefore, STOI in Table I might be computed MSE : 0.79 0.82 0.82 0.82 0.81 0.80 0.80 0.78 based on slightly different magnitude spectra compared to the ELC : 0.82 0.85 0.87 0.87 0.86 0.85 0.84 0.83 CAF: magnitude spectra used for computing the ELC scores in Fig. 3. MSE : 0.85 0.87 0.87 0.87 0.86 0.85 0.85 0.84 Furthermore, we observe that the ES achieve slightly MSE ELC : 0.83 0.86 0.88 0.89 0.88 0.87 0.87 0.85 STR: higher STOI scores than the ES systems for N = 4, which ELC MSE : 0.86 0.88 0.88 0.88 0.88 0.87 0.87 0.85 might be due to sub-optimal learning rates as the ones actually ELC : 0.77 0.81 0.83 0.83 0.83 0.82 0.81 0.80 PED: used during training of the systems at, e.g. N = 4, were found MSE : 0.80 0.82 0.83 0.83 0.82 0.82 0.81 0.80 based on a grid-search using systems with N = 30 (see Sec. ELC : 0.87 0.89 0.90 0.91 0.90 0.89 0.89 0.89 BUS: VI.D). More importantly, the maximum improvement in STOI MSE : 0.89 0.90 0.90 0.90 0.90 0.90 0.89 0.89 is achieved for N = f15; 20; 30g, where both systems achieve similar STOI scores. Finally, while the theoretical results of Sec. V show that approximate-STOI performance of a ^ MMELC results in Sec. V. However, the results in Sec. V predict that and a ^ is identical, asymptotically, for N ! 1, the MMSE not only do ES , and ES systems produce identical ELC MSE empirical results in Table I suggest that N  15 is sufficient for ELC scores, they also predict that the systems are, in fact, practical equality to hold for DNN based speech enhancement essentially identical, i.e. up to an affine transformation. Hence, systems. in this section, we compare how the systems actually operate. Specifically, we compare the gains estimated by ES ELC D. Comparing Gain-Values systems with gains estimated by ES systems. MSE Figures 2 and 3, and Table I show that ES systems In Fig. 4 we present scatter plots, one for each one-third ELC achieve approximately the same ELC and STOI values as octave band for pairs of gains estimated by ES and ES ELC MSE ES systems and that the ELC and STOI difference systems tested with BBL noise at an SNR of 5 dB. Each scatter MSE between the two types of systems approach zero as N becomes plot consists of 10000 pairs of gains acquired by sampling 10 large. These empirical results are in line with the theoretical gain-pairs randomly and uniformly distributed from each of the 9 TABLE II speech enhancement systems, when the DNNs are trained to S AMPLE CORRELATIONS BETWEEN GAINS FROM ES AND ES ELC MSE either maximize ELC or minimize MSE and the systems are SYSTEMS WITH N = 30. S EE F IG. 4 FOR PER BAND CORRELATIONS. evaluated using both ELC and STOI. Finally, our experimental findings suggest, that applying the traditional STSA-MMSE SNR SSN BBL CAF STR BUS PED estimator on noisy speech signals in practice leads to essentially [dB] maximum speech intelligibility as reflected by the STOI speech -5 0.94 0.87 0.89 0.93 0.87 0.90 intelligibility estimator. 0 0.94 0.92 0.92 0.93 0.88 0.92 5 0.95 0.95 0.93 0.93 0.90 0.92 10 0.95 0.95 0.92 0.92 0.91 0.93 APPENDIX A M AXIM IZING A CONSTRAINED INNER PRODUCT 1000 test utterances. In Fig. 4, yellow indicates high density This appendix derives an expression for the zero-mean, unit- of gain-pairs and dark blue indicates low density. From Fig. 4 norm vector e(a ^), which maximizes the inner product with it is seen that a correlation no smaller than 0:88 is achieved the vector E [e(Ajr)]. For notational convenience, let = Ajr for all 15 one-third octave bands. The highest correlation of E [e(Ajr)], and = e(a ^). The constrained optimization Ajr r = 0:98 is achieved by bands 5 to 7 and the lowest is r = 0:88 problem from Eq. (11) is then defined as achieved by band 2 followed by band 1 with r = 0:89. It is also seen that a large number of gain values are either zero, or maximize one, as one would expect due to the sparse nature of speech T (28) in the T-F domain. However, although a strong correlation subject to 1 = 0; is observed for all bands, the gain-pairs are slightly more = 1: scattered at the first few bands than for the remaining bands. This might be explained simply by the fact that low one-third The vector that solves Eq. (28) can be found using octave bands correspond to single STFT bins, whereas higher the method of Lagrange multipliers [57]. Introducing two one-third octave bands are sums of a large number of STFT scalar Lagrange multipliers,  and  , for the two equality 1 2 bins. This, in turn, may have the consequence that for finite constraints, the Lagrangian is given by N (N = 30), Kolmogorovs strong law of large numbers (see T T T Appendix. B) is better valid at higher frequencies than at lower L( ;  ;  ) = +  1 +  ( 1): (29) 1 2 1 2 frequencies (so that gain vectors produced by one system is @L closer to an affine transformation of gain vectors produced by Setting the partial derivatives equal to zero the other system). In fact, if we compute r for models trained with N = 50, we get r = 0:93, i.e. increased correlation @L between the gain vectors produced by the two systems. Finally, = +  1 + 2 = 0; (30) 1 2 in Table. II we present average correlation coefficients and we observe correlation coefficients  0:87 for all, both matched and solving for , we arrive at and unmatched, noise types, at multiple SNRs. = : (31) VIII. CONCLUSION 2 @L @L This study is motivated by the fact that most estimators Using the same approach for and , substituting in @ @ 1 2 used for speech enhancement, being either data-driven models, Eq. (31) and solving for  , and  such that the two constraints 1 2 e.g. deep neural networks (DNNs), or statistical model-based are fulfilled, we find techniques such as the short-time spectral amplitude minimum mean-square error (STSA-MMSE) estimator, use the STSA T (32) = 1 =  ; mean-square error (MSE) cost function as a performance indicator. Short-time objective intelligibility (STOI), a state- and of-the-art speech intelligibility estimator, on the other hand, k  1k (33) rely on the envelope linear correlation (ELC) of speech temporal = : envelopes. Since the primary goal of many speech enhancement systems is to improve speech intelligibility, it raises the question Inserting  and  into Eq. (31) results in 1 2 if estimators can benefit from an ELC cost function. In this paper we derive the maximum mean envelope linear = ; (34) correlation (MMELC) estimator and study its relationship to k  1k the well-known STSA-MMSE estimator. We show theoretically that the MMELC estimator, under a commonly used conditional which is simply the vector , normalized to zero sample mean independence assumption, is asymptotically equivalent to and unit norm. the STSA-MMSE estimator. Furthermore, we demonstrate 2 T experimentally that this relationship also holds for DNN based We solve the equivalent problem that minimizes . 10 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 Fig. 4. Scatter plots based on gain values from ES and ES systems with an envelope length of N = 30. Dark blue indicate low density and bright ELC MSE ^ ^ ^ yellow indicate high density. The systems are tested with BBL noise corrupted speech at an SNR of 5 dB. Each figure shows one of 15 (g ; g ; : : : ; g ) 1 2 15 one-third octave bands. A correlation no smaller than 0:88 is achieved for all one-third octave bands, which indicates that the ES and ES systems ELC MSE estimate fairly similar gain vectors. APPENDIX B We can rewrite the factors on the right-hand side of Eq. (39) FACTORIZATION OF E XPECTATION as follows h i This appendix shows that the expectation in Eq. (16) factor- E [Z ] = E h Y izes into the product of expectations in Eq. (18), asymptotically as N ! 1. Let = E S 1 Y Y , Ajr; (35) 1 (40) = E [S ] 1 E [Y ] and N 1 N H , I 11 ; (36) = E [S ] E [S ] ; i j j=1 so that 2 3 " # Z = HY ; (37) 1 1 4 5 E = E Z T T where I denotes the N -dimensional identity matrix and Ajr N Y HH Y is a random vector distributed according to the conditional 2 3 probability density function f (ajr). A specific element Z , AjR 4 5 = E q of Z is then given by Y HY 2 3 Z = h Y (38) 1 T 4 5 = E q = S 1 Y ; T T 1 T Y Y Y 11 Y (41) 2 3 where h is the ith column of matrix H . We now define the covariance between Z and 1=kZk as i 6 7 6 7 = E " " # # 4 5 P P N N 1 1 1 2 1 S S cov(Z ; ) , E Z E [Z ] E j=1 j N j=1 i i i Z Z Z 2 3 " # " # Z 1 6 7 i N 6 7 = E E [Z ] E : i = E r ; 4   5 Z Z 2 P P N N 1 1 S S j=1 j j=1 N N (39) 11 and REFERENCES 2 3 [1] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Deep Recurrent " # 1 1 S S Networks for Separation and Recognition of Single-Channel Speech 6 i j 7 j=1 Z N N 6 7 in Nonstationary Background Audio,” in New Era for Robust Speech E = E r : (42) 4   5 Z Recognition. Springer, 2017, pp. 165–186. P P N N 1 1 S S [2] D. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, j=1 j j=1 N N vol. 54, no. 3, pp. 32–37, 2017. [3] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep In Eqs. (40), (41) and (42) two different sums of random Learning: An Overview,” arXiv:1708.07524, 2017. variables occur, [4] M. Kim and P. Smaragdis, “Bitwise Neural Networks for Efficient Single- Channel Source Separation,” in Proc. NIPS Machine Learning for Audio Signal Processing Workshop, 2017. [5] R. Fakoor, X. He, I. Tashev, and S. Zarar, “Reinforcement Learning To S ; (43) N Adapt Speech Enhancement to Instantaneous Input Signal Quality,” in j=1 Proc. NIPS Machine Learning for Audio Signal Processing Workshop, and [6] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. 2604–2612, S : (44) j=1 [7] E. W. Healy, M. Delfarah, J. L. Vasko, B. L. Carter, and D. Wang, “An algorithm to increase intelligibility for hearing-impaired listeners in the Since, by assumption, Eq. (17), S 8 j are independent random presence of a competing talker,” J. Acoust. Soc. Am., vol. 141, no. 6, pp. 3 4230–4239, 2017. variables with finite variances , according to Kolmogorovs [8] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential strong law of large numbers [44], the sums given by Eqs. (43) of General and Specialized Deep Neural Network Based Speech and (44) will converge (almost surely, i.e. with probability (Pr) Enhancement Systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, 2017. one) to their average means  = E[S ], and  2 = S j S j=1 [9] J. Schnupp, E. Nelken, and A. King, Auditory Neuroscience - Making E[S ], respectively, as N ! 1. Formally, we can Sense of Sound. MIT Press, 2011. j=1 j [10] B. Moore, An Introduction to the Psychology of Hearing. Brill, 2013. express this as [11] R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, 0 1 and M. Allerhand, “Complex sounds and auditory images,” in In Proc. 1 International Symposium on Hearing, 1992, pp. 429–446. @ A Pr lim S =  = 1; (45) j S [12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm N!1 N for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech,” j=1 IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011. and [13] T. M. Elliott and F. E. Theunissen, “The Modulation Transfer Function 0 1 for Speech Intelligibility,” PLOS Computational Biology, vol. 5, no. 3, @ A Pr lim S =  2 = 1: (46) j S [14] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope N!1 j=1 smearing on speech reception,” J. Acoust. Soc. Am., vol. 95, no. 2, pp. 1053–1064, 1994. [15] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth com- By substituting Eqs. (45), and (46) into Eqs. (40), (41) and pression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. (42), we arrive at 1586–1604, 1979. [16] E. W. Healy, S. E. Yoho, J. Chen, Y. Wang, and D. Wang, “An algorithm lim E [Z ] = E [S ]  ; i i S to increase speech intelligibility for hearing-impaired listeners in novel (47) N!1 segments of the same noise type,” J. Acoust. Soc. Am., vol. 138, no. 3, pp. 1660–1669, 2015. " # 1 [17] P. C. Loizou, “Speech Enhancement Based on Perceptually Motivated lim Bayesian Estimators of the Magnitude Spectrum,” IEEE/ACM Trans. N!1 (48) lim E = p ; Audio, Speech, Lang. Process., vol. 13, no. 5, pp. 857–869, 2005. N!1 [18] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-Domain Based Single- Microphone Noise Reduction for Speech Enhancement: A Survey of the and State of the Art,” Synth. Lect. on Speech and Audio Process., vol. 9, no. 1, pp. 1–80, 2013. " # 1 [19] L. Lightburn and M. Brookes, “SOBM - a binary mask for noisy speech lim that optimises an objective intelligibility metric,” in Proc. ICASSP, 2015, i N!1 lim E = (E [S ]  ) i S pp. 5078–5082. N!1 [20] W. Han, X. Zhang, G. Min, X. Zhou, and W. Zhang, “Perceptual (49) " # weighting deep neural networks for single-channel speech enhancement,” in Proc. WCICA, 2016, pp. 446–450. = lim E [Z ] E ; N!1 [21] P. G. Shivakumar and P. Georgiou, “Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement - Semantic Scholar,” in Proc. INTERSPEECH, 2016, pp. 3743–3747. where the last line follows from Eq. (47) and (48). In words, [22] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNN- as N ! 1, the covariance between Z and 1=kZk tends to i based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Proc. ICASSP, 2017, pp. 81–85. zero and, consequently, the expectation in Eq. (16) factorizes [23] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement into the product of expectations in Eq. (18). using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure,” in Proc. ICASSP, 2018, pp. 5059 – 5063. Assuming a finite variance of S is motivated by the fact that S model [24] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually Guided Speech j j speech signals, which always take finite values due to both physical and Enhancement using Deep Neural Networks,” in Proc. ICASSP, 2018, pp. physiological limitations of sound and speech production systems, respectively. 5074–5078. 12 [25] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation [51] ITU, “Rec. P.56 : Objective measurement of active speech level,” 1993, System to Improve STOI and PESQ Directly,” in Proc. ICASSP, 2018, https://www.itu.int/rec/T-REC-P.56/. pp. 5374–5378. [52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” [26] S. W. Fu, T. W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to- in Proc. ICLR (arXiv:1412.6980), 2014. End Waveform Utterance Enhancement for Direct Evaluation Metrics [53] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Optimization by Fully Convolutional Neural Networks,” IEEE/ACM Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Trans. Audio, Speech, Lang. Process., vol. 26, no. 9, pp. 570 – 1584, Proc. NIPS, 2017. [54] A. Agarwal et al., “An introduction to computational networks and the [27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual computational network toolkit,” Microsoft Technical Report fMSR-TRg- evaluation of speech quality (PESQ)-a new method for speech quality 2014-112, Tech. Rep., 2014. assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, [55] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from short-time 2001, pp. 749–752. Fourier transform magnitude,” IEEE Trans. Acoust., Speech, and Sig. [28] S. Jørgensen, J. Cubick, and T. Dau, “Speech Intelligibility Evaluation Process., vol. 31, no. 4, pp. 986–998, 1983. for Mobile Phones.” Acustica United with Acta Acustica, vol. 101, pp. [56] D. Griffin and J. Lim, “Signal estimation from modified short-time 1016–1025, 2015. Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Process., [29] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility vol. 32, no. 2, pp. 236–243, 1984. of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. [57] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016. University Press, 2004. [30] ——, “Speech Intelligibility Prediction Based on Mutual Information,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 430–440, 2014. Morten Kolbæk received the B.Eng. degree in [31] T. H. Falk et al., “Objective Quality and Intelligibility Prediction for electronic design at Aarhus University, Business and Users of Assistive Listening Devices: Advantages and limitations of Social Sciences, AU Herning, Denmark, in 2013 existing tools,” IEEE Sig. Process. Mag., vol. 32, no. 2, pp. 114–124, and the M.Sc. in signal processing and computing from Aalborg University, Denmark, in 2015. He is [32] R. Xia, J. Li, M. Akagi, and Y. Yan, “Evaluation of objective intelligibility currently pursuing his PhD degree at the section for prediction measures for noise-reduced signals in mandarin,” in Proc. Signal and Information Processing at the Department ICASSP, 2012, pp. 4465–4468. of Electronic Systems, Aalborg University, Denmark. [33] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, His research interests include speech enhancement and separation, deep learning, and intelligibility [34] Y. Ephraim and D. Malah, “Speech enhancement using a minimum- improvement of noisy speech. mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, and Sig. Process., vol. 32, no. 6, pp. 1109–1121, 1984. [35] G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, “An algorithm that improves Zheng-Hua Tan (M’00–SM’06) received the B.Sc. speech intelligibility in noise for normal-hearing listeners,” J. Acoust. and M.Sc. degrees in electrical engineering from Soc. Am., vol. 126, no. 3, pp. 1486–1494, 2009. Hunan University, Changsha, China, in 1990 and [36] K. Han and D. Wang, “A classification based approach to speech 1996, respectively, and the Ph.D. degree in electronic segregation,” J. Acoust. Soc. Am., vol. 132, no. 5, pp. 3475–3483, 2012. engineering from Shanghai Jiao Tong University, [37] J. Allen, “Short term spectral analysis, synthesis, and modification Shanghai, China, in 1999. He is a Professor and a Co- by discrete Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Head of the Centre for Acoustic Signal Processing Process., vol. 25, no. 3, pp. 235–238, 1977. Research (CASPR) at Aalborg University, Aalborg, [38] C. H. Taal, R. C. Hendriks, and R. Heusdens, “Matching pursuit for Denmark. He was a Visiting Scientist at the Computer channel selection in cochlear implants based on an intelligibility metric,” in Proc. EUSIPCO, 2012, pp. 504–508. Science and Artificial Intelligence Laboratory, MIT, [39] A. H. Andersen, J. M. d. Haan, Z. H. Tan, and J. Jensen, “Predicting Cambridge, USA, an Associate Professor at Shanghai the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech,” Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. His research interests include machine learning, deep learning, pattern 1908–1920, 2016. recognition, speech and speaker recognition, noise-robust speech processing, [40] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation multimodal signal processing, and social robotics. He is a member of the IEEE Theory. Prentice Hall, 2010. Signal Processing Society Machine Learning for Signal Processing Technical [41] Y. Ephraim and D. Malah, “Speech enhancement using a minimum Committee (MLSP TC). He is an Editorial Board Member for Computer mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech and Language and was a Guest Editor for the IEEE Journal of Selected Speech, and Sig. Process., vol. 33, no. 2, pp. 443–445, 1985. Topics in Signal Processing and Neurocomputing. He was the General Chair [42] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Minimum for IEEE MLSP 2018 and a TPC co-chair for IEEE SLT 2016. Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, 2007. Jesper Jensen received the M.Sc. degree in electrical [43] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision engineering and the Ph.D. degree in signal processing noise suppression filter,” IEEE Trans. Acoust., Speech, and Sig. Process., from Aalborg University, Aalborg, Denmark, in 1996 vol. 28, no. 2, pp. 137–145, 1980. and 2000, respectively. From 1996 to 2000, he was [44] P. K. Sen and J. M. Singer, Large Sample Methods in Statistics: An with the Center for Person Kommunikation (CPK), Introduction with Applications. Chapman & Hall, 1994. Aalborg University, as a Ph.D. student and Assistant [45] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, Research Professor. From 2000 to 2007, he was a 2016. Post-Doctoral Researcher and Assistant Professor [46] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward with Delft University of Technology, Delft, The networks are universal approximators,” Neural Networks, vol. 2, no. 5, Netherlands, and an External Associate Professor pp. 359–366, 1989. with Aalborg University. Currently, he is a Senior [47] J. Garofolo, D. Graff, P. Doug, and D. Pallett, “CSR-I (WSJ0) Complete Principal Scientist with Oticon A/S, Copenhagen, Denmark, where his main LDC93s6a,” 1993, philadelphia: Linguistic Data Consortium. responsibility is scouting and development of new signal processing concepts [48] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Supplemental Material.” [Online]. for hearing aid applications. He is a Professor with the Section for Signal Available: http://kom.aau.dk/ mok/taslp2018 and Information Processing (SIP), Department of Electronic Systems, at [49] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ Aalborg University. He is also a co-founder of the Centre for Acoustic Signal speech separation and recognition challenge: Dataset, task and baselines,” Processing Research (CASPR) at Aalborg University. His main interests are in Proc. ASRU, 2015, pp. 504–511. in the area of acoustic signal processing, including signal retrieval from noisy [50] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and observations, coding, speech and audio modification and synthesis, intelligibility N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech enhancement of speech signals, signal processing for hearing aid applications, Corpus CDROM,” 1993. and perceptual aspects of signal processing. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement

Loading next page...
 
/lp/arxiv-cornell-university/on-the-relationship-between-short-time-objective-intelligibility-and-OLym7hqBP9

References (60)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2018.2877909
Publisher site
See Article on Publisher Site

Abstract

On the Relationship Between Short-Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean-Square Error for Speech Enhancement Morten Kolbæk, Zheng-Hua Tan, Senior Member, IEEE, and Jesper Jensen Abstract—The majority of deep neural network (DNN) based algorithms using e.g. a Gammatone filter bank [11] or a one- speech enhancement algorithms rely on the mean-square er- third octave band filter bank [12]. It is also well known that ror (MSE) criterion of short-time spectral amplitudes (STSA), preservation of modulation frequencies in the range 4-20 Hz which has no apparent link to human perception, e.g. speech are critical for speech intelligibility [9], [13], [14]. Therefore, intelligibility. Short-Time Objective Intelligibility (STOI), a pop- it is natural to believe that, if prior knowledge about the human ular state-of-the-art speech intelligibility estimator, on the other hand, relies on linear correlation of speech temporal envelopes. auditory system is incorporated into a speech enhancement This raises the question if a DNN training criterion based on algorithm, improvements in speech intelligibility or speech envelope linear correlation (ELC) can lead to improved speech quality can be achieved [15]. intelligibility performance of DNN based speech enhancement Indeed, numerous works exist that attempt to incorporate algorithms compared to algorithms based on the STSA-MSE such knowledge (e.g. [16]–[26] and references therein). In criterion. In this paper we derive that, under certain general conditions, the STSA-MSE and ELC criteria are practically [16] a transform-domain method based on a Gammatone filter equivalent, and we provide empirical data to support our the- bank was used, which incorporates a non-linear frequency oretical results. Furthermore, our experimental findings suggest resolution mimicking that of the human auditory system. In that the standard STSA minimum-MSE estimator is near optimal, [17] different perceptually motivated cost functions were used if the objective is to enhance noisy speech in a manner which is to derive STSA clean speech spectrum estimators in order optimal with respect to the STOI speech intelligibility estimator. to emphasize spectral peak information, account for auditory masking or penalize spectral over-attenuation. In [20], [21] Index Terms—Speech enhancement, Speech intelligibility, Deep similar goals were pursued, but instead of using classical neural networks, Minimum mean-square error estimator. statistically-based models, DNNs were used. Finally, in [22] a deep reinforcement learning technique was used to reward I. I NTRODUCTION solutions that achieved a large score in terms of perceptual evaluation of speech quality (PESQ) [27], a commonly used ESPITE the recent success of deep neural network (DNN) speech quality estimator. based speech enhancement algorithms [1]–[5], it is yet Although the works in e.g. [16], [17], [21], [22] include unknown if these algorithms are optimal in terms of aspects knowledge about the human auditory system the techniques related to human auditory perception, e.g. speech intelligibility, since existing algorithms do not directly optimize criteria are not designed specifically to maximize speech intelligibility. designed with human auditory perception in mind. While speech processing methods that improve speech intel- ligibility would be of vital importance for applications such Many current state-of-the-art DNN based speech enhance- as mobile communications, or hearing assistive devices, only ment algorithms use a mean squared error (MSE) training very little research has been performed to understand if DNN- criterion [6]–[8] on short-time spectral amplitudes (STSA). This, however, might not be the optimal training criterion based speech enhancement systems can help improve speech if the target is the human auditory system, and improvement in intelligibility. Very recent work [23]–[26] has investigated if speech intelligibility or speech quality is the desired objective. DNNs trained to maximize a state-of-the-art speech intelligibil- ity estimator are capable of improving speech intelligibility as It is well known that the frequency sensitivity of the human auditory system is non-linear ( e.g. [9], [10]) and, as a measured by the estimator [23]–[25] or human listeners [26]. consequence, is often approximated in digital signal processing Specifically, DNNs were trained to maximize the short-time objective intelligibility (STOI) [12] estimator and were then Manuscript received month day, year; revised month day, year; accepted compared, in terms of STOI, with DNNs trained to minimize month day, year. Date of publication month day, year; date of current version the classical STSA-MSE criterion. Surprisingly, although all Month day, year. This research was partly funded by the Oticon Foundation. The associate editor coordinating the review of this manuscript and approving DNNs improved STOI, the DNNs trained to maximize STOI it for publication was xxyyzz xxyyzz. showed none or only very modest improvements in STOI M. Kolbæk and Z.-H. Tan are with the Department of Electronic Sys- compared to the DNNs trained with the classical STSA-MSE tems, Aalborg University, Aalborg 9220, Denmark (e-mail: mok@es.aau.dk; zt@es.aau.dk). criterion [23]–[26]. J. Jensen is with the Department of Electronic Systems, Aalborg University, The STOI speech intelligibility estimator has proven to Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark be able to quite accurately predict the intelligibility of (e-mail: jje@es.aau.dk; jesj@oticon.com). Digital Object Identifier 00.0000/TASLP.2018.0000000 noisy/processed speech in a large range of acoustic scenar- arXiv:1806.08404v2 [cs.SD] 4 Dec 2018 2 where v[n] is a sample of additive noise. Furthermore, let r(k; m) a(k; m) and r(k; m), k = 1; : : : ; + 1, m = 1; : : : M; denote the single-sided magnitude spectra of the K -point short-time discrete Fourier transform (STFT) of x[n] and y[n] g^(k; m) a^(k; m) x^[n] T-F Gain T-F y[n], respectively, where M is the number of STFT frames. Analysis Estimator Synthesis Also, let a ^(k; m) denote an estimate of a(k; m) obtained as a ^(k; m) = g ^(k; m)r(k; m). Here, g ^(k; m) is a scalar gain factor applied to the magnitude spectrum of the noisy speech (k; m) y r(k; m) to arrive at an estimate a ^(k; m) of the clean speech magnitude spectrum a(k; m). It is the goal of many STFT- Fig. 1. Classical gain-based speech enhancement system. The noisy time- based speech enhancement systems to find appropriate values domain signal y[n] = x[n]+v[n] is first decomposed into a time-frequency (T- F) representation r(k; m) for time-frame m and frequency index k. An for g ^(k; m) based on the available noisy signal y[n]. The gain estimator, e.g. a DNN, estimates a gain g^(k; m) that is applied to the noisy factor g ^(k; m) is typically estimated using either statistical short-term magnitude spectrum r(k; m) to arrive at an enhanced signal model-based methods such as classical STSA minimum mean- magnitude a^(k; m) = g^(k; m)r(k; m). Finally, the enhanced time-domain signal x^[n] is obtained from a T-F synthesis stage using the phase of the noisy square error (MMSE) estimators [34], [18], [33], or machine signal  (k; m). learning based techniques such as Gaussian mixture models [35], support vector machines [36], or, more recently, DNNs [6]–[8], [16]. For reconstructing the enhanced speech signal in ios, including speech processed by mobile communication the time domain, it is common practice to append the short-time devices [28], ideal time-frequency weighted noisy speech [12], phase spectrum of the noisy signal to the estimated short-time noisy speech enhanced by single-microphone time-frequency magnitude spectrum and then use the overlap-and-add technique weighting-based speech enhancement systems [12], [29], [30], [37], [33]. and speech processed by hearing assistive devices such as cochlear implants [31]. STOI has also been shown to be robust III. S HORT-T IME OBJECTIVE I NTELLIGIBILITY (STOI) to variations in language types, including Danish [12], Dutch [30], and Mandarin [32]. Finally, recent studies e.g. [6], [7] In the following, we shortly review the STOI intelligibility also show a good correspondence between STOI predictions estimator [12]. For further details we refer to [12]. Let the jth of noisy speech enhanced by DNN-based speech enhancement one-third octave band clean-speech amplitude, for time-frame systems, and speech intelligibility. As a consequence, STOI m, be defined as is currently the, perhaps, most commonly used speech intelli- k (j) u 2 gibility estimator for objectively evaluating the performance a (m) = t a(k; m) ; (2) of speech enhancement systems [6]–[8], [16]. Therefore, it k=k (j) is natural to believe that gains in speech intelligibility, as estimated by STOI, can be achieved by utilizing an optimality where k (j) and k (j) denote the first and last STFT bin index, 1 2 criterion based on STOI as opposed to the classical criterion respectively, of the jth one-third octave band. Furthermore, let based on STSA-MSE. a short-time temporal envelope vector that spans time-frames In this paper we study the potential gain in speech in- m N + 1; : : : ; m, for the clean speech signal be defined as telligibility that can be achieved, if a DNN is designed to perform optimally with respect to the STOI speech intelligibility a = [a (m N + 1); a (m N + 2); : : : ; a (m)] (3) j j j j;m estimator. We derive that, under certain general conditions, In a similar manner we define a ^ and r for the enhanced j;m j;m maximizing an approximate-STOI criterion is equivalent to speech signal and the noisy observation, respectively. minimizing a STSA-MSE criterion. Furthermore, we present The parameter N defines the length of the temporal envelope empirical data using simulation studies with DNNs applied to and for STOI N = 30 , which for the STFT settings used in noisy speech signals, that support our theoretical results. Finally, this study, as well as in [12], corresponds to approximately we show theoretically under which conditions the equality 384 ms. Finally, the STOI speech intelligibility estimator for between the approximate-STOI criterion and the STSA-MSE a pair of short-time temporal envelope vectors can then be criterion holds for practical systems. Our results are in line approximated by the sample envelope linear correlation (ELC) with recent empirical work and might explain the somewhat between the clean and enhanced envelope vectors a and j;m surprising result in [23]–[26], where none or only very modest a ^ given as j;m improvements in STOI were achieved with STOI optimal DNNs compared to MSE optimal DNNs. a  a ^ j;m j;m j;m a ^ j;m L(a ; a ^ ) = ; (4) j;m j;m II. STFT-DOMAIN BASED SPEECH ENHANCEMENT a  a ^ j;m a j;m j;m a ^ j;m Fig. 1 shows a block-diagram of a classical gain-based where kk denotes the Euclidean ` -norm and  and speech enhancement system [18], [33]. Let x[n] be the nth j;m denote the sample means of a and a ^ , respectively. j;m j;m sample of the clean time-domain speech signal and let a noisy a ^ j;m observation y[n] be given by With N = 30, STOI is sensitive to temporal modulations of 2:6 Hz and y[n] = x[n] + v[n]; (1) higher, which are frequencies important for speech intelligibility [12]. 3 Note that Eq. (4) is an approximation, since the clipping and be a random envelope vector representing an estimate of A (m). normalization steps otherwise used in STOI, have been omitted. Now, the contribution of A (m) to speech intelligibility may This has empirically been found not to have any significant be approximated as the ELC between the envelope vectors effect on intelligibility prediction performance in most cases A (m) and A (m). In the following, the indices j and m are j j [19], [29], [38], [39]. Furthermore, since the normalization omitted for convenience. Let 1 denote a vector of ones, and 1 T step is applied for the entire vector a ^ , the normalization let  = 1 A1 be a vector, whose entries equal the sample j;m procedure itself does not influence the final STOI score. Also, mean of the entries in A. Let  be defined in a similar as clipping only occurs for time-frequency units for which the manner. Finally, let the ELC between A and A, which is a signal-to-distortion ratio (see Eq. (4) in [12]) is below 15 dB, random variable, be defined as clipping only occurs for a minority of the envelope vectors and approximating STOI with ELC is well valid, or even exact, in A  A A ^ most cases, when evaluating speech signals at practical SNRs. A; A , ; (9) FromL(a ; a ^ ), the final STOI score for an entire speech j;m j;m A  A signal is then defined as [12] the scalar, 1  d  1, and the expected ELC as J M X X h  i d = L(a ; a ^ ); (5) j;m j;m ^ = E  A; A ELC J (M N + 1) A;R j=1 m=N Z Z =  (a; a ^) f (a; r) da dr where J is the number of one-third octave bands and MN +1 A;R (10) Z Z is the total number of short-time temporal envelope vectors. =  (a; a ^) f (ajr) da f (r) dr: Similarly to [12], we use J = 15 with a center frequency AjR R | {z } of the first one-third octave band at 150 Hz and the last at (r) approximately 3.8 kHz to ensure a frequency range that covers the majority of the spectral information of human speech. The Here, a ^ is related to r via a deterministic map, e.g. a STOI score in general has been shown to often have high DNN, and f (a; r) denotes the joint probability density A;R correlation with listening tests involving human test subjects, function (PDF) of clean and noisy/processed one-third octave i.e. the higher numerical value of Eq. (5), the more intelligible band envelope vectors. Furthermore, f (ajr) and f (r) AjR R is the speech signal. denote a conditional and marginal PDF, respectively. Since STOI, as approximated by Eq. (5), is a sum of ELC An optimal estimator can be found by minimizing the Bayes values as given by Eq. (4), maximizing Eq. (4) will also risk [33], [40], which is equivalent to maximizing Eq. (10), maximize the overall STOI score in Eq. (5). As a consequence, hence arriving at the MMELC estimator, which we denote in order to find an estimate x ^[n] of x[n] so that STOI is as a ^ . To do so, observe that for a particular noisy MMELC maximized, one can focus on finding optimal estimates of observation r maximizing (r) maximizes Eq. (10), since the individual short-time temporal envelope vectors a . j;m f (r)  0 8 r. In other words, our goal is to maximize (r) Therefore, we define a ^ = diag(g ^ )r as the short-time j;m j;m for each and every r. Hence, for a particular observation, r, j;m temporal one-third octave band envelope vector of the enhanced the MMELC estimate is given by speech signal, where g ^ is an estimated gain vector and j;m a ^ = arg max  (a; a ^) f (ajr) da diag(g ^ ) is a diagonal matrix with the elements of g ^ on MMELC AjR j;m j;m a ^ the main diagonal. a  a ^ a a ^ = arg max f (ajr) da AjR IV. E NVELOPE LINEAR CORRELATION ESTIMATOR a  a ^ a ^ a a ^ We now introduce the approximate-STOI criterion in a a  a ^ a a ^ stochastic context and derive the speech envelope estimator that = arg max f (ajr) da AjR a  a ^ a ^ a a ^ maximizes it. We denote this estimator as the maximum mean | {z }| {z } envelope linear correlation (MMELC) estimator. Let A (m) E e(A) e(a ^) [ ] Ajr and R (m) denote random variables representing a clean and a noisy, respectively, one-third octave band magnitude, for band = arg max E e(A) e(a ^); Ajr a ^ j and time frame m. Furthermore, let (11) A (m) = [A (m N + 1); : : : A (m)] (6) j j where e() is a function that normalizes its vector argument to zero sample mean and unit norm and where we used that and for a given noisy observation r, a ^ is deterministic. Note that R (m) = [R (m N + 1); : : : R (m)] (7) j j the solution to Eq. (11) is non-unique. For one given solution, say a ^ , any affine transformation, a ^ + 1 8 ; 2 R, is be the stack of these random variables in random envelope also a solution, because any such transformation is undone by vectors. Finally, in a similar manner, let e(). Hence, in the following we focus on finding one such h i ^ ^ ^ A (m) = A (m N + 1); : : : A (m) ; (8) particular solution, namely the zero sample mean, unit norm j j j 4 solution, i.e. the vector e(a ^) that maximizes the inner product This is a standard assumption in the area of speech enhance- with the vector E [e(Ajr)]. To do so, let = E [e(Ajr)], ment, when operating in the STFT domain and has been Ajr Ajr the underlying assumption of a very large number of speech and let e(a ^ ) denote the zero sample mean, unit norm vector enhancement methods (see e.g. [18], [33], [34], [41], [42] and that maximizes Eq. (11). Then, using the method of Lagrange references therein). The conditional independence assumption multipliers, it can be shown (see Appendix A) that the MMELC is, for example, valid, when speech and noise STFT coefficients estimator is given by may be assumed statistically independent across time and a ^ = e(a ^ ) MMELC frequency and mutually independent [33], [34], [43]. Using Kolmogorovs strong law of large numbers [44, pp. (12) 67-68] and the conditional independence assumption, it can be shown (see Appendix B) that asymptotically, as N ! 1, the = ; expectation in Eq. (16) factorizes as k k " # which is nothing more than the vector , normalized to unit 1 T lim = lim E E [Z ] : (18) norm. The fact that  = 1 1 = 0 follows from Eq. (11), Ajr Ajr N!1 N!1 Z where it is seen that = E [e(Ajr)] is an expectation over Ajr Combining this result with Eq. (12) leads to vectors (a  ) a  whose sample mean is zero. By a a interpreting the expectation as an infinite linear combination lim a ^ = lim MMELC of such vectors, it follows that  = 0. N!1 N!1 h i E E [Z ] V. R ELATION TO STSA-MMSE E STIMATORS Ajr Ajr kZk h i = lim We now show that the MMELC estimator, Eq. (12), is N!1 1 E E [Z ] Ajr Ajr kZk asymptotically equivalent to the one-third octave band STSA- h i (19) MMSE estimator for large envelope lengths, i.e. as N ! 1. E E [Z ] Ajr Ajr kZk The STSA-MSE (e.g. [34]) is defined as h i = lim N!1 1 E E [Z ] 2 Ajr Ajr kZk = E A A : (13) MSE A;R E [Z ] Ajr = lim : N!1 It can be shown (e.g. [18], [33], [34]) that the optimal E [Z ] Ajr Bayesian estimator with respect to Eq. (13), is the STSA- MMSE estimator given by the conditional mean defined as Since Eq. (11) is invariant to affine transformations of its input arguments, we can scale a ^ with the scalar quantity MMELC a ^ = a f (ajr) da MMSE AjR kE [Z ]k in Eq. (19) to arrive at Ajr (14) = E [Ajr] : Ajr lim a ^ = E [Z ] : MMELC (20) Ajr N!1 To show that a ^ is asymptotically equivalent to a ^ , MMELC MMSE let us introduce the idempotent, symmetric matrix Finally, as N ! 1, the MMELC estimator a ^ is given MMELC by H = I 11 ; (15) lim a ^ = E [Z ] MMELC Ajr N!1 where I denotes the N -dimensional identity matrix. We can = E HAjr then rewrite the vector as Ajr " # a  1 a T = E I 11 Ajr = f (ajr) da Ajr AjR a ^ a ^ (21) Ha = E Ajr 11 Ajr = f (ajr) da Ajr AjR Ha " # (16) 1 HAjr = E Ajr 11 E Ajr Ajr Ajr = E Ajr HAjr = a ^  : MMSE a ^ " # MMSE In words, the MMELC estimator, a ^ , is (asymptotically = E ; MMELC Ajr in N ) an affine transformation of the STSA-MMSE estimator a ^ . In practice, this means that using the STSA-MMSE where Ajr is a random vector, and we introduced the notation MMSE estimator leads to the same approximate-STOI criterion value Z , HAjr. We now employ the following conditional as the estimator, a ^ , derived to maximize this criterion. independence assumption MMELC In other words, applying the traditional STSA-MMSE estimator leads to maximum speech intelligibility as reflected f (ajr) = f (a jr ): (17) A jR =r j j j j j AjR by the approximate STOI estimator. j=1 5 VI. EXPERIM ENTAL DESIGN A. Noise-free Speech Mixtures We have used the Wall Street Journal (WSJ0) speech corpus We now investigate empirically the relationship between [47] as the clean speech data for both the training set, validation the MMELC estimator in Eq. (14) and the STSA-MMSE set, and test set. Specifically, the noise-free utterances used for estimator in Eq. (11) using an experimental study. As defined training and validation are generated by randomly selecting in Eq. (11), the MMELC estimator is the vector that maximizes utterances from 44 male and 47 female speakers from the WSJ0 the expectation of the ELC cost function given by Eq. (10). This training set entitled si tr s. In total 20000 utterances are used expectation, Eq. (10), is defined via an integral of  (a; a ^) for for the training set and 2000 are used for the validation set, various realizations of a and a ^, and weighted by the joint PDF which adds up to approximately 37 hours of training data and 4 f (a; r). It is however, well known, that the integral may be A;R hours of validation data. For the test set, we have used a similar approximated (arbitrarily well) as a sum of  (a; a ^) terms, where approach and sampled 1000 utterances among 16 speakers (10 realizations of a and a ^ are drawn according to f (a; r). A;R males and 6 females) from the WSJ0 validation set si dt 05 and This is similar to what a DNN approximates during a standard evaluation set si et 05, which is equivalent to approximately 2 training process, where a gradient based optimization technique hours of data, see [48] for further details. The speakers used in is used to minimize the cost on a representative training set the training and validation sets are different than the speakers [45]. Therefore, training a DNN, e.g. using stochastic gradient used for test, i.e. we test in a speaker independent setting. ascent, to maximize Eq. (4) may be seen as an approximation Finally, since WSJ0 utterances primarily include speech active of Eq. (11), where the approximation becomes more accurate regions we do not apply a VAD. This is motivated by the fact with increasing training set size. that noise-only regions are irrelevant for STOI, as these are From the theoretical results presented in Sec. V, we would discarded by an ideal VAD in the STOI front-end [12]. therefore expect that, for some sufficiently large N , one would obtain equality in an ELC sense, between a DNN trained to B. Noise Types maximize an ELC cost function and one that is trained to minimize the classical STSA-MSE cost function. To validate To simulate a wide variety of sound scenes we have used this expectation we follow the techniques formalized in Secs. II six different noise types in our experiments: two synthetic and III and train DNNs to estimate gain vectors, g ^ , that noise signals and four natural noise signals, which are real-life j;m we apply to noisy one-third octave band magnitude envelope recordings of naturally occurring sound scenes. For the two signals r , to arrive at enhanced signals a ^ . synthetic noise signals, we use a stationary speech shaped j;m j;m In principle, any supervised learning model would be noise (SSN) signal and a highly non-stationary 6-speaker applicable for these experiments but considering the universal babble (BBL) noise. For the naturally occurring noise signals, function approximation capability of DNNs [46], this is our we use the street (STR), cafeteria (CAF), bus (BUS), and model of choice. We use short-time temporal one-third octave pedestrian (PED) noise signals from the CHiME3 dataset [49]. band envelope vectors, as defined in Eq. (3), and train multiple The SSN noise signal is Gaussian white noise, spectrally shaped DNNs, one for each of the J = 15 one-third octave bands, according to the long-term spectrum of the entire TIMIT speech for various N , to investigate if for sufficiently large N , DNNs corpus [50]. Similarly, the BBL noise signal is constructed by mixing utterances from both genders from TIMIT. To ensure trained with a STSA-MSE cost function approach the ELC that all noise types are equally represented and with unique values of DNNs trained with a cost function based on ELC. realizations in the training, validation and test sets, all six noise We construct two types of enhancement systems, one type is signals are split into non-overlapping segments such that 40 trained using the STSA-MSE cost function, denoted as ES , MSE min. is used for training, 5 min. is used for validation and and one that is trained using the ELC cost function denoted as another 5 min. is used for test. ES . Each of the systems consists of J = 15 DNNs, each ELC estimating a gain vector g ^ for a particular one-third octave j;m band directly from the STFT magnitudes of the noisy signal C. Noisy Speech Mixtures r(k; m), with the input context given by k = 1; : : : ; + 1, To construct the noisy speech signals used for training, we m N + 1 : : : ; m. This ensures that all DNNs have access to follow Eq. (1) and combine a noise-free training utterance x[n] the same information for a particular value of N , as they all with a randomly selected noise sequence v[n], of equal length, receive the same input data. Furthermore, we follow common from the training noise signal. We scale the noise signal v[n], practice (e.g. [6], [7], [16], [23]) and average overlapping to achieve a certain signal-to-noise ratio (SNR), according to estimated gain values, within a one-third octave band, during the active speech level of x[n] as defined by ITU P.56 [51]. enhancement. We found during a preliminary study that this For the training and validation sets, the SNRs are chosen technique consistently lead to slightly larger STOI scores for uniformly from [5; 10] dB to ensure that the intelligibility both types of systems. of the noisy speech mixtures y[n] ranges from degraded to To compute the STFT coefficients for all signals we use a perfectly intelligible. 10 kHz sample frequency and a K = 256 point STFT with a Hann-window size of 256 samples (25.6 ms) and a 128 D. Model Architecture and Training sample frame shift (12.8 ms). These coefficients are then used to compute one-third octave band envelopes for the clean and The two types of enhancement systems, ES and ES , ELC MSE noisy signals using Eq. (3). each consist of 15 feed-forward DNNs. The DNNs in the 6 ES system are trained with the ELC cost function intro- Note, since L(a; a ^) is invariant to the magnitude of ka ^k (see ELC duced in Eq. (4) and the DNNs in the ES system are Eq. (4)), and a and N are constants during training, the gradient MSE trained using the well-known STSA-MSE cost function given norm of the ELC cost function, Eq. (23), with respect to a ^, is by inversely proportional to the gradient norm of the STSA-MSE cost function, Eq. (27). This suggests that the two cost functions J (a; a ^) = ka a ^k ; (22) have different optimal learning rates. This observation might where the subscripts j and m are omitted for convenience. partly explain why equality with respect to STOI between STOI We train both the ES and ES systems with 20000 ELC MSE optimal and STSA-MSE optimal DNNs were achieved in [23] training utterances and 2000 validation utterances and both but not in [24]–[26], as [23] was the only study that explicitly data sets have been mixed uniformly with the SSN, BBL, CAF, stated that different learning rates for the two cost functions and STR noise signals, which ensures that each noise type were used. In fact, in [24]–[26] the optimization method Adam have been mixed with 25% of the utterances in the training [52] was used, and although Adam is an adaptive gradient and validation sets. During test, we evaluate each system with method, it still has several critical hyper-parameters that can one noise type at a time, i.e. each system is evaluated with influence convergence [53]. 1000 noisy test utterances per noise type, and since BUS and During a preliminary grid-search using the validation set PED are not included in the training and validation sets, these corrupted with SSN at an SNR of 0 dB and N = 30, we found two noise signals serve as unmatched noise types, whereas learning rates of 0:01 and 5 10 per sample to be optimal for SSN, BBL, CAF, and STR are matched noise types. This will the ES and ES systems, respectively. During training, ELC MSE allow us to study how the ELC optimal DNNs and STSA-MSE the cost on the validation set was evaluated for each epoch optimal DNNs generalize to unmatched noise types. and the learning rates were scaled by 0:7, if the cost increased Each feed-forward DNN consists of three hidden layers with compared to the cost for the previous epoch. The training 512 units using ReLU activation functions. The N -dimensional was terminated, if the learning rate was below 10 . We output layer uses sigmoid functions which ensures that the implemented the DNNs using CNTK [54] and the scripts output gain g ^ is confined between zero and one. The needed to reproduce the reported results can be found in [48]. j;m DNNs are trained using stochastic gradient de-/ascent with Note, the goal of these experiments is not to achieve state- the backpropagation technique and batch normalization [45]. of-the-art enhancement performance. In fact, increasing the The DNNs are trained for a maximum of 200 epochs with a size of the dataset or DNNs might likely improve performance, minibatch size of 256 randomly selected short-time temporal although we have not reason to believe it will change the one-third octave band envelope vectors. conclusion. Since the ES and ES systems use different cost ELC MSE functions, they likely have different optimal learning rates. This VII. E XPERIM ENTAL RESULTS is easily seen from the gradient norms of the two cost functions. To study the relationship between ES and ES ELC MSE It can be shown (details omitted due to space limitations) that systems as function of N , we have trained multiple systems for the ` -norm of the gradient of the ELC cost function in Eq. (4), various N . Specifically, a total of eight ES systems and ELC with respect to the desired signal vector a ^, is given by eight ES systems have been trained with N taking the p MSE 1L(a; a ^) values N = f4; 7; 15; 20; 30; 40; 50; 80g, which correspond to krL(a; a ^)k = ; (23) temporal envelope vectors with durations from approximately ka ^k 50 to 1000 milliseconds. where the gradient rL(a; a ^) is given by rL a; a ^ = A. Comparing One-third Octave Bands " # (24) @L a; a ^ @L a; a ^ @L a; a ^ In Fig. 2 we present the ELC scores, as function of envelope ; ; : : : ; ; @a ^ @a ^ @a ^ 1 2 N duration N , for each of the J = 15 one-third octave band DNNs in the ES and ES systems. All DNNs are ELC MSE and tested using speech corrupted with BBL noise at an SNR of 0 @L a; a ^ dB. First, we observe that both systems manage to improve the @a ^ ELC score considerably, when compared to the ELC score of (25) L a; a ^ a ^ L a; a ^ a  m the noisy speech signals, i.e. both systems enhance the noisy m a ^ T T speech, which is in line with known results [8]. a ^  a  a ^  a ^ a ^ a ^ a ^ Furthermore, we can observe that the DNNs trained with the is the partial derivative of L(a; a ^) with respect to entry m ELC cost function, i.e. the ES systems, in general achieve ELC of vector a ^. Similarly, the gradient of the STSA-MSE cost higher, or similar, ELC scores than the DNNs trained with the function in Eq. (22) is given by STSA-MSE cost function, i.e. the ES systems. This is an MSE important observation, since it verifies that DNNs trained to (26) rJ a; a ^ = (a a ^) ; maximize ELC indeed achieve the highest, or similar, ELC scores compared to DNNs trained to optimize a different cost such that function, STSA-MSE in this case. Finally, and most importantly, (27) rJ a; a ^ = ka a ^k : we observe that the difference in ELC score between the N 7 Band 1 (CF: 150 Hz) Band 2 (CF: 189 Hz) Band 3 (CF: 238 Hz) Band 4 (CF: 300 Hz) Band 5 (CF: 378 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 6 (CF: 476 Hz) Band 7 (CF: 600 Hz) Band 8 (CF: 756 Hz) Band 9 (CF: 952 Hz) Band 10 (CF: 1200 Hz) 0.8 0.8 0.8 0.8 0.8 0.6 0.6 0.6 0.6 0.6 0.4 0.4 0.4 0.4 0.4 0.2 0.2 0.2 0.2 0.2 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Band 11 (CF: 1512 Hz) Band 12 (CF: 1905 Hz) Band 13 (CF: 2400 Hz) Band 14 (CF: 3024 Hz) Band 15 (CF: 3810 Hz) 0.8 0.8 0.8 0.8 0.8 0.7 0.7 0.7 0.7 0.6 0.6 0.6 0.6 0.6 0.5 0.5 0.4 0.5 0.5 0.4 0.4 0.2 0.3 0.3 0.4 0.4 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 2. ELC values for ES and ES systems trained using various envelope durations, N , and tested with corresponding values of N using speech ELC MSE corrupted with BBL noise at an SNR of 0 dB. Each figure shows one out of J = 15 one-third octave band DNNs (center frequency (CF) shown in parenthesis). It is seen that as N ! 80 the difference between the ES DNNs and ES DNNs, as measured by ELC, tends to zero. This is in line with the ELC MSE theoretical results of Sec. V. ES and ES DNNs generally decrease with increasing durations and noise types, which indicate that our test set is ELC MSE N . For N = 80 the ELC score of the ES and ES sufficiently large to provide accurate estimates of the true mean ELC MSE DNNs practically coincide. ELC difference. Similarly to Fig. 2, the results in Fig. 3 support the theoretical results of Sec. V. Additionally, the results in B. Comparing ELC across Noise Types Fig. 3 show consistency across multiple noise types, which suggests that the theory in practice applies for various noise In Fig. 3 we present the ELC score difference, as function of type distributions. envelope duration N , for ES and ES systems, when ELC MSE tested using speech material corrupted with various noise types C. Comparing STOI across Noise Types at an SNR of 0 dB. Specifically, we compute the difference in ELC score for each pair of one-third octave band DNNs in the We now investigate if the global behavior observed for ES and ES systems, and then compute the average approximate-STOI, i.e. ELC, in Fig. 3 also applies for real ELC MSE ELC difference as function of envelope duration N . We do this STOI. To do this, we reconstruct the test signals used for Fig. 3 for all the 1000 test utterances and for each of the six noise in the time domain. We follow the technique proposed in [23], types introduced in Sec VI-B: SSN, BBL, CAF, STR, BUS, where a uniform gain across STFT coefficients within a one- and PED. Finally, we compute the 95% confidence interval (CI) third octave band is used before an inverse DFT is applied on the mean ELC difference. using the phase of the noisy signal. In Table I we present the From Fig. 3 we observe that the average ELC difference, STOI scores for ES and ES systems, as a function of ELC MSE i.e. ES ES , appears to be monotonically decreasing N , when tested using speech material corrupted with different ELC MSE with respect to the duration of the envelope N . Furthermore, noise types at an SNR of 0 dB. Note that these test signals we observe that the average ELC difference approaches zero are similar to the test signals used for Fig. 3 except that we as the duration of the envelope N increases, and similarly to now evaluate them according to STOI and not ELC. Fig. 2, for N = 80, the difference between the ES and From Table I we observe that the average STOI difference ELC ES systems is close to zero. Finally, we observe that the between the ES and ES systems is maximum for MSE ELC MSE 95% confidence intervals are relatively narrow for all envelope N = 4, but quickly tends to zero as N increases and for 8 Noise type: SSN Noise type: BBL Noise type: CAF 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Noise type: STR Noise type: BUS Noise type: PED 0.08 0.08 0.08 95% CI 95% CI 95% CI 0.06 0.06 0.06 0.04 0.04 0.04 0.02 0.02 0.02 0 0 0 4 15 30 50 80 4 15 30 50 80 4 15 30 50 80 Fig. 3. Average ELC differences, as function of envelope durations N , between ES and ES systems, for different noise types. We observe a ELC MSE monotonic decreasing relationship between the average ELC difference and the envelope length and for N = 80, the average ELC difference between the ES and ES systems is close to zero. This is in line with the theoretical results of Sec. V. ELC MSE N  15, the STOI difference is practically zero, i.e.  0:01. TABLE I STOI SCORES AS FUNCTION OF N FOR ES AND ES SYSTEMS ELC MSE Also, we observe that the gap in STOI between the ES and ELC TESTED USING DIFFERENT NOISE TYPES AT AN SNR OF 0 DB. ES systems closes faster at a lower value of N in Table I MSE compared to Fig. 3. We believe this is due to the transformation N : 4 7 15 20 30 40 50 80 of the, potentially ”invalid”, sequences of (e.g. [55], [56]) ELC : 0.81 0.85 0.88 0.88 0.87 0.86 0.85 0.84 SSN: modified magnitude spectra, when reconstructing enhanced MSE : 0.84 0.87 0.87 0.87 0.87 0.86 0.85 0.84 time-domain signals, whose intelligibility is estimated by STOI ELC : 0.77 0.80 0.82 0.82 0.81 0.80 0.80 0.78 BBL: in Table I. Therefore, STOI in Table I might be computed MSE : 0.79 0.82 0.82 0.82 0.81 0.80 0.80 0.78 based on slightly different magnitude spectra compared to the ELC : 0.82 0.85 0.87 0.87 0.86 0.85 0.84 0.83 CAF: magnitude spectra used for computing the ELC scores in Fig. 3. MSE : 0.85 0.87 0.87 0.87 0.86 0.85 0.85 0.84 Furthermore, we observe that the ES achieve slightly MSE ELC : 0.83 0.86 0.88 0.89 0.88 0.87 0.87 0.85 STR: higher STOI scores than the ES systems for N = 4, which ELC MSE : 0.86 0.88 0.88 0.88 0.88 0.87 0.87 0.85 might be due to sub-optimal learning rates as the ones actually ELC : 0.77 0.81 0.83 0.83 0.83 0.82 0.81 0.80 PED: used during training of the systems at, e.g. N = 4, were found MSE : 0.80 0.82 0.83 0.83 0.82 0.82 0.81 0.80 based on a grid-search using systems with N = 30 (see Sec. ELC : 0.87 0.89 0.90 0.91 0.90 0.89 0.89 0.89 BUS: VI.D). More importantly, the maximum improvement in STOI MSE : 0.89 0.90 0.90 0.90 0.90 0.90 0.89 0.89 is achieved for N = f15; 20; 30g, where both systems achieve similar STOI scores. Finally, while the theoretical results of Sec. V show that approximate-STOI performance of a ^ MMELC results in Sec. V. However, the results in Sec. V predict that and a ^ is identical, asymptotically, for N ! 1, the MMSE not only do ES , and ES systems produce identical ELC MSE empirical results in Table I suggest that N  15 is sufficient for ELC scores, they also predict that the systems are, in fact, practical equality to hold for DNN based speech enhancement essentially identical, i.e. up to an affine transformation. Hence, systems. in this section, we compare how the systems actually operate. Specifically, we compare the gains estimated by ES ELC D. Comparing Gain-Values systems with gains estimated by ES systems. MSE Figures 2 and 3, and Table I show that ES systems In Fig. 4 we present scatter plots, one for each one-third ELC achieve approximately the same ELC and STOI values as octave band for pairs of gains estimated by ES and ES ELC MSE ES systems and that the ELC and STOI difference systems tested with BBL noise at an SNR of 5 dB. Each scatter MSE between the two types of systems approach zero as N becomes plot consists of 10000 pairs of gains acquired by sampling 10 large. These empirical results are in line with the theoretical gain-pairs randomly and uniformly distributed from each of the 9 TABLE II speech enhancement systems, when the DNNs are trained to S AMPLE CORRELATIONS BETWEEN GAINS FROM ES AND ES ELC MSE either maximize ELC or minimize MSE and the systems are SYSTEMS WITH N = 30. S EE F IG. 4 FOR PER BAND CORRELATIONS. evaluated using both ELC and STOI. Finally, our experimental findings suggest, that applying the traditional STSA-MMSE SNR SSN BBL CAF STR BUS PED estimator on noisy speech signals in practice leads to essentially [dB] maximum speech intelligibility as reflected by the STOI speech -5 0.94 0.87 0.89 0.93 0.87 0.90 intelligibility estimator. 0 0.94 0.92 0.92 0.93 0.88 0.92 5 0.95 0.95 0.93 0.93 0.90 0.92 10 0.95 0.95 0.92 0.92 0.91 0.93 APPENDIX A M AXIM IZING A CONSTRAINED INNER PRODUCT 1000 test utterances. In Fig. 4, yellow indicates high density This appendix derives an expression for the zero-mean, unit- of gain-pairs and dark blue indicates low density. From Fig. 4 norm vector e(a ^), which maximizes the inner product with it is seen that a correlation no smaller than 0:88 is achieved the vector E [e(Ajr)]. For notational convenience, let = Ajr for all 15 one-third octave bands. The highest correlation of E [e(Ajr)], and = e(a ^). The constrained optimization Ajr r = 0:98 is achieved by bands 5 to 7 and the lowest is r = 0:88 problem from Eq. (11) is then defined as achieved by band 2 followed by band 1 with r = 0:89. It is also seen that a large number of gain values are either zero, or maximize one, as one would expect due to the sparse nature of speech T (28) in the T-F domain. However, although a strong correlation subject to 1 = 0; is observed for all bands, the gain-pairs are slightly more = 1: scattered at the first few bands than for the remaining bands. This might be explained simply by the fact that low one-third The vector that solves Eq. (28) can be found using octave bands correspond to single STFT bins, whereas higher the method of Lagrange multipliers [57]. Introducing two one-third octave bands are sums of a large number of STFT scalar Lagrange multipliers,  and  , for the two equality 1 2 bins. This, in turn, may have the consequence that for finite constraints, the Lagrangian is given by N (N = 30), Kolmogorovs strong law of large numbers (see T T T Appendix. B) is better valid at higher frequencies than at lower L( ;  ;  ) = +  1 +  ( 1): (29) 1 2 1 2 frequencies (so that gain vectors produced by one system is @L closer to an affine transformation of gain vectors produced by Setting the partial derivatives equal to zero the other system). In fact, if we compute r for models trained with N = 50, we get r = 0:93, i.e. increased correlation @L between the gain vectors produced by the two systems. Finally, = +  1 + 2 = 0; (30) 1 2 in Table. II we present average correlation coefficients and we observe correlation coefficients  0:87 for all, both matched and solving for , we arrive at and unmatched, noise types, at multiple SNRs. = : (31) VIII. CONCLUSION 2 @L @L This study is motivated by the fact that most estimators Using the same approach for and , substituting in @ @ 1 2 used for speech enhancement, being either data-driven models, Eq. (31) and solving for  , and  such that the two constraints 1 2 e.g. deep neural networks (DNNs), or statistical model-based are fulfilled, we find techniques such as the short-time spectral amplitude minimum mean-square error (STSA-MMSE) estimator, use the STSA T (32) = 1 =  ; mean-square error (MSE) cost function as a performance indicator. Short-time objective intelligibility (STOI), a state- and of-the-art speech intelligibility estimator, on the other hand, k  1k (33) rely on the envelope linear correlation (ELC) of speech temporal = : envelopes. Since the primary goal of many speech enhancement systems is to improve speech intelligibility, it raises the question Inserting  and  into Eq. (31) results in 1 2 if estimators can benefit from an ELC cost function. In this paper we derive the maximum mean envelope linear = ; (34) correlation (MMELC) estimator and study its relationship to k  1k the well-known STSA-MMSE estimator. We show theoretically that the MMELC estimator, under a commonly used conditional which is simply the vector , normalized to zero sample mean independence assumption, is asymptotically equivalent to and unit norm. the STSA-MMSE estimator. Furthermore, we demonstrate 2 T experimentally that this relationship also holds for DNN based We solve the equivalent problem that minimizes . 10 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 1 1 1 1 1 0 0 0 0 0 0 1 0 1 0 1 0 1 0 1 Fig. 4. Scatter plots based on gain values from ES and ES systems with an envelope length of N = 30. Dark blue indicate low density and bright ELC MSE ^ ^ ^ yellow indicate high density. The systems are tested with BBL noise corrupted speech at an SNR of 5 dB. Each figure shows one of 15 (g ; g ; : : : ; g ) 1 2 15 one-third octave bands. A correlation no smaller than 0:88 is achieved for all one-third octave bands, which indicates that the ES and ES systems ELC MSE estimate fairly similar gain vectors. APPENDIX B We can rewrite the factors on the right-hand side of Eq. (39) FACTORIZATION OF E XPECTATION as follows h i This appendix shows that the expectation in Eq. (16) factor- E [Z ] = E h Y izes into the product of expectations in Eq. (18), asymptotically as N ! 1. Let = E S 1 Y Y , Ajr; (35) 1 (40) = E [S ] 1 E [Y ] and N 1 N H , I 11 ; (36) = E [S ] E [S ] ; i j j=1 so that 2 3 " # Z = HY ; (37) 1 1 4 5 E = E Z T T where I denotes the N -dimensional identity matrix and Ajr N Y HH Y is a random vector distributed according to the conditional 2 3 probability density function f (ajr). A specific element Z , AjR 4 5 = E q of Z is then given by Y HY 2 3 Z = h Y (38) 1 T 4 5 = E q = S 1 Y ; T T 1 T Y Y Y 11 Y (41) 2 3 where h is the ith column of matrix H . We now define the covariance between Z and 1=kZk as i 6 7 6 7 = E " " # # 4 5 P P N N 1 1 1 2 1 S S cov(Z ; ) , E Z E [Z ] E j=1 j N j=1 i i i Z Z Z 2 3 " # " # Z 1 6 7 i N 6 7 = E E [Z ] E : i = E r ; 4   5 Z Z 2 P P N N 1 1 S S j=1 j j=1 N N (39) 11 and REFERENCES 2 3 [1] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Deep Recurrent " # 1 1 S S Networks for Separation and Recognition of Single-Channel Speech 6 i j 7 j=1 Z N N 6 7 in Nonstationary Background Audio,” in New Era for Robust Speech E = E r : (42) 4   5 Z Recognition. Springer, 2017, pp. 165–186. P P N N 1 1 S S [2] D. Wang, “Deep learning reinvents the hearing aid,” IEEE Spectrum, j=1 j j=1 N N vol. 54, no. 3, pp. 32–37, 2017. [3] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep In Eqs. (40), (41) and (42) two different sums of random Learning: An Overview,” arXiv:1708.07524, 2017. variables occur, [4] M. Kim and P. Smaragdis, “Bitwise Neural Networks for Efficient Single- Channel Source Separation,” in Proc. NIPS Machine Learning for Audio Signal Processing Workshop, 2017. [5] R. Fakoor, X. He, I. Tashev, and S. Zarar, “Reinforcement Learning To S ; (43) N Adapt Speech Enhancement to Instantaneous Input Signal Quality,” in j=1 Proc. NIPS Machine Learning for Audio Signal Processing Workshop, and [6] J. Chen, Y. Wang, S. E. Yoho, D. Wang, and E. W. Healy, “Large-scale training to increase speech intelligibility for hearing-impaired listeners in novel noises,” J. Acoust. Soc. Am., vol. 139, no. 5, pp. 2604–2612, S : (44) j=1 [7] E. W. Healy, M. Delfarah, J. L. Vasko, B. L. Carter, and D. Wang, “An algorithm to increase intelligibility for hearing-impaired listeners in the Since, by assumption, Eq. (17), S 8 j are independent random presence of a competing talker,” J. Acoust. Soc. Am., vol. 141, no. 6, pp. 3 4230–4239, 2017. variables with finite variances , according to Kolmogorovs [8] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential strong law of large numbers [44], the sums given by Eqs. (43) of General and Specialized Deep Neural Network Based Speech and (44) will converge (almost surely, i.e. with probability (Pr) Enhancement Systems,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 25, no. 1, pp. 153–167, 2017. one) to their average means  = E[S ], and  2 = S j S j=1 [9] J. Schnupp, E. Nelken, and A. King, Auditory Neuroscience - Making E[S ], respectively, as N ! 1. Formally, we can Sense of Sound. MIT Press, 2011. j=1 j [10] B. Moore, An Introduction to the Psychology of Hearing. Brill, 2013. express this as [11] R. D. Patterson, K. Robinson, J. Holdsworth, D. Mckeown, C. Zhang, 0 1 and M. Allerhand, “Complex sounds and auditory images,” in In Proc. 1 International Symposium on Hearing, 1992, pp. 429–446. @ A Pr lim S =  = 1; (45) j S [12] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algorithm N!1 N for Intelligibility Prediction of Time-Frequency Weighted Noisy Speech,” j=1 IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 19, no. 7, pp. 2125–2136, 2011. and [13] T. M. Elliott and F. E. Theunissen, “The Modulation Transfer Function 0 1 for Speech Intelligibility,” PLOS Computational Biology, vol. 5, no. 3, @ A Pr lim S =  2 = 1: (46) j S [14] R. Drullman, J. M. Festen, and R. Plomp, “Effect of temporal envelope N!1 j=1 smearing on speech reception,” J. Acoust. Soc. Am., vol. 95, no. 2, pp. 1053–1064, 1994. [15] J. S. Lim and A. V. Oppenheim, “Enhancement and bandwidth com- By substituting Eqs. (45), and (46) into Eqs. (40), (41) and pression of noisy speech,” Proceedings of the IEEE, vol. 67, no. 12, pp. (42), we arrive at 1586–1604, 1979. [16] E. W. Healy, S. E. Yoho, J. Chen, Y. Wang, and D. Wang, “An algorithm lim E [Z ] = E [S ]  ; i i S to increase speech intelligibility for hearing-impaired listeners in novel (47) N!1 segments of the same noise type,” J. Acoust. Soc. Am., vol. 138, no. 3, pp. 1660–1669, 2015. " # 1 [17] P. C. Loizou, “Speech Enhancement Based on Perceptually Motivated lim Bayesian Estimators of the Magnitude Spectrum,” IEEE/ACM Trans. N!1 (48) lim E = p ; Audio, Speech, Lang. Process., vol. 13, no. 5, pp. 857–869, 2005. N!1 [18] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-Domain Based Single- Microphone Noise Reduction for Speech Enhancement: A Survey of the and State of the Art,” Synth. Lect. on Speech and Audio Process., vol. 9, no. 1, pp. 1–80, 2013. " # 1 [19] L. Lightburn and M. Brookes, “SOBM - a binary mask for noisy speech lim that optimises an objective intelligibility metric,” in Proc. ICASSP, 2015, i N!1 lim E = (E [S ]  ) i S pp. 5078–5082. N!1 [20] W. Han, X. Zhang, G. Min, X. Zhou, and W. Zhang, “Perceptual (49) " # weighting deep neural networks for single-channel speech enhancement,” in Proc. WCICA, 2016, pp. 446–450. = lim E [Z ] E ; N!1 [21] P. G. Shivakumar and P. Georgiou, “Perception Optimized Deep Denoising AutoEncoders for Speech Enhancement - Semantic Scholar,” in Proc. INTERSPEECH, 2016, pp. 3743–3747. where the last line follows from Eq. (47) and (48). In words, [22] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi, and Y. Haneda, “DNN- as N ! 1, the covariance between Z and 1=kZk tends to i based source enhancement self-optimized by reinforcement learning using sound quality measurements,” in Proc. ICASSP, 2017, pp. 81–85. zero and, consequently, the expectation in Eq. (16) factorizes [23] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement into the product of expectations in Eq. (18). using Deep Neural Networks by Maximizing a Short-Time Objective Intelligibility Measure,” in Proc. ICASSP, 2018, pp. 5059 – 5063. Assuming a finite variance of S is motivated by the fact that S model [24] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually Guided Speech j j speech signals, which always take finite values due to both physical and Enhancement using Deep Neural Networks,” in Proc. ICASSP, 2018, pp. physiological limitations of sound and speech production systems, respectively. 5074–5078. 12 [25] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation [51] ITU, “Rec. P.56 : Objective measurement of active speech level,” 1993, System to Improve STOI and PESQ Directly,” in Proc. ICASSP, 2018, https://www.itu.int/rec/T-REC-P.56/. pp. 5374–5378. [52] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” [26] S. W. Fu, T. W. Wang, Y. Tsao, X. Lu, and H. Kawai, “End-to- in Proc. ICLR (arXiv:1412.6980), 2014. End Waveform Utterance Enhancement for Direct Evaluation Metrics [53] A. C. Wilson, R. Roelofs, M. Stern, N. Srebro, and B. Recht, “The Optimization by Fully Convolutional Neural Networks,” IEEE/ACM Marginal Value of Adaptive Gradient Methods in Machine Learning,” in Trans. Audio, Speech, Lang. Process., vol. 26, no. 9, pp. 570 – 1584, Proc. NIPS, 2017. [54] A. Agarwal et al., “An introduction to computational networks and the [27] A. W. Rix, J. G. Beerends, M. P. Hollier, and A. P. Hekstra, “Perceptual computational network toolkit,” Microsoft Technical Report fMSR-TRg- evaluation of speech quality (PESQ)-a new method for speech quality 2014-112, Tech. Rep., 2014. assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, [55] S. Nawab, T. Quatieri, and J. Lim, “Signal reconstruction from short-time 2001, pp. 749–752. Fourier transform magnitude,” IEEE Trans. Acoust., Speech, and Sig. [28] S. Jørgensen, J. Cubick, and T. Dau, “Speech Intelligibility Evaluation Process., vol. 31, no. 4, pp. 986–998, 1983. for Mobile Phones.” Acustica United with Acta Acustica, vol. 101, pp. [56] D. Griffin and J. Lim, “Signal estimation from modified short-time 1016–1025, 2015. Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Process., [29] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility vol. 32, no. 2, pp. 236–243, 1984. of Speech Masked by Modulated Noise Maskers,” IEEE/ACM Trans. [57] S. Boyd and L. Vandenberghe, Convex Optimization. Cambridge Audio, Speech, Lang. Process., vol. 24, no. 11, pp. 2009–2022, 2016. University Press, 2004. [30] ——, “Speech Intelligibility Prediction Based on Mutual Information,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 22, no. 2, pp. 430–440, 2014. Morten Kolbæk received the B.Eng. degree in [31] T. H. Falk et al., “Objective Quality and Intelligibility Prediction for electronic design at Aarhus University, Business and Users of Assistive Listening Devices: Advantages and limitations of Social Sciences, AU Herning, Denmark, in 2013 existing tools,” IEEE Sig. Process. Mag., vol. 32, no. 2, pp. 114–124, and the M.Sc. in signal processing and computing from Aalborg University, Denmark, in 2015. He is [32] R. Xia, J. Li, M. Akagi, and Y. Yan, “Evaluation of objective intelligibility currently pursuing his PhD degree at the section for prediction measures for noise-reduced signals in mandarin,” in Proc. Signal and Information Processing at the Department ICASSP, 2012, pp. 4465–4468. of Electronic Systems, Aalborg University, Denmark. [33] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, His research interests include speech enhancement and separation, deep learning, and intelligibility [34] Y. Ephraim and D. Malah, “Speech enhancement using a minimum- improvement of noisy speech. mean square error short-time spectral amplitude estimator,” IEEE Trans. Acoust., Speech, and Sig. Process., vol. 32, no. 6, pp. 1109–1121, 1984. [35] G. Kim, Y. Lu, Y. Hu, and P. C. Loizou, “An algorithm that improves Zheng-Hua Tan (M’00–SM’06) received the B.Sc. speech intelligibility in noise for normal-hearing listeners,” J. Acoust. and M.Sc. degrees in electrical engineering from Soc. Am., vol. 126, no. 3, pp. 1486–1494, 2009. Hunan University, Changsha, China, in 1990 and [36] K. Han and D. Wang, “A classification based approach to speech 1996, respectively, and the Ph.D. degree in electronic segregation,” J. Acoust. Soc. Am., vol. 132, no. 5, pp. 3475–3483, 2012. engineering from Shanghai Jiao Tong University, [37] J. Allen, “Short term spectral analysis, synthesis, and modification Shanghai, China, in 1999. He is a Professor and a Co- by discrete Fourier transform,” IEEE Trans. Acoust., Speech, and Sig. Head of the Centre for Acoustic Signal Processing Process., vol. 25, no. 3, pp. 235–238, 1977. Research (CASPR) at Aalborg University, Aalborg, [38] C. H. Taal, R. C. Hendriks, and R. Heusdens, “Matching pursuit for Denmark. He was a Visiting Scientist at the Computer channel selection in cochlear implants based on an intelligibility metric,” in Proc. EUSIPCO, 2012, pp. 504–508. Science and Artificial Intelligence Laboratory, MIT, [39] A. H. Andersen, J. M. d. Haan, Z. H. Tan, and J. Jensen, “Predicting Cambridge, USA, an Associate Professor at Shanghai the Intelligibility of Noisy and Nonlinearly Processed Binaural Speech,” Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea. IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 24, no. 11, pp. His research interests include machine learning, deep learning, pattern 1908–1920, 2016. recognition, speech and speaker recognition, noise-robust speech processing, [40] S. M. Kay, Fundamentals of Statistical Signal Processing: Estimation multimodal signal processing, and social robotics. He is a member of the IEEE Theory. Prentice Hall, 2010. Signal Processing Society Machine Learning for Signal Processing Technical [41] Y. Ephraim and D. Malah, “Speech enhancement using a minimum Committee (MLSP TC). He is an Editorial Board Member for Computer mean-square error log-spectral amplitude estimator,” IEEE Trans. Acoust., Speech and Language and was a Guest Editor for the IEEE Journal of Selected Speech, and Sig. Process., vol. 33, no. 2, pp. 443–445, 1985. Topics in Signal Processing and Neurocomputing. He was the General Chair [42] J. S. Erkelens, R. C. Hendriks, R. Heusdens, and J. Jensen, “Minimum for IEEE MLSP 2018 and a TPC co-chair for IEEE SLT 2016. Mean-Square Error Estimation of Discrete Fourier Coefficients With Generalized Gamma Priors,” IEEE/ACM Trans. Audio, Speech, Lang. Process., vol. 15, no. 6, pp. 1741–1752, 2007. Jesper Jensen received the M.Sc. degree in electrical [43] R. McAulay and M. Malpass, “Speech enhancement using a soft-decision engineering and the Ph.D. degree in signal processing noise suppression filter,” IEEE Trans. Acoust., Speech, and Sig. Process., from Aalborg University, Aalborg, Denmark, in 1996 vol. 28, no. 2, pp. 137–145, 1980. and 2000, respectively. From 1996 to 2000, he was [44] P. K. Sen and J. M. Singer, Large Sample Methods in Statistics: An with the Center for Person Kommunikation (CPK), Introduction with Applications. Chapman & Hall, 1994. Aalborg University, as a Ph.D. student and Assistant [45] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, Research Professor. From 2000 to 2007, he was a 2016. Post-Doctoral Researcher and Assistant Professor [46] K. Hornik, M. Stinchcombe, and H. White, “Multilayer feedforward with Delft University of Technology, Delft, The networks are universal approximators,” Neural Networks, vol. 2, no. 5, Netherlands, and an External Associate Professor pp. 359–366, 1989. with Aalborg University. Currently, he is a Senior [47] J. Garofolo, D. Graff, P. Doug, and D. Pallett, “CSR-I (WSJ0) Complete Principal Scientist with Oticon A/S, Copenhagen, Denmark, where his main LDC93s6a,” 1993, philadelphia: Linguistic Data Consortium. responsibility is scouting and development of new signal processing concepts [48] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Supplemental Material.” [Online]. for hearing aid applications. He is a Professor with the Section for Signal Available: http://kom.aau.dk/ mok/taslp2018 and Information Processing (SIP), Department of Electronic Systems, at [49] J. Barker, R. Marxer, E. Vincent, and S. Watanabe, “The third ‘CHiME’ Aalborg University. He is also a co-founder of the Centre for Acoustic Signal speech separation and recognition challenge: Dataset, task and baselines,” Processing Research (CASPR) at Aalborg University. His main interests are in Proc. ASRU, 2015, pp. 504–511. in the area of acoustic signal processing, including signal retrieval from noisy [50] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, D. S. Pallett, and observations, coding, speech and audio modification and synthesis, intelligibility N. L. Dahlgren, “DARPA TIMIT Acoustic Phonetic Continuous Speech enhancement of speech signals, signal processing for hearing aid applications, Corpus CDROM,” 1993. and perceptual aspects of signal processing.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jun 21, 2018

There are no references for this article.