Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement Morten Kolbæk, Zheng-Hua Tan, Senior Member, IEEE, Søren Holdt Jensen, and Jesper Jensen Abstract—Many deep learning-based speech enhancement al- acoustical conditions [6], [13]–[19]. However, despite the recent gorithms are designed to minimize the mean-square error (MSE) success of deep learning-based speech enhancement algorithms, in some transform domain between a predicted and a target many of the techniques referenced above are fundamentally speech signal. However, optimizing for MSE does not necessarily limited, as they primarily focus on enhancement in the guarantee high speech quality or intelligibility, which is the ulti- short-time spectral amplitude (STSA) domain and therefore mate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on ignore potentially useful phase information. Numerous recent the emerging class of time-domain deep learning-based speech deep learning-based speech enhancement techniques exist, enhancement systems. however, that incorporate phase information (e.g. [20]–[22]). We study how popular loss functions influence the performance The most successful approaches to date are arguably end- of time-domain deep learning-based speech enhancement systems. to-end techniques based on fully convolutional neural net- First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. works (FCNN) that do not apply the short-time discrete Fourier Furthermore, we show that the learning rate is a crucial design transform (STFT) or other pre-processing stages, but operate parameter even for adaptive gradient-based optimizers, which directly in the time-domain (e.g. [23]–[30]). These techniques, has been generally overlooked in the literature. Also, we found however, might still be limited as most of them rely on a loss that waveform matching performance metrics must be used with function based on the mean square error (MSE) between time- caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal- domain waveforms. This is most likely suboptimal with respect to-distortion ratio (SI-SDR) achieves good general performance to speech quality and intelligibility, as time-domain MSE has across a range of popular speech enhancement evaluation metrics, no apparent relation to human perception or the human auditory which suggests that SI-SDR is a good candidate as a general- system in general. Furthermore, as the works above use widely purpose loss function for speech enhancement systems. different network architectures, development datasets, noise Index Terms—Speech Enhancement, Fully Convolutional Neu- types, hyperparameters, etc., it is not yet established how the ral Networks, Time-Domain, Objective Intelligibility. loss functions influence the performance of such systems and if alternative loss functions that are more perceptually meaningful I. INTRODUCTION might be advantageous. Speech enhancement algorithms for improving speech quality In this paper we study the influence of loss functions on the and speech intelligibility of single-channel recordings of noisy performance of end-to-end time-domain deep learning-based speech are of high demand in a wide range of applications e.g. speech enhancement systems. Specifically, we adopt a general- hearing aids design, mobile communications devices, voice- purpose FCNN architecture that takes as input a time-domain operated human-machine interfaces, etc. Consequently, devel- waveform of a noisy speech signal and is trained using various oping successful monaural speech enhancement algorithms has loss functions to predict as output the enhanced speech signal been a long-lasting goal in both academia and industry. as a time-domain waveform that optimize the loss function in In fact, over the last decade, monaural speech enhancement question. algorithms based on machine learning, and deep learning in par- We focus on six loss functions: time-domain mean-square ticular, have received a tremendous amount of attention (see e.g. error (MSE) L , short-time spectral amplitude (STSA) TIME-MSE [1]–[10] as well as [11], [12] and references therein). Specifi- MSE L [31], short-time objective intelligibility (STOI) STSA-MSE cally, in recent years, deep learning-based speech enhancement L [32], Extended STOI L [33], scale-invariant signal- STOI ESTOI algorithms, facilitated by powerful general-purpose graphics to-distortion ratio (SI-SDR) L [34], and perceptual metric SI-SDR processing units and large amounts of training data, have shown for speech quality evaluation (PMSQE) L [35]. We study PMSQE impressive results by improving speech intelligibility in narrow these loss functions as they jointly cover a large range of useful properties, e.g. close relationships to human perception Manuscript received August 27, 2019; revised December 19, 2019; accepted or mathematical simplicity, that usually are of interest for January 18, 2020. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Tan Lee. speech enhancement systems (described in detail in Sec. II). Corresponding author: Morten Kolbæk Furthermore, the six loss functions have all been applied in M. Kolbæk, Z.-H. Tan, and S. H. Jensen are with the Department of recent deep learning-based speech processing techniques e.g. Electronic Systems, Aalborg University, Aalborg 9220, Denmark (e-mail: mok@es.aau.dk; zt@es.aau.dk; shj@es.aau.dk). [8]–[11], [23], [25], [35]–[47]. However, no existing work has J. Jensen is with the Department of Electronic Systems, Aalborg University, studied these loss functions jointly under identical conditions Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark and evaluated them in a structured manner with end-to-end (e-mail: jje@es.aau.dk). Digital Object Identifier 10.1109/TASLP.2020.2968738 time-domain deep learning-based speech enhancement systems. arXiv:1909.01019v2 [cs.SD] 30 Jan 2020 2 Fig. 1: End-to-end speech enhancement system based on a fully convolutional neural network. Obviously, one would expect, that a system trained to minimize Let x 2 R be L samples of a clean time-domain speech a specific loss function would also achieve the minimum signal and let the corresponding noisy observation y 2 R be numerical value among all systems for that particular loss y = x + v; (1) function during test. However, as training of machine learning models in general, and FCNNs in particular, is a highly non- where v 2 R is an additive noise signal. The goal is then to linear process, which depends on the loss function itself, this find an estimate x ^ of x from y using a FCNN, is not guaranteed in practice. Furthermore, we argue that the x ^ = f (y; ); (2) FCNN learning-rate is a crucial design parameter, when conducting such experiments, even for adaptive gradient-based optimizers where  represents the parameters of the FCNN. Using a such as ADAM [48], which has been generally overlooked in supervised learning approach, the parameters are found such the literature despite its obvious importance. Therefore, it is of that they minimize a loss L over a training dataset, consisting interest to establish exactly how large the difference in speech of corresponding pairs of clean x and noisy speech signals train enhancement performance is between time-domain waveform- y . Our objective is then to study how the quality of x ^, train based speech enhancement systems trained using different loss measured using different performance metrics, is affected by functions and evaluated using popular speech enhancement the choice of loss function L. In the following, we review each performance metrics. In particular, one might hope to find a of the loss functions we have selected for our experiments. ”universally good” loss function that performs almost optimally with respect to other loss functions. This is the goal of the paper A. Time-Domain Mean Square Error and our findings might serve as a guideline in loss-function The first loss function we consider is the time-domain mean- selection for deep learning-based speech processing systems. square error (MSE). This loss function is given as The rest of the paper is organized as follows. In Sec. II we describe the monaural speech separation problem and present L = kx ^ xk ; (3) TIME-MSE the six loss functions we will study. In Sec. III we present the design of our experimental study including the speech and where kk is the ` -norm. We include this loss function noise material used for training. In Sec. IV we present and because it is computationally very simple and because it is one discusses the results. Finally, in Sec. V we conclude the paper. of the most used loss functions in machine learning and signal processing in general [52]–[54]. However, little is known about the performance of time-domain speech enhancement systems optimized end-to-end for this loss function, when evaluated II. S PEECH ENHANCEMENT S YSTEM using standard speech enhancement metrics such as STOI and PESQ. Fig. 1 shows a block-diagram of the speech enhancement system we use for all experiments. The system is based on a B. Short-Time Spectral Amplitude Mean Square Error fully convolutional neural network (FCNN), which is trained, end-to-end, to estimate the noise-free speech waveform from The second loss function we consider is the classical STSA- a noisy single-channel recording. Note, the architecture in MSE, which is one of the most popular loss functions used Fig. 1 resembles that used in a large body of state-of-the- for deep neural network based speech enhancement [10], [11]. art deep learning-based speech enhancement literature (e.g. The STSA-MSE function also plays a major role in more [23], [26], [28]–[30], [49]–[51]). Therefore, we argue that our classical non-machine learning based speech enhancement experimental findings based on this particular architecture are algorithms [31], [55], but has also been used in recent time- representative and generally valid for a large range of deep domain techniques [23]. learning-based speech enhancement methods. The architecture Let x(k; m), k = 1; : : : ; K , m = 1; : : : M; be the K -point in Fig. 1 is further described in Sec. III-D. short-time discrete Fourier transform (STFT) of x, where M = 3 b c 1 is the number of STFT frames with truncation and Similarly, we define a ^ as the short-time temporal envelope j;m I is the frame shift in samples. Furthermore, let a(k; m) = vector for the enhanced speech signal. The vector a ^ is j;m jx(k; m)j, k = 1; : : : ; + 1, m = 1; : : : M; denote the single- normalized and clipped for each entry a ^ (n) according to j;m sided amplitude spectra of x(k; m). Finally, let a ^(k; m) denote ka k j;m 0 0:75 the estimate of a(k; m). a ^ (n) = min a ^ (n); (1 + 10 )a (n) ; j;m j;m j;m ka ^ k j;m The STSA-MSE is then given as (7) for n = 1; 2; : : : ; N . X X L =  (a ^(k; m) a(k; m)) ; STSA-MSE The intermediate intelligibility measure for a pair of short- + 1 M k=1 m=1 time temporal envelope vectors a and a ^ is then defined j;m j;m (4) as the sample linear correlation between the clean and enhanced which is the MSE between the single-sided amplitude spectra envelope vectors given as of the true target signal x and the estimated signal x ^. Note, L is only sensitive to variations in spectral amplitudes STSA-MSE a  a ^ j;m j;m j;m a ^ and not to variations in the short-time phase spectrum of j;m d = ; (8) j;m the signals. This is different from L (Eq. (??)), as TIME-MSE a  a ^ j;m j;m j;m a ^ j;m L is operating in the time-domain. For all experiments TIME-MSE in this paper we use K = 256 and I = 128. where  and  are the sample mean vectors of a j;m j;m a ^ j;m and a ^ , respectively. From d , the final STOI score for j;m j;m C. Short-Time Objective Intelligibility an entire speech signal is then defined as the scalar, 1 The third loss function we consider is based on the short-time d  1, STOI objective intelligibility (STOI) speech intelligibility estimator J M X X [32]. STOI is currently the, perhaps, most commonly used d = d ; (9) STOI j;m J (M N + 1) speech intelligibility estimator for objectively evaluating the j=1 m=N performance of speech enhancement systems [6], [7], [9], [13]. where J = 15 is the number of one-third octave bands This is presumably driven by the fact that STOI predictions and M N + 1 is the total number of short-time temporal have shown a good correspondence with measured intelligibility envelope vectors. With J = 15, the center frequency of the of noisy/processed speech in a large range of acoustic scenarios, first one-third octave band is 150 Hz and the last one is at including ideal time-frequency weighted noisy speech [32] and approximately 3.8 kHz. These frequencies are chosen such noisy speech enhanced by single-microphone time-frequency that they span the frequency range in which human speech weighting-based speech enhancement systems [32] (see also normally lie [32]. Finally, with N = 30, STOI is sensitive to [33], [56]). Therefore, it is natural to believe that gains in temporal modulation frequencies of 2:6 Hz and higher, which speech intelligibility, as estimated by STOI, can be achieved are frequencies important for speech intelligibility [32]. by utilizing a loss function based on STOI. In the following, We define our STOI loss function to be minimized as we introduce the STOI loss function L , which essentially STOI is identical to STOI. The main difference is that we omit the L = d : (10) STOI STOI voice activity detector (VAD) otherwise used by STOI. We do, Note, except for the min(;) operator in Eq. (??) the entire however, apply the VAD from STOI on the dataset used for STOI loss function is differentiable and computing the required training and validation (described further in Sec. III). gradients for gradient based optimization is straight forward Let a(k; m) k = 1; : : : ; + 1, m = 1; : : : M; denote (see e.g. [10]). Furthermore, the min(;) operator requires the single-sided STFT amplitude spectra of the clean speech th only two subgradients, so the computational complexity of its spectrum as defined in Sec. II-B. We then define the j one- gradient computation is similar to the standard ReLU activation third octave band clean-speech amplitude, for time-frame m, function, which is nothing more than the max operator. To that as [32] end, L is suitable as a loss function for training DNN-based k (j) STOI u 2 2 speech enhancement systems. a (m) = a(k; m) ; (5) k=k (j) D. Extended Short-Time Objective Intelligibility where k (j) and k (j) denote the first and last STFT bin 1 2 th The fourth loss function we include is the extended short- index, respectively, of the j one-third octave band. In a similar fashion we define a ^ (m) as the jth one-third octave time objective intelligibility (ESTOI) speech intelligibility esti- band estimated clean-speech amplitude, for time-frame m. mator [33]. As the name implies, ESTOI is inspired by STOI Furthermore, let a short-time temporal envelope vector that and was developed in an attempt to improve STOI. Specifically, spans time-frames m N + 1; : : : ; m, in the jth frequency in [33] it was shown that the performance of certain speech band for the clean speech signal be defined as intelligibility estimators, including STOI, was sensitive to spectro-temporal modulations of the noise component and that a = [a (m N + 1); a (m N + 2); : : : ; a (m)] ; (6) j j j j;m STOI did not correlate as well ( = 0:47 [33]) with listening where N = 30, which corresponds to approximately 384 ms test results, when the noise components were highly fluctuating with a sampling frequency of 10 kHz. (as e.g. with a competing talker). 4 To alleviate this drawback of STOI, ESTOI was proposed proposed as an alternative to the often used SDR measure [33]. It was shown that ESTOI significantly outperformed from the BSS eval toolbox [57]. Differently from SDR, SI- STOI ( > 0:90 [33]), as well as other speech intelligibility SDR is invariant to the scale of the processed signal, but not estimators, in conditions when the noise type is highly to deformations caused by finite-impulse response filters as fluctuating, while performing on par with these estimators in SDR is [34]. less fluctuating noise conditions. Consequently, it is of interest The SI-SDR is defined as to study how ESTOI compares with STOI, as a loss function, k xk SI-SDR = 10 log ; (17) for time-domain DNN-based speech enhancement. 10 k x x ^k Similarly to STOI, ESTOI is based on an average correlation where coefficient between one-third octave band short-time temporal x ^ x envelope vectors. Specifically, let, = = argmink x x ^k : (18) kxk 2 3 a (m N + 1) : : : a (m) 1 1 It is seen from Eqs. (??), that SI-SDR is simply the signal-to- 6 7 . . A = . . (11) 4 5 noise (SNR) ratio between the weighted clean speech signal m . . and the residual noise defined as k x x ^k . Hence, a (m N + 1) : : : a (m) J J 0 1 denote a short-time spectrogram matrix of the clean speech ^ x x kxk B C signal, where the rows of A are given by a , which j;m SI-SDR = 10 log @ A are short-time temporal envelope vectors in a one-third band x x x x ^ 2 (19) kxk defined by Eq. (??). The jth mean- and variance-normalized x x ^ row of A is then given by = 10 log : T T x xx ^ x ^ x x ^ a  = (a  ): (12) j;m j;m a j;m The scaling of the reference signal x ensures that the SI-SDR k(a  )k j;m a j;m measure is invariant to the scale of x ^, which might be desirable ESTOI now introduces the row-normalized spectrogram matrix in applications, where the speech processing algorithm do not 2 3 guarantee a proper scaling of the processed signal, such as 1;m 6 7 many DNN-based systems. This is also motivated by the fact A = (13) 4 5 m . that both speech quality and intelligibility to a large extent is J;m invariant to scaling [58]. Note, that maximizing L is equivalent to maximizing SI-SDR and defines a  as the mean- and variance-normalized nth n;m the sample correlation between x and x ^, while producing the column, n = 1; 2; : : : ; N of A , where the normalization of solution with the minimum energy [36], [37]. Furthermore, the columns is performed analogously to Eq. (??). similarly to SNR, SI-SDR is expressed in units of decibel (dB) Finally, define and is defined in the range 1 < SI-SDR < 1, which A = a  : : : a  (14) motivates us to define the SI-SDR loss function as 1;m N;m as the row and column normalized spectrogram matrix. Simi- L = SI-SDR: (20) SI-SDR larly, we define a ^ as the columns of the row and column n;m normalized spectrogram matrix for the enhanced speech signal F. Perceptual Metric for Speech Quality Evaluation A . Finally, the ESTOI speech intelligibility index is defined The sixth, and last loss function is the perceptual metric as M N for speech quality evaluation (PMSQE) [35]. The PMSQE X X 1 00 d = a  a ^ : (15) loss function, L , is designed to approximate the non- ESTOI PMSQE n;m n;m NM m=1 n=1 differentiable perceptual evaluation of speech quality (PESQ) speech quality estimator. The PESQ speech quality estimator is Similarly to L , we define the ESTOI loss as STOI furthermore designed to predict the mean opinion score (MOS) L = d : (16) ESTOI ESTOI of a speech quality listening test for certain degradations. Consequently, the PESQ score of a processed speech signal is Note, differently from L , L does not include the STOI ESTOI a scalar between 1 and 4:5, where 1 indicates extremely poor clipping step, i.e. the min(;) operator in Eq. (??), which quality and 4:5 corresponds to no distortion at all [59], [60]. makes L fully differentiable. Also, similarly to the ESTOI Along the same lines, the PMSQE loss function is designed to definition of L , we have ignored the VAD otherwise used STOI be inversely proportional to PESQ, such that a low PMSQE by L as we apply the VAD on the data prior to training. ESTOI value corresponds to a high PESQ value and vice versa. In practice PMSQE is defined in the range from 3 to 0, where 0 E. Scale-Invariant Signal-to-Distortion Ratio is equivalent to an undistorted signal and 3:0 corresponds to The fifth loss function we include is the scale-invariant signal- an extremely poor quality. to-distortion ratio (SI-SDR) [34]. The SI-SDR is an objective Fig. 2 shows the correspondence between PESQ and PMSQE performance measure that was introduced for evaluating the for a speech signal corrupted with either a stationary speech performance of speech processing algorithms and it was shaped noise (SSN) signal or a non-stationary 6-speaker babble 5 Finally, the test set is based on 1000 randomly selected 4.5 3 spoken utterances from si et 05 and si dt 05, which consists 4 2.5 of 1857 utterances divided among ten males and six females. 3.5 2 Note, as the training and validation sets consist of approx- imately three times as many utterances as their respective 3 1.5 subsets of si tr s, each utterance from WSJ0 will on average be 2.5 1 selected three times. However, as each utterance is mixed with 2 0.5 its own unique noise signal, the redundancy in speech material 1.5 0 increases the total variability in the dataset and ultimately -10 0 10 20 30 40 improves the generalizability capability of the system. Also note that the speakers used in the training and validation sets Fig. 2: PESQ ITU P.862.1 and PMSQE scores as function of are different than the speakers used for test, i.e. the tests are SNR for SSN and BBL noise-corrupted speech. conducted in a speaker independent setting. Furthermore, as we are primarily interested in speech active regions during training, we apply the voice activity noise signal, at various SNRs. It is seen that PESQ and detector (VAD) from STOI (and ESTOI) [32], [33] on the PMSQE are approximately inversely proportional and have training and validation set to ensure that any potentially long a monotonic relationship with respect to SNR. Hence, it is silent regions are removed prior to training. Specifically, the assumed that if PMSQE is minimized, PESQ will be maximized. VAD analyzes the clean waveform in 25 ms segments and The L loss function is essentially a log-domain STSA- PMSQE removes the segments where the signal energy is more than MSE loss function with additional key terms that are inspired 40 dB below the energy of the segment with the maximum by human perception. Consequently, an outline of L is PMSQE energy in the waveform. rather involved, and we refer the reader to [35] for details Finally, all utterances used with L , L , L STOI ESTOI TIME-MSE regarding the design of PMSQE. Furthermore, as PMSQE, and L are downsampled to 10 kHz, as STOI and ESTOI SI-SDR similarly to PESQ, is defined for sampling rates at either 8 are defined for this sampling frequency, and to allow an efficient kHz or 16 kHz, we use a 8 kHz sampling frequency when training scheme using minibatch training, each utterance is training L systems, and we downsample test signals to PMSQE truncated or zero-padded to four seconds. The utterances used 8 kHz, when we evaluate speech enhancement systems using with L are downsampled to 8 kHz to comply with the PMSQE PESQ. definition of PMSQE, which results in an utterance duration of approximately five seconds. III. EXPERIM ENTAL DESIGN To study how the loss functions presented in Sec. II affect the B. Noise Types performance of FCNN-based speech enhancement systems in To ensure a diverse noise variability we include four different realistic acoustical conditions, we train multiple systems using noise types in the training dataset: two synthetic noise signals a large noisy-speech dataset with a high degree of speaker and and two real-life recordings of natural sound scenes. This noise variability. In the following, we introduce the dataset, is motivated by the fact that a priori knowledge about the noise types and mixture conditions used for all experiments noise type might lead to unrealistic performance estimates [9]. presented in Sec. IV. The two synthetic noise signals are a stationary speech shaped noise (SSN) and a non-stationary 6-speaker babble (BBL) noise. The SSN signal is synthetically generated Gaussian white noise A. Noise-free Speech Mixtures that is spectrally shaped using a 12th-order all-pole filter with We have evaluated the six loss functions using the WSJ0 coefficients found from linear predictive coding analysis of speech corpus [61]. Specifically, using a sampling-with- the concatenation of 100 randomly chosen TIMIT sentences replacement scheme, the training data is based on 30000 [62]. The BBL noise signal is constructed as a linear mix of randomly selected spoken utterances from a subset of the randomly selected utterances from the TIMIT corpus such that si tr s part of WSJ0. The dataset size was found during six speakers are speaking at any given time. Using the entire preliminary experiments to be a good trade-off between training TIMIT database of 6300 utterances results in a BBL noise time and speech enhancement performance. This si tr s subset sequence with a duration of more than 50 min. For the real-life of WSJ0 consists in total of 11613 utterances approximately noise signals, we use the street (STR) and cafeteria (CAF), equally divided among 44 male speakers and 47 female noise signals from the CHiME3 dataset, which are signals that speakers. This ensures that the training dataset contains a large have been recorded in a natural occurring sound scene [63]. speaker variability, which allows the final speech enhancement Finally, we divide the noise signals such that 40 minutes system to be largely speaker independent [9]. is used for training, five minutes is used for validation and Similarly, the validation set is based on 3000 randomly another five minutes is used for test. This ensures that each selected spoken utterances from another subset of si tr s, which noise type is equally represented and with unique realizations consists of 1163 spoken utterances divided among five male in each dataset. speakers and five female speakers, which are not present in the training set. zero-padding constitutes only 3:9 % of the total number of samples. 6 To evaluate the performance of the speech enhancement Sec. IV-A. We use optimized (and different) learning rates for systems to unseen or unmatched noise signals we also test the different loss functions as further described in Sec. IV-A. using the bus (BUS), and pedestrian (PED) noise signals from The learning rates are shown in Table II. Finally, a batch size [63]. These noise signals are also real-life recordings, but they of eight is used, and training is stopped, if the validation loss represent different noise statistics compared to the four noise has not decreased for five epochs or a maximum of 200 epochs types used for training. has elapsed. We have implemented the speech enhancement systems 2 3 using Keras with a TensorFlow backend and the python C. Noisy Speech Mixtures implementation of the models and loss functions, as well as To construct the noisy speech signals, we follow Eq. (??) and audio samples, are available online . combine a noise-free training utterance x with an equal length and randomly selected noise sequence v. The noise signal v IV. E XPERIM ENTAL RESULTS is scaled according to the active speech level of x as defined We now investigate empirically how each of the loss by ITU P.56 [64] to achieve a certain SNR. For the training functions presented in Sec. II affects the speech enhancement and validation datasets, this SNR is chosen uniformly from performance of the time-domain FCNN-based speech enhance- [10; 10], which ensures that the intelligibility of the noisy ment system presented in Sec. III. Specifically, in Sec. IV-A speech waveforms y ranges from poor to perfectly intelligible. we study the sensitivity of speech enhancement performance with respect to learning rate. Such a study is a prerequisite to D. Model Architecture and Training allow a fair comparison between the custom loss functions in The speech enhancement system (Fig. 1) consists of a FCNN subsequent studies. We then study in Sec. IV-B how the signal with 18 layers configured in an encoder/decoder architecture integrity varies among the loss functions. Lastly, in Sec. IV-C, [65] using parameterized ReLU (PReLU) activation functions we study the speech enhancement performance for each loss [66]. The input dimension is L = 38656 and except for the function in various both matched and unmatched noise types first layer all remaining layers in the encoder use a stride of at a wide range of SNRs. We evaluate the speech enhancement two, which drives the final dimension in the bottleneck to be performance of all the systems using the following popular of dimension L=256. Similarly, except for the last layer, which and often used metrics: STOI [32], ESTOI [33], SI-SDR [34], has dimension L, all layers in the decoder uses upsampling SDR [57], and PESQ [59]. with a factor of two. Additionally, skip-connections where incoming channels are concatenated with existing channels A. Learning Rate vs. Performance Metric are used between the first eight layers in the encoder and the Since the goal of this paper is to make a comparison corresponding eight layers in the decoder. Similarly to [23], between loss functions, it is important that the comparison during training 20 % dropout is used for every third layer. is just. However, as the loss functions presented in Sec. II Furthermore, in (inChannel, outChannel, stride) format, have different processing steps, they have different partial the FCNN model has one (1,64,1), two (64,64,2), one derivatives, which might lead to different gradient norms and a (64,128,2), two (128,128,2), one (128,256,2), two (256,256,2), varying sensitivity to the choice of learning rate during gradient two (512,256,1), three (256,128,1), three (128,64,1), and one based optimization (e.g. [10], [67]). Therefore, to study the (128,1,1) convolutional layers with a filter size of 11 samples, influence that the learning rate can have on the performance which makes the model comparable to other enhancement of time-domain FCNN-based speech enhancement systems, models in the literature (see e.g. [23], [24], [50]). In total, the we have trained multiple systems with various learning rates. model has approximately 6.8 million parameters. Specifically, for each of the six loss functions in Sec. II, we Note, due to the encoder/decoder architecture, the receptive have trained a system using the following five learning rates: field is 2561 samples, which means that 2561 samples need to 2 3 4 4 5 10 , 10 , 5 10 , 10 , and 10 . The learning rates have be available before the system can produce a single output. In been selected from preliminary experiments in order to cover other words, with a 10 kHz sampling frequency the latency of the two training extremes, when training either diverge, i.e. a the speech enhancement system is 256 ms. For applications too large learning rate is used, or when training converge too where hard real-time requirements apply, e.g. hearing aids, this slowly and ultimately ends up at a plateau with a validation latency can likely be reduced significantly using alternative loss higher than the validation loss achieved using a larger architectures (e.g. [42]). learning rate. The systems for this particular experiment have The speech enhancement system is trained using the ADAM been trained using SSN at and SNR of 0 dB . optimizer [48] with = 0:9 and = 0:999 and a learning 1 2 In Table I we present different performance scores for time- rate schedule that reduces the learning rate with a factor of domain FCNN-based speech enhancement systems trained two, if the validation loss has not decreased for two epochs. using different loss functions and learning rates. The largest The six loss functions considered in this study have different performance scores with respect to each loss function (i.e. gradients and ultimately different gradient norms. Consequently, a learning rate used for one loss function might not be the https://keras.io/ optimal learning rate for another loss function. In fact, using https://tensorflow.org/ a non-optimal learning rate might result in radical different 4 https://git.its.aau.dk/mok/Speech Enhancement Loss.git solutions and potentially erroneous conclusions, as we show in Preliminary experiments using BBL indicated similar results. 7 TABLE I: Performance of different speech enhancement column-wise) is highlighted in boldface. It is eminent from 2 systems measured using various performance metrics. The Table I that a learning rate of 10 is too large for all loss systems have been tested in matched noise-type conditions functions as none of the loss functions manage to improve the using SSN at 0 dB SNR. validation loss. Similarly, it is seen from Table I that a learning rate of 10 is too small for all loss functions as none of the Learning Processed Metric Noisy Rate systems, except for L evaluated using STOI, achieve the SI-SDR L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE largest scores for this particular learning rate. However, the STOI: 0.75 ESTOI: 0.46 L systems achieve the same STOI score for the three SI-SDR SI-SDR: -1.05 Could not improve training or validation loss SDR: -0.92 (No convergence) smallest learning rates. Furthermore, it is seen that the learning PMSQE: 2.53 4 4 rates in the middle range, i.e. 5  10 , and 10 achieve PESQ: 1.79 particularly large scores. Specifically, it is seen from Table I STOI: 0.75 0.92 0.93 0.91 0.91 0.92 0.89 ESTOI: 0.46 0.79 0.81 0.79 0.77 0.79 0.73 that L , L , L , and L all achieve TIME-MSE SI-SDR STSA-MSE PMSQE SI-SDR: -1.05 10.24 3.55 -4.37 9.82 6.59 -1.07 SDR: -0.92 10.88 6.47 4.77 10.52 9.39 1.89 the largest overall performance scores using a learning rate of PMSQE: 2.53 1.10 1.47 1.48 1.20 1.22 1.14 510 , whereas the remaining loss functions L and L PESQ: 1.79 2.72 2.67 2.51 2.65 2.77 2.65 STOI ESTOI achieve their maximum performance scores with a learning STOI: 0.75 0.92 0.93 0.93 0.92 0.92 0.89 ESTOI: 0.46 0.79 0.82 0.83 0.80 0.80 0.75 rate of 10 . SI-SDR: -1.05 10.30 2.09 3.12 10.70 -4.32 -8.52 5 10 SDR: -0.92 11.00 7.73 7.84 11.32 2.27 4.21 More importantly, it is seen that choosing a non-optimal PMSQE: 2.53 1.07 1.43 1.27 1.01 1.19 1.05 PESQ: 1.79 2.73 2.70 2.68 2.77 2.80 2.72 learning rate might actually lead to a wrong conclusion, if STOI: 0.75 0.92 0.93 0.93 0.92 0.92 0.89 the systems were compared based on the same learning rate. ESTOI: 0.46 0.79 0.82 0.83 0.80 0.80 0.74 This is a consideration that has been generally absent in SI-SDR: -1.05 10.13 1.95 -12.03 10.59 8.10 -6.96 SDR: -0.92 10.78 4.99 0.61 11.22 9.46 4.76 the literature. For example, with the standard learning rate PMSQE: 2.53 1.11 1.46 1.24 1.03 1.21 1.12 PESQ: 1.79 2.72 2.69 2.68 2.77 2.79 2.67 of 10 the L and L systems both achieve an TIME-MSE ESTOI STOI: 0.75 0.90 0.92 0.92 0.92 0.91 0.85 ESTOI score of 0.79, which might lead to the, perhaps faulty, ESTOI: 0.46 0.74 0.80 0.81 0.79 0.77 0.67 conclusion that both loss functions possess the same potential SI-SDR: -1.05 8.78 -2.22 -22.36 10.15 -3.51 -6.52 SDR: -0.92 9.46 7.28 4.81 10.78 4.76 -0.65 with respect to ESTOI improvements. However, with a learning PMSQE: 2.53 1.40 1.56 1.47 1.12 1.36 1.51 PESQ: 1.79 2.53 2.58 2.59 2.72 2.69 2.37 rate of e.g. 5 10 , it is seen that the L system still TIME-MSE achieves an ESTOI score of 0.79, whereas the L system ESTOI TABLE II: Optimal learning rates for different loss functions. achieves a considerably larger ESTOI score of 0.83, which leads to the correct conclusion that the L loss function ESTOI Loss: L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE 4 4 4 4 4 4 has potential to outperform the L in terms of ESTOI. TIME-MSE LR: 5 10 10 10 5 10 5 10 5 10 Similar observations can be made for other loss functions e.g. with respect to L and L . Furthermore, it is seen TIME-MSE SI-SDR that the SI-SDR scores have a large variance and with a learning systems were trained with. Please note that we used four rate of e.g. 10 they can vary from 22:36 dB for systems significant digits when the learning rates were selected to optimized for L to 10:15 dB for systems optimized for ESTOI ensure a proper resolution. L , while achieving comparable STOI, ESTOI, and PESQ SI-SDR scores. This phenomenon is somewhat surprising and is further B. Signal Integrity vs. Performance Metric studied in Sec. IV-B. Also, Table I suggests that when a system We now study the signal integrity achieved by the systems is trained with a specific loss function, no other system achieves trained to minimize the different loss functions. Specifically, a larger performance score with respect to that particular metric. we compare the waveforms (Fig. 3) and amplitude spectra This expected result indicates that training has evolved correctly (Fig. 4) of representative clean, noisy, and enhanced speech and that the learning rates used in Table I are close to optimal. signal segments processed by speech enhancement systems Finally, it is seen that although L is an approximation of PMSQE trained using the loss functions presented in Sec. II and the PESQ, in Table I, L and L consistently lead to SI-SDR STSA-MSE larger PESQ scores than L despite L consistently learning rates given in Table II. PMSQE PMSQE achieving the lowest PMSQE values of the two loss functions. Figure 3 presents the waveforms of a specific 10 ms L does, however, lead to larger PESQ scores than L realization of clean, noisy, and enhanced speech signals from PMSQE STOI and L for several testing conditions. the experiments in Table I. At first, if polarity is ignored, it ESTOI is seen from Fig. 3 that systems trained with the six loss In conclusion, selecting the learning rate can have a pro- functions manage to enhance the noisy speech signal, as found impact on the performance of FCNN-based speech we see a somewhat good correspondence between the clean enhancement systems and selecting the proper learning rate speech signal (red-solid) and the enhanced speech signal (blue- is crucial, when systems trained using different loss functions dashed) with the enhanced signal having considerably less are compared. Table II summarizes the learning rates that noise compared to the noisy speech signal (yellow-dotted). we will use for training the systems presented in Sec. IV-C. From Fig. 3a it appears that the L loss function The learning rates are selected as the ones that maximize TIME-MSE achieves the most per-sample-accurate estimate of the clean the performance metric most similar to the loss function the signal and L , L , and L (Figs. 3b, 3c, and 3f) STOI ESTOI PMSQE value originally proposed in [48] and currently default in https://keras.io/. appear to achieve the least per-sample-accurate estimate. It 8 0.2 0.2 0.06 0.05 Clean 0.05 Noisy 0.04 Processed 0 0 0.04 0.03 0.03 0.02 -0.2 -0.2 0.02 Clean 0.01 Noisy 0.01 Processed -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (a) L (b) L (a) L (b) L TIME-MSE STOI TIME-MSE STOI 0.2 0.2 0.05 0.05 0.04 0.04 0 0 0.03 0.03 0.02 0.02 -0.2 -0.2 0.01 0.01 -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (c) L (d) L (c) L (d) L ESTOI SI-SDR ESTOI SI-SDR 0.2 0.2 0.06 0.05 0.05 0.04 0 0 0.04 0.03 0.03 0.02 -0.2 -0.2 0.02 0.01 0.01 -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (e) L (f) L (e) L (f) L STSA-MSE PMSQE STSA-MSE PMSQE Fig. 3: Time-domain waveform of a clean speech signal (solid Fig. 4: Magnitude spectra of the signals presented in Fig. 3. red), noisy speech signal (dotted yellow) and processed speech signals (dashed blue) processed by systems trained using different loss functions. compared to the clean signal. This phenomenon is more evident in Fig. 3e, where it is easy to see that the enhanced signal is time-shifted a few samples with respect to the clean signal. is not surprising that L achieves the best estimate, Loss functions such as L , L , L and L TIME-MSE STOI ESTOI PMSQE STSA-MSE as L is a waveform matching loss function and are primarily based on short-time magnitude spectra, and do TIME-MSE consequently penalizes time-domain samples that deviate from not, penalize waveform deviations. Hence, they may allow for the samples of the clean signal. However, a perfect sample-wise the enhanced speech signal to be time-shifted with respect to waveform reconstruction is not necessarily the only optimum, the clean signal. That being said, the amount of time-shift that if the receiver is the human auditory system and the goal is we have observed is less than 1 ms, which is considerably to achieve high speech intelligibility or quality as perceived smaller than the 10-30 ms usually required before the time-shift by humans. For example, in Figs. 3b and 3c it is seen that may be perceivable in real-life low-latency speech processing the waveforms of the enhanced signals are inverted, and applications such as hearing aids and mobile communications somewhat different from Fig. 3a although the processed signals devices [68], [69]. achieve similar or higher STOI and ESTOI scores (Table I), Furthermore, when evaluating speech enhancement perfor- i.e. the signals should ideally represent similar or higher levels mance with waveform-matching metrics such as SNR, SDR, or of intelligibility. This is because L and L are loss STOI ESTOI SI-SDR, exact time-matching is critical and a few samples delay functions based on matching of short-time energy in one-third can cause a complete failure of such performance metrics. This octave bands. As a consequence, these loss functions are e.g. is exactly what we observe in Table I, where the SI-SDR scores, invariant to the signal polarity. and to a smaller extent the SDR scores, have a high variance Furthermore, by careful inspecting e.g. Fig. 3c it can be with no obvious correspondence with the remaining metrics observed that the enhanced signal is slightly time-shifted such as STOI and ESTOI, which are stable and show a more Amplitude Amplitude Amplitude Amplitude Amplitude Amplitude Magnitude Magnitude Magnitude Magnitude Magnitude Magnitude 9 consistent behavior. Since SI-SDR and SDR are scale-invariant performance metrics with respect to the performance scores waveform matching functions, they fail if the processed signal of the noisy unprocessed signals. An exception occurs for the and the reference signal are not perfectly aligned. Consequently, SI-SDR and SDR metrics which, under some circumstances, SI-SDR and SDR should be used with caution, when they are can fail completely as previously discussed (Sec. IV-B) and it used to evaluate time-domain speech enhancement or separation is important to emphasize that these systems, despite the occa- systems with the capability to modify the phase, such as time- sionally very low SDR and SI-SDR scores, still successfully domain FCNNs. Furthermore, they should generally be avoided, enhance the speech signals in terms of perception. This is also when loss functions like L , L , L and L supported by the STOI, ESTOI, and PESQ performance metrics. STOI ESTOI PMSQE STSA-MSE are utilized. In other words, systems trained with the six loss functions seem Finally, in Fig. 4 we show the corresponding amplitude to be successful in terms of their ability to attenuate the noise and enhance the speech signal. More interestingly, although not spectra of the signals from Fig. 3 using a 40 ms window, surprising, it is seen that systems trained using L , L , centered around the 10 ms time-domain segment from Fig. 3, STOI ESTOI and L also achieve the maximum STOI, ESTOI, and to ensure a sufficient frequency resolution. It is seen from SI-SDR SI-SDR scores, respectively. Somewhat surprising is it to see Fig. 4 that all six loss functions lead to enhanced signals that systems trained to minimize L do not achieve the whose magnitude spectrum resemble that of the magnitude PMSQE maximum PESQ score, despite the fact that L is designed spectrum of the clean speech signal. Furthermore, it is seen PMSQE to resemble PESQ and we see a monotonic relationship between that the enhanced signals capture the dominating harmonics of the two functions in Fig. 2. Instead, it is seen from Table III the clean speech signal, while attenuating the major frequency components of the noise. Similarly to Fig. 3a, it is seen that that systems trained to minimize L generally achieve the SI-SDR L achieves an accurate estimate of the amplitude maximum PESQ score. In fact, systems trained to minimize TIME-MSE spectrum but also L (Fig. 4a) achieves an accurate L seem to perform well in general as they generally STSA-MSE SI-SDR estimate, which is expected asL is a frequency-domain achieve large improvements across all performance metrics and STSA-MSE energy-matching loss function. In fact, by careful inspection of often perform on par with systems trained to minimize L STOI Fig. 4e it can be observed that L manages to preserve and L , which are fundamentally different loss functions ESTOI STSA-MSE the higher order harmonics to a larger extent than L compared to L . SI-SDR TIME-MSE (Fig. 4a). Also, as expected, we can conclude that the small In Table IV we present performance scores achieved by the time-shift induced by L in Fig. 3e has no apparent systems from Table III but in unseen noise type conditions, STSA-MSE effect on the accuracy of the amplitude spectrum estimate, using the pedestrian and bus noise types. From Table IV it is which indicates that the time-shift is approximately constant seen, similarly to Table III, that systems trained using L , STOI over the window length of 40 ms. L , and L also achieve the maximum STOI, ESTOI, ESTOI SI-SDR and SI-SDR scores, respectively. It is also seen that systems trained to minimize L generally achieve the maximum, SI-SDR C. Loss Function vs. Performance Metric or close to the maximum, performance scores and also achieve We now turn our attention towards the speech enhancement larger PESQ scores than the systems trained to minimize potential of the systems trained to minimize the loss functions L . In other words, the behavior observed in Table III PMSQE in question. Specifically, we study the speech enhancement where the systems were tested using matched noise types also performance in terms of STOI [32], ESTOI [33], SI-SDR [34], seem to hold for unmatched noise types. SDR [57], and PESQ [59] of six different time-domain FCNN- Finally, from Table III and Table IV we can conclude that if based speech enhancement systems when trained using the loss the goal is to maximize a specific performance metric, gains functions, training data, and noise-types presented in Sec. II and can in general be achieved by training systems to minimize a the learning rates given in Table II. The six systems have been loss function designed specifically to resemble that particular tested using the matched noise types, SSN, BBL, CAF, and performance metric. For example, if the goal is to maximize STR and the unmatched noise types, PED, and BUS, at SNRs ESTOI, the largest ESTOI scores are achieved by training from -10 dB to 20 dB and the systems are evaluated by their systems that minimize the L loss function. However, if ESTOI ability to improve the above-mentioned performance metrics. the goal is to perform good in general across a wide range of Note, in contrast to the training and validation data, a VAD performance metrics, a loss function like L seems to be a SI-SDR has not been applied to the test data during inference. In other good candidate as systems trained to minimize L achieve SI-SDR words, the speech enhancement systems process the test signals high improvements over a range of performance metrics. Also, in their entirety, including any short natural occurring leading and more importantly, these findings seem to be generally valid and trailing silent regions and speech pauses in between spoken over a wide range of SNRs, unseen male and female speakers, words. This is done to simulate a realistic usage scenario, where as well as matched and unmatched noise types. exact knowledge about speech activity is generally not available prior to speech processing. V. C ONCLUSION In Table III we present scores by the above-mentioned performance metrics partitioned into loss functions horizontally In this paper the speech enhancement potential of six state- and SNR vertically. The largest performance score for each of-the-art loss functions for time-domain deep neural network- metric and SNR is highlighted in boldface. From Table III based monaural speech enhancement have been investigated. it is seen that all systems in general are able to improve all Specifically, we have conducted multiple experimental studies 10 TABLE III: Performance of different speech enhancement systems measured using STOI, ESTOI, SI-SDR, and SDR. The systems have been trained using different loss functions (L , L , L , L , L , L ) and tested TIME-MSE STOI ESTOI SI-SDR STSA-MSE TIME-MSE using four matched (SSN, BBL, CAF, STR) noise types at seven different SNRs (-10 dB, -5 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20dB). The maximum score is highlighted in boldface for each SNR and performance measure. See text for details. (a) Speech Shaped Noise (matched) (b) 6-Speaker Babble Noise (matched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.50 0.63 0.68 0.64 0.64 0.63 0.56 STOI: 0.46 0.67 0.69 0.66 0.68 0.67 0.59 ESTOI: 0.16 0.34 0.40 0.43 0.36 0.35 0.27 ESTOI: 0.18 0.41 0.43 0.44 0.42 0.41 0.33 -10 dB: SI-SDR: -11.07 0.41 -7.34 -16.18 0.67 -6.93 -24.50 -10 dB: SI-SDR: -11.04 0.39 -6.70 -15.84 0.59 -6.49 -25.93 SDR: -10.32 1.84 -0.76 -1.63 2.05 0.66 -1.52 SDR: -10.29 1.69 -0.51 -1.38 1.86 0.57 -1.67 PESQ: 1.48 1.59 1.60 1.31 1.62 1.68 1.47 PESQ: 1.73 1.70 1.62 1.41 1.70 1.72 1.51 STOI: 0.62 0.82 0.85 0.83 0.83 0.82 0.77 STOI: 0.59 0.82 0.84 0.83 0.83 0.82 0.76 ESTOI: 0.30 0.60 0.65 0.67 0.62 0.60 0.54 ESTOI: 0.31 0.62 0.65 0.66 0.64 0.61 0.55 -5 dB: SI-SDR: -6.05 5.74 -3.53 -13.41 6.08 -3.74 -20.08 -5 dB: SI-SDR: -6.04 5.59 -3.29 -13.54 5.96 -3.69 -20.38 SDR: -5.77 6.60 4.21 3.63 6.87 5.48 2.37 SDR: -5.75 6.45 4.36 3.66 6.75 5.32 2.24 PESQ: 1.59 2.19 2.18 2.03 2.22 2.24 2.11 PESQ: 1.69 2.18 2.16 2.02 2.21 2.18 2.04 STOI: 0.75 0.90 0.92 0.92 0.91 0.90 0.88 STOI: 0.73 0.90 0.92 0.91 0.91 0.90 0.87 ESTOI: 0.46 0.75 0.80 0.81 0.78 0.75 0.71 ESTOI: 0.47 0.76 0.80 0.81 0.78 0.76 0.71 0 dB: SI-SDR: -1.05 9.46 -2.16 -12.73 9.96 -2.66 -17.95 0 dB: SI-SDR: -1.04 9.54 -1.94 -12.81 10.12 -2.48 -18.13 SDR: -0.92 10.09 7.54 6.73 10.55 8.75 4.69 SDR: -0.91 10.17 7.78 6.82 10.69 8.83 4.58 PESQ: 1.79 2.59 2.58 2.56 2.65 2.62 2.57 PESQ: 1.84 2.57 2.60 2.54 2.64 2.57 2.49 STOI: 0.85 0.94 0.95 0.95 0.95 0.94 0.92 STOI: 0.84 0.94 0.95 0.95 0.95 0.94 0.92 ESTOI: 0.63 0.84 0.87 0.88 0.86 0.84 0.80 ESTOI: 0.63 0.85 0.87 0.88 0.87 0.84 0.80 5 dB: SI-SDR: 3.96 12.56 -1.51 -12.46 13.20 -2.12 -17.23 5 dB: SI-SDR: 3.96 12.75 -1.40 -12.55 13.49 -2.03 -17.25 SDR: 4.04 13.07 10.00 8.68 13.71 11.30 6.07 SDR: 4.04 13.26 10.20 8.76 13.99 11.46 5.99 PESQ: 2.03 2.89 2.88 2.91 3.00 2.89 2.89 PESQ: 2.10 2.88 2.92 2.90 2.98 2.86 2.84 STOI: 0.92 0.96 0.97 0.97 0.97 0.96 0.94 STOI: 0.91 0.96 0.97 0.97 0.97 0.96 0.94 ESTOI: 0.78 0.90 0.91 0.92 0.91 0.90 0.85 ESTOI: 0.77 0.90 0.92 0.92 0.92 0.89 0.85 10 dB: SI-SDR: 8.96 15.39 -1.18 -12.25 16.10 -1.86 -16.98 10 dB: SI-SDR: 8.96 15.55 -1.15 -12.31 16.41 -1.82 -16.97 SDR: 9.02 15.83 11.84 9.98 16.60 13.39 6.85 SDR: 9.03 16.02 11.94 10.00 16.91 13.49 6.79 PESQ: 2.32 3.13 3.12 3.16 3.27 3.10 3.13 PESQ: 2.39 3.12 3.17 3.17 3.25 3.10 3.11 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 ESTOI: 0.88 0.93 0.94 0.94 0.94 0.93 0.88 ESTOI: 0.86 0.93 0.94 0.94 0.94 0.92 0.88 15 dB: SI-SDR: 13.96 17.89 -1.02 -12.11 18.71 -1.74 -16.95 15 dB: SI-SDR: 13.96 17.96 -0.99 -12.11 18.96 -1.72 -16.94 SDR: 14.02 18.36 13.06 10.76 19.31 14.94 7.26 SDR: 14.02 18.46 13.09 10.74 19.56 14.95 7.20 PESQ: 2.63 3.31 3.33 3.36 3.47 3.29 3.32 PESQ: 2.70 3.33 3.38 3.38 3.46 3.32 3.30 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.94 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.93 0.94 0.95 0.95 0.96 0.94 0.89 20 dB: SI-SDR: 18.96 19.73 -0.95 -12.05 20.89 -1.68 -16.97 20 dB: SI-SDR: 18.96 19.70 -0.94 -12.06 21.03 -1.68 -16.94 SDR: 19.02 20.29 13.75 11.17 21.71 15.80 7.45 SDR: 19.02 20.28 13.75 11.15 21.86 15.80 7.40 PESQ: 2.95 3.48 3.55 3.54 3.63 3.50 3.46 PESQ: 3.02 3.50 3.57 3.54 3.62 3.52 3.44 (c) Cafeteria Noise (matched) (d) Street Noise (matched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.56 0.74 0.77 0.74 0.74 0.74 0.69 STOI: 0.58 0.76 0.80 0.77 0.77 0.76 0.70 ESTOI: 0.25 0.48 0.52 0.53 0.49 0.48 0.42 ESTOI: 0.24 0.50 0.56 0.57 0.52 0.50 0.43 -10 dB: SI-SDR: -11.04 3.26 -5.16 -14.90 3.78 -5.37 -19.30 -10 dB: SI-SDR: -11.05 3.92 -4.74 -15.07 4.34 -5.07 -18.52 SDR: -10.32 4.21 1.95 1.23 4.74 2.90 0.41 SDR: -10.35 4.92 2.64 1.79 5.32 3.53 0.46 PESQ: 1.60 1.99 1.97 1.80 2.04 2.03 1.87 PESQ: 1.42 2.05 2.02 1.84 2.07 2.07 1.90 STOI: 0.68 0.85 0.87 0.86 0.86 0.86 0.82 STOI: 0.68 0.87 0.89 0.88 0.88 0.87 0.83 ESTOI: 0.38 0.65 0.70 0.71 0.67 0.66 0.60 ESTOI: 0.36 0.68 0.72 0.73 0.69 0.68 0.61 -5 dB: SI-SDR: -6.03 7.77 -2.69 -13.38 8.33 -3.20 -17.83 -5 dB: SI-SDR: -6.04 8.12 -2.58 -13.50 8.58 -3.24 -17.63 SDR: -5.75 8.49 6.22 5.34 9.06 7.14 3.57 SDR: -5.77 8.88 6.60 5.67 9.32 7.46 3.46 PESQ: 1.69 2.39 2.39 2.29 2.45 2.41 2.29 PESQ: 1.63 2.45 2.44 2.36 2.50 2.46 2.35 STOI: 0.78 0.92 0.93 0.93 0.92 0.92 0.89 STOI: 0.78 0.92 0.94 0.93 0.93 0.92 0.90 ESTOI: 0.52 0.77 0.81 0.82 0.80 0.78 0.73 ESTOI: 0.49 0.79 0.82 0.83 0.81 0.79 0.74 0 dB: SI-SDR: -1.03 11.17 -1.69 -12.83 11.88 -2.35 -17.30 0 dB: SI-SDR: -1.04 11.39 -1.66 -12.81 11.97 -2.38 -17.32 SDR: -0.90 11.75 9.22 8.01 12.50 10.18 5.54 SDR: -0.92 12.03 9.46 8.20 12.62 10.36 5.46 PESQ: 1.99 2.70 2.73 2.69 2.79 2.72 2.65 PESQ: 1.94 2.76 2.78 2.75 2.84 2.76 2.71 STOI: 0.87 0.95 0.96 0.96 0.95 0.95 0.93 STOI: 0.86 0.95 0.96 0.96 0.96 0.95 0.93 ESTOI: 0.66 0.85 0.88 0.88 0.87 0.85 0.81 ESTOI: 0.63 0.86 0.88 0.89 0.87 0.86 0.82 5 dB: SI-SDR: 3.97 14.11 -1.28 -12.42 14.90 -1.98 -17.10 5 dB: SI-SDR: 3.96 14.28 -1.26 -12.45 14.90 -1.99 -17.16 SDR: 4.05 14.61 11.32 9.67 15.50 12.55 6.66 SDR: 4.04 14.83 11.53 9.79 15.54 12.69 6.66 PESQ: 2.33 2.96 3.01 3.01 3.08 2.97 2.96 PESQ: 2.28 3.01 3.05 3.04 3.12 2.99 3.00 STOI: 0.93 0.97 0.97 0.97 0.97 0.97 0.95 STOI: 0.92 0.97 0.97 0.97 0.97 0.97 0.95 ESTOI: 0.79 0.90 0.92 0.92 0.92 0.90 0.86 ESTOI: 0.76 0.90 0.92 0.92 0.92 0.90 0.87 10 dB: SI-SDR: 8.97 16.78 -1.09 -12.18 17.58 -1.81 -17.03 10 dB: SI-SDR: 8.96 16.96 -1.09 -12.17 17.60 -1.80 -17.06 SDR: 9.03 17.28 12.76 10.66 18.27 14.38 7.22 SDR: 9.02 17.51 12.92 10.75 18.34 14.53 7.27 PESQ: 2.66 3.20 3.27 3.27 3.32 3.21 3.21 PESQ: 2.61 3.23 3.29 3.29 3.35 3.21 3.24 STOI: 0.96 0.98 0.98 0.98 0.98 0.98 0.95 STOI: 0.96 0.98 0.98 0.98 0.98 0.98 0.96 ESTOI: 0.88 0.93 0.94 0.94 0.94 0.93 0.89 ESTOI: 0.87 0.93 0.94 0.94 0.94 0.93 0.89 15 dB: SI-SDR: 13.97 18.91 -1.00 -12.06 19.94 -1.72 -17.02 15 dB: SI-SDR: 13.96 19.13 -1.00 -12.08 19.98 -1.72 -17.04 SDR: 14.03 19.48 13.61 11.18 20.84 15.55 7.47 SDR: 14.02 19.74 13.71 11.24 20.94 15.70 7.52 PESQ: 2.99 3.42 3.50 3.49 3.53 3.44 3.40 PESQ: 2.93 3.45 3.53 3.51 3.55 3.45 3.43 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.94 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.93 0.94 0.95 0.96 0.96 0.94 0.90 20 dB: SI-SDR: 18.97 20.22 -0.96 -12.02 21.75 -1.68 -17.01 20 dB: SI-SDR: 18.96 20.39 -0.96 -12.00 21.79 -1.68 -17.04 SDR: 19.02 20.85 14.01 11.40 22.91 16.10 7.55 SDR: 19.02 21.04 14.07 11.44 23.02 16.21 7.59 PESQ: 3.33 3.60 3.70 3.67 3.72 3.63 3.52 PESQ: 3.27 3.62 3.73 3.69 3.74 3.64 3.56 11 TABLE IV: As Table III but for the two unmatched pedestrian and bus noise types. (a) Pedestrian Noise (unmatched) (b) Bus Noise (unmatched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.52 0.66 0.71 0.67 0.67 0.67 0.61 STOI: 0.71 0.88 0.90 0.90 0.89 0.88 0.86 ESTOI: 0.19 0.38 0.44 0.45 0.39 0.38 0.31 ESTOI: 0.39 0.70 0.75 0.76 0.73 0.70 0.66 -10 dB: SI-SDR: -11.04 -0.06 -7.53 -17.82 0.38 -8.05 -20.99 -10 dB: SI-SDR: -11.04 8.49 -2.34 -13.72 9.19 -3.24 -16.39 SDR: -10.31 1.08 -0.90 -1.80 1.53 -0.40 -1.91 SDR: -10.35 9.29 7.48 6.40 10.00 7.60 3.72 PESQ: 1.68 1.73 1.72 1.45 1.74 1.77 1.60 PESQ: 1.67 2.53 2.59 2.51 2.61 2.55 2.50 STOI: 0.62 0.82 0.85 0.83 0.83 0.82 0.77 STOI: 0.78 0.93 0.94 0.94 0.94 0.93 0.91 ESTOI: 0.29 0.59 0.64 0.66 0.61 0.59 0.52 ESTOI: 0.49 0.80 0.83 0.84 0.82 0.80 0.76 -5 dB: SI-SDR: -6.04 5.45 -3.73 -14.37 5.86 -4.24 -18.59 -5 dB: SI-SDR: -6.04 11.57 -1.59 -12.92 12.33 -2.46 -16.89 SDR: -5.75 6.21 4.19 3.40 6.60 4.87 2.14 SDR: -5.77 12.34 10.03 8.62 13.15 10.47 5.44 PESQ: 1.61 2.21 2.21 2.04 2.23 2.24 2.11 PESQ: 2.03 2.86 2.91 2.87 2.95 2.86 2.81 STOI: 0.73 0.90 0.92 0.91 0.91 0.90 0.87 STOI: 0.85 0.95 0.96 0.96 0.96 0.95 0.94 ESTOI: 0.43 0.74 0.78 0.79 0.76 0.74 0.69 ESTOI: 0.60 0.86 0.88 0.89 0.88 0.86 0.82 0 dB: SI-SDR: -1.04 9.52 -2.08 -13.23 10.05 -2.74 -17.53 0 dB: SI-SDR: -1.04 14.16 -1.26 -12.44 14.97 -2.05 -17.17 SDR: -0.91 10.12 7.93 6.91 10.66 8.70 4.80 SDR: -0.92 14.89 11.85 10.05 15.82 12.65 6.59 PESQ: 1.82 2.60 2.62 2.55 2.66 2.62 2.54 PESQ: 2.37 3.13 3.17 3.16 3.22 3.11 3.08 STOI: 0.83 0.94 0.95 0.95 0.95 0.94 0.92 STOI: 0.91 0.97 0.97 0.97 0.97 0.97 0.95 ESTOI: 0.58 0.83 0.86 0.87 0.85 0.83 0.79 ESTOI: 0.71 0.90 0.92 0.92 0.92 0.90 0.87 5 dB: SI-SDR: 3.96 12.82 -1.40 -12.62 13.47 -2.10 -17.20 5 dB: SI-SDR: 3.96 16.51 -1.11 -12.20 17.35 -1.85 -17.18 SDR: 4.04 13.32 10.57 9.05 14.03 11.54 6.32 SDR: 4.04 17.21 13.06 10.89 18.27 14.32 7.26 PESQ: 2.13 2.91 2.94 2.92 3.01 2.92 2.88 PESQ: 2.69 3.36 3.41 3.40 3.46 3.34 3.31 STOI: 0.91 0.96 0.97 0.97 0.97 0.96 0.94 STOI: 0.95 0.98 0.98 0.98 0.98 0.97 0.96 ESTOI: 0.73 0.89 0.91 0.91 0.91 0.89 0.85 ESTOI: 0.82 0.93 0.94 0.94 0.94 0.92 0.89 10 dB: SI-SDR: 8.96 15.77 -1.12 -12.28 16.45 -1.84 -17.02 10 dB: SI-SDR: 8.96 18.55 -1.03 -12.09 19.51 -1.75 -17.12 SDR: 9.03 16.23 12.34 10.32 17.05 13.75 7.07 SDR: 9.02 19.27 13.77 11.33 20.59 15.49 7.56 PESQ: 2.46 3.17 3.21 3.20 3.29 3.16 3.16 PESQ: 3.00 3.58 3.63 3.61 3.66 3.54 3.50 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 STOI: 0.97 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.85 0.92 0.94 0.94 0.94 0.92 0.88 ESTOI: 0.90 0.94 0.95 0.95 0.96 0.94 0.91 15 dB: SI-SDR: 13.96 18.28 -1.01 -12.16 19.08 -1.73 -17.00 15 dB: SI-SDR: 13.96 19.98 -0.99 -12.02 21.27 -1.70 -17.09 SDR: 14.02 18.79 13.40 10.99 19.83 15.23 7.39 SDR: 14.02 20.71 14.10 11.51 22.58 16.14 7.65 PESQ: 2.78 3.39 3.45 3.42 3.50 3.39 3.36 PESQ: 3.33 3.74 3.81 3.77 3.84 3.71 3.62 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.99 0.98 0.99 0.98 0.99 0.98 0.96 ESTOI: 0.92 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.95 0.95 0.96 0.96 0.97 0.95 0.91 20 dB: SI-SDR: 18.96 19.95 -0.95 -12.04 21.23 -1.68 -17.02 20 dB: SI-SDR: 18.96 20.70 -0.96 -12.01 22.46 -1.68 -17.06 SDR: 19.02 20.55 13.93 11.31 22.23 15.97 7.50 SDR: 19.02 21.41 14.21 11.56 23.95 16.38 7.64 PESQ: 3.10 3.55 3.66 3.61 3.67 3.58 3.48 PESQ: 3.66 3.80 3.91 3.87 3.97 3.81 3.68 using speech enhancement systems based on time-domain to optimize with respect to a loss function designed specifically convolutional neural networks and studied the impact the loss to resemble that particular loss function. This is particularly functions have on the performance of those systems, when they interesting for loss functions based on STOI and ESTOI as are evaluated using five commonly used performance metrics these performance metrics predict speech intelligibility, a metric for monaural speech enhancement algorithms. The goal of the many speech enhancement algorithms attempt to maximize study is to establish if, and to what extent, a loss function without explicitly being designed to do so. Furthermore, we designed specifically to resemble a certain performance metric found that the learning rate used when training systems to is advantageous compared to standard loss functions such as minimize a particular loss function can have a critical impact the time-domain mean-square error (MSE) loss function or the on the performance of such systems; it is paramount that the short-time spectral amplitude (STSA)-MSE, whose strongest optimal learning rate is identified for each loss function, as justification is mathematical convenience. In addition to the a sub-optimal learning rate can lead to sub-optimal results classical loss functions based on time-domain MSE and STSA- and erroneous conclusions, when systems trained to optimize MSE, we have studied a loss function based on scale-invariant different loss functions are compared. Despite its obvious signal to distortion ratio (SDR), as well as two loss functions importance, this is a consideration that has been generally based on two often used speech intelligibility predictors, namely absent in the academic literature. Additionally, we found that the short-time objective intelligibility (STOI), and the Extended- waveform matching performance metrics such as SDR and STOI (ESTOI). Lastly, we have studied a loss function based SI-SDR, despite achieving good general performance, must be on perceptual evaluation of speech quality (PESQ), which is used with caution, when they are used in combination with a commonly used speech quality predictor. In general, we speech enhancement systems with the capability of modifying found that all six loss functions are good candidates for the phase of the processed signals such as time-domain FCNN- monaural speech enhancement systems as they all managed to based speech enhancement systems. In particular, SDR and SI- improve the performance metrics employed with respect to the SDR may severely under-estimate the performance of systems performance scores of noisy unprocessed speech signals. More that are trained using loss functions that do not penalize time- importantly, we found that these results were generally valid shifts. We observed on multiple occasions that both SDR and across a wide range of SNRs, unseen male and female speakers, SI-SDR failed completely, when the reference signal and the as well as matched and unmatched noise types. However, we processed signal were not perfectly aligned. also found that if the goal is to perform optimally with respect In conclusion, we found that a loss function based on SI-SDR to a specific performance metric, it is generally advantageous achieves good general performance across a range of popular 12 speech enhancement evaluation metrics, which suggests that [19] E. W. Healy et al., “A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker SI-SDR is a good candidate as a general-purpose loss function and reverberation,” The Journal of the Acoustical Society of America, for supervised monaural time-domain speech enhancement. vol. 145, no. 3, pp. 1378–1388, Mar. 2019. [Online]. Available: https://asa.scitation.org/doi/full/10.1121/1.5093547 [20] J. L. Roux et al., “The Phasebook: Building Complex Masks via Discrete ACKNOWLEDGMENT Representations for Source Separation,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing We would like to thank Juan M. Mart´ ın-Donas ˜ for valuable (ICASSP), May 2019, pp. 66–70. [21] Z. Wang, K. Tan, and D. Wang, “Deep Learning Based Phase Recon- insight and discussions regarding the implementation of the struction for Speaker Separation: A Trigonometric Perspective,” in Proc. PMSQE loss function. ICASSP, 2019, pp. 71–75. [22] Z.-Q. Wang et al., “End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction,” in Proc. Interspeech, 2018, pp. 2708–2712. REFERENCES [23] A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Proc. Interspeech, 2018, pp. [1] G. Kim et al., “An algorithm that improves speech intelligibility in noise 1136–1140. for normal-hearing listeners,” The Journal of the Acoustical Society of [24] ——, “A New Framework for CNN-Based Speech Enhancement in the America, vol. 126, no. 3, pp. 1486–1494, 2009. Time Domain,” IEEE/ACM Transactions on Audio, Speech, and Language [2] K. Han and D. Wang, “A classification based approach to speech Processing, vol. 27, no. 7, pp. 1179–1188, 2019. segregation,” The Journal of the Acoustical Society of America, vol. [25] S. W. Fu et al., “End-to-End Waveform Utterance Enhancement for 132, no. 5, pp. 3475–3483, 2012. Direct Evaluation Metrics Optimization by Fully Convolutional Neural [3] Y. Wang and D. Wang, “Towards Scaling Up Classification-Based Speech Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 570 – 1584, 2018. Processing, vol. 21, no. 7, pp. 1381–1390, 2013. [26] S. R. Park and J. Lee, “A Fully Convolutional Neural Network for Speech [4] Y. Xu et al., “An Experimental Study on Speech Enhancement Based on Enhancement,” in Proc. Interspeech, 2017, pp. 1993–1997. Deep Neural Networks,” IEEE Signal Processing Letters, vol. 21, no. 1, [27] S. W. Fu et al., “Raw waveform-based speech enhancement by fully pp. 65–68, 2014. convolutional networks,” in Proc. APSIPA, 2017, pp. 6–12. [5] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation [28] A. Pandey and D. Wang, “TCNN: Temporal Convolutional Neural with memory-enhanced recurrent neural networks,” in Proc. ICASSP, Network for Real-time Speech Enhancement in the Time Domain,” in 2014, pp. 3709–3713. Proc. ICASSP, 2019, pp. 6875–6879. [6] E. W. Healy et al., “An algorithm to increase speech intelligibility for [29] T. Grzywalski and S. Drgas, “Using Recurrences in Time and Frequency hearing-impaired listeners in novel segments of the same noise type,” within U-net Architecture for Speech Enhancement,” in Proc. ICASSP, The Journal of the Acoustical Society of America, vol. 138, no. 3, pp. 2019, pp. 6970–6974. 1660–1669, 2015. [30] K. Tan, X. Zhang, and D. Wang, “Real-time Speech Enhancement Using [7] J. Chen et al., “Large-scale training to increase speech intelligibility for an Efficient Convolutional Recurrent Network for Dual-microphone hearing-impaired listeners in novel noises,” The Journal of the Acoustical Mobile Phones in Close-talk Scenarios,” in Proc. ICASSP, 2019, pp. Society of America, vol. 139, no. 5, pp. 2604–2612, 2016. 5751–5755. [8] H. Erdogan et al., “Deep Recurrent Networks for Separation and [31] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean Recognition of Single-Channel Speech in Nonstationary Background square error short-time spectral amplitude estimator,” IEEE Transactions Audio,” in New Era for Robust Speech Recognition. Springer, 2017, on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– pp. 165–186. 1121, 1984. [9] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential [32] C. H. Taal et al., “An Algorithm for Intelligibility Prediction of Time- of General and Specialized Deep Neural Network Based Speech Frequency Weighted Noisy Speech,” IEEE/ACM Transactions on Audio, Enhancement Systems,” IEEE/ACM Transactions on Audio, Speech, and Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. Language Processing, vol. 25, no. 1, pp. 153–167, 2017. [33] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of [10] M. Kolbæk, Z. Tan, and J. Jensen, “On the Relationship Between Short- Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean- on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009– Square Error for Speech Enhancement,” IEEE/ACM Transactions on 2022, 2016. Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 283–295, [34] J. L. Roux et al., “SDR – Half-baked or Well Done?” in ICASSP 2019, 2019, pp. 626–630. [11] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep [35] J. M. Mart´ ın-Donas ˜ et al., “A Deep Learning Loss Function Based on the Learning: An Overview,” IEEE/ACM Transactions on Audio, Speech, Perceptual Evaluation of the Speech Quality,” IEEE Signal Processing and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018. Letters, vol. 25, no. 11, pp. 1680–1684, 2018. [12] M. Kolbæk, “Single-Microphone Speech Enhancement and Separation [36] S. Venkataramani, R. Higa, and P. Smaragdis, “Performance Based Cost Using Deep Learning,” Ph.D. dissertation, Aalborg Universitetsforlag, Functions for End-to-End Speech Separation,” Proc. APSIPA, pp. 350– 2018. [Online]. Available: kolbaek-phd.aau.dk 355, 2018. [13] E. W. Healy et al., “An algorithm to increase intelligibility for hearing- [37] S. Venkataramani, J. Casebeer, and P. Smaragdis, “End-to-end Source impaired listeners in the presence of a competing talker,” The Journal of Separation with Adaptive Front-Ends,” in Proc. NIPS Machine Learning the Acoustical Society of America, vol. 141, no. 6, pp. 4230–4239, 2017. for Audio Signal Processing Workshop, 2017. [14] F. Bolner et al., “Speech enhancement based on neural networks applied [38] Y. Zhao et al., “Perceptually Guided Speech Enhancement using Deep to cochlear implant coding strategies,” in Proc. ICASSP, 2016, pp. 6520– Neural Networks,” in Proc. ICASSP, 2018, pp. 5074–5078. [39] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation [15] J. J. M. Monaghan et al., “Auditory inspired machine learning techniques System to Improve STOI and PESQ Directly,” in Proc. ICASSP, 2018, can improve speech intelligibility and quality for hearing-impaired pp. 5374–5378. listeners,” The Journal of the Acoustical Society of America, vol. 141, [40] F. Bahmaninezhad et al., “A Comprehensive Study of Speech Separation: no. 3, pp. 1985–1998, 2017. Spectrogram vs Waveform Separation,” in Proc. Interspeech, 2019, pp. [16] T. Goehring et al., “Speech enhancement based on neural networks 4574–4578. improves speech intelligibility in noise for cochlear implant users,” [41] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Hearing Research, vol. 344, pp. 183–194, 2017. Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM [17] Y. H. Lai et al., “A Deep Denoising Autoencoder Approach to Improving Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, the Intelligibility of Vocoded Speech in Cochlear Implant Simulation,” pp. 1256–1266, May 2019. IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1568– [42] ——, “TaSNet: Time-Domain Audio Separation Network for Real-Time, 1578, 2017. Single-Channel Speech Separation,” in Proc. ICASSP, 2018, pp. 696–700. [18] Y.-H. Lai et al., “Deep Learning-Based Noise Reduction Approach to [43] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement Improve Speech Intelligibility for Cochlear Implant Recipients,” Ear and using Deep Neural Networks by Maximizing a Short-Time Objective Hearing, vol. 39, no. 4, pp. 795–809, 2018. Intelligibility Measure,” in Proc. ICASSP, 2018, pp. 5059 – 5063. 13 [44] M. Kolbæk et al., “Multi-talker Speech Separation With Utterance-Level Morten Kolbæk received the B.Eng. degree in Permutation Invariant Training of Deep Recurrent Neural Networks,” electronic design at Aarhus University, in 2013 and IEEE/ACM Transactions on Audio, Speech, and Language Processing, the M.Sc. in signal processing and computing from vol. 25, no. 10, pp. 1901–1913, Jul. 2017. Aalborg University, Denmark, in 2015. He received [45] Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for the PhD degree from Aalborg University, Denmark, Supervised Speech Separation,” IEEE/ACM Transactions on Audio, in 2018 for the thesis entitled Single-Microphone Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, Speech Enhancement and Separation Using Deep 2014. Learning (kolbaek-phd.aau.dk). He is currently a [46] G. Naithani et al., “Deep Neural Network Based Speech Separation post-doctoral researcher at the section for Signal Optimizing an Objective Estimator of Intelligibility for Low Latency and Information Processing at the Department of Applications,” in Proc. IWAENC, 2018, pp. 386–390. Electronic Systems, Aalborg University, Denmark. [47] K. Tan, J. Chen, and D. Wang, “Gated Residual Networks With His research interests include speech enhancement and separation, deep Dilated Convolutions for Monaural Speech Enhancement,” IEEE/ACM learning, and intelligibility improvement of noisy speech. Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 189–198, 2019. [48] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. ICLR (arXiv:1412.6980), 2015. [49] D. Baby and S. Verhulst, “Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty,” in ICASSP Zheng-Hua Tan (M’00–SM’06) received the B.Sc. 2019 - 2019 IEEE International Conference on Acoustics, Speech and and M.Sc. degrees in electrical engineering from Signal Processing (ICASSP), May 2019, pp. 106–110. Hunan University, Changsha, China, in 1990 and [50] S. Pascual, A. Bonafonte, and J. Serra, ` “SEGAN: Speech Enhancement 1996, respectively, and the Ph.D. degree in electronic Generative Adversarial Network,” in Proc. INTERSPEECH, 2017, pp. engineering from Shanghai Jiao Tong University, 3642–3646. Shanghai, China, in 1999. He is a Professor and a Co- [51] O. Ernst et al., “Speech Dereverberation Using Fully Convolutional Head of the Centre for Acoustic Signal Processing Networks,” in Proc. EUSIPCO, 2018, pp. 390–394. Research (CASPR) at Aalborg University, Aalborg, [52] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, Denmark. He was a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory, MIT, [53] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, Cambridge, USA, an Associate Professor at Shanghai Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea. His [54] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, research interests include machine learning, deep learning, pattern recognition, speech and speaker recognition, noise-robust speech processing, multimodal [55] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-Domain Based Single- signal processing, and social robotics. He is the vice chair of the IEEE Microphone Noise Reduction for Speech Enhancement: A Survey of the Signal Processing Society Machine Learning for Signal Processing Technical State of the Art,” Synthesis Lectures on Speech and Audio Processing, Committee (MLSP TC). He is an Associate Editor for IEEE/ACM Transactions vol. 9, no. 1, pp. 1–80, 2013. on Audio, Speech and Language Processing, an Editorial Board Member for [56] J. Jensen and C. H. Taal, “Speech Intelligibility Prediction Based on Computer Speech and Language and was a Guest Editor for the IEEE Journal Mutual Information,” IEEE/ACM Transactions on Audio, Speech, and of Selected Topics in Signal Processing and Neurocomputing. He was the Language Processing, vol. 22, no. 2, pp. 430–440, 2014. General Chair for IEEE MLSP 2018 and a TPC co-chair for IEEE SLT 2016. [57] C. Fev ´ otte, R. Gribonval, and E. Vincent, “BSS EVAL Toolbox User Guide – Revision 2.0,” IRISA, Tech. Rep. inria-00564760, 2011. [58] B. Moore, An Introduction to the Psychology of Hearing. Brill, 2013. [59] A. W. Rix et al., “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752. [60] “International Telecommunication Union - Recommendation Søren Holdt Jensen (S’87–M’88–SM’00) received P.862.1 : Mapping function for transforming P.862 raw the M.Sc. degree in electrical engineering from result scores to MOS-LQO,” 2003. [Online]. Available: Aalborg University (AAU), Aalborg, Denmark, in https://www.itu.int/rec/T-REC- P.862.1-200311- I/en 1988, and the Ph.D. degree (in signal processing) [61] J. S. Garofolo et al., “CSR-I (WSJ0) Complete LDC93S6A,” 1993, from the Technical University of Denmark (DTU), philadelphia: Linguistic Data Consortium. Lyngby, Denmark, in 1995. He is Full Professor [62] ——, “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1,” in Signal Processing at Aalborg University. Before 1993, linguistic Data Consortium. joining the Department of Electronic Systems, Aal- [63] J. Barker et al., “The third ‘CHiME’ speech separation and recognition borg University, he was with the Telecommunications challenge: Dataset, task and baselines,” in Proc. ASRU, 2015, pp. 504– Laboratory of Telecom Denmark, Ltd, Taastrup (Copenhagen), Denmark; the Electronics Institute [64] ITU, “Rec. P.56 : Objective measurement of active speech level,” 2011, of Technical University of Denmark; the Scientific Computing Group of https://www.itu.int/rec/T-REC-P.56/. Danish Computing Center for Research and Education (UNIC), Lyngby; the [65] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks Electrical Engineering Department (ESAT-SISTA) of Katholieke Universiteit for Biomedical Image Segmentation,” in Proc. MICCAI, N. Navab et al., Leuven, Leuven, Belgium; and the Center for PersonKommunikation (CPK) Eds., 2015, pp. 234–241. of Aalborg University. His current research interest are in statistical signal [66] K. He et al., “Delving Deep into Rectifiers: Surpassing Human-Level processing, numerical algorithms, optimization engineering, machine learning, Performance on ImageNet Classification,” in Proc. ICCV, 2015, pp. and digital processing of acoustic, audio, communication, multimedia, and 1026–1034. speech, signals. He is co-author of the textbook Software-Defined GPS and [67] L. Liu et al., “On the Variance of the Adaptive Learning Rate and Galileo Receiver—A Single-Frequency Approach, Birkhauser ¨ , Boston, USA, Beyond,” in Proc. ICLR, 2020. also translated to Chinese: National Defence Industry Press, China. Prof. Jensen [68] M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. I. Estimation has been Associate Editor for the IEEE Transactions on Signal Processing, of limits imposed by the auditory path alone using simulated hearing IEEE/ACM Transactions on Audio, Speech and Language Processing, Elsevier losses,” Ear and Hearing, vol. 20, no. 3, pp. 182–192, 1999. Signal Processing, and EURASIP Journal on Advances in Signal Processing. [69] L. Bramsløw, “Preferred signal path delay and high-pass cut-off in open He is a recipient of an individual European Community Marie Curie (HCM: fittings,” International Journal of Audiology, vol. 49, no. 9, pp. 634–644, Human Capital and Mobility) Fellowship, former Chairman of the IEEE Denmark Section and the IEEE Denmark Section’s Signal Processing Chapter (founder and first chaiman). He is member of the Danish Academy of Technical Sciences (ATV) and has been member of the Danish Council for Independent Research (2011–2016) appointed by Danish Ministers of Science. 14 Jesper Jensen received the M.Sc. degree in electrical engineering and the Ph.D. degree in signal processing from Aalborg University, Aalborg, Denmark, in 1996 and 2000, respectively. From 1996 to 2000, he was with the Center for Person Kommunikation (CPK), Aalborg University, as a Ph.D. student and Assistant Research Professor. From 2000 to 2007, he was a Post-Doctoral Researcher and Assistant Professor with Delft University of Technology, Delft, The Netherlands, and an External Associate Professor with Aalborg University. Currently, he is a Senior Principal Scientist with Oticon A/S, Copenhagen, Denmark, where his main responsibility is scouting and development of new signal processing concepts for hearing aid applications. He is a Professor with the Section for Signal and Information Processing (SIP), Department of Electronic Systems, at Aalborg University. He is also a co-founder of the Centre for Acoustic Signal Processing Research (CASPR) at Aalborg University. His main interests are in the area of acoustic signal processing, including signal retrieval from noisy observations, coding, speech and audio modification and synthesis, intelligibility enhancement of speech signals, signal processing for hearing aid applications, and perceptual aspects of signal processing. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement

Loading next page...
 
/lp/arxiv-cornell-university/on-loss-functions-for-supervised-monaural-time-domain-speech-wuaOZ10p5D

References (68)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2020.2968738
Publisher site
See Article on Publisher Site

Abstract

On Loss Functions for Supervised Monaural Time-Domain Speech Enhancement Morten Kolbæk, Zheng-Hua Tan, Senior Member, IEEE, Søren Holdt Jensen, and Jesper Jensen Abstract—Many deep learning-based speech enhancement al- acoustical conditions [6], [13]–[19]. However, despite the recent gorithms are designed to minimize the mean-square error (MSE) success of deep learning-based speech enhancement algorithms, in some transform domain between a predicted and a target many of the techniques referenced above are fundamentally speech signal. However, optimizing for MSE does not necessarily limited, as they primarily focus on enhancement in the guarantee high speech quality or intelligibility, which is the ulti- short-time spectral amplitude (STSA) domain and therefore mate goal of many speech enhancement algorithms. Additionally, only little is known about the impact of the loss function on ignore potentially useful phase information. Numerous recent the emerging class of time-domain deep learning-based speech deep learning-based speech enhancement techniques exist, enhancement systems. however, that incorporate phase information (e.g. [20]–[22]). We study how popular loss functions influence the performance The most successful approaches to date are arguably end- of time-domain deep learning-based speech enhancement systems. to-end techniques based on fully convolutional neural net- First, we demonstrate that perceptually inspired loss functions might be advantageous over classical loss functions like MSE. works (FCNN) that do not apply the short-time discrete Fourier Furthermore, we show that the learning rate is a crucial design transform (STFT) or other pre-processing stages, but operate parameter even for adaptive gradient-based optimizers, which directly in the time-domain (e.g. [23]–[30]). These techniques, has been generally overlooked in the literature. Also, we found however, might still be limited as most of them rely on a loss that waveform matching performance metrics must be used with function based on the mean square error (MSE) between time- caution as they in certain situations can fail completely. Finally, we show that a loss function based on scale-invariant signal- domain waveforms. This is most likely suboptimal with respect to-distortion ratio (SI-SDR) achieves good general performance to speech quality and intelligibility, as time-domain MSE has across a range of popular speech enhancement evaluation metrics, no apparent relation to human perception or the human auditory which suggests that SI-SDR is a good candidate as a general- system in general. Furthermore, as the works above use widely purpose loss function for speech enhancement systems. different network architectures, development datasets, noise Index Terms—Speech Enhancement, Fully Convolutional Neu- types, hyperparameters, etc., it is not yet established how the ral Networks, Time-Domain, Objective Intelligibility. loss functions influence the performance of such systems and if alternative loss functions that are more perceptually meaningful I. INTRODUCTION might be advantageous. Speech enhancement algorithms for improving speech quality In this paper we study the influence of loss functions on the and speech intelligibility of single-channel recordings of noisy performance of end-to-end time-domain deep learning-based speech are of high demand in a wide range of applications e.g. speech enhancement systems. Specifically, we adopt a general- hearing aids design, mobile communications devices, voice- purpose FCNN architecture that takes as input a time-domain operated human-machine interfaces, etc. Consequently, devel- waveform of a noisy speech signal and is trained using various oping successful monaural speech enhancement algorithms has loss functions to predict as output the enhanced speech signal been a long-lasting goal in both academia and industry. as a time-domain waveform that optimize the loss function in In fact, over the last decade, monaural speech enhancement question. algorithms based on machine learning, and deep learning in par- We focus on six loss functions: time-domain mean-square ticular, have received a tremendous amount of attention (see e.g. error (MSE) L , short-time spectral amplitude (STSA) TIME-MSE [1]–[10] as well as [11], [12] and references therein). Specifi- MSE L [31], short-time objective intelligibility (STOI) STSA-MSE cally, in recent years, deep learning-based speech enhancement L [32], Extended STOI L [33], scale-invariant signal- STOI ESTOI algorithms, facilitated by powerful general-purpose graphics to-distortion ratio (SI-SDR) L [34], and perceptual metric SI-SDR processing units and large amounts of training data, have shown for speech quality evaluation (PMSQE) L [35]. We study PMSQE impressive results by improving speech intelligibility in narrow these loss functions as they jointly cover a large range of useful properties, e.g. close relationships to human perception Manuscript received August 27, 2019; revised December 19, 2019; accepted or mathematical simplicity, that usually are of interest for January 18, 2020. The associate editor coordinating the review of this manuscript and approving it for publication was Prof. Tan Lee. speech enhancement systems (described in detail in Sec. II). Corresponding author: Morten Kolbæk Furthermore, the six loss functions have all been applied in M. Kolbæk, Z.-H. Tan, and S. H. Jensen are with the Department of recent deep learning-based speech processing techniques e.g. Electronic Systems, Aalborg University, Aalborg 9220, Denmark (e-mail: mok@es.aau.dk; zt@es.aau.dk; shj@es.aau.dk). [8]–[11], [23], [25], [35]–[47]. However, no existing work has J. Jensen is with the Department of Electronic Systems, Aalborg University, studied these loss functions jointly under identical conditions Aalborg 9220, Denmark, and also with Oticon A/S, Smørum 2765, Denmark and evaluated them in a structured manner with end-to-end (e-mail: jje@es.aau.dk). Digital Object Identifier 10.1109/TASLP.2020.2968738 time-domain deep learning-based speech enhancement systems. arXiv:1909.01019v2 [cs.SD] 30 Jan 2020 2 Fig. 1: End-to-end speech enhancement system based on a fully convolutional neural network. Obviously, one would expect, that a system trained to minimize Let x 2 R be L samples of a clean time-domain speech a specific loss function would also achieve the minimum signal and let the corresponding noisy observation y 2 R be numerical value among all systems for that particular loss y = x + v; (1) function during test. However, as training of machine learning models in general, and FCNNs in particular, is a highly non- where v 2 R is an additive noise signal. The goal is then to linear process, which depends on the loss function itself, this find an estimate x ^ of x from y using a FCNN, is not guaranteed in practice. Furthermore, we argue that the x ^ = f (y; ); (2) FCNN learning-rate is a crucial design parameter, when conducting such experiments, even for adaptive gradient-based optimizers where  represents the parameters of the FCNN. Using a such as ADAM [48], which has been generally overlooked in supervised learning approach, the parameters are found such the literature despite its obvious importance. Therefore, it is of that they minimize a loss L over a training dataset, consisting interest to establish exactly how large the difference in speech of corresponding pairs of clean x and noisy speech signals train enhancement performance is between time-domain waveform- y . Our objective is then to study how the quality of x ^, train based speech enhancement systems trained using different loss measured using different performance metrics, is affected by functions and evaluated using popular speech enhancement the choice of loss function L. In the following, we review each performance metrics. In particular, one might hope to find a of the loss functions we have selected for our experiments. ”universally good” loss function that performs almost optimally with respect to other loss functions. This is the goal of the paper A. Time-Domain Mean Square Error and our findings might serve as a guideline in loss-function The first loss function we consider is the time-domain mean- selection for deep learning-based speech processing systems. square error (MSE). This loss function is given as The rest of the paper is organized as follows. In Sec. II we describe the monaural speech separation problem and present L = kx ^ xk ; (3) TIME-MSE the six loss functions we will study. In Sec. III we present the design of our experimental study including the speech and where kk is the ` -norm. We include this loss function noise material used for training. In Sec. IV we present and because it is computationally very simple and because it is one discusses the results. Finally, in Sec. V we conclude the paper. of the most used loss functions in machine learning and signal processing in general [52]–[54]. However, little is known about the performance of time-domain speech enhancement systems optimized end-to-end for this loss function, when evaluated II. S PEECH ENHANCEMENT S YSTEM using standard speech enhancement metrics such as STOI and PESQ. Fig. 1 shows a block-diagram of the speech enhancement system we use for all experiments. The system is based on a B. Short-Time Spectral Amplitude Mean Square Error fully convolutional neural network (FCNN), which is trained, end-to-end, to estimate the noise-free speech waveform from The second loss function we consider is the classical STSA- a noisy single-channel recording. Note, the architecture in MSE, which is one of the most popular loss functions used Fig. 1 resembles that used in a large body of state-of-the- for deep neural network based speech enhancement [10], [11]. art deep learning-based speech enhancement literature (e.g. The STSA-MSE function also plays a major role in more [23], [26], [28]–[30], [49]–[51]). Therefore, we argue that our classical non-machine learning based speech enhancement experimental findings based on this particular architecture are algorithms [31], [55], but has also been used in recent time- representative and generally valid for a large range of deep domain techniques [23]. learning-based speech enhancement methods. The architecture Let x(k; m), k = 1; : : : ; K , m = 1; : : : M; be the K -point in Fig. 1 is further described in Sec. III-D. short-time discrete Fourier transform (STFT) of x, where M = 3 b c 1 is the number of STFT frames with truncation and Similarly, we define a ^ as the short-time temporal envelope j;m I is the frame shift in samples. Furthermore, let a(k; m) = vector for the enhanced speech signal. The vector a ^ is j;m jx(k; m)j, k = 1; : : : ; + 1, m = 1; : : : M; denote the single- normalized and clipped for each entry a ^ (n) according to j;m sided amplitude spectra of x(k; m). Finally, let a ^(k; m) denote ka k j;m 0 0:75 the estimate of a(k; m). a ^ (n) = min a ^ (n); (1 + 10 )a (n) ; j;m j;m j;m ka ^ k j;m The STSA-MSE is then given as (7) for n = 1; 2; : : : ; N . X X L =  (a ^(k; m) a(k; m)) ; STSA-MSE The intermediate intelligibility measure for a pair of short- + 1 M k=1 m=1 time temporal envelope vectors a and a ^ is then defined j;m j;m (4) as the sample linear correlation between the clean and enhanced which is the MSE between the single-sided amplitude spectra envelope vectors given as of the true target signal x and the estimated signal x ^. Note, L is only sensitive to variations in spectral amplitudes STSA-MSE a  a ^ j;m j;m j;m a ^ and not to variations in the short-time phase spectrum of j;m d = ; (8) j;m the signals. This is different from L (Eq. (??)), as TIME-MSE a  a ^ j;m j;m j;m a ^ j;m L is operating in the time-domain. For all experiments TIME-MSE in this paper we use K = 256 and I = 128. where  and  are the sample mean vectors of a j;m j;m a ^ j;m and a ^ , respectively. From d , the final STOI score for j;m j;m C. Short-Time Objective Intelligibility an entire speech signal is then defined as the scalar, 1 The third loss function we consider is based on the short-time d  1, STOI objective intelligibility (STOI) speech intelligibility estimator J M X X [32]. STOI is currently the, perhaps, most commonly used d = d ; (9) STOI j;m J (M N + 1) speech intelligibility estimator for objectively evaluating the j=1 m=N performance of speech enhancement systems [6], [7], [9], [13]. where J = 15 is the number of one-third octave bands This is presumably driven by the fact that STOI predictions and M N + 1 is the total number of short-time temporal have shown a good correspondence with measured intelligibility envelope vectors. With J = 15, the center frequency of the of noisy/processed speech in a large range of acoustic scenarios, first one-third octave band is 150 Hz and the last one is at including ideal time-frequency weighted noisy speech [32] and approximately 3.8 kHz. These frequencies are chosen such noisy speech enhanced by single-microphone time-frequency that they span the frequency range in which human speech weighting-based speech enhancement systems [32] (see also normally lie [32]. Finally, with N = 30, STOI is sensitive to [33], [56]). Therefore, it is natural to believe that gains in temporal modulation frequencies of 2:6 Hz and higher, which speech intelligibility, as estimated by STOI, can be achieved are frequencies important for speech intelligibility [32]. by utilizing a loss function based on STOI. In the following, We define our STOI loss function to be minimized as we introduce the STOI loss function L , which essentially STOI is identical to STOI. The main difference is that we omit the L = d : (10) STOI STOI voice activity detector (VAD) otherwise used by STOI. We do, Note, except for the min(;) operator in Eq. (??) the entire however, apply the VAD from STOI on the dataset used for STOI loss function is differentiable and computing the required training and validation (described further in Sec. III). gradients for gradient based optimization is straight forward Let a(k; m) k = 1; : : : ; + 1, m = 1; : : : M; denote (see e.g. [10]). Furthermore, the min(;) operator requires the single-sided STFT amplitude spectra of the clean speech th only two subgradients, so the computational complexity of its spectrum as defined in Sec. II-B. We then define the j one- gradient computation is similar to the standard ReLU activation third octave band clean-speech amplitude, for time-frame m, function, which is nothing more than the max operator. To that as [32] end, L is suitable as a loss function for training DNN-based k (j) STOI u 2 2 speech enhancement systems. a (m) = a(k; m) ; (5) k=k (j) D. Extended Short-Time Objective Intelligibility where k (j) and k (j) denote the first and last STFT bin 1 2 th The fourth loss function we include is the extended short- index, respectively, of the j one-third octave band. In a similar fashion we define a ^ (m) as the jth one-third octave time objective intelligibility (ESTOI) speech intelligibility esti- band estimated clean-speech amplitude, for time-frame m. mator [33]. As the name implies, ESTOI is inspired by STOI Furthermore, let a short-time temporal envelope vector that and was developed in an attempt to improve STOI. Specifically, spans time-frames m N + 1; : : : ; m, in the jth frequency in [33] it was shown that the performance of certain speech band for the clean speech signal be defined as intelligibility estimators, including STOI, was sensitive to spectro-temporal modulations of the noise component and that a = [a (m N + 1); a (m N + 2); : : : ; a (m)] ; (6) j j j j;m STOI did not correlate as well ( = 0:47 [33]) with listening where N = 30, which corresponds to approximately 384 ms test results, when the noise components were highly fluctuating with a sampling frequency of 10 kHz. (as e.g. with a competing talker). 4 To alleviate this drawback of STOI, ESTOI was proposed proposed as an alternative to the often used SDR measure [33]. It was shown that ESTOI significantly outperformed from the BSS eval toolbox [57]. Differently from SDR, SI- STOI ( > 0:90 [33]), as well as other speech intelligibility SDR is invariant to the scale of the processed signal, but not estimators, in conditions when the noise type is highly to deformations caused by finite-impulse response filters as fluctuating, while performing on par with these estimators in SDR is [34]. less fluctuating noise conditions. Consequently, it is of interest The SI-SDR is defined as to study how ESTOI compares with STOI, as a loss function, k xk SI-SDR = 10 log ; (17) for time-domain DNN-based speech enhancement. 10 k x x ^k Similarly to STOI, ESTOI is based on an average correlation where coefficient between one-third octave band short-time temporal x ^ x envelope vectors. Specifically, let, = = argmink x x ^k : (18) kxk 2 3 a (m N + 1) : : : a (m) 1 1 It is seen from Eqs. (??), that SI-SDR is simply the signal-to- 6 7 . . A = . . (11) 4 5 noise (SNR) ratio between the weighted clean speech signal m . . and the residual noise defined as k x x ^k . Hence, a (m N + 1) : : : a (m) J J 0 1 denote a short-time spectrogram matrix of the clean speech ^ x x kxk B C signal, where the rows of A are given by a , which j;m SI-SDR = 10 log @ A are short-time temporal envelope vectors in a one-third band x x x x ^ 2 (19) kxk defined by Eq. (??). The jth mean- and variance-normalized x x ^ row of A is then given by = 10 log : T T x xx ^ x ^ x x ^ a  = (a  ): (12) j;m j;m a j;m The scaling of the reference signal x ensures that the SI-SDR k(a  )k j;m a j;m measure is invariant to the scale of x ^, which might be desirable ESTOI now introduces the row-normalized spectrogram matrix in applications, where the speech processing algorithm do not 2 3 guarantee a proper scaling of the processed signal, such as 1;m 6 7 many DNN-based systems. This is also motivated by the fact A = (13) 4 5 m . that both speech quality and intelligibility to a large extent is J;m invariant to scaling [58]. Note, that maximizing L is equivalent to maximizing SI-SDR and defines a  as the mean- and variance-normalized nth n;m the sample correlation between x and x ^, while producing the column, n = 1; 2; : : : ; N of A , where the normalization of solution with the minimum energy [36], [37]. Furthermore, the columns is performed analogously to Eq. (??). similarly to SNR, SI-SDR is expressed in units of decibel (dB) Finally, define and is defined in the range 1 < SI-SDR < 1, which A = a  : : : a  (14) motivates us to define the SI-SDR loss function as 1;m N;m as the row and column normalized spectrogram matrix. Simi- L = SI-SDR: (20) SI-SDR larly, we define a ^ as the columns of the row and column n;m normalized spectrogram matrix for the enhanced speech signal F. Perceptual Metric for Speech Quality Evaluation A . Finally, the ESTOI speech intelligibility index is defined The sixth, and last loss function is the perceptual metric as M N for speech quality evaluation (PMSQE) [35]. The PMSQE X X 1 00 d = a  a ^ : (15) loss function, L , is designed to approximate the non- ESTOI PMSQE n;m n;m NM m=1 n=1 differentiable perceptual evaluation of speech quality (PESQ) speech quality estimator. The PESQ speech quality estimator is Similarly to L , we define the ESTOI loss as STOI furthermore designed to predict the mean opinion score (MOS) L = d : (16) ESTOI ESTOI of a speech quality listening test for certain degradations. Consequently, the PESQ score of a processed speech signal is Note, differently from L , L does not include the STOI ESTOI a scalar between 1 and 4:5, where 1 indicates extremely poor clipping step, i.e. the min(;) operator in Eq. (??), which quality and 4:5 corresponds to no distortion at all [59], [60]. makes L fully differentiable. Also, similarly to the ESTOI Along the same lines, the PMSQE loss function is designed to definition of L , we have ignored the VAD otherwise used STOI be inversely proportional to PESQ, such that a low PMSQE by L as we apply the VAD on the data prior to training. ESTOI value corresponds to a high PESQ value and vice versa. In practice PMSQE is defined in the range from 3 to 0, where 0 E. Scale-Invariant Signal-to-Distortion Ratio is equivalent to an undistorted signal and 3:0 corresponds to The fifth loss function we include is the scale-invariant signal- an extremely poor quality. to-distortion ratio (SI-SDR) [34]. The SI-SDR is an objective Fig. 2 shows the correspondence between PESQ and PMSQE performance measure that was introduced for evaluating the for a speech signal corrupted with either a stationary speech performance of speech processing algorithms and it was shaped noise (SSN) signal or a non-stationary 6-speaker babble 5 Finally, the test set is based on 1000 randomly selected 4.5 3 spoken utterances from si et 05 and si dt 05, which consists 4 2.5 of 1857 utterances divided among ten males and six females. 3.5 2 Note, as the training and validation sets consist of approx- imately three times as many utterances as their respective 3 1.5 subsets of si tr s, each utterance from WSJ0 will on average be 2.5 1 selected three times. However, as each utterance is mixed with 2 0.5 its own unique noise signal, the redundancy in speech material 1.5 0 increases the total variability in the dataset and ultimately -10 0 10 20 30 40 improves the generalizability capability of the system. Also note that the speakers used in the training and validation sets Fig. 2: PESQ ITU P.862.1 and PMSQE scores as function of are different than the speakers used for test, i.e. the tests are SNR for SSN and BBL noise-corrupted speech. conducted in a speaker independent setting. Furthermore, as we are primarily interested in speech active regions during training, we apply the voice activity noise signal, at various SNRs. It is seen that PESQ and detector (VAD) from STOI (and ESTOI) [32], [33] on the PMSQE are approximately inversely proportional and have training and validation set to ensure that any potentially long a monotonic relationship with respect to SNR. Hence, it is silent regions are removed prior to training. Specifically, the assumed that if PMSQE is minimized, PESQ will be maximized. VAD analyzes the clean waveform in 25 ms segments and The L loss function is essentially a log-domain STSA- PMSQE removes the segments where the signal energy is more than MSE loss function with additional key terms that are inspired 40 dB below the energy of the segment with the maximum by human perception. Consequently, an outline of L is PMSQE energy in the waveform. rather involved, and we refer the reader to [35] for details Finally, all utterances used with L , L , L STOI ESTOI TIME-MSE regarding the design of PMSQE. Furthermore, as PMSQE, and L are downsampled to 10 kHz, as STOI and ESTOI SI-SDR similarly to PESQ, is defined for sampling rates at either 8 are defined for this sampling frequency, and to allow an efficient kHz or 16 kHz, we use a 8 kHz sampling frequency when training scheme using minibatch training, each utterance is training L systems, and we downsample test signals to PMSQE truncated or zero-padded to four seconds. The utterances used 8 kHz, when we evaluate speech enhancement systems using with L are downsampled to 8 kHz to comply with the PMSQE PESQ. definition of PMSQE, which results in an utterance duration of approximately five seconds. III. EXPERIM ENTAL DESIGN To study how the loss functions presented in Sec. II affect the B. Noise Types performance of FCNN-based speech enhancement systems in To ensure a diverse noise variability we include four different realistic acoustical conditions, we train multiple systems using noise types in the training dataset: two synthetic noise signals a large noisy-speech dataset with a high degree of speaker and and two real-life recordings of natural sound scenes. This noise variability. In the following, we introduce the dataset, is motivated by the fact that a priori knowledge about the noise types and mixture conditions used for all experiments noise type might lead to unrealistic performance estimates [9]. presented in Sec. IV. The two synthetic noise signals are a stationary speech shaped noise (SSN) and a non-stationary 6-speaker babble (BBL) noise. The SSN signal is synthetically generated Gaussian white noise A. Noise-free Speech Mixtures that is spectrally shaped using a 12th-order all-pole filter with We have evaluated the six loss functions using the WSJ0 coefficients found from linear predictive coding analysis of speech corpus [61]. Specifically, using a sampling-with- the concatenation of 100 randomly chosen TIMIT sentences replacement scheme, the training data is based on 30000 [62]. The BBL noise signal is constructed as a linear mix of randomly selected spoken utterances from a subset of the randomly selected utterances from the TIMIT corpus such that si tr s part of WSJ0. The dataset size was found during six speakers are speaking at any given time. Using the entire preliminary experiments to be a good trade-off between training TIMIT database of 6300 utterances results in a BBL noise time and speech enhancement performance. This si tr s subset sequence with a duration of more than 50 min. For the real-life of WSJ0 consists in total of 11613 utterances approximately noise signals, we use the street (STR) and cafeteria (CAF), equally divided among 44 male speakers and 47 female noise signals from the CHiME3 dataset, which are signals that speakers. This ensures that the training dataset contains a large have been recorded in a natural occurring sound scene [63]. speaker variability, which allows the final speech enhancement Finally, we divide the noise signals such that 40 minutes system to be largely speaker independent [9]. is used for training, five minutes is used for validation and Similarly, the validation set is based on 3000 randomly another five minutes is used for test. This ensures that each selected spoken utterances from another subset of si tr s, which noise type is equally represented and with unique realizations consists of 1163 spoken utterances divided among five male in each dataset. speakers and five female speakers, which are not present in the training set. zero-padding constitutes only 3:9 % of the total number of samples. 6 To evaluate the performance of the speech enhancement Sec. IV-A. We use optimized (and different) learning rates for systems to unseen or unmatched noise signals we also test the different loss functions as further described in Sec. IV-A. using the bus (BUS), and pedestrian (PED) noise signals from The learning rates are shown in Table II. Finally, a batch size [63]. These noise signals are also real-life recordings, but they of eight is used, and training is stopped, if the validation loss represent different noise statistics compared to the four noise has not decreased for five epochs or a maximum of 200 epochs types used for training. has elapsed. We have implemented the speech enhancement systems 2 3 using Keras with a TensorFlow backend and the python C. Noisy Speech Mixtures implementation of the models and loss functions, as well as To construct the noisy speech signals, we follow Eq. (??) and audio samples, are available online . combine a noise-free training utterance x with an equal length and randomly selected noise sequence v. The noise signal v IV. E XPERIM ENTAL RESULTS is scaled according to the active speech level of x as defined We now investigate empirically how each of the loss by ITU P.56 [64] to achieve a certain SNR. For the training functions presented in Sec. II affects the speech enhancement and validation datasets, this SNR is chosen uniformly from performance of the time-domain FCNN-based speech enhance- [10; 10], which ensures that the intelligibility of the noisy ment system presented in Sec. III. Specifically, in Sec. IV-A speech waveforms y ranges from poor to perfectly intelligible. we study the sensitivity of speech enhancement performance with respect to learning rate. Such a study is a prerequisite to D. Model Architecture and Training allow a fair comparison between the custom loss functions in The speech enhancement system (Fig. 1) consists of a FCNN subsequent studies. We then study in Sec. IV-B how the signal with 18 layers configured in an encoder/decoder architecture integrity varies among the loss functions. Lastly, in Sec. IV-C, [65] using parameterized ReLU (PReLU) activation functions we study the speech enhancement performance for each loss [66]. The input dimension is L = 38656 and except for the function in various both matched and unmatched noise types first layer all remaining layers in the encoder use a stride of at a wide range of SNRs. We evaluate the speech enhancement two, which drives the final dimension in the bottleneck to be performance of all the systems using the following popular of dimension L=256. Similarly, except for the last layer, which and often used metrics: STOI [32], ESTOI [33], SI-SDR [34], has dimension L, all layers in the decoder uses upsampling SDR [57], and PESQ [59]. with a factor of two. Additionally, skip-connections where incoming channels are concatenated with existing channels A. Learning Rate vs. Performance Metric are used between the first eight layers in the encoder and the Since the goal of this paper is to make a comparison corresponding eight layers in the decoder. Similarly to [23], between loss functions, it is important that the comparison during training 20 % dropout is used for every third layer. is just. However, as the loss functions presented in Sec. II Furthermore, in (inChannel, outChannel, stride) format, have different processing steps, they have different partial the FCNN model has one (1,64,1), two (64,64,2), one derivatives, which might lead to different gradient norms and a (64,128,2), two (128,128,2), one (128,256,2), two (256,256,2), varying sensitivity to the choice of learning rate during gradient two (512,256,1), three (256,128,1), three (128,64,1), and one based optimization (e.g. [10], [67]). Therefore, to study the (128,1,1) convolutional layers with a filter size of 11 samples, influence that the learning rate can have on the performance which makes the model comparable to other enhancement of time-domain FCNN-based speech enhancement systems, models in the literature (see e.g. [23], [24], [50]). In total, the we have trained multiple systems with various learning rates. model has approximately 6.8 million parameters. Specifically, for each of the six loss functions in Sec. II, we Note, due to the encoder/decoder architecture, the receptive have trained a system using the following five learning rates: field is 2561 samples, which means that 2561 samples need to 2 3 4 4 5 10 , 10 , 5 10 , 10 , and 10 . The learning rates have be available before the system can produce a single output. In been selected from preliminary experiments in order to cover other words, with a 10 kHz sampling frequency the latency of the two training extremes, when training either diverge, i.e. a the speech enhancement system is 256 ms. For applications too large learning rate is used, or when training converge too where hard real-time requirements apply, e.g. hearing aids, this slowly and ultimately ends up at a plateau with a validation latency can likely be reduced significantly using alternative loss higher than the validation loss achieved using a larger architectures (e.g. [42]). learning rate. The systems for this particular experiment have The speech enhancement system is trained using the ADAM been trained using SSN at and SNR of 0 dB . optimizer [48] with = 0:9 and = 0:999 and a learning 1 2 In Table I we present different performance scores for time- rate schedule that reduces the learning rate with a factor of domain FCNN-based speech enhancement systems trained two, if the validation loss has not decreased for two epochs. using different loss functions and learning rates. The largest The six loss functions considered in this study have different performance scores with respect to each loss function (i.e. gradients and ultimately different gradient norms. Consequently, a learning rate used for one loss function might not be the https://keras.io/ optimal learning rate for another loss function. In fact, using https://tensorflow.org/ a non-optimal learning rate might result in radical different 4 https://git.its.aau.dk/mok/Speech Enhancement Loss.git solutions and potentially erroneous conclusions, as we show in Preliminary experiments using BBL indicated similar results. 7 TABLE I: Performance of different speech enhancement column-wise) is highlighted in boldface. It is eminent from 2 systems measured using various performance metrics. The Table I that a learning rate of 10 is too large for all loss systems have been tested in matched noise-type conditions functions as none of the loss functions manage to improve the using SSN at 0 dB SNR. validation loss. Similarly, it is seen from Table I that a learning rate of 10 is too small for all loss functions as none of the Learning Processed Metric Noisy Rate systems, except for L evaluated using STOI, achieve the SI-SDR L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE largest scores for this particular learning rate. However, the STOI: 0.75 ESTOI: 0.46 L systems achieve the same STOI score for the three SI-SDR SI-SDR: -1.05 Could not improve training or validation loss SDR: -0.92 (No convergence) smallest learning rates. Furthermore, it is seen that the learning PMSQE: 2.53 4 4 rates in the middle range, i.e. 5  10 , and 10 achieve PESQ: 1.79 particularly large scores. Specifically, it is seen from Table I STOI: 0.75 0.92 0.93 0.91 0.91 0.92 0.89 ESTOI: 0.46 0.79 0.81 0.79 0.77 0.79 0.73 that L , L , L , and L all achieve TIME-MSE SI-SDR STSA-MSE PMSQE SI-SDR: -1.05 10.24 3.55 -4.37 9.82 6.59 -1.07 SDR: -0.92 10.88 6.47 4.77 10.52 9.39 1.89 the largest overall performance scores using a learning rate of PMSQE: 2.53 1.10 1.47 1.48 1.20 1.22 1.14 510 , whereas the remaining loss functions L and L PESQ: 1.79 2.72 2.67 2.51 2.65 2.77 2.65 STOI ESTOI achieve their maximum performance scores with a learning STOI: 0.75 0.92 0.93 0.93 0.92 0.92 0.89 ESTOI: 0.46 0.79 0.82 0.83 0.80 0.80 0.75 rate of 10 . SI-SDR: -1.05 10.30 2.09 3.12 10.70 -4.32 -8.52 5 10 SDR: -0.92 11.00 7.73 7.84 11.32 2.27 4.21 More importantly, it is seen that choosing a non-optimal PMSQE: 2.53 1.07 1.43 1.27 1.01 1.19 1.05 PESQ: 1.79 2.73 2.70 2.68 2.77 2.80 2.72 learning rate might actually lead to a wrong conclusion, if STOI: 0.75 0.92 0.93 0.93 0.92 0.92 0.89 the systems were compared based on the same learning rate. ESTOI: 0.46 0.79 0.82 0.83 0.80 0.80 0.74 This is a consideration that has been generally absent in SI-SDR: -1.05 10.13 1.95 -12.03 10.59 8.10 -6.96 SDR: -0.92 10.78 4.99 0.61 11.22 9.46 4.76 the literature. For example, with the standard learning rate PMSQE: 2.53 1.11 1.46 1.24 1.03 1.21 1.12 PESQ: 1.79 2.72 2.69 2.68 2.77 2.79 2.67 of 10 the L and L systems both achieve an TIME-MSE ESTOI STOI: 0.75 0.90 0.92 0.92 0.92 0.91 0.85 ESTOI score of 0.79, which might lead to the, perhaps faulty, ESTOI: 0.46 0.74 0.80 0.81 0.79 0.77 0.67 conclusion that both loss functions possess the same potential SI-SDR: -1.05 8.78 -2.22 -22.36 10.15 -3.51 -6.52 SDR: -0.92 9.46 7.28 4.81 10.78 4.76 -0.65 with respect to ESTOI improvements. However, with a learning PMSQE: 2.53 1.40 1.56 1.47 1.12 1.36 1.51 PESQ: 1.79 2.53 2.58 2.59 2.72 2.69 2.37 rate of e.g. 5 10 , it is seen that the L system still TIME-MSE achieves an ESTOI score of 0.79, whereas the L system ESTOI TABLE II: Optimal learning rates for different loss functions. achieves a considerably larger ESTOI score of 0.83, which leads to the correct conclusion that the L loss function ESTOI Loss: L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE 4 4 4 4 4 4 has potential to outperform the L in terms of ESTOI. TIME-MSE LR: 5 10 10 10 5 10 5 10 5 10 Similar observations can be made for other loss functions e.g. with respect to L and L . Furthermore, it is seen TIME-MSE SI-SDR that the SI-SDR scores have a large variance and with a learning systems were trained with. Please note that we used four rate of e.g. 10 they can vary from 22:36 dB for systems significant digits when the learning rates were selected to optimized for L to 10:15 dB for systems optimized for ESTOI ensure a proper resolution. L , while achieving comparable STOI, ESTOI, and PESQ SI-SDR scores. This phenomenon is somewhat surprising and is further B. Signal Integrity vs. Performance Metric studied in Sec. IV-B. Also, Table I suggests that when a system We now study the signal integrity achieved by the systems is trained with a specific loss function, no other system achieves trained to minimize the different loss functions. Specifically, a larger performance score with respect to that particular metric. we compare the waveforms (Fig. 3) and amplitude spectra This expected result indicates that training has evolved correctly (Fig. 4) of representative clean, noisy, and enhanced speech and that the learning rates used in Table I are close to optimal. signal segments processed by speech enhancement systems Finally, it is seen that although L is an approximation of PMSQE trained using the loss functions presented in Sec. II and the PESQ, in Table I, L and L consistently lead to SI-SDR STSA-MSE larger PESQ scores than L despite L consistently learning rates given in Table II. PMSQE PMSQE achieving the lowest PMSQE values of the two loss functions. Figure 3 presents the waveforms of a specific 10 ms L does, however, lead to larger PESQ scores than L realization of clean, noisy, and enhanced speech signals from PMSQE STOI and L for several testing conditions. the experiments in Table I. At first, if polarity is ignored, it ESTOI is seen from Fig. 3 that systems trained with the six loss In conclusion, selecting the learning rate can have a pro- functions manage to enhance the noisy speech signal, as found impact on the performance of FCNN-based speech we see a somewhat good correspondence between the clean enhancement systems and selecting the proper learning rate speech signal (red-solid) and the enhanced speech signal (blue- is crucial, when systems trained using different loss functions dashed) with the enhanced signal having considerably less are compared. Table II summarizes the learning rates that noise compared to the noisy speech signal (yellow-dotted). we will use for training the systems presented in Sec. IV-C. From Fig. 3a it appears that the L loss function The learning rates are selected as the ones that maximize TIME-MSE achieves the most per-sample-accurate estimate of the clean the performance metric most similar to the loss function the signal and L , L , and L (Figs. 3b, 3c, and 3f) STOI ESTOI PMSQE value originally proposed in [48] and currently default in https://keras.io/. appear to achieve the least per-sample-accurate estimate. It 8 0.2 0.2 0.06 0.05 Clean 0.05 Noisy 0.04 Processed 0 0 0.04 0.03 0.03 0.02 -0.2 -0.2 0.02 Clean 0.01 Noisy 0.01 Processed -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (a) L (b) L (a) L (b) L TIME-MSE STOI TIME-MSE STOI 0.2 0.2 0.05 0.05 0.04 0.04 0 0 0.03 0.03 0.02 0.02 -0.2 -0.2 0.01 0.01 -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (c) L (d) L (c) L (d) L ESTOI SI-SDR ESTOI SI-SDR 0.2 0.2 0.06 0.05 0.05 0.04 0 0 0.04 0.03 0.03 0.02 -0.2 -0.2 0.02 0.01 0.01 -0.4 -0.4 0 0 0 1 2 3 4 5 6 7 8 9 10 0 1 2 3 4 5 6 7 8 9 10 0 500 1000 0 500 1000 Time [ms] Time [ms] Frequency [Hz] Frequency [Hz] (e) L (f) L (e) L (f) L STSA-MSE PMSQE STSA-MSE PMSQE Fig. 3: Time-domain waveform of a clean speech signal (solid Fig. 4: Magnitude spectra of the signals presented in Fig. 3. red), noisy speech signal (dotted yellow) and processed speech signals (dashed blue) processed by systems trained using different loss functions. compared to the clean signal. This phenomenon is more evident in Fig. 3e, where it is easy to see that the enhanced signal is time-shifted a few samples with respect to the clean signal. is not surprising that L achieves the best estimate, Loss functions such as L , L , L and L TIME-MSE STOI ESTOI PMSQE STSA-MSE as L is a waveform matching loss function and are primarily based on short-time magnitude spectra, and do TIME-MSE consequently penalizes time-domain samples that deviate from not, penalize waveform deviations. Hence, they may allow for the samples of the clean signal. However, a perfect sample-wise the enhanced speech signal to be time-shifted with respect to waveform reconstruction is not necessarily the only optimum, the clean signal. That being said, the amount of time-shift that if the receiver is the human auditory system and the goal is we have observed is less than 1 ms, which is considerably to achieve high speech intelligibility or quality as perceived smaller than the 10-30 ms usually required before the time-shift by humans. For example, in Figs. 3b and 3c it is seen that may be perceivable in real-life low-latency speech processing the waveforms of the enhanced signals are inverted, and applications such as hearing aids and mobile communications somewhat different from Fig. 3a although the processed signals devices [68], [69]. achieve similar or higher STOI and ESTOI scores (Table I), Furthermore, when evaluating speech enhancement perfor- i.e. the signals should ideally represent similar or higher levels mance with waveform-matching metrics such as SNR, SDR, or of intelligibility. This is because L and L are loss STOI ESTOI SI-SDR, exact time-matching is critical and a few samples delay functions based on matching of short-time energy in one-third can cause a complete failure of such performance metrics. This octave bands. As a consequence, these loss functions are e.g. is exactly what we observe in Table I, where the SI-SDR scores, invariant to the signal polarity. and to a smaller extent the SDR scores, have a high variance Furthermore, by careful inspecting e.g. Fig. 3c it can be with no obvious correspondence with the remaining metrics observed that the enhanced signal is slightly time-shifted such as STOI and ESTOI, which are stable and show a more Amplitude Amplitude Amplitude Amplitude Amplitude Amplitude Magnitude Magnitude Magnitude Magnitude Magnitude Magnitude 9 consistent behavior. Since SI-SDR and SDR are scale-invariant performance metrics with respect to the performance scores waveform matching functions, they fail if the processed signal of the noisy unprocessed signals. An exception occurs for the and the reference signal are not perfectly aligned. Consequently, SI-SDR and SDR metrics which, under some circumstances, SI-SDR and SDR should be used with caution, when they are can fail completely as previously discussed (Sec. IV-B) and it used to evaluate time-domain speech enhancement or separation is important to emphasize that these systems, despite the occa- systems with the capability to modify the phase, such as time- sionally very low SDR and SI-SDR scores, still successfully domain FCNNs. Furthermore, they should generally be avoided, enhance the speech signals in terms of perception. This is also when loss functions like L , L , L and L supported by the STOI, ESTOI, and PESQ performance metrics. STOI ESTOI PMSQE STSA-MSE are utilized. In other words, systems trained with the six loss functions seem Finally, in Fig. 4 we show the corresponding amplitude to be successful in terms of their ability to attenuate the noise and enhance the speech signal. More interestingly, although not spectra of the signals from Fig. 3 using a 40 ms window, surprising, it is seen that systems trained using L , L , centered around the 10 ms time-domain segment from Fig. 3, STOI ESTOI and L also achieve the maximum STOI, ESTOI, and to ensure a sufficient frequency resolution. It is seen from SI-SDR SI-SDR scores, respectively. Somewhat surprising is it to see Fig. 4 that all six loss functions lead to enhanced signals that systems trained to minimize L do not achieve the whose magnitude spectrum resemble that of the magnitude PMSQE maximum PESQ score, despite the fact that L is designed spectrum of the clean speech signal. Furthermore, it is seen PMSQE to resemble PESQ and we see a monotonic relationship between that the enhanced signals capture the dominating harmonics of the two functions in Fig. 2. Instead, it is seen from Table III the clean speech signal, while attenuating the major frequency components of the noise. Similarly to Fig. 3a, it is seen that that systems trained to minimize L generally achieve the SI-SDR L achieves an accurate estimate of the amplitude maximum PESQ score. In fact, systems trained to minimize TIME-MSE spectrum but also L (Fig. 4a) achieves an accurate L seem to perform well in general as they generally STSA-MSE SI-SDR estimate, which is expected asL is a frequency-domain achieve large improvements across all performance metrics and STSA-MSE energy-matching loss function. In fact, by careful inspection of often perform on par with systems trained to minimize L STOI Fig. 4e it can be observed that L manages to preserve and L , which are fundamentally different loss functions ESTOI STSA-MSE the higher order harmonics to a larger extent than L compared to L . SI-SDR TIME-MSE (Fig. 4a). Also, as expected, we can conclude that the small In Table IV we present performance scores achieved by the time-shift induced by L in Fig. 3e has no apparent systems from Table III but in unseen noise type conditions, STSA-MSE effect on the accuracy of the amplitude spectrum estimate, using the pedestrian and bus noise types. From Table IV it is which indicates that the time-shift is approximately constant seen, similarly to Table III, that systems trained using L , STOI over the window length of 40 ms. L , and L also achieve the maximum STOI, ESTOI, ESTOI SI-SDR and SI-SDR scores, respectively. It is also seen that systems trained to minimize L generally achieve the maximum, SI-SDR C. Loss Function vs. Performance Metric or close to the maximum, performance scores and also achieve We now turn our attention towards the speech enhancement larger PESQ scores than the systems trained to minimize potential of the systems trained to minimize the loss functions L . In other words, the behavior observed in Table III PMSQE in question. Specifically, we study the speech enhancement where the systems were tested using matched noise types also performance in terms of STOI [32], ESTOI [33], SI-SDR [34], seem to hold for unmatched noise types. SDR [57], and PESQ [59] of six different time-domain FCNN- Finally, from Table III and Table IV we can conclude that if based speech enhancement systems when trained using the loss the goal is to maximize a specific performance metric, gains functions, training data, and noise-types presented in Sec. II and can in general be achieved by training systems to minimize a the learning rates given in Table II. The six systems have been loss function designed specifically to resemble that particular tested using the matched noise types, SSN, BBL, CAF, and performance metric. For example, if the goal is to maximize STR and the unmatched noise types, PED, and BUS, at SNRs ESTOI, the largest ESTOI scores are achieved by training from -10 dB to 20 dB and the systems are evaluated by their systems that minimize the L loss function. However, if ESTOI ability to improve the above-mentioned performance metrics. the goal is to perform good in general across a wide range of Note, in contrast to the training and validation data, a VAD performance metrics, a loss function like L seems to be a SI-SDR has not been applied to the test data during inference. In other good candidate as systems trained to minimize L achieve SI-SDR words, the speech enhancement systems process the test signals high improvements over a range of performance metrics. Also, in their entirety, including any short natural occurring leading and more importantly, these findings seem to be generally valid and trailing silent regions and speech pauses in between spoken over a wide range of SNRs, unseen male and female speakers, words. This is done to simulate a realistic usage scenario, where as well as matched and unmatched noise types. exact knowledge about speech activity is generally not available prior to speech processing. V. C ONCLUSION In Table III we present scores by the above-mentioned performance metrics partitioned into loss functions horizontally In this paper the speech enhancement potential of six state- and SNR vertically. The largest performance score for each of-the-art loss functions for time-domain deep neural network- metric and SNR is highlighted in boldface. From Table III based monaural speech enhancement have been investigated. it is seen that all systems in general are able to improve all Specifically, we have conducted multiple experimental studies 10 TABLE III: Performance of different speech enhancement systems measured using STOI, ESTOI, SI-SDR, and SDR. The systems have been trained using different loss functions (L , L , L , L , L , L ) and tested TIME-MSE STOI ESTOI SI-SDR STSA-MSE TIME-MSE using four matched (SSN, BBL, CAF, STR) noise types at seven different SNRs (-10 dB, -5 dB, 0 dB, 5 dB, 10 dB, 15 dB, and 20dB). The maximum score is highlighted in boldface for each SNR and performance measure. See text for details. (a) Speech Shaped Noise (matched) (b) 6-Speaker Babble Noise (matched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.50 0.63 0.68 0.64 0.64 0.63 0.56 STOI: 0.46 0.67 0.69 0.66 0.68 0.67 0.59 ESTOI: 0.16 0.34 0.40 0.43 0.36 0.35 0.27 ESTOI: 0.18 0.41 0.43 0.44 0.42 0.41 0.33 -10 dB: SI-SDR: -11.07 0.41 -7.34 -16.18 0.67 -6.93 -24.50 -10 dB: SI-SDR: -11.04 0.39 -6.70 -15.84 0.59 -6.49 -25.93 SDR: -10.32 1.84 -0.76 -1.63 2.05 0.66 -1.52 SDR: -10.29 1.69 -0.51 -1.38 1.86 0.57 -1.67 PESQ: 1.48 1.59 1.60 1.31 1.62 1.68 1.47 PESQ: 1.73 1.70 1.62 1.41 1.70 1.72 1.51 STOI: 0.62 0.82 0.85 0.83 0.83 0.82 0.77 STOI: 0.59 0.82 0.84 0.83 0.83 0.82 0.76 ESTOI: 0.30 0.60 0.65 0.67 0.62 0.60 0.54 ESTOI: 0.31 0.62 0.65 0.66 0.64 0.61 0.55 -5 dB: SI-SDR: -6.05 5.74 -3.53 -13.41 6.08 -3.74 -20.08 -5 dB: SI-SDR: -6.04 5.59 -3.29 -13.54 5.96 -3.69 -20.38 SDR: -5.77 6.60 4.21 3.63 6.87 5.48 2.37 SDR: -5.75 6.45 4.36 3.66 6.75 5.32 2.24 PESQ: 1.59 2.19 2.18 2.03 2.22 2.24 2.11 PESQ: 1.69 2.18 2.16 2.02 2.21 2.18 2.04 STOI: 0.75 0.90 0.92 0.92 0.91 0.90 0.88 STOI: 0.73 0.90 0.92 0.91 0.91 0.90 0.87 ESTOI: 0.46 0.75 0.80 0.81 0.78 0.75 0.71 ESTOI: 0.47 0.76 0.80 0.81 0.78 0.76 0.71 0 dB: SI-SDR: -1.05 9.46 -2.16 -12.73 9.96 -2.66 -17.95 0 dB: SI-SDR: -1.04 9.54 -1.94 -12.81 10.12 -2.48 -18.13 SDR: -0.92 10.09 7.54 6.73 10.55 8.75 4.69 SDR: -0.91 10.17 7.78 6.82 10.69 8.83 4.58 PESQ: 1.79 2.59 2.58 2.56 2.65 2.62 2.57 PESQ: 1.84 2.57 2.60 2.54 2.64 2.57 2.49 STOI: 0.85 0.94 0.95 0.95 0.95 0.94 0.92 STOI: 0.84 0.94 0.95 0.95 0.95 0.94 0.92 ESTOI: 0.63 0.84 0.87 0.88 0.86 0.84 0.80 ESTOI: 0.63 0.85 0.87 0.88 0.87 0.84 0.80 5 dB: SI-SDR: 3.96 12.56 -1.51 -12.46 13.20 -2.12 -17.23 5 dB: SI-SDR: 3.96 12.75 -1.40 -12.55 13.49 -2.03 -17.25 SDR: 4.04 13.07 10.00 8.68 13.71 11.30 6.07 SDR: 4.04 13.26 10.20 8.76 13.99 11.46 5.99 PESQ: 2.03 2.89 2.88 2.91 3.00 2.89 2.89 PESQ: 2.10 2.88 2.92 2.90 2.98 2.86 2.84 STOI: 0.92 0.96 0.97 0.97 0.97 0.96 0.94 STOI: 0.91 0.96 0.97 0.97 0.97 0.96 0.94 ESTOI: 0.78 0.90 0.91 0.92 0.91 0.90 0.85 ESTOI: 0.77 0.90 0.92 0.92 0.92 0.89 0.85 10 dB: SI-SDR: 8.96 15.39 -1.18 -12.25 16.10 -1.86 -16.98 10 dB: SI-SDR: 8.96 15.55 -1.15 -12.31 16.41 -1.82 -16.97 SDR: 9.02 15.83 11.84 9.98 16.60 13.39 6.85 SDR: 9.03 16.02 11.94 10.00 16.91 13.49 6.79 PESQ: 2.32 3.13 3.12 3.16 3.27 3.10 3.13 PESQ: 2.39 3.12 3.17 3.17 3.25 3.10 3.11 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 ESTOI: 0.88 0.93 0.94 0.94 0.94 0.93 0.88 ESTOI: 0.86 0.93 0.94 0.94 0.94 0.92 0.88 15 dB: SI-SDR: 13.96 17.89 -1.02 -12.11 18.71 -1.74 -16.95 15 dB: SI-SDR: 13.96 17.96 -0.99 -12.11 18.96 -1.72 -16.94 SDR: 14.02 18.36 13.06 10.76 19.31 14.94 7.26 SDR: 14.02 18.46 13.09 10.74 19.56 14.95 7.20 PESQ: 2.63 3.31 3.33 3.36 3.47 3.29 3.32 PESQ: 2.70 3.33 3.38 3.38 3.46 3.32 3.30 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.94 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.93 0.94 0.95 0.95 0.96 0.94 0.89 20 dB: SI-SDR: 18.96 19.73 -0.95 -12.05 20.89 -1.68 -16.97 20 dB: SI-SDR: 18.96 19.70 -0.94 -12.06 21.03 -1.68 -16.94 SDR: 19.02 20.29 13.75 11.17 21.71 15.80 7.45 SDR: 19.02 20.28 13.75 11.15 21.86 15.80 7.40 PESQ: 2.95 3.48 3.55 3.54 3.63 3.50 3.46 PESQ: 3.02 3.50 3.57 3.54 3.62 3.52 3.44 (c) Cafeteria Noise (matched) (d) Street Noise (matched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.56 0.74 0.77 0.74 0.74 0.74 0.69 STOI: 0.58 0.76 0.80 0.77 0.77 0.76 0.70 ESTOI: 0.25 0.48 0.52 0.53 0.49 0.48 0.42 ESTOI: 0.24 0.50 0.56 0.57 0.52 0.50 0.43 -10 dB: SI-SDR: -11.04 3.26 -5.16 -14.90 3.78 -5.37 -19.30 -10 dB: SI-SDR: -11.05 3.92 -4.74 -15.07 4.34 -5.07 -18.52 SDR: -10.32 4.21 1.95 1.23 4.74 2.90 0.41 SDR: -10.35 4.92 2.64 1.79 5.32 3.53 0.46 PESQ: 1.60 1.99 1.97 1.80 2.04 2.03 1.87 PESQ: 1.42 2.05 2.02 1.84 2.07 2.07 1.90 STOI: 0.68 0.85 0.87 0.86 0.86 0.86 0.82 STOI: 0.68 0.87 0.89 0.88 0.88 0.87 0.83 ESTOI: 0.38 0.65 0.70 0.71 0.67 0.66 0.60 ESTOI: 0.36 0.68 0.72 0.73 0.69 0.68 0.61 -5 dB: SI-SDR: -6.03 7.77 -2.69 -13.38 8.33 -3.20 -17.83 -5 dB: SI-SDR: -6.04 8.12 -2.58 -13.50 8.58 -3.24 -17.63 SDR: -5.75 8.49 6.22 5.34 9.06 7.14 3.57 SDR: -5.77 8.88 6.60 5.67 9.32 7.46 3.46 PESQ: 1.69 2.39 2.39 2.29 2.45 2.41 2.29 PESQ: 1.63 2.45 2.44 2.36 2.50 2.46 2.35 STOI: 0.78 0.92 0.93 0.93 0.92 0.92 0.89 STOI: 0.78 0.92 0.94 0.93 0.93 0.92 0.90 ESTOI: 0.52 0.77 0.81 0.82 0.80 0.78 0.73 ESTOI: 0.49 0.79 0.82 0.83 0.81 0.79 0.74 0 dB: SI-SDR: -1.03 11.17 -1.69 -12.83 11.88 -2.35 -17.30 0 dB: SI-SDR: -1.04 11.39 -1.66 -12.81 11.97 -2.38 -17.32 SDR: -0.90 11.75 9.22 8.01 12.50 10.18 5.54 SDR: -0.92 12.03 9.46 8.20 12.62 10.36 5.46 PESQ: 1.99 2.70 2.73 2.69 2.79 2.72 2.65 PESQ: 1.94 2.76 2.78 2.75 2.84 2.76 2.71 STOI: 0.87 0.95 0.96 0.96 0.95 0.95 0.93 STOI: 0.86 0.95 0.96 0.96 0.96 0.95 0.93 ESTOI: 0.66 0.85 0.88 0.88 0.87 0.85 0.81 ESTOI: 0.63 0.86 0.88 0.89 0.87 0.86 0.82 5 dB: SI-SDR: 3.97 14.11 -1.28 -12.42 14.90 -1.98 -17.10 5 dB: SI-SDR: 3.96 14.28 -1.26 -12.45 14.90 -1.99 -17.16 SDR: 4.05 14.61 11.32 9.67 15.50 12.55 6.66 SDR: 4.04 14.83 11.53 9.79 15.54 12.69 6.66 PESQ: 2.33 2.96 3.01 3.01 3.08 2.97 2.96 PESQ: 2.28 3.01 3.05 3.04 3.12 2.99 3.00 STOI: 0.93 0.97 0.97 0.97 0.97 0.97 0.95 STOI: 0.92 0.97 0.97 0.97 0.97 0.97 0.95 ESTOI: 0.79 0.90 0.92 0.92 0.92 0.90 0.86 ESTOI: 0.76 0.90 0.92 0.92 0.92 0.90 0.87 10 dB: SI-SDR: 8.97 16.78 -1.09 -12.18 17.58 -1.81 -17.03 10 dB: SI-SDR: 8.96 16.96 -1.09 -12.17 17.60 -1.80 -17.06 SDR: 9.03 17.28 12.76 10.66 18.27 14.38 7.22 SDR: 9.02 17.51 12.92 10.75 18.34 14.53 7.27 PESQ: 2.66 3.20 3.27 3.27 3.32 3.21 3.21 PESQ: 2.61 3.23 3.29 3.29 3.35 3.21 3.24 STOI: 0.96 0.98 0.98 0.98 0.98 0.98 0.95 STOI: 0.96 0.98 0.98 0.98 0.98 0.98 0.96 ESTOI: 0.88 0.93 0.94 0.94 0.94 0.93 0.89 ESTOI: 0.87 0.93 0.94 0.94 0.94 0.93 0.89 15 dB: SI-SDR: 13.97 18.91 -1.00 -12.06 19.94 -1.72 -17.02 15 dB: SI-SDR: 13.96 19.13 -1.00 -12.08 19.98 -1.72 -17.04 SDR: 14.03 19.48 13.61 11.18 20.84 15.55 7.47 SDR: 14.02 19.74 13.71 11.24 20.94 15.70 7.52 PESQ: 2.99 3.42 3.50 3.49 3.53 3.44 3.40 PESQ: 2.93 3.45 3.53 3.51 3.55 3.45 3.43 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.94 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.93 0.94 0.95 0.96 0.96 0.94 0.90 20 dB: SI-SDR: 18.97 20.22 -0.96 -12.02 21.75 -1.68 -17.01 20 dB: SI-SDR: 18.96 20.39 -0.96 -12.00 21.79 -1.68 -17.04 SDR: 19.02 20.85 14.01 11.40 22.91 16.10 7.55 SDR: 19.02 21.04 14.07 11.44 23.02 16.21 7.59 PESQ: 3.33 3.60 3.70 3.67 3.72 3.63 3.52 PESQ: 3.27 3.62 3.73 3.69 3.74 3.64 3.56 11 TABLE IV: As Table III but for the two unmatched pedestrian and bus noise types. (a) Pedestrian Noise (unmatched) (b) Bus Noise (unmatched) Processed Processed SNR Metric Noisy SNR Metric Noisy L L L L L L L L L L L L TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE TIME-MSE STOI ESTOI SI-SDR STSA-MSE PMSQE STOI: 0.52 0.66 0.71 0.67 0.67 0.67 0.61 STOI: 0.71 0.88 0.90 0.90 0.89 0.88 0.86 ESTOI: 0.19 0.38 0.44 0.45 0.39 0.38 0.31 ESTOI: 0.39 0.70 0.75 0.76 0.73 0.70 0.66 -10 dB: SI-SDR: -11.04 -0.06 -7.53 -17.82 0.38 -8.05 -20.99 -10 dB: SI-SDR: -11.04 8.49 -2.34 -13.72 9.19 -3.24 -16.39 SDR: -10.31 1.08 -0.90 -1.80 1.53 -0.40 -1.91 SDR: -10.35 9.29 7.48 6.40 10.00 7.60 3.72 PESQ: 1.68 1.73 1.72 1.45 1.74 1.77 1.60 PESQ: 1.67 2.53 2.59 2.51 2.61 2.55 2.50 STOI: 0.62 0.82 0.85 0.83 0.83 0.82 0.77 STOI: 0.78 0.93 0.94 0.94 0.94 0.93 0.91 ESTOI: 0.29 0.59 0.64 0.66 0.61 0.59 0.52 ESTOI: 0.49 0.80 0.83 0.84 0.82 0.80 0.76 -5 dB: SI-SDR: -6.04 5.45 -3.73 -14.37 5.86 -4.24 -18.59 -5 dB: SI-SDR: -6.04 11.57 -1.59 -12.92 12.33 -2.46 -16.89 SDR: -5.75 6.21 4.19 3.40 6.60 4.87 2.14 SDR: -5.77 12.34 10.03 8.62 13.15 10.47 5.44 PESQ: 1.61 2.21 2.21 2.04 2.23 2.24 2.11 PESQ: 2.03 2.86 2.91 2.87 2.95 2.86 2.81 STOI: 0.73 0.90 0.92 0.91 0.91 0.90 0.87 STOI: 0.85 0.95 0.96 0.96 0.96 0.95 0.94 ESTOI: 0.43 0.74 0.78 0.79 0.76 0.74 0.69 ESTOI: 0.60 0.86 0.88 0.89 0.88 0.86 0.82 0 dB: SI-SDR: -1.04 9.52 -2.08 -13.23 10.05 -2.74 -17.53 0 dB: SI-SDR: -1.04 14.16 -1.26 -12.44 14.97 -2.05 -17.17 SDR: -0.91 10.12 7.93 6.91 10.66 8.70 4.80 SDR: -0.92 14.89 11.85 10.05 15.82 12.65 6.59 PESQ: 1.82 2.60 2.62 2.55 2.66 2.62 2.54 PESQ: 2.37 3.13 3.17 3.16 3.22 3.11 3.08 STOI: 0.83 0.94 0.95 0.95 0.95 0.94 0.92 STOI: 0.91 0.97 0.97 0.97 0.97 0.97 0.95 ESTOI: 0.58 0.83 0.86 0.87 0.85 0.83 0.79 ESTOI: 0.71 0.90 0.92 0.92 0.92 0.90 0.87 5 dB: SI-SDR: 3.96 12.82 -1.40 -12.62 13.47 -2.10 -17.20 5 dB: SI-SDR: 3.96 16.51 -1.11 -12.20 17.35 -1.85 -17.18 SDR: 4.04 13.32 10.57 9.05 14.03 11.54 6.32 SDR: 4.04 17.21 13.06 10.89 18.27 14.32 7.26 PESQ: 2.13 2.91 2.94 2.92 3.01 2.92 2.88 PESQ: 2.69 3.36 3.41 3.40 3.46 3.34 3.31 STOI: 0.91 0.96 0.97 0.97 0.97 0.96 0.94 STOI: 0.95 0.98 0.98 0.98 0.98 0.97 0.96 ESTOI: 0.73 0.89 0.91 0.91 0.91 0.89 0.85 ESTOI: 0.82 0.93 0.94 0.94 0.94 0.92 0.89 10 dB: SI-SDR: 8.96 15.77 -1.12 -12.28 16.45 -1.84 -17.02 10 dB: SI-SDR: 8.96 18.55 -1.03 -12.09 19.51 -1.75 -17.12 SDR: 9.03 16.23 12.34 10.32 17.05 13.75 7.07 SDR: 9.02 19.27 13.77 11.33 20.59 15.49 7.56 PESQ: 2.46 3.17 3.21 3.20 3.29 3.16 3.16 PESQ: 3.00 3.58 3.63 3.61 3.66 3.54 3.50 STOI: 0.96 0.97 0.98 0.98 0.98 0.97 0.95 STOI: 0.97 0.98 0.98 0.98 0.99 0.98 0.96 ESTOI: 0.85 0.92 0.94 0.94 0.94 0.92 0.88 ESTOI: 0.90 0.94 0.95 0.95 0.96 0.94 0.91 15 dB: SI-SDR: 13.96 18.28 -1.01 -12.16 19.08 -1.73 -17.00 15 dB: SI-SDR: 13.96 19.98 -0.99 -12.02 21.27 -1.70 -17.09 SDR: 14.02 18.79 13.40 10.99 19.83 15.23 7.39 SDR: 14.02 20.71 14.10 11.51 22.58 16.14 7.65 PESQ: 2.78 3.39 3.45 3.42 3.50 3.39 3.36 PESQ: 3.33 3.74 3.81 3.77 3.84 3.71 3.62 STOI: 0.98 0.98 0.98 0.98 0.99 0.98 0.96 STOI: 0.99 0.98 0.99 0.98 0.99 0.98 0.96 ESTOI: 0.92 0.94 0.95 0.95 0.96 0.94 0.90 ESTOI: 0.95 0.95 0.96 0.96 0.97 0.95 0.91 20 dB: SI-SDR: 18.96 19.95 -0.95 -12.04 21.23 -1.68 -17.02 20 dB: SI-SDR: 18.96 20.70 -0.96 -12.01 22.46 -1.68 -17.06 SDR: 19.02 20.55 13.93 11.31 22.23 15.97 7.50 SDR: 19.02 21.41 14.21 11.56 23.95 16.38 7.64 PESQ: 3.10 3.55 3.66 3.61 3.67 3.58 3.48 PESQ: 3.66 3.80 3.91 3.87 3.97 3.81 3.68 using speech enhancement systems based on time-domain to optimize with respect to a loss function designed specifically convolutional neural networks and studied the impact the loss to resemble that particular loss function. This is particularly functions have on the performance of those systems, when they interesting for loss functions based on STOI and ESTOI as are evaluated using five commonly used performance metrics these performance metrics predict speech intelligibility, a metric for monaural speech enhancement algorithms. The goal of the many speech enhancement algorithms attempt to maximize study is to establish if, and to what extent, a loss function without explicitly being designed to do so. Furthermore, we designed specifically to resemble a certain performance metric found that the learning rate used when training systems to is advantageous compared to standard loss functions such as minimize a particular loss function can have a critical impact the time-domain mean-square error (MSE) loss function or the on the performance of such systems; it is paramount that the short-time spectral amplitude (STSA)-MSE, whose strongest optimal learning rate is identified for each loss function, as justification is mathematical convenience. In addition to the a sub-optimal learning rate can lead to sub-optimal results classical loss functions based on time-domain MSE and STSA- and erroneous conclusions, when systems trained to optimize MSE, we have studied a loss function based on scale-invariant different loss functions are compared. Despite its obvious signal to distortion ratio (SDR), as well as two loss functions importance, this is a consideration that has been generally based on two often used speech intelligibility predictors, namely absent in the academic literature. Additionally, we found that the short-time objective intelligibility (STOI), and the Extended- waveform matching performance metrics such as SDR and STOI (ESTOI). Lastly, we have studied a loss function based SI-SDR, despite achieving good general performance, must be on perceptual evaluation of speech quality (PESQ), which is used with caution, when they are used in combination with a commonly used speech quality predictor. In general, we speech enhancement systems with the capability of modifying found that all six loss functions are good candidates for the phase of the processed signals such as time-domain FCNN- monaural speech enhancement systems as they all managed to based speech enhancement systems. In particular, SDR and SI- improve the performance metrics employed with respect to the SDR may severely under-estimate the performance of systems performance scores of noisy unprocessed speech signals. More that are trained using loss functions that do not penalize time- importantly, we found that these results were generally valid shifts. We observed on multiple occasions that both SDR and across a wide range of SNRs, unseen male and female speakers, SI-SDR failed completely, when the reference signal and the as well as matched and unmatched noise types. However, we processed signal were not perfectly aligned. also found that if the goal is to perform optimally with respect In conclusion, we found that a loss function based on SI-SDR to a specific performance metric, it is generally advantageous achieves good general performance across a range of popular 12 speech enhancement evaluation metrics, which suggests that [19] E. W. Healy et al., “A deep learning algorithm to increase intelligibility for hearing-impaired listeners in the presence of a competing talker SI-SDR is a good candidate as a general-purpose loss function and reverberation,” The Journal of the Acoustical Society of America, for supervised monaural time-domain speech enhancement. vol. 145, no. 3, pp. 1378–1388, Mar. 2019. [Online]. Available: https://asa.scitation.org/doi/full/10.1121/1.5093547 [20] J. L. Roux et al., “The Phasebook: Building Complex Masks via Discrete ACKNOWLEDGMENT Representations for Source Separation,” in ICASSP 2019 - 2019 IEEE International Conference on Acoustics, Speech and Signal Processing We would like to thank Juan M. Mart´ ın-Donas ˜ for valuable (ICASSP), May 2019, pp. 66–70. [21] Z. Wang, K. Tan, and D. Wang, “Deep Learning Based Phase Recon- insight and discussions regarding the implementation of the struction for Speaker Separation: A Trigonometric Perspective,” in Proc. PMSQE loss function. ICASSP, 2019, pp. 71–75. [22] Z.-Q. Wang et al., “End-to-End Speech Separation with Unfolded Iterative Phase Reconstruction,” in Proc. Interspeech, 2018, pp. 2708–2712. REFERENCES [23] A. Pandey and D. Wang, “A New Framework for Supervised Speech Enhancement in the Time Domain,” in Proc. Interspeech, 2018, pp. [1] G. Kim et al., “An algorithm that improves speech intelligibility in noise 1136–1140. for normal-hearing listeners,” The Journal of the Acoustical Society of [24] ——, “A New Framework for CNN-Based Speech Enhancement in the America, vol. 126, no. 3, pp. 1486–1494, 2009. Time Domain,” IEEE/ACM Transactions on Audio, Speech, and Language [2] K. Han and D. Wang, “A classification based approach to speech Processing, vol. 27, no. 7, pp. 1179–1188, 2019. segregation,” The Journal of the Acoustical Society of America, vol. [25] S. W. Fu et al., “End-to-End Waveform Utterance Enhancement for 132, no. 5, pp. 3475–3483, 2012. Direct Evaluation Metrics Optimization by Fully Convolutional Neural [3] Y. Wang and D. Wang, “Towards Scaling Up Classification-Based Speech Networks,” IEEE/ACM Transactions on Audio, Speech, and Language Separation,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 9, pp. 570 – 1584, 2018. Processing, vol. 21, no. 7, pp. 1381–1390, 2013. [26] S. R. Park and J. Lee, “A Fully Convolutional Neural Network for Speech [4] Y. Xu et al., “An Experimental Study on Speech Enhancement Based on Enhancement,” in Proc. Interspeech, 2017, pp. 1993–1997. Deep Neural Networks,” IEEE Signal Processing Letters, vol. 21, no. 1, [27] S. W. Fu et al., “Raw waveform-based speech enhancement by fully pp. 65–68, 2014. convolutional networks,” in Proc. APSIPA, 2017, pp. 6–12. [5] F. Weninger, F. Eyben, and B. Schuller, “Single-channel speech separation [28] A. Pandey and D. Wang, “TCNN: Temporal Convolutional Neural with memory-enhanced recurrent neural networks,” in Proc. ICASSP, Network for Real-time Speech Enhancement in the Time Domain,” in 2014, pp. 3709–3713. Proc. ICASSP, 2019, pp. 6875–6879. [6] E. W. Healy et al., “An algorithm to increase speech intelligibility for [29] T. Grzywalski and S. Drgas, “Using Recurrences in Time and Frequency hearing-impaired listeners in novel segments of the same noise type,” within U-net Architecture for Speech Enhancement,” in Proc. ICASSP, The Journal of the Acoustical Society of America, vol. 138, no. 3, pp. 2019, pp. 6970–6974. 1660–1669, 2015. [30] K. Tan, X. Zhang, and D. Wang, “Real-time Speech Enhancement Using [7] J. Chen et al., “Large-scale training to increase speech intelligibility for an Efficient Convolutional Recurrent Network for Dual-microphone hearing-impaired listeners in novel noises,” The Journal of the Acoustical Mobile Phones in Close-talk Scenarios,” in Proc. ICASSP, 2019, pp. Society of America, vol. 139, no. 5, pp. 2604–2612, 2016. 5751–5755. [8] H. Erdogan et al., “Deep Recurrent Networks for Separation and [31] Y. Ephraim and D. Malah, “Speech enhancement using a minimum-mean Recognition of Single-Channel Speech in Nonstationary Background square error short-time spectral amplitude estimator,” IEEE Transactions Audio,” in New Era for Robust Speech Recognition. Springer, 2017, on Acoustics, Speech, and Signal Processing, vol. 32, no. 6, pp. 1109– pp. 165–186. 1121, 1984. [9] M. Kolbæk, Z. H. Tan, and J. Jensen, “Speech Intelligibility Potential [32] C. H. Taal et al., “An Algorithm for Intelligibility Prediction of Time- of General and Specialized Deep Neural Network Based Speech Frequency Weighted Noisy Speech,” IEEE/ACM Transactions on Audio, Enhancement Systems,” IEEE/ACM Transactions on Audio, Speech, and Speech, and Language Processing, vol. 19, no. 7, pp. 2125–2136, 2011. Language Processing, vol. 25, no. 1, pp. 153–167, 2017. [33] J. Jensen and C. H. Taal, “An Algorithm for Predicting the Intelligibility of [10] M. Kolbæk, Z. Tan, and J. Jensen, “On the Relationship Between Short- Speech Masked by Modulated Noise Maskers,” IEEE/ACM Transactions Time Objective Intelligibility and Short-Time Spectral-Amplitude Mean- on Audio, Speech, and Language Processing, vol. 24, no. 11, pp. 2009– Square Error for Speech Enhancement,” IEEE/ACM Transactions on 2022, 2016. Audio, Speech, and Language Processing, vol. 27, no. 2, pp. 283–295, [34] J. L. Roux et al., “SDR – Half-baked or Well Done?” in ICASSP 2019, 2019, pp. 626–630. [11] D. Wang and J. Chen, “Supervised Speech Separation Based on Deep [35] J. M. Mart´ ın-Donas ˜ et al., “A Deep Learning Loss Function Based on the Learning: An Overview,” IEEE/ACM Transactions on Audio, Speech, Perceptual Evaluation of the Speech Quality,” IEEE Signal Processing and Language Processing, vol. 26, no. 10, pp. 1702–1726, 2018. Letters, vol. 25, no. 11, pp. 1680–1684, 2018. [12] M. Kolbæk, “Single-Microphone Speech Enhancement and Separation [36] S. Venkataramani, R. Higa, and P. Smaragdis, “Performance Based Cost Using Deep Learning,” Ph.D. dissertation, Aalborg Universitetsforlag, Functions for End-to-End Speech Separation,” Proc. APSIPA, pp. 350– 2018. [Online]. Available: kolbaek-phd.aau.dk 355, 2018. [13] E. W. Healy et al., “An algorithm to increase intelligibility for hearing- [37] S. Venkataramani, J. Casebeer, and P. Smaragdis, “End-to-end Source impaired listeners in the presence of a competing talker,” The Journal of Separation with Adaptive Front-Ends,” in Proc. NIPS Machine Learning the Acoustical Society of America, vol. 141, no. 6, pp. 4230–4239, 2017. for Audio Signal Processing Workshop, 2017. [14] F. Bolner et al., “Speech enhancement based on neural networks applied [38] Y. Zhao et al., “Perceptually Guided Speech Enhancement using Deep to cochlear implant coding strategies,” in Proc. ICASSP, 2016, pp. 6520– Neural Networks,” in Proc. ICASSP, 2018, pp. 5074–5078. [39] H. Zhang, X. Zhang, and G. Gao, “Training Supervised Speech Separation [15] J. J. M. Monaghan et al., “Auditory inspired machine learning techniques System to Improve STOI and PESQ Directly,” in Proc. ICASSP, 2018, can improve speech intelligibility and quality for hearing-impaired pp. 5374–5378. listeners,” The Journal of the Acoustical Society of America, vol. 141, [40] F. Bahmaninezhad et al., “A Comprehensive Study of Speech Separation: no. 3, pp. 1985–1998, 2017. Spectrogram vs Waveform Separation,” in Proc. Interspeech, 2019, pp. [16] T. Goehring et al., “Speech enhancement based on neural networks 4574–4578. improves speech intelligibility in noise for cochlear implant users,” [41] Y. Luo and N. Mesgarani, “Conv-TasNet: Surpassing Ideal Hearing Research, vol. 344, pp. 183–194, 2017. Time–Frequency Magnitude Masking for Speech Separation,” IEEE/ACM [17] Y. H. Lai et al., “A Deep Denoising Autoencoder Approach to Improving Transactions on Audio, Speech, and Language Processing, vol. 27, no. 8, the Intelligibility of Vocoded Speech in Cochlear Implant Simulation,” pp. 1256–1266, May 2019. IEEE Transactions on Biomedical Engineering, vol. 64, no. 7, pp. 1568– [42] ——, “TaSNet: Time-Domain Audio Separation Network for Real-Time, 1578, 2017. Single-Channel Speech Separation,” in Proc. ICASSP, 2018, pp. 696–700. [18] Y.-H. Lai et al., “Deep Learning-Based Noise Reduction Approach to [43] M. Kolbæk, Z.-H. Tan, and J. Jensen, “Monaural Speech Enhancement Improve Speech Intelligibility for Cochlear Implant Recipients,” Ear and using Deep Neural Networks by Maximizing a Short-Time Objective Hearing, vol. 39, no. 4, pp. 795–809, 2018. Intelligibility Measure,” in Proc. ICASSP, 2018, pp. 5059 – 5063. 13 [44] M. Kolbæk et al., “Multi-talker Speech Separation With Utterance-Level Morten Kolbæk received the B.Eng. degree in Permutation Invariant Training of Deep Recurrent Neural Networks,” electronic design at Aarhus University, in 2013 and IEEE/ACM Transactions on Audio, Speech, and Language Processing, the M.Sc. in signal processing and computing from vol. 25, no. 10, pp. 1901–1913, Jul. 2017. Aalborg University, Denmark, in 2015. He received [45] Y. Wang, A. Narayanan, and D. Wang, “On Training Targets for the PhD degree from Aalborg University, Denmark, Supervised Speech Separation,” IEEE/ACM Transactions on Audio, in 2018 for the thesis entitled Single-Microphone Speech, and Language Processing, vol. 22, no. 12, pp. 1849–1858, Speech Enhancement and Separation Using Deep 2014. Learning (kolbaek-phd.aau.dk). He is currently a [46] G. Naithani et al., “Deep Neural Network Based Speech Separation post-doctoral researcher at the section for Signal Optimizing an Objective Estimator of Intelligibility for Low Latency and Information Processing at the Department of Applications,” in Proc. IWAENC, 2018, pp. 386–390. Electronic Systems, Aalborg University, Denmark. [47] K. Tan, J. Chen, and D. Wang, “Gated Residual Networks With His research interests include speech enhancement and separation, deep Dilated Convolutions for Monaural Speech Enhancement,” IEEE/ACM learning, and intelligibility improvement of noisy speech. Transactions on Audio, Speech, and Language Processing, vol. 27, no. 1, pp. 189–198, 2019. [48] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” in Proc. ICLR (arXiv:1412.6980), 2015. [49] D. Baby and S. Verhulst, “Sergan: Speech Enhancement Using Relativistic Generative Adversarial Networks with Gradient Penalty,” in ICASSP Zheng-Hua Tan (M’00–SM’06) received the B.Sc. 2019 - 2019 IEEE International Conference on Acoustics, Speech and and M.Sc. degrees in electrical engineering from Signal Processing (ICASSP), May 2019, pp. 106–110. Hunan University, Changsha, China, in 1990 and [50] S. Pascual, A. Bonafonte, and J. Serra, ` “SEGAN: Speech Enhancement 1996, respectively, and the Ph.D. degree in electronic Generative Adversarial Network,” in Proc. INTERSPEECH, 2017, pp. engineering from Shanghai Jiao Tong University, 3642–3646. Shanghai, China, in 1999. He is a Professor and a Co- [51] O. Ernst et al., “Speech Dereverberation Using Fully Convolutional Head of the Centre for Acoustic Signal Processing Networks,” in Proc. EUSIPCO, 2018, pp. 390–394. Research (CASPR) at Aalborg University, Aalborg, [52] P. C. Loizou, Speech Enhancement: Theory and Practice. CRC Press, Denmark. He was a Visiting Scientist at the Computer Science and Artificial Intelligence Laboratory, MIT, [53] C. M. Bishop, Pattern Recognition and Machine Learning. Springer, Cambridge, USA, an Associate Professor at Shanghai Jiao Tong University, and a postdoctoral fellow at KAIST, Daejeon, Korea. His [54] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, research interests include machine learning, deep learning, pattern recognition, speech and speaker recognition, noise-robust speech processing, multimodal [55] R. C. Hendriks, T. Gerkmann, and J. Jensen, “DFT-Domain Based Single- signal processing, and social robotics. He is the vice chair of the IEEE Microphone Noise Reduction for Speech Enhancement: A Survey of the Signal Processing Society Machine Learning for Signal Processing Technical State of the Art,” Synthesis Lectures on Speech and Audio Processing, Committee (MLSP TC). He is an Associate Editor for IEEE/ACM Transactions vol. 9, no. 1, pp. 1–80, 2013. on Audio, Speech and Language Processing, an Editorial Board Member for [56] J. Jensen and C. H. Taal, “Speech Intelligibility Prediction Based on Computer Speech and Language and was a Guest Editor for the IEEE Journal Mutual Information,” IEEE/ACM Transactions on Audio, Speech, and of Selected Topics in Signal Processing and Neurocomputing. He was the Language Processing, vol. 22, no. 2, pp. 430–440, 2014. General Chair for IEEE MLSP 2018 and a TPC co-chair for IEEE SLT 2016. [57] C. Fev ´ otte, R. Gribonval, and E. Vincent, “BSS EVAL Toolbox User Guide – Revision 2.0,” IRISA, Tech. Rep. inria-00564760, 2011. [58] B. Moore, An Introduction to the Psychology of Hearing. Brill, 2013. [59] A. W. Rix et al., “Perceptual evaluation of speech quality (PESQ)-a new method for speech quality assessment of telephone networks and codecs,” in Proc. ICASSP, vol. 2, 2001, pp. 749–752. [60] “International Telecommunication Union - Recommendation Søren Holdt Jensen (S’87–M’88–SM’00) received P.862.1 : Mapping function for transforming P.862 raw the M.Sc. degree in electrical engineering from result scores to MOS-LQO,” 2003. [Online]. Available: Aalborg University (AAU), Aalborg, Denmark, in https://www.itu.int/rec/T-REC- P.862.1-200311- I/en 1988, and the Ph.D. degree (in signal processing) [61] J. S. Garofolo et al., “CSR-I (WSJ0) Complete LDC93S6A,” 1993, from the Technical University of Denmark (DTU), philadelphia: Linguistic Data Consortium. Lyngby, Denmark, in 1995. He is Full Professor [62] ——, “TIMIT Acoustic-Phonetic Continuous Speech Corpus LDC93S1,” in Signal Processing at Aalborg University. Before 1993, linguistic Data Consortium. joining the Department of Electronic Systems, Aal- [63] J. Barker et al., “The third ‘CHiME’ speech separation and recognition borg University, he was with the Telecommunications challenge: Dataset, task and baselines,” in Proc. ASRU, 2015, pp. 504– Laboratory of Telecom Denmark, Ltd, Taastrup (Copenhagen), Denmark; the Electronics Institute [64] ITU, “Rec. P.56 : Objective measurement of active speech level,” 2011, of Technical University of Denmark; the Scientific Computing Group of https://www.itu.int/rec/T-REC-P.56/. Danish Computing Center for Research and Education (UNIC), Lyngby; the [65] O. Ronneberger, P. Fischer, and T. Brox, “U-Net: Convolutional Networks Electrical Engineering Department (ESAT-SISTA) of Katholieke Universiteit for Biomedical Image Segmentation,” in Proc. MICCAI, N. Navab et al., Leuven, Leuven, Belgium; and the Center for PersonKommunikation (CPK) Eds., 2015, pp. 234–241. of Aalborg University. His current research interest are in statistical signal [66] K. He et al., “Delving Deep into Rectifiers: Surpassing Human-Level processing, numerical algorithms, optimization engineering, machine learning, Performance on ImageNet Classification,” in Proc. ICCV, 2015, pp. and digital processing of acoustic, audio, communication, multimedia, and 1026–1034. speech, signals. He is co-author of the textbook Software-Defined GPS and [67] L. Liu et al., “On the Variance of the Adaptive Learning Rate and Galileo Receiver—A Single-Frequency Approach, Birkhauser ¨ , Boston, USA, Beyond,” in Proc. ICLR, 2020. also translated to Chinese: National Defence Industry Press, China. Prof. Jensen [68] M. A. Stone and B. C. Moore, “Tolerable hearing aid delays. I. Estimation has been Associate Editor for the IEEE Transactions on Signal Processing, of limits imposed by the auditory path alone using simulated hearing IEEE/ACM Transactions on Audio, Speech and Language Processing, Elsevier losses,” Ear and Hearing, vol. 20, no. 3, pp. 182–192, 1999. Signal Processing, and EURASIP Journal on Advances in Signal Processing. [69] L. Bramsløw, “Preferred signal path delay and high-pass cut-off in open He is a recipient of an individual European Community Marie Curie (HCM: fittings,” International Journal of Audiology, vol. 49, no. 9, pp. 634–644, Human Capital and Mobility) Fellowship, former Chairman of the IEEE Denmark Section and the IEEE Denmark Section’s Signal Processing Chapter (founder and first chaiman). He is member of the Danish Academy of Technical Sciences (ATV) and has been member of the Danish Council for Independent Research (2011–2016) appointed by Danish Ministers of Science. 14 Jesper Jensen received the M.Sc. degree in electrical engineering and the Ph.D. degree in signal processing from Aalborg University, Aalborg, Denmark, in 1996 and 2000, respectively. From 1996 to 2000, he was with the Center for Person Kommunikation (CPK), Aalborg University, as a Ph.D. student and Assistant Research Professor. From 2000 to 2007, he was a Post-Doctoral Researcher and Assistant Professor with Delft University of Technology, Delft, The Netherlands, and an External Associate Professor with Aalborg University. Currently, he is a Senior Principal Scientist with Oticon A/S, Copenhagen, Denmark, where his main responsibility is scouting and development of new signal processing concepts for hearing aid applications. He is a Professor with the Section for Signal and Information Processing (SIP), Department of Electronic Systems, at Aalborg University. He is also a co-founder of the Centre for Acoustic Signal Processing Research (CASPR) at Aalborg University. His main interests are in the area of acoustic signal processing, including signal retrieval from noisy observations, coding, speech and audio modification and synthesis, intelligibility enhancement of speech signals, signal processing for hearing aid applications, and perceptual aspects of signal processing.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Sep 3, 2019

There are no references for this article.