Access the full text.
Sign up today, get DeepDyve free for 14 days.
References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.
DNN-based Source Enhancement to Increase Objective Sound Quality Assessment Score 1 1 2 Yuma Koizumi Member, IEEE,, Kenta Niwa Member, IEEE,, Yusuke Hioka Senior Member, IEEE, 1 3 Kazunori Kobayashi and Yoichi Haneda Senior Member, IEEE, Abstract—We propose a training method for deep neural these conventional studies, deep neural networks (DNNs) were network (DNN)-based source enhancement to increase objective used as a regression function to estimate time-frequency (T-F) sound quality assessment (OSQA) scores such as the perceptual masks [19]–[22] and/or amplitude-spectra of the target source evaluation of speech quality (PESQ). In many conventional [23]–[31]. The parameters of the DNNs were trained using studies, DNNs have been used as a mapping function to estimate back-propagation [36] to minimize an analytically tractable time-frequency masks and trained to minimize an analytically tractable objective function such as the mean squared error objective function such as the mean squared error (MSE) (MSE). Since OSQA scores have been used widely for sound- between supervised outputs and DNN outputs. In recent stud- quality evaluation, constructing DNNs to increase OSQA scores ies, advanced analytical objective functions were used such would be better than using the minimum-MSE to create high- as the maximum-likelihood (ML) [31], [32], the combination quality output signals. However, since most OSQA scores are not of multi-types of MSE [25]–[27], the Kullback-Leibler and/or analytically tractable, i.e., they are black boxes, the gradient of the objective function cannot be calculated by simply applying Itakura-Saito divergence [33], the modiﬁed short-time intelli- back-propagation. To calculate the gradient of the OSQA-based gibility measure (STOI) [22], the clustering cost [34], and the objective function, we formulated a DNN optimization scheme discriminative cost of a clean target source and output signal on the basis of black-box optimization, which is used for training using a generative adversarial network (GAN) [35]. a computer that plays a game. For a black-box-optimization When output sound is perceived by humans, the objective scheme, we adopt the policy gradient method for calculating the gradient on the basis of a sampling algorithm. To simulate function that reﬂects human perception may not be analytically output signals using the sampling algorithm, DNNs are used to tractable, i.e., it is a black-box function. In the past few years, estimate the probability-density function of the output signals objective sound quality assessment (OSQA) scores, such as that maximize OSQA scores. The OSQA scores are calculated the perceptual evaluation of speech quality (PESQ) [37] and from the simulated output signals, and the DNNs are trained STOI [38], have been commonly used to evaluate output sound to increase the probability of generating the simulated output signals that achieve high OSQA scores. Through several exper- quality. Thus, it might be better to construct DNNs to increase iments, we found that OSQA scores signiﬁcantly increased by OSQA scores directly. However, since typical OSQA scores applying the proposed method, even though the MSE was not are not analytically deﬁned (i.e., they are black-box functions), minimized. the gradient of the objective function cannot be calculated by Index Terms—Sound-source enhancement, time-frequency simply applying back-propagation. mask, deep learning, objective sound quality assessment (OSQA) We previously proposed a DNN training method to estimate score. T-F masks and increase OSQA scores [39]. To overcome the problem that the objective function to maximize the OSQA I. INTRODUCTION scores is not analytically tractable, we developed a DNN- OUND-source enhancement has been studied for many training method on the basis of the black-box optimization years [1]–[6] because of the high demand for its use framework [40], as used in predicting the winning percentage for various practical applications such as automatic speech of the game Go [41]. The basic idea of block-box optimization recognition [7]–[9], hands-free telecommunication [10], [11], is estimating a gradient from randomly simulated output. hearing aids [12]–[15], and immersive audio ﬁeld represen- For example, in the training of a DNN for the Go-playing tation [16], [17]. In this study, we aimed at generating an computer, the computer determines a “move” (where to put enhanced target source with high listening quality because the a Go-stone) depending on the DNN output. Then, when the processed sounds are assumed perceived by humans. computer won the game, a gradient is calculated to increase the Recently, deep learning [18] has been successfully used for selection probability of the selected “moves”. We adopt this sound-source enhancement [8], [15], [19]–[35] . In many of strategy to increase the OSQA scores; some output signals are randomly simulated and a DNN is trained to increase : NTT Media Intelligence Laboratories, NTT Corporation, Tokyo, Japan (e- the generation probability of the simulated output signals that mail: koizumi.yuma@ieee.org, niwa.kenta, kobayashi.kazunori@lab.ntt.co.jp) achieved high OSQA scores. For the ﬁrst trial, we prepared : Department of Mechanical Engineering, University of Auckland, a ﬁnite number of T-F mask templates and trained DNNs 20 Symonds Street, Auckland, 1010 New Zealand (e-mail: yusuke.hioka@ieee.org) to select the best template that maximizes the OSQA score. : Department of Informatics, The University of Electro-Communications, Although we found that the OSQA scores increased using this Tokyo, Japan (e-mail: haneda.yoichi@uec.ac.jp) method, the output performances would improve by extending Copyright (c) 2018 IEEE. This article is the “accepted” version. Digital Object Identiﬁer: 10.1109/TASLP.2018.2842156 the method to a more ﬂexible T-F mask design scheme from arXiv:1810.09137v1 [stat.ML] 22 Oct 2018 2 IRM where 0 G 1 is a T-F mask. The IRM G [8] is an !; !; 4) 235678( implementation of T-F mask, which is deﬁned by jS j !; IRM G = : (3) !; jS j +jN j 0 !; !; The IRM maximizes the signal-to-noise-ratio (SNR) when the phase spectrum of S coincides with that of N . However, !; !; !" this assumption is almost never satisﬁed in most practical cases. To compensate for this mismatch, the phase sensitive +,-, $&’ % spectrum approximation (PSA) [19], [20] was proposed ./ !! jS j !; PSA (S ) (X) G = min 1; max 0; cos ; (4) !; !; !; Fig. 1. Concept of proposed method jX j !; (S ) (X) where and are the phase spectra of S and X , !; !; !; !; PSA the template-selection scheme. respectively. Since the PSA G is a T-F mask that minimizes !; In this study, to arbitrarily estimate T-F masks, we modiﬁed the squared error between S and S on the complex plane, !; !; the DNN source enhancement architecture to estimate the we use this as a T-F masking scheme. latent parameters in a continuous probability density function (PDF) of the T-F mask processing output signals, as shown B. Maximum-likelihood-based DNN training for T-F mask in Fig. 1. To calculate the gradient of the objective function, estimation we adopt the policy gradient method [42] as a black-box In many conventional studies of DNN-based source en- optimization scheme. With our method, the estimated latent hancement, DNNs were used as a mapping function to es- parameters construct a continuous PDF as the “policy” of T- timate T-F masks. In this section, we explain DNN training F-mask estimation to increase OSQA scores. On the basis based on ML estimation, on which the proposed method is of this policy, the output signals are directly simulated using based. Since the ML-based approach explicitly models the the sampling algorithm. Then, the gradient of the DNN is PDF of the target source, it becomes possible to simulate estimated to increase/decrease the generation probability of output signals by generating random numbers from the PDF. output signals with high/low OSQA scores, respectively. The In ML-based training, the DNNs are constructed to estimate sampling from continuous PDF causes the estimate of the the parameters of the conditional PDF of the target source gradient to ﬂuctuate, resulting in unstable training behavior. providing the observation is given by p(S jX ; ). Here, To avoid this problem, we additionally formulate two tricks: denotes the DNN parameters. Its example on a fully connected i) score normalization to reduce the variance in the estimated DNN is described later (after (16)). The target and observation gradient, and ii) a sampling algorithm to simulate output source are assumed to be vectorized for all frequency bins as signals to satisfy the constraint of T-F mask processing. The rest of this paper is organized as follows. Section S := (S ; :::; S ) ; (5) 1; II introduces DNN source enhancement based on the ML X := (X ; :::; X ) ; (6) 1; approach. In Section III, we propose our DNN training where > is transposition. Then is trained to maximize the method to increase OSQA scores on the basis of the black- expectation of the log-likelihood as box optimization. After investigating the sound quality of output signals through several experiments in Section IV, we ML arg maxJ (); (7) conclude this paper in Section V. ML where the objective function J () is deﬁned by II. CONVENTIONAL METHOD ML J () = E ln p(SjX; ) ; (8) S;X A. Sound source enhancement with time-frequency mask and E [] denotes the expectation operator for x. However, since (8) is dicult to analytically calculate, the expectation Let us consider the problem of estimating a target source calculation is replaced with the average of the training dataset S 2 C, which is surrounded by ambient noise N 2 C. A !; !; signal observed with a single microphone X 2 C is assumed as !; to be modeled as T ML J () ln p(S jX ; ): (9) X = S + N ; (1) !; !; !; =1 The back-propagation algorithm [36] is used in training to where ! = f1; 2; :::; g and = f1; 2; :::; Tg denote the maximize (9). When p(S jX ; ) is composed of dierentiable frequency and time indices, respectively. functions with respect to , the gradient is calculated as In sound-source enhancement using T-F masks, the output signal S is obtained by multiplying a T-F mask by X as !; !; ML r J () r ln p(S jX ; ); (10) =1 S = G X ; (2) !; !; !; 3 functions , , and are nonlinear activation functions, g h and in conventional studies, sigmoid and exponential functions were used as an implementation of [19], [20] and [32], respectively. The input vector x is passed to the ﬁrst layer of (1) the network as z = x . III. PROPOSED METHOD Fig. 2. ML-based DNN architecture used in T-F mask estimation Our proposed DNN-training method increases OSQA scores. With the proposed method, the policy gradient method [42] is used to statistically calculate the gradient with respect where r is a partial dierential operator with respect to x. to by using a sampling algorithm, even though the objec- To calculate (10), p(S jX ; ) is modeled by assuming that tive function is not dierentiable. However, sampling-based the estimation error of S is independent for all frequency !; gradient estimation would frequently make the DNN training bins and follows the zero-mean complex Gaussian distribution behavior become unstable. To avoid this problem, we introduce with the variance . The assumption is based on state-of- !; two tricks: i) score normalization that reduces the variance the-art methods, which train DNNs to minimize the MSE be- in the estimated gradient (in Sec. III-B), and ii) a sampling tween S and G X on the complex plane [19], [20]. The !; !; !; algorithm to simulate output signals to satisfy the constraint minimum-MSE (MMSE) on the complex plane is equivalent of T-F mask processing (in Sec. III-C). Finally, the overall to assuming that the errors are independent for all frequency training procedure of the proposed method is summarized in bins and follow the zero-mean complex Gaussian distribution Sec. III-D. with variance 1. Our assumption relaxes the assumption of the conventional methods; the variances of each frequency bin A. Policy gradient-based DNN training for T-F mask estima- vary according to the error values to maximize the likelihood. tion ˆ ˆ Thus, since S is given by G X , p(S jX ; ) is modeled !; !; !; Let B(S; X) be a scoring function that quantiﬁes the sound by the following complex Gaussian distribution as ˆ ˆ ˆ 8 9 quality of the estimated sound signal S := (S ; :::; S ) > > Y > ˆ > > S G X > ˆ 1 < !; !; !; = deﬁned by (2). To implement B(S; X), subjective evaluation p(S jX ; ) = exp : (11) > > 2 > 2 > > > is simple. However, it would be dicult to use in practical 2 2 : ; !; !; !=1 implementation because DNN training requires a massive In this model, it can be regarded that the MSE between S !; amount of listening-test results. Thus, B(S; X) quantiﬁes the and S on the complex plane is extended to the likelihood !; sound quality based on OSQA scores, as shown in Fig. 1, and of S deﬁned on the complex Gaussian distribution, the !; the details of its implementation are discussed in Sec. III-B. mean and variance parameters of which are S and , !; We assume B(S; X) is non-dierentiable with respect to , !; respectively. (11) includes unknown parameters: the T-F mask because most OSQA scores are black-box functions. G and error variance . Thus, we construct DNNs to !; Let us consider the expectation maximization of B(S; X) as !; estimate G and from X , as shown in Fig. 2. The !; a metric of performance of the sound-source enhancement that !; vectorized T-F masks and error variances for all frequency increases OSQA scores as bins are deﬁned as h i ˆ ˆ ˆ ˆ > E B(S; X) = B(S; X) p(S; X)dSdX: (17) S;X ˆ ˆ G(x ) := G ; :::; G ; (12) 1; 2 2 ˆ Since the output signal S is calculated from the observation (x ) := ; :::; : (13) 1; X, we decompose the joint PDF p(S; X) into the conditional Here x is the input vector of DNNs that is prepared by PDF of the output signal given the observation p(SjX) and concatenating several frames of observations to account for the marginal PDF of the observation p(X) as p(S; X) = previous and future Q frames as x = (X ; :::; X ; :::; X ) , Q +Q p(SjX) p(X). Then, (17) can be reformed as and G(x ) and (x ) are estimated by Z Z h i n o ˆ ˆ ˆ ˆ E B(S; X) = p(X) B(S; X) p(SjX)dSdX: (18) () (L1) () S;X G(x ) W z + b ; (14) n o () (L1) () (x ) W z + b + C ; (15) We use DNNs to estimate the parameters of the conditional n o PDF of the output signal p(SjX; ), as with the case of ML- (l) (l) (l1) (l) z = W z + b ; (16) based training. For example, the complex Gaussian distribution ˆ ˆ in (11) can be used as p(SjX; ). To train , E [B(S; X)] is where C is a small positive constant value to prevent the ˆ S;X (l) () used as an objective function by replacing the conditional PDF variance from being very small. Here, l, L, W , and b are ˆ ˆ p(SjX) with p(SjX; ) as the layer index, number of layers, weight matrix, and bias () () h i vector, respectively. W ; W are the weight matrices and J () = E B(S; X) ; (19) () () ˆ S;X b ; b are the bias vectors to estimate the T-F mask and Z Z variance, respectively. The DNN parameters are composed ˆ ˆ ˆ = p(X) B(S; X) p(SjX; )dSdX: (20) () () () () (l) (l) of = fW ; b ; W ; b ; W ; b jl 2 (2; :::; L 1)g. The 4 Since B(S; X) is non-dierentiable with respect to , the gradient of (20) cannot be analytically obtained by simply ap- plying back-propagation. Hence, we apply the policy-gradient method [42], which can statistically calculate the gradient of (ii) a black-box objective function. By assuming that the function , , (i) ˆ ˆ form of B(S; X) is smooth, B(S; X) is a continuous function , , , and its derivative exists. In addition, we assume p(SjX; ) , , is composed with dierentiable functions with respect to . Then, the gradient of (20) can be calculated using a log- derivative trick [42] r p(x) = p(x)r ln p(x) as x x Z Z ˆ ˆ ˆ r J () = p(X) B(S; X)r p(SjX; )dSdX; (21) Fig. 3. T-F mask sampling procedure of proposed method on complex plane. h h ii (i) (i) (i) (i;k) ˆ ˜ The black, red, blue, and green points represent X , G X , S , and !; !; !; !; ˆ ˆ = E E B(S; X)r ln p(SjX; ) : (22) (i;k) (i) (i) X ˆ SjX ˆ ˆ G X , respectively. First, the parameters of p(S jX ; ), i.e., the T- !; !; !; !; (i) (i;k) ˆ ˜ F mask G and the variance are estimated using a DNN. Then, S is !; !; Since the expectation in (22) cannot be analytically calculated, (i) sampled from p(S jX ; ) by using a typical sampling algorithm; which !; !; the expectation with respect to X is approximated by averaging (i;k) is shown as arrow-(i). Finally, the simulated T-F mask G is calculated to !; (i;k) (i;k) (i) the training data, and the average of S is calculated using the ˜ ˆ minimize the MSE between S and the simulated output signal G X !; !; !; by (29); which is shown as arrow-(ii). sampling algorithm as T K X X 1 1 (k) (k) ˆ ˆ r J () B(S ; X )r ln p(S jX ; ); (23) large variance in the scoring-function output [42]. To stabilize T K =1 k=1 the training, instead of directly using a raw OSQA score (k) ˆ ˆ S p(SjX ; ); (24) as B(S; X), a normalized OSQA score is used to reduce its (k) variance. Hereafter, a raw OSQA score calculated from S, where S is the k-th simulated output signal and K is the ˆ ˆ X and S is written as Z(S; X) to distinguish between a raw number of samplings, which is assumed to be suciently ˆ ˆ OSQA score Z(S; X) and normalized OSQA score B(S; X). large. The superscript (k) represents the variable of the k-th From (25) and (26), the total gradient r J () is a weighted sampling, and is a sampling operator from the right-side sum of the i-th gradient of the log-likelihood function, and distribution. The details of the sampling process for (24) are B(S; X) is used as its weight. Since typical OSQA scores described in Sec. III-C. vary not only by the performance of source enhancement but Most OSQA scores, such as PESQ, are designed for their (1;:::;I) also by the SNRs of each input signal X , r J () also scores to be calculated using several time frames such as one (1;:::;I) (k) varies by the OSQA scores and SNRs of X . To reduce the utterance of a speech sentence. Since B(S ; X ) of every time variance in the estimate of the gradient, it would be better to frame cannot be obtained, the gradient cannot be calculated remove such external factors according to the input conditions by (23). Thus, instead of using the average of , we use the of each input signal, e.g., input SNRs. As a possible solution, average of I utterances. We deﬁne the observation of the i-th (i) (i) (i) the external factors involved in the OSQA score would be utterance as X := (X ; :::; X ), and the k-th output signal of (i) estimated by calculating the expectation of the OSQA score of (i;k) (i;k) (i;k) ˆ ˆ ˆ the i-th utterance as S := (S ; :::; S ). Then the gradient (i) the input signal. Thus, subtracting the conditional expectation can be calculated as ˆ ˆ of Z(S; X) given by each input signal E [Z(S; X)] from SjX Z(S; X) might be eective in reducing the variance as (i) r J () r J (); (25) h i ˆ ˆ ˆ i=1 B S; X = Z(S; X) E Z(S; X) : (27) SjX (i;k) (i) (i) K T B S ; X X X This implementation is known as “baseline-subtraction” [42], (i) (i;k) (i) r J () r ln p(S jX ; ); (26) [43]. Here, E [Z(S; X)] cannot be analytically calculated, so (i) ˆ SjX KT k=1 =1 we replace the expectation with the average of OSQA scores. (i) Then the scoring function is designed as where T is the frame length of the i-th utterance, and we assume that the output signal of each time frame is (i;k) (i;k) 1 (i; j) (i) (i) (i) calculated independently. The details of the deviation of (25) ˆ ˆ ˆ B S ; X = Z(S ; X ) Z(S ; X ): (28) are described in the Appendix A. j=1 B. Scoring-function design for stable training C. Sampling-algorithm to simulate T-F-mask-processed out- put signal We now introduce a design of a scoring function B(S; X) to stabilize the training process. Because the expectation for The sampling operator used in (24) is an intuitive method the gradient calculation in (22) is approximated using the that uses a typical pseudo random number generator such as sampling algorithm, the training may become unstable. One the Mersenne-Twister [44]. However, this sampling operator reason for unstable training behavior is that the variance in would in fact be dicult to use because typical sampling the estimated gradient becomes large in accordance with the algorithms simulate output signals that do not satisfy the 5 constraint of real-valued T-F-mask processing deﬁned by (2). In addition, a large gradient value r J () leads to unstable (i;k) To avoid this problem, we calculate the T-F mask G and training. One reason for the large gradient is that the log- !; (i;k) (i;k) (i) output signal S from the simulated output signal by using likelihood r ln p(S jX ; ) in (26) becomes large. To re- !; (i;k) (i;k) (i;k) ˜ ˆ ˆ a typical sampling algorithm S , so that G and S duce the gradient of the log-likelihood, the dierence between !; !; !; (i) (i;k) ˆ ˆ satisfy the constraint of T-F-mask processing and minimize the mean T-F mask G and simulated T-F mask G is !; !; (i;k) (i;k) ˆ ˜ the squared error between S and S . truncated to conﬁne it within the range of [; ] as !; !; Figure 3 illustrates the overview of the problem and the (i;k) (i;k) (i) ˆ ˆ ˆ G G G (33) !; !; !; proposed solution on the complex plane. In this study, we use (i;k) (G > ) the real-value T-F mask within the range of 0 G 1. > !; !; (i;k) (i;k) (i;k) ˆ ˆ ˆ Thus, the output signal is constrained to exist on the dotted G G ( G ) ; (34) !; !; !; > (i;k) line in Fig. 3, i.e., T-F mask processing aects only the norm : ˆ (G < ) !; (i;k) ˆ ˆ of S . However, since p(SjX; ) is modeled by a continuous !; (i;k) (i) (i;k) ˆ ˆ ˆ G G + G : (35) !; !; !; PDF such as the complex Gaussian distribution in (11), a (i;k) typical sampling algorithm possibly generates output signals Then, the output signal S is calculated by T-F-mask that do not satisfy the T-F-mask constraint, i.e., the phase (i;k) (i) processing (30), and the OSQA scores Z(S ; X ) and (i;k) (i) spectrum of S does not coincide with that of X . To (i;k) !; !; (i) B(S ; X ) are calculated by (28). After applying these solve this problem, we formulate the PSA-based T-F-mask re- (i;k) procedures for I utterances, is updated using the back- calculation. First, a temporary output signal S is sampled !; propagation algorithm using the gradient calculated by (25). using a sampling algorithm (Fig. 3 arrow-(i)). Then, the T-F (i;k) (i;k) ˆ ˜ mask G that minimizes the squared error between S and !; !; (i;k) (i) IV. EXPERIMENTS G X is calculated using the PSA equation as !; !; 0 0 11 We conducted objective experiments to evaluate the perfor- (i;k) B B CC jS j B B (i;k) (i) CC !; ˜ (i;k) B B (S ) (X ) CC ˆ mance of the proposed method. The experimental conditions B B CC G = min 1; max 0; cos ; (29) B B CC !; @ @ !; !; AA (i) jX j are described in Sec. IV-A. To investigate whether a DNN !; (i;k) (i) source-enhancement function can be trained to increase OSQA (S ) (X ) (i;k) (i) where and are the phase spectra of S and X , !; !; !; !; scores, we ﬁrst investigated the relationship between the respectively. Then, the output signal is calculated by number of updates and OSQA scores (Sec. IV-B). Second, (i;k) (i;k) (i) ˆ ˆ the source enhancement performance of the proposed method S = G X ; (30) !; !; !; was compared with those of conventional methods by using as shown with arrow-(ii) in Fig. 3. several objective measurements (Sec. IV-C). Finally, subjective evaluations for sound quality and ineligibility were conducted D. Training procedure (Sec. IV-D). For comparison methods, we used four DNN We describe the overall training procedure of the proposed source-enhancement methods; two T-F-mask mapping func- method, as shown in Fig. 4. Hereafter, to simplify the sam- tions trained using an MMSE-based objective function [19] pling algorithm, we use the complex Gaussian distribution as and the ML-based objective function described in Sec. II-B, p(SjX; ) described in (11)–(16). and two T-F-mask selection functions trained for increasing (i) First, the i-th observation utterance X is simulated by (1) the PESQ and STOI [39]. using a randomly selected target-source ﬁle and a noise source with equal frame size from the training dataset. Next, the T-F A. Experimental conditions (i) (i) mask G(x ) and variance (x ) are estimated by (11)–(16). (i;k) 1) Dataset: The ATR Japanese speech database [45] was Then, to simulate the k-th output signal S , the temporary (i;k) used as the training dataset of the target source. The dataset output signal S is sampled from the complex Gaussian !; consists of 6640 utterances spoken by 11 males and 11 distribution using a pseudo random number generator, such as females. The utterances were randomly separated into 5976 the Mersenne-Twister [44], as 2 3 0 2 3 1 for the development set and 664 for the validation set. As (i;k) (i) 6 ˜ 7 B 6 7 C 6< S 7 B 6< X 7 C !; !; 6 7 B (i) 6 7 2 C 6 7 B ˆ 6 7 C the training dataset of noise, a noise dataset of CHiME-3 was 6 7 N BG 6 7 ; IC ; (31) (i;k) !; (i) !; 4 5 @ 4 5 A = S = X !; !; used that consisted of four types of background noise ﬁles including noise in cafes, street junctions, public transport, where I is the 2 2 identity matrix, and < and = denote and pedestrian areas [46]. The noisy-mixture dataset was the real and imaginary parts of the complex number, respec- (i;k) generated by mixing clean speech utterances with various tively. After that, T-F mask G is calculated using (29). To !; noisy and SNR conditions using the following procedure; i) the accelerate the algorithm convergence, we additionally use the (i;k) noise is randomly selected from noise dataset, ii) the amplitude -greedy algorithm to calculate G . With probability 1 !; of noise is adjusted to be the desired SNR-level, and iii) the applied to each time-frequency bin, the maximum a posteriori (i) speech and noise source is added in the time-domain. As the (MAP) T-F mask G estimated using DNNs is used instead !; (i;k) test dataset, a Japanese speech database consisting of 300 of G as !; utterances spoken by 3 males and 3 females was used for (i;k) > ˆ G (with prob. ) target-source dataset, and an ambient noise database recorded < !; (i;k) G : (32) !; (i) : ˆ at airports (Airp.), amusement parks (Amuse.), oces (Oce), G (otherwise) !; … … Repeat times Sampling process (24) ( times) |, Θ Training data T-F mask sampling (29)(31)-(35) Calculate Calculate Target … Random Θ Θ select T-F masking (30) (26) (25) Noise Update Calc. , (27) parameters Fig. 4. Training procedure of proposed method 4 1 TABLE I regularization parameter in (15) was C = 10 . The Adam Experimental conditions method [47] was used as a gradient method. To avoid over- ﬁtting, input vectors and DNN outputs, i.e., the T-F masks Parameters for signal processing and error variances, were compressed using a B = 64 Mel- Sampling rate 16.0 kHz transformation matrix, and the estimated T-F masks and error FFT length 512 pts FFT shift length 256 pts variances were transformed into a linear frequency domain # of mel-ﬁlterbanks 64 using the Mel-transform’s pseudo-inverse [48]. Smoothing parameter 0.3 A PSA objective function [19], [20] was used as the MMSE- min Lower threshold G 0.158 (= 16 dB) based objective function. Since the PSA objective function Training SNR (dB) -6, 0, 6, 12 does not use the variance parameter (x ), DNNs estimate DNN architecture # of hidden layers for DNNs 3 only T-F masks G(x ). For the ML-based objective function, # of hidden units for DNNs 1024 we used (9) with the complex Gaussian distribution described Activation function (T-F mask, ) sigmoid in Sec. II-B. To train both methods, the dropout algorithm Activation function (variance, ) exponential was used and initialized by layer-by-layer pre-training [49]. Activation function (hidden, ) ReLU An early-stopping algorithm [17] was used for ﬁne-tuning Context window size Q 5 4 7 Variance regularization parameter C 10 with the initial step-size 10 and the step-size threshold 10 , Parameters for MMSE and ML-based DNN training and L2 normalization with the parameter 10 was used as a Initial step-size 10 regularization algorithm. Step-size threshold for early-stopping 10 For the T-F-mask selection-based method [39], to improve Dropout probability (input layer) 0.2 the ﬂexibility of T-F-mask selection, we used 128 T-F-mask Dropout probability (hidden layer) 0.5 L normalization parameter 10 templates. The DNN architecture, except for the output layer, Parameters for T-F mask selection is the same as MMSE- and ML-based methods. # of T-F mask templates 128 For the proposed method, DNN parameters were initialized -greedy parameter 0.01 by ML-based training, and their step-size was 10 . To calcu- Parameters for proposed DNN training later J (), the iteration parameters I = 10 and K = 20 were Step-size 10 # of utterance I 10 used. The -greedy parameter was 0.05, and the clipping # of T-F mask sampling K 20 parameter was determined as 0:05 according to preliminary Clipping parameter 0.05 2 informal experiments . As the OSQA scores, we used the -greedy parameter 0.05 PSEQ, which is a speech quality measure, and the STOI, which is a speech intelligibility measure. To avoid adjusting the step- size of the gradient method for each OSQA, we normalized OSQA scores to uniform the range of the each OSQA score. In this experiments, each OSQA score was normalized so that and party rooms (Party) was used as the noisy dataset. All its maximum and minimum values were 100 and 0 as samples were recorded at the sampling rate of 16 kHz. The SNR levels of the training/test dataset were -6, 0, 6, and 12 PESQ ˆ ˆ dB. Z (S; X) = 20:0 PESQ(S; X) + 0:5 ; STOI ˆ ˆ Z (S; X) = 100:0 STOI(S; X): 2) DNN architecture and setup: For the proposed and all conventional methods, a fully connected DNN was used In preliminary experiments using candidate values C 2 that has 3 hidden layers and 1024 hidden units. All input 2 3 4 f10 ; 10 ; 10 g, there were no distinct dierences in training stability and vectors were mean-and-variance normalized using the training results. Thus, to eliminate the eect of regularization, we used the minimum parameter of the candidate values. data statistics. The activation functions for the T-F mask , We tested some possible combinations of these parameters by grid-search. variance , and hidden units were the sigmoid function, Then, we found that the listed parameters achieved a stable training and exponential function, and rectiﬁed linear unit (ReLU), respec- realistic computational time (2 days using an Intel Xeon Processor E5-2630 tively. The context window size was Q = 5, and the variance v3 CPU and a Tesla M-40 GPU). 7 Input SNR: -6 dB Input SNR: 0 dB Input SNR: 6 dB Input SNR: 12 dB Average 0.66 0.58 0.46 0.58 0.56 0.64 0.56 0.56 0.44 0.54 0.62 0.54 0.54 0.42 0.52 0.6 0.52 0.52 0.4 0.5 0.5 0.58 0.5 0.38 0.48 0.56 0.48 0.48 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 1.5 3 0.8 1.5 0.6 0.4 0.5 0.5 0.2 0 0 -1 -0.5 -0.2 -0.5 -1 -2 -0.4 -1 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 # of update # of update # of update # of update # of update Fig. 5. OSQA score improvement depending on number of updates. X-axis shows number of updates, and y-axis shows average dierence between OSQA score of proposed method and that of observed signal. Solid lines and gray area are average and standard-error, respectively. Input SNR: -6 dB Input SNR: 0 dB Input SNR: 6 dB Input SNR: 12 dB Average -3 -3 10 10 0.024 4.5 0.023 0.054 0.023 9.5 0.022 0.022 0.052 0.021 8.5 0.021 3.5 0.05 0.02 0.02 7.5 0.019 3 0.048 0.019 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 -3 -3 10 10 0.019 0.058 7.2 0.021 2.8 0.0185 0.056 7 0.0205 6.8 0.054 0.018 0.02 2.6 6.6 0.0195 0.052 0.0175 6.4 0.019 2.4 0.05 0.017 6.2 0.0185 0.048 0.0165 2.2 0.018 0.046 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 0 5000 10000 # of update # of update # of update # of update # of update Fig. 6. Mean squared error (MSE) depending on number of updates. OSQA scores used for training of proposed method were (a) PESQ and (b) STOI. X-axis shows number of updates, and y-axis shows MSE. Solid lines and gray area are average and standard-error, respectively. The training algorithm was stopped after 10,000 times of TABLE II Correlation coefficients between MSE and OSQA score improvements executing the whole parameter update process shown in Fig. -6 dB 0 dB 6 dB 12 dB Average PESQ 0:120 0:081 0:020 0:089 0:020 3) Other conditions: It is known that T-F-mask processing STOI 0:756 0:672 0:951 0:980 0:482 causes artiﬁcial distortion, so-called musical noise [50]. For all methods, to reduce musical noise, ﬂooring [6], [51] and smoothing [52], [53] were applied to G before T-F-mask !; B. Investigation of relationship between number of updates processing as and OSQA score To investigate whether the DNN source-enhancement func- min ˆ ˆ G max G ; G ; (36) !; !; tion can be trained to increase OSQA scores, we ﬁrst inves- ˆ ˆ ˆ G G + (1 )G ; (37) !; !; !;1 tigated the relationship between the number of updates and improvement of the OSQA scores. We deﬁne “OSQA score min where we used the lower threshold of the T-F mask G = improvement” as the dierence in the score value from the 0:158 and smoothing parameter = 0:3. The frame size of baseline OSQA score. For the baseline, we use the OSQA the short-time Fourier transform (STFT) was 512, and the score obtained from the observed signal. Since the DNN pa- frame was shifted by 256 samples. All the above-mentioned rameters of the proposed method were initialized by ML-based conditions are summarized in Table I. training, each OSQA score was compared with the OSQA (b) (a) STOI improvement [%] PESQ improvement MSE MSE 8 TABLE III Evaluation results on three objective measurements. Asterisks indicate scores significantly higher than that of MMSE and ML in paired one-sided t-test. Gray cells indicate the highest score in same noise and input SNR condition. Input SNR: -6 dB SDR [dB] PESQ STOI [%] Method Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. OBS 4:28 6:98 5:64 1:50 4:6 1:24 1:38 1:33 1:14 1:27 72:1 76:7 73:8 69:1 72:9 MMSE 3:22 5:87 4:66 3:77 4:38 1:66 1:89 1:80 1:48 1:71 68:9 73:6 71:0 66:7 70:1 ML 3:31 6:12 4:87 3:63 4:48 1:68 1:95 1:80 1:54 1:74 69:2 74:3 72:0 64:9 70:1 C-PESQ 0:28 1:38 0:03 1:67 0:69 1:55 1:77 1:64 1:44 1:60 72:2 76:4 73:4 70:4 73:2 C-STOI 0:21 2:02 0:68 2:17 1:27 1:48 1:64 1:56 1:34 1:50 75:0 79:8 76:6 71:1 75:6 P-PESQ 3:13 6:34 4:72 3:50 4:42 1:78 2:07 1:91 1:57 1:83 71:0 76:0 72:4 67:9 71:8 P-STOI 2:18 6:60 3:90 4:15 4:21 1:63 1:93 1:73 1:59 1:72 74:9 80:1 76:6 71:3 75:7 P-MIX 2:93 6:20 4:39 3:49 4:25 1:77 2:08 1:89 1:59 1:83 72:1 77:4 73:8 68:2 72:9 Input SNR: 0 dB SDR [dB] PESQ STOI [%] Method Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. OBS 1:67 1:19 0:36 4:46 1:32 1:71 1:88 1:81 1:54 1:73 84:5 87:8 85:2 82:9 85:1 MMSE 8:03 10:0 9:55 8:44 9:00 2:17 2:36 2:27 2:09 2:22 80:7 84:7 83:1 80:1 82:1 ML 8:62 10:4 9:97 8:66 9:40 2:20 2:42 2:30 2:14 2:27 82:5 86:4 84:6 79:6 83:3 C-PESQ 6:36 7:08 6:49 7:89 6:95 2:11 2:33 2:23 2:00 2:16 83:7 86:2 84:0 82:7 84:2 C-STOI 7:30 8:07 7:18 8:70 7:81 2:03 2:18 2:10 1:89 2:05 86:8 89:9 87:4 84:7 87:2 P-PESQ 8:40 10:3 9:77 8:28 9:19 2:30 2:55 2:41 2:20 2:37 82:7 86:4 84:1 80:3 83:4 P-STOI 8:45 11:2 9:52 9:74 9:74 2:12 2:36 2:21 2:11 2:20 86:7 90:0 87:5 85:0 87:3 P-MIX 8:09 9:85 9:12 8:11 8:79 2:31 2:57 2:41 2:23 2:38 84:2 87:8 85:5 81:6 84:7 Input SNR: 6 dB SDR [dB] PESQ STOI [%] Method Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. OBS 7:67 4:96 6:29 10:5 7:34 2:18 2:33 2:28 2:02 2:20 92:2 93:8 92:7 91:8 92:6 MMSE 12:1 13:6 13:4 12:6 12:9 2:54 2:68 2:63 2:49 2:58 88:9 91:2 90:4 88:6 89:8 ML 13:1 14:2 14:1 13:5 13:7 2:59 2:77 2:69 2:54 2:65 91:1 93:0 92:2 89:8 91:5 C-PESQ 11:5 11:9 11:4 12:6 11:9 2:54 2:75 2:69 2:45 2:61 90:5 91:8 90:9 89:9 90:8 C-STOI 13:2 13:6 13:1 14:3 13:5 2:50 2:62 2:57 2:38 2:52 93:4 94:8 93:9 92:8 93:8 P-PESQ 12:6 13:8 13:6 12:6 13:2 2:70 2:89 2:80 2:64 2:76 90:2 92:1 91:2 89:1 90:6 P-STOI 13:4 15:3 14:3 14:8 14:4 2:49 2:69 2:60 2:45 2:56 93:4 94:9 94:0 92:8 93:8 P-MIX 11:5 12:3 12:1 11:6 11:9 2:69 2:90 2:79 2:66 2:76 91:5 93:1 92:3 90:4 91:8 Input SNR: 12 dB SDR [dB] PESQ STOI [%] Method Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. Airp. Amuse. Oce Party Ave. OBS 13:6 11:0 12:3 16:4 13:3 2:61 2:76 2:72 2:47 2:64 96:1 96:9 96:4 96:2 96:4 MMSE 15:9 16:9 16:8 16:3 16:5 2:84 2:95 2:92 2:77 2:87 93:5 94:7 94:4 93:2 94:0 ML 17:5 18:0 18:0 18:1 17:9 2:95 3:09 3:03 2:88 2:98 95:5 96:3 96:0 94:9 95:7 C-PESQ 15:5 15:8 15:3 16:3 15:7 2:95 3:14 3:08 2:86 3:01 94:2 94:9 94:4 94:0 94:4 C-STOI 18:2 18:6 18:2 19:0 18:5 2:94 3:05 3:01 2:81 2:95 96:7 97:4 97:0 96:6 96:9 P-PESQ 16:5 17:2 17:1 16:6 16:8 3:04 3:19 3:12 2:97 3:08 94:4 95:2 94:9 93:8 94:6 P-STOI 18:2 19:5 18:8 19:7 19:1 2:85 3:02 2:96 2:78 2:90 96:8 97:5 97:1 96:7 97:0 P-MIX 13:6 13:9 13:9 13:8 13:8 3:01 3:18 3:10 2:97 3:07 95:3 96:0 95:7 94:7 95:4 score that had zero updates. Thus, if DNN parameters were TABLE IV Objective scores of example results shown in Fig. 7. successfully trained with the proposed method, the OSQA score improvement would increase in accordance with the Performance measurement number of updates. Method SDR [dB] PESQ STOI [%] OBS 2:36 1:79 81:5 Figure 5 shows the OSQA score improvements evaluated on MMSE 9:31 2:32 80:0 the test dataset. Both OSQA score improvements increased ML 11:3 2:48 82:1 P-PESQ 10:7 2:55 81:4 as the number of updates increased for all SNR conditions. P-STOI 11:2 2:40 86:3 These results suggest that the proposed method is eective P-MIX 11:2 2:55 83:4 at increasing arbitrary OSQA scores, such as the PESQ and STOI. on the input SNR condition. Thus, these results suggest that We also investigated the relationship between the number minimization of MSE does not necessarily maximize OSQA of updates and MSE using the test dataset. Figure 6 shows scores. MSE depending on the number of updates. Under most SNR conditions, MSE did not decrease despite OSQA scores increasing. Table II shows the correlation coecients between C. Objective evaluation OSQA score improvements and MSE values. There was little correlation between PESQ improvement and MSE, and the The source-enhancement performance of the proposed correlation between STOI improvement and MSE depended method was compared with those of conventional methods 9 [dB] [dB] Target Observation 8 8 0 0 6 6 -20 -20 4 4 -40 -40 2 2 -60 -60 0 0 0 2 4 0 2 4 Time [s] Time [s] [dB] (a) (b) (c) (d) (e) 8 8 0 8 0 8 8 0 0 0 6 6 6 6 6 -20 -20 -20 -20 -20 4 4 4 4 4 -40 -40 -40 -40 -40 2 2 2 2 2 -60 -60 -60 -60 -60 0 0 0 0 0 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 8 8 0 8 0 8 0 8 0 0 6 6 -5 6 -5 6 -5 6 -5 -5 4 4 -10 4 -10 4 -10 4 -10 -10 2 2 2 2 2 -15 -15 -15 -15 -15 0 0 0 -20 -20 0 -20 0 -20 -20 0 2 4 0 2 4 0 2 4 0 2 4 0 2 4 Time [s] Time [s] Time [s] Time [s] Time [s] Fig. 7. Examples of estimated T-F mask and output signal. Top ﬁgures show spectrogram of target source S (left) and observed signal X (right), !; !; ˆ ˆ respectively. Middle ﬁgures show spectrogram of output signal S and bottom ﬁgures show estimated T-F mask G , respectively. White dotted box and !; !; circle show larger or less noise reduction areas which modiﬁed by training of P-PESQ and P-STOI, respectively. (a) MMSE, (b) ML, (c) P-PESQ, (d) P-STOI, and (e) P-MIX. using three objective measurements: the signal-to-distortion the conventional MMSE/ML-based objective function than ratio (SDR), PESQ, and STOI. The SDR was deﬁned as the proposed method under low SNR conditions. The PESQ P P and STOI of P-PESQ and P-STOI were higher than those of jS j !; =1 !=1 SDR [dB] := 10 log ; (38) MMSE and ML, respectively. For each method, the PESQ and P P jS S j !; !; =1 !=1 STOI improved by around 0.1 and 2–5 %, respectively, and and calculated using the “BSS-Eval toolbox [54].” These signiﬁcant dierences were observed for all noise and SNR measurements were evaluated on the observed signal (OBS), conditions. These results suggest that the proposed method the MMSE- and ML-based DNN training (MMSE and ML), a T- was able to train the DNN source-enhancement function to F-mask selection method to increase the PESQ and STOI [39] directly increase black-box OSQA scores. (C-PESQ and C-STOI), and the proposed method to increase In mixed-OSQA experiments, both PESQ and STOI of the PESQ and STOI (P-PESQ and P-STOI). To investigate P-MIX were higher than those of MMSE and ML under almost all whether the proposed method enables training of a DNN to noise and SNR conditions. In the comparison to the results of increase a metric that consists of multiple OSQA scores, we the mixed-OSQA and single-OSQA (i.e. P-PESQ and P-STOI), also trained a DNN to increase a mixed-OSQA score (P-MIX). P-MIX achieved almost the same or slightly lower PESQ As the ﬁrst trial, we mixed the PESQ and the STOI. The and STOI scores than P-PESQ and P-STOI, respectively. In mixed-OSQA is deﬁned as addition, P-MIX outperformed STOI and PESQ scores than P-PESQ and P-STOI, respectively. These results suggest that MIX PESQ STOI ˆ ˆ ˆ Z (S; X) = Z (S; X) + (1 )Z (S; X): the use of the mixed-OSQA would be an eective way to In this trial, in order to conﬁrm whether multiple OSQA scores increase multiple-perceptual qualities. increase simultaneously, the additive coecient = 0:5 was In Table III we also show that the proposed method outper- determined in such a way that both OSQA scores had the same formed the T-F mask selection-based methods [39] in terms MIX contribution to Z (S; X). of the target OSQA under almost all noise types and SNR Table III lists the evaluation results of each objective conditions. Such favorable experimental results would have measurement on four noise types and four input SNR con- been observed because of the ﬂexibility of the T-F mask esti- ditions. The asterisk indicates that the score was signiﬁcantly mation achieved by the proposed method. In this experiment, higher than both MMSE and ML in a paired one-sided t-test the number of the T-F mask template (= 128) was larger than ( = 0:05). The SDRs tended to be higher when using that used in the previous work (= 32) [39]. However, since Freq. [kHz] Freq. [kHz] Freq. [kHz] Freq. [kHz] 10 5 5 5 * * * * * 4 4 3 3 3 2 2 1 1 1 ML P-PESQ P-STOI ML P-PESQ P-STOI ML P-PESQ P-STOI Fig. 8. Evaluation results of sound-quality test according to ITU-T P.835. Bar graphs and error bar indicate average and standard error, respectively. Asterisks indicate signiﬁcant dierence observed in paired one-sided t-test. D. Subjective evaluation 1) Sound quality evaluation: To investigate the sound qual- ity of the output signals, subjective speech-quality tests were conducted according to ITU-T P.835 [55]. In the tests, the participants rated three dierent factors in the samples: Speech mean-opinion-score (S-MOS): the speech sam- ple was rated 5–not distorted, 4–slightly distorted, 3– somewhat distorted, 2–fairly distorted, or 1–very dis- torted. Subjective noise MOS (N-MOS): the background of the sample was 5–not noticeable, 4–slightly noticeable, 3– ML P-PESQ P-STOI noticeable but not intrusive, 2–somewhat intrusive, or 1– very intrusive. Fig. 9. Evaluation results of word-intelligibility test. Asterisks indicate signiﬁcant dierence observed in unpaired one-sided t-test. Overall MOS (G-MOS): the sound quality of the sample was 5–excellent, 4–good, 3–fair, 2–poor, or 1–bad. Sixteen participants evaluated the sound quality of the output signals of ML, P-PESQ, and P-STOI. The participants evaluated the T-F masks were generated by a combination of the ﬁnite 20 ﬁles for each method; the 20 ﬁles consisted of ﬁve number of templates, the patterns of the T-F mask were still randomly selected ﬁles from the test dataset for each of the limited. These results suggested that by adopting the policy- four types of noise. The input SNR was 6 dB. gradient method to optimize the parameters of a continuous Figure 8 shows the results of the subjective tests. For PDF of the T-F mask processing, the ﬂexibility of the T-F all factors, P-PESQ achieved a higher score than ML, and mask estimation was improved. statistically signiﬁcant dierences from ML were observed in Figure 7 shows examples of the estimated T-F masks and a paired one-sided t-test ( p-value = 0:05). The reason for output signal, and Table IV lists its objective scores. The this result suggested that participants may have perceived the SNR of the observed signal was adjusted to 0 dB using degrade of the speech quality from both the speech distortion amusement parks noise. Figure 7 shows that the estimated T- and the residual noise in speech frame in the output signal of F masks reﬂect the characteristics of each objective function. ML. In addition, although there was no statistically signiﬁcant In comparison to the results of MMSE and ML that reduced dierence between P-PESQ and P-STOI in terms of S-MOS the distortion of the target source on average, the T-F mask score, N-MOS score of P-STOI was signiﬁcantly lower than estimated by P-PESQ strongly reduced the residual noise, even that of P-PESQ. Thus, G-MOS score of P-STOI was also lower when it distorted the target sound at a middle/high frequency than that of P-PESQ. It would be because P-STOI weakly (e.g. Fig. 7 white dotted box), and achieved the best PESQ. In reduced noise to avoid distorting the target source, even when contrast, the T-F mask estimated by P-STOI weakly reduced the noise remained in the non-speech frames as shown in Sec. noise to avoid distorting the target source, even when the noise IV.C. remained in the non-speech frames (e.g. Fig. 7 white dotted circle), and achieved the best STOI. This may be because the 2) Speech intelligibility test: We conducted a word- residual noise degrades the sound quality and the distortion intelligibility test to investigate speech intelligibility. We se- of the target source degrades speech intelligibility. The T-F lected 50 low familiarity words from familiarity-controlled mask estimated by P-MIX involved both characteristics and word lists 2003 (FW03) [56] as the test dataset of speech. relaxed the disadvantage of P-PESQ and P-STOI, and both The selected dataset consisted of Japanese four-mora words OSQA scores were higher than those of ML and MMSE. Namely, whose accent type was Low-High-High-High. The noisy test speech distortion at a middle/high frequency was reduced (e.g. dataset was created by adding a randomly selected noise at Fig. 7 white dotted box) and residual noise in the non-speech SNR of 6 dB from the noisy dataset, which was used in the frames were reduced (e.g. Fig. 7 white dotted circle). objective evaluation. Sixteen participants attempted to write a Intelligibility score [%] S-MOS N-MOS G-MOS 11 phonetic transcription for output signals of ML, P-PESQ, and [8] A. Narayanan and D. Wang, “Ideal ratio mask estimation using deep neural networks for robust speech recognition,” in Proc. ICASSP, 2013. P-STOI. The percentage of correct answers was used as the [9] T. Ochiai, S. Watanabe, T. Hori, and J. R. Hershey, “Multichannel End- intelligibility score. to-end Speech Recognition,” in Proc. ICML, 2017. Figure 9 shows the intelligibility score of each method. [10] K. Kobayashi, Y. Haneda, K. Furuya, and A. Kataoka, “A hands-free unit with noise reduction by using adaptive beamformer,” IEEE Trans. P-STOI achieved the highest score. In addition, statistically on Consumer Electronics, Vol.54-1, 2008. signiﬁcant dierences from ML were observed in an unpaired [11] Y. Hioka, K. Furuya, K. Kobayashi, S. Sakauchi, and Y. Haneda, “Angu- one-sided t-test ( p-value = 0:05). From both sound-quality and lar region-wise speech enhancement for hands-free speakerphone,” IEEE Trans. on Consumer Electronics, Vol.58-4, 2012. speech-intelligibility tests, we found that the proposed method [12] B. C. J. Moore, “Speech processing for the hearing-impaired: successes, could improve the speciﬁc hearing quality corresponding to the failures, and implications for speech mechanisms,” Speech Communica- OSQA score used as the objective function. tion, Vol. 41, Issue 1, pp.81–91, 2003. [13] D. L. Wang, “Time-frequency masking for speech separation and its potential for hearing aid design,” Trends in Ampliﬁcation, vol. 12, pp. V. CONCLUSIONS 332–353, 2008. [14] T. Zhang, F. Mustiere, and C. Micheyl, “Intelligent Hearing Aids: The We proposed a training method for the DNN-based source- Next Revolution,” In Proc. EMBC, 2016. enhancement function to increase OSQA scores such as the [15] Y. Zhao, D. Wang, I. Merks, and T. Zhang, “DNN-based enhancement PESQ. The diculty is that the gradient of OSQA scores of noisy and reverberant speech,” In Proc. ICASSP, 2016. [16] R. Oldﬁeld, B. Shirley and J. Spille, “Object-based audio for interac- may not be analytically calculated by simply applying the tive football broadcast,” Multimedia Tools and Applications, Vol. 74, back-propagation algorithm because most OSQA scores are pp.2717–2741, 2015. black boxes. To calculate the gradient of the OSQA-based [17] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi and H. Ohmuro, “In- objective function, we formulated a DNN-optimization scheme formative acoustic feature selection to maximize mutual information for collecting target sources,” IEEE/ACM Trans. Audio, Speech and on the basis of the policy-gradient method. In the experiment, Language Processing, pp.768–779, 2017. 1) it was revealed that the DNN-based source-enhancement [18] Y. LeCun, Y. Bengio, and G. Hinton, “Deep Learning,” Nature, 521, function can be trained using the gradient of the OSQA pp.436–444, 2015. [19] F. Weninger, H. Erdogan, S. Watanabe, E. Vincent, J. L. Roux, J. R. Her- obtained with the policy-gradient method. In addition, 2) the shey, and B. Schuller, “Speech Enhancement with LSTM Recurrent OSQA score and speciﬁc hearing quality corresponding to Neural Networks and its Application to Noise-Robust ASR,” in Proc. the OSQA score used as the objective function improved. LVA/ICA, 2015. [20] H. Erdogan, J. R. Hershey, S. Watanabe, and J. L. Roux, “Phase-sensitive Therefore, it can be concluded that this method made it and recognition-boosted speech separation using deep recurrent neural possible to use not only analytical objective functions but also networks,” in Proc. ICASSP, 2015. black-box functions for the training of the DNN-based source- [21] D. S. Williamson and D. L. Wang, “Time-frequency masking in the complex domain for speech dereverberation and denoising,” IEEE/ACM enhancement function. Trans. Audio, Speech and Language Processing, 2017. Although we focused on maximization of OSQA in this [22] Y. Zhao, B. Xu, R. Giri, and T. Zhang, “Perceptually Guided Speech study, the proposed method potentially increases other black- Enhancement using deep neural networks,” in Proc. ICASSP, 2018. [23] Y. Xu, J. Du, L. R. Dai, and C. H. Lee, “An experimental study on speech box measurements. In the future, we will aim to adopt the enhancement based on deep neural networks,” IEEE Signal Processing proposed method to increase other black-box objective mea- Letters, pp.65–68, 2014. sures such as the subjective score obtained from a “human-in- [24] Y. Xu, J. Du, L. R. Dai and C. H. Lee, “A regression approach to speech enhancement based on deep neural networks,” IEEE/ACM Trans. Audio, the-loop” audio-system [57] and word accuracy of a black-box Speech and Language Processing, pp.7–19, 2015. automatic-speech-recognition system [58]. We found that both [25] Y. Xu, J. Du, Z. Huang, L. R. Dai, and C. H. Lee, “Multi-objective the PESQ and STOI could increase simultaneously by mixing learning and mask-based post-processing for deep neural network based speech enhancement,” in Proc. INTERSPEECH, 2015. multiple OSQA scores as an objective function. In the future, [26] T. Gao, J. Du, L. R. Dai, and C. H. Lee, “SNR-Based Progressive we will also investigate the optimality of the OSQA score and Learning of Deep Neural Network for Speech Enhancement,” in Proc. its mixing ratio for the proposed method. INTERSPEECH, 2016. [27] Q. Wang, J. Du, L. R. Dai and C. H. Lee, “A multiobjective learning and ensembling approach to high-performance speech enhancement with References compact neural network architectures,” IEEE/ACM Trans. Audio, Speech [1] J. Benesty, S. Makino, and J. Chen, Eds., “Speech enhancement,” and Language Processing, pp.1181–1193, 2018. Springer, 2005. [28] T. Kawase, K. Niwa, K. Kobayashi, and Y. Hioka, “Application of neural [2] Y. Ephraim and D. Malah, “Speech enhancement using a minimum network to source PSD estimation for Wiener ﬁlter based sound source mean-square error short-time spectral amplitude estimator,” IEEE Trans. separation,” in Proc. IWAENC, 2016. Audio, Speech and Language Processing, pp.1109–1121, 1984. [29] K. Niwa, Y. Koizumi, T. Kawase, K. Kobayashi and Y. Hioka, “Super- [3] R. Zelinski “A microphone array with adaptive post-ﬁltering for noise vised Source Enhancement Composed of Non-negative Auto-Encoders reduction in reverberant rooms,” in Proc. ICASSP, pp. 2578 –2581, 1988. and Complementarity Subtraction” in Proc. ICASSP, 2017. [4] Y. Hioka, K. Furuya, K. Kobayashi, K. Niwa and Y. Haneda, “Un- [30] P. Smaragdis and S. Venkataramani, “A Neural Network Alternative to derdetermined sound source separation using power spectrum density Non-Negative Audio Models,” in Proc. ICASSP, 2017. estimated by combination of directivity gain,” IEEE Trans. Audio, [31] L. Chai, J. Du and Y. Wang, “Gaussian Density Guided Deep Neural Speech and Language Processing, pp.1240–1250, 2013. Network For Single-Channel Speech Enhancement,” in Proc. MLSP, [5] K. Niwa, Y. Hioka, and K. Kobayashi, “Optimal Microphone Array 2017. Observation for Clear Recording of Distant Sound Sources,” IEEE/ACM [32] K. Kinoshita, M. Delcroix, A. Ogawa, T. Higuchi, and T. Nakatani, Trans. Audio, Speech and Language Processing, pp.1785–1795, 2016. “Deep Mixture Density Network for Statistical Model-based Feature [6] L. Lightburn, E. D. Sena, A. Moore, P. A. Naylor, M. Brookes, Enhancement,” in Proc. ICASSP, 2017. “Improving the perceptual quality of ideal binary masked speech,” in [33] A. A. Nugraha, A. Liutkus, and E. Vincent, “Multichannel Audio Source Proc. ICASSP, 2017. Separation With Deep Neural Networks,” IEEE/ACM Trans. Audio, [7] T. Yoshioka, A. Sehr, M. Delcroix, K. Kinoshita, R. Maas, T. Nakatani, Speech and Language Processing, 2016. and W. Kellermann, “Making machines understand us in reverberant [34] J. Hershy, Z. Chen, J. L. Roux, and S. Watanabe, “Deep clustering: rooms: robustness against reverberation for automatic speech recogni- Discriminative embeddings for segmentation and separation,” In Proc. tion,” IEEE Signal Processing Magazine, pp. 114–126, 2012. ICASSP, 2016. 12 [35] S. Pascual, A. Bonafonte, and J. Serra, “SEGAN: Speech Enhancement Appendix Generative Adversarial Network,” In Proc INTERSPEECH, 2017. A. Deviation of (25) [36] D. E. Rumelhart, G. E. Hinton, E. Georey and R. J. Williams, “Learning representations by back-propagating errors,” Nature, 323, We describe the deviation of (25). First, as with (19) and pp.533–536, 1986. (20), the objective function is deﬁned as the expectation of [37] ITU-T Recommendation P.862, “Perceptual evaluation of speech quality B(S; X) as (PESQ): An objective method for end-to-end speech quality assessment h i of narrow-band telephone networks and speech codecs,” 2001. J () = E B(S; X) ; (39) S;X [38] C. H. Taal, R. C. Hendriks, R. Heusdens, and J. Jensen, “An Algo- Z Z rithm for Intelligibility Prediction of Time-Frequency Weighted Noisy ˆ ˆ ˆ = p(X) B(S; X) p(SjX; )dSdX: (40) Speech,” IEEE Transactions on Audio, Speech and Language Process- ing, Vol. 19, pp.2125–2136, 2011. Then, the gradient of (40) can be calculated using a log- [39] Y. Koizumi, K. Niwa, Y. Hioka, K. Kobayashi and Y. Haneda, ‘DNN- based Source Enhancement Self-optimized by Reinforcement Learning derivative trick as using Sound Quality Measurements,” in Proc. ICASSP, 2017. h h ii ˆ ˆ [40] E. S. Sutton and A. G. Barto, “Reinforcement Learning: An Introduc- r J () = E E B(S; X)r ln p(SjX; ) : (41) X ˆ SjX tion,” A Bradford Book, 1998. By approximating the expectation on X by the average on I [41] D. Silver, A. Huang, C. J. Maddison, A. Guez, L. Sifre, G. Driessche, J. Schrittwieser, I. Antonoglou, V. Panneershelvam, M. Lanctot, S. Diele- ˆ utterances and that of S by the average on K times sampling, man, D. Grewe, J. Nham, N. Kalchbrenner, I. Sutskever, T. Lillicrap, (41) can be calculated as M. Leach, K. Kavukcuoglu, T. Graepel and D. Hassabis, “ Mastering the game of Go with deep neural networks and tree search,” Nature, I K X X 1 1 (i;k) (i;k) pp.484—489, 2016. (i) (i) ˆ ˆ r J () B(S ; X )r ln p(S jX ; ): [42] R. J. Williams, “Simple Statistical Gradient-Following Algorithms for I K =1 k=1 Connectionist Reinforcement Learning,” Machine Learning, Vol. 8, (42) [43] R. S. Sutton, D. McAllester, S. Singh, and Y. Mansour, “Policy Gradient We assume that the output signal on each time frame is Methods for Reinforcement Learning with Function Approximation,” In calculated independently. Then, ln p(SjX; ) can be reformed Proc. NIPS, 1999. to [44] M. Matsumoto and T. Nishimura, “Mersenne Twister: A 623- dimensionally Equidistributed Uniform Pseudorandom Number Gener- ˆ ˆ ln p(SjX; ) = ln p(S jX ; ); (43) ator,” ACM Trans. on Modeling and Computer Simulations, 1998. =1 [45] A. Kurematsu, K. Takeda, Y. Sagisaka, S. Katagiri, H. Kuwabara, and K. Shikano, “ATR Japanese speech database as a tool of speech recognition and its gradient can be calculated by and synthesis,” Speech communication, pp.357–363, 1990. (i) [46] J. Barker, R. Marxer, E. Vincent and S. Watanabe, “The third ‘CHiME’ X (i;k) (i) (i;k) (i) ˆ ˆ speech separation and recognition challenge: dataset, task and baseline,” r ln p S jX ; = r ln p(S jX ; ); (44) in Proc. ASRU, 2015. =1 [47] D. P. Kingma and J. Ba, “Adam: A Method for Stochastic Optimization,” (i) in Proc ICLR, 2015. (i;k) (i) r ln p(S jX ; ): (45) (i) [48] F. Weninger, J. R. Hershey, J. L. Roux and B. Schuller, “Discrimina- =1 tively Trained Recurrent Neural Networks for Single-Channel Speech Separation,” in Proc. GlobalSIP, 2014. (i) To normalize the dierence in frame length T , we multiplied [49] F. Seide, G. Li, X. Chen and D. Yu, “Feature engineering in context- (i) 1=T by the original gradient. The log-likelihood function dependent deep neural networks for conversational speech transcription,” (i;k) (i) ln p(S jX ; ) can be expanded as in Proc. ASRU, pp. 24–29, 2011. [50] R. Miyazaki, H. Saruwatari, T. Inoue, Y. Takahashi, K. Shikano and (i;k) (i;k) L +L K. Kondo, “Musical-Noise-Free Speech Enhancement Based on Op- <;!; =;!; (i;k) (i) 2 (i) ln p(S jX ; ) = ln( ) + ; (46) timized Iterative Spectral Subtraction,” IEEE Transactions on Audio, !; 2 (i) 2( ) !; Speech and Language Processing, Vol. 20, pp.2080–2094, 2012. !=1 [51] I. Cohen, “Optimal Speech Enhancement Under Signal Presence Uncer- (i;k) (i;k) (i) (i) (i) ˆ ˆ L = G < X G < X ; (47) !; !; !; !; <;!; tainty Using Log-Spectral Amplitude Estimator,” IEEE Signal Process- ing Letters, Vol. 9, pp.113–116, 2002. (i;k) (i;k) (i) (i) (i) ˆ ˆ L = G = X G = X ; (48) !; !; !; !; =;!; [52] E. Vincent, “An Experimental Evaluation of Wiener Filter Smoothing Techniques Applied to Under-Determined Audio Source Separation,” in (i) 2 (i) where G and ( ) can be estimated by forward- Proc. LVA/ICA, 2010. !; !; (i;k) [53] K. Niwa, Y. Hioka, and K. Kobayashi, “Post-Filter Design for Speech propagation of the DNN as (12)–(16), and G is given by the !; Enhancement in Various Noisy Environments,” in Proc IWAENC, 2014. sampling algorithm of the proposed method. By using above [54] E. Vincent, R. Gribonval and C. Fevotte, “Performance measurement procedure, r J () can be calculated by simply applying in blind audio source separation,” IEEE Trans. Audio, Speech and (i) 2 (i) back-propagation with respect to G and ( ) . Please note Language Processing, 14(4), pp.1462–1469, 2006. !; !; (i;k) that since the simulated output signal S deals with the “label [55] ITU-T Recommendation P.835, “Subjective test methodology for eval- (i;k) uating speech communication systems that include noise suppression ˆ data”, the back-propagation algorithm is not applied for G . !; algorithm,” 2003. [56] S. Amano, S. Sakamoto, T. Kondo, and Y. Suzuki, “Development of familiarity-controlled word lists 2003 (FW03) to assess spoken-word intelligibility in Japanese,” Speech Communication, pp. 76–82, 2009. [57] K. Niwa, K. Ohtani and K, Takeda, “Music Staging AI,” in Proc. ICASSP, 2017. [58] S. Watanabe and J. L. Roux, “Black Box Optimization for Automatic Speech Recognition,” in Proc. ICASSP, 2014. 13 Yuma Koizumi (M’15) received the B.S. and M.S. Yoichi Haneda (M’97-SM’06) received the B.S., from Hosei University, Tokyo, in 2012 and 2014, M.S., and Ph.D. degrees from Tohoku University, and the Ph.D. degree from the University of Electro- Sendai, in 1987, 1989, and 1999. From 1989 to Communications in 2017. Since joining the Nip- 2012, he was with the NTT, Japan. In 2012, he pon Telegraph and Telephone Corporation (NTT) joined the University of Electro-Communications, in 2014, he has been researching acoustic signal where he is a Professor. His research interests in- processing and machine learning. He was awarded clude modeling of acoustic transfer functions, micro- the IPSJ Yamashita SIG Research Award from the phone arrays, loudspeaker arrays, and acoustic echo Information Processing Society of Japan (IPSJ) in cancellers. He received paper awards from the ASJ 2014 and the Awaya Prize from the Acoustical and from the IEICE of Japan in 2002. Dr. Haneda is Society of Japan (ASJ) in 2017. He is a member a senior member of IEICE, and a member of AES, of the ASJ and the Institute of Electronics, Information and Communication ASA and ASJ. Engineers (IEICE). Kenta Niwa (M’09) received his B.E., M.E., and Ph.D. in information science from Nagoya Univer- sity in 2006, 2008, and 2014. Since joining the NTT in 2008, he has been engaged in research on microphone array signal processing as a research engineer at NTT Media Intelligence Laboratories. From 2017, he is also a visiting researcher at Victoria University of Wellington, New Zealand. He was awarded the Awaya Prize by the ASJ in 2010. He is a member of the ASJ and the IEICE. Yusuke Hioka (S’04-M’05-SM’12) received his B.E., M.E., and Ph.D. degrees in engineering in 2000, 2002, and 2005 from Keio University, Yoko- hama, Japan. From 2005 to 2012, he was with the NTT Cyber Space Laboratories (now NTT Media Intelligence Laboratories), NTT in Tokyo. From 2010 to 2011, he was also a visiting researcher at Victoria University of Wellington, New Zealand. In 2013 he permanently moved to New Zealand and was appointed as a Lecturer at the University of Canterbury, Christchurch. Then in 2014, he joined the Department of Mechanical Engineering, the University of Auckland, Auckland, where he is currently a Senior Lecturer. His research interests include audio and acoustic signal processing especially microphone arrays, room acoustics, human auditory perception and psychoacoustics. He is a Senior Member of IEEE and a Member of the Acoustical Society of New Zealand, ASJ, and the IEICE. Kazunori Kobayashi received the B.E., M.E., and Ph.D. degrees in Electrical and Electronic System Engineering from Nagaoka University of Technol- ogy in 1997, 1999, and 2003. Since joining NTT in 1999, he has been engaged in research on micro- phone arrays, acoustic echo cancellers and hands- free systems. He is now Senior Research Engineer of NTT Media Intelligence Laboratories. He is a member of the ASJ and the IEICE.
Electrical Engineering and Systems Science – arXiv (Cornell University)
Published: Oct 22, 2018
You can share this free article with as many people as you like with the url below! We hope you enjoy this feature!
Read and print from thousands of top scholarly journals.
Already have an account? Log in
Bookmark this article. You can see your Bookmarks on your DeepDyve Library.
To save an article, log in first, or sign up for a DeepDyve account if you don’t already have one.
Copy and paste the desired citation format or use the link below to download a file formatted for EndNote
Access the full text.
Sign up today, get DeepDyve free for 14 days.
All DeepDyve websites use cookies to improve your online experience. They were placed on your computer when you launched this website. You can change your cookie settings through your browser.