Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech... JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension Zhen-Hua Ling, Member, IEEE, Yang Ai, Yu Gu, and Li-Rong Dai Abstract—This paper presents a waveform modeling and as the difficulty of distinguishing fricatives and similar voices. generation method using hierarchical recurrent neural networks Therefore, speech bandwidth extension (BWE), which aims to (HRNN) for speech bandwidth extension (BWE). Different from restore the missing high-frequency components of narrowband conventional BWE methods which predict spectral parame- speech using the correlations that exist between the low and ters for reconstructing wideband speech waveforms, this BWE high-frequency components of the wideband speech signal, has method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an uncon- attracted the attentions of many researchers. BWE methods can ditional neural audio generator, the HRNN model represents not only be applied to real-time voice communication, but also the distribution of each wideband or high-frequency waveform benefit other speech signal processing areas such as text-to- sample conditioned on the input narrowband waveform samples speech (TTS) synthesis [1], speech recognition [2], [3], and using a neural network composed of long short-term memory speech enhancement [4], [5]. (LSTM) layers and feed-forward (FF) layers. The LSTM layers form a hierarchical structure and each layer operates at a specific Many researchers have made a lot of efforts in the field temporal resolution to efficiently capture long-span dependencies of BWE. Some early studies adopted the source-filter model between temporal sequences. Furthermore, additional conditions, of speech production and attempted to restore high-frequency such as the bottleneck (BN) features derived from narrowband residual signals and spectral envelopes respectively from speech using a deep neural network (DNN)-based state classifier, input narrowband signals. The high-frequency residual signals are employed as auxiliary input to further improve the quality of generated wideband speech. The experimental results of were usually estimated from the narrowband residual signals comparing several waveform modeling methods show that the by spectral folding [6]. To estimate high-frequency spectral HRNN-based method can achieve better speech quality and run- envelopes from narrowband signals is always a difficult task. time efficiency than the dilated convolutional neural network To achieve this goal, simple methods, such as codebook (DCNN)-based method and the plain sample-level recurrent mapping [7] and linear mapping [4], and statistical methods neural network (SRNN)-based method. Our proposed method also outperforms the conventional vocoder-based BWE method using Gaussian mixture models (GMMs) [8]–[11] and hidden using LSTM-RNNs in terms of the subjective quality of the Markov models (HMMs) [12]–[15], have been proposed. In reconstructed wideband speech. statistical methods, acoustic models were build to represent Index Terms—speech bandwidth extension, recurrent neural the mapping relationship between narrowband spectral param- networks, dilated convolutional neural networks, bottleneck eters and high-frequency spectral parameters. Although these features statistical methods achieved better performance than simple mapping methods, the inadequate modeling ability of GMMs I. I NTRODUCTION and HMMs may lead to over-smoothed spectral parameters which constraints the quality of reconstructed speech signals PEECH communication is important in people’s daily life. [16]. However, due to the limitation of transmission channels In recent years, deep learning has become an emerging and the restriction of speech acquisition equipments, the field in machine learning research. Deep learning techniques bandwidth of speech signal is usually limited to a narrowband have been successfully applied to many signal processing of frequencies. For example, the bandwidth of speech signal tasks. In speech signal processing, neural networks with deep in the public switching telephone network (PSTN) is less than structures have been introduced to the speech generation tasks 4kHz. The missing of high-frequency components of speech including speech synthesis [17], [18], voice conversion [19], signal usually leads to low naturalness and intelligibility, such [20], speech enhancement [21], [22], and so on. In the field This work was partially funded by National Key Research and Development of BWE, neural networks have also been adopted to predict Project of China (Grant No. 2017YFB1002202) and the National Natural either the spectral parameters representing vocal-tract filter Science Foundation of China (Grants No. U1636201). Z.-H. Ling, Y. Ai, and L.-R. Dai are with the National Engineering properties [23]–[25] or the original log-magnitude spectra Laboratory of Speech and Language Information Processing, University derived by short-time Fourier transform (STFT) [26], [27]. of Science and Technology of China, Hefei, 230027, China (e-mail: The studied model architectures included deep neural networks zhling@ustc.edu.cn, ay8067@mail.ustc.edu.cn, lrdai@ustc.edu.cn). Y. Gu is with Baidu Speech Department, Baidu Technology Park, Beijing, (DNN) [28]–[30], recurrent temporal restricted Boltzmann 100193, China (e-mail: guyu04@baidu.com ). This work was done when he machines (RBM) [31], recurrent neural networks (RNN) with was a graduate student at the National Engineering Laboratory of Speech and long short-term memory (LSTM) cells [32], and so on. Language Information Processing, University of Science and Technology of China. These methods achieved better BWE performance than using arXiv:1801.07910v1 [cs.SD] 24 Jan 2018 A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2 conventional statistical models, like GMMs and HMMs, since erate speech waveforms directly at sample-level using RNNs deep-structured neural networks are more capable of modeling for the BWE task. Second, various RNN architectures for the complicated and nonlinear mapping relationship between waveform-based BWE, including plain sample-level LSTM- input and output acoustic parameters. RNNs, HRNNs, and HRNNs with additional conditions, are However, all these existing methods are vocoder-based ones, implemented and evaluated in this paper. The experimental which means vocoders are used to extract spectral param- results of comparing several waveform modeling methods eters from narrowband waveforms and then to reconstruct show that the HRNN-based method achieves better speech waveforms from the predicted wideband or high-frequency quality and run-time efficiency than the stacked dilated CNN- spectral parameters. This may lead to two deficiencies. First, based method [35] and the plain sample-level RNN-based the parameterization process of vocoders usually degrades method. Our proposed method also outperforms the conven- speech quality. For example, the spectral details are always tional vocoder-based BWE method using LSTM-RNNs in lost in the reconstructed waveforms when low-dimensional terms of the subjective quality of the reconstructed wideband spectral parameters, such as mel-cepstra or line spectral speech. pairs (LSP), are adopted to represent spectral envelopes in This paper is organized as follows. In Section II, we briefly vocoders. The spectral shapes of the noise components at review previous BWE methods including vocoder-based ones voiced frames are always ignored when only F0 values and the dilated CNN-based one. In Section III, the details of and binary voiced/unvoiced flags are used to describe the our proposed method are presented. Section IV reports our excitation. Second, it is difficult to parameterize and to predict experimental results, and conclusions are given in Section V. phase spectra due to the phase-warpping issue. Thus, simple estimation methods, such as mirror inversion, are popularly II. PREVIOUS WORK used to predict the high-frequency phase spectra in existing A. Vocoder-Based BWE Using Neural Networks methods [26], [32]. This also constraints the quality of the The vocoder-based BWE methods using DNNs or RNNs reconstructed wideband speech. have been proposed in recent years [26], [32]. In these meth- Recently, neural network-based speech waveform synthe- ods, spectral parameters such as logarithmic magnitude spectra sizers, such as WaveNet [33] and SampleRNN [34], have (LMS) were first extracted by short time Fourier transform been presented. In WaveNet [33], the distribution of each (STFT) [38]. Then, DNNs or LSTM-RNNs were trained under waveform sample conditioned on previous samples and addi- minimum mean square error (MMSE) criterion to establish a tional conditions was represented using a neural network with mapping relationship from the LMS of narrowband speech dilated convolutional neural layers and residual architectures. to the LMS of the high-frequency components of wideband SampleRNN [34] adopted recurrent neural layers with a hier- speech. Some additional features extracted from narrowband archical structure for unconditional audio generation. Inspired speech, such as bottleneck features, can be used as auxiliary by WaveNet, a waveform modeling and generation method inputs to improve the performance of networks [32]. At the using stacked dilated CNNs for BWE has been proposed in stage of reconstruction, the LMS of wideband speech were our previous work [35], which achieved better subjective BWE reconstructed by concatenating the LMS of input narrowband performance than the vocoder-based approach utilizing LSTM- speech and the LMS of high-frequency components predicted RNNs. On the other hand, the methods of applying RNNs to by the trained DNN or LSTM-RNN. The phase spectra of directly model and generate speech waveforms for BWE have wideband speech were usually generated by some simple not yet been investigated. mapping algorithms, such as mirror inversion [26]. Finally, Therefore, this paper proposes a waveform modeling and inverse FFT (IFFT) and overlap-add algorithm were carried generation method using RNNs for BWE. As discussed above, out to reconstruct the wideband waveforms from the predicted direct waveform modeling and generation can help avoid the LMS and phase spectra. spectral representation and phase modeling issues in vocoder- The experimental results of previous work showed that based BWE methods. Considering the sequence memory and LSTM-RNNs can achieve better performance than DNNs in modeling ability of RNNs and LSTM units, this paper adopts the vocoder-based BWE [32]. Nevertheless, there are still some LSTM-RNNs to model and generate the wideband or high- issues with the vocoder-based BWE approach as discussed frequency waveform samples directly given input narrowband in Section I, such as the quality degradation caused by the waveforms. Inspired by SampleRNN [34], a hierarchical RNN parameterization of vocoders and the inadequacy of restoring (HRNN) structure is presented for the BWE task. There are phase spectra. multiple recurrent layers in an HRNN and each layer operates at a specific temporal resolution. Compared with plain sample- B. Waveform-Based BWE Using Stacked Dilated CNNs level deep RNNs, HRNNs are more capable and efficient at capturing long-span dependencies in temporal sequences. Recently, a novel waveform generation model named Furthermore, additional conditions, such as the bottleneck WaveNet was proposed [33] and has been successfully (BN) features [32], [36], [37] extracted from narrowband applied to the speech synthesis task [39]–[41]. This model speech using a DNN-based state classifier, are introduced into utilizes stacked dilated CNNs to describe the autoregressive HRNN modeling to further improve the performance of BWE. generation process of audio waveforms without using The contributions of this paper are twofold. First, this frequency analysis and vocoders. A stacked dilated CNN paper makes the first successful attempt to model and gen- consists of many convolutional layers with different dilation A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3 ... ... y y y t-1 t t+1 · · · · · · t-1 t t+1 Embedding layer e e e t-1 t t+1 · · · · · · LSTM layers · · · · · · FF · · · · · · layers ... ... t-1 t t+1 y y y · · · t-1 t t+1 · · · Fig. 1. The structure of stacked dilated non-causal CNNs [35]. Fig. 2. The structure of SRNNs for BWE, where concentric circles represent LSTM layers and inverted trapezoids represent FF layers. factors. The length of its receptive filed grows exponentially in terms of the network depth [33]. A. Sample-Level Recurrent Neural Networks Motivated by this idea, a waveform modeling and generation method for BWE was proposed [35], which described the con- The LSTM-RNNs for speech generation are usually built ditional distribution of the output wideband or high-frequency at frame-level in order to model the acoustic parameters waveform sequence y = [y ; y ; : : : ; y ] conditioned on the extracted by vocoders with a fixed frame shift [32], [43]. It 1 2 T input narrowband waveform sequence x = [x ; x ; : : : ; x ] is straightforward to model and generate speech waveforms 1 2 T using stacked dilated CNNs . Similar to WaveNet, the samples at sample-level using similar LSTM-RNN framework. The structure of sample-level recurrent neural networks (SRNNs) x and y were all discretized by 8-bit -law quantization t t for BWE is shown in Fig. 2, which is composed of a cascade [42] and a softmax output layer was adopted. Residual and of LSTM layers and feed-forward (FF) layers. Both the input parameterized skip connections together with gated activation waveform samples x = [x ; x ; : : : ; x ] and output waveform functions were also employed to capacitate training deep 1 2 T samples y = [y ; y ; : : : ; y ] are quantized to discrete values networks and to accelerate the convergence of model esti- 1 2 T by -law. The embedding layer maps each discrete sample mation. Different from WaveNet, this method modeled the value x to a real-valued vector e . The LSTM layers model mapping relationship between two waveform sequences, not t t the sequence of embedding vectors in a recurrent manner. the autoregressive generation process of output waveform When there is only one LSTM layer, the calculation process sequence. Both causal and non-causal model structures were implemented and experimental results showed that the non- can be formulated as causal structure achieved better performance than the causal h = H(h ; e ); (2) t t1 t one [35]. The stacked dilated non-causal CNN, as illustrated where h is the output of LSTM layers at time step t, H in Fig. 1, described the conditional distribution as represents the activation function of LSTM units. If there are multiple LSTM layers, their output can be calculated layer- p(yjx) = p(y jx ; x ; : : : ; x ); (1) t tN=2 tN=2+1 t+N=2 by-layer. Then, h passes through FF layers. The activation t=1 function of the last layer is a softmax function which generates where N + 1 is the length of receptive field. the probability distribution of the output sample y conditioned At the extension stage, given input narrowband speech, each on the previous and current input samples fx ; x ; : : : ; x g as 1 2 t output sample was obtained by selecting the quantization level with maximum posterior probability. Finally, the generated p(y jx ; x ; : : : ; x ) = FF(h ); (3) t 1 2 t t waveforms were processed by a high-pass filter and then added where function FF denotes the calculation of FF layers. with the input narrowband waveforms to reconstruct the final Given a training set with parallel input and output waveform wideband waveforms. Experimental results showed that this sequences, the model parameters of the LSTM and the FF method achieved better subjective BWE performance than the layers are estimated using cross-entropy cost function. At gen- vocoder-based method using LSTM-RNNs [35]. eration time, each output sample y is obtained by maximizing the conditional probability distribution (3). Our preliminary III. P ROPOSED M ETHODS and informal listening test showed that this generation criterion Inspired by SampleRNN [34] which is an unconditional can achieve better subjective performance than generating audio generator containing recurrent neural layers with a random samples from the distribution. The random sampling hierarchical structure, this paper proposes waveform modeling is necessary for the conventional WaveNet and SampleRNN and generation methods using RNNs for BWE. In this section, models because of their autoregressive architecture. However, we first introduce the plain sample-level RNNs (SRNN) for the model structure shown in Fig. 2 is not an autoregressive waveform modeling. Then the structures of hierarchical RNNs one. The input waveforms provide the necessary randomness (HRNN) and conditional HRNNs are explained in detail. to synthesize the output speech, especially the unvoiced Finally, the flowchart of BWE using RNNs is introduced. segments. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4 Assume an HRNN has K tiers in total (e.g., K = 3 in Fig. (3) , , 1 L 3). Tier 1 works at sample-level and the other K 1 tiers are Tier 3 frame-level tiers since they operate at a temporal resolution · · · (3) lower than samples. (3) d 1 d (3) (2) (3) L /L d · · · 1) Frame-level tiers: The k-th tier (1 < k  K) operates (2) (2) (2) (3) (2) (3) , , , , , , 1 L L +1 2L L -L +1 L (k) on frames composed of L samples. The range of time Tier 2 (k) (k) · · · · · · step at the k-th tier, t , is determined by L . Denoting (2) (2) (2) (2) (2) (2) (2) (2) d d d d (2) d (3) (2) d (3) 1 L L +1 2L L -L +1 L the quantized input waveforms as x = [x ; x ; : : : ; x ] and 1 2 T (2) (2) (2) (3) (2) (3) e e e e e e 1 L L +1 2L L -L +1 L · · · · · · · · · assuming that L represents the sequence length of x after (K) Tier 1 · · · zero-padding so that L can be divisible by L , we can get · · · · · · · · · (3) (2) (3) (k) (k) (2) (2) (2) y y y1 yL yL +1 y2L L -L +1 L · · · · · · · · · t 2 T = f1; 2; : : : ; g; 1 < k  K: (4) (k) Furthermore, the relationship of temporal resolution between Fig. 3. The structure of HRNNs for BWE, where concentric circles represent the m-th tier and the n-th tier (1 < m < n  K ) can be LSTM layers and inverted trapezoids represent FF layers. described as (m) (n) (n) (n) (m) (m) In an SRNN, the generation of each output sample depends T = ft jt = d e; t 2 T g; (5) (n) (m) L =L on all previous and current input samples. However, this plain LSTM-RNN architecture still has some deficiencies where de represents the operation of rounding up. It can for waveform modeling and generation. First, sample-level be observed from (5) that one time step of the n-th tier (n) (m) modeling makes it difficult to model long-span dependencies corresponds to L =L time steps of the m-th tier. The (k) between input and output speech signals due to the signifi- frame inputs f at the k-th tier (1 < k  K) and the cantly increased sequence length compared with frame-level t-th time step can be written by framing and concatenation modeling. Second, SRNNs suffer from the inefficiency of operations as waveform generation due to the point-by-point calculation at (k) f = [x (k) ; : : : ; x (k)] ; (6) all layers and the dimension expansion at the embedding layer. (t1)L +1 tL (k) (k)> (k)> Therefore, inspired by SampleRNN [34], a hierarchical RNN > ~ ~ f = [f ; :::; f ] ; (7) (k) t t t+c 1 (HRNN) structure is proposed in next subsection to alleviate (k) (k) these problems. where t 2 T , f denotes the t-th waveform frame at the (k) k-th tier, and c is the number of concatenated frames at (3) (2) B. Hierarchical Recurrent Neural Networks the k-th tier. We have c = c = 1 in the model structure shown in Fig. 3. The structure of HRNNs for BWE is illustrated in Fig. 3. As shown in Fig. 3, the frame-level ties are composed of Similar to SRNNs mentioned in Section III-A, HRNNs are LSTM layers. For the top tier (i.e., k = K ), the LSTM units also composed of LSTM layers and FF layers. Different from (K) update their hidden states h based on the hidden states of the plain LSTM-RNN structure of SRNNs, these LSTM and (K) previous time step h and the input at current time step FF layers in HRNNs form a hierarchical structure of multiple t1 (K) tiers and each tier operates at a specific temporal resolution. f . If there is only one LSTM layer in the K -th tier, the The bottom tier (i.e., Tier 1 in Fig. 3) deals with individual calculation process can be formulated as samples and outputs sample-level predictions. Each higher tier (K) (K) (K) (K) h = H(h ; f ); t 2 T : (8) operates on a lower temporal resolution (i.e., dealing with t t1 t more samples per time step). Each tier conditions on the tier If the top tier is composed of multiple LSTM-RNN layers, the above it except the top tier. This model structure is similar hidden states can be calculated layer-by-layer iteratively. to SampleRNN [34]. The main difference is that the original Due to the different temporal resolution at different tiers, the SampleRNN model is an unconditional audio generator which (K) (K) (K1) top tier generates r = L =L conditioning vectors employs the history of output waveforms as network input and (K) for the (K 1)-th tier at each time step t 2 T . This generates output waveforms in an autoregressive way. While, (K) is implemented by producing a set of r separate linear the HRNN model shown in Fig. 3 describes the mapping (K) projections of h at each time step. For the intermediate tiers relationship between two waveform sequences directly without (i.e., 1 < k < K ), the processing of generating conditioning considering the autoregressive property of output waveforms. vectors is the same as that of the top tier. Thus, we can describe This HRNN structure is specifically designed for BWE be- the conditioning vectors uniformly as cause narrowband waveforms are used as inputs in this task. (k) (k) (k) (k) (k) Removing autoregressive connections can help reduce the d = W h ; j = 1; 2; : : : ; r ; t 2 T ; (9) (k) t (t1)r +j computation complexity and facilitate parallel computing at (k) (k) (k1) where 1 < k  K and r = L =L . generation time. Although conditional SampleRNNs have been developed and used as neural vocoders to reconstruct speech The input vectors of the LSTM layers at intermediate tiers waveforms from acoustic parameters [44], they still follow the are different from that of the top tier. For the k-th tier (k) autoregressive framework and are different from HRNNs. (1 < k < K ), the input vector i at the t-th time step is t A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5 (k) composed by a linear combination of the frame inputs f 1 1 c , ,c 1 d (k+1) and the conditioning vectors d given by the (k + 1)-th Tier 4 tier as · · · (4) (4) d 1 d (4) (3) L /L (k) (k) (k+1) (4) (k) (k) · · · i = W f + d ; t 2 T ; (10) (3) , , 1 L (4) (3) (4) t t t - + , , L L 1 L Tier 3 Thus, the output of the LSTM layer at the k-th tier (1 < k < · · · · · · K ) can be calculated as (2) (3) (2) (3) , , , , 1 L L -L +1 L (k) (k) (k) (K) h = H(h ; i ); t 2 T : (11) Tier 2 t t1 t · · · · · · 2) Sample-level tier: The sample-level tier (i.e., Tier 1 (2) e1 eL in Fig. 3) gives the probability distribution of the output · · · sample y conditioned on the current input sample x (i.e., t t Tier 1 · · · (2) (1) · · · L = 1) together with the conditioning vector d passed from the above tier which encodes history information of the (2) y y 1 L · · · (1) L input sequence, where t 2 T = f1; 2; : : : ; g. Since x (1) and y are individual samples, it is convenient to model the Fig. 4. The structure of conditional HRNNs for BWE, where concentric correlation among them using a memoryless structure such as circles represent LSTM layers and inverted trapezoids represent FF layers. FF layers. First, x is mapped into a real-valued vector e by t t an embedding layer. These embedding vectors form the input of vocoder-based BWE [32]. In order to combine such auxil- at each time step of the sample-level tier, i.e., iary inputs with the HRNN model introduced in Section III-B, (1) > > > f = [e ; :::; e ] ; (12) (1) t t a conditional HRNN structure is designed as shown in Fig. 4. t+c 1 Compared with HRNNs, conditional HRNNs add an addi- (1) (1) where t 2 T , c is the number of concatenated sample tional tier named conditional tier on the top. The input features embeddings at the sample-level tier. In the model structure of the conditional tier are frame-level auxiliary feature vectors (1) shown in Fig. 3, c = 1. Then, the input of the FF layers is extracted from input waveforms rather than waveform samples. (1) (2) a linear combination of f and d as t t Assume the total number of tiers in a conditional HRNN is K (K) (1) (1) (2) (1) (1) (e.g., K = 4 in Fig. 4) and let L donate the frame shift of i = W f + d ; t 2 T : (13) t t t auxiliary input features. The equations (4) and (5) in Section Finally, we can obtain the conditional probability distribu- III-B still works here. Similar to the introductions in Section (1) tion of the output sample y by passing i through the FF III-B, the frame inputs at the conditional tier can be written as layers. The activation function of the last FF layer is a softmax function. The output of FF layers describes the conditional t t t (K) c = [c ; c ; : : : ; c ]; t 2 T ; (15) 1 2 d distribution where c represents the d-th dimension of the auxiliary feature (1) p(y jx ; x ; : : : ; x t ) = FF(i ); (14) (K) (K) t 1 2 (d e+c 1)L t (K) vector at time t. Then the calculations of (8)-(13) for HRNNs are followed. Finally, the conditional probability distribution (1) where t 2 T . for generating y can be written as It is worth mentioning that the structure shown in Fig. 3 is p(y jx ; : : : ; x t ;c ; c ; : : : ; c t ) non-casual which utilizes future input samples together with (K) (K) t 1 1 2 (d e+c 1)L d e (K) (K) L L current and previous input samples to predict current output (1) = FF(i ); (16) sample (e.g., using x ; : : : ; x to predict y in Fig. 3). (3) 1 1 (K) (K) Generally speaking, at most c L 1 input samples after (1) where t 2 T , fc ; c ; : : : ; c t g are additional condi- 1 2 d e (K) the current time step are necessary in order to predict current tions introduced by the auxiliary input features. output sample accroding to (14). This is also a difference between our HRNN model and SampleRNN, which has a D. BWE Using SRNNs and HRNNs causal and autoregressive structure. The flowchart of BWE using SRNNs or HRNNs are Similar to SRNNs, the parameters of HRNNs are estimated illustrated in Fig. 5. There are two mapping strategies. One is using cross-entropy cost function given a training set with to map the narrowband waveforms towards their corresponding parallel input and output sample sequences. At generation wideband counterparts (named WB strategy in the rest of this time, each y is predicted using the conditional probability paper) and the other is to map the narrowband waveforms distribution in (14). towards the waveforms of the high-frequency component of wideband speech (named HF strategy). C. Conditional Hierarchical Recurrent Neural Networks A database with wideband speech recordings is used for Some frame-level auxiliary features extracted from input model training. At the training stage, the input narrowband narrowband waveforms, such as bottleneck (BN) features [36], waveforms are obtained by downsampling the wideband wave- have shown their effectiveness in improving the performance forms. To guarantee the length consistency between the input A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6 Training IV. E XPERIM ENTS Narrowband Wideband waveform waveform A. Experimental Setup Quantization Downsampling Upsampling encoding The TIMIT corpus [45] which contained English speech High-frequency BN feature from multi-speakers with 16kHz sampling rate and 16bits waveform SRNNs vectors Quantization BN feature Highpass filter Amplification encoding or extractor resolution was adopted in our experiments. We chose 3696 HRNNs and 1153 utterances to construct the training set and validation Narrowband waveform Quantization Waveform set respectively. Another 192 utterances from the speakers Upsampling encoding prediction not included in the training set and validation set were Wideband y waveform used as the test set to evaluate the performance of different Waveform Quantization Highpass filter adding decoding BWE methods. In our experiments, the narrowband speech High-frequency waveform waveforms sampled at 8kHz were obtained by downsampling Extension Reconstructed Deamplification wideband waveform the wideband speech at 16kHz. Five BWE systems were constructed for comparison in our Fig. 5. The flowchart of our proposed BWE methods. experiments. The descriptions of these systems are as follows. VRNN: Vocoder-based BWE method using LSTM-RNNs as introduced in Section II-A. The DRNN-BN system in [32] was used here for comparison, which predicted the LMS of high-frequency components using a deep and output sequences, the narrowband waveforms are then LSTM-RNN with auxiliary BN features. Backpropaga- upsampled to the sampling rate of the wideband speech with tion through time (BPTT) algorithm was used to train zero high-frequency components. The upsampled narrowband the LSTM-RNN model based on the minimum mean waveforms are used as the model input. The output wave- square error (MMSE) criterion. In this system, a DNN- forms are either the unfiltered wideband waveforms (WB based state classifier was built to extract BN features. strategy) or the high-frequency waveforms (HF strategy). The 11-frames of 39-dimensional narrowband MFCCs were high-frequency waveforms are obtained by sending wideband used as the input of the DNN classifier and the posterior speech into a high-pass filter and an amplifier for reducing probabilities of 183 HMM states for 61 monophones were quantization noise as the dotted lines in Fig. 5. Before the regarded as the output of the DNN classifier. The DNN waveforms are used for model training, all the input and output classifier adopt 6 hidden layers where there were 100 waveform samples are discretized by 8-bit -law quantization. hidden units at the BN layer and 1024 hidden units at The model parameters of SRNNs or HRNNs are trained under other hidden layers. The BN layer was set as the fifth cross-entropy (CE) criterion which optimizes the classification hidden layer so that the extractor could capture more accuracy of discrete output samples on training set. linguistic information. This BN feature extractor was also At the extension stage, the upsampled and quantized nar- used in the CHRNN system. rowband waveforms are fed into the trained SRNNs or DCNN: Waveform-based BWE method using stacked HRNNs to generate the probability distributions of output dilated CNNs as introduced in Section II-B. The CNN2- samples. Then each output sample is obtained by selecting the HF system in [35] was used here for comparison, which quantization level with maximum posterior probability. Later, predicted high-frequency waveforms using non-causal the quantized output samples are decoded into continuous CNNs and performed better than other configurations. values using the inverse mapping of -law quantization. A SRNN: Waveform-based BWE method using sample- deamplification process is conducted for the HF strategy in level RNNs as introduced in Section III-A. The built order to compensate the effect of amplification at training time. model had two LSTM layers and two FF layers. Both Finally, the generated waveforms are high-pass filtered and the LSTM layers and the FF layers had 1024 hidden added with the input narrowband waveforms to generate the units and the embedding size was 256. The model final wideband waveforms. was trained by stochastic gradient decent with a mini- batch size of 64 to minimize the cross entropy between Particularly for conditional HRNNs, BN features are used the predicted and real probability distribution. Zero- as auxiliary input in our implementation as shown by the gray padding was applied to make all the sequences in a mini- lines in Fig. 5. BN features can be regarded as a compact batch have the same length and the cost values of the representation of both linguistic and acoustic information added zero samples were ignored when computing the [36]. Here, BN features are extracted by a DNN-based state gradients. An Adam optimizer [46] was used to update the classifier, which has a bottleneck layer with smaller number parameters with an initial learning rate 0.001. Truncated of hidden units than that of other hidden layers. The inputs backpropagation through time (TBPTT) algorithm was of the DNN are mel-frequency cepstral coefficients (MFCC) employed to improve the efficiency of model training and extracted from narrowband speech and the outputs are the the truncated length was set to 480. posterior probability of HMM states. The DNN is trained under cross-entropy (CE) criterion and is used as the BN Examples of reconstructed speech waveforms in our experiments can be feature extractor at extension time. found at http://home.ustc.edu.cn/ ay8067/IEEEtran/demo.html. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7 HRNN: Waveform-based BWE method using HRNNs as introduced in Section III-B. The HRNN was composed of 3 tiers with two FF layers in Tier 1 and one LSTM layer each in Tier 2 and 3. Therefore, there were two LSTM layers and two FF layers in total which was the same as (k) the SRNN system. The number of c ; k = f1; 2; 3g in (3) (2) (1) (2) (14) and (19) were set as c = c = 2; c = L in our experiments after tuning on the validation set. Some other setups, such as the dimension of the hidden units and the training method, were the same as that of the SRNN system mentioned above. The frame size configurations of the HRNN model will be discussed in Section IV-B. CHRNN: Waveform-based BWE method using condi- tional HRNNs as introduced in Section III-C. The BN features extracted by the DNN state classifier used by the VRNN system were adopted as auxiliary conditions. Fig. 6. Accuracy and efficiency comparison for HRNN-based BWE with (3) (2) The model was composed of 4 tiers. The top conditional different (L ; L ) configurations and using (a) WB and (b) HF mapping strategies. tier had one LSTM layer with 1024 hidden units and the other three tiers were the same as the HRNN system. TABLE I Some basic setups and the training method were the same AVERAGE PESQ SCORES WITH 95% CONFIDENCE INTERVALS ON THE as the HRNN system. The setup of the conditional tier TEST SET W HEN USING WB AND HF MAPPING STRATEGIES FOR will be introduced in detail in Section IV-E. HRNN-BASED BWE. In our experiments, we first investigated the influence of Narrowband HRNN-WB HRNN-HF frame sizes and mapping strategies (i.e., the WB and HF PESQ score 3.630.0636 3.53 0.0438 3.75 0.0456 strategies introduced in Section III-D) on the performance of the HRNN system. Then, the comparison between different waveform-based BWE methods including the DCNN, SRNN C. Effects of Mapping Strategy on HRNN-Based BWE and HRNN systems was carried out. Later, the effect of It can be observed from Fig. 6 that the HF strategy introducing BN features to HRNNs was studied by comparing achieved much lower classification accuracy than the WB the HRNN system and the CHRNN system. Finally, our strategy. It is reasonable since it is more difficult to predict proposed waveform-based BWE method was compared with the aperiodic and noise-like high-frequency waveforms than the conventional vocoder-based one. to predict wideband waveforms. Objective and subjective evaluations were conducted to investigate which strategy can B. Effects of Frame Sizes on HRNN-Based BWE achieve better performance for the HRNN-based BWE. (k) As introduced in Section III-B, the frame sizes L are Since it is improper to compare the classification accuracy key parameters that makes a HRNN model different from of these two strategies directly, the score of Perceptual Eval- the conventional sample-level RNN. In this experiment, we uation of Speech Quality (PESQ) for wideband speech (ITU- (k) studied the effect of L on the performance of HRNN- T P.862.2) [47] was adopted as the objective measurement based BWE. The HRNN models with several configurations here. We utilized the clean wideband speech as reference and (3) (2) of (L ; L ) were trained and their accuracy and efficiency calculated the PESQ scores of the 192 utterances in the test were compared as shown in Fig. 6. Here, the classification set generated using WB and HF strategies (i.e., the HRNN- accuracy of predicting discrete waveform samples in the WB system and the HRNN-HF system) respectively. For validation set was used to measure the accuracy of different comparison, the PESQ scores of the upsampled narrowband models. The total time of generating 1153 utterances in utterances (i.e., with empty high-frequency components) were also calculated. The average PESQ scores and their 95% confi- the validation set with mini-batch size of 64 on a single dence intervals are shown in Table I. The differences between Tesla K40 GPU was used to measure the run-time efficiency. any two of the three systems were significant according to the Both the WB and HF mapping strategies were considered results of paired t-tests (p < 0:001). From Table I, we can see in this experiment. From the results shown in Fig. 6, we that the HF strategy achieved higher PESQ score than the WB can see that there existed conflict between the accuracy and strategy. The average PESQ of the HRNN-WB system was the efficiency of the trained HRNN models. Using smaller (3) (2) even lower than that of the upsampled narrowband speech. frame sizes of (L ; L ) improved the accuracy of sample This may be attributed to that the model in the HRNN-WB prediction while increased the computational complexity at the system aimed to reconstruct the whole wideband waveforms extension stage for both the WB and HF strategies. Finally, (3) (2) we chose (L ; L ) = (16; 4) as a trade-off and used this and was incapable of generating high-frequency components configuration for building the HRNN system in the following as accurately as the HRNN-HF system. experiments. A 3-point comparison category rating (CCR) [48] test A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8 TABLE II O BJECTIVE PERFORMANCE OF THE DCNN, SRNN AND HRNN SYSTEMS ON THE TEST SET. DCNN SRNN HRNN Accuracy (%) 7.180.336 7.400.387 7.520.388 PESQ score 3.620.0532 3.700.0477 3.750.0456 SNR (dB) 19.060.5983 18.950.6053 19.000.6099 SNR-V (dB) 26.140.7557 26.060.7648 26.210.7716 SNR-U (dB) 10.490.4094 10.320.4126 10.260.4124 LSD (dB) 8.460.122 8.610.136 8.300.127 LSD-V (dB) 7.710.172 8.090.203 8.020.194 LSD-U (dB) 9.340.124 9.190.124 8.570.107 Generation time (s) 3.97 19.39 3.61 used in Section IV-C were adopted as objective measurements. Besides, two extra metrics were adopted here, including signal- to-noise ratio (SNR) [40] which measured the distortion of waveforms and log spectral distance (LSD) [40] which Fig. 7. Average CCR scores of comparing five system pairs, including (1) HRNN-HF vs. HRNN-WB, (2) HRNN vs. DCNN, (3) HRNN vs. SRNN, (4) reflected the distortion in frequency domain. The SNR and CHRNN vs. HRNN, and (5) CHRNN vs. VRNN. The error bars represent LSD for voiced frames (denoted by SNR-V and LSD-V) 95% confidence intervals and the numerical values in parentheses represent and unvoiced frames (denoted by SNR-U and LSD-U) were the p-value of one-sample t-test for different system pairs. also calculated separately for each system. For the fairness of efficiency comparison, we set the mini-batch size as 1 for all was conducted on the Amazon Mechanical Turk (AMT) the three systems when generating utterances in the test set. crowdsourcing platform (https://www.mturk.com) to compare The time of generating 1 second speech (i.e., 16000 samples the subjective performance of the HRNN-WB and HRNN-HF for 16kHz speech) using a Tesla K40 GPU was recorded as the measurement of efficiency in this experiment. systems. The wideband waveforms of 20 utterances randomly selected from the test set were reconstructed by the HRNN- Table II shows the objective performance of the three WB and HRNN-HF systems. Each pair of generated wideband systems on the test set. The 95% confidence intervals were speech were evaluated in random order by 15 native English also calculated for all metrics except the generation time. The results of paired t-tests indicated that the differences listeners after rejecting improper listeners based on anti- between any two of the three systems on all metrics were cheating considerations [49]. The listeners were asked to judge significant (p < 0:01). For accuracy and PESQ score, the which utterance in each pair had better speech quality or DCNN system was not as good as the other two systems. there was no preference. Here, the HRNN-WB system was used as the reference system. The CCR scores of +1, -1, The HRNN system achieved the best performance on both and 0 denoted that the wideband utterance reconstructed by accuracy and PESQ score. For SNR, the HRNN system and the evaluated system, i.e., the HRNN-HF system, sounded the DCNN system achieved the best performance on voiced better than, worse than, or equal to the sample generated by segments and unvoiced segments respectively. For LSD, the the reference system in each pair. We calculated the average HRNN system achieved the lowest overall LSD and the lowest CCR score and its 95% confidence interval through all pairs of LSD of unvoiced segments. On the other hand, the DCNN utterances listened by all listeners. Besides, one-sample t-test system achieved the lowest LSD of voiced frames among the was also conducted to judge whether there was a significant three systems. Considering that LSDs were calculated using difference between the average CCR score and 0 (i.e., to only amplitude spectra while SNRs were influenced by both judge whether there was a significant difference between two amplitude and phase spectra of the reconstructed waveforms, it systems) by examining the p-value. The results are shown as can be inferred that the HRNN system was better at restoring the first system pair in Fig. 7, which suggests that the HRNN- the phase spectra of voiced frames than the DCNN system HF system outperformed the HRNN-WB system significantly. according to the SNR-V and LSD-V results of these two This is consistent with the results of comparing these two systems shown in Table II. In terms of the efficiency, the strategies when dilated CNNs were used to model waveforms generation time of the SRNN system was more than 5 times for the BWE task [35]. Therefore, the HF strategy was adopted longer than that of the HRNN system due to the sample- in the following experiments for building waveform-based by-sample calculation at all layers in the SRNN structure as BWE systems. discussed in Section III-A. Also, the efficiency of the DCNN system was slightly worse than that of the HRNN system. The results reveal that HRNNs can help improve both the accuracy D. Model Comparison for Waveform-Based BWE and efficiency of SRNNs by modeling long-span dependencies among sequences using a hierarchical structure. The performance of three waveform-based BWE systems, i.e., the DCNN, SRNN and HRNN systems, were compared The spectrograms extracted from clean wideband speech by objective and subjective evaluations. The accuracy and and the output of BWE using the DCNN, SRNN and HRNN efficiency metrics used in Section IV-B and the PESQ score systems for an example sentence in the test set are shown A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9 TABLE III O BJECTIVE PERFORMANCE OF THE HRNN AND CHRNN SYSTEMS ON THE TEST SET TOGETHER WITH THE p VALUES OF PAIRED t-TESTS. HRNN CHRNN p-value Accuracy (%) 7.520.388 7.460.385 <0.001 PESQ score 3.750.0456 3.790.0394 <0.001 SNR (dB) 19.000.6099 18.990.5946 0.322 SNR-V (dB) 26.210.7716 26.130.7539 <0.001 SNR-U (dB) 10.260.4124 10.340.4097 <0.001 LSD (dB) 8.300.127 8.270.123 0.301 LSD-V (dB) 8.020.194 7.890.185 <0.001 LSD-U (dB) 8.570.107 8.660.103 <0.01 Generation time (s) 3.61 4.17 – features was 100 and the frame size at the top conditional tier (4) was L = 160 because the frame shift of BN features was 10ms, corresponding to 160 samples for 16kHz speech. The objective measurements used in Section IV-D were adopted here to compare the HRNN and CHRNN systems. The results are shown in Table III. The CHRNN system outperformed the HRNN system on PESQ score while its prediction accuracy was not as good as the HRNN system. For SNR, these two systems achieved similar performance. The results of LSD show that the CHRNN system was better at reconstructing voiced frames and the HRNN system was on Fig. 8. The spectrograms of clean wideband speech and the output of BWE using five systems for an example sentence in the test set. the contrary. In terms of efficiency, the generation time of the CHRNN system was higher than that of the HRNN system due to the extra conditional tier. in Fig. 8. It can be observed that the high-frequency energy A 3-point CCR test was also conducted to evaluate the of some unvoiced segments generated by the DCNN system subjective performance of the CHRNN system by using the was much weaker than that of the natural speech and the HRNN system as the reference system and following the outputs of the SRNN and HRNN systems. Compared with the evaluation configurations introduced in Section IV-C. The SRNN and HRNN systems, the DCNN system was better at results are shown as the fourth system pairs in Fig. 7, which reconstructing the high-frequency harmonic structures of some reveal that utilizing BN features as additional conditions in voiced segments. These observations are in line with the LSD HRNN-based BWE can improve the subjective quality of results discussed earlier. reconstructed wideband speech significantly. Fig. 8 also shows Furthermore, two 3-point CCR tests were carried out to the spectrogram of the wideband speech generated by the evaluate the subjective performance of the HRNN system CHRNN system for an example sentence. Comparing the by using the DCNN system and the SRNN system as the spectrograms produced by the HRNN system and the CHRNN reference system respectively. The configurations of the tests system, we can observe that the high-frequency components were the same as the ones introduced in Section IV-C. The generated by the CHRNN system were stronger than the results are shown as the second and third system pairs in HRNN system. This may lead to better speech quality as Fig. 7. We can see that our proposed HRNN-based method shown in Fig. 7. generated speech with significantly better quality than the dilated CNN-based method. Compared with the SRNN system, the HRNN system was slightly better while the superiority was F. Comparison between Waveform-Based and Vocoder-Based insignificant at 0.05 significance level. However, the HRNN BWE Methods system was much more efficient than the SRNN system at Finally, we compared the performance of vocoder-based generation time as shown in Table II. and waveform-based BWE methods by conducting objective and subjective evaluations between the VRNN system and E. Effects of Additional Conditions on HRNN-Based BWE the CHRNN system since both systems adopted BN features We compared the HRNN system with the CHRNN system as auxiliary input. The objective results including PESQ, by objective and subjective evaluations to explore the effects SNR and LSD are shown in Table IV. The CHRNN system of additional conditions on HRNN-based BWE. As introduced achieved significantly better SNR than that of the VRNN in Section IV-A, the BN features were used as additional system, which suggested that our proposed waveform-based conditions in the CHRNN system since they can provide method can restore the phase spectra more accurately than the linguistic-related information besides the acoustic waveforms. conventional vocoder-based method. For PESQ and LSD, the The CHRNN system adopted the conditional HRNN structure CHRNN system was not as good as the VRNN system. This introduced in Section III-C with 4 tiers. The dimension of BN is reasonable considering that the VRNN system modeled and A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10 TABLE IV predicting current output sample. The maximal latencies of the O BJECTIVE PERFORMANCE OF THE VRNN AND CHRNN SYSTEMS ON THE VRNN system and the CHRNN system were both determined TEST SET TOGETHER W ITH THE p VALUES OF PAIRED t- TESTS. by the window size of STFT for extracting LMS and MFCC parameters, which was 25 ms in our implementation. The VRNN CHRNN p value PESQ score 3.870.0368 3.790.0394 <0.001 maximal latencies of the other three systems depended on their SNR (dB) 17.760.6123 18.990.5946 <0.001 structures. The SRNN system processed input waveforms and SNR-V (dB) 25.000.7333 26.130.7539 <0.001 generate output waveforms sample-by-sample without latency SNR-U (dB) 9.010.424 10.340.4097 <0.001 LSD (dB) 6.690.110 8.270.123 <0.001 according to (3). Because the non-causal CNN structure shown LSD-V (dB) 6.860.148 7.890.185 <0.001 in Fig. 1 was adopted by the DCNN system and its receptive LSD-U (dB) 6.450.0972 8.660.103 <0.001 field length was about 64ms [35], it made the highest latency among the five systems. The latency of the HRNN system was TABLE V relatively short because the number of concatenated frames M AXIM AL LATENCIES (ms) OF THE FIVE BWE SYSTEMS. T HE SAMPLING (3) and the frame size of the top tier were small (c = 2 and RATE OF W IDEBAND WAVEFORMS IS f = 16kHz . (3) L = 16). Maximal Latency Remarks 2) Run-time efficiency of waveform-based BWE WS: window size in ms of STFT VRNN WS = 25 One deficiency of the waveform-based BWE methods is that for extracting spectral parameters. they are very time-consuming at generation time. As shown N=2 DCNN = 32 N + 1: length of receptive field. in Table II and Table III, the HRNN system achieved the best SRNN 0 None run-time efficiency among the four waveform-based systems, (3) (3) (3) (3) c L 1 c , L : number of concatenated HRNN = 1:9375 frames, frame size at Tier 3. which still took 3.61 seconds to generate 1 second speech WS: window size in ms of STFT in our current implementation. Therefore, to accelerate the CHRNN WS = 25 for extracting spectral parameters. computation of HRNNs is an important task of our future work. As shown in Fig. 6, using longer frame sizes may help reduce the computational complexity of HRNNs. Another predicted LMS directly which were used in the calculation of possible way is to reduce the number of hidden units and PESQ and LSD. A 3-point CCR test was also conducted to other model parameters similar to the attempt of accelerating evaluate the subjective performance of the CHRNN system by WaveNet for speech synthesis [39]. using the VRNN system as the reference system and following the evaluation configuratioins introduced in Section IV-C. The V. C ONCLUSION results are shown as the fifth system pairs in Fig. 7. We can see that the CCR score was high than 0 significantly which In this paper, we have proposed a novel waveform modeling indicates that the CHRNN system can achieve significantly and generation method using hierarchical recurrent neural higher quality of reconstructed wideband speech than the networks (HRNNs) to fulfill the speech bandwidth extension VRNN system. (BWE) task. HRNNs adopt a hierarchy of recurrent modules Comparing the spectrograms produced by the VRNN system to capture long-span dependencies between input and output and the CHRNN system in Fig. 8, it can be observed waveform sequences. Compared with the plain sample-level that the CHRNN system performed better than the VRNN RNN and the stacked dilated CNN, the proposed HRNN model achieves better accuracy and efficiency of predicting system in generating the high-frequency harmonics for voiced high-frequency waveform samples. Besides, additional con- sounds. Besides, the high-frequency components generated ditions, such as the bottleneck features (BN) extracted from by the CHRNN system were less over-smoothed and more narrowband speech, can further improve subjective quality of natural than that of the VRNN system at unvoiced segments. reconstructed wideband speech. The experimental results show Furthermore, there was a discontinuity between the low- that our proposed HRNN-based method achieves higher sub- frequency and high-frequency spectra of the speech generated jective preference scores than the conventional vocoder-based the VRNN system, which was also found in other vocoder- method using LSTM-RNNs. To evaluate the performance of based BWE method [26]. As shown in Fig. 8, the waveform- our proposed methods using practical band-limited speech based systems alleviated this discontinuity effectively. These data, to improve the efficiency of waveform generation using experimental results indicate the superiority of modeling and HRNNs, and to utilize other types of additional conditions will generating speech waveforms directly over utilizing vocoders for feature extraction and waveform reconstruction on the be the tasks of our future work. BWE task. REFERENCES G. Analysis and Discussion [1] K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “A mel-cepstral analysis technique restoring high frequency components 1) Maximal latency of different BWE systems from low-sampling-rate speech,” in Proc. Interspeech, 2014. Some application scenarios have strict requirement on the [2] A. Albahri, C. S. Rodriguez, and M. Lech, “Artificial bandwidth extension to improve automatic emotion recognition from narrow-band latency of BWE algorithm. We compared the maximal latency coded speech,” in Proc. ICSPCS, 2016, pp. 1–7. of the five BWE systems listed in Section IV-A and the [3] M. M. Goodarzi, F. Almasganj, J. Kabudian, Y. Shekofteh, and results are shown in Table V. Here, the latency refers to I. S. Rezaei, “Feature bandwidth extension for Persian conversational the duration of future input samples that are necessary for telephone speech recognition,” in Proc. ICEE, 2012, pp. 1220–1223. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 11 [4] S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech enhancement [29] J. Abel, M. Strake, and T. Fingscheidt, “Artificial bandwidth extension via frequency bandwidth extension using line spectral frequencies,” in using deep neural networks for spectral envelope estimation,” in Proc. Proc. ICASSP, vol. 1, 2001, pp. 665–668. IWAENC, 2016, pp. 1–5. [30] Y. Gu and Z.-H. Ling, “Restoring high frequency spectral envelopes [5] F. Mustiere, ` M. Bouchard, and M. Bolic, ´ “Bandwidth extension for using neural networks for speech bandwidth extension,” in Proc. IJCNN, speech enhancement,” in Proc. CCECE, 2010, pp. 1–4. 2015, pp. 1–8. [6] J. Makhoul and M. Berouti, “High-frequency regeneration in speech [31] Y. Wang, S. Zhao, J. Li, and J. Kuang, “Speech bandwidth extension coding systems,” in Proc. ICASSP, vol. 4, 1979, pp. 428–431. using recurrent temporal restricted Boltzmann machines,” IEEE Signal [7] S. Vaseghi, E. Zavarehei, and Q. Yan, “Speech bandwidth extension: Processing Letters, vol. 23, no. 12, pp. 1877–1881, 2016. extrapolations of spectral envelop and harmonicity quality of excitation,” [32] Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using in Proc. ICASSP, vol. 3, 2006, pp. III–III. bottleneck features and deep recurrent neural networks.” in Proc. [8] H. Pulakka, U. Remes, K. Palomaki, ¨ M. Kurimo, and P. Alku, “Speech Interspeech, 2016, pp. 297–301. bandwidth extension using Gaussian mixture model-based estimation of [33] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, the highband mel spectrum,” in Proc. ICASSP, 2011, pp. 5100–5103. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [9] Y. Wang, S. Zhao, Y. Yu, and J. Kuang, “Speech bandwidth extension A generative model for raw audio,” arXiv preprint arXiv:1609.03499, based on GMM and clustering method,” in Proc. CSNT, 2015, pp. 437– [34] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, [10] Y. Ohtani, M. Tamura, M. Morita, and M. Akamine, “GMM-based A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to- bandwidth extension using sub-band basis spectrum model.” in Proc. end neural audio generation model,” arXiv preprint arXiv:1612.07837, Interspeech, 2014, pp. 2489–2493. [11] Y. Zhang and R. Hu, “Speech wideband extension based on Gaussian [35] Y. Gu and Z.-H. Ling, “Waveform modeling using stacked dilated mixture model,” Chinese Journal of Acoustics, no. 4, pp. 363–377, 2009. convolutional neural networks for speech bandwidth extension,” in Proc. [12] G.-B. Song and P. Martynovich, “A study of HMM-based bandwidth Interspeech, 2017, pp. 1123–1127. extension of speech signals,” Signal Processing, vol. 89, no. 10, pp. [36] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained 2036–2044, 2009. deep neural networks.” in Proc. Interspeech, 2011, pp. 237–240. [13] Z. Yong and L. Yi, “Bandwidth extension of narrowband speech based [37] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural on hidden Markov model,” in Proc. ICALIP, 2014, pp. 372–376. networks employing multi-task learning and stacked bottleneck features [14] P. Bauer and T. Fingscheidt, “An HMM-based artificial bandwidth for speech synthesis,” in Proc. ICASSP, 2015, pp. 4460–4464. extension evaluated by cross-language training and test,” in Proc. [38] J. B. Allen and L. R. Rabiner, “A unified approach to short-time Fourier ICASSP, 2008, pp. 4589–4592. analysis and synthesis,” Proceedings of the IEEE, vol. 65, no. 11, pp. [15] G. Chen and V. Parsa, “HMM-based frequency bandwidth extension for 1558–1564, 1977. speech enhancement using line spectral frequencies,” in Proc. ICASSP, [39] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, vol. 1, 2004, pp. I–709. Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta et al., “Deep voice: [16] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling 2017. in parametric speech generation: A systematic review of existing [40] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, techniques and future trends,” IEEE Signal Processing Magazine, “Speaker-dependent WaveNet vocoder.” in Proc. Interspeech, 2017, pp. vol. 32, no. 3, pp. 35–52, 2015. 1118–1122. [17] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using [41] Y.-J. Hu, C. Ding, L.-J. Liu, Z.-H. Ling, and L.-R. Dai, “The USTC restricted Boltzmann machines and deep belief networks for statistical system for blizzard challenge 2017.” in Proc. Blizzard Challenge parametric speech synthesis,” IEEE Transactions on Audio, Speech, and Workshop, 2017. Language Processing, vol. 21, no. 10, pp. 2129–2139, 2013. [42] I. Recommendation, “G. 711: Pulse code modulation (PCM) of voice frequencies,” International Telecommunication Union, 1988. [18] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech [43] Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “TTS synthesis synthesis using deep neural networks,” in Proc. ICASSP, 2013, pp. 7962– with bidirectional LSTM based recurrent neural networks.” in Proc. Interspeech, 2014, pp. 1964–1968. [19] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion [44] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, using deep neural networks with layer-wise generative training,” and Y. Bengio, “Char2wav: End-to-end speech synthesis,” in Proc. ICLR IEEE/ACM Transactions on Audio, Speech and Language Processing, Workshop Track, 2017. vol. 22, no. 12, pp. 1859–1872, 2014. [45] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, [20] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki, “Voice “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. conversion in high-order eigen space using deep belief nets.” in Proc. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, Interspeech, 2013, pp. 369–372. [21] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on [46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” deep denoising autoencoder.” in Proc. Interspeech, 2013, pp. 436–440. arXiv preprint arXiv:1412.6980, 2014. [22] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech [47] I. Recommendation, “P. 862.2: Wideband extension to recommendation enhancement based on deep neural networks,” IEEE/ACM Transactions P. 862 for the assessment of wideband telephone networks and speech on Audio, Speech and Language Processing, vol. 23, no. 1, pp. 7–19, codecs,” International Telecommunication Union, 2007. [48] A. O. Watson, “Assessing the quality of audio and video components [23] C. V. Botinhao, B. S. Carlos, L. P. Caloba, and M. R. Petraglia, in desktop multimedia conferencing,” Ph.D. dissertation, University of “Frequency extension of telephone narrowband speech signal using London, 2001. neural networks,” in Proc. CESA, vol. 2, 2006, pp. 1576–1579. [49] S. Buchholz and J. Latorre, “Crowdsourcing preference tests, and how [24] J. Kontio, L. Laaksonen, and P. Alku, “Neural network-based artificial to detect cheating,” in Proc. Interspeech, 2011, pp. 1118–1122. bandwidth expansion of speech,” IEEE transactions on audio, speech, and language processing, vol. 15, no. 3, pp. 873–881, 2007. [25] H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011. [26] K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. ICASSP, 2015, pp. 4395–4399. [27] B. Liu, J. Tao, Z. Wen, Y. Li, and D. Bukhari, “A novel method of artificial bandwidth extension using deep architecture.” in Proc. Interspeech, 2015, pp. 2598–2602. [28] Y. Wang, S. Zhao, W. Liu, M. Li, and J. Kuang, “Speech bandwidth expansion based on deep neural networks.” in Proc. Interspeech, 2015, pp. 2593–2597. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension

Loading next page...
 
/lp/arxiv-cornell-university/waveform-modeling-and-generation-using-hierarchical-recurrent-neural-0zE84JO0Ex

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2018.2798811
Publisher site
See Article on Publisher Site

Abstract

JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 1 Waveform Modeling and Generation Using Hierarchical Recurrent Neural Networks for Speech Bandwidth Extension Zhen-Hua Ling, Member, IEEE, Yang Ai, Yu Gu, and Li-Rong Dai Abstract—This paper presents a waveform modeling and as the difficulty of distinguishing fricatives and similar voices. generation method using hierarchical recurrent neural networks Therefore, speech bandwidth extension (BWE), which aims to (HRNN) for speech bandwidth extension (BWE). Different from restore the missing high-frequency components of narrowband conventional BWE methods which predict spectral parame- speech using the correlations that exist between the low and ters for reconstructing wideband speech waveforms, this BWE high-frequency components of the wideband speech signal, has method models and predicts waveform samples directly without using vocoders. Inspired by SampleRNN which is an uncon- attracted the attentions of many researchers. BWE methods can ditional neural audio generator, the HRNN model represents not only be applied to real-time voice communication, but also the distribution of each wideband or high-frequency waveform benefit other speech signal processing areas such as text-to- sample conditioned on the input narrowband waveform samples speech (TTS) synthesis [1], speech recognition [2], [3], and using a neural network composed of long short-term memory speech enhancement [4], [5]. (LSTM) layers and feed-forward (FF) layers. The LSTM layers form a hierarchical structure and each layer operates at a specific Many researchers have made a lot of efforts in the field temporal resolution to efficiently capture long-span dependencies of BWE. Some early studies adopted the source-filter model between temporal sequences. Furthermore, additional conditions, of speech production and attempted to restore high-frequency such as the bottleneck (BN) features derived from narrowband residual signals and spectral envelopes respectively from speech using a deep neural network (DNN)-based state classifier, input narrowband signals. The high-frequency residual signals are employed as auxiliary input to further improve the quality of generated wideband speech. The experimental results of were usually estimated from the narrowband residual signals comparing several waveform modeling methods show that the by spectral folding [6]. To estimate high-frequency spectral HRNN-based method can achieve better speech quality and run- envelopes from narrowband signals is always a difficult task. time efficiency than the dilated convolutional neural network To achieve this goal, simple methods, such as codebook (DCNN)-based method and the plain sample-level recurrent mapping [7] and linear mapping [4], and statistical methods neural network (SRNN)-based method. Our proposed method also outperforms the conventional vocoder-based BWE method using Gaussian mixture models (GMMs) [8]–[11] and hidden using LSTM-RNNs in terms of the subjective quality of the Markov models (HMMs) [12]–[15], have been proposed. In reconstructed wideband speech. statistical methods, acoustic models were build to represent Index Terms—speech bandwidth extension, recurrent neural the mapping relationship between narrowband spectral param- networks, dilated convolutional neural networks, bottleneck eters and high-frequency spectral parameters. Although these features statistical methods achieved better performance than simple mapping methods, the inadequate modeling ability of GMMs I. I NTRODUCTION and HMMs may lead to over-smoothed spectral parameters which constraints the quality of reconstructed speech signals PEECH communication is important in people’s daily life. [16]. However, due to the limitation of transmission channels In recent years, deep learning has become an emerging and the restriction of speech acquisition equipments, the field in machine learning research. Deep learning techniques bandwidth of speech signal is usually limited to a narrowband have been successfully applied to many signal processing of frequencies. For example, the bandwidth of speech signal tasks. In speech signal processing, neural networks with deep in the public switching telephone network (PSTN) is less than structures have been introduced to the speech generation tasks 4kHz. The missing of high-frequency components of speech including speech synthesis [17], [18], voice conversion [19], signal usually leads to low naturalness and intelligibility, such [20], speech enhancement [21], [22], and so on. In the field This work was partially funded by National Key Research and Development of BWE, neural networks have also been adopted to predict Project of China (Grant No. 2017YFB1002202) and the National Natural either the spectral parameters representing vocal-tract filter Science Foundation of China (Grants No. U1636201). Z.-H. Ling, Y. Ai, and L.-R. Dai are with the National Engineering properties [23]–[25] or the original log-magnitude spectra Laboratory of Speech and Language Information Processing, University derived by short-time Fourier transform (STFT) [26], [27]. of Science and Technology of China, Hefei, 230027, China (e-mail: The studied model architectures included deep neural networks zhling@ustc.edu.cn, ay8067@mail.ustc.edu.cn, lrdai@ustc.edu.cn). Y. Gu is with Baidu Speech Department, Baidu Technology Park, Beijing, (DNN) [28]–[30], recurrent temporal restricted Boltzmann 100193, China (e-mail: guyu04@baidu.com ). This work was done when he machines (RBM) [31], recurrent neural networks (RNN) with was a graduate student at the National Engineering Laboratory of Speech and long short-term memory (LSTM) cells [32], and so on. Language Information Processing, University of Science and Technology of China. These methods achieved better BWE performance than using arXiv:1801.07910v1 [cs.SD] 24 Jan 2018 A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 2 conventional statistical models, like GMMs and HMMs, since erate speech waveforms directly at sample-level using RNNs deep-structured neural networks are more capable of modeling for the BWE task. Second, various RNN architectures for the complicated and nonlinear mapping relationship between waveform-based BWE, including plain sample-level LSTM- input and output acoustic parameters. RNNs, HRNNs, and HRNNs with additional conditions, are However, all these existing methods are vocoder-based ones, implemented and evaluated in this paper. The experimental which means vocoders are used to extract spectral param- results of comparing several waveform modeling methods eters from narrowband waveforms and then to reconstruct show that the HRNN-based method achieves better speech waveforms from the predicted wideband or high-frequency quality and run-time efficiency than the stacked dilated CNN- spectral parameters. This may lead to two deficiencies. First, based method [35] and the plain sample-level RNN-based the parameterization process of vocoders usually degrades method. Our proposed method also outperforms the conven- speech quality. For example, the spectral details are always tional vocoder-based BWE method using LSTM-RNNs in lost in the reconstructed waveforms when low-dimensional terms of the subjective quality of the reconstructed wideband spectral parameters, such as mel-cepstra or line spectral speech. pairs (LSP), are adopted to represent spectral envelopes in This paper is organized as follows. In Section II, we briefly vocoders. The spectral shapes of the noise components at review previous BWE methods including vocoder-based ones voiced frames are always ignored when only F0 values and the dilated CNN-based one. In Section III, the details of and binary voiced/unvoiced flags are used to describe the our proposed method are presented. Section IV reports our excitation. Second, it is difficult to parameterize and to predict experimental results, and conclusions are given in Section V. phase spectra due to the phase-warpping issue. Thus, simple estimation methods, such as mirror inversion, are popularly II. PREVIOUS WORK used to predict the high-frequency phase spectra in existing A. Vocoder-Based BWE Using Neural Networks methods [26], [32]. This also constraints the quality of the The vocoder-based BWE methods using DNNs or RNNs reconstructed wideband speech. have been proposed in recent years [26], [32]. In these meth- Recently, neural network-based speech waveform synthe- ods, spectral parameters such as logarithmic magnitude spectra sizers, such as WaveNet [33] and SampleRNN [34], have (LMS) were first extracted by short time Fourier transform been presented. In WaveNet [33], the distribution of each (STFT) [38]. Then, DNNs or LSTM-RNNs were trained under waveform sample conditioned on previous samples and addi- minimum mean square error (MMSE) criterion to establish a tional conditions was represented using a neural network with mapping relationship from the LMS of narrowband speech dilated convolutional neural layers and residual architectures. to the LMS of the high-frequency components of wideband SampleRNN [34] adopted recurrent neural layers with a hier- speech. Some additional features extracted from narrowband archical structure for unconditional audio generation. Inspired speech, such as bottleneck features, can be used as auxiliary by WaveNet, a waveform modeling and generation method inputs to improve the performance of networks [32]. At the using stacked dilated CNNs for BWE has been proposed in stage of reconstruction, the LMS of wideband speech were our previous work [35], which achieved better subjective BWE reconstructed by concatenating the LMS of input narrowband performance than the vocoder-based approach utilizing LSTM- speech and the LMS of high-frequency components predicted RNNs. On the other hand, the methods of applying RNNs to by the trained DNN or LSTM-RNN. The phase spectra of directly model and generate speech waveforms for BWE have wideband speech were usually generated by some simple not yet been investigated. mapping algorithms, such as mirror inversion [26]. Finally, Therefore, this paper proposes a waveform modeling and inverse FFT (IFFT) and overlap-add algorithm were carried generation method using RNNs for BWE. As discussed above, out to reconstruct the wideband waveforms from the predicted direct waveform modeling and generation can help avoid the LMS and phase spectra. spectral representation and phase modeling issues in vocoder- The experimental results of previous work showed that based BWE methods. Considering the sequence memory and LSTM-RNNs can achieve better performance than DNNs in modeling ability of RNNs and LSTM units, this paper adopts the vocoder-based BWE [32]. Nevertheless, there are still some LSTM-RNNs to model and generate the wideband or high- issues with the vocoder-based BWE approach as discussed frequency waveform samples directly given input narrowband in Section I, such as the quality degradation caused by the waveforms. Inspired by SampleRNN [34], a hierarchical RNN parameterization of vocoders and the inadequacy of restoring (HRNN) structure is presented for the BWE task. There are phase spectra. multiple recurrent layers in an HRNN and each layer operates at a specific temporal resolution. Compared with plain sample- B. Waveform-Based BWE Using Stacked Dilated CNNs level deep RNNs, HRNNs are more capable and efficient at capturing long-span dependencies in temporal sequences. Recently, a novel waveform generation model named Furthermore, additional conditions, such as the bottleneck WaveNet was proposed [33] and has been successfully (BN) features [32], [36], [37] extracted from narrowband applied to the speech synthesis task [39]–[41]. This model speech using a DNN-based state classifier, are introduced into utilizes stacked dilated CNNs to describe the autoregressive HRNN modeling to further improve the performance of BWE. generation process of audio waveforms without using The contributions of this paper are twofold. First, this frequency analysis and vocoders. A stacked dilated CNN paper makes the first successful attempt to model and gen- consists of many convolutional layers with different dilation A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 3 ... ... y y y t-1 t t+1 · · · · · · t-1 t t+1 Embedding layer e e e t-1 t t+1 · · · · · · LSTM layers · · · · · · FF · · · · · · layers ... ... t-1 t t+1 y y y · · · t-1 t t+1 · · · Fig. 1. The structure of stacked dilated non-causal CNNs [35]. Fig. 2. The structure of SRNNs for BWE, where concentric circles represent LSTM layers and inverted trapezoids represent FF layers. factors. The length of its receptive filed grows exponentially in terms of the network depth [33]. A. Sample-Level Recurrent Neural Networks Motivated by this idea, a waveform modeling and generation method for BWE was proposed [35], which described the con- The LSTM-RNNs for speech generation are usually built ditional distribution of the output wideband or high-frequency at frame-level in order to model the acoustic parameters waveform sequence y = [y ; y ; : : : ; y ] conditioned on the extracted by vocoders with a fixed frame shift [32], [43]. It 1 2 T input narrowband waveform sequence x = [x ; x ; : : : ; x ] is straightforward to model and generate speech waveforms 1 2 T using stacked dilated CNNs . Similar to WaveNet, the samples at sample-level using similar LSTM-RNN framework. The structure of sample-level recurrent neural networks (SRNNs) x and y were all discretized by 8-bit -law quantization t t for BWE is shown in Fig. 2, which is composed of a cascade [42] and a softmax output layer was adopted. Residual and of LSTM layers and feed-forward (FF) layers. Both the input parameterized skip connections together with gated activation waveform samples x = [x ; x ; : : : ; x ] and output waveform functions were also employed to capacitate training deep 1 2 T samples y = [y ; y ; : : : ; y ] are quantized to discrete values networks and to accelerate the convergence of model esti- 1 2 T by -law. The embedding layer maps each discrete sample mation. Different from WaveNet, this method modeled the value x to a real-valued vector e . The LSTM layers model mapping relationship between two waveform sequences, not t t the sequence of embedding vectors in a recurrent manner. the autoregressive generation process of output waveform When there is only one LSTM layer, the calculation process sequence. Both causal and non-causal model structures were implemented and experimental results showed that the non- can be formulated as causal structure achieved better performance than the causal h = H(h ; e ); (2) t t1 t one [35]. The stacked dilated non-causal CNN, as illustrated where h is the output of LSTM layers at time step t, H in Fig. 1, described the conditional distribution as represents the activation function of LSTM units. If there are multiple LSTM layers, their output can be calculated layer- p(yjx) = p(y jx ; x ; : : : ; x ); (1) t tN=2 tN=2+1 t+N=2 by-layer. Then, h passes through FF layers. The activation t=1 function of the last layer is a softmax function which generates where N + 1 is the length of receptive field. the probability distribution of the output sample y conditioned At the extension stage, given input narrowband speech, each on the previous and current input samples fx ; x ; : : : ; x g as 1 2 t output sample was obtained by selecting the quantization level with maximum posterior probability. Finally, the generated p(y jx ; x ; : : : ; x ) = FF(h ); (3) t 1 2 t t waveforms were processed by a high-pass filter and then added where function FF denotes the calculation of FF layers. with the input narrowband waveforms to reconstruct the final Given a training set with parallel input and output waveform wideband waveforms. Experimental results showed that this sequences, the model parameters of the LSTM and the FF method achieved better subjective BWE performance than the layers are estimated using cross-entropy cost function. At gen- vocoder-based method using LSTM-RNNs [35]. eration time, each output sample y is obtained by maximizing the conditional probability distribution (3). Our preliminary III. P ROPOSED M ETHODS and informal listening test showed that this generation criterion Inspired by SampleRNN [34] which is an unconditional can achieve better subjective performance than generating audio generator containing recurrent neural layers with a random samples from the distribution. The random sampling hierarchical structure, this paper proposes waveform modeling is necessary for the conventional WaveNet and SampleRNN and generation methods using RNNs for BWE. In this section, models because of their autoregressive architecture. However, we first introduce the plain sample-level RNNs (SRNN) for the model structure shown in Fig. 2 is not an autoregressive waveform modeling. Then the structures of hierarchical RNNs one. The input waveforms provide the necessary randomness (HRNN) and conditional HRNNs are explained in detail. to synthesize the output speech, especially the unvoiced Finally, the flowchart of BWE using RNNs is introduced. segments. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 4 Assume an HRNN has K tiers in total (e.g., K = 3 in Fig. (3) , , 1 L 3). Tier 1 works at sample-level and the other K 1 tiers are Tier 3 frame-level tiers since they operate at a temporal resolution · · · (3) lower than samples. (3) d 1 d (3) (2) (3) L /L d · · · 1) Frame-level tiers: The k-th tier (1 < k  K) operates (2) (2) (2) (3) (2) (3) , , , , , , 1 L L +1 2L L -L +1 L (k) on frames composed of L samples. The range of time Tier 2 (k) (k) · · · · · · step at the k-th tier, t , is determined by L . Denoting (2) (2) (2) (2) (2) (2) (2) (2) d d d d (2) d (3) (2) d (3) 1 L L +1 2L L -L +1 L the quantized input waveforms as x = [x ; x ; : : : ; x ] and 1 2 T (2) (2) (2) (3) (2) (3) e e e e e e 1 L L +1 2L L -L +1 L · · · · · · · · · assuming that L represents the sequence length of x after (K) Tier 1 · · · zero-padding so that L can be divisible by L , we can get · · · · · · · · · (3) (2) (3) (k) (k) (2) (2) (2) y y y1 yL yL +1 y2L L -L +1 L · · · · · · · · · t 2 T = f1; 2; : : : ; g; 1 < k  K: (4) (k) Furthermore, the relationship of temporal resolution between Fig. 3. The structure of HRNNs for BWE, where concentric circles represent the m-th tier and the n-th tier (1 < m < n  K ) can be LSTM layers and inverted trapezoids represent FF layers. described as (m) (n) (n) (n) (m) (m) In an SRNN, the generation of each output sample depends T = ft jt = d e; t 2 T g; (5) (n) (m) L =L on all previous and current input samples. However, this plain LSTM-RNN architecture still has some deficiencies where de represents the operation of rounding up. It can for waveform modeling and generation. First, sample-level be observed from (5) that one time step of the n-th tier (n) (m) modeling makes it difficult to model long-span dependencies corresponds to L =L time steps of the m-th tier. The (k) between input and output speech signals due to the signifi- frame inputs f at the k-th tier (1 < k  K) and the cantly increased sequence length compared with frame-level t-th time step can be written by framing and concatenation modeling. Second, SRNNs suffer from the inefficiency of operations as waveform generation due to the point-by-point calculation at (k) f = [x (k) ; : : : ; x (k)] ; (6) all layers and the dimension expansion at the embedding layer. (t1)L +1 tL (k) (k)> (k)> Therefore, inspired by SampleRNN [34], a hierarchical RNN > ~ ~ f = [f ; :::; f ] ; (7) (k) t t t+c 1 (HRNN) structure is proposed in next subsection to alleviate (k) (k) these problems. where t 2 T , f denotes the t-th waveform frame at the (k) k-th tier, and c is the number of concatenated frames at (3) (2) B. Hierarchical Recurrent Neural Networks the k-th tier. We have c = c = 1 in the model structure shown in Fig. 3. The structure of HRNNs for BWE is illustrated in Fig. 3. As shown in Fig. 3, the frame-level ties are composed of Similar to SRNNs mentioned in Section III-A, HRNNs are LSTM layers. For the top tier (i.e., k = K ), the LSTM units also composed of LSTM layers and FF layers. Different from (K) update their hidden states h based on the hidden states of the plain LSTM-RNN structure of SRNNs, these LSTM and (K) previous time step h and the input at current time step FF layers in HRNNs form a hierarchical structure of multiple t1 (K) tiers and each tier operates at a specific temporal resolution. f . If there is only one LSTM layer in the K -th tier, the The bottom tier (i.e., Tier 1 in Fig. 3) deals with individual calculation process can be formulated as samples and outputs sample-level predictions. Each higher tier (K) (K) (K) (K) h = H(h ; f ); t 2 T : (8) operates on a lower temporal resolution (i.e., dealing with t t1 t more samples per time step). Each tier conditions on the tier If the top tier is composed of multiple LSTM-RNN layers, the above it except the top tier. This model structure is similar hidden states can be calculated layer-by-layer iteratively. to SampleRNN [34]. The main difference is that the original Due to the different temporal resolution at different tiers, the SampleRNN model is an unconditional audio generator which (K) (K) (K1) top tier generates r = L =L conditioning vectors employs the history of output waveforms as network input and (K) for the (K 1)-th tier at each time step t 2 T . This generates output waveforms in an autoregressive way. While, (K) is implemented by producing a set of r separate linear the HRNN model shown in Fig. 3 describes the mapping (K) projections of h at each time step. For the intermediate tiers relationship between two waveform sequences directly without (i.e., 1 < k < K ), the processing of generating conditioning considering the autoregressive property of output waveforms. vectors is the same as that of the top tier. Thus, we can describe This HRNN structure is specifically designed for BWE be- the conditioning vectors uniformly as cause narrowband waveforms are used as inputs in this task. (k) (k) (k) (k) (k) Removing autoregressive connections can help reduce the d = W h ; j = 1; 2; : : : ; r ; t 2 T ; (9) (k) t (t1)r +j computation complexity and facilitate parallel computing at (k) (k) (k1) where 1 < k  K and r = L =L . generation time. Although conditional SampleRNNs have been developed and used as neural vocoders to reconstruct speech The input vectors of the LSTM layers at intermediate tiers waveforms from acoustic parameters [44], they still follow the are different from that of the top tier. For the k-th tier (k) autoregressive framework and are different from HRNNs. (1 < k < K ), the input vector i at the t-th time step is t A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 5 (k) composed by a linear combination of the frame inputs f 1 1 c , ,c 1 d (k+1) and the conditioning vectors d given by the (k + 1)-th Tier 4 tier as · · · (4) (4) d 1 d (4) (3) L /L (k) (k) (k+1) (4) (k) (k) · · · i = W f + d ; t 2 T ; (10) (3) , , 1 L (4) (3) (4) t t t - + , , L L 1 L Tier 3 Thus, the output of the LSTM layer at the k-th tier (1 < k < · · · · · · K ) can be calculated as (2) (3) (2) (3) , , , , 1 L L -L +1 L (k) (k) (k) (K) h = H(h ; i ); t 2 T : (11) Tier 2 t t1 t · · · · · · 2) Sample-level tier: The sample-level tier (i.e., Tier 1 (2) e1 eL in Fig. 3) gives the probability distribution of the output · · · sample y conditioned on the current input sample x (i.e., t t Tier 1 · · · (2) (1) · · · L = 1) together with the conditioning vector d passed from the above tier which encodes history information of the (2) y y 1 L · · · (1) L input sequence, where t 2 T = f1; 2; : : : ; g. Since x (1) and y are individual samples, it is convenient to model the Fig. 4. The structure of conditional HRNNs for BWE, where concentric correlation among them using a memoryless structure such as circles represent LSTM layers and inverted trapezoids represent FF layers. FF layers. First, x is mapped into a real-valued vector e by t t an embedding layer. These embedding vectors form the input of vocoder-based BWE [32]. In order to combine such auxil- at each time step of the sample-level tier, i.e., iary inputs with the HRNN model introduced in Section III-B, (1) > > > f = [e ; :::; e ] ; (12) (1) t t a conditional HRNN structure is designed as shown in Fig. 4. t+c 1 Compared with HRNNs, conditional HRNNs add an addi- (1) (1) where t 2 T , c is the number of concatenated sample tional tier named conditional tier on the top. The input features embeddings at the sample-level tier. In the model structure of the conditional tier are frame-level auxiliary feature vectors (1) shown in Fig. 3, c = 1. Then, the input of the FF layers is extracted from input waveforms rather than waveform samples. (1) (2) a linear combination of f and d as t t Assume the total number of tiers in a conditional HRNN is K (K) (1) (1) (2) (1) (1) (e.g., K = 4 in Fig. 4) and let L donate the frame shift of i = W f + d ; t 2 T : (13) t t t auxiliary input features. The equations (4) and (5) in Section Finally, we can obtain the conditional probability distribu- III-B still works here. Similar to the introductions in Section (1) tion of the output sample y by passing i through the FF III-B, the frame inputs at the conditional tier can be written as layers. The activation function of the last FF layer is a softmax function. The output of FF layers describes the conditional t t t (K) c = [c ; c ; : : : ; c ]; t 2 T ; (15) 1 2 d distribution where c represents the d-th dimension of the auxiliary feature (1) p(y jx ; x ; : : : ; x t ) = FF(i ); (14) (K) (K) t 1 2 (d e+c 1)L t (K) vector at time t. Then the calculations of (8)-(13) for HRNNs are followed. Finally, the conditional probability distribution (1) where t 2 T . for generating y can be written as It is worth mentioning that the structure shown in Fig. 3 is p(y jx ; : : : ; x t ;c ; c ; : : : ; c t ) non-casual which utilizes future input samples together with (K) (K) t 1 1 2 (d e+c 1)L d e (K) (K) L L current and previous input samples to predict current output (1) = FF(i ); (16) sample (e.g., using x ; : : : ; x to predict y in Fig. 3). (3) 1 1 (K) (K) Generally speaking, at most c L 1 input samples after (1) where t 2 T , fc ; c ; : : : ; c t g are additional condi- 1 2 d e (K) the current time step are necessary in order to predict current tions introduced by the auxiliary input features. output sample accroding to (14). This is also a difference between our HRNN model and SampleRNN, which has a D. BWE Using SRNNs and HRNNs causal and autoregressive structure. The flowchart of BWE using SRNNs or HRNNs are Similar to SRNNs, the parameters of HRNNs are estimated illustrated in Fig. 5. There are two mapping strategies. One is using cross-entropy cost function given a training set with to map the narrowband waveforms towards their corresponding parallel input and output sample sequences. At generation wideband counterparts (named WB strategy in the rest of this time, each y is predicted using the conditional probability paper) and the other is to map the narrowband waveforms distribution in (14). towards the waveforms of the high-frequency component of wideband speech (named HF strategy). C. Conditional Hierarchical Recurrent Neural Networks A database with wideband speech recordings is used for Some frame-level auxiliary features extracted from input model training. At the training stage, the input narrowband narrowband waveforms, such as bottleneck (BN) features [36], waveforms are obtained by downsampling the wideband wave- have shown their effectiveness in improving the performance forms. To guarantee the length consistency between the input A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 6 Training IV. E XPERIM ENTS Narrowband Wideband waveform waveform A. Experimental Setup Quantization Downsampling Upsampling encoding The TIMIT corpus [45] which contained English speech High-frequency BN feature from multi-speakers with 16kHz sampling rate and 16bits waveform SRNNs vectors Quantization BN feature Highpass filter Amplification encoding or extractor resolution was adopted in our experiments. We chose 3696 HRNNs and 1153 utterances to construct the training set and validation Narrowband waveform Quantization Waveform set respectively. Another 192 utterances from the speakers Upsampling encoding prediction not included in the training set and validation set were Wideband y waveform used as the test set to evaluate the performance of different Waveform Quantization Highpass filter adding decoding BWE methods. In our experiments, the narrowband speech High-frequency waveform waveforms sampled at 8kHz were obtained by downsampling Extension Reconstructed Deamplification wideband waveform the wideband speech at 16kHz. Five BWE systems were constructed for comparison in our Fig. 5. The flowchart of our proposed BWE methods. experiments. The descriptions of these systems are as follows. VRNN: Vocoder-based BWE method using LSTM-RNNs as introduced in Section II-A. The DRNN-BN system in [32] was used here for comparison, which predicted the LMS of high-frequency components using a deep and output sequences, the narrowband waveforms are then LSTM-RNN with auxiliary BN features. Backpropaga- upsampled to the sampling rate of the wideband speech with tion through time (BPTT) algorithm was used to train zero high-frequency components. The upsampled narrowband the LSTM-RNN model based on the minimum mean waveforms are used as the model input. The output wave- square error (MMSE) criterion. In this system, a DNN- forms are either the unfiltered wideband waveforms (WB based state classifier was built to extract BN features. strategy) or the high-frequency waveforms (HF strategy). The 11-frames of 39-dimensional narrowband MFCCs were high-frequency waveforms are obtained by sending wideband used as the input of the DNN classifier and the posterior speech into a high-pass filter and an amplifier for reducing probabilities of 183 HMM states for 61 monophones were quantization noise as the dotted lines in Fig. 5. Before the regarded as the output of the DNN classifier. The DNN waveforms are used for model training, all the input and output classifier adopt 6 hidden layers where there were 100 waveform samples are discretized by 8-bit -law quantization. hidden units at the BN layer and 1024 hidden units at The model parameters of SRNNs or HRNNs are trained under other hidden layers. The BN layer was set as the fifth cross-entropy (CE) criterion which optimizes the classification hidden layer so that the extractor could capture more accuracy of discrete output samples on training set. linguistic information. This BN feature extractor was also At the extension stage, the upsampled and quantized nar- used in the CHRNN system. rowband waveforms are fed into the trained SRNNs or DCNN: Waveform-based BWE method using stacked HRNNs to generate the probability distributions of output dilated CNNs as introduced in Section II-B. The CNN2- samples. Then each output sample is obtained by selecting the HF system in [35] was used here for comparison, which quantization level with maximum posterior probability. Later, predicted high-frequency waveforms using non-causal the quantized output samples are decoded into continuous CNNs and performed better than other configurations. values using the inverse mapping of -law quantization. A SRNN: Waveform-based BWE method using sample- deamplification process is conducted for the HF strategy in level RNNs as introduced in Section III-A. The built order to compensate the effect of amplification at training time. model had two LSTM layers and two FF layers. Both Finally, the generated waveforms are high-pass filtered and the LSTM layers and the FF layers had 1024 hidden added with the input narrowband waveforms to generate the units and the embedding size was 256. The model final wideband waveforms. was trained by stochastic gradient decent with a mini- batch size of 64 to minimize the cross entropy between Particularly for conditional HRNNs, BN features are used the predicted and real probability distribution. Zero- as auxiliary input in our implementation as shown by the gray padding was applied to make all the sequences in a mini- lines in Fig. 5. BN features can be regarded as a compact batch have the same length and the cost values of the representation of both linguistic and acoustic information added zero samples were ignored when computing the [36]. Here, BN features are extracted by a DNN-based state gradients. An Adam optimizer [46] was used to update the classifier, which has a bottleneck layer with smaller number parameters with an initial learning rate 0.001. Truncated of hidden units than that of other hidden layers. The inputs backpropagation through time (TBPTT) algorithm was of the DNN are mel-frequency cepstral coefficients (MFCC) employed to improve the efficiency of model training and extracted from narrowband speech and the outputs are the the truncated length was set to 480. posterior probability of HMM states. The DNN is trained under cross-entropy (CE) criterion and is used as the BN Examples of reconstructed speech waveforms in our experiments can be feature extractor at extension time. found at http://home.ustc.edu.cn/ ay8067/IEEEtran/demo.html. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 7 HRNN: Waveform-based BWE method using HRNNs as introduced in Section III-B. The HRNN was composed of 3 tiers with two FF layers in Tier 1 and one LSTM layer each in Tier 2 and 3. Therefore, there were two LSTM layers and two FF layers in total which was the same as (k) the SRNN system. The number of c ; k = f1; 2; 3g in (3) (2) (1) (2) (14) and (19) were set as c = c = 2; c = L in our experiments after tuning on the validation set. Some other setups, such as the dimension of the hidden units and the training method, were the same as that of the SRNN system mentioned above. The frame size configurations of the HRNN model will be discussed in Section IV-B. CHRNN: Waveform-based BWE method using condi- tional HRNNs as introduced in Section III-C. The BN features extracted by the DNN state classifier used by the VRNN system were adopted as auxiliary conditions. Fig. 6. Accuracy and efficiency comparison for HRNN-based BWE with (3) (2) The model was composed of 4 tiers. The top conditional different (L ; L ) configurations and using (a) WB and (b) HF mapping strategies. tier had one LSTM layer with 1024 hidden units and the other three tiers were the same as the HRNN system. TABLE I Some basic setups and the training method were the same AVERAGE PESQ SCORES WITH 95% CONFIDENCE INTERVALS ON THE as the HRNN system. The setup of the conditional tier TEST SET W HEN USING WB AND HF MAPPING STRATEGIES FOR will be introduced in detail in Section IV-E. HRNN-BASED BWE. In our experiments, we first investigated the influence of Narrowband HRNN-WB HRNN-HF frame sizes and mapping strategies (i.e., the WB and HF PESQ score 3.630.0636 3.53 0.0438 3.75 0.0456 strategies introduced in Section III-D) on the performance of the HRNN system. Then, the comparison between different waveform-based BWE methods including the DCNN, SRNN C. Effects of Mapping Strategy on HRNN-Based BWE and HRNN systems was carried out. Later, the effect of It can be observed from Fig. 6 that the HF strategy introducing BN features to HRNNs was studied by comparing achieved much lower classification accuracy than the WB the HRNN system and the CHRNN system. Finally, our strategy. It is reasonable since it is more difficult to predict proposed waveform-based BWE method was compared with the aperiodic and noise-like high-frequency waveforms than the conventional vocoder-based one. to predict wideband waveforms. Objective and subjective evaluations were conducted to investigate which strategy can B. Effects of Frame Sizes on HRNN-Based BWE achieve better performance for the HRNN-based BWE. (k) As introduced in Section III-B, the frame sizes L are Since it is improper to compare the classification accuracy key parameters that makes a HRNN model different from of these two strategies directly, the score of Perceptual Eval- the conventional sample-level RNN. In this experiment, we uation of Speech Quality (PESQ) for wideband speech (ITU- (k) studied the effect of L on the performance of HRNN- T P.862.2) [47] was adopted as the objective measurement based BWE. The HRNN models with several configurations here. We utilized the clean wideband speech as reference and (3) (2) of (L ; L ) were trained and their accuracy and efficiency calculated the PESQ scores of the 192 utterances in the test were compared as shown in Fig. 6. Here, the classification set generated using WB and HF strategies (i.e., the HRNN- accuracy of predicting discrete waveform samples in the WB system and the HRNN-HF system) respectively. For validation set was used to measure the accuracy of different comparison, the PESQ scores of the upsampled narrowband models. The total time of generating 1153 utterances in utterances (i.e., with empty high-frequency components) were also calculated. The average PESQ scores and their 95% confi- the validation set with mini-batch size of 64 on a single dence intervals are shown in Table I. The differences between Tesla K40 GPU was used to measure the run-time efficiency. any two of the three systems were significant according to the Both the WB and HF mapping strategies were considered results of paired t-tests (p < 0:001). From Table I, we can see in this experiment. From the results shown in Fig. 6, we that the HF strategy achieved higher PESQ score than the WB can see that there existed conflict between the accuracy and strategy. The average PESQ of the HRNN-WB system was the efficiency of the trained HRNN models. Using smaller (3) (2) even lower than that of the upsampled narrowband speech. frame sizes of (L ; L ) improved the accuracy of sample This may be attributed to that the model in the HRNN-WB prediction while increased the computational complexity at the system aimed to reconstruct the whole wideband waveforms extension stage for both the WB and HF strategies. Finally, (3) (2) we chose (L ; L ) = (16; 4) as a trade-off and used this and was incapable of generating high-frequency components configuration for building the HRNN system in the following as accurately as the HRNN-HF system. experiments. A 3-point comparison category rating (CCR) [48] test A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 8 TABLE II O BJECTIVE PERFORMANCE OF THE DCNN, SRNN AND HRNN SYSTEMS ON THE TEST SET. DCNN SRNN HRNN Accuracy (%) 7.180.336 7.400.387 7.520.388 PESQ score 3.620.0532 3.700.0477 3.750.0456 SNR (dB) 19.060.5983 18.950.6053 19.000.6099 SNR-V (dB) 26.140.7557 26.060.7648 26.210.7716 SNR-U (dB) 10.490.4094 10.320.4126 10.260.4124 LSD (dB) 8.460.122 8.610.136 8.300.127 LSD-V (dB) 7.710.172 8.090.203 8.020.194 LSD-U (dB) 9.340.124 9.190.124 8.570.107 Generation time (s) 3.97 19.39 3.61 used in Section IV-C were adopted as objective measurements. Besides, two extra metrics were adopted here, including signal- to-noise ratio (SNR) [40] which measured the distortion of waveforms and log spectral distance (LSD) [40] which Fig. 7. Average CCR scores of comparing five system pairs, including (1) HRNN-HF vs. HRNN-WB, (2) HRNN vs. DCNN, (3) HRNN vs. SRNN, (4) reflected the distortion in frequency domain. The SNR and CHRNN vs. HRNN, and (5) CHRNN vs. VRNN. The error bars represent LSD for voiced frames (denoted by SNR-V and LSD-V) 95% confidence intervals and the numerical values in parentheses represent and unvoiced frames (denoted by SNR-U and LSD-U) were the p-value of one-sample t-test for different system pairs. also calculated separately for each system. For the fairness of efficiency comparison, we set the mini-batch size as 1 for all was conducted on the Amazon Mechanical Turk (AMT) the three systems when generating utterances in the test set. crowdsourcing platform (https://www.mturk.com) to compare The time of generating 1 second speech (i.e., 16000 samples the subjective performance of the HRNN-WB and HRNN-HF for 16kHz speech) using a Tesla K40 GPU was recorded as the measurement of efficiency in this experiment. systems. The wideband waveforms of 20 utterances randomly selected from the test set were reconstructed by the HRNN- Table II shows the objective performance of the three WB and HRNN-HF systems. Each pair of generated wideband systems on the test set. The 95% confidence intervals were speech were evaluated in random order by 15 native English also calculated for all metrics except the generation time. The results of paired t-tests indicated that the differences listeners after rejecting improper listeners based on anti- between any two of the three systems on all metrics were cheating considerations [49]. The listeners were asked to judge significant (p < 0:01). For accuracy and PESQ score, the which utterance in each pair had better speech quality or DCNN system was not as good as the other two systems. there was no preference. Here, the HRNN-WB system was used as the reference system. The CCR scores of +1, -1, The HRNN system achieved the best performance on both and 0 denoted that the wideband utterance reconstructed by accuracy and PESQ score. For SNR, the HRNN system and the evaluated system, i.e., the HRNN-HF system, sounded the DCNN system achieved the best performance on voiced better than, worse than, or equal to the sample generated by segments and unvoiced segments respectively. For LSD, the the reference system in each pair. We calculated the average HRNN system achieved the lowest overall LSD and the lowest CCR score and its 95% confidence interval through all pairs of LSD of unvoiced segments. On the other hand, the DCNN utterances listened by all listeners. Besides, one-sample t-test system achieved the lowest LSD of voiced frames among the was also conducted to judge whether there was a significant three systems. Considering that LSDs were calculated using difference between the average CCR score and 0 (i.e., to only amplitude spectra while SNRs were influenced by both judge whether there was a significant difference between two amplitude and phase spectra of the reconstructed waveforms, it systems) by examining the p-value. The results are shown as can be inferred that the HRNN system was better at restoring the first system pair in Fig. 7, which suggests that the HRNN- the phase spectra of voiced frames than the DCNN system HF system outperformed the HRNN-WB system significantly. according to the SNR-V and LSD-V results of these two This is consistent with the results of comparing these two systems shown in Table II. In terms of the efficiency, the strategies when dilated CNNs were used to model waveforms generation time of the SRNN system was more than 5 times for the BWE task [35]. Therefore, the HF strategy was adopted longer than that of the HRNN system due to the sample- in the following experiments for building waveform-based by-sample calculation at all layers in the SRNN structure as BWE systems. discussed in Section III-A. Also, the efficiency of the DCNN system was slightly worse than that of the HRNN system. The results reveal that HRNNs can help improve both the accuracy D. Model Comparison for Waveform-Based BWE and efficiency of SRNNs by modeling long-span dependencies among sequences using a hierarchical structure. The performance of three waveform-based BWE systems, i.e., the DCNN, SRNN and HRNN systems, were compared The spectrograms extracted from clean wideband speech by objective and subjective evaluations. The accuracy and and the output of BWE using the DCNN, SRNN and HRNN efficiency metrics used in Section IV-B and the PESQ score systems for an example sentence in the test set are shown A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 9 TABLE III O BJECTIVE PERFORMANCE OF THE HRNN AND CHRNN SYSTEMS ON THE TEST SET TOGETHER WITH THE p VALUES OF PAIRED t-TESTS. HRNN CHRNN p-value Accuracy (%) 7.520.388 7.460.385 <0.001 PESQ score 3.750.0456 3.790.0394 <0.001 SNR (dB) 19.000.6099 18.990.5946 0.322 SNR-V (dB) 26.210.7716 26.130.7539 <0.001 SNR-U (dB) 10.260.4124 10.340.4097 <0.001 LSD (dB) 8.300.127 8.270.123 0.301 LSD-V (dB) 8.020.194 7.890.185 <0.001 LSD-U (dB) 8.570.107 8.660.103 <0.01 Generation time (s) 3.61 4.17 – features was 100 and the frame size at the top conditional tier (4) was L = 160 because the frame shift of BN features was 10ms, corresponding to 160 samples for 16kHz speech. The objective measurements used in Section IV-D were adopted here to compare the HRNN and CHRNN systems. The results are shown in Table III. The CHRNN system outperformed the HRNN system on PESQ score while its prediction accuracy was not as good as the HRNN system. For SNR, these two systems achieved similar performance. The results of LSD show that the CHRNN system was better at reconstructing voiced frames and the HRNN system was on Fig. 8. The spectrograms of clean wideband speech and the output of BWE using five systems for an example sentence in the test set. the contrary. In terms of efficiency, the generation time of the CHRNN system was higher than that of the HRNN system due to the extra conditional tier. in Fig. 8. It can be observed that the high-frequency energy A 3-point CCR test was also conducted to evaluate the of some unvoiced segments generated by the DCNN system subjective performance of the CHRNN system by using the was much weaker than that of the natural speech and the HRNN system as the reference system and following the outputs of the SRNN and HRNN systems. Compared with the evaluation configurations introduced in Section IV-C. The SRNN and HRNN systems, the DCNN system was better at results are shown as the fourth system pairs in Fig. 7, which reconstructing the high-frequency harmonic structures of some reveal that utilizing BN features as additional conditions in voiced segments. These observations are in line with the LSD HRNN-based BWE can improve the subjective quality of results discussed earlier. reconstructed wideband speech significantly. Fig. 8 also shows Furthermore, two 3-point CCR tests were carried out to the spectrogram of the wideband speech generated by the evaluate the subjective performance of the HRNN system CHRNN system for an example sentence. Comparing the by using the DCNN system and the SRNN system as the spectrograms produced by the HRNN system and the CHRNN reference system respectively. The configurations of the tests system, we can observe that the high-frequency components were the same as the ones introduced in Section IV-C. The generated by the CHRNN system were stronger than the results are shown as the second and third system pairs in HRNN system. This may lead to better speech quality as Fig. 7. We can see that our proposed HRNN-based method shown in Fig. 7. generated speech with significantly better quality than the dilated CNN-based method. Compared with the SRNN system, the HRNN system was slightly better while the superiority was F. Comparison between Waveform-Based and Vocoder-Based insignificant at 0.05 significance level. However, the HRNN BWE Methods system was much more efficient than the SRNN system at Finally, we compared the performance of vocoder-based generation time as shown in Table II. and waveform-based BWE methods by conducting objective and subjective evaluations between the VRNN system and E. Effects of Additional Conditions on HRNN-Based BWE the CHRNN system since both systems adopted BN features We compared the HRNN system with the CHRNN system as auxiliary input. The objective results including PESQ, by objective and subjective evaluations to explore the effects SNR and LSD are shown in Table IV. The CHRNN system of additional conditions on HRNN-based BWE. As introduced achieved significantly better SNR than that of the VRNN in Section IV-A, the BN features were used as additional system, which suggested that our proposed waveform-based conditions in the CHRNN system since they can provide method can restore the phase spectra more accurately than the linguistic-related information besides the acoustic waveforms. conventional vocoder-based method. For PESQ and LSD, the The CHRNN system adopted the conditional HRNN structure CHRNN system was not as good as the VRNN system. This introduced in Section III-C with 4 tiers. The dimension of BN is reasonable considering that the VRNN system modeled and A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 10 TABLE IV predicting current output sample. The maximal latencies of the O BJECTIVE PERFORMANCE OF THE VRNN AND CHRNN SYSTEMS ON THE VRNN system and the CHRNN system were both determined TEST SET TOGETHER W ITH THE p VALUES OF PAIRED t- TESTS. by the window size of STFT for extracting LMS and MFCC parameters, which was 25 ms in our implementation. The VRNN CHRNN p value PESQ score 3.870.0368 3.790.0394 <0.001 maximal latencies of the other three systems depended on their SNR (dB) 17.760.6123 18.990.5946 <0.001 structures. The SRNN system processed input waveforms and SNR-V (dB) 25.000.7333 26.130.7539 <0.001 generate output waveforms sample-by-sample without latency SNR-U (dB) 9.010.424 10.340.4097 <0.001 LSD (dB) 6.690.110 8.270.123 <0.001 according to (3). Because the non-causal CNN structure shown LSD-V (dB) 6.860.148 7.890.185 <0.001 in Fig. 1 was adopted by the DCNN system and its receptive LSD-U (dB) 6.450.0972 8.660.103 <0.001 field length was about 64ms [35], it made the highest latency among the five systems. The latency of the HRNN system was TABLE V relatively short because the number of concatenated frames M AXIM AL LATENCIES (ms) OF THE FIVE BWE SYSTEMS. T HE SAMPLING (3) and the frame size of the top tier were small (c = 2 and RATE OF W IDEBAND WAVEFORMS IS f = 16kHz . (3) L = 16). Maximal Latency Remarks 2) Run-time efficiency of waveform-based BWE WS: window size in ms of STFT VRNN WS = 25 One deficiency of the waveform-based BWE methods is that for extracting spectral parameters. they are very time-consuming at generation time. As shown N=2 DCNN = 32 N + 1: length of receptive field. in Table II and Table III, the HRNN system achieved the best SRNN 0 None run-time efficiency among the four waveform-based systems, (3) (3) (3) (3) c L 1 c , L : number of concatenated HRNN = 1:9375 frames, frame size at Tier 3. which still took 3.61 seconds to generate 1 second speech WS: window size in ms of STFT in our current implementation. Therefore, to accelerate the CHRNN WS = 25 for extracting spectral parameters. computation of HRNNs is an important task of our future work. As shown in Fig. 6, using longer frame sizes may help reduce the computational complexity of HRNNs. Another predicted LMS directly which were used in the calculation of possible way is to reduce the number of hidden units and PESQ and LSD. A 3-point CCR test was also conducted to other model parameters similar to the attempt of accelerating evaluate the subjective performance of the CHRNN system by WaveNet for speech synthesis [39]. using the VRNN system as the reference system and following the evaluation configuratioins introduced in Section IV-C. The V. C ONCLUSION results are shown as the fifth system pairs in Fig. 7. We can see that the CCR score was high than 0 significantly which In this paper, we have proposed a novel waveform modeling indicates that the CHRNN system can achieve significantly and generation method using hierarchical recurrent neural higher quality of reconstructed wideband speech than the networks (HRNNs) to fulfill the speech bandwidth extension VRNN system. (BWE) task. HRNNs adopt a hierarchy of recurrent modules Comparing the spectrograms produced by the VRNN system to capture long-span dependencies between input and output and the CHRNN system in Fig. 8, it can be observed waveform sequences. Compared with the plain sample-level that the CHRNN system performed better than the VRNN RNN and the stacked dilated CNN, the proposed HRNN model achieves better accuracy and efficiency of predicting system in generating the high-frequency harmonics for voiced high-frequency waveform samples. Besides, additional con- sounds. Besides, the high-frequency components generated ditions, such as the bottleneck features (BN) extracted from by the CHRNN system were less over-smoothed and more narrowband speech, can further improve subjective quality of natural than that of the VRNN system at unvoiced segments. reconstructed wideband speech. The experimental results show Furthermore, there was a discontinuity between the low- that our proposed HRNN-based method achieves higher sub- frequency and high-frequency spectra of the speech generated jective preference scores than the conventional vocoder-based the VRNN system, which was also found in other vocoder- method using LSTM-RNNs. To evaluate the performance of based BWE method [26]. As shown in Fig. 8, the waveform- our proposed methods using practical band-limited speech based systems alleviated this discontinuity effectively. These data, to improve the efficiency of waveform generation using experimental results indicate the superiority of modeling and HRNNs, and to utilize other types of additional conditions will generating speech waveforms directly over utilizing vocoders for feature extraction and waveform reconstruction on the be the tasks of our future work. BWE task. REFERENCES G. Analysis and Discussion [1] K. Nakamura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “A mel-cepstral analysis technique restoring high frequency components 1) Maximal latency of different BWE systems from low-sampling-rate speech,” in Proc. Interspeech, 2014. Some application scenarios have strict requirement on the [2] A. Albahri, C. S. Rodriguez, and M. Lech, “Artificial bandwidth extension to improve automatic emotion recognition from narrow-band latency of BWE algorithm. We compared the maximal latency coded speech,” in Proc. ICSPCS, 2016, pp. 1–7. of the five BWE systems listed in Section IV-A and the [3] M. M. Goodarzi, F. Almasganj, J. Kabudian, Y. Shekofteh, and results are shown in Table V. Here, the latency refers to I. S. Rezaei, “Feature bandwidth extension for Persian conversational the duration of future input samples that are necessary for telephone speech recognition,” in Proc. ICEE, 2012, pp. 1220–1223. A JOURNAL OF LT X CLASS FILES, VOL. 6, NO. 1, JANUARY 2007 11 [4] S. Chennoukh, A. Gerrits, G. Miet, and R. Sluijter, “Speech enhancement [29] J. Abel, M. Strake, and T. Fingscheidt, “Artificial bandwidth extension via frequency bandwidth extension using line spectral frequencies,” in using deep neural networks for spectral envelope estimation,” in Proc. Proc. ICASSP, vol. 1, 2001, pp. 665–668. IWAENC, 2016, pp. 1–5. [30] Y. Gu and Z.-H. Ling, “Restoring high frequency spectral envelopes [5] F. Mustiere, ` M. Bouchard, and M. Bolic, ´ “Bandwidth extension for using neural networks for speech bandwidth extension,” in Proc. IJCNN, speech enhancement,” in Proc. CCECE, 2010, pp. 1–4. 2015, pp. 1–8. [6] J. Makhoul and M. Berouti, “High-frequency regeneration in speech [31] Y. Wang, S. Zhao, J. Li, and J. Kuang, “Speech bandwidth extension coding systems,” in Proc. ICASSP, vol. 4, 1979, pp. 428–431. using recurrent temporal restricted Boltzmann machines,” IEEE Signal [7] S. Vaseghi, E. Zavarehei, and Q. Yan, “Speech bandwidth extension: Processing Letters, vol. 23, no. 12, pp. 1877–1881, 2016. extrapolations of spectral envelop and harmonicity quality of excitation,” [32] Y. Gu, Z.-H. Ling, and L.-R. Dai, “Speech bandwidth extension using in Proc. ICASSP, vol. 3, 2006, pp. III–III. bottleneck features and deep recurrent neural networks.” in Proc. [8] H. Pulakka, U. Remes, K. Palomaki, ¨ M. Kurimo, and P. Alku, “Speech Interspeech, 2016, pp. 297–301. bandwidth extension using Gaussian mixture model-based estimation of [33] A. v. d. Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, the highband mel spectrum,” in Proc. ICASSP, 2011, pp. 5100–5103. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [9] Y. Wang, S. Zhao, Y. Yu, and J. Kuang, “Speech bandwidth extension A generative model for raw audio,” arXiv preprint arXiv:1609.03499, based on GMM and clustering method,” in Proc. CSNT, 2015, pp. 437– [34] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, J. Sotelo, [10] Y. Ohtani, M. Tamura, M. Morita, and M. Akamine, “GMM-based A. Courville, and Y. Bengio, “SampleRNN: An unconditional end-to- bandwidth extension using sub-band basis spectrum model.” in Proc. end neural audio generation model,” arXiv preprint arXiv:1612.07837, Interspeech, 2014, pp. 2489–2493. [11] Y. Zhang and R. Hu, “Speech wideband extension based on Gaussian [35] Y. Gu and Z.-H. Ling, “Waveform modeling using stacked dilated mixture model,” Chinese Journal of Acoustics, no. 4, pp. 363–377, 2009. convolutional neural networks for speech bandwidth extension,” in Proc. [12] G.-B. Song and P. Martynovich, “A study of HMM-based bandwidth Interspeech, 2017, pp. 1123–1127. extension of speech signals,” Signal Processing, vol. 89, no. 10, pp. [36] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained 2036–2044, 2009. deep neural networks.” in Proc. Interspeech, 2011, pp. 237–240. [13] Z. Yong and L. Yi, “Bandwidth extension of narrowband speech based [37] Z. Wu, C. Valentini-Botinhao, O. Watts, and S. King, “Deep neural on hidden Markov model,” in Proc. ICALIP, 2014, pp. 372–376. networks employing multi-task learning and stacked bottleneck features [14] P. Bauer and T. Fingscheidt, “An HMM-based artificial bandwidth for speech synthesis,” in Proc. ICASSP, 2015, pp. 4460–4464. extension evaluated by cross-language training and test,” in Proc. [38] J. B. Allen and L. R. Rabiner, “A unified approach to short-time Fourier ICASSP, 2008, pp. 4589–4592. analysis and synthesis,” Proceedings of the IEEE, vol. 65, no. 11, pp. [15] G. Chen and V. Parsa, “HMM-based frequency bandwidth extension for 1558–1564, 1977. speech enhancement using line spectral frequencies,” in Proc. ICASSP, [39] S. O. Arik, M. Chrzanowski, A. Coates, G. Diamos, A. Gibiansky, vol. 1, 2004, pp. I–709. Y. Kang, X. Li, J. Miller, J. Raiman, S. Sengupta et al., “Deep voice: [16] Z.-H. Ling, S.-Y. Kang, H. Zen, A. Senior, M. Schuster, X.-J. Qian, Real-time neural text-to-speech,” arXiv preprint arXiv:1702.07825, H. M. Meng, and L. Deng, “Deep learning for acoustic modeling 2017. in parametric speech generation: A systematic review of existing [40] A. Tamamori, T. Hayashi, K. Kobayashi, K. Takeda, and T. Toda, techniques and future trends,” IEEE Signal Processing Magazine, “Speaker-dependent WaveNet vocoder.” in Proc. Interspeech, 2017, pp. vol. 32, no. 3, pp. 35–52, 2015. 1118–1122. [17] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using [41] Y.-J. Hu, C. Ding, L.-J. Liu, Z.-H. Ling, and L.-R. Dai, “The USTC restricted Boltzmann machines and deep belief networks for statistical system for blizzard challenge 2017.” in Proc. Blizzard Challenge parametric speech synthesis,” IEEE Transactions on Audio, Speech, and Workshop, 2017. Language Processing, vol. 21, no. 10, pp. 2129–2139, 2013. [42] I. Recommendation, “G. 711: Pulse code modulation (PCM) of voice frequencies,” International Telecommunication Union, 1988. [18] H. Zen, A. Senior, and M. Schuster, “Statistical parametric speech [43] Y. Fan, Y. Qian, F.-L. Xie, and F. K. Soong, “TTS synthesis synthesis using deep neural networks,” in Proc. ICASSP, 2013, pp. 7962– with bidirectional LSTM based recurrent neural networks.” in Proc. Interspeech, 2014, pp. 1964–1968. [19] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion [44] J. Sotelo, S. Mehri, K. Kumar, J. F. Santos, K. Kastner, A. Courville, using deep neural networks with layer-wise generative training,” and Y. Bengio, “Char2wav: End-to-end speech synthesis,” in Proc. ICLR IEEE/ACM Transactions on Audio, Speech and Language Processing, Workshop Track, 2017. vol. 22, no. 12, pp. 1859–1872, 2014. [45] J. S. Garofolo, L. F. Lamel, W. M. Fisher, J. G. Fiscus, and D. S. Pallett, [20] T. Nakashika, R. Takashima, T. Takiguchi, and Y. Ariki, “Voice “DARPA TIMIT acoustic-phonetic continous speech corpus CD-ROM. conversion in high-order eigen space using deep belief nets.” in Proc. NIST speech disc 1-1.1,” NASA STI/Recon technical report n, vol. 93, Interspeech, 2013, pp. 369–372. [21] X. Lu, Y. Tsao, S. Matsuda, and C. Hori, “Speech enhancement based on [46] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” deep denoising autoencoder.” in Proc. Interspeech, 2013, pp. 436–440. arXiv preprint arXiv:1412.6980, 2014. [22] Y. Xu, J. Du, L.-R. Dai, and C.-H. Lee, “A regression approach to speech [47] I. Recommendation, “P. 862.2: Wideband extension to recommendation enhancement based on deep neural networks,” IEEE/ACM Transactions P. 862 for the assessment of wideband telephone networks and speech on Audio, Speech and Language Processing, vol. 23, no. 1, pp. 7–19, codecs,” International Telecommunication Union, 2007. [48] A. O. Watson, “Assessing the quality of audio and video components [23] C. V. Botinhao, B. S. Carlos, L. P. Caloba, and M. R. Petraglia, in desktop multimedia conferencing,” Ph.D. dissertation, University of “Frequency extension of telephone narrowband speech signal using London, 2001. neural networks,” in Proc. CESA, vol. 2, 2006, pp. 1576–1579. [49] S. Buchholz and J. Latorre, “Crowdsourcing preference tests, and how [24] J. Kontio, L. Laaksonen, and P. Alku, “Neural network-based artificial to detect cheating,” in Proc. Interspeech, 2011, pp. 1118–1122. bandwidth expansion of speech,” IEEE transactions on audio, speech, and language processing, vol. 15, no. 3, pp. 873–881, 2007. [25] H. Pulakka and P. Alku, “Bandwidth extension of telephone speech using a neural network and a filter bank implementation for highband mel spectrum,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 19, no. 7, pp. 2170–2183, 2011. [26] K. Li and C.-H. Lee, “A deep neural network approach to speech bandwidth expansion,” in Proc. ICASSP, 2015, pp. 4395–4399. [27] B. Liu, J. Tao, Z. Wen, Y. Li, and D. Bukhari, “A novel method of artificial bandwidth extension using deep architecture.” in Proc. Interspeech, 2015, pp. 2598–2602. [28] Y. Wang, S. Zhao, W. Liu, M. Li, and J. Kuang, “Speech bandwidth expansion based on deep neural networks.” in Proc. Interspeech, 2015, pp. 2593–2597.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jan 24, 2018

There are no references for this article.