Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Sequence-to-Sequence Acoustic Modeling for Voice Conversion PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 1 Sequence-to-Sequence Acoustic Modeling for Voice Conversion Jing-Xuan Zhang, Zhen-Hua Ling, Member, IEEE, Li-Juan Liu, Yuan-Jiang, and Li-Rong Dai Abstract—In this paper, a neural network named Sequence-to- can be a joint density Gaussian mixture model (JD-GMM) [3], sequence ConvErsion NeTwork (SCENT) is presented for acoustic [6] or a deep neural network (DNN) [7], [8], both of which are modeling in voice conversion. At training stage, a SCENT model universal function approximators [9], [10]. At the conversion is estimated by aligning the feature sequences of source and target stage, a mapping function is derived from the built acoustic speakers implicitly using attention mechanism. At conversion model that converts the acoustic features of source speaker stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel- into those of target speaker. Finally, waveforms are recovered scale spectrograms are adopted as acoustic features which contain from the converted acoustic features using a vocoder. both excitation and vocal tract descriptions of speech signals. This conventional pipeline for voice conversion has its The bottleneck features extracted from source speech using limitations. First, most previous work focused on the conver- an automatic speech recognition (ASR) model are appended sion of spectral features and simply adjusted F trajectories as auxiliary input. A WaveNet vocoder conditioned on Mel- 0 spectrograms is built to reconstruct waveforms from the outputs linearly in the logarithm domain [7], [8], [11]–[15]. Besides, of the SCENT model. It is worth noting that our proposed the durations of converted utterances were kept the same as method can achieve appropriate duration conversion which is the ones of source utterances since the acoustic models were difficult in conventional methods. Experimental results show that built on a frame-by-frame basis. However, the production of our proposed method obtained better objective and subjective human speech is a highly dynamic process and the frame-by- performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic frame assumption constrains the modeling ability of mapping models. This proposed method also outperformed our previous functions [16]. work which achieved the top rank in Voice Conversion Challenge This paper proposes an acoustic modeling method for 2018. Ablation tests further confirmed the effectiveness of several voice conversion based on the sequence-to-sequence neural components in our proposed method. network framework [17], [18]. A Sequence-to-sequence Con- Index Terms—voice conversion, sequence-to-sequence, atten- vErsion NeTwork (SCENT) is designed to directly describe the tion, Mel-spectrogram. conditional probabilities of target acoustic feature sequences given source ones without explicit frame-to-frame alignment. I. INTRODUCTION The SCENT model follows the widely-used architecture of encoder-decoder with attention [19], [20]. The encoder net- OICE conversion aims to modify the speech signal of a work first transforms the input feature sequences into hidden source speaker to make it sound like being uttered by a representations which are suitable for the decoder to deal target speaker, while keeping the linguistic contents unchanged with. At each decoder time step, the attention module selects [1], [2]. The potential applications of this technique include encoder outputs softly by attention probabilities and produces entertainment, personalized text-to-speech, and so on [3], [4]. a context vector. Then, the decoder predicts output acoustic Building statistical acoustic models for feature mapping is features frame by frame using context vectors. Furthermore, a a popular approach to voice conversion nowadays. At the post-filtering network is designed to enhance the accuracy of training stage of the conventional voice conversion pipeline, the converted acoustic features. Finally, a speaker-dependent acoustic features are first extracted from the waveforms of WaveNet is utilized to recover time-domain waveforms from source and target utterances. Then, the features of parallel the predicted sequences of acoustic features. utterances are aligned frame by frame using alignment algo- In our proposed method, Mel-scale spectrograms are rithms, such as dynamic time wrapping (DTW) [5]. Next, an adopted as acoustic features, which do not rely on the acoustic model for conversion is trained using the acoustic source-filter assumption of speech production. Therefore, features of paired source-target frames. The acoustic model F and spectral features are converted jointly in a single This work was supported by National Key R&D Program of China (Grant model. Additional bottleneck features derived using an No. 2017YFB1002202), the National Nature Science Foundation of China automatic speech recognition (ASR) model are appended to (Grant No. 61871358) and the Key Science and Technology Project of Anhui Province (Grant No. 18030901016). the source Mel-spectrograms, which are expected to improve J.-X. Zhang, Z.-H. Ling and L.-R. Dai are with the National Engineering the pronunciation correctness of the converted speech. Laboratory for Speech and Language Information Processing, University Attention module learns the soft alignments between the of Science and Technology of China, Hefei, 230027, China (e-mail: nosisi@mail.ustc.edu.cn, zhling@ustc.edu.cn, lrdai@ustc.edu.cn). L.-J. Liu pairs of source-target feature sequence implicitly. Facilitated and Y. Jiang are with the iFLYTEK Co., Ltd., Hefei, 230088, China (e-mail: by attention module, our proposed method is capable of ljliu@iflytek.com, yuanjiang@iflytek.com). predicting target acoustic sequences with durations different This work was conducted when J.-X. Zhang was an intern at iFLYTEK Research. from source ones at conversion stage. arXiv:1810.06865v5 [cs.SD] 12 Jan 2020 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 2 Experimental results show that our proposed method verification task. Then, the built network was transferred to achieved better objective and subjective performance than a conditional Tacotron model [24] to generate speech for a the GMM-based and DNN-based baseline systems. This variety of speakers. Nachmani et al. [29] extended the Voice proposed method also outperformed our previous work which Loop model [30] to fit new voices by incorporating a fitting achieved the top rank in Voice Conversion Challenge 2018 network. Instead of using text as model input in these studies, [21]. It is worth noting that our proposed method can achieve we utilize a separate ASR model for extracting linguistic- appropriate duration conversion, which contributes to higher related features and the input of our model is only the speech similarity and is difficult in conventional methods. Ablation of source speakers. Also, instead of generating speech of studies were further conducted and the results confirmed unseen speakers, we focus on voice conversion for one pair of the effectiveness of several key components in our proposed speakers. It should be noticed that the techniques developed method. for voice cloning are potentially useful for extending our In this paper, we focus on one-to-one voice conversion, proposed method from one-to-one conversion to many-to- i.e., one model is trained for each speaker pair. It should many conversion, which will be a part of our future study. be noticed that our proposed method can also be adapted to other cases rather than one-to-one conversion. For example, C. Sequence-to-sequence learning for voice conversion the proposed method can be extended to multiple speaker pairs To the best of our knowledge, Ramos [31] made the by conditioning on codes of speaker identities, which can be first attempt to convert spectral features using a sequence-to- obtained from the outputs of a speaker encoder [22], [23]. sequence model with attention. However, as stated in Section The rest of this article is organized as follows. Section II 5.5 of Ramos’s thesis [31], the model was not capable of using reviews the related work on seq2seq modeling, voice cloning its own predictions to generate a real valued output prediction. and WaveNet vocoders. Section III introduces our proposed Kaneko et al. [32] proposed a CNN-based seq2seq spectral method for voice conversion. Details and results of experi- conversion method. Because of the lack of attention module ments are presented in Section IV. The article is concluded in in their method, the DTW algorithm was still utilized in order Section V. to obtain frame-level aligned feature sequences during training data preparation. Miyoshi et al. [33] proposed a method of II. RELATED WORK mapping context posterior probabilities using seq2seq models. A. Relationship with sequence-to-sequence learning for text- In their method, an RNN-based encoder-decoder converted the to-speech source posterior probability sequence to the target one for each Text-to-speech (TTS) methods based on seq2seq learning phone, and the phone durations of natural target speech were have emerged recently and attracted much attention [24]– necessary at conversion stage. [27]. Our work is inspired by the success of applying seq2seq Our work is most similar to Ramos’s one [31], where models to TTS. However, voice conversion is different from an utterance-level seq2seq with attention model is built for acoustic feature conversion. Different from previous methods, TTS in several aspects. First, the inputs of a voice conversion Mel-spectrograms are adopted as acoustic features in our model are frame-level acoustic features rather than phone- method. Thus, F and spectral features are transformed jointly. level or character-level linguistic features. Typically, linguistic Our method has the ability of modeling pairs of input and features are discrete, while acoustic features are continuous. In output utterance without dependency on DTW alignment. addition to linguistic information, acoustic features also con- During conversion, the durations of generated target acoustic tain speaker identity information which should be processed sequences are determined automatically and the probability of during voice conversion. Second, the input-output alignment completion is predicted at each decoder time step. in voice conversion task is different from that in TTS. Speech generation in TTS is a decompressing process and the alignment between text and acoustic frames is usually a one- D. Voice conversion using WaveNet to-many mapping. While the alignment can be either one-to- WaveNet [34], as a neural network-based waveform gener- many or many-to-one in voice conversion, depending on the ation model, has been successfully applied to TTS and voice characteristics of speaker pairs and the dynamic characteristics conversion areas [21], [35], [36]. Studies have shown that of acoustic sequences. Third, the training data available for WaveNet vocoders outperformed traditional vocoders such as voice conversion is typically smaller than that for TTS. WORLD [37] and STRAIGHT [38] in terms of the quality of reconstructed speech [21], [39], [40]. Voice conversion B. Relationship with voice cloning methods using WaveNet models have also been studied in Voice cloning is a task that learns the voice of unseen recent years. Kobayashi et al. [35] proposed a GMM-based speakers from a few speech samples for text-to-speech syn- voice conversion method with WaveNet-based waveform gen- thesis. Unlike voice conversion, voice cloning takes text as eration. Liu et al. [21] proposed building WaveNet vocoders model input. Arik et al. [23] evaluated two techniques of voice for voice conversion with limited data by model adaptation. cloning, i.e., speaker adaptation and speaker encoding, based Directly mapping source acoustic features into target speaker’s on Deep Voice 3 [26]. Jia et al. [28] proposed a transfer waveforms using WaveNet has also been proposed [36]. learning method for voice cloning. A speaker-discriminative In this paper, WaveNet vocoders are used to reconstruct embedding network was first trained to achieve a speaker the waveforms of target speakers. WaveNet vocoders accept PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 3 Fig. 1. The conversion process of our proposed sequence-to-sequence voice conversion method. Mel-spectrograms as input conditions and are trained in a speaker-dependent way without using the adaptation technique described in [21]. III. PROPOSED METHOD A. Overall architecture Fig. 1 shows the diagram of our proposed method when converting an input utterance. The conversion process can be divided into two main stages. One is a Seq2seq ConvErsion NeTwork (SCENT) for acoustic feature prediction, the other is a WaveNet neural vocoder for waveform generation. Mel- spectrograms are adopted as acoustic features in this paper. Bottleneck features extracted by an ASR model from source speech are concatenated with acoustic features to form the input sequences of the SCENT model. The SCENT model converts input sequence into Mel-spectrograms of the target speaker. Then, the target speaker’s speech is synthesized by passing the predicted Mel-spectrograms through the WaveNet Fig. 2. The network structure of a SCENT model, where skip connections and vocoder. residual connections are ignored for clarity. The grey circles in the encoder represent LSTM units with layer normalization. T and T are the frame x h B. Feature extraction numbers of input sequence and hidden representations. The encoder in this figure has a downsampling rate M = 2. Therefore, we have T = 2T in this x h Mel-spectrograms are computed through a short-time figure. The auto-regressive inputs of the decoder are natural history contexts Fourier transform (STFT) on waveforms. The STFT at training time and are generated ones at conversion time. Single frame is predicted at each decoder time step (i.e., r = 1) in this figure. magnitudes are transformed to Mel-frequency scale using Mel-filterbanks followed by a logarithmic dynamic range compression. In order to extract bottleneck features, a left-to-right way, and a bi-directional post-filtering network recurrent neural network (RNN) based ASR model is trained which refines the generation results. Fig. 2 shows the network on a separate speech recognition dataset. For each input structure of a SCENT model. frame, bottleneck features, i.e., the activations of the last Let y = [y ; : : : ;y ] denote the output Mel-spectrogram hidden layer before the softmax output layer of the ASR 1 T sequence of the encoder-decoder network, where T is the model, are extracted. Such bottleneck features can provide y frame number of target speech. The encoder-decoder network additional linguistic-related descriptions which are expected models the mapping relationship between input and output fea- to benefit the conversion process. It should be noticed that ture sequences using conditional distributions of each output these bottleneck features are still automatically extracted frame y given previous output frames y = [y ; : : : ;y ] from the acoustic signals of source utterances and no text t <t 1 t1 and the input x as transcriptions are necessary. The Mel-spectrograms and bottleneck features at each frame are concatenated to form the input sequence x = [x ; : : : ;x ] of the SCENT model, 1 T p(yjx) = p(y jy ;x; W ; W ); (1) t <t enc dec where T is the frame number of source speech. t=1 C. Structure of SCENT where W and W are parameters of the encoder-decoder enc dec A SCENT model contains an encoder-decoder with attention network. As shown in Fig. 2, the encoder transforms the network which predicts acoustic feature in an uni-directional concatenated Mel-spectrograms and bottleneck features of PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 4 source speech into a high-level and abstract representation the voice conversion task, we expect that the encoder network h = [h ; : : : ;h ] as should exclude speaker-dependent information of the source 1 T speech and extract hidden representation h which is high-level h = Encoder(x; W ): (2) enc and linguistic-related. Because one phone usually corresponds to tens of acoustic frames, it is reasonable to derive hidden T is the frame number of hidden representations and T < h h representation with lower sampling rate than the frame-level T because of the pyramid structure of encoder. The decoder input sequence. Furthermore, hidden representation with lower with attention mechanism utilizes h and produces a probability sampling rate makes the attention module easier to converge, distribution over output frames as since this leads to fewer encoding states for attention calcula- p(y jy ;x) = Decoder(y ;h; W ): (3) t <t <t dec tion at each decoding step. This pyramid structure also reduces the computational complexity by shortening the length of h The generation process of the decoder network is uni- for attention calculation and speeds up training and inference directional. In order to make use the bi-directional context significantly. information, a post-filtering network (i.e., PostNet) is further 2) Decoder with attention mechanism: The decoder is employed to enhance the accuracy of prediction. Let z = an auto-regressive RNN which predicts the output acoustic [z ; : : : ;z ] represent the PostNet output sequence, which 1 T features from the hidden representation h. Non-overlapping r is the final prediction of the SCENT model. In this paper, frames are predicted at each decoder step. This trick divides the frame rates of decoder outputs and PostNet outputs are the total decoding steps by r, which further reduces training the same, i.e. T = T . The distribution of feature sequence z z y and inference time [24]. In Fig. 2, the decoder is illustrated given the output of the encoder-decoder network y is modeled with r = 1 for clarity. The prediction of previous time step as y is first passed through a pre-processing network (i.e., t1 p(zjy) = PostNet(y; W ); (4) pos PreNet), which is a two-layer MLP with ReLU activation and where W denotes the parameters of the PostNet. pos dropout in our implementation. The MLP outputs are sent into Next, we will describe each part of SCENT in details. an LSTM layer with attention mechanism. A context vector c 1) Encoder: The encoder network is constructed based on is calculated at each decoder step using attention probabilities the pyramid bidirectional LSTM architecture [41], [42], which as processes the sequence with lower time resolution at higher c = h ; (8) layers. In a conventional deep bidirectional LSTM (BLSTM) t n n=1 architecture, the output at the n-th time step of the j-th layer is computed as 1 h where = [ ; :::; ] are attention probabilities, t is t t j j j1 decoder time step, and n is the index of encoder outputs. h = BLSTM(h ;h ): (5) n n1 n In our implementation, a hybrid attention mechanism is In a pyramid BLSTM (pBLSTM), the outputs at consecutive adopted which takes the alignment of previous decoder step steps of a lower layer are concatenated and fed into the (i.e., location-awareness) into account when computing the next layer to decrease the sampling rate of input sequence. attention probabilities. In order to extract location information, The general calculation of pBLSTM hidden units can be k filters with kernel size l are employed to convolve the kl formulated as alignment of previous time step. Let F 2 R represent the j j1 j1 convolution matrix, and q denote the query vector which is h = pBLSTM(h ; [h ; : : : ;h ]); (6) n n1 Mn Mn+M1 given by the output of attention LSTM. Then, the attention where M is ratio of downsampling. The technique of layer score e is computed as normalization [43] is applied to the encoder LSTM cells. f = F  ; (9) t t1 Then, a location code l = [l (0); : : : ; l (d 1)] [44] is n n n added to the top output layer of pBLSTMs to form the hidden n > > n e = q Wh + v tanh(Uf + b); (10) t t t representation h. Let d be the dimension of each h . The location code is composed of sine and cosine functions of where v, b, W and U are trainable parameters of the model. different frequencies as As we can see from Eq. (10), the calculation of the hybrid attention takes two parts into consideration. The first part of 2i/d l (2i) = sin(n 10000 ); (7) Eq. (10) measures the relationship between the query vector 2i/d l (2i + 1) = cos(n 10000 ); and different entries of encoder outputs. The second part of where n is the time step in sequence h and i 2 [0; : : : ; d/21] Eq. (10) is computed based on the alignment of previous is the dimension index. The base 10000 in Eq. (7) follows the decoder step and provides a constraint on current t1 configuration in the original paper [44] which proposed the attention probabilities. The convolution matrix is employed to filter for extracting useful features as shown in Eq. (9). location code. This location code is useful since it gives the t1 The features are further integrated into the calculation of model explicit information of which portion of the sequence attention scores as shown in Eq. (10). is currently processed. The effectiveness of the location code will be demonstrated by ablation tests in our experiments. Furthermore, the forward attention method proposed in our The pyramid structure of our encoder network results in previous work [45] is adopted to stabilize the attention align- shorter hidden representation than original input sequence. For ment and speed up the convergence of attention alignment. In PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 5 the forward attention method, the attention probability is likelihood (ML) criterion based on Gaussian mixture model calculated as (GMM). For GMM-ML, the network outputs are adopted to parameterize a GMM following the framework of mixture n n i density networks (MDN) [46], [47]. e^ = exp(e ) exp(e ) ; (11) t t t More specifically, the likelihood function in GMM-ML i=1 takes the form of a GMM as n n n n1 ^ = e^ ( + ); (12) t t t1 t1 X p(yjx; W ; W ) = w (x)N (y; (x); (x)); (16) enc dec i i i n n i = ^ ^ : (13) i=1 t t t i=1 where m is the number of mixture components, and For initialization, we have w (x),  (x) and  (x) correspond to the mixture weight, i i i mean vector and covariance matrix of the i-th Gaussian = 1; (14) component given x. Here, the covariance matrices are = 0; for n = 2; : : : ; T : set to be diagonal. The concatenation of c , q and the t t The motivation of forward attention is to follow the mono- outputs of decoding LSTMs are projected to a vector tonic nature of alignments in human speech generation [45]. (2d +1)m Mel o(x; W ; W ) 2 R , where d is the enc dec Mel Therefore, a forward variable which only takes the monotonic dimension of Mel-spectrograms and the whole vector can be alignment paths into consideration is designed. This forward divided into all mixture components as variable is derived from the original attention probabilities e^ (w) (w) and it can be computed recursively as Eq. (12). Then, the o(x; W ; W ) =[o (x); : : : ; o (x); enc dec 1 m normalized forward variables are used to replace original t () > () > (17) o (x) ; : : : ;o (x) ; 1 m attention probabilities e^ for summarizing the encoder outputs () > () > > as shown in Eq. (8). In addition, a location code is also added o (x) ; : : : ;o (x) ] : 1 m to the auto-regressive input of the decoder at each time step. Then, the GMM parameters in Eq. (16) can be derived from The context vector c and query vector q are concatenated the vector o(x; W ; W ) as enc dec and fed into a stack of two-layer decoding LSTMs. The concatenation of c, q and the outputs of decoding LSTMs (w) (w) are linearly projected to produce the Mel-spectrogram output w (x) = exp o (x) exp o (x) ; (18) i j of the decoder network. In parallel, the concatenation of c j=1 and q are linearly projected to a scalar and passed through a () sigmoid activation to predict the completion probability p , end  (x) = log exp (o (x)) + 1 ; (19) which indicates whether the converted sequence reaches the last frame. () (x) = o (x); (20) 3) Post-filtering network: The PostNet refines the Mel- spectrograms predicted by the decoder using bi-directional where  (x) is a vector composed of the diagonal elements context information. The PostNet is a convolutional neural of  (x). For GMM-ML, L is defined as the negative log- i dec network (CNN) with a residual connection from network input likelihood (NLL) function, i.e., to the final output. The first layer of the PostNet is composed of 1-D convolution filter banks in order to extract rich context L = log p(yjx; W ; W ): (21) dec enc dec information. The outputs of the convolution banks are stacked Under both MSE and GMM-ML criteria, natural acoustic together and further passed through a two-layer 1-D CNN. histories of target speech are sent into the decoder at training The outputs of the final layer are added to the input Mel- time. The MSE criterion is actually a special case of GMM- spectrograms to produce the final results. ML which uses single mixture with fixed unit variance and predicted mean vector [48]. Theoretically, GMM-ML D. Loss function of SCENT is more flexible since it models more general probability We train the SCENT model by multi-task learning and the distributions and the MSE criterion usually leads to over- total loss is the weighted sum of three sub-losses as smoothed prediction because of the averaging effect [46]. When applying the GMM-ML criterion to L , the mean dec L = w L + w L + w L ; (15) dec dec post post end end vector of the component with maximum prior probability is used to generate the output sample at both training and testing where w , w and w are the weights of the three com- dec post end stages. At training time, the gradients from the PostNet are ponents. L and L denote the losses of Mel-spectrogram dec post only back-propagated through the sampled mean vectors given prediction given by the decoder and the PostNet respectively. L is a binary cross-entropy loss for evaluating the predicted by the decoder output layer. end completion probabilities. Only the MSE criterion is applied to L in our imple- post Two types of criteria are investigated for L . One is mentation. For calculating L , only the last decoder step of dec end the minimum square error (MSE) between the predicted and a natural target sequence is labelled as 1 (i.e., completed) and ground truth acoustic features. The other is the maximum the rest steps are labelled as 0 (i.e., incompleted). PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 6 TABLE I ms with Hann windowing of 50 ms frame length and 1024- D ETAILS OF M ODEL CONFIGURATIONS. point Fourier transform. 512-dimensional bottleneck features were extracted using an ASR model every 40 ms and were pBLSTM, 2 layers and 256 cells LSTM Encoder then upsampled by repeating to match the frame rate of Mel- with layer normalization, M = 4 spectrograms. FC-256-ReLU-Dropout(0.5)! PreNet FC-256-ReLU-Dropout(0.5) The speaker-independent ASR model was trained using Attention LSTM, 1 layer and 256 cells; internal datasets of iFLYTEK company, which contained k = 10 and l = 32 for F in Eq. (9); recordings of about 10,000 hours for Mandarin and recordings SCENT Decoder v in Eq. (10) has dimension of 256; of about 3,000 hours for English. Our ASR model was Decoder LSTM, 2 layers and 256 cells an LSTM-HMM-based one. The LSTM was bidirectional Conv1D banks, k = [1; : : : ; 8], with 6 hidden layers and 1024 units in each direction. The Conv1D-k-256-BN-ReLU-Dropout(0.2)! PostNet classification targets of the LSTM model were clustered Conv1D-3-256-BN-ReLU-Dropout(0.2)! triphones, i.e., senones. For the Mandarin dataset, the phoneme Conv1D-3-256-BN-ReLU-Dropout(0.2) set included 26 initials and 140 tonal finals. We evaluated the 4 layers Conv1D-3-100-PReLU ConditionNet performance of the ASR model on the parallel dataset for with dilation d = [1; 2; 4; 8] WaveNet 30 layers dilated convolution layers voice conversion. The frame classification accuracies for the vocoder k mod 10 WaveNet with dilation d = 2 for female and male speakers were 72.3% and 78.4% respectively. k = [0; : : : ; 29]; 1024 softmax output For the English dataset, there were 62 phonemes and the frame FC represents fully connected. BN represents for batch normalization. classification accuracies for the female and male speakers were Conv1D-k-n represents 1-D convolution with kernel size k and channel 76.4% and 75.9% respectively. n. The details of our model configurations are listed in TA- BLE I. In our implementation, two frames were predicted at one decoding step (i.e., r = 2) and only the last frame was fed E. WaveNet-based vocoder back into the PreNet for the generation at next step. In the loss As shown in Fig. 1, a WaveNet-based vocoder is adopted to function for training the SCENT model, w was heuristically dec reconstruct time-domain waveforms given the predicted Mel- set as 1.0 or 0.01 if MSE or GMM-ML training criterion was spectrogram features. adopted for L . w and w were heuristically set as dec post end In our WaveNet model, the Mel-spectrogram features are 1.0 and 0.005 respectively. Zoneout [51] with probability of first passed through a ConditionNet consisting of stack of di- 0.2 were used at LSTM layers for regularization. Residual lated 1-D convolution layers with parametric ReLU activation connections were adopted for the LSTM layers of encoder (PReLU) [49]. The outputs of ConditionNet are upsampled to and decoder to speed up model convergence. We used Adam be consistent with the sampling rate of waveforms by simply [52] optimizer with learning rate of 10 for the first 20 repeating. Then, the sequence of condition vectors are fed epochs. After 50 epochs, the learning rate was exponentially into each dilated convolution block of the WaveNet to control decayed by 0.95 for each epoch. L regularization with weight the waveform generation. Our WaveNet model is trained only 10 was also applied. The batch size was 4. For WaveNet using the target speech data for building the SCENT model training, the -law companded waveforms were quantized and the adaptation technique [21] is not used in this paper. into 10 bits, i.e., 1024 levels. A speaker-dependent WaveNet vocoder was trained using each speaker’s waveforms with IV. E XPERIM ENTS random initialization and a learning rate of 10 until the loss converge. A. Experimental conditions Three kind of baseline methods were adopted for compar- Two datasets were used in our experiments. The first one ison in our experiments. 41-dimensional Mel-cepstral coeffi- contained 1060 parallel Mandarin Chinese utterances from cients (MCCs), 1-dimensional fundamental frequency (F ) and one male speaker (about 53 mins) and one female speaker 5-dimensional band aperiodicities (BAPs) were extracted every (about 72 mins). This dataset was separated into a training 5 ms by STRAIGHT [38] as acoustic features in our baseline set with 1000 utterances, a validation set with 30 utterances systems. The descriptions of these methods are as follows . and a test set with 30 utterances. For the second dataset, JD-GMM: Gaussian mixture models with full-covariance speech data of one male (rms, about 62 mins) and one female matrices were utilized for modeling the joint spectral (slt, about 52 mins) from the CMU ARCTIC database [50] feature vectors of source and target speakers. For each was adopted. This dataset contained 1132 parallel English speaker, static and delta spectral features were used. utterances, which were separated into a training set with 1000 The number of mixtures m was tuned on validation utterances, a validation set with 66 utterances and a test set set with m 2 [16; 32; 48; 64]. Maximum likelihood with 66 utterances. Our analytical experiments in Section IV-B parameters generation (MLPG) with global variance (GV) and Section IV-D only adopted the Mandarin dataset, and enhancement were used for spectral parameter generation. the main objective and subjective evaluations in Section IV-C F was converted by Gaussian normalization in the adopted both datasets. logarithm domain [53]. BAPs were not converted but The recordings of both dataset were sampled at 16kHz. 80- dimensional Mel-scale spectrograms were extracted every 10 Samples of audio are available at https://jxzhanggg.github.io/Seq2SeqVC. PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 7 1.0 TABLE II O BJECTIVE EVALUATION RESULTS OF USING DIFFERENT LOSS FUNCTIONS FOR THE DECODER ON VALIDATION SET. 0.8 Female-to-Male Male-to-Female Settings MCD F RMSE MCD F RMSE 0 0 0.6 (dB) (Hz) (dB) (Hz) MSE 3.397 42.122 3.658 33.420 0.4 MX2 3.365 38.123 3.649 32.271 MX4 3.384 38.629 3.651 34.748 MX6 3.376 38.804 3.669 35.337 0.2 MX8 3.418 39.230 3.637 33.029 “MX2”, “MX4”, “MX6” and “MX8” represent using ML criterion with 0.0 2, 4, 6 and 8 GMM mixture components respectively. 0 50 100 150 200 250 decoder steps Fig. 3. Visualization of the attention alignment and the DTW path of an directly copied from the source, since previous research utterance pair in the validation set. The heat map shows the alignment showed that converting aperiodic component did not probabilities calculated by the attention module in our seq2seq model. The red dashed line shows the alignment path given by DTW, which is downsampled make a statistically significant difference to the quality to match the sample rates of encoder states and decoder time steps. of converted speech [54]. Waveforms were reconstructed by STRAIGHT vocoder from the converted acoustic features. objective performance of these loss functions by experiments DNN: The DNN-based voice conversion models were on both female-to-male and male-to-female conversions using implemented based on Merlin toolkit [55]. The static, the Mandarin dataset. delta and acceleration components of MCCs, F and The Mel-cepstral distortion (MCD) and root mean square BAPs were transformed jointly using a DNN. In addition error (RMSE) of F on validation set were adopted as metrics. to use the acoustic features of the source speaker as Because Mel-spectrograms were adopted as acoustic features, model input, we also concatenated the input acoustic it’s not straightforward to extract F and MCCs features features with the bottleneck features used in our proposed from the converted acoustic features. Therefore, F and 25- method. This approach was named bn-DNN in the rest dimensional MCCs features were extracted by STRAIGHT of this paper. The ReLU activation function was used from the reconstructed waveforms for evaluation. Then the at DNN hidden units. A grid search using validation extracted features were aligned to those of the reference set was adopted in order to pick up the optimal depth utterances in the validation set in order to compute MCD and d and width w of the DNN with d 2 [3; 4; 5; 6] and F RMSE values. The F RMSE was calculated only using the 0 0 w 2 [512; 1024; 2048]. MLPG and GV techniques were frames which were both voiced in the converted and reference used for acoustic parameter generation. Waveform was utterances. reconstructed by STRAIGHT vocoder from the converted TABLE II summarizes the objective evaluation results on acoustic features. validation set. From the table, we can see that the model using VCC2018: This baseline method followed the framework the GMM-ML criterion with 2 mixture components achieves of our previous work [21], which achieved the top rank on the best performance on validation set among all settings naturalness and similarity in Voice Conversion Challenge except the MCD of male-to-female conversion. A further 2018. A speaker-dependent acoustic feature predictor was examination shows that using the GMM-ML criterion with trained by adapting a pre-trained speaker-independent mixture components more than 2 may lead to the instability model using the data of the target speaker. The predictor of attention alignment. Some cases of attention failures, such was an LSTM model which predicted MCCs, F and as getting stuck in one frame, can be observed for MX6 and BAPs of the target speaker from bottleneck features MX8. We tried to re-optimize the weighting factors in Eq. (15) frame-by-frame. At the training stage, bottleneck features for the MX6 and the MX8. The experimental results showed were extracted from the target speaker as model inputs. that changing the coefficients for models with more mixtures At the conversion stage, bottleneck features were obtained could slightly improve the alignment quality while the overall from the speech of the source speaker and were sent into performances of the models were still worse than the MX2 the acoustic feature predictor of the target speaker for model. One possible reason is that larger mixture numbers conversion. In this method, a speaker-dependent WaveNet may increase the number of parameters and the difficulty of vocoder conditioned on MCCs, F and BAPs features was model training. Thus, the GMM-ML criterion with 2 mixtures built for waveform reconstruction. was adopted for L in following experiments. dec The SCENT network models pairs of source and target B. Comparison between different decoder loss functions utterance directly. During training, alignments of utterance As introduced in Section III-D, either MSE or GMM-ML pairs are learned by attention module implicitly. An example criterion was applied to define the loss function L of of the alignment between an utterance pair using the SCENT dec the decoder output in our implementation. We evaluated the model is shown in Fig. 3, where each column denotes the encoder states PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 8 TABLE III TABLE IV O BJECTIVE EVALUATION RESULTS OF BASELINE AND PROPOSED O BJECTIVE EVALUATION RESULTS OF BASELINE AND PROPOSED METHODS ON TEST SET OF M ANDARIN DATASET. METHODS ON TEST SET OF ENGLISH CMU ARCTIC DATASET. Female-to-Male Male-to-Female Female-to-Male Male-to-Female Methods Methods MCD F RMSE MCD F RMSE MCD F RMSE MCD F RMSE 0 0 0 0 (dB) (Hz) (dB) (Hz) (dB) (Hz) (dB) (Hz) JD-GMM 3.892 55.241 4.307 46.625 JD-GMM 3.176 16.473 3.278 16.418 i-JD-GMM 3.936 55.939 4.328 48.286 i-JD-GMM 3.187 14.834 3.274 16.343 DNN 3.688 44.087 4.335 39.190 DNN 3.200 13.998 3.270 14.118 i-DNN 3.750 44.268 4.245 39.877 i-DNN 3.271 14.531 3.296 14.050 bn-DNN 3.618 42.385 4.078 35.883 bn-DNN 3.167 12.675 3.100 13.070 i-bn-DNN 3.725 42.961 4.088 35.019 i-bn-DNN 3.141 11.969 3.182 13.098 VCC2018 3.802 56.874 4.210 39.196 VCC2018 3.384 11.116 3.668 13.707 i-VCC2018 3.854 53.350 4.225 41.257 i-VCC2018 3.354 11.455 3.663 12.631 Proposed 3.556 41.748 3.802 33.374 Proposed 3.212 9.899 3.383 11.704 “i” represents the interpolation of source features for duration “i” represents the interpolation of source features for duration compensation. “bn” denotes appending bottleneck features as input. compensation. “bn” denotes appending bottleneck features as input. TABLE V attention probabilities corresponding to different encoder states THE AVERAGE ABSOLUTE DIFFERENCES BETWEEN THE DURATIONS OF for one decoder step. The DTW algorithm was also conducted THE CONVERTED AND TARGET UTTERANCES (DDUR) ON TEST SET. based on the input and output Mel-spectrogram sequences Conversion Baseline i-Baseline Proposed and the resulting path was plotted as the red dashed line Pairs (second) (second) (second) for comparison. From this figure, we can see that these two F-M (MA) 1.147 0.276 0.194 M-F (MA) 1.157 0.380 0.260 alignments matched well. Comparing with the DTW path F-M (EN) 0.560 0.282 0.227 which denotes hard and deterministic alignment, the attention M-F (EN) 0.556 0.240 0.147 alignment is soft and changes smoothly along consecutive “F-M” and “M-F” represent female-to-male and male-to-female decoder time steps. conversions. “MA” and “EN” represent the Mandarin and English dataset respectively. C. Comparison between baseline and proposed methods without interpolation. Appending bottleneck features as inputs 1) Objective evaluation: Objective evaluations were first was beneficial for improving the objective performance of the carried out to compare the MCD and F RMSE performance DNN-based method. Our proposed method outperformed all of our proposed method and the baseline methods introduced baseline methods, which obtained the lowest MCD and F in Section IV-A, including JD-GMM, DNN, bn-DNN and RMSE. VCC2018. In order to compensate the duration differences TABLE IV shows the results evaluated on the English between source and target speakers, we also tried to linearly CMU ARCTIC dataset. The proposed method achieved best interpolate the source feature sequences before sending them performance on F RMSE, while its performance on MCD into the conversion models according to the average ratio was not as good as some baseline methods. Considering between the training set durations of the two speakers. We that the MCD measurement may be inconsistent with human only interpolated the static part of the source features and the perception [6], [12], [56], some subjective evaluations were dynamic features were recalculated based on the interpolated further conducted and will be introduced later. static features. This led to four additional methods, named i-JD-GMM, i-DNN, i-bn-DNN and i-VCC2018, in our evalu- One advantage of our proposed method is that it can ations. The MCDs and F RMSEs were calculated following convert the duration of source speech using an unified acoustic the way introduced in Section IV-B. For fair comparison, model. In order to investigate the performance of duration F and MCCs were re-extracted by STRAIGHT from the conversion, the scatter diagrams of test utterance durations are converted waveforms for all methods when computing MCDs drawn in Fig. 4 and Fig. 5 for female-to-male and male-to- and F RMSEs. female conversions using the Mandarin dataset. For each test The proposed and baseline methods were evaluated on both utterance, the durations of speech converted using different the Mandarin dataset and the English CMU ARCTIC dataset. baseline methods were the same, i.e., the duration of the When using the English CMU ARCTIC dataset, the same source speech. For the baseline methods with source feature procedure of tuning the decoder output layer as described in interpolation, the same global interpolation ratio was shared by all baseline methods. Therefore, “i-Baseline” and “Baseline” Section IV-B was conducted and the GMM output layer with in these two figures stand for all baseline methods with and 2 mixtures was also chosen. without interpolation respectively. TABLE III shows the objective evaluation results of baseline and proposed methods on test set of the Mandarin dataset. We From these figures, we can see that the male speaker had can see that the MCDs and F RMSEs of baseline methods higher speaking rate and shorter utterance durations than the with interpolation were close to or slightly worse than those female speaker in the Mandarin dataset. The simple linear PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 9 TABLE VI Baseline M EAN OPINION SCORES (MOS) WITH 95% CONFIDENCE INTERVALS ON i-Baseline NATURALNESS AND SIMILARITY OF BASELINE AND PROPOSED M ETHODS. Proposed Conversion i-JD-GMM i-bn-DNN i-VCC2018 Proposed Pairs N 2:08 0:16 2:09 0:12 3:29 0:10 3.70 0:09 F-M (MA) S 1:86 0:13 1:97 0:11 2:55 0:11 3.66 0:09 N 1:62 0:11 1:78 0:12 3:37 0:10 3.68 0:11 M-F (MA) S 1:55 0:09 1:82 0:11 2:29 0:11 3.80 0:09 N 2:90 0:13 2:97 0:13 3:70 0:10 3.93 0:10 F-M (EN) S 3:03 0:12 3:11 0:11 3:84 0:09 4.10 0:08 N 2:30 0:12 2:14 0:11 3:72 0:10 4.10 0:09 M-F (EN) S 2:58 0:11 2:49 0:11 3:70 0:10 4.05 0:09 “F-M” and “M-F” represent female-to-male and male-to-female conversions respectively. “MA” and “EN” represent the Mandarin and English dataset respectively. “N” and “S” denote naturalness and similarity. 0 1 2 3 4 5 6 7 Length of target utterances (second) utterances in the test set were randomly selected and converted Fig. 4. The scatter diagram of the durations of test utterances using our proposed method and three baseline methods, for female-to-male conversion using the Mandarin dataset. including i-JD-GMM, i-bn-DNN, and i-VCC2018. Baseline For the experiments conducted on the Mandarin dataset, i-Baseline ten native listeners participated in the evaluation. For the Proposed experiments conducted on the English CMU ARCTIC dataset, evaluations were conducted on the Amazon Mechanical Turk (AMT), a platform designed to facilitate crowdsourcing. At least twenty native English listeners took part in the evaluation. In both evaluations, the listeners were asked to use headphones and the samples were shown to them in random order. The listeners were asked to give a 5-scale opinion score (5: excellent, 4: good, 3: fair, 2: poor, 1: bad) on both similarity and naturalness for each converted utterance. The results of the subjective evaluations are presented in TABLE VI. From the table, we can see that the i-bn- 0 1 2 3 4 5 6 7 Length of target utterances (second) DNN method achieved similar naturalness and similarity to the i-JD-GMM method. This is consistent with previous Fig. 5. The scatter diagram of the durations of test utterances studies on DNN-based voice conversion methods [7], [8], for male-to-female conversion using the Mandarin dataset. [11]. It should be noticed that the i-bn-DNN method accepted additional bottleneck features as inputs, which may benefit the performance of this method. Compared with the i-bn- interpolation made the length of converted speech closer to DNN method, the i-VCC2018 method did not use acoustic the target. features as inputs. However, this method achieved the best Furthermore, the average absolute differences between the performance among the three baseline methods, especially on durations of the converted and target utterances (DDUR) are the naturalness of converted speech. One important reason is calculated using both Mandarin and English datasets and that the i-VCC2018 method adopted WaveNet vocoder instead are presented in TABLE V. Results show that our proposed of conventional STRAIGHT vocoder to reconstruct speech method can generate speech with lower duration errors than waveforms from the converted acoustic features. the baseline methods without duration modification or with Our proposed method outperformed the i-VCC2018 method global speaking rate compensation. on both naturalness and similarity, also on both Mandarin Fig. 6 plots the F contours and spectrograms of one test and English datasets. These experimental results proved the utterance converted using different methods and the natural effectiveness of our proposed method and the improvement target reference in the Mandarin dataset. From this figure, we brought by our proposed method was not limited to a specific can see that our proposed method can generate speech with language. One possible reason is that at the conversion stage more similar F contours to the natural reference than the other of the i-VCC2018 method, bottleneck features extracted from two baseline methods. Furthermore, our proposed method can source speech were fed to the acoustic predictor. While also modify the duration of source speech towards the natural the model was trained with the bottleneck features of the reference appropriately as shown in this figure. target speaker as inputs [21]. This inconsistency may degrade 2) Subjective evaluation: Subjective evaluations were con- the similarity of converted speech. Another reason can be ducted to compare the performance of our proposed method attributed to the duration conversion ability of our proposed with the baseline methods in terms of the naturalness and similarity of converted speech. In this evaluation, twenty https://www.mturk.com Length of converted utterances (second) Length of converted utterances (second) F (Hz) F (Hz) 0 0 F (Hz) F (Hz) 0 0 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 10 (a) bn-DNN (c) Proposed 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 (b) VCC2018 (d) Target 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Time (second) Time (second) Fig. 6. The F contours and spectrograms of one test utterance converted using different methods and the natural target reference. The red dashed lines are F contours extracted by STRAIGHT from the converted waveforms. TABLE VII TABLE VIII O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITHOUT O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITH AND USING MEL- SPECTROGRAM S AND W ITHOUT USING BOTTLENECK WITHOUT THE ATTENTION MODULE. FEATURES AS INPUTS. Female-to-Male Male-to-Female Female-to-Male Male-to-Female Methods MCD F RMSE MCD F RMSE 0 0 Methods MCD F RMSE MCD F RMSE 0 0 (dB) (Hz) (dB) (Hz) (dB) (Hz) (dB) (Hz) Proposed 3.556 41.748 3.802 33.374 Proposed 3.556 41.748 3.802 33.374 w/o-att 3.635 47.620 3.969 37.948 w/o-Mel 3.623 43.443 3.803 35.463 i-w/o-att 3.770 50.310 3.914 37.034 w/o-bn 3.624 48.550 4.000 40.183 “w/o-att” and “i-w/o-att” represent models without attention module “w/o-Mel” and “w/o-bn” represent the models without using Mel- and without attention module but adjusting speaking rate globally by spectrograms and without using bottleneck features as inputs interpolation respectively. respectively. Mel-spectrograms. Listening to the converted audio samples method as introduced in the objective evaluations. Therefore, without using bottleneck features, we found they suffered from the prosody similarity and naturalness of our proposed method serious mispronunciation problem. The bottleneck features were better than simply adjusting speaking rate globally. extracted by an ASR model contain high-level and linguistic- related information. The experimental results indicate that they D. Ablation tests were essential for achieving stable voice conversion results in our proposed method. In order to further analyze the effectiveness of some key F contours and spectrograms of one test utterance con- components in our model, ablation tests on model inputs, verted by the proposed method and the proposed method attention module and location code were conducted. In this without bottleneck features are presented in Fig. 7 (a) and subsection, only the Mandarin dataset was adopted for evalu- Fig. 7 (b) respectively. Compared to the method without using ation. bottleneck features, the F contour of the utterance converted 1) Mel-spectrograms and bottleneck features: In order to by our proposed method is more similar to that of the natural investigate the necessity of using Mel-spectrograms and bot- reference in Fig. 7 (d). Also, a significant spectral distortion tleneck features, we removed each one of them and built can be observed at the 1  2s interval of the spectrogram SCENT models utilizing only source bottleneck features or generated by the “w/o-bn” method. Mel-spectrograms as inputs respectively. Objective evaluation results of MCD and F RMSE on test set are presented in 2) Attention module: The attention module in a SCENT TABLE VII. model helps to achieve the alignment between input and output From this table, we can see that Mel-spectrograms are feature sequences at the training stage and to predict target beneficial for the model to achieve more accurate prediction of durations at the conversion stage. In order to investigate how acoustic features. It also can be found that removing bottleneck the attention module contributed to the overall performance features led to higher F RMSE and MCD on test set, and its of our proposed method, we modified the SCENT model degradation on F RMSE was more serious than removing to a frame-by-frame transformation model without attention Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) F (Hz) F (Hz) 0 0 F (Hz) F (Hz) 0 0 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 11 (a) Proposed (c) i-w/o-att 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 (b) w/o-bn (d) Target 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Time (second) Time (second) Fig. 7. The F contours and spectrograms of one test utterance converted using different methods and the natural target reference. “w/o-bn” and “i-w/o-att” represent the proposed models without bottleneck features and without attention module but adjusting speaking rate globally by interpolation respectively. The red dashed lines are F contours extracted by STRAIGHT from the converted waveforms. TABLE IX TABLE X THE RESULTS OF PREFERENCE TESTS ON SIM ILARITY AMONG PROPOSED O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITH AND METHODS W ITH AND WITHOUT THE ATTENTION M ODULE. WITHOUT THE LOCATION CODE. w/o-att i-w/o-att Proposed N/P MCD F RMSE DDUR p Methods (%) (%) (%) (%) (dB) (Hz) (second) 33.0 58.5 - 8.5 1:31 10 Proposed 3.556 41.748 0.194 F-M F-M - 21.0 67.5 11.5 < 1 10 w/o-locc 3.590 41.783 0.205 Proposed 3.802 33.374 0.260 17.5 76.0 - 6.5 < 1 10 M-F M-F w/o-locc 3.822 35.561 0.307 - 24.0 66.5 9.5 < 1 10 “DDUR” represents average absolute difference between the dura- “p” represents p value of t-test. “N/P” denotes no preference. “F-M” tions of the converted and target utterances. “w/o-locc” represents and “M-F” represent female-to-male and male-to-female conversions models without location code. “F-M” and “M-F” represent female- respectively. to-male and male-to-female conversions respectively. mechanism for comparison. Once the attention module was removed, the LSTM layer with attention in the decoder became module significantly affected on the similarity, the preference a plain uni-directional LSTM. In order to get frame aligned tests focused on the similarity aspect of converted speech. sequence pairs for model training, the input sequences were Ten native listeners were involved in evaluation and the wrapped towards the target ones using DTW algorithm and experimental results are presented in TABLE IX. This table MCCs features. The other parts of the SCENT model were shows that the strategy of global speaking rate adjustment by kept unchanged. source interpolation can improve the similarity of converted Our experiments compared three methods, including the speech in both conversion pairs. The proposed method with proposed method, the proposed method without attention attention module outperformed the method without attention (w/o-att) and the proposed method without attention but but using source interpolation. These results further confirmed using source interpolation at conversion time (i-w/o-att). TA- the effectiveness of the attention module. BLE VIII shows the MCDs and F RMSEs of these three methods. We can see that the prediction errors increased in the 3) Location code: Ablation tests were conducted for inves- absence of the attention module. F contours and spectrograms 0 tigating how the location code affected the performance of the of one test utterance converted by proposed method and “i- model. In the experiments, the location code was removed and w/o-att” method are presented in Fig. 7 (a) and Fig. 7 (c) the models were built in the same conditions. MCD, F RMSE respectively. This figure again shows the effectiveness of the and DDUR were calculated and are presented in TABLE X. A attention module for generating speech with duration and F 0 slight raise of MCD and F RMSE after removing the location contour closer to that of the natural target speech. code can be observed from this table. Furthermore, the DDURs Furthermore, a group of preference tests were conducted to in female-to-male and male-to-female conversions increased compare the subjective performance of these three methods. by 5.4% and 15.3% respectively. These experimental results Because the duration conversion achieved by the attention demonstrated the positive effects of using the location code. Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 12 E. Discussions importance of the attention module and the positive effect of the location code were also proved in our ablation studies. To As discussed in Section II, directly implementing seq2seq investigate the influence of training set size on the performance models at utterance level is difficult for the voice conversion of our proposed method and to reduce conversion errors by task. The input and output sequences in voice conversion are improving attention calculation will be our work in the future. composed of frame-level features and are relatively long thus it is a challenge for the attention mechanism to search for the correct hidden entries to pay attention to. Once there are REFERENCES abnormal skips or repetitions in the sequence of attention [1] D. G. Childers, B. Yegnanarayana, and K. Wu, “Voice conversion: probabilities, mistakes of converted speech may occur. Factors responsible for quality,” in IEEE International Conference on These difficulties are considered when designing the Acoustics, Speech and Signal Processing, 1985, pp. 748–751. [2] D. G. Childers, K. Wu, D. M. Hicks, and B. Yegnanarayana, “Voice SCENT model. In order to improve attention stability, conversion,” Speech Communication, vol. 8, no. 2, pp. 147–158, 1989. the techniques of forward attention and adding location [3] A. Kain, “Spectral voice conversion for text-to-speech synthesis,” features are used when calculating attention probabilities. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 1998, pp. 285–288. The bottleneck features can also provide linguistic-related [4] L. M. Arslan, “Speaker transformation algorithm using segmental information to help the attention-based alignment between codebooks (STASC),” Speech Communication, vol. 28, no. 3, pp. 211– input and output feature sequences. However, errors still can 226, 1999. [5] M. Muller, “Dynamic time warping,” Information retrieval for music not be completely avoided in the converted speech. Additional and motion, pp. 69–84, 2007. 100 non-parallel utterances of both speakers in the Mandarin [6] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on dataset, which were out of the dataset used for previous maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio Speech and Language Processing, vol. 15, no. 8, experiments, were adopted for error analysis. The utterances pp. 2222–2235, 2007. of the male speaker contained 2747 phonemes, while the [7] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and utterances of the female speaker had 2538 phonemes. We K. Prahallad, “Voice conversion using artificial neural networks,” in IEEE International Conference on Acoustics, Speech and Signal conducted male-to-female and female-to-male conversions Processing (ICASSP), April 2009, pp. 3893–3896. for these utterances and identified different categories of [8] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral conversion errors subjectively. In male-to-female conversion, mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio Speech and Language Processing, vol. 18, no. 5, there were 1 skipping phoneme error, 2 completion prediction pp. 954–964, 2010. errors, 34 phoneme pronunciation errors, 31 tone defects and [9] D. M. Titterington, A. F. M. Smith, and U. E. Makov, Statistical analysis 10 phoneme quality defects. In female-to-male conversion, of finite mixture distributions. Wiley, 1985. [10] K. Hornik, “Multilayer feedforward neural networks are universal there were 19 phoneme pronunciation errors, 20 tone defects approximators,” Neural Networks, vol. 2, 1989. and 17 phoneme quality defects. [11] R. H. Laskar, D. Chakrabarty, F. A. Talukdar, K. S. Rao, and K. Banerjee, Several reasons may lead to these errors. First, the proposed “Comparing ANN and GMM in a voice conversion framework,” Applied Soft Computing Journal, vol. 12, no. 11, pp. 3332–3342, 2012. model contains about 7.5 M trainable parameters thus is [12] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion complex and needs to be trained in a data-driven way. using deep neural networks with layer-wise generative training,” Therefore, the insufficiency of training data may cause the IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, no. 12, pp. 1859–1872, 2014. model’s lack of generalization ability when dealing with [13] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep unseen utterances. Also, the extracted bottleneck features may bidirectional long short-term memory based recurrent neural networks,” also be misleading due to the accuracy limitation of the ASR in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4869–4873. model. To further reduce conversion errors and to produce [14] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using RNN more reliable conversion results using seq2seq models will be pre-trained by recurrent temporal restricted Boltzmann machines,” IEEE an important task of our future work. Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 580–587, 2015. [15] J. Lai, B. Chen, T. Tan, S. Tong, and K. Yu, “Phone-aware LSTM- V. CONCLUSION RNN for voice conversion,” in IEEE International Conference on Signal Processing (ICSP), 2016, pp. 177–182. This paper presents SCENT, a sequence-to-sequence neural [16] S. H. Mohammadi and A. Kain, “An overview of voice conversion network, for acoustic modeling in voice conversion. Mel- systems,” Speech Communication, vol. 88, pp. 65–82, 2017. [17] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning spectrograms are used as acoustic features. Bottleneck features with neural networks,” Neural Information Processing Systems, pp. extracted by an ASR model are taken as additional linguistic- 3104–3112, 2014. related descriptions and are concatenated with the source [18] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using acoustic features as network inputs. Taking advantage of the RNN encoder–decoder for statistical machine translation,” Empirical attention mechanism, the SCENT model does not rely on the Methods in Natural Language Processing, pp. 1724–1734, 2014. preprocessing of DTW alignment and the duration conversion [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” International Conference on can be achieved simultaneously. Finally, the converted acoustic Learning Representations, 2015. features are passed through a WaveNet vocoder to reconstruct [20] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to speech waveforms. Objective and subjective experimental attention-based neural machine translation,” Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015. results demonstrated the superiority of our proposed method [21] L.-J. Liu, Z.-H. Ling, and L.-R. Dai, “WaveNet vocoder with limited compared with baseline methods, especially in durational training data for voice conversion,” in Annual Conference of the aspect. Ablation tests further proved the benefits of inputting International Speech Communication Association, INTERSPEECH, Mel-spectrograms and the necessity of bottleneck features. The 2018, pp. 1983–1987. PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 13 [22] J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, “Multi-target voice [42] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A conversion without parallel data by adversarially learning disentangled neural network for large vocabulary conversational speech recognition,” audio representations,” in Annual Conference of the International Speech in IEEE International Conference on Acoustics, Speech and Signal Communication Association, INTERSPEECH, 2018, pp. 501–505. Processing (ICASSP), 2016, pp. 4960–4964. [23] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice [43] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. cloning with a few samples,” in Advances in Neural Information 1050, p. 21, 2016. Processing Systems, 2018, pp. 10 040–10 050. [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances [24] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, in Neural Information Processing Systems, 2017, pp. 6000–6010. Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to- [45] J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Forward attention in end speech synthesis,” in Annual Conference of the International Speech sequence-to-sequence acoustic modeling for speech synthesis,” in IEEE Communication Association, INTERSPEECH, 2017, pp. 4006–4010. International Conference on Acoustics, Speech and Signal Processing [25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, (ICASSP), 2018, pp. 4789–4793. Y. Zhang, Y. Wang, R. J. Skerry-Ryan et al., “Natural TTS synthesis [46] C. M. Bishop, “Mixture density networks,” Technical Report by conditioning WaveNet on mel spectrogram predictions,” in IEEE NCRG/4228, Aston University, Birmingham, UK, 1994. International Conference on Acoustics, Speech and Signal Processing [47] H. Zen and A. Senior, “Deep mixture density networks for (ICASSP), 2018, pp. 4779–4783. acoustic modeling in statistical parametric speech synthesis,” in IEEE [26] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, International Conference on Acoustics, Speech and Signal Processing J. Raiman, and J. P. Miller, “Deep Voice 3: 2000-speaker neural text-to- (ICASSP), May 2014, pp. 3844–3848. speech,” International Conference on Learning Representations, 2018. [48] M. B. Christopher, Pattern Recognition and Machine Learning. Spring- [27] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- Verlag New York, 2016. to-speech system based on deep convolutional networks with guided [49] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of recitified attention,” in IEEE International Conference on Acoustics, Speech and acitvations in convolutional network,” ICML Deep Learning Workshop, Signal Processing (ICASSP), 2018, pp. 4784–4788. Lille, France, 06-11 July, 2015. [28] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, [50] J. Kominek and A. W. Black, “CMU ARCTIC databases for P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from speaker speech synthesis,” http://festvox.org/cmu arctic/index.html, 2003, Lang. verification to multispeaker text-to-speech synthesis,” in Advances in Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA. Neural Information Processing Systems, 2018, pp. 4485–4495. [51] D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, [29] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers A. Goyal, Y. Bengio, A. C. Courville, and C. Pal, “Zoneout: Regularizing based on a short untranscribed sample,” in International Conference on RNNs by randomly preserving hidden activations,” International Machine Learning, 2018, pp. 3683–3691. Conference on Learning Representations, 2017. [30] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice [52] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” fitting and synthesis via a phonological loop,” International Conference Computer Science, 2014. on Learning Representations, 2018. [53] D. T. Chappell and J. H. L. Hansen, “Speaker-specific pitch contour [31] M. V. Ramos, “Voice conversion with deep learning,” Master’s Thesis, modeling and modification,” in IEEE International Conference on Instituto Superior Tecnico, ´ 10 2016. Acoustics, Speech and Signal Processing (ICASSP), vol. 2, 1998, pp. [32] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to- 885–888. sequence voice conversion with similarity metric learned using gener- [54] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maximum ative adversarial networks,” in Annual Conference of the International likelihood voice conversion based on GMM with STRAIGHT mixed Speech Communication Association, INTERSPEECH, 2017, pp. 1283– excitation,” in Proc. ICSLP, 2006, pp. 2266–2269. [55] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network [33] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voice speech synthesis system,” in 9th ISCA Speech Synthesis Workshop conversion using sequence-to-sequence learning of context posterior (SSW9), 2016. probabilities,” in Annual Conference of the International Speech [56] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using Communication Association, INTERSPEECH, 2017, pp. 1268–1272. restricted Boltzmann machines and deep belief networks for statistical [34] A. V. Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, parametric speech synthesis,” IEEE Transactions on Audio, Speech, and A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, Language Processing, vol. 21, no. 10, pp. 2129–2139, Oct 2013. “WaveNet: A generative model for raw audio,” in 9th ISCA Speech Synthesis Workshop (SSW9), 2016, pp. 125–125. [35] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversion with WaveNet-based waveform generation,” in Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, pp. 1138–1142. [36] J. Niwa, T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Statistical voice conversion based on WaveNet,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5289–5293. [37] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, [38] H. Kawahara, I. Masuda-Katsuse, and A. D. Cheveigne, ´ “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 34, pp. 187–207, 1999. [39] X. Wang, J. Lorenzo-Trueba, S. Takaki, L. Juvela, and J. Yamagishi, “A comparison of recent waveform generation and acoustic modeling meth- ods for neural-network-based speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4804–4808. [40] Y. Ai, H.-C. Wu, and Z.-H. Ling, “SampleRNN-based neural vocoder for statistical parametric speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5659–5663. [41] S. Hchreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Sequence-to-Sequence Acoustic Modeling for Voice Conversion

Loading next page...
 
/lp/arxiv-cornell-university/sequence-to-sequence-acoustic-modeling-for-voice-conversion-zWQ25LQqIQ

References (57)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2892235
Publisher site
See Article on Publisher Site

Abstract

PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 1 Sequence-to-Sequence Acoustic Modeling for Voice Conversion Jing-Xuan Zhang, Zhen-Hua Ling, Member, IEEE, Li-Juan Liu, Yuan-Jiang, and Li-Rong Dai Abstract—In this paper, a neural network named Sequence-to- can be a joint density Gaussian mixture model (JD-GMM) [3], sequence ConvErsion NeTwork (SCENT) is presented for acoustic [6] or a deep neural network (DNN) [7], [8], both of which are modeling in voice conversion. At training stage, a SCENT model universal function approximators [9], [10]. At the conversion is estimated by aligning the feature sequences of source and target stage, a mapping function is derived from the built acoustic speakers implicitly using attention mechanism. At conversion model that converts the acoustic features of source speaker stage, acoustic features and durations of source utterances are converted simultaneously using the unified acoustic model. Mel- into those of target speaker. Finally, waveforms are recovered scale spectrograms are adopted as acoustic features which contain from the converted acoustic features using a vocoder. both excitation and vocal tract descriptions of speech signals. This conventional pipeline for voice conversion has its The bottleneck features extracted from source speech using limitations. First, most previous work focused on the conver- an automatic speech recognition (ASR) model are appended sion of spectral features and simply adjusted F trajectories as auxiliary input. A WaveNet vocoder conditioned on Mel- 0 spectrograms is built to reconstruct waveforms from the outputs linearly in the logarithm domain [7], [8], [11]–[15]. Besides, of the SCENT model. It is worth noting that our proposed the durations of converted utterances were kept the same as method can achieve appropriate duration conversion which is the ones of source utterances since the acoustic models were difficult in conventional methods. Experimental results show that built on a frame-by-frame basis. However, the production of our proposed method obtained better objective and subjective human speech is a highly dynamic process and the frame-by- performance than the baseline methods using Gaussian mixture models (GMM) and deep neural networks (DNN) as acoustic frame assumption constrains the modeling ability of mapping models. This proposed method also outperformed our previous functions [16]. work which achieved the top rank in Voice Conversion Challenge This paper proposes an acoustic modeling method for 2018. Ablation tests further confirmed the effectiveness of several voice conversion based on the sequence-to-sequence neural components in our proposed method. network framework [17], [18]. A Sequence-to-sequence Con- Index Terms—voice conversion, sequence-to-sequence, atten- vErsion NeTwork (SCENT) is designed to directly describe the tion, Mel-spectrogram. conditional probabilities of target acoustic feature sequences given source ones without explicit frame-to-frame alignment. I. INTRODUCTION The SCENT model follows the widely-used architecture of encoder-decoder with attention [19], [20]. The encoder net- OICE conversion aims to modify the speech signal of a work first transforms the input feature sequences into hidden source speaker to make it sound like being uttered by a representations which are suitable for the decoder to deal target speaker, while keeping the linguistic contents unchanged with. At each decoder time step, the attention module selects [1], [2]. The potential applications of this technique include encoder outputs softly by attention probabilities and produces entertainment, personalized text-to-speech, and so on [3], [4]. a context vector. Then, the decoder predicts output acoustic Building statistical acoustic models for feature mapping is features frame by frame using context vectors. Furthermore, a a popular approach to voice conversion nowadays. At the post-filtering network is designed to enhance the accuracy of training stage of the conventional voice conversion pipeline, the converted acoustic features. Finally, a speaker-dependent acoustic features are first extracted from the waveforms of WaveNet is utilized to recover time-domain waveforms from source and target utterances. Then, the features of parallel the predicted sequences of acoustic features. utterances are aligned frame by frame using alignment algo- In our proposed method, Mel-scale spectrograms are rithms, such as dynamic time wrapping (DTW) [5]. Next, an adopted as acoustic features, which do not rely on the acoustic model for conversion is trained using the acoustic source-filter assumption of speech production. Therefore, features of paired source-target frames. The acoustic model F and spectral features are converted jointly in a single This work was supported by National Key R&D Program of China (Grant model. Additional bottleneck features derived using an No. 2017YFB1002202), the National Nature Science Foundation of China automatic speech recognition (ASR) model are appended to (Grant No. 61871358) and the Key Science and Technology Project of Anhui Province (Grant No. 18030901016). the source Mel-spectrograms, which are expected to improve J.-X. Zhang, Z.-H. Ling and L.-R. Dai are with the National Engineering the pronunciation correctness of the converted speech. Laboratory for Speech and Language Information Processing, University Attention module learns the soft alignments between the of Science and Technology of China, Hefei, 230027, China (e-mail: nosisi@mail.ustc.edu.cn, zhling@ustc.edu.cn, lrdai@ustc.edu.cn). L.-J. Liu pairs of source-target feature sequence implicitly. Facilitated and Y. Jiang are with the iFLYTEK Co., Ltd., Hefei, 230088, China (e-mail: by attention module, our proposed method is capable of ljliu@iflytek.com, yuanjiang@iflytek.com). predicting target acoustic sequences with durations different This work was conducted when J.-X. Zhang was an intern at iFLYTEK Research. from source ones at conversion stage. arXiv:1810.06865v5 [cs.SD] 12 Jan 2020 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 2 Experimental results show that our proposed method verification task. Then, the built network was transferred to achieved better objective and subjective performance than a conditional Tacotron model [24] to generate speech for a the GMM-based and DNN-based baseline systems. This variety of speakers. Nachmani et al. [29] extended the Voice proposed method also outperformed our previous work which Loop model [30] to fit new voices by incorporating a fitting achieved the top rank in Voice Conversion Challenge 2018 network. Instead of using text as model input in these studies, [21]. It is worth noting that our proposed method can achieve we utilize a separate ASR model for extracting linguistic- appropriate duration conversion, which contributes to higher related features and the input of our model is only the speech similarity and is difficult in conventional methods. Ablation of source speakers. Also, instead of generating speech of studies were further conducted and the results confirmed unseen speakers, we focus on voice conversion for one pair of the effectiveness of several key components in our proposed speakers. It should be noticed that the techniques developed method. for voice cloning are potentially useful for extending our In this paper, we focus on one-to-one voice conversion, proposed method from one-to-one conversion to many-to- i.e., one model is trained for each speaker pair. It should many conversion, which will be a part of our future study. be noticed that our proposed method can also be adapted to other cases rather than one-to-one conversion. For example, C. Sequence-to-sequence learning for voice conversion the proposed method can be extended to multiple speaker pairs To the best of our knowledge, Ramos [31] made the by conditioning on codes of speaker identities, which can be first attempt to convert spectral features using a sequence-to- obtained from the outputs of a speaker encoder [22], [23]. sequence model with attention. However, as stated in Section The rest of this article is organized as follows. Section II 5.5 of Ramos’s thesis [31], the model was not capable of using reviews the related work on seq2seq modeling, voice cloning its own predictions to generate a real valued output prediction. and WaveNet vocoders. Section III introduces our proposed Kaneko et al. [32] proposed a CNN-based seq2seq spectral method for voice conversion. Details and results of experi- conversion method. Because of the lack of attention module ments are presented in Section IV. The article is concluded in in their method, the DTW algorithm was still utilized in order Section V. to obtain frame-level aligned feature sequences during training data preparation. Miyoshi et al. [33] proposed a method of II. RELATED WORK mapping context posterior probabilities using seq2seq models. A. Relationship with sequence-to-sequence learning for text- In their method, an RNN-based encoder-decoder converted the to-speech source posterior probability sequence to the target one for each Text-to-speech (TTS) methods based on seq2seq learning phone, and the phone durations of natural target speech were have emerged recently and attracted much attention [24]– necessary at conversion stage. [27]. Our work is inspired by the success of applying seq2seq Our work is most similar to Ramos’s one [31], where models to TTS. However, voice conversion is different from an utterance-level seq2seq with attention model is built for acoustic feature conversion. Different from previous methods, TTS in several aspects. First, the inputs of a voice conversion Mel-spectrograms are adopted as acoustic features in our model are frame-level acoustic features rather than phone- method. Thus, F and spectral features are transformed jointly. level or character-level linguistic features. Typically, linguistic Our method has the ability of modeling pairs of input and features are discrete, while acoustic features are continuous. In output utterance without dependency on DTW alignment. addition to linguistic information, acoustic features also con- During conversion, the durations of generated target acoustic tain speaker identity information which should be processed sequences are determined automatically and the probability of during voice conversion. Second, the input-output alignment completion is predicted at each decoder time step. in voice conversion task is different from that in TTS. Speech generation in TTS is a decompressing process and the alignment between text and acoustic frames is usually a one- D. Voice conversion using WaveNet to-many mapping. While the alignment can be either one-to- WaveNet [34], as a neural network-based waveform gener- many or many-to-one in voice conversion, depending on the ation model, has been successfully applied to TTS and voice characteristics of speaker pairs and the dynamic characteristics conversion areas [21], [35], [36]. Studies have shown that of acoustic sequences. Third, the training data available for WaveNet vocoders outperformed traditional vocoders such as voice conversion is typically smaller than that for TTS. WORLD [37] and STRAIGHT [38] in terms of the quality of reconstructed speech [21], [39], [40]. Voice conversion B. Relationship with voice cloning methods using WaveNet models have also been studied in Voice cloning is a task that learns the voice of unseen recent years. Kobayashi et al. [35] proposed a GMM-based speakers from a few speech samples for text-to-speech syn- voice conversion method with WaveNet-based waveform gen- thesis. Unlike voice conversion, voice cloning takes text as eration. Liu et al. [21] proposed building WaveNet vocoders model input. Arik et al. [23] evaluated two techniques of voice for voice conversion with limited data by model adaptation. cloning, i.e., speaker adaptation and speaker encoding, based Directly mapping source acoustic features into target speaker’s on Deep Voice 3 [26]. Jia et al. [28] proposed a transfer waveforms using WaveNet has also been proposed [36]. learning method for voice cloning. A speaker-discriminative In this paper, WaveNet vocoders are used to reconstruct embedding network was first trained to achieve a speaker the waveforms of target speakers. WaveNet vocoders accept PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 3 Fig. 1. The conversion process of our proposed sequence-to-sequence voice conversion method. Mel-spectrograms as input conditions and are trained in a speaker-dependent way without using the adaptation technique described in [21]. III. PROPOSED METHOD A. Overall architecture Fig. 1 shows the diagram of our proposed method when converting an input utterance. The conversion process can be divided into two main stages. One is a Seq2seq ConvErsion NeTwork (SCENT) for acoustic feature prediction, the other is a WaveNet neural vocoder for waveform generation. Mel- spectrograms are adopted as acoustic features in this paper. Bottleneck features extracted by an ASR model from source speech are concatenated with acoustic features to form the input sequences of the SCENT model. The SCENT model converts input sequence into Mel-spectrograms of the target speaker. Then, the target speaker’s speech is synthesized by passing the predicted Mel-spectrograms through the WaveNet Fig. 2. The network structure of a SCENT model, where skip connections and vocoder. residual connections are ignored for clarity. The grey circles in the encoder represent LSTM units with layer normalization. T and T are the frame x h B. Feature extraction numbers of input sequence and hidden representations. The encoder in this figure has a downsampling rate M = 2. Therefore, we have T = 2T in this x h Mel-spectrograms are computed through a short-time figure. The auto-regressive inputs of the decoder are natural history contexts Fourier transform (STFT) on waveforms. The STFT at training time and are generated ones at conversion time. Single frame is predicted at each decoder time step (i.e., r = 1) in this figure. magnitudes are transformed to Mel-frequency scale using Mel-filterbanks followed by a logarithmic dynamic range compression. In order to extract bottleneck features, a left-to-right way, and a bi-directional post-filtering network recurrent neural network (RNN) based ASR model is trained which refines the generation results. Fig. 2 shows the network on a separate speech recognition dataset. For each input structure of a SCENT model. frame, bottleneck features, i.e., the activations of the last Let y = [y ; : : : ;y ] denote the output Mel-spectrogram hidden layer before the softmax output layer of the ASR 1 T sequence of the encoder-decoder network, where T is the model, are extracted. Such bottleneck features can provide y frame number of target speech. The encoder-decoder network additional linguistic-related descriptions which are expected models the mapping relationship between input and output fea- to benefit the conversion process. It should be noticed that ture sequences using conditional distributions of each output these bottleneck features are still automatically extracted frame y given previous output frames y = [y ; : : : ;y ] from the acoustic signals of source utterances and no text t <t 1 t1 and the input x as transcriptions are necessary. The Mel-spectrograms and bottleneck features at each frame are concatenated to form the input sequence x = [x ; : : : ;x ] of the SCENT model, 1 T p(yjx) = p(y jy ;x; W ; W ); (1) t <t enc dec where T is the frame number of source speech. t=1 C. Structure of SCENT where W and W are parameters of the encoder-decoder enc dec A SCENT model contains an encoder-decoder with attention network. As shown in Fig. 2, the encoder transforms the network which predicts acoustic feature in an uni-directional concatenated Mel-spectrograms and bottleneck features of PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 4 source speech into a high-level and abstract representation the voice conversion task, we expect that the encoder network h = [h ; : : : ;h ] as should exclude speaker-dependent information of the source 1 T speech and extract hidden representation h which is high-level h = Encoder(x; W ): (2) enc and linguistic-related. Because one phone usually corresponds to tens of acoustic frames, it is reasonable to derive hidden T is the frame number of hidden representations and T < h h representation with lower sampling rate than the frame-level T because of the pyramid structure of encoder. The decoder input sequence. Furthermore, hidden representation with lower with attention mechanism utilizes h and produces a probability sampling rate makes the attention module easier to converge, distribution over output frames as since this leads to fewer encoding states for attention calcula- p(y jy ;x) = Decoder(y ;h; W ): (3) t <t <t dec tion at each decoding step. This pyramid structure also reduces the computational complexity by shortening the length of h The generation process of the decoder network is uni- for attention calculation and speeds up training and inference directional. In order to make use the bi-directional context significantly. information, a post-filtering network (i.e., PostNet) is further 2) Decoder with attention mechanism: The decoder is employed to enhance the accuracy of prediction. Let z = an auto-regressive RNN which predicts the output acoustic [z ; : : : ;z ] represent the PostNet output sequence, which 1 T features from the hidden representation h. Non-overlapping r is the final prediction of the SCENT model. In this paper, frames are predicted at each decoder step. This trick divides the frame rates of decoder outputs and PostNet outputs are the total decoding steps by r, which further reduces training the same, i.e. T = T . The distribution of feature sequence z z y and inference time [24]. In Fig. 2, the decoder is illustrated given the output of the encoder-decoder network y is modeled with r = 1 for clarity. The prediction of previous time step as y is first passed through a pre-processing network (i.e., t1 p(zjy) = PostNet(y; W ); (4) pos PreNet), which is a two-layer MLP with ReLU activation and where W denotes the parameters of the PostNet. pos dropout in our implementation. The MLP outputs are sent into Next, we will describe each part of SCENT in details. an LSTM layer with attention mechanism. A context vector c 1) Encoder: The encoder network is constructed based on is calculated at each decoder step using attention probabilities the pyramid bidirectional LSTM architecture [41], [42], which as processes the sequence with lower time resolution at higher c = h ; (8) layers. In a conventional deep bidirectional LSTM (BLSTM) t n n=1 architecture, the output at the n-th time step of the j-th layer is computed as 1 h where = [ ; :::; ] are attention probabilities, t is t t j j j1 decoder time step, and n is the index of encoder outputs. h = BLSTM(h ;h ): (5) n n1 n In our implementation, a hybrid attention mechanism is In a pyramid BLSTM (pBLSTM), the outputs at consecutive adopted which takes the alignment of previous decoder step steps of a lower layer are concatenated and fed into the (i.e., location-awareness) into account when computing the next layer to decrease the sampling rate of input sequence. attention probabilities. In order to extract location information, The general calculation of pBLSTM hidden units can be k filters with kernel size l are employed to convolve the kl formulated as alignment of previous time step. Let F 2 R represent the j j1 j1 convolution matrix, and q denote the query vector which is h = pBLSTM(h ; [h ; : : : ;h ]); (6) n n1 Mn Mn+M1 given by the output of attention LSTM. Then, the attention where M is ratio of downsampling. The technique of layer score e is computed as normalization [43] is applied to the encoder LSTM cells. f = F  ; (9) t t1 Then, a location code l = [l (0); : : : ; l (d 1)] [44] is n n n added to the top output layer of pBLSTMs to form the hidden n > > n e = q Wh + v tanh(Uf + b); (10) t t t representation h. Let d be the dimension of each h . The location code is composed of sine and cosine functions of where v, b, W and U are trainable parameters of the model. different frequencies as As we can see from Eq. (10), the calculation of the hybrid attention takes two parts into consideration. The first part of 2i/d l (2i) = sin(n 10000 ); (7) Eq. (10) measures the relationship between the query vector 2i/d l (2i + 1) = cos(n 10000 ); and different entries of encoder outputs. The second part of where n is the time step in sequence h and i 2 [0; : : : ; d/21] Eq. (10) is computed based on the alignment of previous is the dimension index. The base 10000 in Eq. (7) follows the decoder step and provides a constraint on current t1 configuration in the original paper [44] which proposed the attention probabilities. The convolution matrix is employed to filter for extracting useful features as shown in Eq. (9). location code. This location code is useful since it gives the t1 The features are further integrated into the calculation of model explicit information of which portion of the sequence attention scores as shown in Eq. (10). is currently processed. The effectiveness of the location code will be demonstrated by ablation tests in our experiments. Furthermore, the forward attention method proposed in our The pyramid structure of our encoder network results in previous work [45] is adopted to stabilize the attention align- shorter hidden representation than original input sequence. For ment and speed up the convergence of attention alignment. In PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 5 the forward attention method, the attention probability is likelihood (ML) criterion based on Gaussian mixture model calculated as (GMM). For GMM-ML, the network outputs are adopted to parameterize a GMM following the framework of mixture n n i density networks (MDN) [46], [47]. e^ = exp(e ) exp(e ) ; (11) t t t More specifically, the likelihood function in GMM-ML i=1 takes the form of a GMM as n n n n1 ^ = e^ ( + ); (12) t t t1 t1 X p(yjx; W ; W ) = w (x)N (y; (x); (x)); (16) enc dec i i i n n i = ^ ^ : (13) i=1 t t t i=1 where m is the number of mixture components, and For initialization, we have w (x),  (x) and  (x) correspond to the mixture weight, i i i mean vector and covariance matrix of the i-th Gaussian = 1; (14) component given x. Here, the covariance matrices are = 0; for n = 2; : : : ; T : set to be diagonal. The concatenation of c , q and the t t The motivation of forward attention is to follow the mono- outputs of decoding LSTMs are projected to a vector tonic nature of alignments in human speech generation [45]. (2d +1)m Mel o(x; W ; W ) 2 R , where d is the enc dec Mel Therefore, a forward variable which only takes the monotonic dimension of Mel-spectrograms and the whole vector can be alignment paths into consideration is designed. This forward divided into all mixture components as variable is derived from the original attention probabilities e^ (w) (w) and it can be computed recursively as Eq. (12). Then, the o(x; W ; W ) =[o (x); : : : ; o (x); enc dec 1 m normalized forward variables are used to replace original t () > () > (17) o (x) ; : : : ;o (x) ; 1 m attention probabilities e^ for summarizing the encoder outputs () > () > > as shown in Eq. (8). In addition, a location code is also added o (x) ; : : : ;o (x) ] : 1 m to the auto-regressive input of the decoder at each time step. Then, the GMM parameters in Eq. (16) can be derived from The context vector c and query vector q are concatenated the vector o(x; W ; W ) as enc dec and fed into a stack of two-layer decoding LSTMs. The concatenation of c, q and the outputs of decoding LSTMs (w) (w) are linearly projected to produce the Mel-spectrogram output w (x) = exp o (x) exp o (x) ; (18) i j of the decoder network. In parallel, the concatenation of c j=1 and q are linearly projected to a scalar and passed through a () sigmoid activation to predict the completion probability p , end  (x) = log exp (o (x)) + 1 ; (19) which indicates whether the converted sequence reaches the last frame. () (x) = o (x); (20) 3) Post-filtering network: The PostNet refines the Mel- spectrograms predicted by the decoder using bi-directional where  (x) is a vector composed of the diagonal elements context information. The PostNet is a convolutional neural of  (x). For GMM-ML, L is defined as the negative log- i dec network (CNN) with a residual connection from network input likelihood (NLL) function, i.e., to the final output. The first layer of the PostNet is composed of 1-D convolution filter banks in order to extract rich context L = log p(yjx; W ; W ): (21) dec enc dec information. The outputs of the convolution banks are stacked Under both MSE and GMM-ML criteria, natural acoustic together and further passed through a two-layer 1-D CNN. histories of target speech are sent into the decoder at training The outputs of the final layer are added to the input Mel- time. The MSE criterion is actually a special case of GMM- spectrograms to produce the final results. ML which uses single mixture with fixed unit variance and predicted mean vector [48]. Theoretically, GMM-ML D. Loss function of SCENT is more flexible since it models more general probability We train the SCENT model by multi-task learning and the distributions and the MSE criterion usually leads to over- total loss is the weighted sum of three sub-losses as smoothed prediction because of the averaging effect [46]. When applying the GMM-ML criterion to L , the mean dec L = w L + w L + w L ; (15) dec dec post post end end vector of the component with maximum prior probability is used to generate the output sample at both training and testing where w , w and w are the weights of the three com- dec post end stages. At training time, the gradients from the PostNet are ponents. L and L denote the losses of Mel-spectrogram dec post only back-propagated through the sampled mean vectors given prediction given by the decoder and the PostNet respectively. L is a binary cross-entropy loss for evaluating the predicted by the decoder output layer. end completion probabilities. Only the MSE criterion is applied to L in our imple- post Two types of criteria are investigated for L . One is mentation. For calculating L , only the last decoder step of dec end the minimum square error (MSE) between the predicted and a natural target sequence is labelled as 1 (i.e., completed) and ground truth acoustic features. The other is the maximum the rest steps are labelled as 0 (i.e., incompleted). PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 6 TABLE I ms with Hann windowing of 50 ms frame length and 1024- D ETAILS OF M ODEL CONFIGURATIONS. point Fourier transform. 512-dimensional bottleneck features were extracted using an ASR model every 40 ms and were pBLSTM, 2 layers and 256 cells LSTM Encoder then upsampled by repeating to match the frame rate of Mel- with layer normalization, M = 4 spectrograms. FC-256-ReLU-Dropout(0.5)! PreNet FC-256-ReLU-Dropout(0.5) The speaker-independent ASR model was trained using Attention LSTM, 1 layer and 256 cells; internal datasets of iFLYTEK company, which contained k = 10 and l = 32 for F in Eq. (9); recordings of about 10,000 hours for Mandarin and recordings SCENT Decoder v in Eq. (10) has dimension of 256; of about 3,000 hours for English. Our ASR model was Decoder LSTM, 2 layers and 256 cells an LSTM-HMM-based one. The LSTM was bidirectional Conv1D banks, k = [1; : : : ; 8], with 6 hidden layers and 1024 units in each direction. The Conv1D-k-256-BN-ReLU-Dropout(0.2)! PostNet classification targets of the LSTM model were clustered Conv1D-3-256-BN-ReLU-Dropout(0.2)! triphones, i.e., senones. For the Mandarin dataset, the phoneme Conv1D-3-256-BN-ReLU-Dropout(0.2) set included 26 initials and 140 tonal finals. We evaluated the 4 layers Conv1D-3-100-PReLU ConditionNet performance of the ASR model on the parallel dataset for with dilation d = [1; 2; 4; 8] WaveNet 30 layers dilated convolution layers voice conversion. The frame classification accuracies for the vocoder k mod 10 WaveNet with dilation d = 2 for female and male speakers were 72.3% and 78.4% respectively. k = [0; : : : ; 29]; 1024 softmax output For the English dataset, there were 62 phonemes and the frame FC represents fully connected. BN represents for batch normalization. classification accuracies for the female and male speakers were Conv1D-k-n represents 1-D convolution with kernel size k and channel 76.4% and 75.9% respectively. n. The details of our model configurations are listed in TA- BLE I. In our implementation, two frames were predicted at one decoding step (i.e., r = 2) and only the last frame was fed E. WaveNet-based vocoder back into the PreNet for the generation at next step. In the loss As shown in Fig. 1, a WaveNet-based vocoder is adopted to function for training the SCENT model, w was heuristically dec reconstruct time-domain waveforms given the predicted Mel- set as 1.0 or 0.01 if MSE or GMM-ML training criterion was spectrogram features. adopted for L . w and w were heuristically set as dec post end In our WaveNet model, the Mel-spectrogram features are 1.0 and 0.005 respectively. Zoneout [51] with probability of first passed through a ConditionNet consisting of stack of di- 0.2 were used at LSTM layers for regularization. Residual lated 1-D convolution layers with parametric ReLU activation connections were adopted for the LSTM layers of encoder (PReLU) [49]. The outputs of ConditionNet are upsampled to and decoder to speed up model convergence. We used Adam be consistent with the sampling rate of waveforms by simply [52] optimizer with learning rate of 10 for the first 20 repeating. Then, the sequence of condition vectors are fed epochs. After 50 epochs, the learning rate was exponentially into each dilated convolution block of the WaveNet to control decayed by 0.95 for each epoch. L regularization with weight the waveform generation. Our WaveNet model is trained only 10 was also applied. The batch size was 4. For WaveNet using the target speech data for building the SCENT model training, the -law companded waveforms were quantized and the adaptation technique [21] is not used in this paper. into 10 bits, i.e., 1024 levels. A speaker-dependent WaveNet vocoder was trained using each speaker’s waveforms with IV. E XPERIM ENTS random initialization and a learning rate of 10 until the loss converge. A. Experimental conditions Three kind of baseline methods were adopted for compar- Two datasets were used in our experiments. The first one ison in our experiments. 41-dimensional Mel-cepstral coeffi- contained 1060 parallel Mandarin Chinese utterances from cients (MCCs), 1-dimensional fundamental frequency (F ) and one male speaker (about 53 mins) and one female speaker 5-dimensional band aperiodicities (BAPs) were extracted every (about 72 mins). This dataset was separated into a training 5 ms by STRAIGHT [38] as acoustic features in our baseline set with 1000 utterances, a validation set with 30 utterances systems. The descriptions of these methods are as follows . and a test set with 30 utterances. For the second dataset, JD-GMM: Gaussian mixture models with full-covariance speech data of one male (rms, about 62 mins) and one female matrices were utilized for modeling the joint spectral (slt, about 52 mins) from the CMU ARCTIC database [50] feature vectors of source and target speakers. For each was adopted. This dataset contained 1132 parallel English speaker, static and delta spectral features were used. utterances, which were separated into a training set with 1000 The number of mixtures m was tuned on validation utterances, a validation set with 66 utterances and a test set set with m 2 [16; 32; 48; 64]. Maximum likelihood with 66 utterances. Our analytical experiments in Section IV-B parameters generation (MLPG) with global variance (GV) and Section IV-D only adopted the Mandarin dataset, and enhancement were used for spectral parameter generation. the main objective and subjective evaluations in Section IV-C F was converted by Gaussian normalization in the adopted both datasets. logarithm domain [53]. BAPs were not converted but The recordings of both dataset were sampled at 16kHz. 80- dimensional Mel-scale spectrograms were extracted every 10 Samples of audio are available at https://jxzhanggg.github.io/Seq2SeqVC. PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 7 1.0 TABLE II O BJECTIVE EVALUATION RESULTS OF USING DIFFERENT LOSS FUNCTIONS FOR THE DECODER ON VALIDATION SET. 0.8 Female-to-Male Male-to-Female Settings MCD F RMSE MCD F RMSE 0 0 0.6 (dB) (Hz) (dB) (Hz) MSE 3.397 42.122 3.658 33.420 0.4 MX2 3.365 38.123 3.649 32.271 MX4 3.384 38.629 3.651 34.748 MX6 3.376 38.804 3.669 35.337 0.2 MX8 3.418 39.230 3.637 33.029 “MX2”, “MX4”, “MX6” and “MX8” represent using ML criterion with 0.0 2, 4, 6 and 8 GMM mixture components respectively. 0 50 100 150 200 250 decoder steps Fig. 3. Visualization of the attention alignment and the DTW path of an directly copied from the source, since previous research utterance pair in the validation set. The heat map shows the alignment showed that converting aperiodic component did not probabilities calculated by the attention module in our seq2seq model. The red dashed line shows the alignment path given by DTW, which is downsampled make a statistically significant difference to the quality to match the sample rates of encoder states and decoder time steps. of converted speech [54]. Waveforms were reconstructed by STRAIGHT vocoder from the converted acoustic features. objective performance of these loss functions by experiments DNN: The DNN-based voice conversion models were on both female-to-male and male-to-female conversions using implemented based on Merlin toolkit [55]. The static, the Mandarin dataset. delta and acceleration components of MCCs, F and The Mel-cepstral distortion (MCD) and root mean square BAPs were transformed jointly using a DNN. In addition error (RMSE) of F on validation set were adopted as metrics. to use the acoustic features of the source speaker as Because Mel-spectrograms were adopted as acoustic features, model input, we also concatenated the input acoustic it’s not straightforward to extract F and MCCs features features with the bottleneck features used in our proposed from the converted acoustic features. Therefore, F and 25- method. This approach was named bn-DNN in the rest dimensional MCCs features were extracted by STRAIGHT of this paper. The ReLU activation function was used from the reconstructed waveforms for evaluation. Then the at DNN hidden units. A grid search using validation extracted features were aligned to those of the reference set was adopted in order to pick up the optimal depth utterances in the validation set in order to compute MCD and d and width w of the DNN with d 2 [3; 4; 5; 6] and F RMSE values. The F RMSE was calculated only using the 0 0 w 2 [512; 1024; 2048]. MLPG and GV techniques were frames which were both voiced in the converted and reference used for acoustic parameter generation. Waveform was utterances. reconstructed by STRAIGHT vocoder from the converted TABLE II summarizes the objective evaluation results on acoustic features. validation set. From the table, we can see that the model using VCC2018: This baseline method followed the framework the GMM-ML criterion with 2 mixture components achieves of our previous work [21], which achieved the top rank on the best performance on validation set among all settings naturalness and similarity in Voice Conversion Challenge except the MCD of male-to-female conversion. A further 2018. A speaker-dependent acoustic feature predictor was examination shows that using the GMM-ML criterion with trained by adapting a pre-trained speaker-independent mixture components more than 2 may lead to the instability model using the data of the target speaker. The predictor of attention alignment. Some cases of attention failures, such was an LSTM model which predicted MCCs, F and as getting stuck in one frame, can be observed for MX6 and BAPs of the target speaker from bottleneck features MX8. We tried to re-optimize the weighting factors in Eq. (15) frame-by-frame. At the training stage, bottleneck features for the MX6 and the MX8. The experimental results showed were extracted from the target speaker as model inputs. that changing the coefficients for models with more mixtures At the conversion stage, bottleneck features were obtained could slightly improve the alignment quality while the overall from the speech of the source speaker and were sent into performances of the models were still worse than the MX2 the acoustic feature predictor of the target speaker for model. One possible reason is that larger mixture numbers conversion. In this method, a speaker-dependent WaveNet may increase the number of parameters and the difficulty of vocoder conditioned on MCCs, F and BAPs features was model training. Thus, the GMM-ML criterion with 2 mixtures built for waveform reconstruction. was adopted for L in following experiments. dec The SCENT network models pairs of source and target B. Comparison between different decoder loss functions utterance directly. During training, alignments of utterance As introduced in Section III-D, either MSE or GMM-ML pairs are learned by attention module implicitly. An example criterion was applied to define the loss function L of of the alignment between an utterance pair using the SCENT dec the decoder output in our implementation. We evaluated the model is shown in Fig. 3, where each column denotes the encoder states PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 8 TABLE III TABLE IV O BJECTIVE EVALUATION RESULTS OF BASELINE AND PROPOSED O BJECTIVE EVALUATION RESULTS OF BASELINE AND PROPOSED METHODS ON TEST SET OF M ANDARIN DATASET. METHODS ON TEST SET OF ENGLISH CMU ARCTIC DATASET. Female-to-Male Male-to-Female Female-to-Male Male-to-Female Methods Methods MCD F RMSE MCD F RMSE MCD F RMSE MCD F RMSE 0 0 0 0 (dB) (Hz) (dB) (Hz) (dB) (Hz) (dB) (Hz) JD-GMM 3.892 55.241 4.307 46.625 JD-GMM 3.176 16.473 3.278 16.418 i-JD-GMM 3.936 55.939 4.328 48.286 i-JD-GMM 3.187 14.834 3.274 16.343 DNN 3.688 44.087 4.335 39.190 DNN 3.200 13.998 3.270 14.118 i-DNN 3.750 44.268 4.245 39.877 i-DNN 3.271 14.531 3.296 14.050 bn-DNN 3.618 42.385 4.078 35.883 bn-DNN 3.167 12.675 3.100 13.070 i-bn-DNN 3.725 42.961 4.088 35.019 i-bn-DNN 3.141 11.969 3.182 13.098 VCC2018 3.802 56.874 4.210 39.196 VCC2018 3.384 11.116 3.668 13.707 i-VCC2018 3.854 53.350 4.225 41.257 i-VCC2018 3.354 11.455 3.663 12.631 Proposed 3.556 41.748 3.802 33.374 Proposed 3.212 9.899 3.383 11.704 “i” represents the interpolation of source features for duration “i” represents the interpolation of source features for duration compensation. “bn” denotes appending bottleneck features as input. compensation. “bn” denotes appending bottleneck features as input. TABLE V attention probabilities corresponding to different encoder states THE AVERAGE ABSOLUTE DIFFERENCES BETWEEN THE DURATIONS OF for one decoder step. The DTW algorithm was also conducted THE CONVERTED AND TARGET UTTERANCES (DDUR) ON TEST SET. based on the input and output Mel-spectrogram sequences Conversion Baseline i-Baseline Proposed and the resulting path was plotted as the red dashed line Pairs (second) (second) (second) for comparison. From this figure, we can see that these two F-M (MA) 1.147 0.276 0.194 M-F (MA) 1.157 0.380 0.260 alignments matched well. Comparing with the DTW path F-M (EN) 0.560 0.282 0.227 which denotes hard and deterministic alignment, the attention M-F (EN) 0.556 0.240 0.147 alignment is soft and changes smoothly along consecutive “F-M” and “M-F” represent female-to-male and male-to-female decoder time steps. conversions. “MA” and “EN” represent the Mandarin and English dataset respectively. C. Comparison between baseline and proposed methods without interpolation. Appending bottleneck features as inputs 1) Objective evaluation: Objective evaluations were first was beneficial for improving the objective performance of the carried out to compare the MCD and F RMSE performance DNN-based method. Our proposed method outperformed all of our proposed method and the baseline methods introduced baseline methods, which obtained the lowest MCD and F in Section IV-A, including JD-GMM, DNN, bn-DNN and RMSE. VCC2018. In order to compensate the duration differences TABLE IV shows the results evaluated on the English between source and target speakers, we also tried to linearly CMU ARCTIC dataset. The proposed method achieved best interpolate the source feature sequences before sending them performance on F RMSE, while its performance on MCD into the conversion models according to the average ratio was not as good as some baseline methods. Considering between the training set durations of the two speakers. We that the MCD measurement may be inconsistent with human only interpolated the static part of the source features and the perception [6], [12], [56], some subjective evaluations were dynamic features were recalculated based on the interpolated further conducted and will be introduced later. static features. This led to four additional methods, named i-JD-GMM, i-DNN, i-bn-DNN and i-VCC2018, in our evalu- One advantage of our proposed method is that it can ations. The MCDs and F RMSEs were calculated following convert the duration of source speech using an unified acoustic the way introduced in Section IV-B. For fair comparison, model. In order to investigate the performance of duration F and MCCs were re-extracted by STRAIGHT from the conversion, the scatter diagrams of test utterance durations are converted waveforms for all methods when computing MCDs drawn in Fig. 4 and Fig. 5 for female-to-male and male-to- and F RMSEs. female conversions using the Mandarin dataset. For each test The proposed and baseline methods were evaluated on both utterance, the durations of speech converted using different the Mandarin dataset and the English CMU ARCTIC dataset. baseline methods were the same, i.e., the duration of the When using the English CMU ARCTIC dataset, the same source speech. For the baseline methods with source feature procedure of tuning the decoder output layer as described in interpolation, the same global interpolation ratio was shared by all baseline methods. Therefore, “i-Baseline” and “Baseline” Section IV-B was conducted and the GMM output layer with in these two figures stand for all baseline methods with and 2 mixtures was also chosen. without interpolation respectively. TABLE III shows the objective evaluation results of baseline and proposed methods on test set of the Mandarin dataset. We From these figures, we can see that the male speaker had can see that the MCDs and F RMSEs of baseline methods higher speaking rate and shorter utterance durations than the with interpolation were close to or slightly worse than those female speaker in the Mandarin dataset. The simple linear PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 9 TABLE VI Baseline M EAN OPINION SCORES (MOS) WITH 95% CONFIDENCE INTERVALS ON i-Baseline NATURALNESS AND SIMILARITY OF BASELINE AND PROPOSED M ETHODS. Proposed Conversion i-JD-GMM i-bn-DNN i-VCC2018 Proposed Pairs N 2:08 0:16 2:09 0:12 3:29 0:10 3.70 0:09 F-M (MA) S 1:86 0:13 1:97 0:11 2:55 0:11 3.66 0:09 N 1:62 0:11 1:78 0:12 3:37 0:10 3.68 0:11 M-F (MA) S 1:55 0:09 1:82 0:11 2:29 0:11 3.80 0:09 N 2:90 0:13 2:97 0:13 3:70 0:10 3.93 0:10 F-M (EN) S 3:03 0:12 3:11 0:11 3:84 0:09 4.10 0:08 N 2:30 0:12 2:14 0:11 3:72 0:10 4.10 0:09 M-F (EN) S 2:58 0:11 2:49 0:11 3:70 0:10 4.05 0:09 “F-M” and “M-F” represent female-to-male and male-to-female conversions respectively. “MA” and “EN” represent the Mandarin and English dataset respectively. “N” and “S” denote naturalness and similarity. 0 1 2 3 4 5 6 7 Length of target utterances (second) utterances in the test set were randomly selected and converted Fig. 4. The scatter diagram of the durations of test utterances using our proposed method and three baseline methods, for female-to-male conversion using the Mandarin dataset. including i-JD-GMM, i-bn-DNN, and i-VCC2018. Baseline For the experiments conducted on the Mandarin dataset, i-Baseline ten native listeners participated in the evaluation. For the Proposed experiments conducted on the English CMU ARCTIC dataset, evaluations were conducted on the Amazon Mechanical Turk (AMT), a platform designed to facilitate crowdsourcing. At least twenty native English listeners took part in the evaluation. In both evaluations, the listeners were asked to use headphones and the samples were shown to them in random order. The listeners were asked to give a 5-scale opinion score (5: excellent, 4: good, 3: fair, 2: poor, 1: bad) on both similarity and naturalness for each converted utterance. The results of the subjective evaluations are presented in TABLE VI. From the table, we can see that the i-bn- 0 1 2 3 4 5 6 7 Length of target utterances (second) DNN method achieved similar naturalness and similarity to the i-JD-GMM method. This is consistent with previous Fig. 5. The scatter diagram of the durations of test utterances studies on DNN-based voice conversion methods [7], [8], for male-to-female conversion using the Mandarin dataset. [11]. It should be noticed that the i-bn-DNN method accepted additional bottleneck features as inputs, which may benefit the performance of this method. Compared with the i-bn- interpolation made the length of converted speech closer to DNN method, the i-VCC2018 method did not use acoustic the target. features as inputs. However, this method achieved the best Furthermore, the average absolute differences between the performance among the three baseline methods, especially on durations of the converted and target utterances (DDUR) are the naturalness of converted speech. One important reason is calculated using both Mandarin and English datasets and that the i-VCC2018 method adopted WaveNet vocoder instead are presented in TABLE V. Results show that our proposed of conventional STRAIGHT vocoder to reconstruct speech method can generate speech with lower duration errors than waveforms from the converted acoustic features. the baseline methods without duration modification or with Our proposed method outperformed the i-VCC2018 method global speaking rate compensation. on both naturalness and similarity, also on both Mandarin Fig. 6 plots the F contours and spectrograms of one test and English datasets. These experimental results proved the utterance converted using different methods and the natural effectiveness of our proposed method and the improvement target reference in the Mandarin dataset. From this figure, we brought by our proposed method was not limited to a specific can see that our proposed method can generate speech with language. One possible reason is that at the conversion stage more similar F contours to the natural reference than the other of the i-VCC2018 method, bottleneck features extracted from two baseline methods. Furthermore, our proposed method can source speech were fed to the acoustic predictor. While also modify the duration of source speech towards the natural the model was trained with the bottleneck features of the reference appropriately as shown in this figure. target speaker as inputs [21]. This inconsistency may degrade 2) Subjective evaluation: Subjective evaluations were con- the similarity of converted speech. Another reason can be ducted to compare the performance of our proposed method attributed to the duration conversion ability of our proposed with the baseline methods in terms of the naturalness and similarity of converted speech. In this evaluation, twenty https://www.mturk.com Length of converted utterances (second) Length of converted utterances (second) F (Hz) F (Hz) 0 0 F (Hz) F (Hz) 0 0 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 10 (a) bn-DNN (c) Proposed 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 (b) VCC2018 (d) Target 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 0 1 2 3 4 5 0 1 2 3 4 5 Time (second) Time (second) Fig. 6. The F contours and spectrograms of one test utterance converted using different methods and the natural target reference. The red dashed lines are F contours extracted by STRAIGHT from the converted waveforms. TABLE VII TABLE VIII O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITHOUT O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITH AND USING MEL- SPECTROGRAM S AND W ITHOUT USING BOTTLENECK WITHOUT THE ATTENTION MODULE. FEATURES AS INPUTS. Female-to-Male Male-to-Female Female-to-Male Male-to-Female Methods MCD F RMSE MCD F RMSE 0 0 Methods MCD F RMSE MCD F RMSE 0 0 (dB) (Hz) (dB) (Hz) (dB) (Hz) (dB) (Hz) Proposed 3.556 41.748 3.802 33.374 Proposed 3.556 41.748 3.802 33.374 w/o-att 3.635 47.620 3.969 37.948 w/o-Mel 3.623 43.443 3.803 35.463 i-w/o-att 3.770 50.310 3.914 37.034 w/o-bn 3.624 48.550 4.000 40.183 “w/o-att” and “i-w/o-att” represent models without attention module “w/o-Mel” and “w/o-bn” represent the models without using Mel- and without attention module but adjusting speaking rate globally by spectrograms and without using bottleneck features as inputs interpolation respectively. respectively. Mel-spectrograms. Listening to the converted audio samples method as introduced in the objective evaluations. Therefore, without using bottleneck features, we found they suffered from the prosody similarity and naturalness of our proposed method serious mispronunciation problem. The bottleneck features were better than simply adjusting speaking rate globally. extracted by an ASR model contain high-level and linguistic- related information. The experimental results indicate that they D. Ablation tests were essential for achieving stable voice conversion results in our proposed method. In order to further analyze the effectiveness of some key F contours and spectrograms of one test utterance con- components in our model, ablation tests on model inputs, verted by the proposed method and the proposed method attention module and location code were conducted. In this without bottleneck features are presented in Fig. 7 (a) and subsection, only the Mandarin dataset was adopted for evalu- Fig. 7 (b) respectively. Compared to the method without using ation. bottleneck features, the F contour of the utterance converted 1) Mel-spectrograms and bottleneck features: In order to by our proposed method is more similar to that of the natural investigate the necessity of using Mel-spectrograms and bot- reference in Fig. 7 (d). Also, a significant spectral distortion tleneck features, we removed each one of them and built can be observed at the 1  2s interval of the spectrogram SCENT models utilizing only source bottleneck features or generated by the “w/o-bn” method. Mel-spectrograms as inputs respectively. Objective evaluation results of MCD and F RMSE on test set are presented in 2) Attention module: The attention module in a SCENT TABLE VII. model helps to achieve the alignment between input and output From this table, we can see that Mel-spectrograms are feature sequences at the training stage and to predict target beneficial for the model to achieve more accurate prediction of durations at the conversion stage. In order to investigate how acoustic features. It also can be found that removing bottleneck the attention module contributed to the overall performance features led to higher F RMSE and MCD on test set, and its of our proposed method, we modified the SCENT model degradation on F RMSE was more serious than removing to a frame-by-frame transformation model without attention Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) F (Hz) F (Hz) 0 0 F (Hz) F (Hz) 0 0 PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 11 (a) Proposed (c) i-w/o-att 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 (b) w/o-bn (d) Target 8 360 8 360 6 6 240 240 4 4 120 120 2 2 0 0 0 1 2 3 4 5 6 0 1 2 3 4 5 6 Time (second) Time (second) Fig. 7. The F contours and spectrograms of one test utterance converted using different methods and the natural target reference. “w/o-bn” and “i-w/o-att” represent the proposed models without bottleneck features and without attention module but adjusting speaking rate globally by interpolation respectively. The red dashed lines are F contours extracted by STRAIGHT from the converted waveforms. TABLE IX TABLE X THE RESULTS OF PREFERENCE TESTS ON SIM ILARITY AMONG PROPOSED O BJECTIVE EVALUATION RESULTS OF PROPOSED METHODS WITH AND METHODS W ITH AND WITHOUT THE ATTENTION M ODULE. WITHOUT THE LOCATION CODE. w/o-att i-w/o-att Proposed N/P MCD F RMSE DDUR p Methods (%) (%) (%) (%) (dB) (Hz) (second) 33.0 58.5 - 8.5 1:31 10 Proposed 3.556 41.748 0.194 F-M F-M - 21.0 67.5 11.5 < 1 10 w/o-locc 3.590 41.783 0.205 Proposed 3.802 33.374 0.260 17.5 76.0 - 6.5 < 1 10 M-F M-F w/o-locc 3.822 35.561 0.307 - 24.0 66.5 9.5 < 1 10 “DDUR” represents average absolute difference between the dura- “p” represents p value of t-test. “N/P” denotes no preference. “F-M” tions of the converted and target utterances. “w/o-locc” represents and “M-F” represent female-to-male and male-to-female conversions models without location code. “F-M” and “M-F” represent female- respectively. to-male and male-to-female conversions respectively. mechanism for comparison. Once the attention module was removed, the LSTM layer with attention in the decoder became module significantly affected on the similarity, the preference a plain uni-directional LSTM. In order to get frame aligned tests focused on the similarity aspect of converted speech. sequence pairs for model training, the input sequences were Ten native listeners were involved in evaluation and the wrapped towards the target ones using DTW algorithm and experimental results are presented in TABLE IX. This table MCCs features. The other parts of the SCENT model were shows that the strategy of global speaking rate adjustment by kept unchanged. source interpolation can improve the similarity of converted Our experiments compared three methods, including the speech in both conversion pairs. The proposed method with proposed method, the proposed method without attention attention module outperformed the method without attention (w/o-att) and the proposed method without attention but but using source interpolation. These results further confirmed using source interpolation at conversion time (i-w/o-att). TA- the effectiveness of the attention module. BLE VIII shows the MCDs and F RMSEs of these three methods. We can see that the prediction errors increased in the 3) Location code: Ablation tests were conducted for inves- absence of the attention module. F contours and spectrograms 0 tigating how the location code affected the performance of the of one test utterance converted by proposed method and “i- model. In the experiments, the location code was removed and w/o-att” method are presented in Fig. 7 (a) and Fig. 7 (c) the models were built in the same conditions. MCD, F RMSE respectively. This figure again shows the effectiveness of the and DDUR were calculated and are presented in TABLE X. A attention module for generating speech with duration and F 0 slight raise of MCD and F RMSE after removing the location contour closer to that of the natural target speech. code can be observed from this table. Furthermore, the DDURs Furthermore, a group of preference tests were conducted to in female-to-male and male-to-female conversions increased compare the subjective performance of these three methods. by 5.4% and 15.3% respectively. These experimental results Because the duration conversion achieved by the attention demonstrated the positive effects of using the location code. Frequency (kHz) Frequency (kHz) Frequency (kHz) Frequency (kHz) PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 12 E. Discussions importance of the attention module and the positive effect of the location code were also proved in our ablation studies. To As discussed in Section II, directly implementing seq2seq investigate the influence of training set size on the performance models at utterance level is difficult for the voice conversion of our proposed method and to reduce conversion errors by task. The input and output sequences in voice conversion are improving attention calculation will be our work in the future. composed of frame-level features and are relatively long thus it is a challenge for the attention mechanism to search for the correct hidden entries to pay attention to. Once there are REFERENCES abnormal skips or repetitions in the sequence of attention [1] D. G. Childers, B. Yegnanarayana, and K. Wu, “Voice conversion: probabilities, mistakes of converted speech may occur. Factors responsible for quality,” in IEEE International Conference on These difficulties are considered when designing the Acoustics, Speech and Signal Processing, 1985, pp. 748–751. [2] D. G. Childers, K. Wu, D. M. Hicks, and B. Yegnanarayana, “Voice SCENT model. In order to improve attention stability, conversion,” Speech Communication, vol. 8, no. 2, pp. 147–158, 1989. the techniques of forward attention and adding location [3] A. Kain, “Spectral voice conversion for text-to-speech synthesis,” features are used when calculating attention probabilities. in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), vol. 1, 1998, pp. 285–288. The bottleneck features can also provide linguistic-related [4] L. M. Arslan, “Speaker transformation algorithm using segmental information to help the attention-based alignment between codebooks (STASC),” Speech Communication, vol. 28, no. 3, pp. 211– input and output feature sequences. However, errors still can 226, 1999. [5] M. Muller, “Dynamic time warping,” Information retrieval for music not be completely avoided in the converted speech. Additional and motion, pp. 69–84, 2007. 100 non-parallel utterances of both speakers in the Mandarin [6] T. Toda, A. W. Black, and K. Tokuda, “Voice conversion based on dataset, which were out of the dataset used for previous maximum-likelihood estimation of spectral parameter trajectory,” IEEE Transactions on Audio Speech and Language Processing, vol. 15, no. 8, experiments, were adopted for error analysis. The utterances pp. 2222–2235, 2007. of the male speaker contained 2747 phonemes, while the [7] S. Desai, E. V. Raghavendra, B. Yegnanarayana, A. W. Black, and utterances of the female speaker had 2538 phonemes. We K. Prahallad, “Voice conversion using artificial neural networks,” in IEEE International Conference on Acoustics, Speech and Signal conducted male-to-female and female-to-male conversions Processing (ICASSP), April 2009, pp. 3893–3896. for these utterances and identified different categories of [8] S. Desai, A. W. Black, B. Yegnanarayana, and K. Prahallad, “Spectral conversion errors subjectively. In male-to-female conversion, mapping using artificial neural networks for voice conversion,” IEEE Transactions on Audio Speech and Language Processing, vol. 18, no. 5, there were 1 skipping phoneme error, 2 completion prediction pp. 954–964, 2010. errors, 34 phoneme pronunciation errors, 31 tone defects and [9] D. M. Titterington, A. F. M. Smith, and U. E. Makov, Statistical analysis 10 phoneme quality defects. In female-to-male conversion, of finite mixture distributions. Wiley, 1985. [10] K. Hornik, “Multilayer feedforward neural networks are universal there were 19 phoneme pronunciation errors, 20 tone defects approximators,” Neural Networks, vol. 2, 1989. and 17 phoneme quality defects. [11] R. H. Laskar, D. Chakrabarty, F. A. Talukdar, K. S. Rao, and K. Banerjee, Several reasons may lead to these errors. First, the proposed “Comparing ANN and GMM in a voice conversion framework,” Applied Soft Computing Journal, vol. 12, no. 11, pp. 3332–3342, 2012. model contains about 7.5 M trainable parameters thus is [12] L.-H. Chen, Z.-H. Ling, L.-J. Liu, and L.-R. Dai, “Voice conversion complex and needs to be trained in a data-driven way. using deep neural networks with layer-wise generative training,” Therefore, the insufficiency of training data may cause the IEEE/ACM Transactions on Audio Speech and Language Processing, vol. 22, no. 12, pp. 1859–1872, 2014. model’s lack of generalization ability when dealing with [13] L. Sun, S. Kang, K. Li, and H. Meng, “Voice conversion using deep unseen utterances. Also, the extracted bottleneck features may bidirectional long short-term memory based recurrent neural networks,” also be misleading due to the accuracy limitation of the ASR in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015, pp. 4869–4873. model. To further reduce conversion errors and to produce [14] T. Nakashika, T. Takiguchi, and Y. Ariki, “Voice conversion using RNN more reliable conversion results using seq2seq models will be pre-trained by recurrent temporal restricted Boltzmann machines,” IEEE an important task of our future work. Transactions on Audio, Speech, and Language Processing, vol. 23, no. 3, pp. 580–587, 2015. [15] J. Lai, B. Chen, T. Tan, S. Tong, and K. Yu, “Phone-aware LSTM- V. CONCLUSION RNN for voice conversion,” in IEEE International Conference on Signal Processing (ICSP), 2016, pp. 177–182. This paper presents SCENT, a sequence-to-sequence neural [16] S. H. Mohammadi and A. Kain, “An overview of voice conversion network, for acoustic modeling in voice conversion. Mel- systems,” Speech Communication, vol. 88, pp. 65–82, 2017. [17] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning spectrograms are used as acoustic features. Bottleneck features with neural networks,” Neural Information Processing Systems, pp. extracted by an ASR model are taken as additional linguistic- 3104–3112, 2014. related descriptions and are concatenated with the source [18] K. Cho, B. Van Merrienboer, C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using acoustic features as network inputs. Taking advantage of the RNN encoder–decoder for statistical machine translation,” Empirical attention mechanism, the SCENT model does not rely on the Methods in Natural Language Processing, pp. 1724–1734, 2014. preprocessing of DTW alignment and the duration conversion [19] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by jointly learning to align and translate,” International Conference on can be achieved simultaneously. Finally, the converted acoustic Learning Representations, 2015. features are passed through a WaveNet vocoder to reconstruct [20] T. Luong, H. Pham, and C. D. Manning, “Effective approaches to speech waveforms. Objective and subjective experimental attention-based neural machine translation,” Empirical Methods in Natural Language Processing, pp. 1412–1421, 2015. results demonstrated the superiority of our proposed method [21] L.-J. Liu, Z.-H. Ling, and L.-R. Dai, “WaveNet vocoder with limited compared with baseline methods, especially in durational training data for voice conversion,” in Annual Conference of the aspect. Ablation tests further proved the benefits of inputting International Speech Communication Association, INTERSPEECH, Mel-spectrograms and the necessity of bottleneck features. The 2018, pp. 1983–1987. PREPRINT MANUSCRIPT OF IEEE/ACM TRANSACTIONS ON AUDIO, SPEECH AND LANGUAGE PROCESSING c 2018 IEEE 13 [22] J.-C. Chou, C.-C. Yeh, H.-Y. Lee, and L.-S. Lee, “Multi-target voice [42] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A conversion without parallel data by adversarially learning disentangled neural network for large vocabulary conversational speech recognition,” audio representations,” in Annual Conference of the International Speech in IEEE International Conference on Acoustics, Speech and Signal Communication Association, INTERSPEECH, 2018, pp. 501–505. Processing (ICASSP), 2016, pp. 4960–4964. [23] S. O. Arik, J. Chen, K. Peng, W. Ping, and Y. Zhou, “Neural voice [43] J. L. Ba, J. R. Kiros, and G. E. Hinton, “Layer normalization,” stat, vol. cloning with a few samples,” in Advances in Neural Information 1050, p. 21, 2016. Processing Systems, 2018, pp. 10 040–10 050. [44] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Advances [24] Y. Wang, R. J. Skerry-Ryan, D. Stanton, Y. Wu, R. J. Weiss, N. Jaitly, in Neural Information Processing Systems, 2017, pp. 6000–6010. Z. Yang, Y. Xiao, Z. Chen, S. Bengio et al., “Tacotron: Towards end-to- [45] J.-X. Zhang, Z.-H. Ling, and L.-R. Dai, “Forward attention in end speech synthesis,” in Annual Conference of the International Speech sequence-to-sequence acoustic modeling for speech synthesis,” in IEEE Communication Association, INTERSPEECH, 2017, pp. 4006–4010. International Conference on Acoustics, Speech and Signal Processing [25] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, (ICASSP), 2018, pp. 4789–4793. Y. Zhang, Y. Wang, R. J. Skerry-Ryan et al., “Natural TTS synthesis [46] C. M. Bishop, “Mixture density networks,” Technical Report by conditioning WaveNet on mel spectrogram predictions,” in IEEE NCRG/4228, Aston University, Birmingham, UK, 1994. International Conference on Acoustics, Speech and Signal Processing [47] H. Zen and A. Senior, “Deep mixture density networks for (ICASSP), 2018, pp. 4779–4783. acoustic modeling in statistical parametric speech synthesis,” in IEEE [26] W. Ping, K. Peng, A. Gibiansky, S. O. Arik, A. Kannan, S. Narang, International Conference on Acoustics, Speech and Signal Processing J. Raiman, and J. P. Miller, “Deep Voice 3: 2000-speaker neural text-to- (ICASSP), May 2014, pp. 3844–3848. speech,” International Conference on Learning Representations, 2018. [48] M. B. Christopher, Pattern Recognition and Machine Learning. Spring- [27] H. Tachibana, K. Uenoyama, and S. Aihara, “Efficiently trainable text- Verlag New York, 2016. to-speech system based on deep convolutional networks with guided [49] B. Xu, N. Wang, T. Chen, and M. Li, “Empirical evaluation of recitified attention,” in IEEE International Conference on Acoustics, Speech and acitvations in convolutional network,” ICML Deep Learning Workshop, Signal Processing (ICASSP), 2018, pp. 4784–4788. Lille, France, 06-11 July, 2015. [28] Y. Jia, Y. Zhang, R. J. Weiss, Q. Wang, J. Shen, F. Ren, Z. Chen, [50] J. Kominek and A. W. Black, “CMU ARCTIC databases for P. Nguyen, R. Pang, I. L. Moreno et al., “Transfer learning from speaker speech synthesis,” http://festvox.org/cmu arctic/index.html, 2003, Lang. verification to multispeaker text-to-speech synthesis,” in Advances in Technol. Inst., Carnegie Mellon Univ., Pittsburgh, PA. Neural Information Processing Systems, 2018, pp. 4485–4495. [51] D. Krueger, T. Maharaj, J. Kramar, M. Pezeshki, N. Ballas, N. R. Ke, [29] E. Nachmani, A. Polyak, Y. Taigman, and L. Wolf, “Fitting new speakers A. Goyal, Y. Bengio, A. C. Courville, and C. Pal, “Zoneout: Regularizing based on a short untranscribed sample,” in International Conference on RNNs by randomly preserving hidden activations,” International Machine Learning, 2018, pp. 3683–3691. Conference on Learning Representations, 2017. [30] Y. Taigman, L. Wolf, A. Polyak, and E. Nachmani, “VoiceLoop: Voice [52] D. Kingma and J. Ba, “Adam: A method for stochastic optimization,” fitting and synthesis via a phonological loop,” International Conference Computer Science, 2014. on Learning Representations, 2018. [53] D. T. Chappell and J. H. L. Hansen, “Speaker-specific pitch contour [31] M. V. Ramos, “Voice conversion with deep learning,” Master’s Thesis, modeling and modification,” in IEEE International Conference on Instituto Superior Tecnico, ´ 10 2016. Acoustics, Speech and Signal Processing (ICASSP), vol. 2, 1998, pp. [32] T. Kaneko, H. Kameoka, K. Hiramatsu, and K. Kashino, “Sequence-to- 885–888. sequence voice conversion with similarity metric learned using gener- [54] Y. Ohtani, T. Toda, H. Saruwatari, and K. Shikano, “Maximum ative adversarial networks,” in Annual Conference of the International likelihood voice conversion based on GMM with STRAIGHT mixed Speech Communication Association, INTERSPEECH, 2017, pp. 1283– excitation,” in Proc. ICSLP, 2006, pp. 2266–2269. [55] Z. Wu, O. Watts, and S. King, “Merlin: An open source neural network [33] H. Miyoshi, Y. Saito, S. Takamichi, and H. Saruwatari, “Voice speech synthesis system,” in 9th ISCA Speech Synthesis Workshop conversion using sequence-to-sequence learning of context posterior (SSW9), 2016. probabilities,” in Annual Conference of the International Speech [56] Z.-H. Ling, L. Deng, and D. Yu, “Modeling spectral envelopes using Communication Association, INTERSPEECH, 2017, pp. 1268–1272. restricted Boltzmann machines and deep belief networks for statistical [34] A. V. Den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, parametric speech synthesis,” IEEE Transactions on Audio, Speech, and A. Graves, N. Kalchbrenner, A. W. Senior, and K. Kavukcuoglu, Language Processing, vol. 21, no. 10, pp. 2129–2139, Oct 2013. “WaveNet: A generative model for raw audio,” in 9th ISCA Speech Synthesis Workshop (SSW9), 2016, pp. 125–125. [35] K. Kobayashi, T. Hayashi, A. Tamamori, and T. Toda, “Statistical voice conversion with WaveNet-based waveform generation,” in Annual Conference of the International Speech Communication Association, INTERSPEECH, 2017, pp. 1138–1142. [36] J. Niwa, T. Yoshimura, K. Hashimoto, K. Oura, Y. Nankaku, and K. Tokuda, “Statistical voice conversion based on WaveNet,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5289–5293. [37] M. Morise, F. Yokomori, and K. Ozawa, “WORLD: A vocoder-based high-quality speech synthesis system for real-time applications,” IEICE Transactions on Information and Systems, vol. 99, no. 7, pp. 1877–1884, [38] H. Kawahara, I. Masuda-Katsuse, and A. D. Cheveigne, ´ “Restructuring speech representations using a pitch-adaptive time-frequency smoothing and an instantaneous-frequency based F0 extraction: Possible role of a repetitive structure in sounds,” Speech Communication, vol. 27, no. 34, pp. 187–207, 1999. [39] X. Wang, J. Lorenzo-Trueba, S. Takaki, L. Juvela, and J. Yamagishi, “A comparison of recent waveform generation and acoustic modeling meth- ods for neural-network-based speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 4804–4808. [40] Y. Ai, H.-C. Wu, and Z.-H. Ling, “SampleRNN-based neural vocoder for statistical parametric speech synthesis,” in IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2018, pp. 5659–5663. [41] S. Hchreiter and J. Schmidhuber, “Long short-term memory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780, 1997.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Oct 16, 2018

There are no references for this article.