Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Improved training of end-to-end attention models for speech recognition

Improved training of end-to-end attention models for speech recognition 1,2,3 1 1 1,2 Albert Zeyer , Kazuki Irie , Ralf Schlu¨ter , Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, 52062 Aachen, Germany, AppTek, USA, http://www.apptek.com/, NNAISENSE, Switzerland, https://nnaisense.com/ {zeyer, irie, schlueter, ney}@cs.rwth-aachen.de Abstract many other domains such as images [23]. Recent investiga- Sequence-to-sequence attention-based models on subword units tions have shown promising results by applying the same ap- allow simple open-vocabulary end-to-end speech recognition. proach for speech recognition [24–28]. In this work, we also In this work, we show that such models can achieve compet- investigate techniques to improve recurrent encoder-attention- itive results on the Switchboard 300h and LibriSpeech 1000h decoder based systems for speech recognition. We use long tasks. In particular, we report the state-of-the-art word error short-term memory (LSTM) neural networks [29] for the en- rates (WER) of 3.54% on the dev-clean and 3.82% on the test- coder and the decoder. Our model is similar to the architec- clean evaluation subsets of LibriSpeech. We introduce a new ture used in machine translation [30], except of encoder time pretraining scheme by starting with a high time reduction factor reduction. This generality of the model and the simplicity is and lowering it during training, which is crucial both for con- its strength. Although a valid argument against this model for vergence and final performance. In some experiments, we also speech recognition is that it is in fact too powerful because it use an auxiliary CTC loss function to help the convergence. In does not require monotonicity in its implicit alignments. There addition, we train long short-term memory (LSTM) language are attempts to restrict the attention to become monotonic in models on subword units. By shallow fusion, we report up to various ways [31–38]. In this work, our models are without 27% relative improvements in WER over the attention baseline these modifications and extensions. without a language model. Recently, alternative models for end-to-end modeling were also suggested, such as inverted HMMs [39], the recurrent Index Terms: attention, end-to-end, speech recognition transducer [40–42], or the recurrent neural aligner [43]. In many ways, these can all be interpreted in the same encoder- 1. Introduction decoder-attention framework, but these approaches often use Conventional speech recognition systems [1] with neural net- some variant of hard latent monotonic attention instead of soft work (NN) based acoustic models using the hybrid hidden attention. Markov models (HMM) / NN approach [2, 3] usually oper- Our models operate on subword units which are created ate on the phone level, given a phonetic pronunciation lexicon via byte-pair encoding (BPE) [44]. We introduce a pretrain- (from phones to words). They require a pretraining scheme ing scheme applied on the encoder, which grows the encoder in with HMM and Gaussian mixture models (GMM) as emission layer depth, as well as decreases the initial high encoder time re- probabilities to bootstrap good alignments of the HMM states. duction factor. To the best of our knowledge, we are the first to Context-independent phones are used initially because context- apply pretraining for encoder-attention-decoder models. We use dependent phones need a good clustering, which is usually cre- RETURNN [30,45] based on TensorFlow [46] for its computa- ated on good existing alignments (via a Classification And Re- tion. We have implemented our own flexible and efficient beam gression Tree (CART) clustering [4]). This boot-strapping pro- search decoder and efficient LSTM kernels in native CUDA. In cess is iterated a few times. Then a hybrid HMM / NN is trained addition, we train subword-level LSTM language models [47], with frame-wise cross entropy. Recognition with such a model which we integrate in the beam search by shallow fusion [48]. requires a sophisticated beam search decoder. Handling out-of- The source code is fully open , as well as all the setups of the vocabulary words is also not straightforward and increases the experiments in this paper . We report competitive results on the complexity. There was certain work to remove the GMM de- 300h-Switchboard and LibriSpeech [49]. In particular on Lib- pendency in the pretraining [5], or to be able to train without rispeech, our system achieves WERs of 3.54% on the dev-clean an existing alignment [6–8], or to avoid the lexicon [9], which and 3.82% on the test-clean evaluation subsets, which are the simplifies the pretraining procedure but still is not end-to-end. best results obtained on this task to the best of our knowledge. An end-to-end model in speech recognition generally de- notes a simple single model which can be trained from scratch, 2. Pretraining and usually directly operates on words, sub-words or character- Compared to machine translation, the input sequences are much s/graphemes. This removes the need for a pronunciation lexicon longer in speech recognition, relatively to the output sequence and the whole explicit modeling of phones, and it greatly sim- (e.g. with BPE 10K subword units, and audio feature frames plifies the decoding. every 10ms, more than 30 times longer on Switchboard on Connectionist temporal classification (CTC) [10] has been average). However, as the original input is continuous, some often used as an end-to-end model for speech recognition, often sort of downscaling in the time dimension works, such as con- on characters/graphemes [11–16] or on sub-words [17] but also catenation in the feature dimension of consecutive time-frames directly on words [18, 19]. The encoder-decoder framework with attention has become https://github.com/rwth-i6/returnn the standard approach for machine translation [20–22] and https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention arXiv:1805.03294v1 [cs.CL] 8 May 2018 [7,24,42,50]. We use max-pooling in the time-dimension which 4. Sub-word units is simpler. The time reduction can be done directly on the fea- Characters/graphemes are probably the most generic and sim- tures or alternatively at multiple steps inside the encoder, e.g. ple output units for generating texts but it has been shown that after every encoder layer [24]. This is also what we do. This al- sub-word units can perform better [26] and they can be just as lows the encoder to better compress any necessary information. generic since the characters can be included in the set of sub- We observed that a high time reduction factor makes the word units. Using words as output units is also possible but training much simpler. In fact, without careful tuning, usually it does not allow to recognize out-of-vocabulary words and it the model will not converge without a high time reduction factor requires a large softmax output and thus is computational ex- (16 or 32), as it was also observed in the literature [24]. How- pensive. An inhomogeneous length distribution as well as an ever, we also observed that a low time reduction factor (e.g. 8) imbalance in the label occurence can also make training harder. can perform better after all, when pretrained with a high time In all the experiments, we use byte-pair encoding (BPE) reduction factor. [44] to create subword units, which are the output targets of the Also, it has been shown that deep LSTM models can benefit decoder. The beam search decoding will go over these BPE from layer-wise pretraining, by starting with 1 or 2 layers and units, and then select the best hypothesis. Therefore, our sys- adding more and more layers [1]. We apply the same pretrain- tem is open-vocabulary. At the end of decoding, the BPE units ing. are merged into words in order to obtain the best hypothesis To improve the convergence further, we disable label on word level. In addition, we add the special tokens from the smoothing during pretraining and only enable it after pretrain- transcriptions which denote noise, vocalized-noise and laughter ing. Also, we disable dropout during the first few pretraining in our BPE vocabulary set. Our recognizer can also potentially epochs in the encoder. recognize these special events. 3. Model 5. Language model combination We use a deep bidirectional LSTM encoder network, and LSTM We also improve the recognition accuracy of our recognizer us- decoder network. After every layer in the encoder, we option- ing external language models. We train LSTM language mod- ally do max-pooling in the time dimension to reduce the en- els [47] on the same BPE vocabulary set as the end-to-end coder length. I.e. for the input sequence x , we end up with the model, using RETURNN with TensorFlow. For Switchboard, encoder state the training set of 27M words concatenating Switchboard and T T Fisher parts of transcriptions was used. For LibriSpeech, we h = LSTM ◦ · · · ◦ max-pool ◦ LSTM (x ), #enc 1 1 1 1 use the 800M-word dataset officially available for training lan- where T = red·T for the time reduction factor red, and #enc guage models. It can be noted that in the case of Switchboard, is the number of encoder layers, with #enc ≥ 2. We use the there is some overlap between the training data for language MLP attention [20,21,31,32,51]. Our model closely follows the models and the transcription used to train the end-to-end model: machine translation model presented by Bahar et al. [51] and 3M out of 27M words are used to train the end-to-end system. Bahdanau et al. [20] and we use a variant of attention weight While for the LispriSpeech, 800M-word data is fully external to / fertility feedback [52], which is inverse in our case, to use the end-to-end models. Our experiments show that this differ- a multiplication instead of a division, for better numerical sta- ence in amount of external data directly affects the performance bility. More specifically, the attention energies e ∈ R for i,t improvements by the use of external language model. For both encoder time-step t and decoder step i are defined as tasks, we use a LSTM LM with one input projection layer size e = v tanh(W [s , h , β ]), i,t i t i,t of 512 dimension and two LSTM layers with 2048 nodes. We apply dropout at the input of all hidden layers with the rate of where v is a trainable vector, W a trainable matrix, s the cur- 0.2. The standard stochastic gradient descent with global gradi- rent decoder state, h the encoder state, and β is the attention t i,t ent clipping is used for optimization to train all LSTM LMs. weight feedback, defined as We integrate the external language model in the beam i−1 ⊤ search by shallow fusion [48]. The weight for the language β = σ(v h ) · α , i,t β t k,t model has been optimized by grid search on the development k=1 set WER. We found 0.23 and 0.36 to be optimal respectively where v is a trainable vector. Then the attention weights are for Switchboard and LibriSpeech (the weight on the attention defined as model is 1). α = softmax (e ) i t i For LibriSpeech, we also train Kneser-Ney smoothed n- gram count based language models [53] on the same BPE vo- and the attention context vector is given as cabulary set using SRILM toolkit [54]. The comparison of per- c = α h . i i,t t plexities can be found in Table 1. We also report WERs using the 4-gram count model by shallow fusion with a weight of 0.01, The decoder state is recurrent function implemented as for comparison to the performance of LSTM LM. s = LSTMCell(s , y , c ) i i−1 i−1 i−1 Table 1: Perplexities (PPL) on the concatenation of dev-clean and the final prediction probability for the output symbol y is and dev-other sets of LibriSpeech. All models have the same given as vocabulary of 10K BPE. p(y |y , x ) = softmax(MLP (s , y , c )). LM 3-gram 4-gram 5-gram LSTM i i−1 i i−1 i 1 readout PPL 104.6 88.2 85.1 65.9 In our case we use MLP = linear ◦ maxout ◦ linear. readout http://www.openslr.org/11/ Table 2: Comparisons on Switchboard 300h. The hybrid HM- 6. Experiments M/NN model is a 6 layer deep bidirectional LSTM. The attention All attention models and neural network language models were model has a 6 layer deep bidirectional LSTM encoder and a 1 trained and decoded with RETURNN. For both Switchboard layer LSTM decoder. CDp are (clustered) context-dependend and LibriSpeech, we first used the BPE vocabulary of 10K phones. Byte-pair encoding (BPE) are sub-word units. SWB subword units to tune the hyperparameters of the model, then and CH are from Hub5’00. added noise from external data. trained the models with 1K and 5K BPE units. We found 1K added the lexicon, i.e. also additional data. and 10K to be optimal for Switchboard and LibriSpeech respec- label WER[%] tively. We use label smoothing [55], dropout [56], Adam [57], model LM unit SWB CH Hub5’01 learning rate warmup [26], and automatic learning rate schedul- ing according to a cross-validation set (”Newbob”) [1]. LF MMI, 2016 [7] 4-gram CDp 9.6 19.3 hybrid 4-gram CDp 9.8 19.0 14.7 6.1. Pretraining hybrid LSTM CDp 8.3 17.3 12.9 In all cases we use layer-wise pretraining for the encoder, where CTC , 2014 [12] RNN chars 20.0 31.8 we start with two encoder layers and a single max-pool in be- CTC, 2015 [60] none chars 38.0 56.1 tween with factor 32. Then we add a LSTM layer and a max- CTC, 2015 [60] RNN chars 21.4 40.2 pool in between, and we reduce the first max-pool to factor 16 attention, 2016 [61] none chars 32.8 52.7 and the new one with factor 2 such that we always keep the same attention, 2016 [61] 5-gram chars 30.5 50.4 total encoder time reduction factor of 32. Only when we end up attention, 2016 [61] none words 26.8 48.2 at 6 layers, we remove some of the max-pooling ops to get a attention, 2016 [61] 3-gram words 25.8 46.0 final total time reduction factor of e.g. 8. Directly starting with CTC, 2017 [16] none chars 24.7 37.1 a time reduction factor of 8 with and with 2 layers did not work CTC, 2017 [16] n-gram chars 19.8 32.1 for us. Also directly starting with 6 layers and time reduction CTC , 2017 [16] word RNN chars 14.0 25.3 factor of 32 did not work for us. Similar experiments for trans- attention, 2017 [28] none chars 23.1 40.8 lation converged also without pretraining, however with much BPE 10K 13.5 27.1 19.9 none worse performance compared when layer-wise pretraining was attention BPE 1K 13.1 26.1 19.7 used [30]. With more careful tuning or more training data, it LSTM BPE 1K 11.8 25.7 18.1 might have worked without pretraining as it is seen in the liter- ature, however, that is not necessary with pretraining. the shallow fusion with LSTM LM brings from 17% to 27% rel- We were interested in the optimal final total time reduction ative improvements in terms of WER on different subsets. This factor, after the pretraining with time reduction factor 32. We improvement is much larger than in the case of Switchboard. tried factor 8, 16 and 32, and ended up with 20.4, 21.0 and 21.9 The amount of data is most likely the reason for this observa- WER% respectively, on the full Hub5’00 set (Switchboard + tion. For Librispeech, the external data of 800M words is used Callhome). Thus we continue to use a final reduction factor of to train the language models, which is 80 times larger than the 8 in all further experiments. Note that a lower factor requires 10M words corresponding to the transcription of 1000 hours of more memory and more computation for the global attention audio. In addition, this 10M transcription is not part of the lan- and was not feasible with our hardware and computational re- guage model training data. In case of Switchboard, the LM is sources. trained only on about 27M words, including 3M of transcription used to train the end-to-end system. Text data for conversational 6.2. Switchboard 300h speech is not as readily available as for read speech. The WER of 3.54% on the dev-clean and 3.82% on the test-clean subsets Switchboard consists of about 300 hours of training data. There are the best performance on this task to the best of our knowl- is also the additional Fisher training dataset, so combined it edge for systems trained only using LibriSpeech data. makes the total of about 2000h. In this work, we only use the 300h-Switchboard training data. We use 40-dimensional Gam- 6.4. Beam search prune error analysis matone features [58], and the feature extraction was done with RASR [59]. Results are shown in Table 2. We observe that Beam search is an approximation for the decision rule T N N T our attention model performs better on the easier Switchboard x → wˆ := arg max p(w |x ). 1 1 1 1 subset of the dev set Hub5’00, where it is the best end-to-end model we know. On the harder Callhome part, it also performs The approximation is the pruning we apply due to the beam well compared to other end-to-end models but the relative dif- size. Beam search decoding for hybrid models is very sophis- ference is not as high. ticated and uses a dynamic beam size based on the partial hy- pothesis scores which can become very large (on the order of 6.3. LibriSpeech 1000h thousands) [66]. The beam search for attention models works LibriSpeech training dataset consist of about 1000 hours of directly on the labels, i.e. on the BPE units in our case, and read audio books. The dev and test sets were split into sim- usually a static fixed very low beam size (e.g. 10) is used. It ple (”clean”) and harder (”other”) subsets [49]. We do 40-dim. has been shown that increasing the beam size much more does MFCC feature extraction on-the-fly in RETURNN, based on not help in increasing the overal performance. This indicates librosa [62]. We use CTC as an additional loss function ap- that we do not have a search problem but we wanted to ana- plied on top of the decoder to help the convergence, although lyze this in more detail. Specifically, we are interested in how this is not used in decoding [63]. We initially trained only us- much errors we are making due to the pruning for our attention ing the train-clean set and restricting it to sequences not longer models, and we can count that by calculating the search score than 75 characters in the orthography. Results are shown in of the real target sequence, and compare it to the search score Table 3. Our end-to-end system achieves competitive perfor- of the decoded sequence. If the decoded sequence has a higher mance even without using language models. We observed that Table 3: Comparisons on LibriSpeech 1000h. The attention edge, the WERs of 3.54% on the dev-clean and 3.82% on the model has a 6 layer deep bidirectional LSTM encoder and a 1 test-clean subsets are the best results reported on this task, when layer LSTM decoder. CDp are (clustered) context-dependend only the official LibriSpeech training data is used. phones. Byte-pair encoding (BPE) are sub-word units. Lattice- free (LF) maximum mutual information (MMI) [7] is a sequence 8. Acknowledgements criterion to train a hybrid HMM/NN model. Auto SeGmentation This work has received funding from the European Research Council (ASG) [64] can be seen as a variant of the CTC criterion and (ERC) under the European Union’s Horizon 2020 research and innova- tion programme (grant agreement No 694537, project ”SEQCLAS”). model. Policy learning is a sequence training method, applied The work reflects only the authors’ views and the ERC Executive here on a CTC model [15]. If not specified, the official 4-gram Agency is not responsible for any use that may be made of the infor- word LM is used. The remaining attention models are all our mation it contains. The GPU cluster used for the experiments was par- tially funded by Deutsche Forschungsgemeinschaft (DFG) Grant INST models. 222/1168-1. WER[%] label model LM dev test unit 9. References clean other clean other [1] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlu¨ter, and H. Ney, hybrid, FFNN, 2015 [49] 4-gram CDp 4.90 12.98 5.51 13.97 “A comprehensive study of deep bidirectional LSTM RNNs for LF MMI, LSTM, 2016 [7] 4-gram CDp 4.28 acoustic modeling in speech recognition,” in ICASSP, New Or- CTC, 2015 [65] 4-gram chars 5.33 13.25 leans, LA, USA, Mar. 2017, pp. 2462–2466. ASG (CTC), 2017 [64] 4-gram chars 4.80 14.50 [2] H. Bourlard and N. Morgan, Connectionist speech recognition: a ASG (CTC), 2017 [64] none chars 6.70 20.80 hybrid approach. Springer, 1994, vol. 247. CTC, PL, 2017 [15] 4-gram chars 5.10 14.26 5.42 14.70 [3] A. J. Robinson, “An application of recurrent nets to phone proba- bility estimation,” Neural Networks, IEEE Transactions on, vol. 5, none BPE 4.87 14.37 4.87 15.39 no. 2, pp. 298–305, 1994. attention 4-gram BPE 4.79 14.31 4.82 15.30 [4] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state LSTM BPE 3.54 11.52 3.82 12.76 tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology. Association for score than the real target sequence, we have not made a search Computational Linguistics, 1994, pp. 307–312. error but it is a model error. We count the number of sequences [5] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-free where the decoded sequence has a lower score than the real tar- DNN acoustic model training,” in ICASSP, 2014. get sequence. We report our results in Table 4. We observe that [6] A. Zeyer, E. Beck, R. Schlu¨ter, and H. Ney, “CTC in the context of for our standard beam size 12, the number of search errors are generalized full-sum HMM training,” in Interspeech, Stockholm, well below 1%, and also the WER will not noticeably improve Sweden, Aug. 2017, pp. 944–948. with a larger beam size. Note that we only analyzed the search [7] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu- errors regarding reaching the real target sequence. We did not ral networks for ASR based on lattice-free MMI,” in Interspeech, count search errors regarding reaching any sequence with lower 2016, pp. 2751–2755. WER. However, our results still suggest that we do not seem to [8] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate have a search problem but a model problem. recurrent neural network acoustic models for speech recognition,” in Interspeech, 2015. Table 4: Beam search error analysis, performed on Lib- [9] S. Kanthak and H. Ney, “Context-dependent acoustic modeling riSpeech, without language model. We provide both the num- using graphemes for large vocabulary speech recognition,” in ber of reference-related search errors, relative to the number of ICASSP, Orlando, FL, USA, May 2002, pp. 845–848. sequences, and also the corresponding WER. [10] A. Graves, S. Ferna´ndez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- search errors [%] (WER [%]) beam quence data with recurrent neural networks,” in ICML. ACM, dev test size 2006, pp. 369–376. clean other clean other [11] A. Graves and N. Jaitly, “Towards end-to-end speech recognition 4 1.52 (4.87) 1.68 (14.53) 1.07 (4.87) 1.70 (15.49) with recurrent neural networks,” in ICML, T. Jebara and E. P. Xing, Eds. JMLR Workshop and Conference Proceedings, 2014, 8 0.96 (4.88) 0.98 (14.40) 0.76 (4.87) 1.02 (15.39) pp. 1764–1772. 12 0.81 (4.87) 0.59 (14.37) 0.61 (4.86) 0.71 (15.39) [12] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, 16 0.70 (4.87) 0.52 (14.36) 0.50 (4.86) 0.58 (15.37) E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., 32 0.26 (4.87) 0.14 (14.34) 0.19 (4.86) 0.20 (15.34) “DeepSpeech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014. [13] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech 7. Conclusions recognition using deep rnn models and wfst-based decoding,” in ASRU. IEEE, 2015, pp. 167–174. We presentented an encoder-decoder-attention model for speech [14] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end- recognition operating on BPE subword units. We introduced to-end convnet-based speech recognition system,” arXiv preprint a new method for pretraining the encoder, which was crucial arXiv:1609.03193, 2016. for both convergence and the performance in terms of WER. [15] Y. Zhou, C. Xiong, and R. Socher, “Improving end-to- We further improved our recognition accuracy by a joint beam end speech recognition with policy learning,” arXiv preprint arXiv:1712.07101, 2017. search with a LSTM LM trained on the same subword vocab- [16] G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “Advances in all- ulary. We carried out experiments on two standard datasets. neural speech recognition,” in ICASSP. IEEE, 2017, pp. 4805– On the 300h-Switchboard, we achieved competitve results com- pared to the previously reported end-to-end models, while the [17] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit WERs are still higher than the conventional hybrid systems. On selection and target decomposition for sequence labelling,” arXiv the 1000h-LibriSpeech task, we obtained competitive results preprint arXiv:1703.00096, 2017. across different evaluation subsets. To the best of our knowl- [18] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech data and units for streaming end-to-end speech recognition with recognition,” in Proc. Interspeech, 2017, pp. 3707–3711. RNN-transducer,” in ASRU. IEEE, 2017, pp. 193–199. [19] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na- [41] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, hamoo, “Direct acoustics-to-word models for english conversa- S. Satheesh, D. Seetapun, A. Sriram et al., “Exploring neural tional speech recognition,” in Proc. Interspeech, 2017, pp. 959– transducers for end-to-end speech recognition,” arXiv preprint 963. arXiv:1707.07413, 2017. [20] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans- [42] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and lation by jointly learning to align and translate,” arXiv preprint N. Jaitly, “A comparison of sequence-to-sequence models for arXiv:1409.0473, 2014. speech recognition,” in Proc. Interspeech, 2017, pp. 939–943. [21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches [43] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural to attention-based neural machine translation,” arXiv preprint aligner: An encoder-decoder neural network model for sequence arXiv:1508.04025, 2015. to sequence mapping,” in Proc. of Interspeech, 2017. [22] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [44] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- tion of rare words with subword units,” in ACL, Berlin, Germany, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural August 2016, pp. 1715–1725. machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016. [45] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlu¨ter, [23] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition and H. Ney, “RETURNN: the RWTH extensible training frame- with visual attention,” arXiv preprint arXiv:1412.7755, 2014. work for universal recurrent neural networks,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 5345–5349. [24] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational [46] TensorFlow Development Team, “TensorFlow: Large-scale speech recognition,” in ICASSP, 2016. machine learning on heterogeneous systems,” 2015, soft- ware available from tensorflow.org. [Online]. Available: [25] P. Doetsch, A. Zeyer, and H. Ney, “Bidirectional decoder net- https://www.tensorflow.org/ works for attention-based end-to-end offline handwriting recog- nition,” in International Conference on Frontiers in Handwriting [47] M. Sundermeyer, R. Schlu¨ter, and H. Ney, “LSTM neural net- Recognition, Shenzhen, China, Oct. 2016, pp. 361–366. works for language modeling.” in Interspeech, Portland, OR, USA, Sep. 2012, pp. 194–197. [26] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, [48] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State- F. Bougares, H. Schwenk, and Y. Bengio, “On using mono- of-the-art speech recognition with sequence-to-sequence models,” lingual corpora in neural machine translation,” arXiv preprint arXiv preprint arXiv:1712.01769, 2017. arXiv:1503.03535, 2015. [27] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, [49] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transduc- riSpeech: an ASR corpus based on public domain audio books,” ers for end-to-end speech recognition,” in ASRU, Okinawa, Japan, in ICASSP. IEEE, 2015, pp. 5206–5210. Dec. 2017, pp. 206–213. [28] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learning [50] G. Pundak and T. N. Sainath, “Lower frame rate neural network with low-level auxiliary tasks for encoder-decoder based speech acoustic models.” in Interspeech, 2016, pp. 22–26. recognition,” in Proc. Interspeech, 2017, pp. 3532–3536. [51] P. Bahar, J. Rosendahl, N. Rossenbach, and H. Ney, “The RWTH [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Aachen machine translation systems for IWSLT 2017,” in Int. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. Workshop on Spoken Language Translation, Tokyo, Japan, Dec. 2017, pp. 29–34. [30] “RETURNN as a generic flexible neural toolkit with application [52] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for to translation and speech recognition,” anonymous authors, sub- neural machine translation,” in ACL, 2016. mitted to ACL, 2018. [53] R. Kneser and H. Ney, “Improved backing-off for m-gram lan- [31] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end guage modeling,” in ICASSP, Detroit, MI, USA, May 1995, pp. continuous speech recognition using attention-based recurrent nn: 181–184. first results,” arXiv preprint arXiv:1412.1602, 2014. [54] A. Stolcke, “SRILM-an extensible language modeling toolkit.” in [32] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and Interspeech, Denver, CO, USA, Sep. 2002. S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in Advances in Neural Information Processing Sys- [55] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- tems, 2016, pp. 5067–5075. thinking the inception architecture for computer vision,” in Pro- [33] R. Aharoni and Y. Goldberg, “Morphological inflection ceedings of the IEEE Conference on Computer Vision and Pattern generation with hard monotonic attention,” arXiv preprint Recognition, 2016, pp. 2818–2826. arXiv:1611.01487, 2016. [56] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- [34] C. Raffel, T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and works from overfitting,” The Journal of Machine Learning Re- linear-time attention by enforcing monotonic alignments,” arXiv search, vol. 15, no. 1, pp. 1929–1958, 2014. preprint arXiv:1704.00784, 2017. [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- [35] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” mization,” arXiv preprint arXiv:1412.6980, 2014. arXiv preprint arXiv:1712.05382, 2017. [58] R. Schlu¨ter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma- [36] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic atten- tone features and feature combination for large vocabulary speech tion mechanism for end-to-end speech and language processing,” recognition,” in ICASSP, Honolulu, HI, USA, Apr. 2007, pp. 649– in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2017, pp. 431–440. [59] S. Wiesler, A. Richard, P. Golik, R. Schlu¨ter, and H. Ney, “RAS- [37] R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, R/NN: The RWTH neural network toolkit for speech recognition,” “An analysis of “attention” in sequence-to-sequence models,”,” in in ICASSP, Florence, Italy, May 2014, pp. 3313–3317. Proc. of Interspeech, 2017. [60] A. L. Maas, Z. Xie, D. Jurafsky, and A. Y. Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proc. [38] J. Hou, S. Zhang, and L. Dai, “Gaussian prediction based attention NAACL, 2015. for online end-to-end speech recognition,” in Proc. Interspeech, 2017, pp. 3692–3696. [61] L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech [39] P. Doetsch, M. Hannemann, R. Schlueter, and H. Ney, “Inverted recognition,” in ICASSP. IEEE, 2016, pp. 5060–5064. alignments for end-to-end automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, [62] librosa Development Team, “librosa 0.5.0,” Feb. 2017. [Online]. pp. 1265–1273, Dec. 2017. Available: https://doi.org/10.5281/zenodo.293021 [40] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, [63] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017. [64] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter- based speech recognition with gated convnets,” arXiv preprint arXiv:1712.09444, 2017. [65] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015. [66] D. Nolden, “Progress in decoding for large vocabulary continuous speech recognition,” Ph.D. dissertation, RWTH Aachen Univer- sity, Computer Science Department, RWTH Aachen University, Aachen, Germany, Apr. 2017. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Improved training of end-to-end attention models for speech recognition

Statistics , Volume 2018 (1805) – May 8, 2018

Loading next page...
 
/lp/arxiv-cornell-university/improved-training-of-end-to-end-attention-models-for-speech-EIFSqfFBio

References (67)

eISSN
ARCH-3347
DOI
10.21437/Interspeech.2018-1616
Publisher site
See Article on Publisher Site

Abstract

1,2,3 1 1 1,2 Albert Zeyer , Kazuki Irie , Ralf Schlu¨ter , Hermann Ney Human Language Technology and Pattern Recognition, Computer Science Department, RWTH Aachen University, 52062 Aachen, Germany, AppTek, USA, http://www.apptek.com/, NNAISENSE, Switzerland, https://nnaisense.com/ {zeyer, irie, schlueter, ney}@cs.rwth-aachen.de Abstract many other domains such as images [23]. Recent investiga- Sequence-to-sequence attention-based models on subword units tions have shown promising results by applying the same ap- allow simple open-vocabulary end-to-end speech recognition. proach for speech recognition [24–28]. In this work, we also In this work, we show that such models can achieve compet- investigate techniques to improve recurrent encoder-attention- itive results on the Switchboard 300h and LibriSpeech 1000h decoder based systems for speech recognition. We use long tasks. In particular, we report the state-of-the-art word error short-term memory (LSTM) neural networks [29] for the en- rates (WER) of 3.54% on the dev-clean and 3.82% on the test- coder and the decoder. Our model is similar to the architec- clean evaluation subsets of LibriSpeech. We introduce a new ture used in machine translation [30], except of encoder time pretraining scheme by starting with a high time reduction factor reduction. This generality of the model and the simplicity is and lowering it during training, which is crucial both for con- its strength. Although a valid argument against this model for vergence and final performance. In some experiments, we also speech recognition is that it is in fact too powerful because it use an auxiliary CTC loss function to help the convergence. In does not require monotonicity in its implicit alignments. There addition, we train long short-term memory (LSTM) language are attempts to restrict the attention to become monotonic in models on subword units. By shallow fusion, we report up to various ways [31–38]. In this work, our models are without 27% relative improvements in WER over the attention baseline these modifications and extensions. without a language model. Recently, alternative models for end-to-end modeling were also suggested, such as inverted HMMs [39], the recurrent Index Terms: attention, end-to-end, speech recognition transducer [40–42], or the recurrent neural aligner [43]. In many ways, these can all be interpreted in the same encoder- 1. Introduction decoder-attention framework, but these approaches often use Conventional speech recognition systems [1] with neural net- some variant of hard latent monotonic attention instead of soft work (NN) based acoustic models using the hybrid hidden attention. Markov models (HMM) / NN approach [2, 3] usually oper- Our models operate on subword units which are created ate on the phone level, given a phonetic pronunciation lexicon via byte-pair encoding (BPE) [44]. We introduce a pretrain- (from phones to words). They require a pretraining scheme ing scheme applied on the encoder, which grows the encoder in with HMM and Gaussian mixture models (GMM) as emission layer depth, as well as decreases the initial high encoder time re- probabilities to bootstrap good alignments of the HMM states. duction factor. To the best of our knowledge, we are the first to Context-independent phones are used initially because context- apply pretraining for encoder-attention-decoder models. We use dependent phones need a good clustering, which is usually cre- RETURNN [30,45] based on TensorFlow [46] for its computa- ated on good existing alignments (via a Classification And Re- tion. We have implemented our own flexible and efficient beam gression Tree (CART) clustering [4]). This boot-strapping pro- search decoder and efficient LSTM kernels in native CUDA. In cess is iterated a few times. Then a hybrid HMM / NN is trained addition, we train subword-level LSTM language models [47], with frame-wise cross entropy. Recognition with such a model which we integrate in the beam search by shallow fusion [48]. requires a sophisticated beam search decoder. Handling out-of- The source code is fully open , as well as all the setups of the vocabulary words is also not straightforward and increases the experiments in this paper . We report competitive results on the complexity. There was certain work to remove the GMM de- 300h-Switchboard and LibriSpeech [49]. In particular on Lib- pendency in the pretraining [5], or to be able to train without rispeech, our system achieves WERs of 3.54% on the dev-clean an existing alignment [6–8], or to avoid the lexicon [9], which and 3.82% on the test-clean evaluation subsets, which are the simplifies the pretraining procedure but still is not end-to-end. best results obtained on this task to the best of our knowledge. An end-to-end model in speech recognition generally de- notes a simple single model which can be trained from scratch, 2. Pretraining and usually directly operates on words, sub-words or character- Compared to machine translation, the input sequences are much s/graphemes. This removes the need for a pronunciation lexicon longer in speech recognition, relatively to the output sequence and the whole explicit modeling of phones, and it greatly sim- (e.g. with BPE 10K subword units, and audio feature frames plifies the decoding. every 10ms, more than 30 times longer on Switchboard on Connectionist temporal classification (CTC) [10] has been average). However, as the original input is continuous, some often used as an end-to-end model for speech recognition, often sort of downscaling in the time dimension works, such as con- on characters/graphemes [11–16] or on sub-words [17] but also catenation in the feature dimension of consecutive time-frames directly on words [18, 19]. The encoder-decoder framework with attention has become https://github.com/rwth-i6/returnn the standard approach for machine translation [20–22] and https://github.com/rwth-i6/returnn-experiments/tree/master/2018-asr-attention arXiv:1805.03294v1 [cs.CL] 8 May 2018 [7,24,42,50]. We use max-pooling in the time-dimension which 4. Sub-word units is simpler. The time reduction can be done directly on the fea- Characters/graphemes are probably the most generic and sim- tures or alternatively at multiple steps inside the encoder, e.g. ple output units for generating texts but it has been shown that after every encoder layer [24]. This is also what we do. This al- sub-word units can perform better [26] and they can be just as lows the encoder to better compress any necessary information. generic since the characters can be included in the set of sub- We observed that a high time reduction factor makes the word units. Using words as output units is also possible but training much simpler. In fact, without careful tuning, usually it does not allow to recognize out-of-vocabulary words and it the model will not converge without a high time reduction factor requires a large softmax output and thus is computational ex- (16 or 32), as it was also observed in the literature [24]. How- pensive. An inhomogeneous length distribution as well as an ever, we also observed that a low time reduction factor (e.g. 8) imbalance in the label occurence can also make training harder. can perform better after all, when pretrained with a high time In all the experiments, we use byte-pair encoding (BPE) reduction factor. [44] to create subword units, which are the output targets of the Also, it has been shown that deep LSTM models can benefit decoder. The beam search decoding will go over these BPE from layer-wise pretraining, by starting with 1 or 2 layers and units, and then select the best hypothesis. Therefore, our sys- adding more and more layers [1]. We apply the same pretrain- tem is open-vocabulary. At the end of decoding, the BPE units ing. are merged into words in order to obtain the best hypothesis To improve the convergence further, we disable label on word level. In addition, we add the special tokens from the smoothing during pretraining and only enable it after pretrain- transcriptions which denote noise, vocalized-noise and laughter ing. Also, we disable dropout during the first few pretraining in our BPE vocabulary set. Our recognizer can also potentially epochs in the encoder. recognize these special events. 3. Model 5. Language model combination We use a deep bidirectional LSTM encoder network, and LSTM We also improve the recognition accuracy of our recognizer us- decoder network. After every layer in the encoder, we option- ing external language models. We train LSTM language mod- ally do max-pooling in the time dimension to reduce the en- els [47] on the same BPE vocabulary set as the end-to-end coder length. I.e. for the input sequence x , we end up with the model, using RETURNN with TensorFlow. For Switchboard, encoder state the training set of 27M words concatenating Switchboard and T T Fisher parts of transcriptions was used. For LibriSpeech, we h = LSTM ◦ · · · ◦ max-pool ◦ LSTM (x ), #enc 1 1 1 1 use the 800M-word dataset officially available for training lan- where T = red·T for the time reduction factor red, and #enc guage models. It can be noted that in the case of Switchboard, is the number of encoder layers, with #enc ≥ 2. We use the there is some overlap between the training data for language MLP attention [20,21,31,32,51]. Our model closely follows the models and the transcription used to train the end-to-end model: machine translation model presented by Bahar et al. [51] and 3M out of 27M words are used to train the end-to-end system. Bahdanau et al. [20] and we use a variant of attention weight While for the LispriSpeech, 800M-word data is fully external to / fertility feedback [52], which is inverse in our case, to use the end-to-end models. Our experiments show that this differ- a multiplication instead of a division, for better numerical sta- ence in amount of external data directly affects the performance bility. More specifically, the attention energies e ∈ R for i,t improvements by the use of external language model. For both encoder time-step t and decoder step i are defined as tasks, we use a LSTM LM with one input projection layer size e = v tanh(W [s , h , β ]), i,t i t i,t of 512 dimension and two LSTM layers with 2048 nodes. We apply dropout at the input of all hidden layers with the rate of where v is a trainable vector, W a trainable matrix, s the cur- 0.2. The standard stochastic gradient descent with global gradi- rent decoder state, h the encoder state, and β is the attention t i,t ent clipping is used for optimization to train all LSTM LMs. weight feedback, defined as We integrate the external language model in the beam i−1 ⊤ search by shallow fusion [48]. The weight for the language β = σ(v h ) · α , i,t β t k,t model has been optimized by grid search on the development k=1 set WER. We found 0.23 and 0.36 to be optimal respectively where v is a trainable vector. Then the attention weights are for Switchboard and LibriSpeech (the weight on the attention defined as model is 1). α = softmax (e ) i t i For LibriSpeech, we also train Kneser-Ney smoothed n- gram count based language models [53] on the same BPE vo- and the attention context vector is given as cabulary set using SRILM toolkit [54]. The comparison of per- c = α h . i i,t t plexities can be found in Table 1. We also report WERs using the 4-gram count model by shallow fusion with a weight of 0.01, The decoder state is recurrent function implemented as for comparison to the performance of LSTM LM. s = LSTMCell(s , y , c ) i i−1 i−1 i−1 Table 1: Perplexities (PPL) on the concatenation of dev-clean and the final prediction probability for the output symbol y is and dev-other sets of LibriSpeech. All models have the same given as vocabulary of 10K BPE. p(y |y , x ) = softmax(MLP (s , y , c )). LM 3-gram 4-gram 5-gram LSTM i i−1 i i−1 i 1 readout PPL 104.6 88.2 85.1 65.9 In our case we use MLP = linear ◦ maxout ◦ linear. readout http://www.openslr.org/11/ Table 2: Comparisons on Switchboard 300h. The hybrid HM- 6. Experiments M/NN model is a 6 layer deep bidirectional LSTM. The attention All attention models and neural network language models were model has a 6 layer deep bidirectional LSTM encoder and a 1 trained and decoded with RETURNN. For both Switchboard layer LSTM decoder. CDp are (clustered) context-dependend and LibriSpeech, we first used the BPE vocabulary of 10K phones. Byte-pair encoding (BPE) are sub-word units. SWB subword units to tune the hyperparameters of the model, then and CH are from Hub5’00. added noise from external data. trained the models with 1K and 5K BPE units. We found 1K added the lexicon, i.e. also additional data. and 10K to be optimal for Switchboard and LibriSpeech respec- label WER[%] tively. We use label smoothing [55], dropout [56], Adam [57], model LM unit SWB CH Hub5’01 learning rate warmup [26], and automatic learning rate schedul- ing according to a cross-validation set (”Newbob”) [1]. LF MMI, 2016 [7] 4-gram CDp 9.6 19.3 hybrid 4-gram CDp 9.8 19.0 14.7 6.1. Pretraining hybrid LSTM CDp 8.3 17.3 12.9 In all cases we use layer-wise pretraining for the encoder, where CTC , 2014 [12] RNN chars 20.0 31.8 we start with two encoder layers and a single max-pool in be- CTC, 2015 [60] none chars 38.0 56.1 tween with factor 32. Then we add a LSTM layer and a max- CTC, 2015 [60] RNN chars 21.4 40.2 pool in between, and we reduce the first max-pool to factor 16 attention, 2016 [61] none chars 32.8 52.7 and the new one with factor 2 such that we always keep the same attention, 2016 [61] 5-gram chars 30.5 50.4 total encoder time reduction factor of 32. Only when we end up attention, 2016 [61] none words 26.8 48.2 at 6 layers, we remove some of the max-pooling ops to get a attention, 2016 [61] 3-gram words 25.8 46.0 final total time reduction factor of e.g. 8. Directly starting with CTC, 2017 [16] none chars 24.7 37.1 a time reduction factor of 8 with and with 2 layers did not work CTC, 2017 [16] n-gram chars 19.8 32.1 for us. Also directly starting with 6 layers and time reduction CTC , 2017 [16] word RNN chars 14.0 25.3 factor of 32 did not work for us. Similar experiments for trans- attention, 2017 [28] none chars 23.1 40.8 lation converged also without pretraining, however with much BPE 10K 13.5 27.1 19.9 none worse performance compared when layer-wise pretraining was attention BPE 1K 13.1 26.1 19.7 used [30]. With more careful tuning or more training data, it LSTM BPE 1K 11.8 25.7 18.1 might have worked without pretraining as it is seen in the liter- ature, however, that is not necessary with pretraining. the shallow fusion with LSTM LM brings from 17% to 27% rel- We were interested in the optimal final total time reduction ative improvements in terms of WER on different subsets. This factor, after the pretraining with time reduction factor 32. We improvement is much larger than in the case of Switchboard. tried factor 8, 16 and 32, and ended up with 20.4, 21.0 and 21.9 The amount of data is most likely the reason for this observa- WER% respectively, on the full Hub5’00 set (Switchboard + tion. For Librispeech, the external data of 800M words is used Callhome). Thus we continue to use a final reduction factor of to train the language models, which is 80 times larger than the 8 in all further experiments. Note that a lower factor requires 10M words corresponding to the transcription of 1000 hours of more memory and more computation for the global attention audio. In addition, this 10M transcription is not part of the lan- and was not feasible with our hardware and computational re- guage model training data. In case of Switchboard, the LM is sources. trained only on about 27M words, including 3M of transcription used to train the end-to-end system. Text data for conversational 6.2. Switchboard 300h speech is not as readily available as for read speech. The WER of 3.54% on the dev-clean and 3.82% on the test-clean subsets Switchboard consists of about 300 hours of training data. There are the best performance on this task to the best of our knowl- is also the additional Fisher training dataset, so combined it edge for systems trained only using LibriSpeech data. makes the total of about 2000h. In this work, we only use the 300h-Switchboard training data. We use 40-dimensional Gam- 6.4. Beam search prune error analysis matone features [58], and the feature extraction was done with RASR [59]. Results are shown in Table 2. We observe that Beam search is an approximation for the decision rule T N N T our attention model performs better on the easier Switchboard x → wˆ := arg max p(w |x ). 1 1 1 1 subset of the dev set Hub5’00, where it is the best end-to-end model we know. On the harder Callhome part, it also performs The approximation is the pruning we apply due to the beam well compared to other end-to-end models but the relative dif- size. Beam search decoding for hybrid models is very sophis- ference is not as high. ticated and uses a dynamic beam size based on the partial hy- pothesis scores which can become very large (on the order of 6.3. LibriSpeech 1000h thousands) [66]. The beam search for attention models works LibriSpeech training dataset consist of about 1000 hours of directly on the labels, i.e. on the BPE units in our case, and read audio books. The dev and test sets were split into sim- usually a static fixed very low beam size (e.g. 10) is used. It ple (”clean”) and harder (”other”) subsets [49]. We do 40-dim. has been shown that increasing the beam size much more does MFCC feature extraction on-the-fly in RETURNN, based on not help in increasing the overal performance. This indicates librosa [62]. We use CTC as an additional loss function ap- that we do not have a search problem but we wanted to ana- plied on top of the decoder to help the convergence, although lyze this in more detail. Specifically, we are interested in how this is not used in decoding [63]. We initially trained only us- much errors we are making due to the pruning for our attention ing the train-clean set and restricting it to sequences not longer models, and we can count that by calculating the search score than 75 characters in the orthography. Results are shown in of the real target sequence, and compare it to the search score Table 3. Our end-to-end system achieves competitive perfor- of the decoded sequence. If the decoded sequence has a higher mance even without using language models. We observed that Table 3: Comparisons on LibriSpeech 1000h. The attention edge, the WERs of 3.54% on the dev-clean and 3.82% on the model has a 6 layer deep bidirectional LSTM encoder and a 1 test-clean subsets are the best results reported on this task, when layer LSTM decoder. CDp are (clustered) context-dependend only the official LibriSpeech training data is used. phones. Byte-pair encoding (BPE) are sub-word units. Lattice- free (LF) maximum mutual information (MMI) [7] is a sequence 8. Acknowledgements criterion to train a hybrid HMM/NN model. Auto SeGmentation This work has received funding from the European Research Council (ASG) [64] can be seen as a variant of the CTC criterion and (ERC) under the European Union’s Horizon 2020 research and innova- tion programme (grant agreement No 694537, project ”SEQCLAS”). model. Policy learning is a sequence training method, applied The work reflects only the authors’ views and the ERC Executive here on a CTC model [15]. If not specified, the official 4-gram Agency is not responsible for any use that may be made of the infor- word LM is used. The remaining attention models are all our mation it contains. The GPU cluster used for the experiments was par- tially funded by Deutsche Forschungsgemeinschaft (DFG) Grant INST models. 222/1168-1. WER[%] label model LM dev test unit 9. References clean other clean other [1] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schlu¨ter, and H. Ney, hybrid, FFNN, 2015 [49] 4-gram CDp 4.90 12.98 5.51 13.97 “A comprehensive study of deep bidirectional LSTM RNNs for LF MMI, LSTM, 2016 [7] 4-gram CDp 4.28 acoustic modeling in speech recognition,” in ICASSP, New Or- CTC, 2015 [65] 4-gram chars 5.33 13.25 leans, LA, USA, Mar. 2017, pp. 2462–2466. ASG (CTC), 2017 [64] 4-gram chars 4.80 14.50 [2] H. Bourlard and N. Morgan, Connectionist speech recognition: a ASG (CTC), 2017 [64] none chars 6.70 20.80 hybrid approach. Springer, 1994, vol. 247. CTC, PL, 2017 [15] 4-gram chars 5.10 14.26 5.42 14.70 [3] A. J. Robinson, “An application of recurrent nets to phone proba- bility estimation,” Neural Networks, IEEE Transactions on, vol. 5, none BPE 4.87 14.37 4.87 15.39 no. 2, pp. 298–305, 1994. attention 4-gram BPE 4.79 14.31 4.82 15.30 [4] S. J. Young, J. J. Odell, and P. C. Woodland, “Tree-based state LSTM BPE 3.54 11.52 3.82 12.76 tying for high accuracy acoustic modelling,” in Proceedings of the workshop on Human Language Technology. Association for score than the real target sequence, we have not made a search Computational Linguistics, 1994, pp. 307–312. error but it is a model error. We count the number of sequences [5] A. Senior, G. Heigold, M. Bacchiani, and H. Liao, “GMM-free where the decoded sequence has a lower score than the real tar- DNN acoustic model training,” in ICASSP, 2014. get sequence. We report our results in Table 4. We observe that [6] A. Zeyer, E. Beck, R. Schlu¨ter, and H. Ney, “CTC in the context of for our standard beam size 12, the number of search errors are generalized full-sum HMM training,” in Interspeech, Stockholm, well below 1%, and also the WER will not noticeably improve Sweden, Aug. 2017, pp. 944–948. with a larger beam size. Note that we only analyzed the search [7] D. Povey, V. Peddinti, D. Galvez, P. Ghahremani, V. Manohar, X. Na, Y. Wang, and S. Khudanpur, “Purely sequence-trained neu- errors regarding reaching the real target sequence. We did not ral networks for ASR based on lattice-free MMI,” in Interspeech, count search errors regarding reaching any sequence with lower 2016, pp. 2751–2755. WER. However, our results still suggest that we do not seem to [8] H. Sak, A. Senior, K. Rao, and F. Beaufays, “Fast and accurate have a search problem but a model problem. recurrent neural network acoustic models for speech recognition,” in Interspeech, 2015. Table 4: Beam search error analysis, performed on Lib- [9] S. Kanthak and H. Ney, “Context-dependent acoustic modeling riSpeech, without language model. We provide both the num- using graphemes for large vocabulary speech recognition,” in ber of reference-related search errors, relative to the number of ICASSP, Orlando, FL, USA, May 2002, pp. 845–848. sequences, and also the corresponding WER. [10] A. Graves, S. Ferna´ndez, F. Gomez, and J. Schmidhuber, “Con- nectionist temporal classification: labelling unsegmented se- search errors [%] (WER [%]) beam quence data with recurrent neural networks,” in ICML. ACM, dev test size 2006, pp. 369–376. clean other clean other [11] A. Graves and N. Jaitly, “Towards end-to-end speech recognition 4 1.52 (4.87) 1.68 (14.53) 1.07 (4.87) 1.70 (15.49) with recurrent neural networks,” in ICML, T. Jebara and E. P. Xing, Eds. JMLR Workshop and Conference Proceedings, 2014, 8 0.96 (4.88) 0.98 (14.40) 0.76 (4.87) 1.02 (15.39) pp. 1764–1772. 12 0.81 (4.87) 0.59 (14.37) 0.61 (4.86) 0.71 (15.39) [12] A. Hannun, C. Case, J. Casper, B. Catanzaro, G. Diamos, 16 0.70 (4.87) 0.52 (14.36) 0.50 (4.86) 0.58 (15.37) E. Elsen, R. Prenger, S. Satheesh, S. Sengupta, A. Coates et al., 32 0.26 (4.87) 0.14 (14.34) 0.19 (4.86) 0.20 (15.34) “DeepSpeech: Scaling up end-to-end speech recognition,” arXiv preprint arXiv:1412.5567, 2014. [13] Y. Miao, M. Gowayyed, and F. Metze, “Eesen: End-to-end speech 7. Conclusions recognition using deep rnn models and wfst-based decoding,” in ASRU. IEEE, 2015, pp. 167–174. We presentented an encoder-decoder-attention model for speech [14] R. Collobert, C. Puhrsch, and G. Synnaeve, “Wav2letter: an end- recognition operating on BPE subword units. We introduced to-end convnet-based speech recognition system,” arXiv preprint a new method for pretraining the encoder, which was crucial arXiv:1609.03193, 2016. for both convergence and the performance in terms of WER. [15] Y. Zhou, C. Xiong, and R. Socher, “Improving end-to- We further improved our recognition accuracy by a joint beam end speech recognition with policy learning,” arXiv preprint arXiv:1712.07101, 2017. search with a LSTM LM trained on the same subword vocab- [16] G. Zweig, C. Yu, J. Droppo, and A. Stolcke, “Advances in all- ulary. We carried out experiments on two standard datasets. neural speech recognition,” in ICASSP. IEEE, 2017, pp. 4805– On the 300h-Switchboard, we achieved competitve results com- pared to the previously reported end-to-end models, while the [17] H. Liu, Z. Zhu, X. Li, and S. Satheesh, “Gram-ctc: Automatic unit WERs are still higher than the conventional hybrid systems. On selection and target decomposition for sequence labelling,” arXiv the 1000h-LibriSpeech task, we obtained competitive results preprint arXiv:1703.00096, 2017. across different evaluation subsets. To the best of our knowl- [18] H. Soltau, H. Liao, and H. Sak, “Neural speech recognizer: Acoustic-to-word LSTM model for large vocabulary speech data and units for streaming end-to-end speech recognition with recognition,” in Proc. Interspeech, 2017, pp. 3707–3711. RNN-transducer,” in ASRU. IEEE, 2017, pp. 193–199. [19] K. Audhkhasi, B. Ramabhadran, G. Saon, M. Picheny, and D. Na- [41] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, hamoo, “Direct acoustics-to-word models for english conversa- S. Satheesh, D. Seetapun, A. Sriram et al., “Exploring neural tional speech recognition,” in Proc. Interspeech, 2017, pp. 959– transducers for end-to-end speech recognition,” arXiv preprint 963. arXiv:1707.07413, 2017. [20] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine trans- [42] R. Prabhavalkar, K. Rao, T. N. Sainath, B. Li, L. Johnson, and lation by jointly learning to align and translate,” arXiv preprint N. Jaitly, “A comparison of sequence-to-sequence models for arXiv:1409.0473, 2014. speech recognition,” in Proc. Interspeech, 2017, pp. 939–943. [21] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches [43] H. Sak, M. Shannon, K. Rao, and F. Beaufays, “Recurrent neural to attention-based neural machine translation,” arXiv preprint aligner: An encoder-decoder neural network model for sequence arXiv:1508.04025, 2015. to sequence mapping,” in Proc. of Interspeech, 2017. [22] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [44] R. Sennrich, B. Haddow, and A. Birch, “Neural machine transla- tion of rare words with subword units,” in ACL, Berlin, Germany, M. Krikun, Y. Cao, Q. Gao, K. Macherey et al., “Google’s neural August 2016, pp. 1715–1725. machine translation system: Bridging the gap between human and machine translation,” arXiv preprint arXiv:1609.08144, 2016. [45] P. Doetsch, A. Zeyer, P. Voigtlaender, I. Kulikov, R. Schlu¨ter, [23] J. Ba, V. Mnih, and K. Kavukcuoglu, “Multiple object recognition and H. Ney, “RETURNN: the RWTH extensible training frame- with visual attention,” arXiv preprint arXiv:1412.7755, 2014. work for universal recurrent neural networks,” in ICASSP, New Orleans, LA, USA, Mar. 2017, pp. 5345–5349. [24] W. Chan, N. Jaitly, Q. V. Le, and O. Vinyals, “Listen, attend and spell: A neural network for large vocabulary conversational [46] TensorFlow Development Team, “TensorFlow: Large-scale speech recognition,” in ICASSP, 2016. machine learning on heterogeneous systems,” 2015, soft- ware available from tensorflow.org. [Online]. Available: [25] P. Doetsch, A. Zeyer, and H. Ney, “Bidirectional decoder net- https://www.tensorflow.org/ works for attention-based end-to-end offline handwriting recog- nition,” in International Conference on Frontiers in Handwriting [47] M. Sundermeyer, R. Schlu¨ter, and H. Ney, “LSTM neural net- Recognition, Shenzhen, China, Oct. 2016, pp. 361–366. works for language modeling.” in Interspeech, Portland, OR, USA, Sep. 2012, pp. 194–197. [26] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, [48] C. Gulcehre, O. Firat, K. Xu, K. Cho, L. Barrault, H.-C. Lin, Z. Chen, A. Kannan, R. J. Weiss, K. Rao, K. Gonina et al., “State- F. Bougares, H. Schwenk, and Y. Bengio, “On using mono- of-the-art speech recognition with sequence-to-sequence models,” lingual corpora in neural machine translation,” arXiv preprint arXiv preprint arXiv:1712.01769, 2017. arXiv:1503.03535, 2015. [27] E. Battenberg, J. Chen, R. Child, A. Coates, Y. Gaur, Y. Li, H. Liu, [49] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “Lib- S. Satheesh, A. Sriram, and Z. Zhu, “Exploring neural transduc- riSpeech: an ASR corpus based on public domain audio books,” ers for end-to-end speech recognition,” in ASRU, Okinawa, Japan, in ICASSP. IEEE, 2015, pp. 5206–5210. Dec. 2017, pp. 206–213. [28] S. Toshniwal, H. Tang, L. Lu, and K. Livescu, “Multitask learning [50] G. Pundak and T. N. Sainath, “Lower frame rate neural network with low-level auxiliary tasks for encoder-decoder based speech acoustic models.” in Interspeech, 2016, pp. 22–26. recognition,” in Proc. Interspeech, 2017, pp. 3532–3536. [51] P. Bahar, J. Rosendahl, N. Rossenbach, and H. Ney, “The RWTH [29] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Aachen machine translation systems for IWSLT 2017,” in Int. Neural computation, vol. 9, no. 8, pp. 1735–1780, 1997. Workshop on Spoken Language Translation, Tokyo, Japan, Dec. 2017, pp. 29–34. [30] “RETURNN as a generic flexible neural toolkit with application [52] Z. Tu, Z. Lu, Y. Liu, X. Liu, and H. Li, “Modeling coverage for to translation and speech recognition,” anonymous authors, sub- neural machine translation,” in ACL, 2016. mitted to ACL, 2018. [53] R. Kneser and H. Ney, “Improved backing-off for m-gram lan- [31] J. Chorowski, D. Bahdanau, K. Cho, and Y. Bengio, “End-to-end guage modeling,” in ICASSP, Detroit, MI, USA, May 1995, pp. continuous speech recognition using attention-based recurrent nn: 181–184. first results,” arXiv preprint arXiv:1412.1602, 2014. [54] A. Stolcke, “SRILM-an extensible language modeling toolkit.” in [32] N. Jaitly, Q. V. Le, O. Vinyals, I. Sutskever, D. Sussillo, and Interspeech, Denver, CO, USA, Sep. 2002. S. Bengio, “An online sequence-to-sequence model using partial conditioning,” in Advances in Neural Information Processing Sys- [55] C. Szegedy, V. Vanhoucke, S. Ioffe, J. Shlens, and Z. Wojna, “Re- tems, 2016, pp. 5067–5075. thinking the inception architecture for computer vision,” in Pro- [33] R. Aharoni and Y. Goldberg, “Morphological inflection ceedings of the IEEE Conference on Computer Vision and Pattern generation with hard monotonic attention,” arXiv preprint Recognition, 2016, pp. 2818–2826. arXiv:1611.01487, 2016. [56] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdinov, “Dropout: A simple way to prevent neural net- [34] C. Raffel, T. Luong, P. J. Liu, R. J. Weiss, and D. Eck, “Online and works from overfitting,” The Journal of Machine Learning Re- linear-time attention by enforcing monotonic alignments,” arXiv search, vol. 15, no. 1, pp. 1929–1958, 2014. preprint arXiv:1704.00784, 2017. [57] D. P. Kingma and J. Ba, “Adam: A method for stochastic opti- [35] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” mization,” arXiv preprint arXiv:1412.6980, 2014. arXiv preprint arXiv:1712.05382, 2017. [58] R. Schlu¨ter, I. Bezrukov, H. Wagner, and H. Ney, “Gamma- [36] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic atten- tone features and feature combination for large vocabulary speech tion mechanism for end-to-end speech and language processing,” recognition,” in ICASSP, Honolulu, HI, USA, Apr. 2007, pp. 649– in Proceedings of the Eighth International Joint Conference on Natural Language Processing (Volume 1: Long Papers), vol. 1, 2017, pp. 431–440. [59] S. Wiesler, A. Richard, P. Golik, R. Schlu¨ter, and H. Ney, “RAS- [37] R. Prabhavalkar, T. N. Sainath, B. Li, K. Rao, and N. Jaitly, R/NN: The RWTH neural network toolkit for speech recognition,” “An analysis of “attention” in sequence-to-sequence models,”,” in in ICASSP, Florence, Italy, May 2014, pp. 3313–3317. Proc. of Interspeech, 2017. [60] A. L. Maas, Z. Xie, D. Jurafsky, and A. Y. Ng, “Lexicon-free conversational speech recognition with neural networks,” in Proc. [38] J. Hou, S. Zhang, and L. Dai, “Gaussian prediction based attention NAACL, 2015. for online end-to-end speech recognition,” in Proc. Interspeech, 2017, pp. 3692–3696. [61] L. Lu, X. Zhang, and S. Renais, “On training the recurrent neural network encoder-decoder for large vocabulary end-to-end speech [39] P. Doetsch, M. Hannemann, R. Schlueter, and H. Ney, “Inverted recognition,” in ICASSP. IEEE, 2016, pp. 5060–5064. alignments for end-to-end automatic speech recognition,” IEEE Journal of Selected Topics in Signal Processing, vol. 11, no. 8, [62] librosa Development Team, “librosa 0.5.0,” Feb. 2017. [Online]. pp. 1265–1273, Dec. 2017. Available: https://doi.org/10.5281/zenodo.293021 [40] K. Rao, H. Sak, and R. Prabhavalkar, “Exploring architectures, [63] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC-attention based end-to-end speech recognition with a deep CNN encoder and RNN-LM,” in Interspeech, 2017. [64] V. Liptchinsky, G. Synnaeve, and R. Collobert, “Letter- based speech recognition with gated convnets,” arXiv preprint arXiv:1712.09444, 2017. [65] D. Amodei, R. Anubhai, E. Battenberg, C. Case, J. Casper, B. Catanzaro, J. Chen, M. Chrzanowski, A. Coates, G. Diamos et al., “Deep speech 2: End-to-end speech recognition in english and mandarin,” arXiv preprint arXiv:1512.02595, 2015. [66] D. Nolden, “Progress in decoding for large vocabulary continuous speech recognition,” Ph.D. dissertation, RWTH Aachen Univer- sity, Computer Science Department, RWTH Aachen University, Aachen, Germany, Apr. 2017.

Journal

StatisticsarXiv (Cornell University)

Published: May 8, 2018

There are no references for this article.