Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Hyeongju Kim, and Nam Soo Kim, Senior Member, IEEE Recently, attention-based encoder-decoder (AED) models have by the attention mechanism that provides proper acoustic shown state-of-the-art performance in automatic speech recogni- information at each step [6]. tion (ASR). As the original AED models with global attentions A major drawback of the conventional AED models is that are not capable of online inference, various online attention they cannot infer the ASR output in an online fashion, which schemes have been developed to reduce ASR latency for better degrades the user experience due to the large latency [7]. This user experience. However, a common limitation of the conven- tional softmax-based online attention approaches is that they problem is mainly caused by the following aspects of the introduce an additional hyperparameter related to the length AED models. Firstly, the encoders of most high-performance of the attention window, requiring multiple trials of model AED models make use of layers with global receptive fields, training for tuning the hyperparameter. In order to deal with such as bidirectional long short-term memory (BiLSTM) or this problem, we propose a novel softmax-free attention method self-attention layer. More importantly, a conventional global and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. attention mechanism (e.g., Bahdanau attention) considers the Through a number of ASR experiments, we demonstrate the entire utterance to obtain the attention context vector at tradeoff between the latency and performance of the proposed every step. The former issue can be solved by replacing the online attention technique can be controlled by merely adjusting global-receptive encoder with an online encoder, where an a threshold at the test phase. Furthermore, the proposed methods encoded representation for a particular frame depends on only showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs). a limited number of future frames. The online encoder can be built straightforwardly by employing layers with finite future receptive field such as latency-controlled BiLSTM (LC- Index Terms—Automatic Speech Recognition, Online speech recognition, Attention-based encoder-decoder model BiLSTM) [8], temporal convolution layers, and masked self- attention layers. However, reformulating the global attention methods for an online purpose is still a challenging problem. I. I NTRODUCTION Conventional techniques for online attentionare usually two- step approaches where the window (i.e., chunk) for the current N the last few years, the performance of deep learning- attention is determined first at each decoder step, then the based end-to-end automatic speech recognition (ASR) attention weights are calculated using the softmax function systems has improved significantly through numerous studies defined over the window. Existing online attentions mainly mostly on the architecture designs and training schemes differ in how they determine the window. Neural transducers [9], of neural networks (NNs). Among many end-to-end ASR [10] divide an encoded sequence into multiple chunks with systems, attention-based encoder-decoder (AED) models [1], a fixed length, and the attention-decoder produces an output [2] have shown better performance than the others, such as the sequence for each input chunk. In the windowed attention connectionist temporal classification (CTC) [3] and recurrent techniques [11], [12], [13], the position of each fixed-size neural network transducer (RNN-T) [4], and even outperformed window is decided by a position prediction algorithm. The the conventional DNN-hidden Markov model (HMM) hybrid window position is monotonically increasing in time, and some systems in case a large training set of transcribed speech is approaches employ a trainable position prediction model with available [5]. Such successful results of AED models come from a fixed Gaussian window. In MoChA-based approaches [14], the tightly integrated language modeling capability of the label- [15], [16], a fixed-size chunk is obtained using a monotonic synchronous decoder (i.e., the decoder network operates once endpoint prediction model, which is jointly trained considering per output text token in an autoregressive manner), supported all possible chunk endpoints. A common limitation of the aforementioned approaches is that the fixed-length of the window needs to be tuned according The authors are with the Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University, to the training data. Merely choosing a large window of a Seoul, Republic of Korea (e-mail: hslee@hi.snu.ac.kr; whkang@hi.snu.ac.kr; constant size causes a large latency while setting the window sjcheon@hi.snu.ac.kr; hjkim@hi.snu.ac.kr; nkim@snu.ac.kr). size too small results in degraded performance. Therefore © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, multiple trials of the model training are required to find a proper including reprinting/republishing this material for advertising or promotional value of the window length, consuming excessive computational purposes, creating new collective works, for resale or redistribution to servers resources. Moreover, the trained model does not guarantee to or lists, or reuse of any copyrighted component of this work in other works. Digital Object Identifier 10.1109/TASLP.2021.3049344 perform well on an unseen test set, since the window size is arXiv:2007.05214v3 [eess.AS] 14 Jan 2021 2 fixed for all datasets. decoder ASR is formally described, followed by conventional Although a few variants of MoChA utilize an adaptive online attention methods and their common limitation. Sec- window length to remove the need for tuning the window tion III proposes formulations of both GRC and DecGRC size, such variants induce other problems. MAtChA [14] and the algorithm for online inference. The experimental regards the previous endpoint as the beginning of the current results with various attention methods are given in Section IV. chunk. Occasionally, the window can be too short to contain Conclusions are presented in Section V. enough speech content when two consecutive endpoints are too close, which may degrade the performance. AMoChA [17] II. BACKGROUNDS employs an auxiliary model that predicts the chunk size but A. Attention-based Encoder-Decoder for ASR also introduces an additional loss term for the prediction model. As the coefficient for the new loss needs tuning, AMoChA An attention-based encoder-decoder model consists of two still requires repeated training sessions. Besides, several recent sub-modules Encoder() and AttentionDecoder(), and it approaches [18], [19] utilize strictly monotonic windows. But predicts the posterior probability of the output transcription these methods have a limitation in that the decoder state is not given the input speech features as follows: used for determining the window, which means such algorithms h = Encoder(x); might not fully exploit the advantage of AED models, i.e., the inherent capability of autoregressive language modeling. P (yjx) = AttentionDecoder(h; y) The aforementioned inefficiency in training the conventional where x = [x ; x ; :::; x ] and h = [h ; h ; :::; h ] are online attentions is essentially caused by the fact that the 1 2 T 1 2 T in sequences of input speech features and encoded vectors softmax function needs a predetermined attention window respectively, and y = [y ; y ; :::; y ] is a sequence of output to obtain the attention weights, which results in repetitive 1 2 U text units. Either the start or end of the text is considered as tuning process of the window-related hyperparameter. Although one of the text units. several recent studies [20], [21] investigate softmax-free formu- In general, Encoder() reduces its output length T to be lation of attention, they focuses on reducing computations by smaller than the input length T , cutting down the memory and replacing the softmax with other kernels and do not suggest a in computational footprint. A global Encoder() is implemented solution for online encoder-decoder attention. To overcome this with NN layers having powerful sequence modeling capacity, limitation, we propose a novel softmax-free global attention e.g., BiLSTM or self-attention layers with subsampling layers. method called gated recurrent context (GRC), inspired by the On the other hand, an online Encoder() must only consist of gate-based update in gated recurrent unit (GRU) [22]. Whereas layers with finite future receptive field. conventional attentions are based on a kernel smoother (e.g. AttentionDecoder() operates at each output step recur- softmax function) [23], [20], GRC obtains an attention context sively, emitting an estimated posterior probability over all vector by recursively aggregating the encoded vectors in a possible text units given the outputs produced at the previous time-synchronous manner, using update gates. GRC can be step. This procedure can be summarized as follows: reformulated for the purpose of online attention, which we refer to decreasing GRC (DecGRC), where the update gates are s = RecurrentState(s ; y ; c ); u u1 u1 u1 constrained to be decreasing over time. DecGRC is window-free and capable of deciding the attention-endpoint by thresholding c = AttentionContext(s ; h); (1) u u the update gate values at the inference phase. DecGRC as P (y jy ; x) = ReadOut(s ; y ; c ) u <u u u1 u well as GRC introduces no hyperparameter to be tuned at the training phase. where c denotes the u-th attention context vector and s u u The main contributions of this paper can be summarized as is the u-th decoder state. RecurrentState() consists of follows: unidirectional layers, e.g., unidirectional LSTM and masked We propose a novel softmax-free attention method called self-attention layers. ReadOut() usually contains a small NN Gated Recurrent Context (GRC), which obtains an followed by a softmax activation function. attention context vector using a time-synchronous recur- The most popular choice for AttentionContext() is the sive updating rule rather than a kernel smoother-based global soft attention (GSA) [2], [24] that includes the softmax formulation. function given as follows: We present a window-free online attention method, De- creasing GRC (DecGRC), a constrained variant of GRC. c = h ; (2) u u;t t DecGRC does not need any new hyperparameter to be t=1 tuned at the training phase. At test time, the tradeoff exp(e ) u;t between performance and latency can be adjusted using = ; (3) u;t P exp(e ) a simple thresholding technique. u;j j=1 We experimentally show that GRC and DecGRC perform e = Score(s ; h ; ) (4) u;t u t <u;t competitive to the conventional global and online attention methods on the LibriSpeech test set. in which is an attention weight on the t-th encoded vector u;t The remainder of this paper is organized as follows. In h at the u-th decoder step, and e is a score indicating the t u;t Section II, the general framework of attention-based encoder- relevance of h to the u-th decoder state. Common choices t 3 Encoder step Fixed-size window Fixed-size window Global window (a) Global soft attention (b) Windowed attention (c) MoChA (d) DecGRC (proposed) Fig. 1. Pictorial descriptions of various attention methods. For online attention methods, the endpoint and the start-point (if it exists) are respectively marked with cyan and orange bold outline, at each decoder step. Windowed attention and MoChA, two widely-used online attention methods, decide either the start-point or endpoint for each decoder step, and then calculate attention weights within a fixed-size window. Unlike these conventional methods, DecGRC does not utilize a fixed-size window and scans from the beginning of the utterance to find the endpoint for each decoder step. The endpoint decision algorithm of DecGRC is independent of the former endpoints. Thus the endpoint may not be monotonically increasing over time-steps, as depicted in (d). The detailed DecGRC inference algorithm is described in Alg. 1. p +w1 for the Score() function are additive scores [2], [25] and c = h ; u u;t t dot-product scores [24], [26]. Additive scores often utilize t=p additional information to decide the current attention <u;t exp(e ) weights based on the past attention locations. In this paper, an u;t = (7) u;t p +w1 additive score with attention weight feedback [25] is employed exp(e ) u;j j=p for all the experiments at Sec. IV: where p is the start point of the attention window at the u-th e = v tanh(W [s ; h ; ] + ); (5) u;t u t u;t step, and w is the window size. The windowed attention is online, as the attention context c derived through Eqs. (6)-(7) u1 does not depend on the entire encoded vector sequence h. = (v h ) u;t t k;t The tradeoff between performance and latency of windowed k=1 attention relies on the window length w. where the notation [ ;  ] means concatenation of vectors, v and v are trainable vectors, W and  are a trainable weight and a trainable bias, and is an attention weight feedback. 2) MoChA u;t The whole system is trained to maximize the log posterior (n) (n) N probability on a training dataset D = f(x ; y )g , In MoChA [14], an attention window endpoint is first n=1 decided, followed by attention weights calculation within a jyj h i fixed-size window as follows: max E logP (y jy ; x; ) u <u (x;y)D u=1 X c = h ; (8) u u;t t where  denotes the set of all trainable parameters, and jyj is t= w+1 the text sequence length of the sampled data. Inference can be exp(e ) u;t performed by searching the most likely text sequence: = ; (9) u;t exp(e ) u;j j= w+1 y ^ = argmax logP (yjx; ): = MonotonicEndpoint(e ~ ); u u; u1 B. Online Attention e ~ = MonotonicScore(s ; h ; ) + b (10) u;t u t <u;t To achieve online attention, the context vector c in where MonotonicScore() is a similarity function, b is a Eq. (2) must have local dependency on the encoded vectors trainable bias parameter, e ~ is the monotonic score, and u;t h. Windowed attention and MoChA are widely-used online MonotonicEndpoint() is an window end-decision algorithm attention methods that show high performance for which only based on thresholding, and is an attention weight within u;t the AttentionContext() function in Eq. (1) is modified in the window. Note that Eqs. (8)-(10) are substitutes for Eqs. (2)- the general framework. Pictorial descriptions of all the online (3) in GSA. The performance and latency of MoChA are also attention methods in this paper are provided in Fig. 1. known to depend on the chunk size w. 1) Windowed attention Among various formulations of windowed attention, a simple Optimizing an AED model using these formulations is heuristic using argmax for window boundary prediction [13] impossible. The MonotonicEndpoint() function makes a has shown the best performance. This method can be described hard-decision for an endpoint  that is not differentiable, as follows: which means  cannot be trained with the backpropagation p = 0; p = argmax ( ); (6) framework. To solve this problem, an expectation-based for- 1 u u1;1tT Decoder step 4 mulation is exploited for training [14]: for any h = [h ; h ; : : : ; h ] and z = [z ; z ; : : : ; z ], 1 2 T u u;1 u;2 u;T the following holds: t+w1 exp(e ) u;t u;k = ; u;t P exp(e ) u;l k=t l=kw+1 d = (z ) h u;T u t t t=1 u;t1 = p (1 p ) + ; (11) u;t u;t u;t1 u1;t u;t1 where (z ) denotes the t-th element of (z ), and d is u t u u;T obtained from z and h according to Eq. (13). p = (e ) u;t u;t where p is a stopping probability at the t-th time step and u;t Proof . Using the recursive Eq. (13), is an accumulated selection probability that the window u;t d =(1 z )d + z h endpoint is t. u;T u;T u;T1 u;T T 3) A limitation of the conventional methods =(1 z )(1 z )d u;T u;T1 u;T2 As mentioned in Sec. I, the softmax function in the + (1 z )z h + z h u;T u;T1 T1 u;T T conventional online attentions (e.g., Eqs. (7)-(9)) requires a = : : : predetermined attention window, which induces a limitation in T T X Y training efficiency since multiple trials of training are inevitable = (1 z ) z h : u;j u;t t for tuning either the window length or the coefficient of an t=1 j=t+1 additional loss term. To overcome this limitation, in the next Therefore the function  that satisfies Eq. 1 is given by section, we propose a novel softmax-free global attention approach and its online version which is free from the tuning of hyperparameters in training. (z ) := z (1 z ) for t = 1; 2; : : : ; T . (16) u t u;t u;j j=t+1 T T III. PROPOSED METHODS Given that z 2 Z , the output (z ) is an element of A u u because it is trivial to show that 0  (z )  1 for u t A. Gated Recurrent Context (GRC) i = 1; 2; : : : ; T , and also (z ) = 1 holds as follows: u t t=1 We propose a novel softmax-free global attention method T T T called GRC, which recursively aggregates the information of X X Y (z ) = z (1 z ) u t u;t u;j the encoded sequence into an attention context vector in a t=1 t=1 j=t+1 time-synchronous manner. Specifically, the following formulas T T T X Y Y are employed in place of the Eqs. (2)-(4): = z (1 z ) + (1 z ) u;t u;j u;j c = d ; (12) u u;T t=2 j=t+1 j=2 T T T X Y Y d = (1 z )d + z h ; (13) u;t u;t u;t1 u;t t = z (1 z ) + (1 z ) u;t u;j u;j t=3 j=t+1 j=3 z = 1; z = (e ) = ; (14) u;1 u;t u;t 1 + exp(e ) u;t = : : : e = Score(s ; h ; ) + b (15) = z + (1 z ) = 1: u;t u t <u;t u;T u;T where z and d are the update gate and the intermediate The (z ) is a bijective function since the inverse mapping u;t u;t u attention context vector for the t-th time step at the u-th decoder of  exists as follows: step, respectively. GRC computes an intermediate value for the z = (z ) ; u;T u T final context vector recursively in time, inspired by GRU [22]. (z ) (z ) u T1 u T1 Note that Eqs. (12)-(15) of GRC do not utilize the softmax = if (z ) < 1; u T 1 z 1 (z ) z = u T u;T u;T1 function at all, unlike the conventional attentions. Nevertheless, 0 otherwise, GRC can be interpreted as a global attention method, since it calculates a weighted average of the encoded sequence over > X (z ) u T2 the whole time period, as explained in Sec. III-A1. if (z ) ; u j 1 (z ) (z ) z = u T u T1 u;T2 1) Relation to GSA j=T1 The update gates sequence z of GRC in Eq. (14) and u;1:T 0 otherwise, the attention weights sequence of GSA in Eq. (3) u;1:T have one-to-one correspondence (i.e., intuitively, z and u;1:T are always interchangable without changing the value of u;1:T > X > (z ) u t attention context vector c ) according to the following theorem: if (z ) < 1; u P u T ) z = 1 (z ) u;t u j j=t+1 t=j+1 Theorem 1 (GRC-GSA duality). For arbitrary n 2 N, let 0 otherwise, n n Z = fx 2 R j x = 1; 0  x  1 for j = 2; 3; : : : ; ng 1 j n n and A = fx 2 R j x = 1; 0  x  1 for j = for t = 1; 2; : : : ; T . It is also trivial to show that z = 1 and j j u;1 j=1 T T T 1; 2; : : : ; ng. There exists a bijective function  : Z ! A s.t. 0  z  1 for i = 2; : : : ; T , given that (z ) 2 A . u;t u 5 Algorithm 1 Online inference using DecGRC. Therefore, z 2 Z . Input: encoded vectors h of length T , threshold State: s = 0, u = 1, y = StartOfSequence Note that (z ) in Eq. (16) corresponds to the attention 0 0 u t while y ! = EndOfSequence do weight in Eq. (3) of GSA. By Thm. 1, the attention u1 u;t d = h u 1 context vector c of GRC is capable of expressing all possible for t = 2 to T do weighted averages of the encoded representations over time, e = Score(s ; h ; ) + b u;t u t u;t as in the GSA. Thus the range of c in GRC or GSA is the z = 1=(1 + exp(e )) same. Nonetheless, we empirically showed that GRC performs u;t u;j j=1 d = (1 z )d + z h u u;t u u;t t comparable to or even better than GSA, and the experimental if z <  then u;t results are given in Sec. IV. break 2) Relation to sMoChA end if The sMoChA [15] is a variant of MoChA where Eq. (11) end for is replaced by the following formula: s = RecurrentState(s ; y ; d ) u u1 u1 u t1 P (y jy ; h) = ReadOut(s ; y ; c ) // softmax u <u u u1 u = p (1 p ) (17) u;t u;t u;j y = Decide P (y jy ; h) // choose a text unit in the u u <u j=1 vocabulary. (e.g. argmax for greedy search) which enables the optimization process to be more stabilized. u = u + 1 Eq. (17) is almost similar to the function  in Eq. (16), and end while implies evidence on the stability of GRC training. Despite this fact, sMoChA is an algorithm independent of GRC, as Eq. (17) is merely used as the selection probability component in the Global attention methods including GSA and GRC cannot whole training formulas and not even used for inference. compute the attention weights without the entire sequence of the encoded vectors h. However, considering that the attention B. Decreasing GRC (DecGRC) techniques are methods for calculating the weighted average of the encoded vectors, Coroll. 1.1 enables us to treat an By Thm. 1, the final context vector d of GRC in Eq. (12) u;T intermediate context d as a substitute for the attention context can be interpreted as a weighted average of encoded vectors u;t vector c in Eq. (12) of GRC even when the whole encoded h . Thus GRC can be regarded as a kind of global attention u;1:T sequence is not provided. method. Furthermore, not only the final context vector d of u;T Inspired by this, we further propose a novel online attention GRC but also an intermediate context d is a weighted average u;t algorithm, namely DecGRC. DecGRC is a modified version of the encoded vectors h according to the following u;1:T of GRC, replacing Eq. (14) with corollary: n n Corollary 1.1 (Weighted average). Let Z and A be the sets z = 1; z = : u;1 u;t P 1 + exp(e ) defined in Thm. 1. For any  2 f1; 2; : : : ; Tg, z 2 Z , and u;j u j=1 h = [h ; h ; : : : ; h ] , there exists a function a  : Z ! A 1 2 T Note that the update gate is constrained to be monotonically that satisfies the following equation: decreasing over time. At the training phase, DecGRC is trained in the same way as GRC, using an entire utterance to obtain a d = a (z ) h u; u t t final context d according to Eqs. (12)-(13). At the inference u;T t=1 phase, for each decoder step u, DecGRC decides an endpoint t so that only encoded vectors before the endpoint can end where a (z ) denotes the t-th element of a (z ), and d is u t u u; contribute to the online context vector obtained from z and h according to Eq. (13). c = d ; Proof . By substituting every T in the proof of Thm. 1 with u u;t end , there exists a bijective function  : Z ! A given by which is used in place of the GRC context vector in Eq. (12). Assume that there exists an endpoint index t with which end (z ) := z (1 z ) for t = 1; 2; : : : ;  , u t u;t u;j z has a very small value (e.g. less than 0.001). Considering u;t end j=t+1 that z < z holds for all t > t , the difference between u;t u;t end end such that d and d is small, as the numerical change jd u;t u;T u;t end d j for t > t induced by the recursion rule in Eq. (13) u;t1 end d = (z ) h : u; u t t is negligible if z is small enough. Intuitively, intermediate u;t t=1 context vectors roughly converge after the endpoint. It is trivial to show that the following function a  : Z ! A DecGRC can operate as an online attention method if such an satisfies the equation in Coroll. 1.1: endpoint index t exists at each decoder step and the index end can be decided by the model. We experimentally observed (z ) if t   ; u t a (z ) = for t = 1; 2; : : : ; T . u t that DecGRC models adequately learn the alignment between 0 otherwise, encoded vectors and text output units, and the intermediate context nearly converges after the aligned time index at each 6 decoder step. Nevertheless, the performance of DecGRC can be are minor compared to the training set. Hence the total time degraded due to the mismatch between training and inference, spent to prepare an ASR system can be saved. Furthermore, especially when the endpoints are decided to be too early. the tradeoff between latency and performance can be adjusted Relevant experimental results are given in Sec. IV-D by resetting the threshold value  at inference phase, unlike the Accordingly, with an online encoder, online inference can be conventional online attention methods [9], [13], [14]. In these implemented via a well-trained DecGRC model. We describe existing methods, the inference algorithms’ decision rules on the online inference technique in Alg. 1, where the endpoint the attention endpoints are determined at the training phase, index is decided simply by thresholding the update gate values. and remains unchanged at the test stage. The experiments on DecGRC with different thresholds are demonstrated in Sec. IV-E. C. Computational efficiency of proposed methods GRC or DecGRC increases negligible amount of memory IV. E XPERIM ENTS footprint, since only one trainable parameter b in Eq. (15) is A. Configurations added to the standard GSA-based AED model. The computa- All experiments were conducted on LibriSpeech dataset , tional amount of an attention method is dominated by the score which contains 16 kHz read English speech with transcription. function calculation, as it requires matrix multiplications. For The dataset consists of 960 hours of a training set from 2,338 example in GSA, a fixed-dimensional matrix-vector product is needed to obtain e ~ in Eq. (5) for each u and t, which speakers, 10.8 hours of a dev set from 80 speakers, and 10.4 u;t results in (TU ) floating point operations for processing an hours of a test set from 66 speakers, with no overlapping utterance. Although the softmax operation in Eq. (3) and the speakers between different sets. Both dev and test sets are split weighted average operation in Eq. (2) also requires (TU ) in half into clean and other subsets, depending upon the ASR operations in total, these are negligible compared to the score difficulty of each speaker. We randomly chose 1,500 utterances function calculation since they do not regard matrix-vector from dev set as a validation set. multiplications. As a result, the total computational complexity All experiments shared the same network architecture of GSA is (TU ). and training scheme of a recipe of RETURNN toolkit [27], Similarly, both GRC and DecGRC requires the score function [28], except the attention methods. Input features were 40- calculation in Eq. (15), having computational complexity of dimensional mel-frequency cepstral coefficients (MFCCs) ex- O(TU ). However, in practice, a speech sequence is linearly tracted with Hanning window of 25 ms length and 10 ms aligned with the text sequence on average. As Alg. 1 only hop size, followed by global mean-variance normalization. regards to encoded vectors before endpoint indices, the total Output text units were 10,025 byte-pair encoding (BPE) units extracted from transcription of LibriSpeech training set. The number of steps in the for loop is typically slightly larger Encoder() consisted of 6 BiLSTM layers of 1,024 units for than TU=2, if the threshold  is set to an appropriate value. each direction, and max-pooling layers of stride 3 and 2 were Therefore, DecGRC is computationally more efficient than the applied after the first two BiLSTM layers respectively. For global attentions such as GRC and GSA at the inference phase. the online Encoder(), 6 LC-BiLSTM layers were employed The recursive updating in Eq. (13) induces negligible amount in place of the BiLSTM layers, where the future context of computation compared to the whole training or inference sizes were set to 36, 12, 6, 6, 6, and 6 for each layer process. There still exists a room for faster computation by from bottom to top and the chunk sizes were same as the enabling parallel computation in time. The parallel computation future context sizes. Both Score() and MonotonicScore() can be implemented by utilizing Eq. (2) where is replaced u;t functions were implemented using the formulation in Eq. (5) with (z ) in Eq. (16), instead of Eqs. (12)-(13). Note that and 1,024-dimensional attention key. RecurrentState() was GRC and DecGRC are not the best choices among attention implemented with an unidirectional LSTM layer with 1,000 methods in terms of computational complexity. Among the units. ReadOut() consisted of a max-out layer with 2500 global attention methods, the linearized attention [21] features units, followed by a softmax output layer with 10,025 units. a very low computational complexity of (T + U ) when be Every model contains a total of 188 M parameters both for used as encoder-decoder attention, which is much smaller BiLSTM and LC-BiLSTM encoder architecture, except that than (TU ) of GRC. The computational complexity of an every MoChA-based model has 191 M parameters. online attention method MoChA [14] is (wU ) where w is the window-size, which is typically far less than O(TU ) of Weight parameters were initialized with Glorot uniform DecGRC. Notwithstanding, the encoder-decoder attention’s method [29], and biases were initially set to zero. Optimization computational amount is minor to the other layers in the techniques were utilized during the training: teacher forcing, encoder and the decoder. Adam optimizer, learning rate scheduling, curriculum learning, and the layer-wise pre-training scheme. Briefly, the models The most important fact is that both proposed methods were trained for 13.5 epochs using a learning rate of 810 introduce no hyperparameter at the training phase. Thus the with a linear warm-up starting from 310 and the Newbob proposed methods do not need to repeat training to find a proper value of such a hyperparameter. Though the DecGRC inference The LibriSpeech dataset can be downloaded from http://www.openslr.org/ in Alg. 1 introduces a new hyperparameter (i.e., threshold  ) at test phase, the threshold searching on development sets does The scripts for all experiments are available at https://github.com/ not take a long time, because the size of the development sets GRC-anonymous/GRC. 7 TABLE I WORD ERROR RATES (WER S) COMPARISON BETWEEN ATTENTION METHODS ON LIBRIS PEECH DATASET. PARAM. IS IS C AN WER [%] E XP. ATTENTION METHOD INIT. ATTENTION ENCODER INFER DEV TEST ID FROM ONLINE? ONLINE? ONLINE? CLEAN OTHER CLEAN OTHER E1 GSA - 4.77 14.11 4.92 15.15 NO E2 GRC - 4.84 14.06 4.88 14.59 E3 WINDOW ED ATT. (W=11) E1 12.50 23.79 15.27 25.81 E4 WINDOW ED ATT. (W=20) E1 5.78 14.82 5.71 15.90 NO E5 MOC HA ( W=2) - 6.49 17.11 6.17 18.18 YES E6 MOC HA ( W=8) - 4.74 14.20 4.95 15.32 (BILSTM) NO E7 - 4.91 14.85 5.10 15.85 DECGRC ( =0.01) E8 E2 4.97 14.02 4.83 14.90 E9 - 5.54 15.49 5.51 16.91 GSA E10 E1 5.28 15.44 5.17 16.40 NO E11 - 6.09 16.05 6.18 16.47 GRC E12 E2 5.48 15.14 5.55 15.88 E13 WINDOW ED ATT. (W=11) E10 12.82 24.10 15.14 26.94 YES E14 WINDOW ED ATT. (W=20) E10 5.62 15.86 5.56 16.96 E15 MOC HA ( W=2) E5 6.48 18.35 6.55 19.33 YES (LC- YES E16 MOC HA ( W=8) E6 5.11 15.10 5.15 16.45 BILSTM) E17 DECGRC ( =0.01) E8 5.77 16.24 5.87 17.04 E18 DECGRC ( =0.08) E12 5.79 15.67 6.04 16.34 decay rule [30]. Only the first two layers of the Encoder() rate (WER), a word-level Levenshtein distance divided by the with half-width (i.e., 512 units for each direction) were used number of ground-truth words, on dev-other set. at the beginning of training. Then once every 0.25 epoch from In E1 to E2 and E9 to E12, GRC showed better performance 0.75 epoch until 1.5 epoch, a new layer was inserted on the top than the other attention methods on test-other set, showing of the encoder and 1=8 original width (i.e., 128 units for each 3.7% and 3.2% relative error-reduction rate (RERR) compared direction) of new units are added to each layer. Finally, the to GSA when evaluated on BiLSTM and LC-BiLSTM encoder, width and the number of layers increased to the original size respectively. at 1.5 epoch. The CTC multi-task learning [31] with a lambda In E3 to E6 and E13 to E16, performances of the conven- of 0.5 was employed to stabilize the learning, where CTC loss tional online attentions, i.e., windowed attention and MoChA, is measured with another 10,025-units softmax layer on the were shown to be highly dependent on a choice of window top of Encoder(). For the models which began the learning size hyperparameter w. On the other hand, DecGRC is trained from parameters of a pre-trained model, the layer-wise pre- without any additional hyperparameter and only involves a training was skipped. Every model was regularized by applying threshold  at the inference phase. dropout rate 0.3 to Encoder() layers and the softmax layer In E3 to E8 and E13 to E18, DecGRC outperformed the and employing label smoothing of 0.1. For each epoch of the conventional online attention techniques on BiLSTM encoder. training, both cross-entropy (CE) losses and output error rates With LC-BiLSTM encoder, the performance of DecGRC on were measured 20 times on the validation set with teacher test-other set surpassed the conventional attentions including forcing. During the inference phase, model with the lowest GSA, while the scores on test-clean set were worse than the WER on the dev-other set among all checkpoints was selected competitors. The overall performance of GRC and DecGRC as the final model, and performed beam search once on the is degraded on LC-BiLSTM compared to their preferable dev and test sets with a beam size of 12. performance on BiLSTM, which was conjectured to be caused by the following aspect of the proposed methods; (z ) in We trained MoChA models for 17.5 epochs with five times Eq. (16) has a dependency on update-gate values of the future longer layer-wise pre-training to make them converge. A small time-steps. Therefore using a short future receptive field of learning rate of 1e-5 was used for training windowed attention LC-BiLSTM may affected the degradation. models as in [13]. Though the numbers of total epochs for different experiments were not the same, each model was optimized to converge and showed negligible improvements C. Optimization speed after that. The cross-entropy loss curves on training and dev set in E1, E2, E6, and E7 are depicted in Fig. 2. The model based on each attention method was trained from scratch until convergence, B. Performance comparison between attentions with a few spikes in its training loss curve. These spikes in the All experimental results are summarized in Tbl. I. For each loss curve are caused by the layer-wise pre-training algorithm experiment, we performed two trials of training with the same described in Sec. IV-A. Every time a new layer and units are configuration and chose a model with the lowest word-error- inserted to the encoder, the training loss temporarily shows 8 “we went at a good swinging gallo__ p and what about you” GSA (train) GSA (dev) GRC (train) “we went at a good swing gallo__ p and what about you” GRC (dev) MoChA (train) MoChA (dev) DecGRC (train) DecGRC (dev) “we went at a good swinging gallo__ p and what about you” 0.1 0 2 4 6 8 10 12 14 16 Epoch “we went at a good swinging gallo__ p and what about you” Fig. 2. Cross-entropy loss curves of various attention methods. All the models were trained from scratch (w/ BiLSTM encoder). rapid increase, because the newly inserted network parameters are not trained yet. Update gate 𝑧 𝑢 ,𝑡 Overall, GRC and DecGRC showed faster from-scratch 0.001 0.01 training speed than MoChA, but slower than GSA. DecGRC 0.05 5 0.08 converged slightly later than GRC. MoChA showed the slowest 0.2 optimization speed, which was partly due to the 5 times longer 0.25 0.4 layer-wise pre-training scheduling than the others. Such long 10 0.6 pre-training was employed to stabilize the training of MoChA, 0 10 20 30 40 50 60 whereas the both GRC and DecGRC successfully converged Encoder time index with the standard pre-training. Note that the longer pre-training Fig. 3. An input spectrogram, attention plots with the output BPE sequence of MoChA was adopted because it had failed to converge with of GSA (E1), GRC (E2), and DecGRC (E8), and the update gates of the a short pre-training in our initial experiments. The relatively DecGRC, from top to bottom. All results were obtained with BiLSTM encoder stable learning of the proposed methods over MoChA can be on an utterance 8254-84205-0009 in dev-other set. The update gates were obtained with teacher forcing, and the attention plots were results of the beam explained in relation to sMoChA, as described in Sec. III-A2; search w/ beam size 12. “ ” was inserted after a BPE unit end if it was not the sMoChA stabilized the training of MoChA by utilizing a a word-end. modified selection probability formula, which is actually almost similar to the attention weight (z ) of GRC in Eq. (16). larger than or equal to 6, while it showed similar performance for shorter median lengths. D. Attention analysis Attention weights of DecGRC tended to be much smoother GRC and DecGRC accurately learned alignments between (i.e., focused on longer time) than GRC and GSA. Such encoded representations and output text units, as illustrated smoothness was hypothesized to be caused by the decreasing in Fig. 3. An interesting characteristic of GRC was observed update gates, which made the model trained to be cautious for that it tended to put much weight on the latter time indices of a sharp descent of update gate values, as it is irreversible attention weights, compared to GSA. This can be regarded as in DecGRC. In addition, DecGRC did not attend on the an innate behavior of GRC, as the attention weight (z ) in first time index, unlike GSA and GRC. It is an intrinsic Thm. 1 is designed to weigh the latter indices when the update property of DecGRC, as the earliest update gates have values gates z have similar value over several consecutive time- close to 1 and therefore difficult to carry information to later u;t indices. The latter-time-weighing attribute could be especially time. As the initial frames of an utterance usually contain effective for a long text unit (e.g., a BPE unit “swinging” in helpful information such as background noise, this might cause Fig. 3), as a long BPE unit often ends with a suffix that might DecGRC to be degraded compared to the global attentions. be crucial to distinguish words (e.g., “-ing”, “-n’t”, or “-est” in The last two plots in Fig. 3 show that the update gate values English). A piece of statistical evidence is presented in Fig. 4; of DecGRC mostly changed near the attention region. As GRC outperformed GSA when the median length of BPE was the update gates rapidly decreased after the attention region, CE loss Decoder step index DecGRC GRC GSA Input DecGRC (ν =0.01) 30-33 27-30 24-27 21-24 18-21 15-18 12-15 9-12 6-9 3-6 0-3 0:3 WindowedAtt GSA GRC MoChA 0:2 DecGRC 0:1 2 3 4 5 6 Median length of BPE units in an utterance Fig. 4. WER for each utterance-wise median length of the BPE units (w/ BiLSTM encoder). WERs for GSA (E1) and GRC (E2) were measured on the test set (i.e., both test-clean and test-other). Utterance length range (s) Fig. 5. Word error rates (WERs) of windowed attention (E14), MoChA (E16), tight attention endpoints could easily be found by setting the and DecGRC (E18) online models on LibriSpeech test-other dataset for various ranges of utterance lengths, evaluated with LC-BiLSTM encoder. DecGRC threshold value approximately in a range of [0.001, 0.2]. For model is evaluated with a threshold value of 0:08. instance, with an inference threshold  = 0:01 in Fig. 3, the total number of steps in the for loop in Alg. 1 was 459, which was approximately 54% of TU = 13 65 = 845. It implies DecGRC that insignificant time indices were properly ignored during 0.4 the inference. In Fig. 5, WERs of online attention models are evaluated for various ranges of utterance lengths with LC-BiLSTM encoder. DecGRC models showed better performance than conventional 0.25 online attention methods for utterances shorter than 21 seconds, while its performance severely degenerated for utterances 0.001 0.01 0.05 longer than 21 seconds. We conjectured the performance degeneration of DecGRC for long utterances is fundamentally due to its formulation. According to the recursion rule in 0.2 Eq. (13), for each decoder step, DecGRC always starts from 0.1 the first time-index of encoded vectors and processes through the whole sequence until the endpoint is detected, whereas most 1:75 2 2:25 2:5 conventional online attention methods compute the attention Average lagging (s) weights within a fixed-size window. This indicates that DecGRC has a larger possibility of producing wrong attention context Fig. 6. Ablation study about the inference threshold of the proposed online vector than existing online attentions for long utterance, as model DecGRC (E18) on LibriSpeech dev-clean dataset. The latency measure (average lagging) and WERs were measured with varying inference threshold observed in Fig. 5. The overall performance of DecGRC was , which is denoted for each node with blue text. better than the others since the utterances longer than 21 seconds is only about 0.5% of the LibirSpeech test-other set. Notwithstanding, such a low WER problem of DecGRC on , where x and y are acoustic input sequence and output long input sequences need to be fixed for better performance, text sequence respectively, and g(u) is a monotonic non- which we would solve in future research. decreasing function of u that denotes the number of acoustic input frames processed by the encoder when deciding the u-th target text token. For intuitive notation, we reported the AL E. Ablation study on DecGRC inference threshold value calculated according to Eq. (18) multiplied by the time We evaluated WERs and latencies of the proposed online unit of acoustic input (i.e., 10 ms) in Fig. 6. DecGRC model (E18) for different threshold values, and the In Fig. 6, the tradeoff between latency and WER was results are plotted in Fig. 6. For the latency measure, we observed to be adjustable when the threshold  is in the employed average lagging (AL) metric [32], which is frequently range of [0:1; 1:0]. Setting the threshold to a value larger than used to measure the latency of an online sequence-to-sequence 0.25 was found to be detrimental to the performance, with model when ground-truth label of input-output time alignment larger thresholds giving higher WERs. It means that some is not given. The AL of an online ASR model on an utterance encoded vectors in the correct attention region were ignored is obtained as follows [32]: due to the high threshold, as shown in the last two plots of Fig. 3. Impressively, the best performance was obtained with (jxj) n o between 0.05 and 0.1, not  = 0. This may be attributed to the 1 jxj AL (x; y) = g(u) (u 1) (18) fact that the thresholding not only reduced the latency, but also (jxj) jyj u=1 eliminated undesirable updates after the correct attention region. (jxj) = min u g(u) = jxj With thresholds higher than the best-performing threshold, the Total WERs (%) Word error rates (%) Word error rates (%) 10 latency could be further reduced by taking the performance [9] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, “A neural transducer,” arXiv preprint arXiv:1511.04868, 2015. penalty, and vice versa. [10] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, After the training end, a DecGRC model needs extra and Z. Chen, “Improving the performance of online neural transducer searching to find a threshold that provides the best tradeoff models,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5864–5868. between latency and performance. Nevertheless, the threshold [11] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention searching time is insignificant compared to the training time. for online end-to-end speech recognition.” in Proceedings of Interspeech, The beam search inference on the dev set took less than 15 2017, pp. 3692–3696. [12] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic attention mech- minutes using a single GPU, the time spent for the tuning anism for end-to-end speech and language processing,” in Proceedings process of the threshold was no more than 2.5 hours, which is of the International Joint Conference on Natural Language Processing much shorter than the model training time; a single epoch of (IJCNLP), vol. 1, 2017, pp. 431–440. [13] A. Merboldt, A. Zeyer, R. Schluter, and H. Ney, “An analysis of local training took 9 hours on average, and the total time for training monotonic attention variants,” in Proceedings of Interspeech, 2019, pp. a model from scratch was more than 5 days. 1398–1402. [14] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in Proceed- ings of International Conference on Learning Representations (ICLR), V. CONCLUSION We proposed a novel softmax-free global attention method [15] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hybrid ctc/attention architecture for end-to-end speech recognition,” in Pro- called GRC, and its variant for online attention, namely ceedings of Interspeech 2019, 2019, pp. 2623–2627. DecGRC. Unlike the conventional online attentions, DecGRC [16] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Towards online introduces no additional hyperparameter to be tuned at the end-to-end transformer automatic speech recognition,” arXiv preprint arXiv:1910.11871, 2019. training phase. Thus DecGRC does not require multiple trials [17] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-based of training, saving time for model preparation. Moreover at model for speech recognition,” in Proceedings of Interspeech, 2019, pp. the inference of DecGRC, the tradeoff between ASR latency 4390–4394. [18] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end and performance can be controlled by adapting the scalar speech recognition,” in Proceedings of IEEE International Conference threshold which is related to the attention endpoint decision, on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, whereas the conventional online attentions are not capable of pp. 5666–5670. [19] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to-end changing the endpoint decision rule at test phase. Both GRC speech recognition,” arXiv preprint arXiv:1905.11235, 2019. and DecGRC showed comparable ASR performance to the [20] Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov, conventional global attentions. “Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel,” in Proceedings of Conference on For further research, the proposed attention methods will Empirical Methods in Natural Language Processing and International be investigated in various applications which leverage AED Joint Conference on Natural Language Processing (EMNLP-IJCNLP), models. We are particularly interested in applying DecGRC 2019, pp. 4343–4352. [21] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers to simultaneous machine translation [33] and real-time scene are rnns: Fast autoregressive transformers with linear attention,” in text recognition [34], where the latency can be reduced by Proceedings of the International Conference on Machine learning (ICML), exploiting an online attention method. [22] K. Cho, B. Van Merrienboer ¨ , C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using REFERENCES rnn encoder-decoder for statistical machine translation,” in Proceedings [1] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A of Conference on Empricial Methods in Natural Language Processing neural network for large vocabulary conversational speech recognition,” (EMNLP), 2014, pp. 1724–1734. in Proceedings of IEEE International Conference on Acoustics, Speech [23] L. Wasserman, All of nonparametric statistics. New York: Springer and Signal Processing (ICASSP), 2016, pp. 4960–4964. Science & Business Media, 2006. [2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, [24] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to “End-to-end attention-based large vocabulary speech recognition,” in attention-based neural machine translation,” in Proceedings of Conference Proceedings of IEEE International Conference on Acoustics, Speech and on Empricial Methods in Natural Language Processing (EMNLP), 2015, Signal Processing (ICASSP), 2016, pp. 4945–4949. pp. 1412–1421. [3] A. Graves, S. Fernandez, ´ F. Gomez, and J. Schmidhuber, “Connectionist [25] A. Zeyer, K. Irie, R. Schluter ¨ , and H. Ney, “Improved training of temporal classification: labelling unsegmented sequence data with recur- end-to-end attention models for speech recognition,” in Proceedings rent neural networks,” in Proceedings of the International Conference of Interspeech, 2018, pp. 7–11. on Machine learning (ICML), 2006, pp. 369–376. [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [4] A. Graves, “Sequence transduction with recurrent neural networks,” in Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings Representation Learning Workshop in International Coneference on of Advances in Neural Information Processing Systems (NeurIPS), 2017, Machine Learning (ICML), 2012. pp. 5998–6008. [5] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, [27] A. Zeyer, T. Alkhouli, and H. Ney, “Returnn as a generic flexible neural A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech toolkit with application to translation and speech recognition,” in Annual recognition with sequence-to-sequence models,” in Proceedings of IEEE Meeting of the Association for Computational Linguistics (ACL), 2018. International Conference on Acoustics, Speech and Signal Processing [28] A. Zeyer, A. Merboldt, R. Schluter ¨ , and H. Ney, “A comprehensive (ICASSP). IEEE, 2018, pp. 4774–4778. analysis on attention models,” in Interpretability and Robustness in [6] A. Garg, D. Gowda, A. Kumar, K. Kim, M. Kumar, and C. Kim, Audio, Speech, and Language (IRASL) Workshop in Conference on Neural “Improved multi-stage training of online attention-based encoder-decoder Information Processing Systems (NeurIPS), Montreal, Canada, 2018. models,” arXiv preprint arXiv:1912.12384, 2019. [29] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep [7] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, feedforward neural networks,” in Proceedings of International Conference M. Visontai, Q. Liang, T. Strohman, Y. Wu et al., “Two-pass end-to-end on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256. speech recognition,” in Proceedings of Interspeech, 2019, pp. 2773–2778. [30] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schluter, and H. Ney, “A [8] Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass, comprehensive study of deep bidirectional lstm rnns for acoustic modeling “Highway long short-term memory rnns for distant speech recognition,” in speech recognition,” in Proceedings of IEEE International Conference in Proceedings of IEEE International Conference on Acoustics, Speech on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, and Signal Processing (ICASSP). IEEE, 2016, pp. 5755–5759. pp. 2462–2466. 11 [31] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4835–4839. [32] M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li et al., “Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework,” [33] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2019. [34] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu, “Squeezedtext: A real- time scene text recognition by binary convolutional encoder-decoder network,” in Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 2018, pp. 7194–7201. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Computing Research Repository arXiv (Cornell University)

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition

Loading next page...
 
/lp/arxiv-cornell-university/gated-recurrent-context-softmax-free-attention-for-online-encoder-kXDP0rNVGQ

References (34)

ISSN
2329-9290
eISSN
ARCH-3344
DOI
10.1109/TASLP.2021.3049344
Publisher site
See Article on Publisher Site

Abstract

Gated Recurrent Context: Softmax-free Attention for Online Encoder-Decoder Speech Recognition Hyeonseung Lee, Woo Hyun Kang, Sung Jun Cheon, Hyeongju Kim, and Nam Soo Kim, Senior Member, IEEE Recently, attention-based encoder-decoder (AED) models have by the attention mechanism that provides proper acoustic shown state-of-the-art performance in automatic speech recogni- information at each step [6]. tion (ASR). As the original AED models with global attentions A major drawback of the conventional AED models is that are not capable of online inference, various online attention they cannot infer the ASR output in an online fashion, which schemes have been developed to reduce ASR latency for better degrades the user experience due to the large latency [7]. This user experience. However, a common limitation of the conven- tional softmax-based online attention approaches is that they problem is mainly caused by the following aspects of the introduce an additional hyperparameter related to the length AED models. Firstly, the encoders of most high-performance of the attention window, requiring multiple trials of model AED models make use of layers with global receptive fields, training for tuning the hyperparameter. In order to deal with such as bidirectional long short-term memory (BiLSTM) or this problem, we propose a novel softmax-free attention method self-attention layer. More importantly, a conventional global and its modified formulation for online attention, which does not need any additional hyperparameter at the training phase. attention mechanism (e.g., Bahdanau attention) considers the Through a number of ASR experiments, we demonstrate the entire utterance to obtain the attention context vector at tradeoff between the latency and performance of the proposed every step. The former issue can be solved by replacing the online attention technique can be controlled by merely adjusting global-receptive encoder with an online encoder, where an a threshold at the test phase. Furthermore, the proposed methods encoded representation for a particular frame depends on only showed competitive performance to the conventional global and online attentions in terms of word-error-rates (WERs). a limited number of future frames. The online encoder can be built straightforwardly by employing layers with finite future receptive field such as latency-controlled BiLSTM (LC- Index Terms—Automatic Speech Recognition, Online speech recognition, Attention-based encoder-decoder model BiLSTM) [8], temporal convolution layers, and masked self- attention layers. However, reformulating the global attention methods for an online purpose is still a challenging problem. I. I NTRODUCTION Conventional techniques for online attentionare usually two- step approaches where the window (i.e., chunk) for the current N the last few years, the performance of deep learning- attention is determined first at each decoder step, then the based end-to-end automatic speech recognition (ASR) attention weights are calculated using the softmax function systems has improved significantly through numerous studies defined over the window. Existing online attentions mainly mostly on the architecture designs and training schemes differ in how they determine the window. Neural transducers [9], of neural networks (NNs). Among many end-to-end ASR [10] divide an encoded sequence into multiple chunks with systems, attention-based encoder-decoder (AED) models [1], a fixed length, and the attention-decoder produces an output [2] have shown better performance than the others, such as the sequence for each input chunk. In the windowed attention connectionist temporal classification (CTC) [3] and recurrent techniques [11], [12], [13], the position of each fixed-size neural network transducer (RNN-T) [4], and even outperformed window is decided by a position prediction algorithm. The the conventional DNN-hidden Markov model (HMM) hybrid window position is monotonically increasing in time, and some systems in case a large training set of transcribed speech is approaches employ a trainable position prediction model with available [5]. Such successful results of AED models come from a fixed Gaussian window. In MoChA-based approaches [14], the tightly integrated language modeling capability of the label- [15], [16], a fixed-size chunk is obtained using a monotonic synchronous decoder (i.e., the decoder network operates once endpoint prediction model, which is jointly trained considering per output text token in an autoregressive manner), supported all possible chunk endpoints. A common limitation of the aforementioned approaches is that the fixed-length of the window needs to be tuned according The authors are with the Institute of New Media and Communications, Department of Electrical and Computer Engineering, Seoul National University, to the training data. Merely choosing a large window of a Seoul, Republic of Korea (e-mail: hslee@hi.snu.ac.kr; whkang@hi.snu.ac.kr; constant size causes a large latency while setting the window sjcheon@hi.snu.ac.kr; hjkim@hi.snu.ac.kr; nkim@snu.ac.kr). size too small results in degraded performance. Therefore © 2021 IEEE. Personal use of this material is permitted. Permission from IEEE must be obtained for all other uses, in any current or future media, multiple trials of the model training are required to find a proper including reprinting/republishing this material for advertising or promotional value of the window length, consuming excessive computational purposes, creating new collective works, for resale or redistribution to servers resources. Moreover, the trained model does not guarantee to or lists, or reuse of any copyrighted component of this work in other works. Digital Object Identifier 10.1109/TASLP.2021.3049344 perform well on an unseen test set, since the window size is arXiv:2007.05214v3 [eess.AS] 14 Jan 2021 2 fixed for all datasets. decoder ASR is formally described, followed by conventional Although a few variants of MoChA utilize an adaptive online attention methods and their common limitation. Sec- window length to remove the need for tuning the window tion III proposes formulations of both GRC and DecGRC size, such variants induce other problems. MAtChA [14] and the algorithm for online inference. The experimental regards the previous endpoint as the beginning of the current results with various attention methods are given in Section IV. chunk. Occasionally, the window can be too short to contain Conclusions are presented in Section V. enough speech content when two consecutive endpoints are too close, which may degrade the performance. AMoChA [17] II. BACKGROUNDS employs an auxiliary model that predicts the chunk size but A. Attention-based Encoder-Decoder for ASR also introduces an additional loss term for the prediction model. As the coefficient for the new loss needs tuning, AMoChA An attention-based encoder-decoder model consists of two still requires repeated training sessions. Besides, several recent sub-modules Encoder() and AttentionDecoder(), and it approaches [18], [19] utilize strictly monotonic windows. But predicts the posterior probability of the output transcription these methods have a limitation in that the decoder state is not given the input speech features as follows: used for determining the window, which means such algorithms h = Encoder(x); might not fully exploit the advantage of AED models, i.e., the inherent capability of autoregressive language modeling. P (yjx) = AttentionDecoder(h; y) The aforementioned inefficiency in training the conventional where x = [x ; x ; :::; x ] and h = [h ; h ; :::; h ] are online attentions is essentially caused by the fact that the 1 2 T 1 2 T in sequences of input speech features and encoded vectors softmax function needs a predetermined attention window respectively, and y = [y ; y ; :::; y ] is a sequence of output to obtain the attention weights, which results in repetitive 1 2 U text units. Either the start or end of the text is considered as tuning process of the window-related hyperparameter. Although one of the text units. several recent studies [20], [21] investigate softmax-free formu- In general, Encoder() reduces its output length T to be lation of attention, they focuses on reducing computations by smaller than the input length T , cutting down the memory and replacing the softmax with other kernels and do not suggest a in computational footprint. A global Encoder() is implemented solution for online encoder-decoder attention. To overcome this with NN layers having powerful sequence modeling capacity, limitation, we propose a novel softmax-free global attention e.g., BiLSTM or self-attention layers with subsampling layers. method called gated recurrent context (GRC), inspired by the On the other hand, an online Encoder() must only consist of gate-based update in gated recurrent unit (GRU) [22]. Whereas layers with finite future receptive field. conventional attentions are based on a kernel smoother (e.g. AttentionDecoder() operates at each output step recur- softmax function) [23], [20], GRC obtains an attention context sively, emitting an estimated posterior probability over all vector by recursively aggregating the encoded vectors in a possible text units given the outputs produced at the previous time-synchronous manner, using update gates. GRC can be step. This procedure can be summarized as follows: reformulated for the purpose of online attention, which we refer to decreasing GRC (DecGRC), where the update gates are s = RecurrentState(s ; y ; c ); u u1 u1 u1 constrained to be decreasing over time. DecGRC is window-free and capable of deciding the attention-endpoint by thresholding c = AttentionContext(s ; h); (1) u u the update gate values at the inference phase. DecGRC as P (y jy ; x) = ReadOut(s ; y ; c ) u <u u u1 u well as GRC introduces no hyperparameter to be tuned at the training phase. where c denotes the u-th attention context vector and s u u The main contributions of this paper can be summarized as is the u-th decoder state. RecurrentState() consists of follows: unidirectional layers, e.g., unidirectional LSTM and masked We propose a novel softmax-free attention method called self-attention layers. ReadOut() usually contains a small NN Gated Recurrent Context (GRC), which obtains an followed by a softmax activation function. attention context vector using a time-synchronous recur- The most popular choice for AttentionContext() is the sive updating rule rather than a kernel smoother-based global soft attention (GSA) [2], [24] that includes the softmax formulation. function given as follows: We present a window-free online attention method, De- creasing GRC (DecGRC), a constrained variant of GRC. c = h ; (2) u u;t t DecGRC does not need any new hyperparameter to be t=1 tuned at the training phase. At test time, the tradeoff exp(e ) u;t between performance and latency can be adjusted using = ; (3) u;t P exp(e ) a simple thresholding technique. u;j j=1 We experimentally show that GRC and DecGRC perform e = Score(s ; h ; ) (4) u;t u t <u;t competitive to the conventional global and online attention methods on the LibriSpeech test set. in which is an attention weight on the t-th encoded vector u;t The remainder of this paper is organized as follows. In h at the u-th decoder step, and e is a score indicating the t u;t Section II, the general framework of attention-based encoder- relevance of h to the u-th decoder state. Common choices t 3 Encoder step Fixed-size window Fixed-size window Global window (a) Global soft attention (b) Windowed attention (c) MoChA (d) DecGRC (proposed) Fig. 1. Pictorial descriptions of various attention methods. For online attention methods, the endpoint and the start-point (if it exists) are respectively marked with cyan and orange bold outline, at each decoder step. Windowed attention and MoChA, two widely-used online attention methods, decide either the start-point or endpoint for each decoder step, and then calculate attention weights within a fixed-size window. Unlike these conventional methods, DecGRC does not utilize a fixed-size window and scans from the beginning of the utterance to find the endpoint for each decoder step. The endpoint decision algorithm of DecGRC is independent of the former endpoints. Thus the endpoint may not be monotonically increasing over time-steps, as depicted in (d). The detailed DecGRC inference algorithm is described in Alg. 1. p +w1 for the Score() function are additive scores [2], [25] and c = h ; u u;t t dot-product scores [24], [26]. Additive scores often utilize t=p additional information to decide the current attention <u;t exp(e ) weights based on the past attention locations. In this paper, an u;t = (7) u;t p +w1 additive score with attention weight feedback [25] is employed exp(e ) u;j j=p for all the experiments at Sec. IV: where p is the start point of the attention window at the u-th e = v tanh(W [s ; h ; ] + ); (5) u;t u t u;t step, and w is the window size. The windowed attention is online, as the attention context c derived through Eqs. (6)-(7) u1 does not depend on the entire encoded vector sequence h. = (v h ) u;t t k;t The tradeoff between performance and latency of windowed k=1 attention relies on the window length w. where the notation [ ;  ] means concatenation of vectors, v and v are trainable vectors, W and  are a trainable weight and a trainable bias, and is an attention weight feedback. 2) MoChA u;t The whole system is trained to maximize the log posterior (n) (n) N probability on a training dataset D = f(x ; y )g , In MoChA [14], an attention window endpoint is first n=1 decided, followed by attention weights calculation within a jyj h i fixed-size window as follows: max E logP (y jy ; x; ) u <u (x;y)D u=1 X c = h ; (8) u u;t t where  denotes the set of all trainable parameters, and jyj is t= w+1 the text sequence length of the sampled data. Inference can be exp(e ) u;t performed by searching the most likely text sequence: = ; (9) u;t exp(e ) u;j j= w+1 y ^ = argmax logP (yjx; ): = MonotonicEndpoint(e ~ ); u u; u1 B. Online Attention e ~ = MonotonicScore(s ; h ; ) + b (10) u;t u t <u;t To achieve online attention, the context vector c in where MonotonicScore() is a similarity function, b is a Eq. (2) must have local dependency on the encoded vectors trainable bias parameter, e ~ is the monotonic score, and u;t h. Windowed attention and MoChA are widely-used online MonotonicEndpoint() is an window end-decision algorithm attention methods that show high performance for which only based on thresholding, and is an attention weight within u;t the AttentionContext() function in Eq. (1) is modified in the window. Note that Eqs. (8)-(10) are substitutes for Eqs. (2)- the general framework. Pictorial descriptions of all the online (3) in GSA. The performance and latency of MoChA are also attention methods in this paper are provided in Fig. 1. known to depend on the chunk size w. 1) Windowed attention Among various formulations of windowed attention, a simple Optimizing an AED model using these formulations is heuristic using argmax for window boundary prediction [13] impossible. The MonotonicEndpoint() function makes a has shown the best performance. This method can be described hard-decision for an endpoint  that is not differentiable, as follows: which means  cannot be trained with the backpropagation p = 0; p = argmax ( ); (6) framework. To solve this problem, an expectation-based for- 1 u u1;1tT Decoder step 4 mulation is exploited for training [14]: for any h = [h ; h ; : : : ; h ] and z = [z ; z ; : : : ; z ], 1 2 T u u;1 u;2 u;T the following holds: t+w1 exp(e ) u;t u;k = ; u;t P exp(e ) u;l k=t l=kw+1 d = (z ) h u;T u t t t=1 u;t1 = p (1 p ) + ; (11) u;t u;t u;t1 u1;t u;t1 where (z ) denotes the t-th element of (z ), and d is u t u u;T obtained from z and h according to Eq. (13). p = (e ) u;t u;t where p is a stopping probability at the t-th time step and u;t Proof . Using the recursive Eq. (13), is an accumulated selection probability that the window u;t d =(1 z )d + z h endpoint is t. u;T u;T u;T1 u;T T 3) A limitation of the conventional methods =(1 z )(1 z )d u;T u;T1 u;T2 As mentioned in Sec. I, the softmax function in the + (1 z )z h + z h u;T u;T1 T1 u;T T conventional online attentions (e.g., Eqs. (7)-(9)) requires a = : : : predetermined attention window, which induces a limitation in T T X Y training efficiency since multiple trials of training are inevitable = (1 z ) z h : u;j u;t t for tuning either the window length or the coefficient of an t=1 j=t+1 additional loss term. To overcome this limitation, in the next Therefore the function  that satisfies Eq. 1 is given by section, we propose a novel softmax-free global attention approach and its online version which is free from the tuning of hyperparameters in training. (z ) := z (1 z ) for t = 1; 2; : : : ; T . (16) u t u;t u;j j=t+1 T T III. PROPOSED METHODS Given that z 2 Z , the output (z ) is an element of A u u because it is trivial to show that 0  (z )  1 for u t A. Gated Recurrent Context (GRC) i = 1; 2; : : : ; T , and also (z ) = 1 holds as follows: u t t=1 We propose a novel softmax-free global attention method T T T called GRC, which recursively aggregates the information of X X Y (z ) = z (1 z ) u t u;t u;j the encoded sequence into an attention context vector in a t=1 t=1 j=t+1 time-synchronous manner. Specifically, the following formulas T T T X Y Y are employed in place of the Eqs. (2)-(4): = z (1 z ) + (1 z ) u;t u;j u;j c = d ; (12) u u;T t=2 j=t+1 j=2 T T T X Y Y d = (1 z )d + z h ; (13) u;t u;t u;t1 u;t t = z (1 z ) + (1 z ) u;t u;j u;j t=3 j=t+1 j=3 z = 1; z = (e ) = ; (14) u;1 u;t u;t 1 + exp(e ) u;t = : : : e = Score(s ; h ; ) + b (15) = z + (1 z ) = 1: u;t u t <u;t u;T u;T where z and d are the update gate and the intermediate The (z ) is a bijective function since the inverse mapping u;t u;t u attention context vector for the t-th time step at the u-th decoder of  exists as follows: step, respectively. GRC computes an intermediate value for the z = (z ) ; u;T u T final context vector recursively in time, inspired by GRU [22]. (z ) (z ) u T1 u T1 Note that Eqs. (12)-(15) of GRC do not utilize the softmax = if (z ) < 1; u T 1 z 1 (z ) z = u T u;T u;T1 function at all, unlike the conventional attentions. Nevertheless, 0 otherwise, GRC can be interpreted as a global attention method, since it calculates a weighted average of the encoded sequence over > X (z ) u T2 the whole time period, as explained in Sec. III-A1. if (z ) ; u j 1 (z ) (z ) z = u T u T1 u;T2 1) Relation to GSA j=T1 The update gates sequence z of GRC in Eq. (14) and u;1:T 0 otherwise, the attention weights sequence of GSA in Eq. (3) u;1:T have one-to-one correspondence (i.e., intuitively, z and u;1:T are always interchangable without changing the value of u;1:T > X > (z ) u t attention context vector c ) according to the following theorem: if (z ) < 1; u P u T ) z = 1 (z ) u;t u j j=t+1 t=j+1 Theorem 1 (GRC-GSA duality). For arbitrary n 2 N, let 0 otherwise, n n Z = fx 2 R j x = 1; 0  x  1 for j = 2; 3; : : : ; ng 1 j n n and A = fx 2 R j x = 1; 0  x  1 for j = for t = 1; 2; : : : ; T . It is also trivial to show that z = 1 and j j u;1 j=1 T T T 1; 2; : : : ; ng. There exists a bijective function  : Z ! A s.t. 0  z  1 for i = 2; : : : ; T , given that (z ) 2 A . u;t u 5 Algorithm 1 Online inference using DecGRC. Therefore, z 2 Z . Input: encoded vectors h of length T , threshold State: s = 0, u = 1, y = StartOfSequence Note that (z ) in Eq. (16) corresponds to the attention 0 0 u t while y ! = EndOfSequence do weight in Eq. (3) of GSA. By Thm. 1, the attention u1 u;t d = h u 1 context vector c of GRC is capable of expressing all possible for t = 2 to T do weighted averages of the encoded representations over time, e = Score(s ; h ; ) + b u;t u t u;t as in the GSA. Thus the range of c in GRC or GSA is the z = 1=(1 + exp(e )) same. Nonetheless, we empirically showed that GRC performs u;t u;j j=1 d = (1 z )d + z h u u;t u u;t t comparable to or even better than GSA, and the experimental if z <  then u;t results are given in Sec. IV. break 2) Relation to sMoChA end if The sMoChA [15] is a variant of MoChA where Eq. (11) end for is replaced by the following formula: s = RecurrentState(s ; y ; d ) u u1 u1 u t1 P (y jy ; h) = ReadOut(s ; y ; c ) // softmax u <u u u1 u = p (1 p ) (17) u;t u;t u;j y = Decide P (y jy ; h) // choose a text unit in the u u <u j=1 vocabulary. (e.g. argmax for greedy search) which enables the optimization process to be more stabilized. u = u + 1 Eq. (17) is almost similar to the function  in Eq. (16), and end while implies evidence on the stability of GRC training. Despite this fact, sMoChA is an algorithm independent of GRC, as Eq. (17) is merely used as the selection probability component in the Global attention methods including GSA and GRC cannot whole training formulas and not even used for inference. compute the attention weights without the entire sequence of the encoded vectors h. However, considering that the attention B. Decreasing GRC (DecGRC) techniques are methods for calculating the weighted average of the encoded vectors, Coroll. 1.1 enables us to treat an By Thm. 1, the final context vector d of GRC in Eq. (12) u;T intermediate context d as a substitute for the attention context can be interpreted as a weighted average of encoded vectors u;t vector c in Eq. (12) of GRC even when the whole encoded h . Thus GRC can be regarded as a kind of global attention u;1:T sequence is not provided. method. Furthermore, not only the final context vector d of u;T Inspired by this, we further propose a novel online attention GRC but also an intermediate context d is a weighted average u;t algorithm, namely DecGRC. DecGRC is a modified version of the encoded vectors h according to the following u;1:T of GRC, replacing Eq. (14) with corollary: n n Corollary 1.1 (Weighted average). Let Z and A be the sets z = 1; z = : u;1 u;t P 1 + exp(e ) defined in Thm. 1. For any  2 f1; 2; : : : ; Tg, z 2 Z , and u;j u j=1 h = [h ; h ; : : : ; h ] , there exists a function a  : Z ! A 1 2 T Note that the update gate is constrained to be monotonically that satisfies the following equation: decreasing over time. At the training phase, DecGRC is trained in the same way as GRC, using an entire utterance to obtain a d = a (z ) h u; u t t final context d according to Eqs. (12)-(13). At the inference u;T t=1 phase, for each decoder step u, DecGRC decides an endpoint t so that only encoded vectors before the endpoint can end where a (z ) denotes the t-th element of a (z ), and d is u t u u; contribute to the online context vector obtained from z and h according to Eq. (13). c = d ; Proof . By substituting every T in the proof of Thm. 1 with u u;t end , there exists a bijective function  : Z ! A given by which is used in place of the GRC context vector in Eq. (12). Assume that there exists an endpoint index t with which end (z ) := z (1 z ) for t = 1; 2; : : : ;  , u t u;t u;j z has a very small value (e.g. less than 0.001). Considering u;t end j=t+1 that z < z holds for all t > t , the difference between u;t u;t end end such that d and d is small, as the numerical change jd u;t u;T u;t end d j for t > t induced by the recursion rule in Eq. (13) u;t1 end d = (z ) h : u; u t t is negligible if z is small enough. Intuitively, intermediate u;t t=1 context vectors roughly converge after the endpoint. It is trivial to show that the following function a  : Z ! A DecGRC can operate as an online attention method if such an satisfies the equation in Coroll. 1.1: endpoint index t exists at each decoder step and the index end can be decided by the model. We experimentally observed (z ) if t   ; u t a (z ) = for t = 1; 2; : : : ; T . u t that DecGRC models adequately learn the alignment between 0 otherwise, encoded vectors and text output units, and the intermediate context nearly converges after the aligned time index at each 6 decoder step. Nevertheless, the performance of DecGRC can be are minor compared to the training set. Hence the total time degraded due to the mismatch between training and inference, spent to prepare an ASR system can be saved. Furthermore, especially when the endpoints are decided to be too early. the tradeoff between latency and performance can be adjusted Relevant experimental results are given in Sec. IV-D by resetting the threshold value  at inference phase, unlike the Accordingly, with an online encoder, online inference can be conventional online attention methods [9], [13], [14]. In these implemented via a well-trained DecGRC model. We describe existing methods, the inference algorithms’ decision rules on the online inference technique in Alg. 1, where the endpoint the attention endpoints are determined at the training phase, index is decided simply by thresholding the update gate values. and remains unchanged at the test stage. The experiments on DecGRC with different thresholds are demonstrated in Sec. IV-E. C. Computational efficiency of proposed methods GRC or DecGRC increases negligible amount of memory IV. E XPERIM ENTS footprint, since only one trainable parameter b in Eq. (15) is A. Configurations added to the standard GSA-based AED model. The computa- All experiments were conducted on LibriSpeech dataset , tional amount of an attention method is dominated by the score which contains 16 kHz read English speech with transcription. function calculation, as it requires matrix multiplications. For The dataset consists of 960 hours of a training set from 2,338 example in GSA, a fixed-dimensional matrix-vector product is needed to obtain e ~ in Eq. (5) for each u and t, which speakers, 10.8 hours of a dev set from 80 speakers, and 10.4 u;t results in (TU ) floating point operations for processing an hours of a test set from 66 speakers, with no overlapping utterance. Although the softmax operation in Eq. (3) and the speakers between different sets. Both dev and test sets are split weighted average operation in Eq. (2) also requires (TU ) in half into clean and other subsets, depending upon the ASR operations in total, these are negligible compared to the score difficulty of each speaker. We randomly chose 1,500 utterances function calculation since they do not regard matrix-vector from dev set as a validation set. multiplications. As a result, the total computational complexity All experiments shared the same network architecture of GSA is (TU ). and training scheme of a recipe of RETURNN toolkit [27], Similarly, both GRC and DecGRC requires the score function [28], except the attention methods. Input features were 40- calculation in Eq. (15), having computational complexity of dimensional mel-frequency cepstral coefficients (MFCCs) ex- O(TU ). However, in practice, a speech sequence is linearly tracted with Hanning window of 25 ms length and 10 ms aligned with the text sequence on average. As Alg. 1 only hop size, followed by global mean-variance normalization. regards to encoded vectors before endpoint indices, the total Output text units were 10,025 byte-pair encoding (BPE) units extracted from transcription of LibriSpeech training set. The number of steps in the for loop is typically slightly larger Encoder() consisted of 6 BiLSTM layers of 1,024 units for than TU=2, if the threshold  is set to an appropriate value. each direction, and max-pooling layers of stride 3 and 2 were Therefore, DecGRC is computationally more efficient than the applied after the first two BiLSTM layers respectively. For global attentions such as GRC and GSA at the inference phase. the online Encoder(), 6 LC-BiLSTM layers were employed The recursive updating in Eq. (13) induces negligible amount in place of the BiLSTM layers, where the future context of computation compared to the whole training or inference sizes were set to 36, 12, 6, 6, 6, and 6 for each layer process. There still exists a room for faster computation by from bottom to top and the chunk sizes were same as the enabling parallel computation in time. The parallel computation future context sizes. Both Score() and MonotonicScore() can be implemented by utilizing Eq. (2) where is replaced u;t functions were implemented using the formulation in Eq. (5) with (z ) in Eq. (16), instead of Eqs. (12)-(13). Note that and 1,024-dimensional attention key. RecurrentState() was GRC and DecGRC are not the best choices among attention implemented with an unidirectional LSTM layer with 1,000 methods in terms of computational complexity. Among the units. ReadOut() consisted of a max-out layer with 2500 global attention methods, the linearized attention [21] features units, followed by a softmax output layer with 10,025 units. a very low computational complexity of (T + U ) when be Every model contains a total of 188 M parameters both for used as encoder-decoder attention, which is much smaller BiLSTM and LC-BiLSTM encoder architecture, except that than (TU ) of GRC. The computational complexity of an every MoChA-based model has 191 M parameters. online attention method MoChA [14] is (wU ) where w is the window-size, which is typically far less than O(TU ) of Weight parameters were initialized with Glorot uniform DecGRC. Notwithstanding, the encoder-decoder attention’s method [29], and biases were initially set to zero. Optimization computational amount is minor to the other layers in the techniques were utilized during the training: teacher forcing, encoder and the decoder. Adam optimizer, learning rate scheduling, curriculum learning, and the layer-wise pre-training scheme. Briefly, the models The most important fact is that both proposed methods were trained for 13.5 epochs using a learning rate of 810 introduce no hyperparameter at the training phase. Thus the with a linear warm-up starting from 310 and the Newbob proposed methods do not need to repeat training to find a proper value of such a hyperparameter. Though the DecGRC inference The LibriSpeech dataset can be downloaded from http://www.openslr.org/ in Alg. 1 introduces a new hyperparameter (i.e., threshold  ) at test phase, the threshold searching on development sets does The scripts for all experiments are available at https://github.com/ not take a long time, because the size of the development sets GRC-anonymous/GRC. 7 TABLE I WORD ERROR RATES (WER S) COMPARISON BETWEEN ATTENTION METHODS ON LIBRIS PEECH DATASET. PARAM. IS IS C AN WER [%] E XP. ATTENTION METHOD INIT. ATTENTION ENCODER INFER DEV TEST ID FROM ONLINE? ONLINE? ONLINE? CLEAN OTHER CLEAN OTHER E1 GSA - 4.77 14.11 4.92 15.15 NO E2 GRC - 4.84 14.06 4.88 14.59 E3 WINDOW ED ATT. (W=11) E1 12.50 23.79 15.27 25.81 E4 WINDOW ED ATT. (W=20) E1 5.78 14.82 5.71 15.90 NO E5 MOC HA ( W=2) - 6.49 17.11 6.17 18.18 YES E6 MOC HA ( W=8) - 4.74 14.20 4.95 15.32 (BILSTM) NO E7 - 4.91 14.85 5.10 15.85 DECGRC ( =0.01) E8 E2 4.97 14.02 4.83 14.90 E9 - 5.54 15.49 5.51 16.91 GSA E10 E1 5.28 15.44 5.17 16.40 NO E11 - 6.09 16.05 6.18 16.47 GRC E12 E2 5.48 15.14 5.55 15.88 E13 WINDOW ED ATT. (W=11) E10 12.82 24.10 15.14 26.94 YES E14 WINDOW ED ATT. (W=20) E10 5.62 15.86 5.56 16.96 E15 MOC HA ( W=2) E5 6.48 18.35 6.55 19.33 YES (LC- YES E16 MOC HA ( W=8) E6 5.11 15.10 5.15 16.45 BILSTM) E17 DECGRC ( =0.01) E8 5.77 16.24 5.87 17.04 E18 DECGRC ( =0.08) E12 5.79 15.67 6.04 16.34 decay rule [30]. Only the first two layers of the Encoder() rate (WER), a word-level Levenshtein distance divided by the with half-width (i.e., 512 units for each direction) were used number of ground-truth words, on dev-other set. at the beginning of training. Then once every 0.25 epoch from In E1 to E2 and E9 to E12, GRC showed better performance 0.75 epoch until 1.5 epoch, a new layer was inserted on the top than the other attention methods on test-other set, showing of the encoder and 1=8 original width (i.e., 128 units for each 3.7% and 3.2% relative error-reduction rate (RERR) compared direction) of new units are added to each layer. Finally, the to GSA when evaluated on BiLSTM and LC-BiLSTM encoder, width and the number of layers increased to the original size respectively. at 1.5 epoch. The CTC multi-task learning [31] with a lambda In E3 to E6 and E13 to E16, performances of the conven- of 0.5 was employed to stabilize the learning, where CTC loss tional online attentions, i.e., windowed attention and MoChA, is measured with another 10,025-units softmax layer on the were shown to be highly dependent on a choice of window top of Encoder(). For the models which began the learning size hyperparameter w. On the other hand, DecGRC is trained from parameters of a pre-trained model, the layer-wise pre- without any additional hyperparameter and only involves a training was skipped. Every model was regularized by applying threshold  at the inference phase. dropout rate 0.3 to Encoder() layers and the softmax layer In E3 to E8 and E13 to E18, DecGRC outperformed the and employing label smoothing of 0.1. For each epoch of the conventional online attention techniques on BiLSTM encoder. training, both cross-entropy (CE) losses and output error rates With LC-BiLSTM encoder, the performance of DecGRC on were measured 20 times on the validation set with teacher test-other set surpassed the conventional attentions including forcing. During the inference phase, model with the lowest GSA, while the scores on test-clean set were worse than the WER on the dev-other set among all checkpoints was selected competitors. The overall performance of GRC and DecGRC as the final model, and performed beam search once on the is degraded on LC-BiLSTM compared to their preferable dev and test sets with a beam size of 12. performance on BiLSTM, which was conjectured to be caused by the following aspect of the proposed methods; (z ) in We trained MoChA models for 17.5 epochs with five times Eq. (16) has a dependency on update-gate values of the future longer layer-wise pre-training to make them converge. A small time-steps. Therefore using a short future receptive field of learning rate of 1e-5 was used for training windowed attention LC-BiLSTM may affected the degradation. models as in [13]. Though the numbers of total epochs for different experiments were not the same, each model was optimized to converge and showed negligible improvements C. Optimization speed after that. The cross-entropy loss curves on training and dev set in E1, E2, E6, and E7 are depicted in Fig. 2. The model based on each attention method was trained from scratch until convergence, B. Performance comparison between attentions with a few spikes in its training loss curve. These spikes in the All experimental results are summarized in Tbl. I. For each loss curve are caused by the layer-wise pre-training algorithm experiment, we performed two trials of training with the same described in Sec. IV-A. Every time a new layer and units are configuration and chose a model with the lowest word-error- inserted to the encoder, the training loss temporarily shows 8 “we went at a good swinging gallo__ p and what about you” GSA (train) GSA (dev) GRC (train) “we went at a good swing gallo__ p and what about you” GRC (dev) MoChA (train) MoChA (dev) DecGRC (train) DecGRC (dev) “we went at a good swinging gallo__ p and what about you” 0.1 0 2 4 6 8 10 12 14 16 Epoch “we went at a good swinging gallo__ p and what about you” Fig. 2. Cross-entropy loss curves of various attention methods. All the models were trained from scratch (w/ BiLSTM encoder). rapid increase, because the newly inserted network parameters are not trained yet. Update gate 𝑧 𝑢 ,𝑡 Overall, GRC and DecGRC showed faster from-scratch 0.001 0.01 training speed than MoChA, but slower than GSA. DecGRC 0.05 5 0.08 converged slightly later than GRC. MoChA showed the slowest 0.2 optimization speed, which was partly due to the 5 times longer 0.25 0.4 layer-wise pre-training scheduling than the others. Such long 10 0.6 pre-training was employed to stabilize the training of MoChA, 0 10 20 30 40 50 60 whereas the both GRC and DecGRC successfully converged Encoder time index with the standard pre-training. Note that the longer pre-training Fig. 3. An input spectrogram, attention plots with the output BPE sequence of MoChA was adopted because it had failed to converge with of GSA (E1), GRC (E2), and DecGRC (E8), and the update gates of the a short pre-training in our initial experiments. The relatively DecGRC, from top to bottom. All results were obtained with BiLSTM encoder stable learning of the proposed methods over MoChA can be on an utterance 8254-84205-0009 in dev-other set. The update gates were obtained with teacher forcing, and the attention plots were results of the beam explained in relation to sMoChA, as described in Sec. III-A2; search w/ beam size 12. “ ” was inserted after a BPE unit end if it was not the sMoChA stabilized the training of MoChA by utilizing a a word-end. modified selection probability formula, which is actually almost similar to the attention weight (z ) of GRC in Eq. (16). larger than or equal to 6, while it showed similar performance for shorter median lengths. D. Attention analysis Attention weights of DecGRC tended to be much smoother GRC and DecGRC accurately learned alignments between (i.e., focused on longer time) than GRC and GSA. Such encoded representations and output text units, as illustrated smoothness was hypothesized to be caused by the decreasing in Fig. 3. An interesting characteristic of GRC was observed update gates, which made the model trained to be cautious for that it tended to put much weight on the latter time indices of a sharp descent of update gate values, as it is irreversible attention weights, compared to GSA. This can be regarded as in DecGRC. In addition, DecGRC did not attend on the an innate behavior of GRC, as the attention weight (z ) in first time index, unlike GSA and GRC. It is an intrinsic Thm. 1 is designed to weigh the latter indices when the update property of DecGRC, as the earliest update gates have values gates z have similar value over several consecutive time- close to 1 and therefore difficult to carry information to later u;t indices. The latter-time-weighing attribute could be especially time. As the initial frames of an utterance usually contain effective for a long text unit (e.g., a BPE unit “swinging” in helpful information such as background noise, this might cause Fig. 3), as a long BPE unit often ends with a suffix that might DecGRC to be degraded compared to the global attentions. be crucial to distinguish words (e.g., “-ing”, “-n’t”, or “-est” in The last two plots in Fig. 3 show that the update gate values English). A piece of statistical evidence is presented in Fig. 4; of DecGRC mostly changed near the attention region. As GRC outperformed GSA when the median length of BPE was the update gates rapidly decreased after the attention region, CE loss Decoder step index DecGRC GRC GSA Input DecGRC (ν =0.01) 30-33 27-30 24-27 21-24 18-21 15-18 12-15 9-12 6-9 3-6 0-3 0:3 WindowedAtt GSA GRC MoChA 0:2 DecGRC 0:1 2 3 4 5 6 Median length of BPE units in an utterance Fig. 4. WER for each utterance-wise median length of the BPE units (w/ BiLSTM encoder). WERs for GSA (E1) and GRC (E2) were measured on the test set (i.e., both test-clean and test-other). Utterance length range (s) Fig. 5. Word error rates (WERs) of windowed attention (E14), MoChA (E16), tight attention endpoints could easily be found by setting the and DecGRC (E18) online models on LibriSpeech test-other dataset for various ranges of utterance lengths, evaluated with LC-BiLSTM encoder. DecGRC threshold value approximately in a range of [0.001, 0.2]. For model is evaluated with a threshold value of 0:08. instance, with an inference threshold  = 0:01 in Fig. 3, the total number of steps in the for loop in Alg. 1 was 459, which was approximately 54% of TU = 13 65 = 845. It implies DecGRC that insignificant time indices were properly ignored during 0.4 the inference. In Fig. 5, WERs of online attention models are evaluated for various ranges of utterance lengths with LC-BiLSTM encoder. DecGRC models showed better performance than conventional 0.25 online attention methods for utterances shorter than 21 seconds, while its performance severely degenerated for utterances 0.001 0.01 0.05 longer than 21 seconds. We conjectured the performance degeneration of DecGRC for long utterances is fundamentally due to its formulation. According to the recursion rule in 0.2 Eq. (13), for each decoder step, DecGRC always starts from 0.1 the first time-index of encoded vectors and processes through the whole sequence until the endpoint is detected, whereas most 1:75 2 2:25 2:5 conventional online attention methods compute the attention Average lagging (s) weights within a fixed-size window. This indicates that DecGRC has a larger possibility of producing wrong attention context Fig. 6. Ablation study about the inference threshold of the proposed online vector than existing online attentions for long utterance, as model DecGRC (E18) on LibriSpeech dev-clean dataset. The latency measure (average lagging) and WERs were measured with varying inference threshold observed in Fig. 5. The overall performance of DecGRC was , which is denoted for each node with blue text. better than the others since the utterances longer than 21 seconds is only about 0.5% of the LibirSpeech test-other set. Notwithstanding, such a low WER problem of DecGRC on , where x and y are acoustic input sequence and output long input sequences need to be fixed for better performance, text sequence respectively, and g(u) is a monotonic non- which we would solve in future research. decreasing function of u that denotes the number of acoustic input frames processed by the encoder when deciding the u-th target text token. For intuitive notation, we reported the AL E. Ablation study on DecGRC inference threshold value calculated according to Eq. (18) multiplied by the time We evaluated WERs and latencies of the proposed online unit of acoustic input (i.e., 10 ms) in Fig. 6. DecGRC model (E18) for different threshold values, and the In Fig. 6, the tradeoff between latency and WER was results are plotted in Fig. 6. For the latency measure, we observed to be adjustable when the threshold  is in the employed average lagging (AL) metric [32], which is frequently range of [0:1; 1:0]. Setting the threshold to a value larger than used to measure the latency of an online sequence-to-sequence 0.25 was found to be detrimental to the performance, with model when ground-truth label of input-output time alignment larger thresholds giving higher WERs. It means that some is not given. The AL of an online ASR model on an utterance encoded vectors in the correct attention region were ignored is obtained as follows [32]: due to the high threshold, as shown in the last two plots of Fig. 3. Impressively, the best performance was obtained with (jxj) n o between 0.05 and 0.1, not  = 0. This may be attributed to the 1 jxj AL (x; y) = g(u) (u 1) (18) fact that the thresholding not only reduced the latency, but also (jxj) jyj u=1 eliminated undesirable updates after the correct attention region. (jxj) = min u g(u) = jxj With thresholds higher than the best-performing threshold, the Total WERs (%) Word error rates (%) Word error rates (%) 10 latency could be further reduced by taking the performance [9] N. Jaitly, D. Sussillo, Q. V. Le, O. Vinyals, I. Sutskever, and S. Bengio, “A neural transducer,” arXiv preprint arXiv:1511.04868, 2015. penalty, and vice versa. [10] T. N. Sainath, C.-C. Chiu, R. Prabhavalkar, A. Kannan, Y. Wu, P. Nguyen, After the training end, a DecGRC model needs extra and Z. Chen, “Improving the performance of online neural transducer searching to find a threshold that provides the best tradeoff models,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2018, pp. 5864–5868. between latency and performance. Nevertheless, the threshold [11] J. Hou, S. Zhang, and L.-R. Dai, “Gaussian prediction based attention searching time is insignificant compared to the training time. for online end-to-end speech recognition.” in Proceedings of Interspeech, The beam search inference on the dev set took less than 15 2017, pp. 3692–3696. [12] A. Tjandra, S. Sakti, and S. Nakamura, “Local monotonic attention mech- minutes using a single GPU, the time spent for the tuning anism for end-to-end speech and language processing,” in Proceedings process of the threshold was no more than 2.5 hours, which is of the International Joint Conference on Natural Language Processing much shorter than the model training time; a single epoch of (IJCNLP), vol. 1, 2017, pp. 431–440. [13] A. Merboldt, A. Zeyer, R. Schluter, and H. Ney, “An analysis of local training took 9 hours on average, and the total time for training monotonic attention variants,” in Proceedings of Interspeech, 2019, pp. a model from scratch was more than 5 days. 1398–1402. [14] C.-C. Chiu and C. Raffel, “Monotonic chunkwise attention,” in Proceed- ings of International Conference on Learning Representations (ICLR), V. CONCLUSION We proposed a novel softmax-free global attention method [15] H. Miao, G. Cheng, P. Zhang, T. Li, and Y. Yan, “Online hybrid ctc/attention architecture for end-to-end speech recognition,” in Pro- called GRC, and its variant for online attention, namely ceedings of Interspeech 2019, 2019, pp. 2623–2627. DecGRC. Unlike the conventional online attentions, DecGRC [16] E. Tsunoo, Y. Kashiwagi, T. Kumakura, and S. Watanabe, “Towards online introduces no additional hyperparameter to be tuned at the end-to-end transformer automatic speech recognition,” arXiv preprint arXiv:1910.11871, 2019. training phase. Thus DecGRC does not require multiple trials [17] R. Fan, P. Zhou, W. Chen, J. Jia, and G. Liu, “An online attention-based of training, saving time for model preparation. Moreover at model for speech recognition,” in Proceedings of Interspeech, 2019, pp. the inference of DecGRC, the tradeoff between ASR latency 4390–4394. [18] N. Moritz, T. Hori, and J. Le Roux, “Triggered attention for end-to-end and performance can be controlled by adapting the scalar speech recognition,” in Proceedings of IEEE International Conference threshold which is related to the attention endpoint decision, on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2019, whereas the conventional online attentions are not capable of pp. 5666–5670. [19] L. Dong and B. Xu, “Cif: Continuous integrate-and-fire for end-to-end changing the endpoint decision rule at test phase. Both GRC speech recognition,” arXiv preprint arXiv:1905.11235, 2019. and DecGRC showed comparable ASR performance to the [20] Y.-H. H. Tsai, S. Bai, M. Yamada, L.-P. Morency, and R. Salakhutdinov, conventional global attentions. “Transformer dissection: An unified understanding for transformer’s attention via the lens of kernel,” in Proceedings of Conference on For further research, the proposed attention methods will Empirical Methods in Natural Language Processing and International be investigated in various applications which leverage AED Joint Conference on Natural Language Processing (EMNLP-IJCNLP), models. We are particularly interested in applying DecGRC 2019, pp. 4343–4352. [21] A. Katharopoulos, A. Vyas, N. Pappas, and F. Fleuret, “Transformers to simultaneous machine translation [33] and real-time scene are rnns: Fast autoregressive transformers with linear attention,” in text recognition [34], where the latency can be reduced by Proceedings of the International Conference on Machine learning (ICML), exploiting an online attention method. [22] K. Cho, B. Van Merrienboer ¨ , C. Gulcehre, D. Bahdanau, F. Bougares, H. Schwenk, and Y. Bengio, “Learning phrase representations using REFERENCES rnn encoder-decoder for statistical machine translation,” in Proceedings [1] W. Chan, N. Jaitly, Q. Le, and O. Vinyals, “Listen, attend and spell: A of Conference on Empricial Methods in Natural Language Processing neural network for large vocabulary conversational speech recognition,” (EMNLP), 2014, pp. 1724–1734. in Proceedings of IEEE International Conference on Acoustics, Speech [23] L. Wasserman, All of nonparametric statistics. New York: Springer and Signal Processing (ICASSP), 2016, pp. 4960–4964. Science & Business Media, 2006. [2] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, [24] M.-T. Luong, H. Pham, and C. D. Manning, “Effective approaches to “End-to-end attention-based large vocabulary speech recognition,” in attention-based neural machine translation,” in Proceedings of Conference Proceedings of IEEE International Conference on Acoustics, Speech and on Empricial Methods in Natural Language Processing (EMNLP), 2015, Signal Processing (ICASSP), 2016, pp. 4945–4949. pp. 1412–1421. [3] A. Graves, S. Fernandez, ´ F. Gomez, and J. Schmidhuber, “Connectionist [25] A. Zeyer, K. Irie, R. Schluter ¨ , and H. Ney, “Improved training of temporal classification: labelling unsegmented sequence data with recur- end-to-end attention models for speech recognition,” in Proceedings rent neural networks,” in Proceedings of the International Conference of Interspeech, 2018, pp. 7–11. on Machine learning (ICML), 2006, pp. 369–376. [26] A. Vaswani, N. Shazeer, N. Parmar, J. Uszkoreit, L. Jones, A. N. Gomez, [4] A. Graves, “Sequence transduction with recurrent neural networks,” in Ł. Kaiser, and I. Polosukhin, “Attention is all you need,” in Proceedings Representation Learning Workshop in International Coneference on of Advances in Neural Information Processing Systems (NeurIPS), 2017, Machine Learning (ICML), 2012. pp. 5998–6008. [5] C.-C. Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, [27] A. Zeyer, T. Alkhouli, and H. Ney, “Returnn as a generic flexible neural A. Kannan, R. J. Weiss, K. Rao, E. Gonina et al., “State-of-the-art speech toolkit with application to translation and speech recognition,” in Annual recognition with sequence-to-sequence models,” in Proceedings of IEEE Meeting of the Association for Computational Linguistics (ACL), 2018. International Conference on Acoustics, Speech and Signal Processing [28] A. Zeyer, A. Merboldt, R. Schluter ¨ , and H. Ney, “A comprehensive (ICASSP). IEEE, 2018, pp. 4774–4778. analysis on attention models,” in Interpretability and Robustness in [6] A. Garg, D. Gowda, A. Kumar, K. Kim, M. Kumar, and C. Kim, Audio, Speech, and Language (IRASL) Workshop in Conference on Neural “Improved multi-stage training of online attention-based encoder-decoder Information Processing Systems (NeurIPS), Montreal, Canada, 2018. models,” arXiv preprint arXiv:1912.12384, 2019. [29] X. Glorot and Y. Bengio, “Understanding the difficulty of training deep [7] T. N. Sainath, R. Pang, D. Rybach, Y. He, R. Prabhavalkar, W. Li, feedforward neural networks,” in Proceedings of International Conference M. Visontai, Q. Liang, T. Strohman, Y. Wu et al., “Two-pass end-to-end on Artificial Intelligence and Statistics (AISTATS), 2010, pp. 249–256. speech recognition,” in Proceedings of Interspeech, 2019, pp. 2773–2778. [30] A. Zeyer, P. Doetsch, P. Voigtlaender, R. Schluter, and H. Ney, “A [8] Y. Zhang, G. Chen, D. Yu, K. Yaco, S. Khudanpur, and J. Glass, comprehensive study of deep bidirectional lstm rnns for acoustic modeling “Highway long short-term memory rnns for distant speech recognition,” in speech recognition,” in Proceedings of IEEE International Conference in Proceedings of IEEE International Conference on Acoustics, Speech on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, and Signal Processing (ICASSP). IEEE, 2016, pp. 5755–5759. pp. 2462–2466. 11 [31] S. Kim, T. Hori, and S. Watanabe, “Joint ctc-attention based end-to-end speech recognition using multi-task learning,” in Proceedings of IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP). IEEE, 2017, pp. 4835–4839. [32] M. Ma, L. Huang, H. Xiong, R. Zheng, K. Liu, B. Zheng, C. Zhang, Z. He, H. Liu, X. Li et al., “Stacl: Simultaneous translation with implicit anticipation and controllable latency using prefix-to-prefix framework,” [33] N. Arivazhagan, C. Cherry, W. Macherey, C.-C. Chiu, S. Yavuz, R. Pang, W. Li, and C. Raffel, “Monotonic infinite lookback attention for simultaneous machine translation,” in Annual Meeting of the Association for Computational Linguistics (ACL), 2019. [34] Z. Liu, Y. Li, F. Ren, W. L. Goh, and H. Yu, “Squeezedtext: A real- time scene text recognition by binary convolutional encoder-decoder network,” in Proceedings of the Association for the Advancement of Artificial Intelligence (AAAI) Conference on Artificial Intelligence, 2018, pp. 7194–7201.

Journal

Computing Research RepositoryarXiv (Cornell University)

Published: Jul 10, 2020

There are no references for this article.