Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity... IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 1 Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection Tharindu Fernando, Student Member, IEEE, Sridha Sridharan, Life Senior Member, IEEE, Mitchell McLaren, Darshana Priyasad, Member, IEEE, Simon Denman, Member, IEEE, and Clinton Fookes, Senior Member, IEEE. Abstract—This paper presents a novel framework for Speech that significant improvements in the accuracy of Automatic Activity Detection (SAD). Inspired by the recent success of multi- Speech Recognition (ASR) can be obtained by combining the task learning approaches in the speech processing domain, we ASR task with context recognition and gender classification propose a novel joint learning framework for SAD. We utilise as auxiliary tasks, as opposed to performing ASR alone. generative adversarial networks to automatically learn a loss Furthermore, the evaluations in [5], [6] suggested that methods function for joint prediction of the frame-wise speech/ non- speech classifications together with the next audio segment. In learned using the multi-task learning paradigm are not only order to exploit the temporal relationships within the input robust when evaluated in cross database scenarios, but also signal, we propose a temporal discriminator which aims to learn powerful and more discriminative features to facilitate ensure that the predicted signal is temporally consistent. We both tasks. evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT’ 17, AMI Meeting and HAVIC, where Inspired by these findings, we exploit the power of Gen- we demonstrate its capability to outperform state-of-the-art erative Adversarial Networks (GAN) [7], [8] to accurately SAD approaches. Furthermore, our cross-database evaluations perform speech/non-speech classification together with an demonstrate the robustness of the proposed approach across auxiliary task. In choosing the appropriate auxiliary task for different languages, accents, and acoustic environments. SAD we draw inspiration from a conclusion in the field of Index Terms—Speech Activity Detection, Generative Adversar- neuroscience that humans recognise speech in noisy conditions ial Networks, Context Modelling. through the awareness of the next segment of speech which is most likely to be heard [9], [10]. We therefore chose the I. INTRODUCTION prediction of the next audio segment as the auxiliary task as PEECH Activity Detection (SAD) plays a pivotal role in it also complements the primary SAD task via learning the many speech processing systems. Despite the consistent context of the input audio embedding. Through the prediction progress attained in this subject, the problem is far from of next audio segment our model tries to learn a contextual being solved as evidenced by evaluation results across the mapping between the input audio segments and the next vast variety of acoustic conditions featured in challenging segment which is likely to be heard. benchmarks such as HAVIC [1] and NIST OpenSAT’ 17 [2]. Even though the final speech activity decision is agnostic to Our work is inspired by recent observations in speech the actual content of speech, there are reasons to conjecture processing where multi-task learning approaches have shown that the SAD accuracy could be improved by making use to outperform single task learning methods in numerous areas, the semantic information of speech. It is known that humans including, speech synthesis [3], speech recognition [4], speech make use the semantic information to understand speech that enhancement [5], and speech emotion recognition [6]. For is affected significantly by noise [9], [10]. In [11] the authors instance, the seminal work by Pironkov et. al [4] demonstrated demonstrated that our inferior-frontal cortex predicts what someone is likely to hear next even before the actual sound T. Fernando, S. Sridharan, D. Priyasad, S. Denman, C. Fookes are with Speech Research Lab, SAIVT, Queensland University of Technology, reaches the superior temporal gyrus, allowing us to separate Australia. M. McLaren is with Speech Technology and Research Laboratory of SRI International. noise from what is actually spoken. One of our aims in this paper is to investigate how and to what extent we could E-mail: t.warnakulasuriya@qut.edu.au Manuscript received improve the performance of SAD if we were to use semantic arXiv:2004.01546v1 [eess.AS] 2 Apr 2020 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 2 information to predict the next speech segment. Current SAD ing better discriminative features for supervised classification methods simply classify whether a sample is speech or non- [14]–[20]. For instance in [17] the authors suggest a com- speech, without paying attention to the temporal context. bination of MFCCs and Gabor features. In [21] the authors Even though the current state-of-the-art SAD systems extract suggest the use of source and filter based features and perform features from a sliding window surrounding the event frame a score level fusion. In [22] the authors propose the use of of interest, they consider the frame as an isolated event and bottleneck features for predicting the speech and non-speech do not consider the entire sequence when detecting the speech posteriors. In [23] the authors fuse six SAD systems, two activity. We show in this paper that through the prediction of supervised and four unsupervised, for the NIST-Open-SAD- the next audio segment by exploiting the task-specific loss- 2015 challenge. Supervised systems utilise labelled speech function learning capability of the GAN framework, we can and non-speech segments for training the SAD while the improve SAD accuracy by a significant amount. unsupervised methods utilise a fixed or adaptive threshold for The proposed architecture is shown in Fig. 1. The model the SAD task. The work of Hwang et. al [24] proposed the utilises audio, Mel-Frequency Cepstral Coefficients (MFCC) utilisation of an ensemble of deep neural networks trained on and Deltas of MFCC as the inputs and encodes these inputs different noise types for supervised SAD. In a different line of into an encoded representation, C . The generators receive this work, [25] proposed a semi-supervised learning approach for input embedding, C , and a noise vector, z, as the inputs. We GMM training, using power normalized cepstral coefficients, perceptual linear prediction coefficients, and frequency domain utilise two generators, G , for synthesising the frame-wise speech/ non-speech classifications and G , which synthesises linear prediction as features in addition to MFCCs. the audio signal for the next time window. It should be noted However, none of the above stated deep learning systems that in Fig. 1 the generators, G and G are denoted as two have explicitly modelled the temporal relationship between separate LSTM blocks, each with two cells of LSTMs.The audio frames in the input signal when performing SAD. static discriminator, D , receives the current input embeddings One of the earliest attempts to leverage temporal modelling and either the synthesised or ground truth speech classification in SAD was based on Recurrent Neural Networks (RNNs) [26] sequences and tries to discriminate between the two. The where the authors demonstrate a reduction of 26% in the false temporal discriminator, D , also receives the current input alarm rate compared to their Gaussian Mixture Model (GMM) embeddings and either the synthesised or ground truth future baseline which doesn’t use any temporal modelling. In [27] the audio segments and learns to classify them, considering the authors build upon this work where they augment the Long temporal consistency of those signals. Short Term Memory (LSTM) cell architecture. They propose The main contributions of the proposed work are sum- a coordinated-gate LSTM structure and a methodology to marised as follows: directly optimise the SAD loss using the Frame Error Rate (FER). Most recently, the Adaptive Context Attention Model We introduce a Temporarily-Aware GAN (TA-GAN) (ACAM) [28] model extended the LSTM based temporal learning framework for speech activity detection. modelling scheme using an attention strategy to learn the We demonstrate how a custom loss function for speech context of the speech signal for noise robustness in the SAD activity detection can be automatically learned through system. In a different line of work, an audiovisual SAD system the GAN learning process. is proposed in [29] in order to improve the robustness of the We propose a novel temporal discriminator which encour- framework. ages the generator to synthesise future speech segments in accordance with the current context. We perform extensive evaluations on the proposed frame- III. THE P ROPOSED APPROACH work using multiple public benchmarks and demonstrate We are inspired by the tremendous success of DNN based performance beyond that of current state-of-the-art sys- multi-task learning frameworks [4]–[6] in speech processing tems. which demonstrate greater robustness compared to single task learning methods. Motivated by these findings we investigate II. RELATED WORK ON SUPERVISED SPEECH ACTIVITY the utility of multi-task learning for SAD. To the best of our DETECTION knowledge, the work in this study is the first to consider multi- In supervised SAD, machine learning algorithms are trained task learning for SAD. Specifically, we attain joint predictions on annotated audio data to discriminate speech from non- of the frame-wise speech/ non-speech classification along speech segments. Several prior works have focused on find- with the next audio segment through the proposed multi-task IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 3 Ground Truth Speech Current audio (W ) (θ ) (Δ ) Classifi cation (η ) 1: T Ground truth futureaudio 1:T 1:T 1:T Speech (W ) T+ 1:T+ T {z} Non-Speech LSTM (encoder) Input embedding(C) LSTM LSTM LSTM LSTM Predicted Speech Classifi cations Predicted futureaudio (η̂ ) 1:T (W ) T+ 1:T+ T Speech Non-Speech LSTM LSTM LSTM LSTM FC(1) FC(1) softmax softmax [true/ fake] [true/ fake] Static Discriminator Temporal Discriminator Fig. 1. Proposed TA-GAN framework: Given the current time  , the model input is a segment containing the T audio frames and the features extracted from this segment (where w is the raw audio of frame t,  denotes the MFCC feature [12] and  denotes the MFCC deltas [13] for the same frame) directly t t t proceeding  and we term this the current segment. The encoder receives audio, MFCC and Deltas of MFCC inputs and embeds this information in an input embedding, C . Using this embedding and a random noise vector z, the classification sequence generator, G , synthesises a frame-wise speech classification sequence for the current time window while the same C and a random noise vector z are used by the audio generator, G , to synthesise the audio signal for the next T frames directly following  which we term the future segment. We utilise two discriminators. The static discriminator, D , receives the current input embeddings and either the synthesised or ground truth speech classification sequences and tries to discriminate between the two. The temporal discriminator, D , receives the current input embeddings and either the synthesised or ground truth future audio segments and learns to classify them, considering the temporal relationships between audio frames within those signals. learning framework. As there doesn’t exist an optimal, off the at hand rather than simply adding together the loss functions shelf loss function for the joint task that we are attaining, we for individual tasks. For instance, in [33] the authors illustrate utilise the GAN learning framework to automatically learn a the utility of GANs for video based action prediction while loss function for these tasks. synthesising future frame representations, and the authors in [34] showed that this process is highly beneficial for mitigating We exploit the task-specific loss-function learning capability the errors due to variation of view angles in gait recognition of the GAN framework to automatically learn a custom loss through view synthesis. function [30]–[33] that facilitates these two tasks . The merit of this approach is that it allows us to learn a highly non- For benefit of the readers who may be unfamiliar with linear loss, in contrast to a linear loss like cross entropy, to GAN we provide a brief introduction. Generative adversarial optimally capture the underlying semantics of the process. This networks fall within the family of generative models. The Gen- custom loss function learning capability of GANs is highly erator (G) learns a mapping from a random noise vector z to beneficial in the multi-task learning setting, as it allows us to an output y; G : z ! y [35]. An extension to this basic model learn a custom loss function that accounts for all the tasks is proposed in [7] where the authors propose a conditional IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 4 GAN, which learns a mapping from an observed input x and minimises the sum of the squared differences between the random noise vector z, to output y, G : fz; xg ! y. This synthesised output and ground truth data [47] while the L extension allows the model to learn a conditional mapping loss function minimises the sum of the absolute differences between the current input and the output. between the synthesised output and ground truth data [48]. In GANs partake in a two-player adversarial game where the [39] the authors demonstrate that L loss is more effective in generator, G, tries to fool the discriminator, D, with synthe- penalising discontinuities between nearby frames compared to sised outputs while D tries to identify them. This objective, the L loss. Motivated by these findings we utilise L as our 1 2 in terms of the conditional GAN, can be written as, regularisation mechanism. min max E [log(D(x; y))]+E [log(1(D(x; G(x; z))))]; x;y x;z IV. A RCHITECTURE G D (1) The proposed architecture is inspired by the success of where D tries to maximise this objective while G tries to multi-task learning over single task learning methods in nu- minimise it. Hence there exists a dual between G and D, merous speech related areas [4]–[6]. We design our auxiliary through which the GAN framework learns a custom loss task of predicting the next audio segment to facilitate our function for the task at hand. It should be noted that we primary goal of speech/non-speech classification via capturing do not explicitly define the loss of G. The discriminator, D, broader context of the input audio segment than just relying is the loss function for the G, which is a neural network on the input itself. Rather than using hand engineered loss approximating the loss. Therefore, a custom loss function is function for the two tasks we utilise GAN framework to learned through the adversarial learning process. For further automatically learn a custom loss function that facilitates both information regarding the GAN learning process we refer the tasks. readers to [7], [35]. The proposed approach is shown in Fig. 1. Inputs are GANs are extensively applied for tasks such as image-to- E processed by the encoder, f , which embeds this information image synthesis [7], [36]–[38], video synthesis [39], [40] and into a vector. We implemented the encoding function f using speech enhancement [8], [41], [42], but seldom for SAD. To a single LSTM cell. Using this embedding, the generator, the best of our knowledge, no prior work has applied GANs G , synthesises a speech activity classification sequence while for the SAD task. Most GAN related works have focused on G synthesises the future audio signal (see Sec. IV-A). We using static inputs such as images [7], [36]–[38], while only utilise two discriminators, a static discriminator (see Sec. a few have addressed temporal changes in the input data. In IV-B) and a temporal discriminator (see Sec. IV-C) where [43], [44] the authors address this by directly incorporating the former considers individual elements in the sequence the time axis in the input and output. For instance, in [43] when performing the adversarial classification, and the latter the authors propose a temporal generator while Yu et. al [44] preserves the temporal relationships between audio frames of propose a sequence generator that learns a stochastic policy. the outputs. The overall objective of the combined model is However, neither of these works have considered a framework presented in Sec. IV-D. that processes individual frames while also considering the Motivated by [12] we consider a combination of input temporal relationships between them. features. Let the input, X , be, Xie et. al [45] address this issue through a dual discriminator X = [(w ; w ; : : : ; w ); ( ;  ; : : : ;  ); ( ;  ; : : : ;  )]; 1 2 T 1 2 T 1 2 T architecture. However, they have engineered the temporal loss (2) to consider the velocity of consecutive frames, and hence this where w is the raw audio of the frame t,  denotes the t t cannot be directly applied for speech processing. MFCC feature [12] and  denotes the MFCC deltas [13] for In our work we exploit the merits of the GAN learning t the same frame. framework to automatically learn a loss function for synthe- sising highly indistinguishable data and synthesise both the A. Generators speech activity classifications for the set of individual input Given an input X , we first pass it through an encoding frames as well as the input signal in the next time frame. This function, f , which generates an embedding such that, allows us to learn the context of the input audio segment. In the context of computer vision, L and L losses have 1 2 C = f (X ); (3) been extensively coupled with the adversarial GAN loss to alleviate the static pixel-wise loss between the synthesised where C = [c ; c ; : : : ; c ; : : : ; c ]. Using this input embed- 1 2 t T output and the ground truth data [7], [46]. The L loss ding, C , and a noise vector, z, the generator, G , synthesises 2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 5 a speech classification sequence,  ^ = [ ^ ;  ^ ; : : : ;  ^ ; : : : ;  ^ ], sub-sequences. This objective can be written as, 1 2 t T classifying each frame in X , while G synthesises the future w w V = min max E[log(D (c ; w ))] audio signal, w ^ = [w ^ ; w ^ ; : : : ; w ^ ], for the next time T +1 T +2 T +T 1:t t+1:t+t w w G D t=1 window. This can be written as, w w + E[log(1 (D (c ; G (c ; z))))] (7) 1:t 1:t ^ = G (C; z); (4) t=1 and w w 2 + jjw G (c ; z)jj ; w t+1:t+t 1:t w ^ = G (C; z): (5) t=1 E w Predicting the raw signal, rather than MFCC’s or other where w = (w : : : w ), c = f (X ) and  is a hyper- 1:t 1 t 1:t 1:t parameter controlling the contribution from the L loss. features, allows us to enforce temporal constraints in the 2 We would like to emphasise the fact that utilising the above discriminator and preserve the original characteristics of the formulation, the static discriminator provides frame-wise true/fake input signal. decisions while the temporal discriminator provides decisions for time-windows of different frame lengths. B. Static Discriminator D. Complete Model We combine the objectives in Equations 6 and 7 to obtain the The static discriminator, D , receives the current input em- objective for the proposed TA-GAN, beddings and the ground truth speech classification sequence, , and learns to classify it as real while G tries to synthesise V = V + V : (8) a classification sequence,  ^, which is not easily distinguishable It can be seen that for the individual losses V and V there exist from the real sequences. This objective can be written as, contributions from the adversarial losses which occur due to the dual between G and D , and G and D . As shown in Equations 6 V = min max E[log(D (c ;  ))] t t G D and 7, the generators G and G try to minimise these loss values t=1 while discriminators D and D try to maximise them. Hence it T T X X + E[log(1 (D (c ; G (c ; z))))] +  jj G (c ; z)jj ; can be concluded that the overall loss, V , of the proposed TA- t t t t t=1 t=1 GAN is automatically learned through the proposed framework by (6) considering the task at hand. where we add an additional L loss to regularise the process and  is a hyper-parameter controlling the contribution from V. EVALUATIONS the L loss. A. Datasets The proposed Temporarily-Aware GAN (TA-GAN) framework is evaluated on four popular SAD benchmarks, namely, HAVIC [1], C. Temporal Discriminator AMI Meeting corpus [50], NIST OpenKWS’13 [51], and NIST OpenSAT’ 17 [2]. The details of the datasets and the evaluation The objective in Eq. 6 is shown to be highly effective protocols are summarised below. for generating realistic static outputs considering the elements 1) HAVIC: HAVIC (the Heterogeneous Audio Visual Internet of the sequence individually [49]. However, it discards the Collection) Pilot Transcription [1] is comprised of approximately 72 temporal coherence as the generator and the discriminator hours of user-generated videos with transcripts based on the English consider each frame individually [49]. Even though this be- speech audio extracted from the videos. The transcription files contain the type of the audio segment annotated for speech, music, noise and haviour is acceptable when considering the frame-wise speech singing segments [52]. We choose music and noise segments as non- classification sequence, it is suboptimal when considering the speech and rest of the segments as speech. Due to the unavailability future audio output. Inspired by [44], [45], [49] we introduce a of standard training/ testing splits we randomly split 70% of the data temporal discriminator, D , which also preserves the temporal for training, 20% for testing and 10% for validation. As the evaluation relationships between audio frames of the output. We consider metric we measure NIST OpenSAD Detection Cost Function (DCF), different sub sequences of the generated sequences  ^ and DCF = 0:75 P + 0:25 P ; (9) miss fa w ^, and generate the true/ fake classification through the discriminator considering these sub-sequences. Hence it forces where P denotes miss probability and P denotes the probabil- miss fa the discriminator to consider the temporal accordance of these ity of false alarms. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 6 w  w 2) AMI Meeting corpus: This dataset consists of 100 hours of generators G and G . However the two discriminators D and D recordings collected across three different meeting rooms. It offers a are updated individually. Hyperparameters  and  are evaluated challenging SAD setting as audio data is from both non-native and experimentally by changing the respective hyper-parameter while native English speakers. Similar to [27] we use the Frame Error Rate holding the rest of the parameters constant, and are set to 30 and (FER) metric to evaluate the performance. Training testing splits are 25 respectively. Changes in FER against  and  are shown in as defined in [50]. Fig. 2. The implementation of the proposed TA-GAN is completed 3) NIST OpenSAT’ 17: We also utilise the public safety with Keras [57] and Theano [58]. communications (PSC) corpus from NIST OpenSAT 2017 for our evaluations [2], which is a standard split in NIST OpenSAT 2017 and is constructed using the audio data from Sofa Super Store Fire (SSSF) dispatcher that occurred on June 18, 2007 in Charleston, South Carolina. This data consisted of audio logs in English from real fire-response operational data and is rich in naturalistic distor- tions including land-mobile-radio transmission effects, speech under cognitive and physical stress, speaking with significant background noise (Lombard effect), varying background-noise types and levels, and varying background decibel levels, [2], [53], [54]. The data is 1 3 provided as 16-bit at 8 kHz sampling rate . Due to the unavailability of ground truth evaluation labels we use the six audio recordings 00 20100 40200 60300 800 40 100500 1200 in the development data which constitute approximately 30 minutes Length ofmemory worth audio recordings. Due to this limited size we utilise this dataset (a)  vs FER only under cross database evaluations (see Sec. V-D) where we use this dataset only for testing (i.e it is not used for training the models). Following [53] we measure the DCF metric which is evaluated using Eq. 9. 18 4) NIST OpenKWS’13: To demonstrate the robustness of TA- GAN for different languages we evaluate the performance using Vietnamese, Pashto, Turkish and Tagalog corpuses from the IARPA 12 91 Babel dataset [51] . We evaluate the system using the FER metric as in [27]. B. Implementation Details We use a sliding window [8] to sample 1 second segments 0 200 400 600 800 1000 50 1200 60 0 10 20 30 40 from the raw audio every 500ms (with 50% overlap). We extract Length ofmemory MFCC features with 13 cepstral coefficients and the delta features (b)  vs FER considering the immediately preceding 2 frames and the next 2 frames using a frame size of 25 ms, sampled at a frame rate of 100 fps. Fig. 2. Evaluation of hyper-parameters using the validation set of AMI Similar to [55] inputs are normalised to a range 0-1, and no other Meeting corpus. We set  = 30 and  = 25. speech-specific preprocessing is performed. At test time we slide the window, without overlap, over the whole test utterance and generate the relevant speech classification sequence using G . It should be C. Results noted that similar to [27] we generate speech/ non- speech predictions for each frame within the 1 second segment. Hence at test time there TABLE I EVALUATIONS ON THE HAVIC DATASET [1]. DCF DENOTES NIST is only a 1 second framing delay at the beginning, after which the O PENSAD DETECTION COST F UNCTION (DCF) AS DEFINED IN EQ. 9. window can be shifted in small increments to produce predictions in real time. Method DCF We implemented the encoding function, f , using a single LSTM MLP - Gelley et. al [27] 8.10 cell, and the two generators, G and G , are implemented with Basic RNN - Gelley et. al [27] 6.38 two separate LSTM blocks, each with two cells of LSTMs. For CG-LSTM - Gelley et. al [27] 5.10 all LSTMs the hidden state size is set to 300 units. For training, ACAM -Kim et al - [28] 4.95 we use the Adam [56] optimiser, a learning rate of 0.005, and TA-GAN 2.53 500 epochs with a batch size of 600, alternating between epochs of D and G. We train the input encoder jointly with the two Evaluations on the HAVIC dataset are presented in Tab. I, and AMI Meeting corpus and NIST OpenKWS’13 corpora are presented We obtained the data from https://catalog.ldc.upenn.edu/ and the LDC in Tab. II. For better comparisons we provide evaluations for the Catalog ID is LDC2017E12 CG-LSTM and Basic RNN and MLP methods in [27], and the Vietnamese IARPA-babel107b-v0.7, Pashto IARPA-babel104b- v0.4b,Turkish IARPA-babel105b-v0.5, Tagalog IARPA-babel106-v0.2g Adaptive Context Attention Model (ACAM) proposed in [28]. These Average altitude error(feet) Average altitude error(feet) FER FER IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 7 TABLE II cross database evaluations with the evaluations presented within EVALUATIONS ON THE AMI M EETING [50] AND O PENKWS’13 CORPUS brackets, when the models are trained and tested on the same dataset, [51]. FER DENOTES F RAM E ERROR RATE AS DEFINED IN [27]. we only observe a slight reduction in the performance of the proposed approach when it is not tuned on the training set of the specific FER Method AMI Meeting OpenKWS’13 dataset. However, the performance reductions in the baselines are MLP - Gelley et. al [27] 6.84 6.29 quite substantial. Basic RNN - Gelley et. al [27] 6.55 6.24 CG-LSTM - Gelley et. al [27] 5.93 5.76 TABLE III ACAM - Kim. et al [28] 5.89 5.66 CROSS DATABASE EVALUATIONS USING NIST OPENSAT’ 17 [2], HAVIC [1], AMI MEETING [50] AND O PENKWS’13 CORPUS [51]. FOR HAVIC, TA- GAN 2.80 2.75 AMI M EETING AND O PENKWS’13 DATASETS W ITHIN BRACKETS WE REPORT THE ERROR RATES WHEN THE M ODEL IS TRAINED AND TESTED ON THE DATABASE INDICATED IN THE “T ESTED O N” COLUMN. two models which were proposed very recently have been able to Error Rate (DCF / FER) attain state-of-the-art results under a supervised SAD setting in the Trained on Tested on CG-LSTM [27] ACAM [28] TA- GAN datasets that we consider. The work of Gelley et. al [27] utilises RNNs NIST OpenSAT’ 17 5.36 4.78 2.53 AMI Meeting HAVIC 7.63 (5.10) 7.08 (4.95) 4.53 (2.53) for modelling the temporal relationships within the input signal and OpenKWS’13 8.17 (5.76) 7.73 (5.66) 4.15 (2.75) demonstrates that directly optimising the SAD loss using the Frame NIST OpenSAT’ 17 5.30 4.42 2.14 Error Rate (FER) produces better results. In [28] Kim. et al exploit an HAVIC OpenKWS’13 7.93 (5.76) 7.65 (5.66) 4.01 (2.75) AMI Meeting 7.87 (5.93) 7.39 (5.89) 4.23 (2.80) attention strategy for learning the context of the speech signal using NIST OpenSAT’ 17 5.51 5.02 3.14 LSTMs for noise robustness in the SAD system. The comparative OpenKWS’13 HAVIC 7.86 (5.10) 7.12 (4.95) 4.81 (2.53) evaluations with these baselines demonstrates the utility of GAN AMI Meeting 8.51 (5.93) 7.71 (5.89) 4.60 (2.80) based learning for the SAD system. In addition to utilising LSTMs for temporal modelling and an attention mechanism for input embedding, We note that the baseline ACAM [28] model utilises a window of 39 frames as the input, w, while the proposed TA-GAN model the proposed model automatically learns a loss function for the utilises 100 frames as the input window. Due to this difference, SAD task. Hence, in contrast to [27], [28], the proposed TA-GAN their performance is not directly comparable. However, in Sec. V-F method has been able to learn a more robust input embedding which we show a further evaluation using different (smaller) window sizes better discriminates the speech segments compared to its counterparts. which illustrates that the proposed TA- GAN model is capable of Furthermore, when comparing the MLP - Gelley et. al [27] system outperforming the baseline models even with smaller input window with Basic RNN - Gelley et. al [27], CG-LSTM - Gelley et. al [27], sizes. recurrent neural network based temporal modelling has been able to further improve the performance over an MLP network. We would like to note that these systems directly optimise the FER loss. In E. Ablation Experiments contrast, using the task-specific loss function learning framework of To better understand the crucial components and sensitivities of GANs and the augmented multi-task learning approach, the proposed the proposed TA-GAN framework, we conduct a series of ablation method has been able to outperform the state-of-the-art methods. experiments. In this experiment, we use the AMI Meeting [50] dataset and compare the TA-GAN model with a series of counterparts defined D. Cross Database Evaluation as follows: To demonstrate the robustness of the proposed method across 1) G (w): Removes the GAN learning framework and G is different languages, accents, and acoustics, we perform a cross- learnt through binary cross entropy loss. This receives only database evaluation where we train the model using the training data the audio input. of one dataset and test that model on the test sets of the rest of the 2) G (w +  + ): Receives audio, MFCC and delta inputs. datasets. 3) G + G (w +  + ): Similar to 2) but additionally predicts The evaluations are presented in Tab. III. Note that when tested on the future audio segment, which is trained using mean square NIST OpenSAT’ 17 [2] and HAVIC datasets [1] we report the NIST error. OpenSAD Detection Cost Function (DCF) whereas for AMI Meeting 4) GAN (w + + )=L : uses the GAN learning framework but and OpenKWS’13 corpus we report the Frame Error Rate (FER). synthesises only the classification sequence. Receives audio, To better demonstrate the merits of the proposed method we train MFCC and delta inputs. Doesn’t utilise L regularisation in the CG-LSTM baseline model defined in [27] and ACAM baseline Eq. 6 model of [28]. For better comparisons for AMI Meeting, HAVIC 5) GAN (w +  + ): Same as above method but with L and OpenKWS’13 datasets, within brackets we report the error rates regularisation. when the model is trained and tested on the database indicated in 6) TAGAN (w): Proposed model that receives only the audio the “Tested On” column. Due to the limited dataset size we do not input and predicts the future audio segment. attempt to train models using the NIST OpenSAT’ 17 dataset. 7) TA GAN (): Proposed model which receives the MFCC When analysing the results it is clear that the proposed GAN based as input and predicts future MFCC distribution. learning framework better captures the discriminative features and is 8) TA GAN (): Receives Deltas of MFCC as input and more robust under cross domain scenarios, better segregating speech predicts future deltas. from non-speech embeddings. This allows the proposed method to 9) TA GAN (w + ): Receives both audio and MFCC inputs achieve superior results compared to the baselines. Comparing the and predicts their future distributions. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 8 10) TA GAN (w + ): Receives both audio and delta inputs SAD task. However, when MFCC features are fused with both audio and predicts their future distributions. and  features we observe improved performance, highlighting that 11) TA GAN ( + ): Receives both MFCC and delta inputs the complementary attributes present in those streams have the ability and predicts their future distributions. to better discriminate speech segments from their counterparts. 12) TAGAN (w + + )=L : Receives audio, MFCC and delta inputs and predicts their future distributions. Doesn’t utilise L F. Impact of input window size regularisation in Eq. 6 and Eq. 7. _ In order to illustrate the impact of the input window size for SAD 13) TA GAN (D ): Replaced the temporal discriminator with accuracy , we perform an additional evaluation on the proposed TA- static discriminator as per Eq. 6, hence, this model contains GAN model using different window sizes: 20, 40, 60, 80, 100, and two static discriminators. 120 frames. In this experiment, we use the AMI Meeting [50] dataset. TABLE IV TABLE V A BLATION MODEL EVALUATIONS ON AMI MEETING [50] DATASET. EVALUATING THE EFFECT OF DIFFERENT W INDOW SIZES ON THE TA-GAN MODEL USING AMI MEETING [50] DATASET. ID Method FER 1) G (w) 9.10 Window Size (in frames) FER 2) G (w +  + ) 7.20 20 5.24 3) (G + G )(w +  + ) 7.12 4) GAN (w +  + )=L 4.73 40 4.41 5) GAN (w +  + ) 4.15 60 3.54 6) TA GAN (w) 3.99 80 3.11 7) TA GAN () 3.72 100 2.80 8) TA GAN () 4.03 120 2.84 9) TA GAN (w + ) 3.60 10) TA GAN (w + ) 3.98 11) TA GAN ( + ) 3.65 Considering these evaluations it is clear that a considerable re- 12) TA GAN (w +  + )=L 3.54 duction in the FER can be achieved when increasing the window 13) TA GAN (D ) 3.31 size from 20 to 100 frames, but no significant gain is observed by Proposed TA-GAN 2.80 increasing it beyond 100 frames. We believe utilising a large window size is essential in the proposed method in order to properly model With the ablation evaluations presented in Tab. IV we can see the context within the given window. Furthermore, comparing these the importance of multi-task learning, the merits of using GAN results to those obtained by ACAM [28] for the AMI Meeting using based automatic loss function learning and the importance of utilised a window size of 39 frames, we observe that the proposed TA-GAN features. model with a smaller window size (i.e 20 frames) has been able to When comparing both non-GAN and GAN based single task SAD achieve better performance than ACAM [28]. methods (ablation model 1-2 and 4-5) with their respective multi-task counterparts (i.e ablation models 3 and 13) we observe a significant G. Qualitative Results contribution for the SAD task through the multi-task learning strategy. Furthermore, when comparing non-GAN based models (1-3) with We randomly selected 100 examples from the AMI Meeting [50] GAN based models (4-13), we observe a significant performance test set and plotted the inputs embeddings, c, for each frame in those boost denoting the merits of task-specific loss function learning. We examples. The model trained on the AMI Meeting training set is used would like to emphasise the fact that this performance increase is generate these embeddings. These embeddings are coloured based on observed for both single-task as well as multi-task models, although the ground truth speech/ non-speech labels. Note that for each frame we observe a further substantial improvement with regards to multi- the encoder generates a 300 dimensional embedding vector. Hence task methods. in order to plot the results in 2D we applied PCA [59] to reduce In addition we observe that the temporal discriminator has been the dimensionality. In Fig. 3 (a) we visualise the input embeddings able to further improve this learning process (see model 13 and learnt through the proposed TA-GAN model. It is clear that the TA-GAN (proposed)). Even though we do not observe a direct model has been successful in learning an embedding space which relationship between the temporal discriminator, which is used for better segregates speech from non-speech than the alternate ablation real/fake validation of the predicted future audio segments and the models, at least in terms of the two directions that capture most SAD task, we notice a significant contribution from this module. This variation determined via PCA. In Fig, 3 (b) and (c) we perform illustrates that via analysing the temporal relationships between audio the same visualisation for two of the ablation models (3 and 5). frames the discriminator gains the ability to guide the generator to Considering Fig. 3 (b) we see that the model fails to learn such a generate realistic outputs. Hence it enforces the input embeddings discriminative embedding space. Furthermore, we would like to point to better identify the temporal context of the inputs, denoting the out that in the proposed model the same input embedding is used to utility of multi-task learning and the importance of the future audio predict the future audio signal as well. The clear separation of the segment prediction task as the auxiliary task of the proposed multi- two classes (i.e speech and non-speech) verifies our hypothesis that task learning framework. jointly predicting the future audio signal for the next time window When comparing different feature combinations present in Tab. IV can improve SAD performance. To further demonstrate this ability in we observe that MFCC features contain more salient attributes for the Fig 3 (c) we visualise the input embeddings learnt through Ablation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 9 model 5 for the same set of examples, where the GAN based model  1 second length classification sequence predictions. It should be only predicts the classification sequence without modelling the future noted that the proposed TA-GAN model has a larger time complexity audio signal. It is clear that the automatic loss function learning due to the joint prediction of both future audio and classification process has contributed to learning discriminative embeddings but sequences. In terms of number of trainable parameters the proposed we observe some areas with overlaps between the speech and non- method contains 48K trainable parameters while the basic RNN and speech embeddings in contrast to the clear segregation in Fig. 3 (a). CG-LSTM methods of [27] have 6K trainable parameters. This further emphasises the importance of the joint learning of both tasks to better capture the discriminative features. VI. CONCLUSION In this paper, we propose a novel multi-task learning framework for Speech Non-Speech speech activity detection, by properly analysing the context of the in- put embeddings and their temporal accordance. We contribute a novel 2 data-driven method to capture salient information from the observed 1 audio segment by jointly predicting the speech activity classification sequence and the audio for the next time frame. Additionally, we introduce a temporal discriminator to enforce these relationships in the synthesised data. Our quantitative evaluations using multiple supervised SAD bench- marks, including NIST OpenSAT’ 17 [2], AMI Meeting [50] 4 2 0 2 4 OpenKWS’13 [51] and HAVIC [1] demonstrated the utility of the (a) TA-GAN proposed multi-task learning framework compared to the single Speech task based supervised SAD baselines. Furthermore, through ablation Non-Speech model evaluations presented Sec. V-E we demonstrate that the auto- matic learning of a loss function specifically considering the task at hand, as opposed to using hand engineered losses, has significantly contributed to the superior performance attained in the proposed multi-task learning framework. In addition, in Tab. IV we provide comparisons regarding systems with and without using the proposed temporal discriminator. The evaluation of the temporal discriminator, which enforces the tempo- 4 2 0 2 4 ral relationships between audio frames of the synthesised outputs, (b) Ablation model 3 (G +G (w + + )) demonstrates the utility of incorporating this intelligence in the discriminator, which guides the generator to generate realistic outputs. Speech Speech Non-Speech Non-Speech With empirical evaluations we illustrate that the future audio segment prediction auxiliary task contributes to augment the performance of the SAD task, demonstrating the utility of multi-task learning and the importance of the future audio segment prediction task for learning the context of the input embeddings. To better demonstrate the robustness of the proposed framework we conducted a cross-database evaluation where we train the model using a seperate dataset and tested on another dataset. This experi- 4 2 0 2 4 ment revealed that the proposed multi-task learning framework learns (c) Ablation model 5 (GAN (w +  + )) better discriminative features which are more robust across multiple datasets, compared to the current state-of-the-art supervised SAD Fig. 3. Visualisation of input embeddings, C , for proposed TA-GAN and models. We would like to emphasise that these evaluated datasets Ablation models 3 (G + G (w +  + )) and 5 (GAN (w +  + )) are of different languages, accents, and acoustics and the proposed method exhibits 37-52% relative gain over the best alternate approach (ACAM [28]) when evaluated with NIST OpenSAT’ 17 [2]. H. Time Complexity In order to demonstrate that the proposed TA-GAN approach is ACKNOWLEDGMENT suitable for real-time use, we benchmarked the time complexity of This research was supported by an Australian Research Council TA-GAN on the test set of AMI Meeting corpus dataset on a single (ARC) Discovery grant DP140100793. core of an Intel Xeon E5-2680 2.50GHz CPU and the TA-GAN model runs at 5.35  faster than real time. The proposed system was REFERENCES able to generate 100 predictions (i.e, 100 seconds of audio) where the output is both 100  1 second length classification sequence [1] LDC, “Havic pilot transcription,” 2016. [Online]. Available: https: predictions and 100  1 second length future audio predictions, in //catalog.ldc.upenn.edu/LDC2016V01 18.70 seconds. In a similar setting both basic RNN and CG-LSTM [2] NIST, “Nist pilot speech analytic technologies evaluation, opensat,” methods of [27] take approximately 8.56 seconds to generate 100 2017. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 10 [3] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent [22] L. Ferrer, M. Graciarena, and V. Mitra, “A phonetically aware system for speaker verification,” in Sixteenth annual conference of the international speech activity detection,” in Acoustics, Speech and Signal Processing speech communication association, 2015. (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. [4] G. Pironkov, S. Dupont, and T. Dutoit, “Multi-task learning for speech 5710–5714. recognition: an overview,” in Proceedings of the 24th European Sympo- [23] T. Kinnunen, A. Sholokhov, E. Khoury, D. Thomsen, M. Sahidullah, and sium on Artificial Neural Networks (ESANN), vol. 192, 2016. Z.-H. Tan, “Happy team entry to nist opensad challenge: A fusion of [5] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech en- short-term unsupervised and segment i-vector based speech activity de- tectors,” Proceedings of the 17th Annual Conference of the International hancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of Speech Communication Association, 2992-2996, 2016. the International Speech Communication Association, 2015. [24] I. Hwang, H.-M. Park, and J.-H. Chang, “Ensemble of deep neural [6] N. K. Kim, J. Lee, H. K. Ha, G. W. Lee, J. H. Lee, and H. K. Kim, networks using acoustic environment classification for statistical model- “Speech emotion recognition based on multi-task learning using a con- based voice activity detection,” Computer Speech & Language, vol. 38, pp. 1–12, 2016. volutional neural network,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). [25] A. Sholokhov, M. Sahidullah, and T. Kinnunen, “Semi-supervised IEEE, 2017, pp. 704–707. speech activity detection with an application to automatic speaker [7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation verification,” Computer Speech & Language, vol. 47, pp. 132–156, 2018. with conditional adversarial networks,” arXiv preprint, 2017. [26] T. Hughes and K. Mierle, “Recurrent neural networks for voice activity [8] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7378–7382. generative adversarial network,” InterSpeech, 2017. [27] G. Gelly and J.-L. Gauvain, “Optimization of rnn based speech activity [9] M. K. Leonard, K. E. Bouchard, C. Tang, and E. F. Chang, “Dynamic encoding of speech sequence probability in human temporal cortex,” detection,” IEEE/ACM Transactions on Audio, Speech, and Language Journal of Neuroscience, vol. 35, no. 18, pp. 7203–7214, 2015. Processing, vol. 26, no. 3, pp. 646–656, 2018. [10] A. Heinrich, R. P. Carlyon, M. H. Davis, and I. S. Johnsrude, “Illusory [28] J. Kim and M. Hahn, “Voice activity detection using an adaptive context vowels resulting from perceptual continuity: a functional magnetic attention model,” IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181–1185, 2018. resonance imaging study,” Journal of cognitive neuroscience, vol. 20, no. 10, pp. 1737–1752, 2008. [29] F. Tao and C. Busso, “End-to-end audiovisual speech activity detection [11] C. Darwin, “Listening to speech in the presence of other sounds,” with bimodal recurrent neural models,” Speech Communication, vol. Philosophical Transactions of the Royal Society of London B: Biological 113, pp. 25–35, 2019. Sciences, vol. 363, no. 1493, pp. 1011–1021, 2008. [30] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Memory augmented deep generative models for forecasting the next shot location [12] I. McCowan, D. B. Dean, M. L. McLaren, R. J. Vogt, and S. Sridharan, in tennis,” IEEE Transactions on Knowledge and Data Engineering, “The delta-phase spectrum with application to voice activity detection and speaker recognition,” IEEE Transactions on Audio, Speech, and 2019. Language Processing, vol. 19, no. 7, pp. 2026–2038, 2011. [31] A. Wang, “Application of generative adversarial network on image [13] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition style transformation and image processing,” Ph.D. dissertation, UCLA Electronic Theses and Dissertations, 2018. using deep convolutional neural network and discriminant temporal [32] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2018. “Autoencoding beyond pixels using a learned similarity metric,” arXiv [14] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, preprint arXiv:1512.09300, 2015. K. Vesely, ` and P. Matejka, ˇ “Developing a speech activity detection [33] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Multi-level system for the darpa rats program,” in Thirteenth Annual Conference sequence gan for group activity recognition,” in Asian Conference on Computer Vision. Springer, 2018, pp. 331–346. of the International Speech Communication Association, 2012. [15] G. Saon, S. Thomas, H. Soltau, S. Ganapathy, and B. Kingsbury, “The [34] Y. He, J. Zhang, H. Shan, and L. Wang, “Multi-task gans for view- ibm speech activity detection system for the darpa rats program.” in specific feature learning in gait recognition,” IEEE Transactions on Interspeech, 2013, pp. 3497–3501. Information Forensics and Security, vol. 14, no. 1, pp. 102–113, 2018. [16] S. Thomas, G. Saon, M. Van Segbroeck, and S. S. Narayanan, “Im- [35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, provements to the ibm speech activity detection system for the darpa S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in rats program,” in Acoustics, Speech and Signal Processing (ICASSP), Advances in neural information processing systems, 2014, pp. 2672– 2015 IEEE International Conference on. IEEE, 2015, pp. 4500–4504. 2680. [17] M. Graciarena, A. Alwan, D. Ellis, H. Franco, L. Ferrer, J. H. Hansen, [36] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual A. Janin, B. S. Lee, Y. Lei, V. Mitra et al., “All for one: feature learning for image-to-image translation.” in ICCV, 2017, pp. 2868–2876. combination for highly channel-degraded speech activity detection.” in [37] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image INTERSPEECH. Citeseer, 2013, pp. 709–713. translation networks,” in Advances in Neural Information Processing [18] M.-W. Mak and H.-B. Yu, “A study of voice activity detection techniques Systems, 2017, pp. 700–708. for nist speaker recognition evaluations,” Computer Speech & Language, [38] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, vol. 28, no. 1, pp. 295–313, 2014. “High-resolution image synthesis and semantic manipulation with con- [19] J. W. Shin, J.-H. Chang, and N. S. Kim, “Voice activity detection ditional gans,” computer vision and pattern recognition, 2017. based on statistical models and machine learning approaches,” Computer [39] D. Berthelot, T. Schumm, and L. Metz, “Began: boundary equilib- Speech & Language, vol. 24, no. 3, pp. 515–530, 2010. rium generative adversarial networks,” arXiv preprint arXiv:1703.10717, [20] S. S. Kumar and K. S. Rao, “Voice/non-voice detection using phase of 2017. zero frequency filtered speech signal,” Speech Communication, vol. 81, [40] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catan- pp. 90–103, 2016. zaro, “Video-to-video synthesis,” arXiv preprint arXiv:1808.06601, [21] T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, “Voice activity 2018. detection: Merging source and filter-based information,” IEEE Signal [41] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement Processing Letters, vol. 23, no. 2, pp. 252–256, 2016. with generative adversarial networks for robust speech recognition,” in IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 11 2018 IEEE International Conference on Acoustics, Speech and Signal Tharindu Fernando received his BSc (special de- Processing (ICASSP). IEEE, 2018, pp. 5024–5028. gree in computer science) from the University of [42] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based Peradeniya, Sri Lanka and his PhD from Queens- speech enhancement using generative adversarial network,” 2018 IEEE land University of Technology (QUT), Australia, International Conference on Acoustics, Speech and Signal Processing respectively. He is currently a Postdoctoral Research (ICASSP), 2018. Fellow in the SAIVT Research Program of School [43] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial Electrical Engineering and Computer Science at nets with singular value clipping,” in IEEE International Conference on QUT. His research interests focus mainly on human Computer Vision (ICCV), vol. 2, no. 3, 2017, p. 5. behaviour analysis and prediction. [44] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy gradient.” in AAAI, 2017, pp. 2852–2858. [45] Y. Xie, E. Franz, M. Chu, and N. Thuerey, “tempogan: A temporally coherent, volumetric gan for super-resolution fluid flow,” ACM Trans- actions on Graphics, Vol. 37, No. 4, Article 95, 2018. Sridha Sridharan has a BSc (Electrical Engineer- [46] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using ing) degree and obtained a MSc (Communication deep convolutional networks,” IEEE transactions on pattern analysis Engineering) degree from the University of Manch- and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016. ester, UK and a PhD degree from University of New [47] J. Steinier, Y. Termonia, and J. Deltour, “Smoothing and differentiation South Wales, Australia. He is currently with the of data by simplified least square procedure,” Analytical Chemistry, Queensland University of Technology (QUT) where vol. 44, no. 11, pp. 1906–1909, 1972. he is a Professor in the School Electrical Engineering [48] P. Bloomfield and W. L. Steiger, Least absolute deviations: Theory, and Computer Science. Professor Sridharan is the applications and algorithms. Springer, 1984. Leader of the Research Program in Speech, Audio, [49] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan: Unsu- Image and Video Technologies (SAIVT) at QUT, with strong focus in the pervised video retargeting,” European Conference on Computer Vision, areas of computer vision, pattern recognition and machine learning. He has published over 600 papers consisting of publications in journals and [50] J. Carletta, “Announcing the ami meeting corpus,” The ELRA Newsletter, in refereed international conferences in the areas of Image and Speech vol. 11, no. 1, pp. 3–5, 2006. technologies during the period 1990-2019. During this period he has also [51] M. Harper, “Iarpa babel program,” 2014. graduated 75 PhD students in the areas of Image and Speech technologies. [52] I. Himawan, M. H. Rahman, S. Sridharan, C. Fookes, and A. Kanaga- Prof Sridharan has also received a number of research grants from various sundaram, “Investigating deep neural networks for speaker diarization funding bodies including Commonwealth competitive funding schemes such in the dihard challenge,” in 2018 IEEE Spoken Language Technology as the Australian Research Council (ARC) and the National Security Science Workshop (SLT). IEEE, 2018, pp. 1029–1035. and Technology (NSST) unit. Several of his research outcomes have been [53] H. Dubey, A. Sangwan, and J. H. Hansen, “Leveraging frequency- commercialised. dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2056– 2071, 2018. [54] F. Byers, F. Byers, and O. Sadjadi, 2017 Pilot Open Speech Analytic Technologies Evaluation (2017 NIST Pilot OpenSAT): Post Evaluation Mitchell McLaren , Ph.D., is a senior computer Summary. US Department of Commerce, National Institute of Standards scientist in SRI International’s Speech Technology and Technology, 2019. and Research (STAR) Laboratory. His research in- [55] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale terests include speaker and language identification, speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. as well as other biometrics such as face recognition. [56] D. Kinga and J. B. Adam, “A method for stochastic optimization,” in Prior to joining SRI in 2012, Mitchell was a post- ICLR, vol. 5, 2015. doctoral researcher and the University of Nijmegen, [57] F. Chollet et al., “Keras (2015),” 2017. The Netherlands where he focused on speaker and [58] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- face identification on the Bayesian Biometrics for jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu and Forensics (BBfor2) project, funded by Marie Curie Action. His Ph.D. in gpu math compiler in python,” in Proc. 9th Python in Science Conf, speaker identification is from the Queensland University of Technology vol. 1, 2010. (QUT), Brisbane, Australia. [59] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987. Darshana Priyasad is a PhD student at Queensland University of Technology, Australia. He received his Bachelor of Science in Engineering, specialised in Integrated Computer Engineering with first class honours from the University of Moratuwa, Sri Lanka. His research interests include deep learning, computer and machine vision. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 12 Simon Denman received a BEng (Electrical), BIT, and PhD in the area of object tracking from the Queensland University of Technology (QUT) in Brisbane, Australia. He is currently a Senior Re- search Fellow with the Speech, Audio, Image and Video Technology Laboratory at QUT. His active areas of research include intelligent surveillance, video analytics, and video-based recognition. Clinton Fookes (SM’06) received his B.Eng. (Aerospace/Avionics), MBA, and Ph.D. degrees from the Queensland University of Technology (QUT), Australia. He is currently a Professor and Head of Discipline for Vision and Signal Process- ing within the Science and Engineering Faculty at QUT. He actively researchers across computer vision, machine learning, and pattern recognition areas. He serves on the editorial board for the IEEE Transactions on Information Forensics & Security. He is a Senior Member of the IEEE, an Australian Institute of Policy and Science Young Tall Poppy, an Australian Museum Eureka Prize winner, and a Senior Fulbright Scholar. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Statistics arXiv (Cornell University)

Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection

Loading next page...
 
/lp/arxiv-cornell-university/temporarily-aware-context-modelling-using-generative-adversarial-A2HiB3dRRf

References (69)

ISSN
2329-9290
eISSN
ARCH-3347
DOI
10.1109/TASLP.2020.2982297
Publisher site
See Article on Publisher Site

Abstract

IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 1 Temporarily-Aware Context Modelling using Generative Adversarial Networks for Speech Activity Detection Tharindu Fernando, Student Member, IEEE, Sridha Sridharan, Life Senior Member, IEEE, Mitchell McLaren, Darshana Priyasad, Member, IEEE, Simon Denman, Member, IEEE, and Clinton Fookes, Senior Member, IEEE. Abstract—This paper presents a novel framework for Speech that significant improvements in the accuracy of Automatic Activity Detection (SAD). Inspired by the recent success of multi- Speech Recognition (ASR) can be obtained by combining the task learning approaches in the speech processing domain, we ASR task with context recognition and gender classification propose a novel joint learning framework for SAD. We utilise as auxiliary tasks, as opposed to performing ASR alone. generative adversarial networks to automatically learn a loss Furthermore, the evaluations in [5], [6] suggested that methods function for joint prediction of the frame-wise speech/ non- speech classifications together with the next audio segment. In learned using the multi-task learning paradigm are not only order to exploit the temporal relationships within the input robust when evaluated in cross database scenarios, but also signal, we propose a temporal discriminator which aims to learn powerful and more discriminative features to facilitate ensure that the predicted signal is temporally consistent. We both tasks. evaluate the proposed framework on multiple public benchmarks, including NIST OpenSAT’ 17, AMI Meeting and HAVIC, where Inspired by these findings, we exploit the power of Gen- we demonstrate its capability to outperform state-of-the-art erative Adversarial Networks (GAN) [7], [8] to accurately SAD approaches. Furthermore, our cross-database evaluations perform speech/non-speech classification together with an demonstrate the robustness of the proposed approach across auxiliary task. In choosing the appropriate auxiliary task for different languages, accents, and acoustic environments. SAD we draw inspiration from a conclusion in the field of Index Terms—Speech Activity Detection, Generative Adversar- neuroscience that humans recognise speech in noisy conditions ial Networks, Context Modelling. through the awareness of the next segment of speech which is most likely to be heard [9], [10]. We therefore chose the I. INTRODUCTION prediction of the next audio segment as the auxiliary task as PEECH Activity Detection (SAD) plays a pivotal role in it also complements the primary SAD task via learning the many speech processing systems. Despite the consistent context of the input audio embedding. Through the prediction progress attained in this subject, the problem is far from of next audio segment our model tries to learn a contextual being solved as evidenced by evaluation results across the mapping between the input audio segments and the next vast variety of acoustic conditions featured in challenging segment which is likely to be heard. benchmarks such as HAVIC [1] and NIST OpenSAT’ 17 [2]. Even though the final speech activity decision is agnostic to Our work is inspired by recent observations in speech the actual content of speech, there are reasons to conjecture processing where multi-task learning approaches have shown that the SAD accuracy could be improved by making use to outperform single task learning methods in numerous areas, the semantic information of speech. It is known that humans including, speech synthesis [3], speech recognition [4], speech make use the semantic information to understand speech that enhancement [5], and speech emotion recognition [6]. For is affected significantly by noise [9], [10]. In [11] the authors instance, the seminal work by Pironkov et. al [4] demonstrated demonstrated that our inferior-frontal cortex predicts what someone is likely to hear next even before the actual sound T. Fernando, S. Sridharan, D. Priyasad, S. Denman, C. Fookes are with Speech Research Lab, SAIVT, Queensland University of Technology, reaches the superior temporal gyrus, allowing us to separate Australia. M. McLaren is with Speech Technology and Research Laboratory of SRI International. noise from what is actually spoken. One of our aims in this paper is to investigate how and to what extent we could E-mail: t.warnakulasuriya@qut.edu.au Manuscript received improve the performance of SAD if we were to use semantic arXiv:2004.01546v1 [eess.AS] 2 Apr 2020 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 2 information to predict the next speech segment. Current SAD ing better discriminative features for supervised classification methods simply classify whether a sample is speech or non- [14]–[20]. For instance in [17] the authors suggest a com- speech, without paying attention to the temporal context. bination of MFCCs and Gabor features. In [21] the authors Even though the current state-of-the-art SAD systems extract suggest the use of source and filter based features and perform features from a sliding window surrounding the event frame a score level fusion. In [22] the authors propose the use of of interest, they consider the frame as an isolated event and bottleneck features for predicting the speech and non-speech do not consider the entire sequence when detecting the speech posteriors. In [23] the authors fuse six SAD systems, two activity. We show in this paper that through the prediction of supervised and four unsupervised, for the NIST-Open-SAD- the next audio segment by exploiting the task-specific loss- 2015 challenge. Supervised systems utilise labelled speech function learning capability of the GAN framework, we can and non-speech segments for training the SAD while the improve SAD accuracy by a significant amount. unsupervised methods utilise a fixed or adaptive threshold for The proposed architecture is shown in Fig. 1. The model the SAD task. The work of Hwang et. al [24] proposed the utilises audio, Mel-Frequency Cepstral Coefficients (MFCC) utilisation of an ensemble of deep neural networks trained on and Deltas of MFCC as the inputs and encodes these inputs different noise types for supervised SAD. In a different line of into an encoded representation, C . The generators receive this work, [25] proposed a semi-supervised learning approach for input embedding, C , and a noise vector, z, as the inputs. We GMM training, using power normalized cepstral coefficients, perceptual linear prediction coefficients, and frequency domain utilise two generators, G , for synthesising the frame-wise speech/ non-speech classifications and G , which synthesises linear prediction as features in addition to MFCCs. the audio signal for the next time window. It should be noted However, none of the above stated deep learning systems that in Fig. 1 the generators, G and G are denoted as two have explicitly modelled the temporal relationship between separate LSTM blocks, each with two cells of LSTMs.The audio frames in the input signal when performing SAD. static discriminator, D , receives the current input embeddings One of the earliest attempts to leverage temporal modelling and either the synthesised or ground truth speech classification in SAD was based on Recurrent Neural Networks (RNNs) [26] sequences and tries to discriminate between the two. The where the authors demonstrate a reduction of 26% in the false temporal discriminator, D , also receives the current input alarm rate compared to their Gaussian Mixture Model (GMM) embeddings and either the synthesised or ground truth future baseline which doesn’t use any temporal modelling. In [27] the audio segments and learns to classify them, considering the authors build upon this work where they augment the Long temporal consistency of those signals. Short Term Memory (LSTM) cell architecture. They propose The main contributions of the proposed work are sum- a coordinated-gate LSTM structure and a methodology to marised as follows: directly optimise the SAD loss using the Frame Error Rate (FER). Most recently, the Adaptive Context Attention Model We introduce a Temporarily-Aware GAN (TA-GAN) (ACAM) [28] model extended the LSTM based temporal learning framework for speech activity detection. modelling scheme using an attention strategy to learn the We demonstrate how a custom loss function for speech context of the speech signal for noise robustness in the SAD activity detection can be automatically learned through system. In a different line of work, an audiovisual SAD system the GAN learning process. is proposed in [29] in order to improve the robustness of the We propose a novel temporal discriminator which encour- framework. ages the generator to synthesise future speech segments in accordance with the current context. We perform extensive evaluations on the proposed frame- III. THE P ROPOSED APPROACH work using multiple public benchmarks and demonstrate We are inspired by the tremendous success of DNN based performance beyond that of current state-of-the-art sys- multi-task learning frameworks [4]–[6] in speech processing tems. which demonstrate greater robustness compared to single task learning methods. Motivated by these findings we investigate II. RELATED WORK ON SUPERVISED SPEECH ACTIVITY the utility of multi-task learning for SAD. To the best of our DETECTION knowledge, the work in this study is the first to consider multi- In supervised SAD, machine learning algorithms are trained task learning for SAD. Specifically, we attain joint predictions on annotated audio data to discriminate speech from non- of the frame-wise speech/ non-speech classification along speech segments. Several prior works have focused on find- with the next audio segment through the proposed multi-task IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 3 Ground Truth Speech Current audio (W ) (θ ) (Δ ) Classifi cation (η ) 1: T Ground truth futureaudio 1:T 1:T 1:T Speech (W ) T+ 1:T+ T {z} Non-Speech LSTM (encoder) Input embedding(C) LSTM LSTM LSTM LSTM Predicted Speech Classifi cations Predicted futureaudio (η̂ ) 1:T (W ) T+ 1:T+ T Speech Non-Speech LSTM LSTM LSTM LSTM FC(1) FC(1) softmax softmax [true/ fake] [true/ fake] Static Discriminator Temporal Discriminator Fig. 1. Proposed TA-GAN framework: Given the current time  , the model input is a segment containing the T audio frames and the features extracted from this segment (where w is the raw audio of frame t,  denotes the MFCC feature [12] and  denotes the MFCC deltas [13] for the same frame) directly t t t proceeding  and we term this the current segment. The encoder receives audio, MFCC and Deltas of MFCC inputs and embeds this information in an input embedding, C . Using this embedding and a random noise vector z, the classification sequence generator, G , synthesises a frame-wise speech classification sequence for the current time window while the same C and a random noise vector z are used by the audio generator, G , to synthesise the audio signal for the next T frames directly following  which we term the future segment. We utilise two discriminators. The static discriminator, D , receives the current input embeddings and either the synthesised or ground truth speech classification sequences and tries to discriminate between the two. The temporal discriminator, D , receives the current input embeddings and either the synthesised or ground truth future audio segments and learns to classify them, considering the temporal relationships between audio frames within those signals. learning framework. As there doesn’t exist an optimal, off the at hand rather than simply adding together the loss functions shelf loss function for the joint task that we are attaining, we for individual tasks. For instance, in [33] the authors illustrate utilise the GAN learning framework to automatically learn a the utility of GANs for video based action prediction while loss function for these tasks. synthesising future frame representations, and the authors in [34] showed that this process is highly beneficial for mitigating We exploit the task-specific loss-function learning capability the errors due to variation of view angles in gait recognition of the GAN framework to automatically learn a custom loss through view synthesis. function [30]–[33] that facilitates these two tasks . The merit of this approach is that it allows us to learn a highly non- For benefit of the readers who may be unfamiliar with linear loss, in contrast to a linear loss like cross entropy, to GAN we provide a brief introduction. Generative adversarial optimally capture the underlying semantics of the process. This networks fall within the family of generative models. The Gen- custom loss function learning capability of GANs is highly erator (G) learns a mapping from a random noise vector z to beneficial in the multi-task learning setting, as it allows us to an output y; G : z ! y [35]. An extension to this basic model learn a custom loss function that accounts for all the tasks is proposed in [7] where the authors propose a conditional IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 4 GAN, which learns a mapping from an observed input x and minimises the sum of the squared differences between the random noise vector z, to output y, G : fz; xg ! y. This synthesised output and ground truth data [47] while the L extension allows the model to learn a conditional mapping loss function minimises the sum of the absolute differences between the current input and the output. between the synthesised output and ground truth data [48]. In GANs partake in a two-player adversarial game where the [39] the authors demonstrate that L loss is more effective in generator, G, tries to fool the discriminator, D, with synthe- penalising discontinuities between nearby frames compared to sised outputs while D tries to identify them. This objective, the L loss. Motivated by these findings we utilise L as our 1 2 in terms of the conditional GAN, can be written as, regularisation mechanism. min max E [log(D(x; y))]+E [log(1(D(x; G(x; z))))]; x;y x;z IV. A RCHITECTURE G D (1) The proposed architecture is inspired by the success of where D tries to maximise this objective while G tries to multi-task learning over single task learning methods in nu- minimise it. Hence there exists a dual between G and D, merous speech related areas [4]–[6]. We design our auxiliary through which the GAN framework learns a custom loss task of predicting the next audio segment to facilitate our function for the task at hand. It should be noted that we primary goal of speech/non-speech classification via capturing do not explicitly define the loss of G. The discriminator, D, broader context of the input audio segment than just relying is the loss function for the G, which is a neural network on the input itself. Rather than using hand engineered loss approximating the loss. Therefore, a custom loss function is function for the two tasks we utilise GAN framework to learned through the adversarial learning process. For further automatically learn a custom loss function that facilitates both information regarding the GAN learning process we refer the tasks. readers to [7], [35]. The proposed approach is shown in Fig. 1. Inputs are GANs are extensively applied for tasks such as image-to- E processed by the encoder, f , which embeds this information image synthesis [7], [36]–[38], video synthesis [39], [40] and into a vector. We implemented the encoding function f using speech enhancement [8], [41], [42], but seldom for SAD. To a single LSTM cell. Using this embedding, the generator, the best of our knowledge, no prior work has applied GANs G , synthesises a speech activity classification sequence while for the SAD task. Most GAN related works have focused on G synthesises the future audio signal (see Sec. IV-A). We using static inputs such as images [7], [36]–[38], while only utilise two discriminators, a static discriminator (see Sec. a few have addressed temporal changes in the input data. In IV-B) and a temporal discriminator (see Sec. IV-C) where [43], [44] the authors address this by directly incorporating the former considers individual elements in the sequence the time axis in the input and output. For instance, in [43] when performing the adversarial classification, and the latter the authors propose a temporal generator while Yu et. al [44] preserves the temporal relationships between audio frames of propose a sequence generator that learns a stochastic policy. the outputs. The overall objective of the combined model is However, neither of these works have considered a framework presented in Sec. IV-D. that processes individual frames while also considering the Motivated by [12] we consider a combination of input temporal relationships between them. features. Let the input, X , be, Xie et. al [45] address this issue through a dual discriminator X = [(w ; w ; : : : ; w ); ( ;  ; : : : ;  ); ( ;  ; : : : ;  )]; 1 2 T 1 2 T 1 2 T architecture. However, they have engineered the temporal loss (2) to consider the velocity of consecutive frames, and hence this where w is the raw audio of the frame t,  denotes the t t cannot be directly applied for speech processing. MFCC feature [12] and  denotes the MFCC deltas [13] for In our work we exploit the merits of the GAN learning t the same frame. framework to automatically learn a loss function for synthe- sising highly indistinguishable data and synthesise both the A. Generators speech activity classifications for the set of individual input Given an input X , we first pass it through an encoding frames as well as the input signal in the next time frame. This function, f , which generates an embedding such that, allows us to learn the context of the input audio segment. In the context of computer vision, L and L losses have 1 2 C = f (X ); (3) been extensively coupled with the adversarial GAN loss to alleviate the static pixel-wise loss between the synthesised where C = [c ; c ; : : : ; c ; : : : ; c ]. Using this input embed- 1 2 t T output and the ground truth data [7], [46]. The L loss ding, C , and a noise vector, z, the generator, G , synthesises 2 IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 5 a speech classification sequence,  ^ = [ ^ ;  ^ ; : : : ;  ^ ; : : : ;  ^ ], sub-sequences. This objective can be written as, 1 2 t T classifying each frame in X , while G synthesises the future w w V = min max E[log(D (c ; w ))] audio signal, w ^ = [w ^ ; w ^ ; : : : ; w ^ ], for the next time T +1 T +2 T +T 1:t t+1:t+t w w G D t=1 window. This can be written as, w w + E[log(1 (D (c ; G (c ; z))))] (7) 1:t 1:t ^ = G (C; z); (4) t=1 and w w 2 + jjw G (c ; z)jj ; w t+1:t+t 1:t w ^ = G (C; z): (5) t=1 E w Predicting the raw signal, rather than MFCC’s or other where w = (w : : : w ), c = f (X ) and  is a hyper- 1:t 1 t 1:t 1:t parameter controlling the contribution from the L loss. features, allows us to enforce temporal constraints in the 2 We would like to emphasise the fact that utilising the above discriminator and preserve the original characteristics of the formulation, the static discriminator provides frame-wise true/fake input signal. decisions while the temporal discriminator provides decisions for time-windows of different frame lengths. B. Static Discriminator D. Complete Model We combine the objectives in Equations 6 and 7 to obtain the The static discriminator, D , receives the current input em- objective for the proposed TA-GAN, beddings and the ground truth speech classification sequence, , and learns to classify it as real while G tries to synthesise V = V + V : (8) a classification sequence,  ^, which is not easily distinguishable It can be seen that for the individual losses V and V there exist from the real sequences. This objective can be written as, contributions from the adversarial losses which occur due to the dual between G and D , and G and D . As shown in Equations 6 V = min max E[log(D (c ;  ))] t t G D and 7, the generators G and G try to minimise these loss values t=1 while discriminators D and D try to maximise them. Hence it T T X X + E[log(1 (D (c ; G (c ; z))))] +  jj G (c ; z)jj ; can be concluded that the overall loss, V , of the proposed TA- t t t t t=1 t=1 GAN is automatically learned through the proposed framework by (6) considering the task at hand. where we add an additional L loss to regularise the process and  is a hyper-parameter controlling the contribution from V. EVALUATIONS the L loss. A. Datasets The proposed Temporarily-Aware GAN (TA-GAN) framework is evaluated on four popular SAD benchmarks, namely, HAVIC [1], C. Temporal Discriminator AMI Meeting corpus [50], NIST OpenKWS’13 [51], and NIST OpenSAT’ 17 [2]. The details of the datasets and the evaluation The objective in Eq. 6 is shown to be highly effective protocols are summarised below. for generating realistic static outputs considering the elements 1) HAVIC: HAVIC (the Heterogeneous Audio Visual Internet of the sequence individually [49]. However, it discards the Collection) Pilot Transcription [1] is comprised of approximately 72 temporal coherence as the generator and the discriminator hours of user-generated videos with transcripts based on the English consider each frame individually [49]. Even though this be- speech audio extracted from the videos. The transcription files contain the type of the audio segment annotated for speech, music, noise and haviour is acceptable when considering the frame-wise speech singing segments [52]. We choose music and noise segments as non- classification sequence, it is suboptimal when considering the speech and rest of the segments as speech. Due to the unavailability future audio output. Inspired by [44], [45], [49] we introduce a of standard training/ testing splits we randomly split 70% of the data temporal discriminator, D , which also preserves the temporal for training, 20% for testing and 10% for validation. As the evaluation relationships between audio frames of the output. We consider metric we measure NIST OpenSAD Detection Cost Function (DCF), different sub sequences of the generated sequences  ^ and DCF = 0:75 P + 0:25 P ; (9) miss fa w ^, and generate the true/ fake classification through the discriminator considering these sub-sequences. Hence it forces where P denotes miss probability and P denotes the probabil- miss fa the discriminator to consider the temporal accordance of these ity of false alarms. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 6 w  w 2) AMI Meeting corpus: This dataset consists of 100 hours of generators G and G . However the two discriminators D and D recordings collected across three different meeting rooms. It offers a are updated individually. Hyperparameters  and  are evaluated challenging SAD setting as audio data is from both non-native and experimentally by changing the respective hyper-parameter while native English speakers. Similar to [27] we use the Frame Error Rate holding the rest of the parameters constant, and are set to 30 and (FER) metric to evaluate the performance. Training testing splits are 25 respectively. Changes in FER against  and  are shown in as defined in [50]. Fig. 2. The implementation of the proposed TA-GAN is completed 3) NIST OpenSAT’ 17: We also utilise the public safety with Keras [57] and Theano [58]. communications (PSC) corpus from NIST OpenSAT 2017 for our evaluations [2], which is a standard split in NIST OpenSAT 2017 and is constructed using the audio data from Sofa Super Store Fire (SSSF) dispatcher that occurred on June 18, 2007 in Charleston, South Carolina. This data consisted of audio logs in English from real fire-response operational data and is rich in naturalistic distor- tions including land-mobile-radio transmission effects, speech under cognitive and physical stress, speaking with significant background noise (Lombard effect), varying background-noise types and levels, and varying background decibel levels, [2], [53], [54]. The data is 1 3 provided as 16-bit at 8 kHz sampling rate . Due to the unavailability of ground truth evaluation labels we use the six audio recordings 00 20100 40200 60300 800 40 100500 1200 in the development data which constitute approximately 30 minutes Length ofmemory worth audio recordings. Due to this limited size we utilise this dataset (a)  vs FER only under cross database evaluations (see Sec. V-D) where we use this dataset only for testing (i.e it is not used for training the models). Following [53] we measure the DCF metric which is evaluated using Eq. 9. 18 4) NIST OpenKWS’13: To demonstrate the robustness of TA- GAN for different languages we evaluate the performance using Vietnamese, Pashto, Turkish and Tagalog corpuses from the IARPA 12 91 Babel dataset [51] . We evaluate the system using the FER metric as in [27]. B. Implementation Details We use a sliding window [8] to sample 1 second segments 0 200 400 600 800 1000 50 1200 60 0 10 20 30 40 from the raw audio every 500ms (with 50% overlap). We extract Length ofmemory MFCC features with 13 cepstral coefficients and the delta features (b)  vs FER considering the immediately preceding 2 frames and the next 2 frames using a frame size of 25 ms, sampled at a frame rate of 100 fps. Fig. 2. Evaluation of hyper-parameters using the validation set of AMI Similar to [55] inputs are normalised to a range 0-1, and no other Meeting corpus. We set  = 30 and  = 25. speech-specific preprocessing is performed. At test time we slide the window, without overlap, over the whole test utterance and generate the relevant speech classification sequence using G . It should be C. Results noted that similar to [27] we generate speech/ non- speech predictions for each frame within the 1 second segment. Hence at test time there TABLE I EVALUATIONS ON THE HAVIC DATASET [1]. DCF DENOTES NIST is only a 1 second framing delay at the beginning, after which the O PENSAD DETECTION COST F UNCTION (DCF) AS DEFINED IN EQ. 9. window can be shifted in small increments to produce predictions in real time. Method DCF We implemented the encoding function, f , using a single LSTM MLP - Gelley et. al [27] 8.10 cell, and the two generators, G and G , are implemented with Basic RNN - Gelley et. al [27] 6.38 two separate LSTM blocks, each with two cells of LSTMs. For CG-LSTM - Gelley et. al [27] 5.10 all LSTMs the hidden state size is set to 300 units. For training, ACAM -Kim et al - [28] 4.95 we use the Adam [56] optimiser, a learning rate of 0.005, and TA-GAN 2.53 500 epochs with a batch size of 600, alternating between epochs of D and G. We train the input encoder jointly with the two Evaluations on the HAVIC dataset are presented in Tab. I, and AMI Meeting corpus and NIST OpenKWS’13 corpora are presented We obtained the data from https://catalog.ldc.upenn.edu/ and the LDC in Tab. II. For better comparisons we provide evaluations for the Catalog ID is LDC2017E12 CG-LSTM and Basic RNN and MLP methods in [27], and the Vietnamese IARPA-babel107b-v0.7, Pashto IARPA-babel104b- v0.4b,Turkish IARPA-babel105b-v0.5, Tagalog IARPA-babel106-v0.2g Adaptive Context Attention Model (ACAM) proposed in [28]. These Average altitude error(feet) Average altitude error(feet) FER FER IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 7 TABLE II cross database evaluations with the evaluations presented within EVALUATIONS ON THE AMI M EETING [50] AND O PENKWS’13 CORPUS brackets, when the models are trained and tested on the same dataset, [51]. FER DENOTES F RAM E ERROR RATE AS DEFINED IN [27]. we only observe a slight reduction in the performance of the proposed approach when it is not tuned on the training set of the specific FER Method AMI Meeting OpenKWS’13 dataset. However, the performance reductions in the baselines are MLP - Gelley et. al [27] 6.84 6.29 quite substantial. Basic RNN - Gelley et. al [27] 6.55 6.24 CG-LSTM - Gelley et. al [27] 5.93 5.76 TABLE III ACAM - Kim. et al [28] 5.89 5.66 CROSS DATABASE EVALUATIONS USING NIST OPENSAT’ 17 [2], HAVIC [1], AMI MEETING [50] AND O PENKWS’13 CORPUS [51]. FOR HAVIC, TA- GAN 2.80 2.75 AMI M EETING AND O PENKWS’13 DATASETS W ITHIN BRACKETS WE REPORT THE ERROR RATES WHEN THE M ODEL IS TRAINED AND TESTED ON THE DATABASE INDICATED IN THE “T ESTED O N” COLUMN. two models which were proposed very recently have been able to Error Rate (DCF / FER) attain state-of-the-art results under a supervised SAD setting in the Trained on Tested on CG-LSTM [27] ACAM [28] TA- GAN datasets that we consider. The work of Gelley et. al [27] utilises RNNs NIST OpenSAT’ 17 5.36 4.78 2.53 AMI Meeting HAVIC 7.63 (5.10) 7.08 (4.95) 4.53 (2.53) for modelling the temporal relationships within the input signal and OpenKWS’13 8.17 (5.76) 7.73 (5.66) 4.15 (2.75) demonstrates that directly optimising the SAD loss using the Frame NIST OpenSAT’ 17 5.30 4.42 2.14 Error Rate (FER) produces better results. In [28] Kim. et al exploit an HAVIC OpenKWS’13 7.93 (5.76) 7.65 (5.66) 4.01 (2.75) AMI Meeting 7.87 (5.93) 7.39 (5.89) 4.23 (2.80) attention strategy for learning the context of the speech signal using NIST OpenSAT’ 17 5.51 5.02 3.14 LSTMs for noise robustness in the SAD system. The comparative OpenKWS’13 HAVIC 7.86 (5.10) 7.12 (4.95) 4.81 (2.53) evaluations with these baselines demonstrates the utility of GAN AMI Meeting 8.51 (5.93) 7.71 (5.89) 4.60 (2.80) based learning for the SAD system. In addition to utilising LSTMs for temporal modelling and an attention mechanism for input embedding, We note that the baseline ACAM [28] model utilises a window of 39 frames as the input, w, while the proposed TA-GAN model the proposed model automatically learns a loss function for the utilises 100 frames as the input window. Due to this difference, SAD task. Hence, in contrast to [27], [28], the proposed TA-GAN their performance is not directly comparable. However, in Sec. V-F method has been able to learn a more robust input embedding which we show a further evaluation using different (smaller) window sizes better discriminates the speech segments compared to its counterparts. which illustrates that the proposed TA- GAN model is capable of Furthermore, when comparing the MLP - Gelley et. al [27] system outperforming the baseline models even with smaller input window with Basic RNN - Gelley et. al [27], CG-LSTM - Gelley et. al [27], sizes. recurrent neural network based temporal modelling has been able to further improve the performance over an MLP network. We would like to note that these systems directly optimise the FER loss. In E. Ablation Experiments contrast, using the task-specific loss function learning framework of To better understand the crucial components and sensitivities of GANs and the augmented multi-task learning approach, the proposed the proposed TA-GAN framework, we conduct a series of ablation method has been able to outperform the state-of-the-art methods. experiments. In this experiment, we use the AMI Meeting [50] dataset and compare the TA-GAN model with a series of counterparts defined D. Cross Database Evaluation as follows: To demonstrate the robustness of the proposed method across 1) G (w): Removes the GAN learning framework and G is different languages, accents, and acoustics, we perform a cross- learnt through binary cross entropy loss. This receives only database evaluation where we train the model using the training data the audio input. of one dataset and test that model on the test sets of the rest of the 2) G (w +  + ): Receives audio, MFCC and delta inputs. datasets. 3) G + G (w +  + ): Similar to 2) but additionally predicts The evaluations are presented in Tab. III. Note that when tested on the future audio segment, which is trained using mean square NIST OpenSAT’ 17 [2] and HAVIC datasets [1] we report the NIST error. OpenSAD Detection Cost Function (DCF) whereas for AMI Meeting 4) GAN (w + + )=L : uses the GAN learning framework but and OpenKWS’13 corpus we report the Frame Error Rate (FER). synthesises only the classification sequence. Receives audio, To better demonstrate the merits of the proposed method we train MFCC and delta inputs. Doesn’t utilise L regularisation in the CG-LSTM baseline model defined in [27] and ACAM baseline Eq. 6 model of [28]. For better comparisons for AMI Meeting, HAVIC 5) GAN (w +  + ): Same as above method but with L and OpenKWS’13 datasets, within brackets we report the error rates regularisation. when the model is trained and tested on the database indicated in 6) TAGAN (w): Proposed model that receives only the audio the “Tested On” column. Due to the limited dataset size we do not input and predicts the future audio segment. attempt to train models using the NIST OpenSAT’ 17 dataset. 7) TA GAN (): Proposed model which receives the MFCC When analysing the results it is clear that the proposed GAN based as input and predicts future MFCC distribution. learning framework better captures the discriminative features and is 8) TA GAN (): Receives Deltas of MFCC as input and more robust under cross domain scenarios, better segregating speech predicts future deltas. from non-speech embeddings. This allows the proposed method to 9) TA GAN (w + ): Receives both audio and MFCC inputs achieve superior results compared to the baselines. Comparing the and predicts their future distributions. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 8 10) TA GAN (w + ): Receives both audio and delta inputs SAD task. However, when MFCC features are fused with both audio and predicts their future distributions. and  features we observe improved performance, highlighting that 11) TA GAN ( + ): Receives both MFCC and delta inputs the complementary attributes present in those streams have the ability and predicts their future distributions. to better discriminate speech segments from their counterparts. 12) TAGAN (w + + )=L : Receives audio, MFCC and delta inputs and predicts their future distributions. Doesn’t utilise L F. Impact of input window size regularisation in Eq. 6 and Eq. 7. _ In order to illustrate the impact of the input window size for SAD 13) TA GAN (D ): Replaced the temporal discriminator with accuracy , we perform an additional evaluation on the proposed TA- static discriminator as per Eq. 6, hence, this model contains GAN model using different window sizes: 20, 40, 60, 80, 100, and two static discriminators. 120 frames. In this experiment, we use the AMI Meeting [50] dataset. TABLE IV TABLE V A BLATION MODEL EVALUATIONS ON AMI MEETING [50] DATASET. EVALUATING THE EFFECT OF DIFFERENT W INDOW SIZES ON THE TA-GAN MODEL USING AMI MEETING [50] DATASET. ID Method FER 1) G (w) 9.10 Window Size (in frames) FER 2) G (w +  + ) 7.20 20 5.24 3) (G + G )(w +  + ) 7.12 4) GAN (w +  + )=L 4.73 40 4.41 5) GAN (w +  + ) 4.15 60 3.54 6) TA GAN (w) 3.99 80 3.11 7) TA GAN () 3.72 100 2.80 8) TA GAN () 4.03 120 2.84 9) TA GAN (w + ) 3.60 10) TA GAN (w + ) 3.98 11) TA GAN ( + ) 3.65 Considering these evaluations it is clear that a considerable re- 12) TA GAN (w +  + )=L 3.54 duction in the FER can be achieved when increasing the window 13) TA GAN (D ) 3.31 size from 20 to 100 frames, but no significant gain is observed by Proposed TA-GAN 2.80 increasing it beyond 100 frames. We believe utilising a large window size is essential in the proposed method in order to properly model With the ablation evaluations presented in Tab. IV we can see the context within the given window. Furthermore, comparing these the importance of multi-task learning, the merits of using GAN results to those obtained by ACAM [28] for the AMI Meeting using based automatic loss function learning and the importance of utilised a window size of 39 frames, we observe that the proposed TA-GAN features. model with a smaller window size (i.e 20 frames) has been able to When comparing both non-GAN and GAN based single task SAD achieve better performance than ACAM [28]. methods (ablation model 1-2 and 4-5) with their respective multi-task counterparts (i.e ablation models 3 and 13) we observe a significant G. Qualitative Results contribution for the SAD task through the multi-task learning strategy. Furthermore, when comparing non-GAN based models (1-3) with We randomly selected 100 examples from the AMI Meeting [50] GAN based models (4-13), we observe a significant performance test set and plotted the inputs embeddings, c, for each frame in those boost denoting the merits of task-specific loss function learning. We examples. The model trained on the AMI Meeting training set is used would like to emphasise the fact that this performance increase is generate these embeddings. These embeddings are coloured based on observed for both single-task as well as multi-task models, although the ground truth speech/ non-speech labels. Note that for each frame we observe a further substantial improvement with regards to multi- the encoder generates a 300 dimensional embedding vector. Hence task methods. in order to plot the results in 2D we applied PCA [59] to reduce In addition we observe that the temporal discriminator has been the dimensionality. In Fig. 3 (a) we visualise the input embeddings able to further improve this learning process (see model 13 and learnt through the proposed TA-GAN model. It is clear that the TA-GAN (proposed)). Even though we do not observe a direct model has been successful in learning an embedding space which relationship between the temporal discriminator, which is used for better segregates speech from non-speech than the alternate ablation real/fake validation of the predicted future audio segments and the models, at least in terms of the two directions that capture most SAD task, we notice a significant contribution from this module. This variation determined via PCA. In Fig, 3 (b) and (c) we perform illustrates that via analysing the temporal relationships between audio the same visualisation for two of the ablation models (3 and 5). frames the discriminator gains the ability to guide the generator to Considering Fig. 3 (b) we see that the model fails to learn such a generate realistic outputs. Hence it enforces the input embeddings discriminative embedding space. Furthermore, we would like to point to better identify the temporal context of the inputs, denoting the out that in the proposed model the same input embedding is used to utility of multi-task learning and the importance of the future audio predict the future audio signal as well. The clear separation of the segment prediction task as the auxiliary task of the proposed multi- two classes (i.e speech and non-speech) verifies our hypothesis that task learning framework. jointly predicting the future audio signal for the next time window When comparing different feature combinations present in Tab. IV can improve SAD performance. To further demonstrate this ability in we observe that MFCC features contain more salient attributes for the Fig 3 (c) we visualise the input embeddings learnt through Ablation IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 9 model 5 for the same set of examples, where the GAN based model  1 second length classification sequence predictions. It should be only predicts the classification sequence without modelling the future noted that the proposed TA-GAN model has a larger time complexity audio signal. It is clear that the automatic loss function learning due to the joint prediction of both future audio and classification process has contributed to learning discriminative embeddings but sequences. In terms of number of trainable parameters the proposed we observe some areas with overlaps between the speech and non- method contains 48K trainable parameters while the basic RNN and speech embeddings in contrast to the clear segregation in Fig. 3 (a). CG-LSTM methods of [27] have 6K trainable parameters. This further emphasises the importance of the joint learning of both tasks to better capture the discriminative features. VI. CONCLUSION In this paper, we propose a novel multi-task learning framework for Speech Non-Speech speech activity detection, by properly analysing the context of the in- put embeddings and their temporal accordance. We contribute a novel 2 data-driven method to capture salient information from the observed 1 audio segment by jointly predicting the speech activity classification sequence and the audio for the next time frame. Additionally, we introduce a temporal discriminator to enforce these relationships in the synthesised data. Our quantitative evaluations using multiple supervised SAD bench- marks, including NIST OpenSAT’ 17 [2], AMI Meeting [50] 4 2 0 2 4 OpenKWS’13 [51] and HAVIC [1] demonstrated the utility of the (a) TA-GAN proposed multi-task learning framework compared to the single Speech task based supervised SAD baselines. Furthermore, through ablation Non-Speech model evaluations presented Sec. V-E we demonstrate that the auto- matic learning of a loss function specifically considering the task at hand, as opposed to using hand engineered losses, has significantly contributed to the superior performance attained in the proposed multi-task learning framework. In addition, in Tab. IV we provide comparisons regarding systems with and without using the proposed temporal discriminator. The evaluation of the temporal discriminator, which enforces the tempo- 4 2 0 2 4 ral relationships between audio frames of the synthesised outputs, (b) Ablation model 3 (G +G (w + + )) demonstrates the utility of incorporating this intelligence in the discriminator, which guides the generator to generate realistic outputs. Speech Speech Non-Speech Non-Speech With empirical evaluations we illustrate that the future audio segment prediction auxiliary task contributes to augment the performance of the SAD task, demonstrating the utility of multi-task learning and the importance of the future audio segment prediction task for learning the context of the input embeddings. To better demonstrate the robustness of the proposed framework we conducted a cross-database evaluation where we train the model using a seperate dataset and tested on another dataset. This experi- 4 2 0 2 4 ment revealed that the proposed multi-task learning framework learns (c) Ablation model 5 (GAN (w +  + )) better discriminative features which are more robust across multiple datasets, compared to the current state-of-the-art supervised SAD Fig. 3. Visualisation of input embeddings, C , for proposed TA-GAN and models. We would like to emphasise that these evaluated datasets Ablation models 3 (G + G (w +  + )) and 5 (GAN (w +  + )) are of different languages, accents, and acoustics and the proposed method exhibits 37-52% relative gain over the best alternate approach (ACAM [28]) when evaluated with NIST OpenSAT’ 17 [2]. H. Time Complexity In order to demonstrate that the proposed TA-GAN approach is ACKNOWLEDGMENT suitable for real-time use, we benchmarked the time complexity of This research was supported by an Australian Research Council TA-GAN on the test set of AMI Meeting corpus dataset on a single (ARC) Discovery grant DP140100793. core of an Intel Xeon E5-2680 2.50GHz CPU and the TA-GAN model runs at 5.35  faster than real time. The proposed system was REFERENCES able to generate 100 predictions (i.e, 100 seconds of audio) where the output is both 100  1 second length classification sequence [1] LDC, “Havic pilot transcription,” 2016. [Online]. Available: https: predictions and 100  1 second length future audio predictions, in //catalog.ldc.upenn.edu/LDC2016V01 18.70 seconds. In a similar setting both basic RNN and CG-LSTM [2] NIST, “Nist pilot speech analytic technologies evaluation, opensat,” methods of [27] take approximately 8.56 seconds to generate 100 2017. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 10 [3] N. Chen, Y. Qian, and K. Yu, “Multi-task learning for text-dependent [22] L. Ferrer, M. Graciarena, and V. Mitra, “A phonetically aware system for speaker verification,” in Sixteenth annual conference of the international speech activity detection,” in Acoustics, Speech and Signal Processing speech communication association, 2015. (ICASSP), 2016 IEEE International Conference on. IEEE, 2016, pp. [4] G. Pironkov, S. Dupont, and T. Dutoit, “Multi-task learning for speech 5710–5714. recognition: an overview,” in Proceedings of the 24th European Sympo- [23] T. Kinnunen, A. Sholokhov, E. Khoury, D. Thomsen, M. Sahidullah, and sium on Artificial Neural Networks (ESANN), vol. 192, 2016. Z.-H. Tan, “Happy team entry to nist opensad challenge: A fusion of [5] Z. Chen, S. Watanabe, H. Erdogan, and J. R. Hershey, “Speech en- short-term unsupervised and segment i-vector based speech activity de- tectors,” Proceedings of the 17th Annual Conference of the International hancement and recognition using multi-task learning of long short-term memory recurrent neural networks,” in Sixteenth Annual Conference of Speech Communication Association, 2992-2996, 2016. the International Speech Communication Association, 2015. [24] I. Hwang, H.-M. Park, and J.-H. Chang, “Ensemble of deep neural [6] N. K. Kim, J. Lee, H. K. Ha, G. W. Lee, J. H. Lee, and H. K. Kim, networks using acoustic environment classification for statistical model- “Speech emotion recognition based on multi-task learning using a con- based voice activity detection,” Computer Speech & Language, vol. 38, pp. 1–12, 2016. volutional neural network,” in 2017 Asia-Pacific Signal and Information Processing Association Annual Summit and Conference (APSIPA ASC). [25] A. Sholokhov, M. Sahidullah, and T. Kinnunen, “Semi-supervised IEEE, 2017, pp. 704–707. speech activity detection with an application to automatic speaker [7] P. Isola, J.-Y. Zhu, T. Zhou, and A. A. Efros, “Image-to-image translation verification,” Computer Speech & Language, vol. 47, pp. 132–156, 2018. with conditional adversarial networks,” arXiv preprint, 2017. [26] T. Hughes and K. Mierle, “Recurrent neural networks for voice activity [8] S. Pascual, A. Bonafonte, and J. Serra, “Segan: Speech enhancement detection,” in Acoustics, Speech and Signal Processing (ICASSP), 2013 IEEE International Conference on. IEEE, 2013, pp. 7378–7382. generative adversarial network,” InterSpeech, 2017. [27] G. Gelly and J.-L. Gauvain, “Optimization of rnn based speech activity [9] M. K. Leonard, K. E. Bouchard, C. Tang, and E. F. Chang, “Dynamic encoding of speech sequence probability in human temporal cortex,” detection,” IEEE/ACM Transactions on Audio, Speech, and Language Journal of Neuroscience, vol. 35, no. 18, pp. 7203–7214, 2015. Processing, vol. 26, no. 3, pp. 646–656, 2018. [10] A. Heinrich, R. P. Carlyon, M. H. Davis, and I. S. Johnsrude, “Illusory [28] J. Kim and M. Hahn, “Voice activity detection using an adaptive context vowels resulting from perceptual continuity: a functional magnetic attention model,” IEEE Signal Processing Letters, vol. 25, no. 8, pp. 1181–1185, 2018. resonance imaging study,” Journal of cognitive neuroscience, vol. 20, no. 10, pp. 1737–1752, 2008. [29] F. Tao and C. Busso, “End-to-end audiovisual speech activity detection [11] C. Darwin, “Listening to speech in the presence of other sounds,” with bimodal recurrent neural models,” Speech Communication, vol. Philosophical Transactions of the Royal Society of London B: Biological 113, pp. 25–35, 2019. Sciences, vol. 363, no. 1493, pp. 1011–1021, 2008. [30] T. Fernando, S. Denman, S. Sridharan, and C. Fookes, “Memory augmented deep generative models for forecasting the next shot location [12] I. McCowan, D. B. Dean, M. L. McLaren, R. J. Vogt, and S. Sridharan, in tennis,” IEEE Transactions on Knowledge and Data Engineering, “The delta-phase spectrum with application to voice activity detection and speaker recognition,” IEEE Transactions on Audio, Speech, and 2019. Language Processing, vol. 19, no. 7, pp. 2026–2038, 2011. [31] A. Wang, “Application of generative adversarial network on image [13] S. Zhang, S. Zhang, T. Huang, and W. Gao, “Speech emotion recognition style transformation and image processing,” Ph.D. dissertation, UCLA Electronic Theses and Dissertations, 2018. using deep convolutional neural network and discriminant temporal [32] A. B. L. Larsen, S. K. Sønderby, H. Larochelle, and O. Winther, pyramid matching,” IEEE Transactions on Multimedia, vol. 20, no. 6, pp. 1576–1590, 2018. “Autoencoding beyond pixels using a learned similarity metric,” arXiv [14] T. Ng, B. Zhang, L. Nguyen, S. Matsoukas, X. Zhou, N. Mesgarani, preprint arXiv:1512.09300, 2015. K. Vesely, ` and P. Matejka, ˇ “Developing a speech activity detection [33] H. Gammulle, S. Denman, S. Sridharan, and C. Fookes, “Multi-level system for the darpa rats program,” in Thirteenth Annual Conference sequence gan for group activity recognition,” in Asian Conference on Computer Vision. Springer, 2018, pp. 331–346. of the International Speech Communication Association, 2012. [15] G. Saon, S. Thomas, H. Soltau, S. Ganapathy, and B. Kingsbury, “The [34] Y. He, J. Zhang, H. Shan, and L. Wang, “Multi-task gans for view- ibm speech activity detection system for the darpa rats program.” in specific feature learning in gait recognition,” IEEE Transactions on Interspeech, 2013, pp. 3497–3501. Information Forensics and Security, vol. 14, no. 1, pp. 102–113, 2018. [16] S. Thomas, G. Saon, M. Van Segbroeck, and S. S. Narayanan, “Im- [35] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, provements to the ibm speech activity detection system for the darpa S. Ozair, A. Courville, and Y. Bengio, “Generative adversarial nets,” in rats program,” in Acoustics, Speech and Signal Processing (ICASSP), Advances in neural information processing systems, 2014, pp. 2672– 2015 IEEE International Conference on. IEEE, 2015, pp. 4500–4504. 2680. [17] M. Graciarena, A. Alwan, D. Ellis, H. Franco, L. Ferrer, J. H. Hansen, [36] Z. Yi, H. R. Zhang, P. Tan, and M. Gong, “Dualgan: Unsupervised dual A. Janin, B. S. Lee, Y. Lei, V. Mitra et al., “All for one: feature learning for image-to-image translation.” in ICCV, 2017, pp. 2868–2876. combination for highly channel-degraded speech activity detection.” in [37] M.-Y. Liu, T. Breuel, and J. Kautz, “Unsupervised image-to-image INTERSPEECH. Citeseer, 2013, pp. 709–713. translation networks,” in Advances in Neural Information Processing [18] M.-W. Mak and H.-B. Yu, “A study of voice activity detection techniques Systems, 2017, pp. 700–708. for nist speaker recognition evaluations,” Computer Speech & Language, [38] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, A. Tao, J. Kautz, and B. Catanzaro, vol. 28, no. 1, pp. 295–313, 2014. “High-resolution image synthesis and semantic manipulation with con- [19] J. W. Shin, J.-H. Chang, and N. S. Kim, “Voice activity detection ditional gans,” computer vision and pattern recognition, 2017. based on statistical models and machine learning approaches,” Computer [39] D. Berthelot, T. Schumm, and L. Metz, “Began: boundary equilib- Speech & Language, vol. 24, no. 3, pp. 515–530, 2010. rium generative adversarial networks,” arXiv preprint arXiv:1703.10717, [20] S. S. Kumar and K. S. Rao, “Voice/non-voice detection using phase of 2017. zero frequency filtered speech signal,” Speech Communication, vol. 81, [40] T.-C. Wang, M.-Y. Liu, J.-Y. Zhu, G. Liu, A. Tao, J. Kautz, and B. Catan- pp. 90–103, 2016. zaro, “Video-to-video synthesis,” arXiv preprint arXiv:1808.06601, [21] T. Drugman, Y. Stylianou, Y. Kida, and M. Akamine, “Voice activity 2018. detection: Merging source and filter-based information,” IEEE Signal [41] C. Donahue, B. Li, and R. Prabhavalkar, “Exploring speech enhancement Processing Letters, vol. 23, no. 2, pp. 252–256, 2016. with generative adversarial networks for robust speech recognition,” in IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 11 2018 IEEE International Conference on Acoustics, Speech and Signal Tharindu Fernando received his BSc (special de- Processing (ICASSP). IEEE, 2018, pp. 5024–5028. gree in computer science) from the University of [42] M. H. Soni, N. Shah, and H. A. Patil, “Time-frequency masking-based Peradeniya, Sri Lanka and his PhD from Queens- speech enhancement using generative adversarial network,” 2018 IEEE land University of Technology (QUT), Australia, International Conference on Acoustics, Speech and Signal Processing respectively. He is currently a Postdoctoral Research (ICASSP), 2018. Fellow in the SAIVT Research Program of School [43] M. Saito, E. Matsumoto, and S. Saito, “Temporal generative adversarial Electrical Engineering and Computer Science at nets with singular value clipping,” in IEEE International Conference on QUT. His research interests focus mainly on human Computer Vision (ICCV), vol. 2, no. 3, 2017, p. 5. behaviour analysis and prediction. [44] L. Yu, W. Zhang, J. Wang, and Y. Yu, “Seqgan: Sequence generative adversarial nets with policy gradient.” in AAAI, 2017, pp. 2852–2858. [45] Y. Xie, E. Franz, M. Chu, and N. Thuerey, “tempogan: A temporally coherent, volumetric gan for super-resolution fluid flow,” ACM Trans- actions on Graphics, Vol. 37, No. 4, Article 95, 2018. Sridha Sridharan has a BSc (Electrical Engineer- [46] C. Dong, C. C. Loy, K. He, and X. Tang, “Image super-resolution using ing) degree and obtained a MSc (Communication deep convolutional networks,” IEEE transactions on pattern analysis Engineering) degree from the University of Manch- and machine intelligence, vol. 38, no. 2, pp. 295–307, 2016. ester, UK and a PhD degree from University of New [47] J. Steinier, Y. Termonia, and J. Deltour, “Smoothing and differentiation South Wales, Australia. He is currently with the of data by simplified least square procedure,” Analytical Chemistry, Queensland University of Technology (QUT) where vol. 44, no. 11, pp. 1906–1909, 1972. he is a Professor in the School Electrical Engineering [48] P. Bloomfield and W. L. Steiger, Least absolute deviations: Theory, and Computer Science. Professor Sridharan is the applications and algorithms. Springer, 1984. Leader of the Research Program in Speech, Audio, [49] A. Bansal, S. Ma, D. Ramanan, and Y. Sheikh, “Recycle-gan: Unsu- Image and Video Technologies (SAIVT) at QUT, with strong focus in the pervised video retargeting,” European Conference on Computer Vision, areas of computer vision, pattern recognition and machine learning. He has published over 600 papers consisting of publications in journals and [50] J. Carletta, “Announcing the ami meeting corpus,” The ELRA Newsletter, in refereed international conferences in the areas of Image and Speech vol. 11, no. 1, pp. 3–5, 2006. technologies during the period 1990-2019. During this period he has also [51] M. Harper, “Iarpa babel program,” 2014. graduated 75 PhD students in the areas of Image and Speech technologies. [52] I. Himawan, M. H. Rahman, S. Sridharan, C. Fookes, and A. Kanaga- Prof Sridharan has also received a number of research grants from various sundaram, “Investigating deep neural networks for speaker diarization funding bodies including Commonwealth competitive funding schemes such in the dihard challenge,” in 2018 IEEE Spoken Language Technology as the Australian Research Council (ARC) and the National Security Science Workshop (SLT). IEEE, 2018, pp. 1029–1035. and Technology (NSST) unit. Several of his research outcomes have been [53] H. Dubey, A. Sangwan, and J. H. Hansen, “Leveraging frequency- commercialised. dependent kernel and dip-based clustering for robust speech activity detection in naturalistic audio streams,” IEEE/ACM Transactions on Audio, Speech, and Language Processing, vol. 26, no. 11, pp. 2056– 2071, 2018. [54] F. Byers, F. Byers, and O. Sadjadi, 2017 Pilot Open Speech Analytic Technologies Evaluation (2017 NIST Pilot OpenSAT): Post Evaluation Mitchell McLaren , Ph.D., is a senior computer Summary. US Department of Commerce, National Institute of Standards scientist in SRI International’s Speech Technology and Technology, 2019. and Research (STAR) Laboratory. His research in- [55] A. Nagrani, J. S. Chung, and A. Zisserman, “Voxceleb: a large-scale terests include speaker and language identification, speaker identification dataset,” arXiv preprint arXiv:1706.08612, 2017. as well as other biometrics such as face recognition. [56] D. Kinga and J. B. Adam, “A method for stochastic optimization,” in Prior to joining SRI in 2012, Mitchell was a post- ICLR, vol. 5, 2015. doctoral researcher and the University of Nijmegen, [57] F. Chollet et al., “Keras (2015),” 2017. The Netherlands where he focused on speaker and [58] J. Bergstra, O. Breuleux, F. Bastien, P. Lamblin, R. Pascanu, G. Des- face identification on the Bayesian Biometrics for jardins, J. Turian, D. Warde-Farley, and Y. Bengio, “Theano: A cpu and Forensics (BBfor2) project, funded by Marie Curie Action. His Ph.D. in gpu math compiler in python,” in Proc. 9th Python in Science Conf, speaker identification is from the Queensland University of Technology vol. 1, 2010. (QUT), Brisbane, Australia. [59] S. Wold, K. Esbensen, and P. Geladi, “Principal component analysis,” Chemometrics and intelligent laboratory systems, vol. 2, no. 1-3, pp. 37–52, 1987. Darshana Priyasad is a PhD student at Queensland University of Technology, Australia. He received his Bachelor of Science in Engineering, specialised in Integrated Computer Engineering with first class honours from the University of Moratuwa, Sri Lanka. His research interests include deep learning, computer and machine vision. IEEE TRANSACTIONS ON AUDIO, SPEECH, AND LANGUAGE PROCESSING, VOL. 14, NO. 8, AUGUST 2015 12 Simon Denman received a BEng (Electrical), BIT, and PhD in the area of object tracking from the Queensland University of Technology (QUT) in Brisbane, Australia. He is currently a Senior Re- search Fellow with the Speech, Audio, Image and Video Technology Laboratory at QUT. His active areas of research include intelligent surveillance, video analytics, and video-based recognition. Clinton Fookes (SM’06) received his B.Eng. (Aerospace/Avionics), MBA, and Ph.D. degrees from the Queensland University of Technology (QUT), Australia. He is currently a Professor and Head of Discipline for Vision and Signal Process- ing within the Science and Engineering Faculty at QUT. He actively researchers across computer vision, machine learning, and pattern recognition areas. He serves on the editorial board for the IEEE Transactions on Information Forensics & Security. He is a Senior Member of the IEEE, an Australian Institute of Policy and Science Young Tall Poppy, an Australian Museum Eureka Prize winner, and a Senior Fulbright Scholar.

Journal

StatisticsarXiv (Cornell University)

Published: Apr 2, 2020

There are no references for this article.