Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data Qiuqiang Kong*, Yong Xu* , Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley Fellow, IEEE Abstract—Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 Fig. 1. From top to bottom: Waveform of an audio clip containing three and 0.120, respectively. In T-F segmentation, we achieved an F1 sound events: “Tambourine”, “scissors” and “computer keyboard”; Log mel score of 0.218, where previous methods were not able to do T-F spectrogram of the audio clip; Ideal ratio mask (IRM) [9] of sound events. segmentation. Strongly labelled onset and offset annotations of sound events; Weak labels. “Silence” is the abbreviated as “sil.”. The signal-to-noise ratio of this audio Index Terms—Sound event detection, time-frequency segmen- clip is 0 dB. tation, weakly labelled data, convolutional neural network. I. INTRODUCTION than video [8], and as a result, longer sound sequences can be stored in a device and faster processing can be obtained using Sound event detection (SED) aims to detect what sound equal computation resources. Many SED algorithms rely on events happen in an audio recording and when they occur. strongly labelled data [10]–[12] where the onset and offset SED has many applications in everyday life. For example, times of sound events have been annotated. The segments SED can be used to monitor “baby cry” sound at home [1], between the onset and offset labels are used as target events for and to detect “typing keyboard”, “door slamming”, “ringing training, while those outside the onset and offset annotations of phones”, “smoke alarms” and “sirens” in the office [2, 3]. are used as non-target events [11, 12]. However, collecting For public security, SED can be used to detect “gunshot” strongly labelled data is time consuming because annotating and “scream” sounds [4]. Not only is SED complementary the onset and offset times of sound events takes more time to video or image based event detection [5]–[7] but also has than annotating audio clips for classification, so the sizes of many advantages over the two modalities. First, sound does strongly labelled datasets are often limited to minutes or a not require illumination, so can be used in dark environments. few hours [12, 13]. At the same time there are large amounts Second, sound can penetrate or move around some obstacles, of weakly labelled data (WLD) available, where only the while objects in video and image are often occluded. Third, presence of the sound events is labelled, without any onset some abnormal events such as fire alarms are audio only, and offset annotations [14, 15] or the sequence of the sound so can only be detected by sound. Furthermore, storing and events. Fig. 1 shows the waveform of an audio clip containing processing sound often consumes less computation resources three non-overlapping sound events, the log mel spectrogram * The first two authors contributed equally to this work. of the audio clip, the ideal ratio mask (IRM) [9] of the sound arXiv:1804.04715v3 [cs.SD] 2 Mar 2019 2 Fig. 2. Audio tagging with convolutional neural network. Input log mel spectrogram is presented to a convolutional neural network including convolutional layers, a global pooling layer and fully connected layers to predict the presence probabilities of audio tags. events, the strongly labelled onset and offset annotations and recently been proposed, including multiple instance learning the weak labels. In this paper we will focus on non-overlapping and convolutional neural networks. sound events as a starting point. In the real world, sound events usually happen in real scenes A. Multi-instance learning method such as a metro station or an urban park. State-of-the-art SED One solution to the WLD problem is based on multiple algorithms only detect the onset and the offset of sound events instance learning (MIL) [14, 17]. MIL was first proposed in in the time domain but do not separate them from background 1997 for drug activity detection [18]. In MIL for SED, an in the T-F domain. The separation of sound events in the T-F audio clip is labelled positive for a specified sound event if domain can be useful for enhancing and recognizing sound that sound event occurs at least one time in the audio clip, events in audio scenes under low signal-to-noise ratio (SNR). and labelled negative if that sound event does not occur in the In this paper, we propose a T-F segmentation and sound event audio clip. For strongly labelled data, the dataset consists of detection framework trained using weakly labelled data. This training pairs fx; yg where x is the feature of a frame in an is done by learning T-F segmentation masks implicitly in train- audio clip and y 2 f0; 1g is the strong label of the frame, ing with only the clip-level audio tags. It means that T-F masks where K denotes the number of sound classes. For weakly are not known even for the training set: they are predicted as labelled data, features of all frames in an audio clip constitute intermediate results. T-F segmentation masks are equivalent a bag B = fx g where T is the number of frames in the t=1 to the ideal ratio masks (IRM) [9]. An IRM is the ratio of audio clip. Multiple instance assumption states that the weak the spectrogram of a sound event to the spectrogram of the labels of a bag are y = maxfy g , where y is the strong t t t=1 mixed audio. T-F segmentation masks can be used for SED and label of the feature x . The weakly labelled data consists of sound event separation. In training, a segmentation mapping is the training pairs fB; yg. applied to the T-F representation such as log mel spectrogram The problem of SED from WLD now can be cast as learning of an audio clip to obtain T-F segmentation masks for sound a classifier to predict the labels of the frames fy g of t=1 events. Then a classification mapping is applied to the T-F a bag B = fx g . For the general WLD problem, an t=1 segmentation masks to output the presence probabilities of MIL framework based on a neural network was proposed sound events. In T-F segmentation, with a T-F representation in [14, 19]. In [14, 20] a support vector machine (SVM) of an audio clip as input, the trained segmentation mapping was used to solve MIL as a maximum margin problem. A is used to obtain the T-F segmentation masks. In SED, onset negative mining method was proposed in [21] that selects and offset times can be obtained from the T-F segmentation negative examples according to intra-class variance criterion. masks. As a byproduct, separated waveforms of sound events A concept ranking according to negative exemplars (CRANE) can be obtained from the T-F segmentation masks. This work algorithm was proposed in [22]. However, an MIL method is an extension of the joint separation-classification model for tends to underestimate the number of positive instances in an SED of weakly labelled data [16]. audio clip [23]. Furthermore, the MIL method cannot predict The paper is organized as follows. Section II introduces the T-F segmentations from the WLD [14]. previous work in SED with WLD. Section III describes the proposed T-F segmentation, sound event detection and separation framework. Section IV describes the implemen- B. Convolutional neural networks for audio tagging and tation details of the proposed framework. Section V shows weakly supervised sound event detection experimental results. Section VI concludes and forecasts future Convolutional neural networks (CNNs) have been success- work. fully used in many areas including image classification [24], object detection [6], image segmentation [25], speech recog- nition [26, 27] and audio classification [28]. In this section II. WEAKLY SUPERVISED SOUND EVENT DETECTION we briefly introduce previous work using convolutional neural Compared to the conventional SED task, where strongly network for audio tagging [28] and weakly supervised SED. labelled onset and offset annotations for the training set are Audio tagging [12, 28, 29] aims to predict the presence of given, the weakly supervised SED task contains only clip-level sound events in an audio clip. In [30], a mel spectrogram of an labels. That is, only the presence of sound events is known in audio clip is presented to a CNN, where the filters of each con- an audio clip, without knowing the temporal locations of the volutional layer capture local patterns of a spectrogram. After events. Several approaches for weakly supervised SED have a global pooling layer such as global max pooling [28], global 3 Fig. 3. Training stage using weakly labelled data. A segmentation mapping g maps from an input T-F representation to the segmentation masks. A classification mapping g maps each segmentation mask to the presence probabilities of the corresponding audio tag. average pooling [31], global weighted rank pooling [23], segmentation masks h = [h ; :::; h ], where K is the number 1 K global attention pooling [32, 33] or other poolings [34, 35], of T-F segmentation masks and is equal to the number of sound fully connected layers are applied to predict the presence events. Symbol h is the abbreviation of h (t; f) which is the k k probabilities of audio classes. Fig. 2 shows the framework T-F segmentation mask of the k-th event. Ideally, each T-F of audio tagging with convolutional neural network. However, segmentation mask h is an ideal ratio mask [9] of the k-th this CNN only predicts the presence probabilities of a sound sound event. events in an audio clip, but not the onset and offset times of The second part of the training stage is a classification the sound events. mapping g : h 7! p ; k = 1; :::; K where g maps each 2 k k 2 In [36, 37], a time-distributed CNN with a global max- T-F segmentation mask to the presence probability of the k-th pooling strategy was proposed to approximate the MIL method event, denoted as p . Then the binary crossentropy between the to predict the temporal locations of each event. However, the predictions p ; k = 1; :::; K and the targets y ; k = 1; :::; K is k k global max-pooling will encourage the model to attend to the calculated as the loss function: most dominant T-F unit contributing to the presence of the sound event and ignore all of other T-F units. That is, the hap- l (p ; y ) = y log p k k k k pening time of the sound events is underestimated. A method k=1 (1) for localizing the sound events in an audio clip by splitting the input into several segments based on the CNNs was presented = y log g (g (X) ); k 2 1 k in [38]. It splits an audio clip into several segments with the k=1 assumption that parts of the segments correspond to the clip- where y 2 f0; 1g; k = 1; :::; K is the binary representation level labels. This assumption may be unreasonable due to of the weak labels. Both g and g can be modeled by neural 1 2 the fact that some sound events may only occur at certain networks. The parameters of g and g can be trained end-to- 1 2 frames. Recently, an attention-based global pooling strategy end from the input T-F representation to the weak labels of using CNNs was proposed to predict the temporal locations an audio clip. [39] for SED using WLD. However, attention-based global pooling can only predict the time domain segmentation, but B. Time-frequency segmentation not the T-F segmentation which will be firstly addressed in In inference step, the input T-F representation of an audio this paper. clip is presented to the segmentation mapping g to obtain the T-F segmentation masks h ; k = 1; :::; K . The T-F segmenta- III. TIME- FREQUENCY SEGMENTATION, SOUND EVENT tion masks indicate which T-F units in the T-F representation DETECTION AND SEPARATION FROM W EAKLY LABELLED contribute to the presence of the sound events (top right of DATA Fig. 4). The learned T-F segmentation masks are affected by In this section, we present a T-F segmentation, sound event the classification mapping g and will be discussed in Section detection and separation framework trained on weakly labelled IV. audio data. Unlike the CNN method for audio tagging, we design a CNN to learn T-F segmentation masks of sound C. Sound event detection events from the weakly labelled data. As T-F segmentation masks h ; k = 1; :::; K contain the information about where sound events happen in the T-F A. Training from weakly labelled data domain, the simplest way to obtain the sound event detection We use only weakly labelled audio data to train the proposed score v (t) in the time domain is to average out the frequency model. The training stage is shown in Fig. 3. To begin with, axis of the T-F segmentation masks (bottom right of Fig. 4): the waveform of an audio clip x is converted to an input time- frequency (T-F) representation X(t; f), for example, spectro- v (t) = h (t; f); (2) k k gram or log mel spectrogram. To simplify the notation, we f=1 abbreviate X(t; f) as X . The first part of the training stage is a segmentation mapping where F is the number of frequency bins of the segmentation g : X 7! h which maps the input T-F representation to the T-F mask h . Then v (t) is the score of the frame-wise prediction 1 k k 4 nally, an inverse Fourier transform with overlap add [40] is applied on each segmented spectrogram with the phase from X to obtain the separated waveforms sb ; k = 1; :::; K : j\X sb = IFFT Y  e : (4) k k We summarize the training, time-frequency segmentation, sound event detection and separation framework in Fig. 6. The training stage, sound event detection stage and sound event separation stage are shown in the left, middle and right column of Fig. 6, respectively. IV. PROPOSED SEGMENTATION MAPPING AND CLASSIFICATION MAPPING In this section, we describe the implementation details of the segmentation mapping g and the classification mapping g proposed in Section III. A. Segmentation mapping Segmentation mapping g takes a T-F representation of an Fig. 4. Inference stage. An input T-F representation is presented to the audio clip as input and outputs segmentation masks of each segmentation mapping g to obtain the T-F segmentation masks. By averaging out the frequency axis of the T-F segmentation masks and post processing, sound event. We use log mel spectrogram as the input T- event-wise predictions of sound events can be obtained. F representation, which has been shown to perform well in audio classification [28, 39, 41]. Ideally, the outputs of g are ideal ratio masks (IRMs) [42] of sound events in the T-F of the sound events. We describe how to convert the frame- domain. The segmentation mapping g is modeled by a CNN. wise scores to event-wise sound events in Section IV-C. Each convolutional layer consists of a linear convolution, a batch normalization (BN) [43] and a ReLU [44] nonlinearity D. Sound event separation as in [43]. The BN inserted between the convolution and the As a byproduct, the T-F segmentation masks can be used nonlinearity can stabilize and speed up the training [43]. We to separate sound events from the mixture in the T-F domain. do not apply downsampling layers after convolutional layers In addition, by applying an inverse Fourier transform on the because we want to retain the resolution of the input T-F separated T-F representation of each sound event, separated segmentation masks. The T-F segmentation masks are obtained waveforms of the sound events can be obtained. Separating from the activations of the last CNN layer using a sigmoid non- sound events from the mixture of sound events and background linearity to constrain the values of the T-F segmentation masks under a low SNR can improve the recognition of sound events to be between 0 and 1 to be a valid value of an IRM. The in future work. Fig. 5 shows the pipeline of sound event configuration details of the CNN will be described in Section separation. An audio clip x is presented to the segmentation V-D. mapping g to obtain T-F segmentation masks. Meanwhile, The idea of learning the T-F segmentation masks explicitly the complex spectrum X of the audio clip is calculated. We is inspired by work on weakly labelled image localization [45] use the tilde on X to distinguish the complex spectrum X and image segmentation [46, 47]. In weakly labelled image from the input T-F representation X because X might not localization, saliency maps are learned indicating the locations be a spectrum, such as log mel spectrogram. We interpo- of the objects in an image [45]. Similarly, the T-F segmentation late the segmentation masks of the input T-F representation masks in our work resemble the saliency maps of an image h ; k = 1; :::; K to h ; k = 1; :::; K representing the T-F k k [45], where T-F segmentation masks indicate what time and segmentation masks of the complex spectrum. The reason frequency a sound event occurs in a T-F representation. for performing this interpolation is that h may have a size different from h , for example, a log mel spectrogram has fewer frequency bins than linear spectrum in the frequency B. Classification mapping domain. Then we multiply the upsampled T-F segmentation As described in Section III, the classification mapping g masks h with the magnitude of the spectrum to obtain the maps each segmentation mask h to the presence probability segmented spectrogram of the k-th event: of its corresponding sound event. Modeling the classification e e e mapping in different ways will lead to different representation Y = h X ; k = 1; :::; K; (3) k k of the segmentation masks (Fig. 7). We explored global max where represents the element-wise multiplication and Y pooling [28], global average pooling [31] and global rank represents the segmented spectrogram of the k-th event. Fi- pooling [23] for modeling the classification mappings g . 2 5 Fig. 5. Sound event separation stage. An input T-F representation is presented to the segmentation mapping g to obtain the T-F segmentation masks. The upsampled segmentation masks are multiplied with the magnitude spectrum of the input audio to obtain the segmented spectrogram of each sound event. Separated sound events are obtained by applying an inverse Fourier transform to the segmented spectrogram. to update the parameters in the neural network. Because of Waveform the maximum selection strategy, GMP encourages only one point in a T-F segmentation mask to be positive, so GMP will T-F representation Complex spectrogram underestimate [23] the sound events in the T-F representation. Examples of T-F segmentation masks learned using GMP are g magnitude phase shown in Fig. 7(c). T-F Segmentation 2) Global average pooling: Global average pooling (GAP) Upsampling masks was first applied in image classification [31]. GAP on each A Av ve er ra ag ge e a al lo on ng g T-F segmentation mask h is depicted as: frequency axis Separated spectrogram (magnitude) T F XX Score of sound events 2 along time axis F (h ) = h (t; f): (6) k k TF Inverse Fourier Post transform processing GAP corresponds to the collective assumption in MIL [48], Detected sound Audio tags Separated waveforms which states that all T-F units in a T-F segmentation mask events contribute equally to the label of an audio clip. That is, all T- Training from WLD Sound event detection Sound event separation F units in a T-F segmentation mask are assumed to contain the labelled sound events. However, some sound events only last Fig. 6. Framework of T-F segmentation, sound event detection and sound event separation from WLD. From left to right: Training from WLD; Sound a short time, so GAP usually overestimates the sound events event detection; Sound event separation. [31]. Examples of T-F segmentation masks learned using GAP are shown in Fig 7(d). 3) Global weighted rank pooling: To overcome the lim- 1) Global max pooling: Global max pooling (GMP) ap- itations of GMP and GAP, which underestimate and over- plied on feature maps has been used in audio tagging [28]. estimate the sound events in the T-F segmentation masks, GMP on each T-F segmentation mask map h is depicted as: global weighted rank pooling (GWRP) is proposed in [23]. GWRP can be seen as a generalization of GMP and GAP. F (h ) = max h (t; f): (5) k k t;f The idea of GWRP is to put a descending weight on the values of a T-F segmentation mask sorted in a descending GMP is based on the assumption that an audio clip contains order. Let an index set I = fi ; :::i g define the descending a sound event if at least one T-F unit of the T-F input 1 M order of the values within a T-F segmentation mask h , i.e. representation contains a sound event. GMP is invariant to the k (h )  (h )  :::  (h ) , where M = T  F is the location of sound event in the T-F domain because whenever a k i k i k i 1 2 n number of T-F units in a T-F segmentation mask. Then the sound event occurs, GMP will only select the maximum value GWRP is defined as: of a T-F segmentation mask which is robust to the time or frequency shifts of the sound event. However, in the training stage, back propagation will only pass through the maximum j1 F (h ) = r (h ) ; (7) k k i Z(r) value, so only a small part of data in the T-F domain are used j=1 6 Fig. 7. (a) Spectrogram of an audio clip containing “scissors”, “computer keyboard” and “tambourine” (plotted in log scale); (b) Log mel spectrogram of the audio clip; (c) Upsampled T-F segmentation masks h of sound events learned using global max pooling (GMP). Only a few T-F units have high value and the other parts of the T-F segmentation masks are dark; (d) Upsampled T-F segmentation masks h of sound events learned using global average pooling (GAP); (e) Upsampled T-F segmentation masks h of sound events learned using global weighted rank pooling (GWRP); (f) Ideal ratio mask (IRM) of sound events. Only 6 out of 41 T-F segmentation masks are plotted due to the limited space. j1 where 0  r  1 is a hyper parameter and Z(r) = r is that DCASE 2018 Task 1 provides background sounds j=1 is a normalization term. When r = 0 GWRP becomes GMP recorded from a variety of real world scenes whereas the and when r = 1 GWRP becomes GAP. The hyperparameter DCASE 2018 Task 2 provides a variety of foreground sound r can vary depending on the frequency of occurrence of the events. The DCASE 2018 Task 1 contains 8640 10-second sound events. GWAP attends more to the T-F units of high audio clips in the development set of subtask A. The audio values in a T-F segmentation mask and less to those of low clips are recorded from 10 different scenes such as “airport”, values in a T-F segmentation mask. The T-F segmentation “metro station” and “urban park”. The DCASE 2018 Task 2 masks learned using GWMP is shown in Fig. 7(e). The ideal contains 3710 manually verified sound events ranging in length binary masks (IBMs) of the sound events are plotted in Fig. from 300 ms to 30 s depending on the audio classes. There 7(f) for comparison with the GMP, GAP and GWRP. are 41 classes of sound events such as “flute”, “applause” and “cough”. We only use these manually verified audio clips from the DCASE 2018 Task 2 as sound events because the C. Post-processing for sound event detection remaining audio clips are unverified and may contain noisy In Section III-C we mentioned that the frame-wise scores labels. We truncated the sound events to up to 2 seconds and v (t) can be obtained from the T-F segmentation masks using mix them with the 10-second audio clips from the DCASE Equation (2). To reduce the number of false alarms, for an 2018 Task 1 acoustic scene dataset. The mixed audio clips are audio clip, we only apply sound event detection on the sound single channel with a sampling rate of 32 kHz. Each mixed classes with positive audio tagging predictions. Then we apply audio clip contains three non-overlapped sound events. We thresholds on the frame-wise predictions v (t) to obtain the mixed the sound events with the acoustic scenes for SNRs at event-wise predictions. We apply a high threshold of 0.2 20dB, 10dB and 0dB. For each SNR, the 8000 mixed audio to detect the presence of sound events and then extend the clips are divided into 4 cross-validation folds. Fig. 7(b) shows boundary of both onset and offset sides until the frame-wise the log mel spectrogram of a mixed 10-second audio clip. The scores drop below threshold of 0.1. This two-step threshold source code of our work is released . method will produce smooth predictions of sound events. As the duration of sound events in DCASE 2018 Task 2 varies B. Evaluation metrics from 300 ms to 30 s, we remove the detected sound events We use F-score [51], area under the curve (AUC) [52] and that are shorter than 320 ms (10 frames) to reduce false alarms mean average precision (mAP) [6] in the evaluation of the and join the sound events whose silence gap is shorter than audio tagging, the frame-wise SED and the T-F segmentation. 320 ms (10 frames). We also use error rate (ER) for evaluating the event-wise SED. 1) Basic statistics: True positive (TP): Both the reference V. EXPERIMENTS and the system prediction indicate an event to be active. False negative (FN): The reference indicates an event to be active A. Dataset but the system prediction indicates an event to be inactive. We mix the DCASE 2018 Task 1 acoustic scene dataset False positive (FP): The system prediction indicates an event [49] with the DCASE 2018 Task 2 general-purpose Freesound to be active but the reference indicates it is not [51]. dataset [50] under different signal-to-noise ratios (SNRs) to evaluate the proposed methods. The reason for this choice https://github.com/qiuqiangkong/sed_time_freq_segmentation 7 2) Precision, recall and F-score: Precision (P) and recall TABLE I C ONFIGURATION OF CNN. (R) are defined as [51]: TP TP P = ; R = : (8) TP + FP TP + FN Output size Layers (feature maps  time steps  mel bins) Bigger P and R indicates better performance. F-score is Input log mel spectrogram 1 311 64 calculated based on P and R [51]: f3 3; 32; BN; ReLUg 2 32 311 64 2P  R TP F = = : (9) f3 3; 64; BN; ReLUg 2 64 311 64 P + R TP + (FN + FP )=2 f3 3; 128; BN; ReLUg 2 128 311 64 Bigger F-score indicates better performance. f3 3; 128; BN; ReLUg 2 128 311 64 3) Area under the curve (AUC): A receiver operating 1 1; 41; sigmoid 41 311 64 characteristic (ROC) curve [52] plots true positive rate (TPR) versus false positive rate (FPR). Area under the curve (AUC) Global pooling (GP) 41 score is the area under this ROC curve which summarizes the ROC curve to a single number. Using the AUC does not TABLE II require manual selection of a threshold. Bigger AUC indicates F1- SCORE, AUC AND MAP OF AUDIO TAGGING AT DIFFERENT SNRS. better performance. A random guess has an AUC of 0.5. 4) Average precision: Average precision (AP) is the aver- 20 dB 10 dB 0 dB age of the precision at different recall values. Similar to AUC, Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP AP does not rely on the threshold. Different to AUC, AP does DNN [55] 0.439 0.885 0.468 0.396 0.861 0.402 0.331 0.810 0.314 WLD CNN [37] 0.498 0.777 0.498 0.524 0.794 0.526 0.528 0.815 0.535 not count the true negatives and is widely used as a criterion FrameCNN [34] 0.581 0.899 0.587 0.543 0.883 0.526 0.484 0.850 0.439 in imbalanced dataset such as object detection [6]. Attention [39] 0.714 0.922 0.755 0.690 0.907 0.729 0.612 0.875 0.643 GMP 0.435 0.818 0.475 0.406 0.801 0.440 0.373 0.773 0.389 5) Error rate: Error rate (ER) is an event-wise evaluation GAP 0.529 0.934 0.623 0.467 0.914 0.555 0.385 0.877 0.442 metric. ER measures the amount of errors in terms of inser- GWRP 0.635 0.955 0.753 0.604 0.942 0.696 0.534 0.915 0.596 tions (I), deletions (D) and substitutions (S) [51]. For an audio clip, the insertions, deletions and substitutions are defined as: “VGG-like” convolutional neural network [56] with 8 convo- S = min(FN; FP ); lutional blocks on the input log mel spectrogram [54]. Each D = max(0; FN FP ); (10) convolutional layer consists of a linear convolution with a filter size of 33 followed by a batch normalization layer [43] and I = max(0; FP FN); a ReLU activation function [44]. We use 4 convolution blocks where FN, FP, FN are event-wise statistics in an audio clip. following the baseline system of DCASE 2018 [54]. The Lower ER, S, D and I indicate the better performance. When number of feature maps of the convolutional layers are 32, 64, evaluating the event based criterion, we allow some degree 128 and 128, respectively. This configuration is to fit the model of misalignment between a reference and a system output for to a single GPU card with 12 GB RAM sufficiently. Then a counting a true positive [12, 51, 53]. Following the default 11 convolutional layer with sigmoid non-linearity is applied configuration of [51], we adopt an onset collar of 200 ms and to convert the feature maps to the T-F segmentation masks of an offset collar of 200 ms / 50% to count the true positive sound events. Then a global pooling is used to summarize each of a detection. We used the toolbox [51] for evaluating the T-F segmentation mask to a scalar representing the presence performance of the event-based SED. probability of the sound events in an audio clip. We summarize the configuration of the neural network in Table I. In training we use a mini-batch size of 24 to fully utilize the single card C. Feature extraction GPU with 12 GB RAM. The Adam optimizer [57] with a We apply a fast Fourier transform (FFT) with a window learning rate 0.001 is used for its fast convergence. size of 2048 and an overlap of 1024 between neighbouring windows to extract the spectrogram of audio clips. This E. Audio tagging configuration that follows [54] offers a good resolution in both We compare our method with fully connected neural time and frequency domain. Then mel filter banks with 64 network [55], CNN trained on weakly labelled data [37], bands are applied on the spectrogram followed by logarithm FrameCNN [34] and the attention model [39]. We apply GMP, operation to obtain log mel spectrogram as the input T-F GAP and GWRP as global pooling in our model. Table II representation feature. Log mel spectrogram has been widely shows that for SNR at 20 dB, the attention model [39] achieves used in audio classification [28, 54]. the best F1-score of 0.714 and mAP of 0.755 followed by the GWRP of 0.635 and 0.753, respectively. On the other D. Model hand, GWRP achieves the best AUC of 0.955. Comparing In this subsection we give a detailed description of the the performance under different SNRs, the F1-score and mAP configuration of the segmentation mapping in Section IV-A drop approximately 0.1 in absolute value for SNR changed and the classification mapping in Section IV-B. We apply a from 20 dB to 0 dB. AUC drop approximately 0.04 in absolute 8 TABLE III F1- SCORE OF AUDIO TAGGING AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.286 0.873 0.332 0.041 0.344 0.367 0.489 0.546 0.423 0.283 0.075 0.133 0.197 0.083 0.304 0.267 0.389 0.285 0.350 0.464 0.310 WLD CNN [37] 0.633 0.896 0.719 0.547 0.794 0.248 0.610 0.589 0.504 0.390 0.513 0.889 0.436 0.136 0.435 0.384 0.672 0.375 0.270 0.692 0.513 FrameCNN [34] 0.416 0.878 0.719 0.166 0.557 0.385 0.529 0.562 0.448 0.507 0.484 0.668 0.314 0.181 0.392 0.304 0.556 0.474 0.385 0.488 0.465 Attention [39] 0.548 0.893 0.761 0.632 0.866 0.335 0.616 0.607 0.568 0.497 0.565 0.924 0.477 0.160 0.546 0.598 0.823 0.463 0.565 0.901 0.617 GMP 0.458 0.522 0.335 0.183 0.400 0.087 0.299 0.468 0.424 0.422 0.151 0.774 0.281 0.076 0.279 0.284 0.176 0.271 0.315 0.844 0.434 GAP 0.547 0.817 0.409 0.070 0.484 0.205 0.435 0.501 0.354 0.504 0.347 0.314 0.181 0.164 0.218 0.407 0.399 0.346 0.343 0.496 0.305 GWRP 0.552 0.825 0.654 0.204 0.578 0.342 0.416 0.628 0.424 0.573 0.543 0.579 0.333 0.320 0.421 0.618 0.473 0.558 0.427 0.726 0.550 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.297 0.672 0.547 0.418 0.276 0.192 0.075 0.121 0.408 0.500 0.411 0.336 0.368 0.097 0.299 0.254 0.270 0.528 0.379 0.293 0.331 WLD CNN [37] 0.538 0.742 0.910 0.643 0.649 0.361 0.359 0.263 0.589 0.636 0.558 0.410 0.599 0.052 0.593 0.436 0.324 0.642 0.755 0.349 0.528 FrameCNN [34] 0.424 0.723 0.688 0.660 0.553 0.390 0.355 0.400 0.490 0.528 0.497 0.481 0.624 0.193 0.733 0.449 0.346 0.526 0.475 0.431 0.484 Attention [39] 0.607 0.759 0.938 0.744 0.738 0.444 0.499 0.441 0.560 0.678 0.660 0.693 0.709 0.113 0.957 0.593 0.434 0.368 0.784 0.400 0.612 GMP 0.398 0.322 0.796 0.141 0.483 0.311 0.275 0.207 0.442 0.474 0.173 0.251 0.465 0.031 0.891 0.504 0.329 0.585 0.567 0.175 0.373 GAP 0.438 0.681 0.641 0.392 0.402 0.480 0.203 0.172 0.372 0.408 0.404 0.392 0.335 0.161 0.412 0.348 0.341 0.579 0.349 0.408 0.385 GWRP 0.523 0.714 0.798 0.606 0.524 0.563 0.547 0.353 0.487 0.534 0.452 0.653 0.585 0.260 0.857 0.583 0.508 0.639 0.516 0.452 0.534 TABLE IV F1- SCORE OF FRAM E- WISE SED AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.191 0.746 0.239 0.009 0.317 0.306 0.373 0.495 0.295 0.202 0.036 0.050 0.123 0.038 0.233 0.207 0.156 0.195 0.214 0.291 0.212 WLD CNN [37] 0.113 0.466 0.159 0.052 0.292 0.044 0.318 0.298 0.223 0.100 0.142 0.111 0.097 0.020 0.078 0.078 0.085 0.085 0.042 0.095 0.037 FrameCNN [34] 0.294 0.741 0.585 0.07 0.411 0.299 0.441 0.480 0.342 0.421 0.370 0.283 0.178 0.102 0.310 0.239 0.236 0.325 0.246 0.315 0.308 Attention [39] 0.062 0.422 0.069 0.020 0.189 0.024 0.242 0.263 0.210 0.019 0.059 0.051 0.045 0.003 0.068 0.050 0.076 0.031 0.159 0.026 0.088 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.410 0.661 0.338 0.033 0.341 0.139 0.240 0.429 0.195 0.426 0.269 0.121 0.088 0.108 0.170 0.297 0.102 0.229 0.173 0.214 0.200 GWRP 0.453 0.704 0.507 0.072 0.456 0.188 0.326 0.575 0.341 0.457 0.402 0.222 0.193 0.172 0.351 0.498 0.247 0.355 0.316 0.596 0.418 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.155 0.594 0.510 0.367 0.16 0.111 0.022 0.095 0.314 0.317 0.277 0.254 0.290 0.045 0.166 0.144 0.190 0.411 0.166 0.212 0.237 WLD CNN [37] 0.093 0.333 0.135 0.160 0.149 0.086 0.056 0.058 0.132 0.234 0.150 0.075 0.141 0.003 0.195 0.055 0.123 0.287 0.258 0.067 0.140 FrameCNN [34] 0.259 0.595 0.639 0.495 0.354 0.271 0.228 0.284 0.399 0.329 0.379 0.364 0.453 0.111 0.443 0.277 0.237 0.407 0.228 0.299 0.343 Attention [39] 0.029 0.143 0.107 0.096 0.101 0.051 0.034 0.018 0.137 0.353 0.078 0.038 0.054 0.005 0.188 0.046 0.148 0.08 0.156 0.056 0.100 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.231 0.528 0.553 0.332 0.233 0.294 0.133 0.121 0.237 0.167 0.146 0.319 0.265 0.114 0.191 0.274 0.172 0.437 0.105 0.313 0.252 GWRP 0.362 0.649 0.696 0.539 0.354 0.429 0.400 0.182 0.404 0.440 0.384 0.471 0.373 0.173 0.591 0.378 0.420 0.528 0.331 0.360 0.398 TABLE V “tambourine” have higher classification accuracy while some F1- SCORE, AUC AND MAP OF FRAME- WISE SED AT DIFFERENT SNRS. sound events such as “microwave” and “squeak” are difficult to recognize. On average, the attention model [39] achieves 20 dB 10 dB 0 dB the best F1-score of 0.612 followed by GWRP of 0.534. Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP DNN [55] 0.360 0.722 0.269 0.306 0.702 0.224 0.237 0.666 0.169 F. Frame-wise sound event detection WLD CNN [37] 0.168 0.669 0.179 0.182 0.688 0.201 0.140 0.701 0.166 FrameCNN [34] 0.440 0.808 0.369 0.399 0.787 0.329 0.343 0.756 0.275 Attention [39] 0.163 0.827 0.317 0.137 0.807 0.278 0.100 0.773 0.221 Table IV shows the F1-score of the frame-wise SED for GMP 0.000 0.676 0.090 0.000 0.658 0.076 0.000 0.649 0.072 all sound classes under SNR of 0 dB. GWRP achieves the GAP 0.398 0.790 0.400 0.334 0.753 0.328 0.252 0.712 0.245 GWRP 0.511 0.886 0.508 0.472 0.871 0.453 0.398 0.829 0.360 best averaged F1-score of 0.398, followed by the FrameCNN model [34] of 0.343. Some classes such as “applause” and “hi-hat” have higher F1-score by the frame-wise SED, while TABLE VI F1- SCORE, AUC AND MAP OF EVENT- WISE SED AT DIFFERENT SNRS. some classes such as “drawer” and squeak” have lower F1- score by the frame-wise SED. Table V shows the frame-wise 20 dB 10 dB 0 dB SED results under different SNRs. GWRP achieves the best Algorithms F1 ER D I F1 ER D I F1 ER D I F1-score, AUC and mAP of 0.511, 0.886 and 0.508 under DNN [55] 0.226 1.91 0.75 1.16 0.178 2.29 0.79 1.50 0.120 2.80 0.84 1.96 20 dB SNR. The FrameCNN model [34] achieves a second WLD CNN [37] 0.010 1.16 0.99 0.17 0.011 1.15 0.99 0.17 0.018 1.12 0.99 0.13 FrameCNN [34] 0.166 2.38 0.79 1.58 0.151 2.49 0.81 1.68 0.141 2.70 0.81 1.88 place with an F1-score of 0.440. GAP overestimates the sound Attention [39] 0.028 1.10 0.96 0.14 0.021 1.10 0.97 0.13 0.011 1.09 0.98 0.10 GMP 0.000 1.00 1.00 0.00 0.000 1.00 1.00 0.00 0.000 1.00 1.00 0.00 events which is shown in the visualization of the upsampled T- GAP 0.173 2.71 0.78 1.93 0.139 2.95 0.82 2.13 0.098 3.52 0.86 2.66 F segmentation masks (Fig. 7). GAP does not perform better GWRP 0.254 2.12 0.66 1.45 0.227 2.30 0.69 1.61 0.167 2.55 0.76 1.78 than GWRP. GMP underestimates the sound events (Fig. 7) and performs worst in frame-wise SED. In GWRP, the F1- value for SNR changed from 20 dB to 0 dB. This result shows score drops from 0.511 to 0.472 to 0.398 under SNRs of 20 that there is a large variance in audio tagging under low SNR. dB, 10 dB and 0 dB. Fig. 8 shows the frame-wise scores Table III shows the audio tagging results of all sound events of sound events obtained from equation (2) under SNR of under 0 dB SNR. Some sound events such as “hi-hat” and 0 dB. Frame-wise scores obtained by using GWRP looks 9 TABLE VII F1- SCORE OF EVENT- WISE SED AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.132 0.287 0.083 0.002 0.176 0.233 0.125 0.389 0.041 0.141 0.033 0.007 0.068 0.036 0.141 0.113 0.035 0.113 0.036 0.079 0.159 WLD CNN [37] 0.020 0.013 0.001 0.001 0.110 0.001 0.036 0.067 0.025 0.005 0.001 0.001 0.003 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 FrameCNN [34] 0.098 0.510 0.187 0.012 0.090 0.180 0.287 0.186 0.157 0.194 0.144 0.005 0.04 0.091 0.168 0.163 0.042 0.133 0.081 0.265 0.098 Attention [39] 0.000 0.051 0.000 0.000 0.052 0.000 0.020 0.018 0.000 0.000 0.000 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.019 0.000 0.000 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.060 0.416 0.141 0.000 0.150 0.035 0.123 0.178 0.085 0.296 0.107 0.016 0.033 0.070 0.025 0.197 0.003 0.078 0.054 0.000 0.032 GWRP 0.131 0.225 0.315 0.002 0.352 0.030 0.086 0.363 0.086 0.211 0.228 0.010 0.089 0.111 0.153 0.312 0.068 0.144 0.060 0.004 0.281 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.073 0.455 0.205 0.262 0.095 0.054 0.024 0.047 0.135 0.107 0.128 0.174 0.106 0.020 0.057 0.088 0.100 0.140 0.031 0.173 0.120 WLD CNN [37] 0.001 0.153 0.001 0.003 0.008 0.003 0.001 0.001 0.007 0.043 0.005 0.001 0.011 0.001 0.001 0.001 0.073 0.077 0.063 0.001 0.018 FrameCNN [34] 0.044 0.226 0.409 0.142 0.071 0.113 0.140 0.120 0.223 0.077 0.140 0.134 0.132 0.042 0.031 0.104 0.071 0.241 0.052 0.124 0.141 Attention [39] 0.000 0.000 0.000 0.000 0.106 0.000 0.000 0.000 0.000 0.188 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.035 0.440 0.002 0.103 0.095 0.152 0.062 0.064 0.153 0.024 0.041 0.024 0.075 0.078 0.003 0.078 0.038 0.257 0.013 0.184 0.098 GWRP 0.078 0.519 0.031 0.356 0.206 0.205 0.269 0.118 0.252 0.165 0.167 0.130 0.065 0.067 0.065 0.146 0.238 0.175 0.105 0.243 0.167 TABLE VIII F1- SCORE OF TIME- FREQUENCY SEGMENTATION AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel GMP 0.000 0.001 0.001 0.000 0.002 0.000 0.003 0.002 0.002 0.002 0.000 0.005 0.001 0.000 0.001 0.000 0.000 0.001 0.001 0.002 0.002 GAP 0.128 0.391 0.106 0.009 0.155 0.073 0.124 0.187 0.057 0.201 0.143 0.038 0.044 0.068 0.067 0.126 0.029 0.119 0.052 0.081 0.116 GWRP 0.222 0.519 0.226 0.030 0.291 0.095 0.213 0.313 0.114 0.303 0.241 0.125 0.086 0.100 0.127 0.256 0.092 0.204 0.104 0.212 0.237 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing GMP 0.001 0.002 0.001 0.002 0.001 0.001 0.000 0.000 0.001 0.001 0.000 0.000 0.002 0.000 0.001 0.001 0.001 0.002 0.003 0.001 0.001 GAP 0.139 0.264 0.212 0.139 0.074 0.135 0.085 0.055 0.077 0.120 0.085 0.144 0.108 0.082 0.057 0.140 0.059 0.166 0.074 0.130 0.114 GWRP 0.283 0.379 0.497 0.311 0.190 0.249 0.185 0.085 0.140 0.257 0.213 0.272 0.196 0.108 0.327 0.237 0.138 0.313 0.222 0.215 0.218 TABLE IX F1- SCORE, AUC AND MAP OF TIME-FREQUENCY SEGMENTATION AT DIFFERENT SNR S. 20 dB 10 dB 0 dB Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP GMP 0.001 0.347 0.008 0.001 0.345 0.007 0.001 0.362 0.005 GAP 0.215 0.889 0.230 0.168 0.880 0.187 0.114 0.861 0.143 GWRP 0.324 0.849 0.268 0.280 0.845 0.227 0.218 0.836 0.175 closer to the ground truth than obtained using GMP and GAP. Compared with event-wise SED, frame-wise SED does not depend on post-processing. G. Event-wise sound event detection Although frame-wise SED does not depend on post- processing so is a more objective criterion, it makes more sense to have event-wise predictions. The event-wise pre- dictions are obtained from frame-wise predictions following Section IV-C. Table VI shows that the GWRP achieves the best F1-score of 0.254 in event-wise SED. Although GMP seems to achieve the lowest ER of 1.00, GMP deletes all the Fig. 8. Frame-wise predictions using GMP, GAP, GWRP with SNR at 0 dB. events and has a deletion error of 1.00 and an insertion of 0. The ground truth annotation is shown in the bottom right. On the other hand, GWRP has the lowest deletion error of 0.66 and has an insertion error of 1.45. The F1-scores drop H. Time-frequency segmentation from 0.254 to 0.227 to 0.167 under SNRs of 20 dB, 10 dB and 0 dB. Table VII shows the the F1-score of event-wise Table VIII shows the T-F segmentation results of all sound SED of all sound classes. Some sound classes such as “barks”, classes under 0 dB. As the T-F segmentation can not be “harmonica” have higher detection F1-score. GWRP achieves obtained by previous works including the fully connected the best averaged F1-score of 0.167. neural network [55], the CNN trained on weakly labelled data 10 [37], the FrameCNN [34] and the attention model [39], we [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In only report the T-F segmentation results with our proposed Proceedings of the IEEE Conference on Computer Vision and Pattern methods. GWRP achieves the best F1-score of 0.218 on Recognition (CVPR), pages 580–587, 2014. average. Table IX shows the T-F segmentation results under [7] A. Borji, M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12):5706– different SNRs. Table IX shows that GWRP achieves the best 5722, 2015. F1-score, AUC and mAP of 0.324, 0.849 and 0.268 under 20 [8] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George dB SNR, respectively. GMP underestimates the T-F segmenta- Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark. arXiv tion masks and performs the worst in T-F segmentation. GAP preprint arXiv:1609.08675, 2016. overestimates the T-F segmentation masks and performs worse [9] A. Narayanan and D. Wang. Ideal ratio mask estimation using deep than GWRP in F1-score. The T-F segmentation masks learned neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal by GWRP (Fig. 7(e)) looks closer to the IRM than the T-F Processing (ICASSP), pages 7092–7096, 2013. segmentation masks learned by using GMP and GAP. [10] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. VI. CONCLUSION IEEE Transactions on Multimedia, 17(10):1733–1746, 2015. [11] G. Parascandolo, H. Huttunen, and T. Virtanen. Recurrent neural This paper proposes a time-frequency (T-F) segmentation, networks for polyphonic sound event detection in real life recordings. In sound event detection and separation framework trained on Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444, 2016. weakly labelled data. In training, a segmentation mapping and [12] A. Mesaros, T. Heittola, and T. Virtanen. TUT database for acoustic a classification mapping are trained jointly using the weakly scene classification and sound event detection. In Proceedings of the labelled data. In T-F segmentation, we use the trained seg- 24th European Signal Processing Conference (EUSIPCO), pages 1128– 1132, 2016. mentation mapping to calculate the T-F segmentation masks. [13] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and Detected sound events can then be obtained from the T-F M. D. Plumbley. Detection and classification of acoustic scenes and segmentation masks. As a byproduct, separated waveforms events: an IEEE AASP challenge. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), of sound events can be obtained from the T-F segmentation masks. Experiments show that the global weighted rank pool- [14] A. Kumar and B. Raj. Audio event detection using weakly labeled ing (GWRP) outperforms the global max pooling, the global data. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1038–1047, 2016. average pooling and previously proposed systems in both of [15] S. Adavanne and T. Virtanen. Sound event detection using weakly T-F segmentation and sound event detection. The limitation labeled dataset with stacked convolutional and recurrent neural network. of this approach is that the T-F segmentation masks are not Technical report, DCASE2017 Challenge, September 2017. perfectly matching the ideal ratio mask (IRM) of the sound [16] Qiuqiang Kong, Yong Xu, Wenwu Wang, and Mark D Plumbley. A joint separation-classification model for sound event detection of weakly events. In future, we will improve the T-F segmentation masks labelled data. In Proceedings of the IEEE International Conference to match the IRM for event separation. on Acoustics, Speech and Signal Processing (ICASSP), pages 321–325, [17] O. Maron and T. Lozano-Pérez. A framework for multiple-instance ACKNOWLEDGMENT learning. In Proceedings of the Advances in Neural Information This research was supported by EPSRC grant Processing Systems (NIPS), volume 10, pages 570–576, 1998. EP/N014111/1 “Making Sense of Sounds” and a Research [18] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Scholarship from the China Scholarship Council (CSC) No. Artificial intelligence, 89(1-2):31–71, 1997. 201406150082. Iwona Sobieraj is sponsored by the European [19] Z. Zhou and M. Zhang. Neural networks for multi-instance learning. In Union’s H2020 Framework Programme (H2020-MSCA-ITN- Proceedings of the International Conference on Intelligent Information Technology (ICIIT), pages 455–459, 2002. 2014) under grant agreement No. 642685 MacSeNet. The [20] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines authors thank Dominic Ward for helping to improve the paper for multiple-instance learning. In Proceedings of the Advances in Neural in the early stage. The authors thank all anonymous reviewers Information Processing Systems (NIPS), volume 15, pages 577–584, for their effort and suggestions to improve this paper. [21] P. Siva, C. Russell, and T. Xiang. In defence of negative mining for annotating weakly labelled data. In Proceedings of the European REFERENCES Conference on Computer Vision (ECCV), pages 594–608, 2012. [22] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discriminative [1] J. Saraswathy, M. Hariharan, S. Yaacob, and W. Khairunizam. Automatic segment annotation in weakly labeled video. In Proceedings of the classification of infant cry: A review. In Proceedings of the International IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Conference on Biomedical Engineering (ICoBE), pages 543–548, 2012. pages 2483–2490, 2013. [2] A. Harma, M. F. McKinney, and J. Skowronek. Automatic surveillance [23] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three of the acoustic activity in our living environment. In Proceedins of the principles for weakly-supervised image segmentation. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pages the European Conference on Computer Vision (ECCV), pages 695–711, 634–637, 2005. [3] D. P. W. Ellis. Detecting alarm sounds. In Proceedings of the Consistent [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification & Reliable Acoustic Cues for Sound Analysis Workshop (CRAC ’01), with deep convolutional neural networks. In Proceedings of the Ad- pages 59–62, 2001. vances in Neural Information Processing Systems (NIPS), volume 25, [4] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. pages 1097–1105, 2012. Scream and gunshot detection and localization for audio-surveillance [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks systems. In Proceedings of the IEEE Conference on Advanced Video for semantic segmentation. In Proceedings of the IEEE Conference on and Signal Based Surveillance (AVSS), pages 21–26, 2007. [5] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, video representation for event detection. In Proceedings of the IEEE 2015. Conference on Computer Vision and Pattern Recognition (CVPR), pages [26] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent 1798–1807, 2015. pre-trained deep neural networks for large-vocabulary speech recogni- 11 tion. IEEE Transactions on Audio, Speech, and Language Processing, Proceedings of the International Conference on Learning Representa- 20(1):30–42, 2012. tions (ICLR), 2014. [27] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D Yu. [47] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional Convolutional neural networks for speech recognition. IEEE/ACM neural networks for weakly supervised segmentation. In Proceedings of Transactions on Audio, Speech, and Language Processing, 22(10):1533– the IEEE International Conference on Computer Vision (ICCV), pages 1545, 2014. 1796–1804, 2015. [48] Jaume Amores. Multiple instance classification: Review, taxonomy and [28] K. Choi, G. Fazekas, and M. Sandler. Automatic tagging using deep comparative study. Artificial Intelligence, 201:81–105, 2013. convolutional neural networks. In Proceedings of the 17th International [49] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi- Conference on Music Information Retrieval (ISMIR), pages 805–811, device dataset for urban acoustic scene classification. arXiv preprint arXiv:1807.09840, 2018. [29] P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley. [50] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, CHiME-home: A dataset for sound source recognition in a domestic Xavier Favory, Jordi Pons, and Xavier Serra. General-purpose tagging environment. In IEEE Workshop on Applications of Signal Processing of freesound audio with audioset labels: Task description, dataset, and to Audio and Acoustics (WASPAA), 2015. baseline. arXiv preprint arXiv:1807.09902, 2018. [30] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional [51] A. Mesaros, T. Heittola, and T. Virtanen. Metrics for polyphonic sound networks. In Proceedings of the European Conference on Computer event detection. Applied Sciences, 6(6):162, 2016. Vision (ECCV), pages 818–833, 2014. [52] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a [31] M. Lin, Q. Chen, and S. Yan. Network in network. In Proceedings of the receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36, International Conference on Learning Representations (ICLR), 2014. [32] Q. Kong, Y. Xu, W. Wang, and M. D Plumbley. Audio set classification [53] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, with attention model: A probabilistic perspective. In Proceedings of the B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets International Conference on Acoustics, Speech and Signal Processing and baseline system. In Proceedings of the Detection and Classification (ICASSP), pages 316–320, 2017. of Acoustic Scenes and Events (DCASE) Workshop, pages 85–92, 2017. [33] Brian McFee, Justin Salamon, and Juan Pablo Bello. Adaptive pooling [54] Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark D operators for weakly labeled sound event detection. arXiv preprint Plumbley. DCASE 2018 Challenge baseline with convolutional neural arXiv:1804.10070, 2018. networks. arXiv preprint arXiv:1808.00773, 2018. [34] S. Chou, J. Jang, and Y. Yang. FrameCNN: A weakly-supervised learn- [55] Q. Kong, I. Sobieraj, W. Wang, and M. D. Plumbley. Deep neural ing framework for frame-wise acoustic event detection and classification. network baseline for DCASE Challenge 2016. Proceedings of the Technical report, DCASE2017 Challenge, September 2017. Detection and Classification of Acoustic Scenes and Events (DCASE) [35] Ting-Wei Su, Jen-Yu Liu, and Yi-Hsuan Yang. Weakly-supervised Workshop, 2016. audio event detection using event-specific gaussian filters and fully [56] Karen Simonyan and Andrew Zisserman. Very deep convolutional convolutional networks. In Proceedings of the IEEE International networks for large-scale image recognition. In Proceedings of the Conference on Acoustics, Speech and Signal Processing (ICASSP), pages International Conference on Learning Representations (ICLR), 2014. 791–795, 2017. [57] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In [36] Shao-Yen Tseng, Juncheng Li, Yun Wang, Joseph Szurley, Florian Proceedings of the International Conference on Learning Representa- Metze, and Samarjit Das. Multiple instance deep learning for weakly tions (ICLR), 2015. supervised audio event detection. arXiv preprint arXiv:1712.09673, [37] Anurag Kumar and Bhiksha Raj. Deep CNN framework for audio event recognition using weakly labeled web data. arXiv preprint arXiv:1707.02530, 2017. [38] Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. In Proceedings of the Detection Qiuqiang Kong (S’17) received the B.Sc. and the and Classification of Acoustic Scenes and Events (DCASE) Workshop, M.E. degree in South China University of Techology, pages 74–79, 2017. Guangzhou, China, in 2012 and 2015, respectively. [39] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley. Large-scale weakly He is currently pursuing a PhD degree in University supervised audio classification using gated convolutional neural network. of Surrey, Guildford, UK. His research interest in- Proceedings of the IEEE International Conference on Acoustics, Speech cludes audio signal processing and machine learning. and Signal Processing (ICASSP), pages 121–125, 2017. [40] S. A. Raki, S. Makino, H. Sawada, and R. Mukai. Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 81–84, 2005. [41] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017. Yong Xu (M’17) received the Ph.D. degree from [42] M. H. Radfar and R. M. Dansereau. Single-channel speech separation the University of Science and Technology of China using soft mask filtering. IEEE Transactions on Audio, Speech, and (USTC), Hefei, China, in 2015, on the topic of Language Processing, 15(8):2299–2310, 2007. DNN-based speech enhancement and recognition. [43] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network Currently, he is a senior research scientist in Tencent training by reducing internal covariate shift. In Proceedings of the 32nd AI lab, Bellevue, USA. He once worked at the Uni- International Conference on Machine Learning (ICML), pages 448–456, versity of Surrey, U.K. as a Research Fellow from 2016 to 2018 working on sound event detection. He [44] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltz- visited Prof. Chin-Hui Lee’s lab in Georgia Institute mann machines. In Proceedings of the 27th International Conference of Technology, USA from Sept. 2014 to May 2015. on Machine Learning (ICML), pages 807–814, 2010. He once also worked in IFLYTEK company from [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning 2015 to 2016 to develop far-field ASR technologies. His research interests deep features for discriminative localization. In Proceedings of the IEEE include deep learning, speech enhancement and recognition, sound event Conference on Computer Vision and Pattern Recognition (CVPR), pages detection, etc. He received 2018 IEEE SPS best paper award. 2921–2929, 2016. [46] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 12 Iwona Sobieraj received the B.A. and the M.E. degreed from Warsaw University of Technology, Poland, in 2010 and 2011, respectively. She joined Samsung Electronics R&D, Warsaw, Poland in 2012. Since 2015 she is pursuing a PhD degree at the Uni- versity of Surrey, Guildford, UK. Her main research interest include environmental audio analysis, non- negative matrix factorization and deep learning. Wenwu Wang (M’02-SM’11) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, China. He then worked in King’s College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, before joining University of Surrey, UK, in May 2007, where he is currently a Reader in Signal Processing, and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing. He has been a Guest Professor at Qingdao University of Science and Technology, China, since 2018. His current research interests include blind signal processing, sparse signal processing, audio-visual signal processing, machine learning and perception, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 200 publications in these areas. He served as an Associate Editor for IEEE Transactions on Signal Processing from 2014 to 2018. He is also Publication Co-Chair for ICASSP 2019, Brighton, UK. Mark D. Plumbley (S’88-M’90-SM’12-F’15) re- ceived the B.A.(Hons.) degree in electrical sciences and the Ph.D. degree in neural networks from Uni- versity of Cambridge, Cambridge, U.K., in 1984 and 1991, respectively. Following his PhD, he became a Lecturer at King’s College London, before moving to Queen Mary University of London in 2002. He subsequently became Professor and Director of the Centre for Digital Music, before joining the Uni- versity of Surrey in 2015 as Professor of Signal Processing. He is known for his work on analysis and processing of audio and music, using a wide range of signal processing techniques, including matrix factorization, sparse representations, and deep learning. He is a co-editor of the recent book on Computational Analysis of Sound Scenes and Events, and Co-Chair of the recent DCASE 2018 Workshop on Detection and Classifications of Acoustic Scenes and Events. He is a Member of the IEEE Signal Processing Society Technical Committee on Signal Processing Theory and Methods, and a Fellow of the IET and IEEE. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data

Loading next page...
 
/lp/arxiv-cornell-university/sound-event-detection-and-time-frequency-segmentation-from-weakly-KryGasjfCB

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2895254
Publisher site
See Article on Publisher Site

Abstract

Sound Event Detection and Time-Frequency Segmentation from Weakly Labelled Data Qiuqiang Kong*, Yong Xu* , Iwona Sobieraj, Wenwu Wang, Mark D. Plumbley Fellow, IEEE Abstract—Sound event detection (SED) aims to detect when and recognize what sound events happen in an audio clip. Many supervised SED algorithms rely on strongly labelled data which contains the onset and offset annotations of sound events. However, many audio tagging datasets are weakly labelled, that is, only the presence of the sound events is known, without knowing their onset and offset annotations. In this paper, we propose a time-frequency (T-F) segmentation framework trained on weakly labelled data to tackle the sound event detection and separation problem. In training, a segmentation mapping is applied on a T-F representation, such as log mel spectrogram of an audio clip to obtain T-F segmentation masks of sound events. The T-F segmentation masks can be used for separating the sound events from the background scenes in the time-frequency domain. Then a classification mapping is applied on the T-F segmentation masks to estimate the presence probabilities of the sound events. We model the segmentation mapping using a convolutional neural network and the classification mapping using a global weighted rank pooling (GWRP). In SED, predicted onset and offset times can be obtained from the T-F segmentation masks. As a byproduct, separated waveforms of sound events can be obtained from the T-F segmentation masks. We remixed the DCASE 2018 Task 1 acoustic scene data with the DCASE 2018 Task 2 sound events data. When mixing under 0 dB, the proposed method achieved F1 scores of 0.534, 0.398 and 0.167 in audio tagging, frame-wise SED and event-wise SED, outperforming the fully connected deep neural network baseline of 0.331, 0.237 Fig. 1. From top to bottom: Waveform of an audio clip containing three and 0.120, respectively. In T-F segmentation, we achieved an F1 sound events: “Tambourine”, “scissors” and “computer keyboard”; Log mel score of 0.218, where previous methods were not able to do T-F spectrogram of the audio clip; Ideal ratio mask (IRM) [9] of sound events. segmentation. Strongly labelled onset and offset annotations of sound events; Weak labels. “Silence” is the abbreviated as “sil.”. The signal-to-noise ratio of this audio Index Terms—Sound event detection, time-frequency segmen- clip is 0 dB. tation, weakly labelled data, convolutional neural network. I. INTRODUCTION than video [8], and as a result, longer sound sequences can be stored in a device and faster processing can be obtained using Sound event detection (SED) aims to detect what sound equal computation resources. Many SED algorithms rely on events happen in an audio recording and when they occur. strongly labelled data [10]–[12] where the onset and offset SED has many applications in everyday life. For example, times of sound events have been annotated. The segments SED can be used to monitor “baby cry” sound at home [1], between the onset and offset labels are used as target events for and to detect “typing keyboard”, “door slamming”, “ringing training, while those outside the onset and offset annotations of phones”, “smoke alarms” and “sirens” in the office [2, 3]. are used as non-target events [11, 12]. However, collecting For public security, SED can be used to detect “gunshot” strongly labelled data is time consuming because annotating and “scream” sounds [4]. Not only is SED complementary the onset and offset times of sound events takes more time to video or image based event detection [5]–[7] but also has than annotating audio clips for classification, so the sizes of many advantages over the two modalities. First, sound does strongly labelled datasets are often limited to minutes or a not require illumination, so can be used in dark environments. few hours [12, 13]. At the same time there are large amounts Second, sound can penetrate or move around some obstacles, of weakly labelled data (WLD) available, where only the while objects in video and image are often occluded. Third, presence of the sound events is labelled, without any onset some abnormal events such as fire alarms are audio only, and offset annotations [14, 15] or the sequence of the sound so can only be detected by sound. Furthermore, storing and events. Fig. 1 shows the waveform of an audio clip containing processing sound often consumes less computation resources three non-overlapping sound events, the log mel spectrogram * The first two authors contributed equally to this work. of the audio clip, the ideal ratio mask (IRM) [9] of the sound arXiv:1804.04715v3 [cs.SD] 2 Mar 2019 2 Fig. 2. Audio tagging with convolutional neural network. Input log mel spectrogram is presented to a convolutional neural network including convolutional layers, a global pooling layer and fully connected layers to predict the presence probabilities of audio tags. events, the strongly labelled onset and offset annotations and recently been proposed, including multiple instance learning the weak labels. In this paper we will focus on non-overlapping and convolutional neural networks. sound events as a starting point. In the real world, sound events usually happen in real scenes A. Multi-instance learning method such as a metro station or an urban park. State-of-the-art SED One solution to the WLD problem is based on multiple algorithms only detect the onset and the offset of sound events instance learning (MIL) [14, 17]. MIL was first proposed in in the time domain but do not separate them from background 1997 for drug activity detection [18]. In MIL for SED, an in the T-F domain. The separation of sound events in the T-F audio clip is labelled positive for a specified sound event if domain can be useful for enhancing and recognizing sound that sound event occurs at least one time in the audio clip, events in audio scenes under low signal-to-noise ratio (SNR). and labelled negative if that sound event does not occur in the In this paper, we propose a T-F segmentation and sound event audio clip. For strongly labelled data, the dataset consists of detection framework trained using weakly labelled data. This training pairs fx; yg where x is the feature of a frame in an is done by learning T-F segmentation masks implicitly in train- audio clip and y 2 f0; 1g is the strong label of the frame, ing with only the clip-level audio tags. It means that T-F masks where K denotes the number of sound classes. For weakly are not known even for the training set: they are predicted as labelled data, features of all frames in an audio clip constitute intermediate results. T-F segmentation masks are equivalent a bag B = fx g where T is the number of frames in the t=1 to the ideal ratio masks (IRM) [9]. An IRM is the ratio of audio clip. Multiple instance assumption states that the weak the spectrogram of a sound event to the spectrogram of the labels of a bag are y = maxfy g , where y is the strong t t t=1 mixed audio. T-F segmentation masks can be used for SED and label of the feature x . The weakly labelled data consists of sound event separation. In training, a segmentation mapping is the training pairs fB; yg. applied to the T-F representation such as log mel spectrogram The problem of SED from WLD now can be cast as learning of an audio clip to obtain T-F segmentation masks for sound a classifier to predict the labels of the frames fy g of t=1 events. Then a classification mapping is applied to the T-F a bag B = fx g . For the general WLD problem, an t=1 segmentation masks to output the presence probabilities of MIL framework based on a neural network was proposed sound events. In T-F segmentation, with a T-F representation in [14, 19]. In [14, 20] a support vector machine (SVM) of an audio clip as input, the trained segmentation mapping was used to solve MIL as a maximum margin problem. A is used to obtain the T-F segmentation masks. In SED, onset negative mining method was proposed in [21] that selects and offset times can be obtained from the T-F segmentation negative examples according to intra-class variance criterion. masks. As a byproduct, separated waveforms of sound events A concept ranking according to negative exemplars (CRANE) can be obtained from the T-F segmentation masks. This work algorithm was proposed in [22]. However, an MIL method is an extension of the joint separation-classification model for tends to underestimate the number of positive instances in an SED of weakly labelled data [16]. audio clip [23]. Furthermore, the MIL method cannot predict The paper is organized as follows. Section II introduces the T-F segmentations from the WLD [14]. previous work in SED with WLD. Section III describes the proposed T-F segmentation, sound event detection and separation framework. Section IV describes the implemen- B. Convolutional neural networks for audio tagging and tation details of the proposed framework. Section V shows weakly supervised sound event detection experimental results. Section VI concludes and forecasts future Convolutional neural networks (CNNs) have been success- work. fully used in many areas including image classification [24], object detection [6], image segmentation [25], speech recog- nition [26, 27] and audio classification [28]. In this section II. WEAKLY SUPERVISED SOUND EVENT DETECTION we briefly introduce previous work using convolutional neural Compared to the conventional SED task, where strongly network for audio tagging [28] and weakly supervised SED. labelled onset and offset annotations for the training set are Audio tagging [12, 28, 29] aims to predict the presence of given, the weakly supervised SED task contains only clip-level sound events in an audio clip. In [30], a mel spectrogram of an labels. That is, only the presence of sound events is known in audio clip is presented to a CNN, where the filters of each con- an audio clip, without knowing the temporal locations of the volutional layer capture local patterns of a spectrogram. After events. Several approaches for weakly supervised SED have a global pooling layer such as global max pooling [28], global 3 Fig. 3. Training stage using weakly labelled data. A segmentation mapping g maps from an input T-F representation to the segmentation masks. A classification mapping g maps each segmentation mask to the presence probabilities of the corresponding audio tag. average pooling [31], global weighted rank pooling [23], segmentation masks h = [h ; :::; h ], where K is the number 1 K global attention pooling [32, 33] or other poolings [34, 35], of T-F segmentation masks and is equal to the number of sound fully connected layers are applied to predict the presence events. Symbol h is the abbreviation of h (t; f) which is the k k probabilities of audio classes. Fig. 2 shows the framework T-F segmentation mask of the k-th event. Ideally, each T-F of audio tagging with convolutional neural network. However, segmentation mask h is an ideal ratio mask [9] of the k-th this CNN only predicts the presence probabilities of a sound sound event. events in an audio clip, but not the onset and offset times of The second part of the training stage is a classification the sound events. mapping g : h 7! p ; k = 1; :::; K where g maps each 2 k k 2 In [36, 37], a time-distributed CNN with a global max- T-F segmentation mask to the presence probability of the k-th pooling strategy was proposed to approximate the MIL method event, denoted as p . Then the binary crossentropy between the to predict the temporal locations of each event. However, the predictions p ; k = 1; :::; K and the targets y ; k = 1; :::; K is k k global max-pooling will encourage the model to attend to the calculated as the loss function: most dominant T-F unit contributing to the presence of the sound event and ignore all of other T-F units. That is, the hap- l (p ; y ) = y log p k k k k pening time of the sound events is underestimated. A method k=1 (1) for localizing the sound events in an audio clip by splitting the input into several segments based on the CNNs was presented = y log g (g (X) ); k 2 1 k in [38]. It splits an audio clip into several segments with the k=1 assumption that parts of the segments correspond to the clip- where y 2 f0; 1g; k = 1; :::; K is the binary representation level labels. This assumption may be unreasonable due to of the weak labels. Both g and g can be modeled by neural 1 2 the fact that some sound events may only occur at certain networks. The parameters of g and g can be trained end-to- 1 2 frames. Recently, an attention-based global pooling strategy end from the input T-F representation to the weak labels of using CNNs was proposed to predict the temporal locations an audio clip. [39] for SED using WLD. However, attention-based global pooling can only predict the time domain segmentation, but B. Time-frequency segmentation not the T-F segmentation which will be firstly addressed in In inference step, the input T-F representation of an audio this paper. clip is presented to the segmentation mapping g to obtain the T-F segmentation masks h ; k = 1; :::; K . The T-F segmenta- III. TIME- FREQUENCY SEGMENTATION, SOUND EVENT tion masks indicate which T-F units in the T-F representation DETECTION AND SEPARATION FROM W EAKLY LABELLED contribute to the presence of the sound events (top right of DATA Fig. 4). The learned T-F segmentation masks are affected by In this section, we present a T-F segmentation, sound event the classification mapping g and will be discussed in Section detection and separation framework trained on weakly labelled IV. audio data. Unlike the CNN method for audio tagging, we design a CNN to learn T-F segmentation masks of sound C. Sound event detection events from the weakly labelled data. As T-F segmentation masks h ; k = 1; :::; K contain the information about where sound events happen in the T-F A. Training from weakly labelled data domain, the simplest way to obtain the sound event detection We use only weakly labelled audio data to train the proposed score v (t) in the time domain is to average out the frequency model. The training stage is shown in Fig. 3. To begin with, axis of the T-F segmentation masks (bottom right of Fig. 4): the waveform of an audio clip x is converted to an input time- frequency (T-F) representation X(t; f), for example, spectro- v (t) = h (t; f); (2) k k gram or log mel spectrogram. To simplify the notation, we f=1 abbreviate X(t; f) as X . The first part of the training stage is a segmentation mapping where F is the number of frequency bins of the segmentation g : X 7! h which maps the input T-F representation to the T-F mask h . Then v (t) is the score of the frame-wise prediction 1 k k 4 nally, an inverse Fourier transform with overlap add [40] is applied on each segmented spectrogram with the phase from X to obtain the separated waveforms sb ; k = 1; :::; K : j\X sb = IFFT Y  e : (4) k k We summarize the training, time-frequency segmentation, sound event detection and separation framework in Fig. 6. The training stage, sound event detection stage and sound event separation stage are shown in the left, middle and right column of Fig. 6, respectively. IV. PROPOSED SEGMENTATION MAPPING AND CLASSIFICATION MAPPING In this section, we describe the implementation details of the segmentation mapping g and the classification mapping g proposed in Section III. A. Segmentation mapping Segmentation mapping g takes a T-F representation of an Fig. 4. Inference stage. An input T-F representation is presented to the audio clip as input and outputs segmentation masks of each segmentation mapping g to obtain the T-F segmentation masks. By averaging out the frequency axis of the T-F segmentation masks and post processing, sound event. We use log mel spectrogram as the input T- event-wise predictions of sound events can be obtained. F representation, which has been shown to perform well in audio classification [28, 39, 41]. Ideally, the outputs of g are ideal ratio masks (IRMs) [42] of sound events in the T-F of the sound events. We describe how to convert the frame- domain. The segmentation mapping g is modeled by a CNN. wise scores to event-wise sound events in Section IV-C. Each convolutional layer consists of a linear convolution, a batch normalization (BN) [43] and a ReLU [44] nonlinearity D. Sound event separation as in [43]. The BN inserted between the convolution and the As a byproduct, the T-F segmentation masks can be used nonlinearity can stabilize and speed up the training [43]. We to separate sound events from the mixture in the T-F domain. do not apply downsampling layers after convolutional layers In addition, by applying an inverse Fourier transform on the because we want to retain the resolution of the input T-F separated T-F representation of each sound event, separated segmentation masks. The T-F segmentation masks are obtained waveforms of the sound events can be obtained. Separating from the activations of the last CNN layer using a sigmoid non- sound events from the mixture of sound events and background linearity to constrain the values of the T-F segmentation masks under a low SNR can improve the recognition of sound events to be between 0 and 1 to be a valid value of an IRM. The in future work. Fig. 5 shows the pipeline of sound event configuration details of the CNN will be described in Section separation. An audio clip x is presented to the segmentation V-D. mapping g to obtain T-F segmentation masks. Meanwhile, The idea of learning the T-F segmentation masks explicitly the complex spectrum X of the audio clip is calculated. We is inspired by work on weakly labelled image localization [45] use the tilde on X to distinguish the complex spectrum X and image segmentation [46, 47]. In weakly labelled image from the input T-F representation X because X might not localization, saliency maps are learned indicating the locations be a spectrum, such as log mel spectrogram. We interpo- of the objects in an image [45]. Similarly, the T-F segmentation late the segmentation masks of the input T-F representation masks in our work resemble the saliency maps of an image h ; k = 1; :::; K to h ; k = 1; :::; K representing the T-F k k [45], where T-F segmentation masks indicate what time and segmentation masks of the complex spectrum. The reason frequency a sound event occurs in a T-F representation. for performing this interpolation is that h may have a size different from h , for example, a log mel spectrogram has fewer frequency bins than linear spectrum in the frequency B. Classification mapping domain. Then we multiply the upsampled T-F segmentation As described in Section III, the classification mapping g masks h with the magnitude of the spectrum to obtain the maps each segmentation mask h to the presence probability segmented spectrogram of the k-th event: of its corresponding sound event. Modeling the classification e e e mapping in different ways will lead to different representation Y = h X ; k = 1; :::; K; (3) k k of the segmentation masks (Fig. 7). We explored global max where represents the element-wise multiplication and Y pooling [28], global average pooling [31] and global rank represents the segmented spectrogram of the k-th event. Fi- pooling [23] for modeling the classification mappings g . 2 5 Fig. 5. Sound event separation stage. An input T-F representation is presented to the segmentation mapping g to obtain the T-F segmentation masks. The upsampled segmentation masks are multiplied with the magnitude spectrum of the input audio to obtain the segmented spectrogram of each sound event. Separated sound events are obtained by applying an inverse Fourier transform to the segmented spectrogram. to update the parameters in the neural network. Because of Waveform the maximum selection strategy, GMP encourages only one point in a T-F segmentation mask to be positive, so GMP will T-F representation Complex spectrogram underestimate [23] the sound events in the T-F representation. Examples of T-F segmentation masks learned using GMP are g magnitude phase shown in Fig. 7(c). T-F Segmentation 2) Global average pooling: Global average pooling (GAP) Upsampling masks was first applied in image classification [31]. GAP on each A Av ve er ra ag ge e a al lo on ng g T-F segmentation mask h is depicted as: frequency axis Separated spectrogram (magnitude) T F XX Score of sound events 2 along time axis F (h ) = h (t; f): (6) k k TF Inverse Fourier Post transform processing GAP corresponds to the collective assumption in MIL [48], Detected sound Audio tags Separated waveforms which states that all T-F units in a T-F segmentation mask events contribute equally to the label of an audio clip. That is, all T- Training from WLD Sound event detection Sound event separation F units in a T-F segmentation mask are assumed to contain the labelled sound events. However, some sound events only last Fig. 6. Framework of T-F segmentation, sound event detection and sound event separation from WLD. From left to right: Training from WLD; Sound a short time, so GAP usually overestimates the sound events event detection; Sound event separation. [31]. Examples of T-F segmentation masks learned using GAP are shown in Fig 7(d). 3) Global weighted rank pooling: To overcome the lim- 1) Global max pooling: Global max pooling (GMP) ap- itations of GMP and GAP, which underestimate and over- plied on feature maps has been used in audio tagging [28]. estimate the sound events in the T-F segmentation masks, GMP on each T-F segmentation mask map h is depicted as: global weighted rank pooling (GWRP) is proposed in [23]. GWRP can be seen as a generalization of GMP and GAP. F (h ) = max h (t; f): (5) k k t;f The idea of GWRP is to put a descending weight on the values of a T-F segmentation mask sorted in a descending GMP is based on the assumption that an audio clip contains order. Let an index set I = fi ; :::i g define the descending a sound event if at least one T-F unit of the T-F input 1 M order of the values within a T-F segmentation mask h , i.e. representation contains a sound event. GMP is invariant to the k (h )  (h )  :::  (h ) , where M = T  F is the location of sound event in the T-F domain because whenever a k i k i k i 1 2 n number of T-F units in a T-F segmentation mask. Then the sound event occurs, GMP will only select the maximum value GWRP is defined as: of a T-F segmentation mask which is robust to the time or frequency shifts of the sound event. However, in the training stage, back propagation will only pass through the maximum j1 F (h ) = r (h ) ; (7) k k i Z(r) value, so only a small part of data in the T-F domain are used j=1 6 Fig. 7. (a) Spectrogram of an audio clip containing “scissors”, “computer keyboard” and “tambourine” (plotted in log scale); (b) Log mel spectrogram of the audio clip; (c) Upsampled T-F segmentation masks h of sound events learned using global max pooling (GMP). Only a few T-F units have high value and the other parts of the T-F segmentation masks are dark; (d) Upsampled T-F segmentation masks h of sound events learned using global average pooling (GAP); (e) Upsampled T-F segmentation masks h of sound events learned using global weighted rank pooling (GWRP); (f) Ideal ratio mask (IRM) of sound events. Only 6 out of 41 T-F segmentation masks are plotted due to the limited space. j1 where 0  r  1 is a hyper parameter and Z(r) = r is that DCASE 2018 Task 1 provides background sounds j=1 is a normalization term. When r = 0 GWRP becomes GMP recorded from a variety of real world scenes whereas the and when r = 1 GWRP becomes GAP. The hyperparameter DCASE 2018 Task 2 provides a variety of foreground sound r can vary depending on the frequency of occurrence of the events. The DCASE 2018 Task 1 contains 8640 10-second sound events. GWAP attends more to the T-F units of high audio clips in the development set of subtask A. The audio values in a T-F segmentation mask and less to those of low clips are recorded from 10 different scenes such as “airport”, values in a T-F segmentation mask. The T-F segmentation “metro station” and “urban park”. The DCASE 2018 Task 2 masks learned using GWMP is shown in Fig. 7(e). The ideal contains 3710 manually verified sound events ranging in length binary masks (IBMs) of the sound events are plotted in Fig. from 300 ms to 30 s depending on the audio classes. There 7(f) for comparison with the GMP, GAP and GWRP. are 41 classes of sound events such as “flute”, “applause” and “cough”. We only use these manually verified audio clips from the DCASE 2018 Task 2 as sound events because the C. Post-processing for sound event detection remaining audio clips are unverified and may contain noisy In Section III-C we mentioned that the frame-wise scores labels. We truncated the sound events to up to 2 seconds and v (t) can be obtained from the T-F segmentation masks using mix them with the 10-second audio clips from the DCASE Equation (2). To reduce the number of false alarms, for an 2018 Task 1 acoustic scene dataset. The mixed audio clips are audio clip, we only apply sound event detection on the sound single channel with a sampling rate of 32 kHz. Each mixed classes with positive audio tagging predictions. Then we apply audio clip contains three non-overlapped sound events. We thresholds on the frame-wise predictions v (t) to obtain the mixed the sound events with the acoustic scenes for SNRs at event-wise predictions. We apply a high threshold of 0.2 20dB, 10dB and 0dB. For each SNR, the 8000 mixed audio to detect the presence of sound events and then extend the clips are divided into 4 cross-validation folds. Fig. 7(b) shows boundary of both onset and offset sides until the frame-wise the log mel spectrogram of a mixed 10-second audio clip. The scores drop below threshold of 0.1. This two-step threshold source code of our work is released . method will produce smooth predictions of sound events. As the duration of sound events in DCASE 2018 Task 2 varies B. Evaluation metrics from 300 ms to 30 s, we remove the detected sound events We use F-score [51], area under the curve (AUC) [52] and that are shorter than 320 ms (10 frames) to reduce false alarms mean average precision (mAP) [6] in the evaluation of the and join the sound events whose silence gap is shorter than audio tagging, the frame-wise SED and the T-F segmentation. 320 ms (10 frames). We also use error rate (ER) for evaluating the event-wise SED. 1) Basic statistics: True positive (TP): Both the reference V. EXPERIMENTS and the system prediction indicate an event to be active. False negative (FN): The reference indicates an event to be active A. Dataset but the system prediction indicates an event to be inactive. We mix the DCASE 2018 Task 1 acoustic scene dataset False positive (FP): The system prediction indicates an event [49] with the DCASE 2018 Task 2 general-purpose Freesound to be active but the reference indicates it is not [51]. dataset [50] under different signal-to-noise ratios (SNRs) to evaluate the proposed methods. The reason for this choice https://github.com/qiuqiangkong/sed_time_freq_segmentation 7 2) Precision, recall and F-score: Precision (P) and recall TABLE I C ONFIGURATION OF CNN. (R) are defined as [51]: TP TP P = ; R = : (8) TP + FP TP + FN Output size Layers (feature maps  time steps  mel bins) Bigger P and R indicates better performance. F-score is Input log mel spectrogram 1 311 64 calculated based on P and R [51]: f3 3; 32; BN; ReLUg 2 32 311 64 2P  R TP F = = : (9) f3 3; 64; BN; ReLUg 2 64 311 64 P + R TP + (FN + FP )=2 f3 3; 128; BN; ReLUg 2 128 311 64 Bigger F-score indicates better performance. f3 3; 128; BN; ReLUg 2 128 311 64 3) Area under the curve (AUC): A receiver operating 1 1; 41; sigmoid 41 311 64 characteristic (ROC) curve [52] plots true positive rate (TPR) versus false positive rate (FPR). Area under the curve (AUC) Global pooling (GP) 41 score is the area under this ROC curve which summarizes the ROC curve to a single number. Using the AUC does not TABLE II require manual selection of a threshold. Bigger AUC indicates F1- SCORE, AUC AND MAP OF AUDIO TAGGING AT DIFFERENT SNRS. better performance. A random guess has an AUC of 0.5. 4) Average precision: Average precision (AP) is the aver- 20 dB 10 dB 0 dB age of the precision at different recall values. Similar to AUC, Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP AP does not rely on the threshold. Different to AUC, AP does DNN [55] 0.439 0.885 0.468 0.396 0.861 0.402 0.331 0.810 0.314 WLD CNN [37] 0.498 0.777 0.498 0.524 0.794 0.526 0.528 0.815 0.535 not count the true negatives and is widely used as a criterion FrameCNN [34] 0.581 0.899 0.587 0.543 0.883 0.526 0.484 0.850 0.439 in imbalanced dataset such as object detection [6]. Attention [39] 0.714 0.922 0.755 0.690 0.907 0.729 0.612 0.875 0.643 GMP 0.435 0.818 0.475 0.406 0.801 0.440 0.373 0.773 0.389 5) Error rate: Error rate (ER) is an event-wise evaluation GAP 0.529 0.934 0.623 0.467 0.914 0.555 0.385 0.877 0.442 metric. ER measures the amount of errors in terms of inser- GWRP 0.635 0.955 0.753 0.604 0.942 0.696 0.534 0.915 0.596 tions (I), deletions (D) and substitutions (S) [51]. For an audio clip, the insertions, deletions and substitutions are defined as: “VGG-like” convolutional neural network [56] with 8 convo- S = min(FN; FP ); lutional blocks on the input log mel spectrogram [54]. Each D = max(0; FN FP ); (10) convolutional layer consists of a linear convolution with a filter size of 33 followed by a batch normalization layer [43] and I = max(0; FP FN); a ReLU activation function [44]. We use 4 convolution blocks where FN, FP, FN are event-wise statistics in an audio clip. following the baseline system of DCASE 2018 [54]. The Lower ER, S, D and I indicate the better performance. When number of feature maps of the convolutional layers are 32, 64, evaluating the event based criterion, we allow some degree 128 and 128, respectively. This configuration is to fit the model of misalignment between a reference and a system output for to a single GPU card with 12 GB RAM sufficiently. Then a counting a true positive [12, 51, 53]. Following the default 11 convolutional layer with sigmoid non-linearity is applied configuration of [51], we adopt an onset collar of 200 ms and to convert the feature maps to the T-F segmentation masks of an offset collar of 200 ms / 50% to count the true positive sound events. Then a global pooling is used to summarize each of a detection. We used the toolbox [51] for evaluating the T-F segmentation mask to a scalar representing the presence performance of the event-based SED. probability of the sound events in an audio clip. We summarize the configuration of the neural network in Table I. In training we use a mini-batch size of 24 to fully utilize the single card C. Feature extraction GPU with 12 GB RAM. The Adam optimizer [57] with a We apply a fast Fourier transform (FFT) with a window learning rate 0.001 is used for its fast convergence. size of 2048 and an overlap of 1024 between neighbouring windows to extract the spectrogram of audio clips. This E. Audio tagging configuration that follows [54] offers a good resolution in both We compare our method with fully connected neural time and frequency domain. Then mel filter banks with 64 network [55], CNN trained on weakly labelled data [37], bands are applied on the spectrogram followed by logarithm FrameCNN [34] and the attention model [39]. We apply GMP, operation to obtain log mel spectrogram as the input T-F GAP and GWRP as global pooling in our model. Table II representation feature. Log mel spectrogram has been widely shows that for SNR at 20 dB, the attention model [39] achieves used in audio classification [28, 54]. the best F1-score of 0.714 and mAP of 0.755 followed by the GWRP of 0.635 and 0.753, respectively. On the other D. Model hand, GWRP achieves the best AUC of 0.955. Comparing In this subsection we give a detailed description of the the performance under different SNRs, the F1-score and mAP configuration of the segmentation mapping in Section IV-A drop approximately 0.1 in absolute value for SNR changed and the classification mapping in Section IV-B. We apply a from 20 dB to 0 dB. AUC drop approximately 0.04 in absolute 8 TABLE III F1- SCORE OF AUDIO TAGGING AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.286 0.873 0.332 0.041 0.344 0.367 0.489 0.546 0.423 0.283 0.075 0.133 0.197 0.083 0.304 0.267 0.389 0.285 0.350 0.464 0.310 WLD CNN [37] 0.633 0.896 0.719 0.547 0.794 0.248 0.610 0.589 0.504 0.390 0.513 0.889 0.436 0.136 0.435 0.384 0.672 0.375 0.270 0.692 0.513 FrameCNN [34] 0.416 0.878 0.719 0.166 0.557 0.385 0.529 0.562 0.448 0.507 0.484 0.668 0.314 0.181 0.392 0.304 0.556 0.474 0.385 0.488 0.465 Attention [39] 0.548 0.893 0.761 0.632 0.866 0.335 0.616 0.607 0.568 0.497 0.565 0.924 0.477 0.160 0.546 0.598 0.823 0.463 0.565 0.901 0.617 GMP 0.458 0.522 0.335 0.183 0.400 0.087 0.299 0.468 0.424 0.422 0.151 0.774 0.281 0.076 0.279 0.284 0.176 0.271 0.315 0.844 0.434 GAP 0.547 0.817 0.409 0.070 0.484 0.205 0.435 0.501 0.354 0.504 0.347 0.314 0.181 0.164 0.218 0.407 0.399 0.346 0.343 0.496 0.305 GWRP 0.552 0.825 0.654 0.204 0.578 0.342 0.416 0.628 0.424 0.573 0.543 0.579 0.333 0.320 0.421 0.618 0.473 0.558 0.427 0.726 0.550 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.297 0.672 0.547 0.418 0.276 0.192 0.075 0.121 0.408 0.500 0.411 0.336 0.368 0.097 0.299 0.254 0.270 0.528 0.379 0.293 0.331 WLD CNN [37] 0.538 0.742 0.910 0.643 0.649 0.361 0.359 0.263 0.589 0.636 0.558 0.410 0.599 0.052 0.593 0.436 0.324 0.642 0.755 0.349 0.528 FrameCNN [34] 0.424 0.723 0.688 0.660 0.553 0.390 0.355 0.400 0.490 0.528 0.497 0.481 0.624 0.193 0.733 0.449 0.346 0.526 0.475 0.431 0.484 Attention [39] 0.607 0.759 0.938 0.744 0.738 0.444 0.499 0.441 0.560 0.678 0.660 0.693 0.709 0.113 0.957 0.593 0.434 0.368 0.784 0.400 0.612 GMP 0.398 0.322 0.796 0.141 0.483 0.311 0.275 0.207 0.442 0.474 0.173 0.251 0.465 0.031 0.891 0.504 0.329 0.585 0.567 0.175 0.373 GAP 0.438 0.681 0.641 0.392 0.402 0.480 0.203 0.172 0.372 0.408 0.404 0.392 0.335 0.161 0.412 0.348 0.341 0.579 0.349 0.408 0.385 GWRP 0.523 0.714 0.798 0.606 0.524 0.563 0.547 0.353 0.487 0.534 0.452 0.653 0.585 0.260 0.857 0.583 0.508 0.639 0.516 0.452 0.534 TABLE IV F1- SCORE OF FRAM E- WISE SED AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.191 0.746 0.239 0.009 0.317 0.306 0.373 0.495 0.295 0.202 0.036 0.050 0.123 0.038 0.233 0.207 0.156 0.195 0.214 0.291 0.212 WLD CNN [37] 0.113 0.466 0.159 0.052 0.292 0.044 0.318 0.298 0.223 0.100 0.142 0.111 0.097 0.020 0.078 0.078 0.085 0.085 0.042 0.095 0.037 FrameCNN [34] 0.294 0.741 0.585 0.07 0.411 0.299 0.441 0.480 0.342 0.421 0.370 0.283 0.178 0.102 0.310 0.239 0.236 0.325 0.246 0.315 0.308 Attention [39] 0.062 0.422 0.069 0.020 0.189 0.024 0.242 0.263 0.210 0.019 0.059 0.051 0.045 0.003 0.068 0.050 0.076 0.031 0.159 0.026 0.088 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.410 0.661 0.338 0.033 0.341 0.139 0.240 0.429 0.195 0.426 0.269 0.121 0.088 0.108 0.170 0.297 0.102 0.229 0.173 0.214 0.200 GWRP 0.453 0.704 0.507 0.072 0.456 0.188 0.326 0.575 0.341 0.457 0.402 0.222 0.193 0.172 0.351 0.498 0.247 0.355 0.316 0.596 0.418 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.155 0.594 0.510 0.367 0.16 0.111 0.022 0.095 0.314 0.317 0.277 0.254 0.290 0.045 0.166 0.144 0.190 0.411 0.166 0.212 0.237 WLD CNN [37] 0.093 0.333 0.135 0.160 0.149 0.086 0.056 0.058 0.132 0.234 0.150 0.075 0.141 0.003 0.195 0.055 0.123 0.287 0.258 0.067 0.140 FrameCNN [34] 0.259 0.595 0.639 0.495 0.354 0.271 0.228 0.284 0.399 0.329 0.379 0.364 0.453 0.111 0.443 0.277 0.237 0.407 0.228 0.299 0.343 Attention [39] 0.029 0.143 0.107 0.096 0.101 0.051 0.034 0.018 0.137 0.353 0.078 0.038 0.054 0.005 0.188 0.046 0.148 0.08 0.156 0.056 0.100 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.231 0.528 0.553 0.332 0.233 0.294 0.133 0.121 0.237 0.167 0.146 0.319 0.265 0.114 0.191 0.274 0.172 0.437 0.105 0.313 0.252 GWRP 0.362 0.649 0.696 0.539 0.354 0.429 0.400 0.182 0.404 0.440 0.384 0.471 0.373 0.173 0.591 0.378 0.420 0.528 0.331 0.360 0.398 TABLE V “tambourine” have higher classification accuracy while some F1- SCORE, AUC AND MAP OF FRAME- WISE SED AT DIFFERENT SNRS. sound events such as “microwave” and “squeak” are difficult to recognize. On average, the attention model [39] achieves 20 dB 10 dB 0 dB the best F1-score of 0.612 followed by GWRP of 0.534. Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP DNN [55] 0.360 0.722 0.269 0.306 0.702 0.224 0.237 0.666 0.169 F. Frame-wise sound event detection WLD CNN [37] 0.168 0.669 0.179 0.182 0.688 0.201 0.140 0.701 0.166 FrameCNN [34] 0.440 0.808 0.369 0.399 0.787 0.329 0.343 0.756 0.275 Attention [39] 0.163 0.827 0.317 0.137 0.807 0.278 0.100 0.773 0.221 Table IV shows the F1-score of the frame-wise SED for GMP 0.000 0.676 0.090 0.000 0.658 0.076 0.000 0.649 0.072 all sound classes under SNR of 0 dB. GWRP achieves the GAP 0.398 0.790 0.400 0.334 0.753 0.328 0.252 0.712 0.245 GWRP 0.511 0.886 0.508 0.472 0.871 0.453 0.398 0.829 0.360 best averaged F1-score of 0.398, followed by the FrameCNN model [34] of 0.343. Some classes such as “applause” and “hi-hat” have higher F1-score by the frame-wise SED, while TABLE VI F1- SCORE, AUC AND MAP OF EVENT- WISE SED AT DIFFERENT SNRS. some classes such as “drawer” and squeak” have lower F1- score by the frame-wise SED. Table V shows the frame-wise 20 dB 10 dB 0 dB SED results under different SNRs. GWRP achieves the best Algorithms F1 ER D I F1 ER D I F1 ER D I F1-score, AUC and mAP of 0.511, 0.886 and 0.508 under DNN [55] 0.226 1.91 0.75 1.16 0.178 2.29 0.79 1.50 0.120 2.80 0.84 1.96 20 dB SNR. The FrameCNN model [34] achieves a second WLD CNN [37] 0.010 1.16 0.99 0.17 0.011 1.15 0.99 0.17 0.018 1.12 0.99 0.13 FrameCNN [34] 0.166 2.38 0.79 1.58 0.151 2.49 0.81 1.68 0.141 2.70 0.81 1.88 place with an F1-score of 0.440. GAP overestimates the sound Attention [39] 0.028 1.10 0.96 0.14 0.021 1.10 0.97 0.13 0.011 1.09 0.98 0.10 GMP 0.000 1.00 1.00 0.00 0.000 1.00 1.00 0.00 0.000 1.00 1.00 0.00 events which is shown in the visualization of the upsampled T- GAP 0.173 2.71 0.78 1.93 0.139 2.95 0.82 2.13 0.098 3.52 0.86 2.66 F segmentation masks (Fig. 7). GAP does not perform better GWRP 0.254 2.12 0.66 1.45 0.227 2.30 0.69 1.61 0.167 2.55 0.76 1.78 than GWRP. GMP underestimates the sound events (Fig. 7) and performs worst in frame-wise SED. In GWRP, the F1- value for SNR changed from 20 dB to 0 dB. This result shows score drops from 0.511 to 0.472 to 0.398 under SNRs of 20 that there is a large variance in audio tagging under low SNR. dB, 10 dB and 0 dB. Fig. 8 shows the frame-wise scores Table III shows the audio tagging results of all sound events of sound events obtained from equation (2) under SNR of under 0 dB SNR. Some sound events such as “hi-hat” and 0 dB. Frame-wise scores obtained by using GWRP looks 9 TABLE VII F1- SCORE OF EVENT- WISE SED AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel DNN [55] 0.132 0.287 0.083 0.002 0.176 0.233 0.125 0.389 0.041 0.141 0.033 0.007 0.068 0.036 0.141 0.113 0.035 0.113 0.036 0.079 0.159 WLD CNN [37] 0.020 0.013 0.001 0.001 0.110 0.001 0.036 0.067 0.025 0.005 0.001 0.001 0.003 0.001 0.001 0.001 0.001 0.001 0.001 0.001 0.001 FrameCNN [34] 0.098 0.510 0.187 0.012 0.090 0.180 0.287 0.186 0.157 0.194 0.144 0.005 0.04 0.091 0.168 0.163 0.042 0.133 0.081 0.265 0.098 Attention [39] 0.000 0.051 0.000 0.000 0.052 0.000 0.020 0.018 0.000 0.000 0.000 0.000 0.009 0.000 0.000 0.000 0.000 0.000 0.019 0.000 0.000 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.060 0.416 0.141 0.000 0.150 0.035 0.123 0.178 0.085 0.296 0.107 0.016 0.033 0.070 0.025 0.197 0.003 0.078 0.054 0.000 0.032 GWRP 0.131 0.225 0.315 0.002 0.352 0.030 0.086 0.363 0.086 0.211 0.228 0.010 0.089 0.111 0.153 0.312 0.068 0.144 0.060 0.004 0.281 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing DNN [55] 0.073 0.455 0.205 0.262 0.095 0.054 0.024 0.047 0.135 0.107 0.128 0.174 0.106 0.020 0.057 0.088 0.100 0.140 0.031 0.173 0.120 WLD CNN [37] 0.001 0.153 0.001 0.003 0.008 0.003 0.001 0.001 0.007 0.043 0.005 0.001 0.011 0.001 0.001 0.001 0.073 0.077 0.063 0.001 0.018 FrameCNN [34] 0.044 0.226 0.409 0.142 0.071 0.113 0.140 0.120 0.223 0.077 0.140 0.134 0.132 0.042 0.031 0.104 0.071 0.241 0.052 0.124 0.141 Attention [39] 0.000 0.000 0.000 0.000 0.106 0.000 0.000 0.000 0.000 0.188 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.011 GMP 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 0.000 GAP 0.035 0.440 0.002 0.103 0.095 0.152 0.062 0.064 0.153 0.024 0.041 0.024 0.075 0.078 0.003 0.078 0.038 0.257 0.013 0.184 0.098 GWRP 0.078 0.519 0.031 0.356 0.206 0.205 0.269 0.118 0.252 0.165 0.167 0.130 0.065 0.067 0.065 0.146 0.238 0.175 0.105 0.243 0.167 TABLE VIII F1- SCORE OF TIME- FREQUENCY SEGMENTATION AT 0 DB SNR. Acous. Appla- Bark Bass Burp- Bus Cello Chime Clari- Keybo- Cough Cow- Double Drawer Elec. Fart Finger Fire- Flute Glock- Gong guitar use drum ing net ard bell bass piano snap works enspiel GMP 0.000 0.001 0.001 0.000 0.002 0.000 0.003 0.002 0.002 0.002 0.000 0.005 0.001 0.000 0.001 0.000 0.000 0.001 0.001 0.002 0.002 GAP 0.128 0.391 0.106 0.009 0.155 0.073 0.124 0.187 0.057 0.201 0.143 0.038 0.044 0.068 0.067 0.126 0.029 0.119 0.052 0.081 0.116 GWRP 0.222 0.519 0.226 0.030 0.291 0.095 0.213 0.313 0.114 0.303 0.241 0.125 0.086 0.100 0.127 0.256 0.092 0.204 0.104 0.212 0.237 Gunshot Harmo- Hi- Keys Knock Laugh- Meow Micro- Oboe Saxo- Sciss- Shatter Snare Squeak Tambo- Tear- Tele- Trumpet Violin Writ- Avg. nica hat ter wave phone ors drum urine ing phone ing GMP 0.001 0.002 0.001 0.002 0.001 0.001 0.000 0.000 0.001 0.001 0.000 0.000 0.002 0.000 0.001 0.001 0.001 0.002 0.003 0.001 0.001 GAP 0.139 0.264 0.212 0.139 0.074 0.135 0.085 0.055 0.077 0.120 0.085 0.144 0.108 0.082 0.057 0.140 0.059 0.166 0.074 0.130 0.114 GWRP 0.283 0.379 0.497 0.311 0.190 0.249 0.185 0.085 0.140 0.257 0.213 0.272 0.196 0.108 0.327 0.237 0.138 0.313 0.222 0.215 0.218 TABLE IX F1- SCORE, AUC AND MAP OF TIME-FREQUENCY SEGMENTATION AT DIFFERENT SNR S. 20 dB 10 dB 0 dB Algorithms F1 AUC mAP F1 AUC mAP F1 AUC mAP GMP 0.001 0.347 0.008 0.001 0.345 0.007 0.001 0.362 0.005 GAP 0.215 0.889 0.230 0.168 0.880 0.187 0.114 0.861 0.143 GWRP 0.324 0.849 0.268 0.280 0.845 0.227 0.218 0.836 0.175 closer to the ground truth than obtained using GMP and GAP. Compared with event-wise SED, frame-wise SED does not depend on post-processing. G. Event-wise sound event detection Although frame-wise SED does not depend on post- processing so is a more objective criterion, it makes more sense to have event-wise predictions. The event-wise pre- dictions are obtained from frame-wise predictions following Section IV-C. Table VI shows that the GWRP achieves the best F1-score of 0.254 in event-wise SED. Although GMP seems to achieve the lowest ER of 1.00, GMP deletes all the Fig. 8. Frame-wise predictions using GMP, GAP, GWRP with SNR at 0 dB. events and has a deletion error of 1.00 and an insertion of 0. The ground truth annotation is shown in the bottom right. On the other hand, GWRP has the lowest deletion error of 0.66 and has an insertion error of 1.45. The F1-scores drop H. Time-frequency segmentation from 0.254 to 0.227 to 0.167 under SNRs of 20 dB, 10 dB and 0 dB. Table VII shows the the F1-score of event-wise Table VIII shows the T-F segmentation results of all sound SED of all sound classes. Some sound classes such as “barks”, classes under 0 dB. As the T-F segmentation can not be “harmonica” have higher detection F1-score. GWRP achieves obtained by previous works including the fully connected the best averaged F1-score of 0.167. neural network [55], the CNN trained on weakly labelled data 10 [37], the FrameCNN [34] and the attention model [39], we [6] R. Girshick, J. Donahue, T. Darrell, and J. Malik. Rich feature hierarchies for accurate object detection and semantic segmentation. In only report the T-F segmentation results with our proposed Proceedings of the IEEE Conference on Computer Vision and Pattern methods. GWRP achieves the best F1-score of 0.218 on Recognition (CVPR), pages 580–587, 2014. average. Table IX shows the T-F segmentation results under [7] A. Borji, M. Cheng, H. Jiang, and J. Li. Salient object detection: A benchmark. IEEE Transactions on Image Processing, 24(12):5706– different SNRs. Table IX shows that GWRP achieves the best 5722, 2015. F1-score, AUC and mAP of 0.324, 0.849 and 0.268 under 20 [8] Sami Abu-El-Haija, Nisarg Kothari, Joonseok Lee, Paul Natsev, George dB SNR, respectively. GMP underestimates the T-F segmenta- Toderici, Balakrishnan Varadarajan, and Sudheendra Vijayanarasimhan. YouTube-8M: A large-scale video classification benchmark. arXiv tion masks and performs the worst in T-F segmentation. GAP preprint arXiv:1609.08675, 2016. overestimates the T-F segmentation masks and performs worse [9] A. Narayanan and D. Wang. Ideal ratio mask estimation using deep than GWRP in F1-score. The T-F segmentation masks learned neural networks for robust speech recognition. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal by GWRP (Fig. 7(e)) looks closer to the IRM than the T-F Processing (ICASSP), pages 7092–7096, 2013. segmentation masks learned by using GMP and GAP. [10] D. Stowell, D. Giannoulis, E. Benetos, M. Lagrange, and M. D. Plumbley. Detection and classification of acoustic scenes and events. VI. CONCLUSION IEEE Transactions on Multimedia, 17(10):1733–1746, 2015. [11] G. Parascandolo, H. Huttunen, and T. Virtanen. Recurrent neural This paper proposes a time-frequency (T-F) segmentation, networks for polyphonic sound event detection in real life recordings. In sound event detection and separation framework trained on Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 6440–6444, 2016. weakly labelled data. In training, a segmentation mapping and [12] A. Mesaros, T. Heittola, and T. Virtanen. TUT database for acoustic a classification mapping are trained jointly using the weakly scene classification and sound event detection. In Proceedings of the labelled data. In T-F segmentation, we use the trained seg- 24th European Signal Processing Conference (EUSIPCO), pages 1128– 1132, 2016. mentation mapping to calculate the T-F segmentation masks. [13] D. Giannoulis, E. Benetos, D. Stowell, M. Rossignol, M. Lagrange, and Detected sound events can then be obtained from the T-F M. D. Plumbley. Detection and classification of acoustic scenes and segmentation masks. As a byproduct, separated waveforms events: an IEEE AASP challenge. In Proceedings of the IEEE Workshop on Applications of Signal Processing to Audio and Acoustics (WASPAA), of sound events can be obtained from the T-F segmentation masks. Experiments show that the global weighted rank pool- [14] A. Kumar and B. Raj. Audio event detection using weakly labeled ing (GWRP) outperforms the global max pooling, the global data. In Proceedings of the 2016 ACM on Multimedia Conference, pages 1038–1047, 2016. average pooling and previously proposed systems in both of [15] S. Adavanne and T. Virtanen. Sound event detection using weakly T-F segmentation and sound event detection. The limitation labeled dataset with stacked convolutional and recurrent neural network. of this approach is that the T-F segmentation masks are not Technical report, DCASE2017 Challenge, September 2017. perfectly matching the ideal ratio mask (IRM) of the sound [16] Qiuqiang Kong, Yong Xu, Wenwu Wang, and Mark D Plumbley. A joint separation-classification model for sound event detection of weakly events. In future, we will improve the T-F segmentation masks labelled data. In Proceedings of the IEEE International Conference to match the IRM for event separation. on Acoustics, Speech and Signal Processing (ICASSP), pages 321–325, [17] O. Maron and T. Lozano-Pérez. A framework for multiple-instance ACKNOWLEDGMENT learning. In Proceedings of the Advances in Neural Information This research was supported by EPSRC grant Processing Systems (NIPS), volume 10, pages 570–576, 1998. EP/N014111/1 “Making Sense of Sounds” and a Research [18] Thomas G Dietterich, Richard H Lathrop, and Tomás Lozano-Pérez. Solving the multiple instance problem with axis-parallel rectangles. Scholarship from the China Scholarship Council (CSC) No. Artificial intelligence, 89(1-2):31–71, 1997. 201406150082. Iwona Sobieraj is sponsored by the European [19] Z. Zhou and M. Zhang. Neural networks for multi-instance learning. In Union’s H2020 Framework Programme (H2020-MSCA-ITN- Proceedings of the International Conference on Intelligent Information Technology (ICIIT), pages 455–459, 2002. 2014) under grant agreement No. 642685 MacSeNet. The [20] S. Andrews, I. Tsochantaridis, and T. Hofmann. Support vector machines authors thank Dominic Ward for helping to improve the paper for multiple-instance learning. In Proceedings of the Advances in Neural in the early stage. The authors thank all anonymous reviewers Information Processing Systems (NIPS), volume 15, pages 577–584, for their effort and suggestions to improve this paper. [21] P. Siva, C. Russell, and T. Xiang. In defence of negative mining for annotating weakly labelled data. In Proceedings of the European REFERENCES Conference on Computer Vision (ECCV), pages 594–608, 2012. [22] K. Tang, R. Sukthankar, J. Yagnik, and L. Fei-Fei. Discriminative [1] J. Saraswathy, M. Hariharan, S. Yaacob, and W. Khairunizam. Automatic segment annotation in weakly labeled video. In Proceedings of the classification of infant cry: A review. In Proceedings of the International IEEE Conference on Computer Vision and Pattern Recognition (CVPR), Conference on Biomedical Engineering (ICoBE), pages 543–548, 2012. pages 2483–2490, 2013. [2] A. Harma, M. F. McKinney, and J. Skowronek. Automatic surveillance [23] A. Kolesnikov and C. H. Lampert. Seed, expand and constrain: Three of the acoustic activity in our living environment. In Proceedins of the principles for weakly-supervised image segmentation. In Proceedings of IEEE International Conference on Multimedia and Expo (ICME), pages the European Conference on Computer Vision (ECCV), pages 695–711, 634–637, 2005. [3] D. P. W. Ellis. Detecting alarm sounds. In Proceedings of the Consistent [24] A. Krizhevsky, I. Sutskever, and G. E. Hinton. ImageNet classification & Reliable Acoustic Cues for Sound Analysis Workshop (CRAC ’01), with deep convolutional neural networks. In Proceedings of the Ad- pages 59–62, 2001. vances in Neural Information Processing Systems (NIPS), volume 25, [4] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti. pages 1097–1105, 2012. Scream and gunshot detection and localization for audio-surveillance [25] J. Long, E. Shelhamer, and T. Darrell. Fully convolutional networks systems. In Proceedings of the IEEE Conference on Advanced Video for semantic segmentation. In Proceedings of the IEEE Conference on and Signal Based Surveillance (AVSS), pages 21–26, 2007. [5] Z. Xu, Y. Yang, and A. G. Hauptmann. A discriminative CNN Computer Vision and Pattern Recognition (CVPR), pages 3431–3440, video representation for event detection. In Proceedings of the IEEE 2015. Conference on Computer Vision and Pattern Recognition (CVPR), pages [26] G. E. Dahl, D. Yu, L. Deng, and A. Acero. Context-dependent 1798–1807, 2015. pre-trained deep neural networks for large-vocabulary speech recogni- 11 tion. IEEE Transactions on Audio, Speech, and Language Processing, Proceedings of the International Conference on Learning Representa- 20(1):30–42, 2012. tions (ICLR), 2014. [27] O. Abdel-Hamid, A. Mohamed, H. Jiang, L. Deng, G. Penn, and D Yu. [47] D. Pathak, P. Krahenbuhl, and T. Darrell. Constrained convolutional Convolutional neural networks for speech recognition. IEEE/ACM neural networks for weakly supervised segmentation. In Proceedings of Transactions on Audio, Speech, and Language Processing, 22(10):1533– the IEEE International Conference on Computer Vision (ICCV), pages 1545, 2014. 1796–1804, 2015. [48] Jaume Amores. Multiple instance classification: Review, taxonomy and [28] K. Choi, G. Fazekas, and M. Sandler. Automatic tagging using deep comparative study. Artificial Intelligence, 201:81–105, 2013. convolutional neural networks. In Proceedings of the 17th International [49] Annamaria Mesaros, Toni Heittola, and Tuomas Virtanen. A multi- Conference on Music Information Retrieval (ISMIR), pages 805–811, device dataset for urban acoustic scene classification. arXiv preprint arXiv:1807.09840, 2018. [29] P. Foster, S. Sigtia, S. Krstulovic, J. Barker, and M. D. Plumbley. [50] Eduardo Fonseca, Manoj Plakal, Frederic Font, Daniel P. W. Ellis, CHiME-home: A dataset for sound source recognition in a domestic Xavier Favory, Jordi Pons, and Xavier Serra. General-purpose tagging environment. In IEEE Workshop on Applications of Signal Processing of freesound audio with audioset labels: Task description, dataset, and to Audio and Acoustics (WASPAA), 2015. baseline. arXiv preprint arXiv:1807.09902, 2018. [30] M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional [51] A. Mesaros, T. Heittola, and T. Virtanen. Metrics for polyphonic sound networks. In Proceedings of the European Conference on Computer event detection. Applied Sciences, 6(6):162, 2016. Vision (ECCV), pages 818–833, 2014. [52] J. A. Hanley and B. J. McNeil. The meaning and use of the area under a [31] M. Lin, Q. Chen, and S. Yan. Network in network. In Proceedings of the receiver operating characteristic (ROC) curve. Radiology, 143(1):29–36, International Conference on Learning Representations (ICLR), 2014. [32] Q. Kong, Y. Xu, W. Wang, and M. D Plumbley. Audio set classification [53] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, with attention model: A probabilistic perspective. In Proceedings of the B. Raj, and T. Virtanen. DCASE2017 challenge setup: Tasks, datasets International Conference on Acoustics, Speech and Signal Processing and baseline system. In Proceedings of the Detection and Classification (ICASSP), pages 316–320, 2017. of Acoustic Scenes and Events (DCASE) Workshop, pages 85–92, 2017. [33] Brian McFee, Justin Salamon, and Juan Pablo Bello. Adaptive pooling [54] Qiuqiang Kong, Turab Iqbal, Yong Xu, Wenwu Wang, and Mark D operators for weakly labeled sound event detection. arXiv preprint Plumbley. DCASE 2018 Challenge baseline with convolutional neural arXiv:1804.10070, 2018. networks. arXiv preprint arXiv:1808.00773, 2018. [34] S. Chou, J. Jang, and Y. Yang. FrameCNN: A weakly-supervised learn- [55] Q. Kong, I. Sobieraj, W. Wang, and M. D. Plumbley. Deep neural ing framework for frame-wise acoustic event detection and classification. network baseline for DCASE Challenge 2016. Proceedings of the Technical report, DCASE2017 Challenge, September 2017. Detection and Classification of Acoustic Scenes and Events (DCASE) [35] Ting-Wei Su, Jen-Yu Liu, and Yi-Hsuan Yang. Weakly-supervised Workshop, 2016. audio event detection using event-specific gaussian filters and fully [56] Karen Simonyan and Andrew Zisserman. Very deep convolutional convolutional networks. In Proceedings of the IEEE International networks for large-scale image recognition. In Proceedings of the Conference on Acoustics, Speech and Signal Processing (ICASSP), pages International Conference on Learning Representations (ICLR), 2014. 791–795, 2017. [57] D. Kingma and J. Ba. Adam: A method for stochastic optimization. In [36] Shao-Yen Tseng, Juncheng Li, Yun Wang, Joseph Szurley, Florian Proceedings of the International Conference on Learning Representa- Metze, and Samarjit Das. Multiple instance deep learning for weakly tions (ICLR), 2015. supervised audio event detection. arXiv preprint arXiv:1712.09673, [37] Anurag Kumar and Bhiksha Raj. Deep CNN framework for audio event recognition using weakly labeled web data. arXiv preprint arXiv:1707.02530, 2017. [38] Donmoon Lee, Subin Lee, Yoonchang Han, and Kyogu Lee. Ensemble of convolutional neural networks for weakly-supervised sound event detection using multiple scale input. In Proceedings of the Detection Qiuqiang Kong (S’17) received the B.Sc. and the and Classification of Acoustic Scenes and Events (DCASE) Workshop, M.E. degree in South China University of Techology, pages 74–79, 2017. Guangzhou, China, in 2012 and 2015, respectively. [39] Y. Xu, Q. Kong, W. Wang, and M. D. Plumbley. Large-scale weakly He is currently pursuing a PhD degree in University supervised audio classification using gated convolutional neural network. of Surrey, Guildford, UK. His research interest in- Proceedings of the IEEE International Conference on Acoustics, Speech cludes audio signal processing and machine learning. and Signal Processing (ICASSP), pages 121–125, 2017. [40] S. A. Raki, S. Makino, H. Sawada, and R. Mukai. Reducing musical noise by a fine-shift overlap-add method applied to source separation using a time-frequency mask. In Proceedings of the IEEE International Conference on Acoustics, Speech, and Signal Processing (ICASSP), volume 3, pages 81–84, 2005. [41] S. Hershey, S. Chaudhuri, D. P. W. Ellis, J. F. Gemmeke, A. Jansen, R. C. Moore, M. Plakal, D. Platt, R. A. Saurous, B. Seybold, et al. CNN architectures for large-scale audio classification. In Proceedings of the IEEE International Conference on Acoustics, Speech and Signal Processing (ICASSP), pages 131–135, 2017. Yong Xu (M’17) received the Ph.D. degree from [42] M. H. Radfar and R. M. Dansereau. Single-channel speech separation the University of Science and Technology of China using soft mask filtering. IEEE Transactions on Audio, Speech, and (USTC), Hefei, China, in 2015, on the topic of Language Processing, 15(8):2299–2310, 2007. DNN-based speech enhancement and recognition. [43] S. Ioffe and C. Szegedy. Batch normalization: Accelerating deep network Currently, he is a senior research scientist in Tencent training by reducing internal covariate shift. In Proceedings of the 32nd AI lab, Bellevue, USA. He once worked at the Uni- International Conference on Machine Learning (ICML), pages 448–456, versity of Surrey, U.K. as a Research Fellow from 2016 to 2018 working on sound event detection. He [44] V. Nair and G. E. Hinton. Rectified linear units improve restricted Boltz- visited Prof. Chin-Hui Lee’s lab in Georgia Institute mann machines. In Proceedings of the 27th International Conference of Technology, USA from Sept. 2014 to May 2015. on Machine Learning (ICML), pages 807–814, 2010. He once also worked in IFLYTEK company from [45] B. Zhou, A. Khosla, A. Lapedriza, A. Oliva, and A. Torralba. Learning 2015 to 2016 to develop far-field ASR technologies. His research interests deep features for discriminative localization. In Proceedings of the IEEE include deep learning, speech enhancement and recognition, sound event Conference on Computer Vision and Pattern Recognition (CVPR), pages detection, etc. He received 2018 IEEE SPS best paper award. 2921–2929, 2016. [46] K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks: Visualising image classification models and saliency maps. In 12 Iwona Sobieraj received the B.A. and the M.E. degreed from Warsaw University of Technology, Poland, in 2010 and 2011, respectively. She joined Samsung Electronics R&D, Warsaw, Poland in 2012. Since 2015 she is pursuing a PhD degree at the Uni- versity of Surrey, Guildford, UK. Her main research interest include environmental audio analysis, non- negative matrix factorization and deep learning. Wenwu Wang (M’02-SM’11) was born in Anhui, China. He received the B.Sc. degree in 1997, the M.E. degree in 2000, and the Ph.D. degree in 2002, all from Harbin Engineering University, China. He then worked in King’s College London, Cardiff University, Tao Group Ltd. (now Antix Labs Ltd.), and Creative Labs, before joining University of Surrey, UK, in May 2007, where he is currently a Reader in Signal Processing, and a Co-Director of the Machine Audition Lab within the Centre for Vision Speech and Signal Processing. He has been a Guest Professor at Qingdao University of Science and Technology, China, since 2018. His current research interests include blind signal processing, sparse signal processing, audio-visual signal processing, machine learning and perception, machine audition (listening), and statistical anomaly detection. He has (co)-authored over 200 publications in these areas. He served as an Associate Editor for IEEE Transactions on Signal Processing from 2014 to 2018. He is also Publication Co-Chair for ICASSP 2019, Brighton, UK. Mark D. Plumbley (S’88-M’90-SM’12-F’15) re- ceived the B.A.(Hons.) degree in electrical sciences and the Ph.D. degree in neural networks from Uni- versity of Cambridge, Cambridge, U.K., in 1984 and 1991, respectively. Following his PhD, he became a Lecturer at King’s College London, before moving to Queen Mary University of London in 2002. He subsequently became Professor and Director of the Centre for Digital Music, before joining the Uni- versity of Surrey in 2015 as Professor of Signal Processing. He is known for his work on analysis and processing of audio and music, using a wide range of signal processing techniques, including matrix factorization, sparse representations, and deep learning. He is a co-editor of the recent book on Computational Analysis of Sound Scenes and Events, and Co-Chair of the recent DCASE 2018 Workshop on Detection and Classifications of Acoustic Scenes and Events. He is a Member of the IEEE Signal Processing Society Technical Committee on Signal Processing Theory and Methods, and a Fellow of the IET and IEEE.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Apr 12, 2018

There are no references for this article.