Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma

Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma 1 1 1 Yuma Koizumi Member, IEEE, Shoichiro Saito Member, IEEE, Hisashi Uematsu Non-Member, 1 1 Yuta Kawachi Non-Member, and Noboru Harada Senior Member, IEEE Abstract—This paper proposes a novel optimization princi- (SED) [9]–[11]. Since the anomalies are defined, we can ple and its implementation for unsupervised anomaly detec- collect a dataset of the target anomalous sounds even though tion in sound (ADS) using an autoencoder (AE). The goal the anomalies are rarer than normal sounds. Thus, the ADS of unsupervised-ADS is to detect unknown anomalous sound system can be trained using a supervised method that is used without training data of anomalous sound. Use of an AE as in various SED tasks of the “Detection and Classification of a normal model is a state-of-the-art technique for unsupervised- ADS. To decrease the false positive rate (FPR), the AE is trained Acoustic Scenes and Events challenge” (DCASE) such as au- to minimize the reconstruction error of normal sounds and the dio scene classification [12], [13], sound event detection [14], anomaly score is calculated as the reconstruction error of the [15], and audio tagging [16]. On the other hand, unsupervised- observed sound. Unfortunately, since this training procedure does ADS [17]–[19] is the task of detecting “unknown” anomalous not take into account the anomaly score for anomalous sounds, sounds that have not been observed. In the case of real- the true positive rate (TPR) does not necessarily increase. In this study, we define an objective function based on the Neyman- world factories, from the view of the development cost, it Pearson lemma by considering ADS as a statistical hypothesis is impracticable to deliberately be damaged the expensive test. The proposed objective function trains the AE to maximize target machine. In addition, actual anomalous sounds occur the TPR under an arbitrary low FPR condition. To calculate rarely and have high variability. Therefore, it is impossible the TPR in the objective function, we consider that the set of to collect an exhaustive set of anomalous sounds and need anomalous sounds is the complementary set of normal sounds and simulate anomalous sounds by using a rejection sampling to detect anomalous sounds for which training data does not algorithm. Through experiments using synthetic data, we found exist. From this reason, the task is often tackled as the one- that the proposed method improved the performance measures class unsupervised classification problem [17]–[19]. This point of ADS under low FPR conditions. In addition, we confirmed is one of the major di erences in premise between the DCASE that the proposed method could detect anomalous sounds in real tasks and ADS for industrial equipment. Thus, in this study, environments. we aim to detect unknown anomalous sounds based on an Index Terms—Anomaly detection in sound, Neyman-Pearson unsupervised approach. lemma, deep learning, and autoencoder. In unsupervised anomaly detection, “anomaly” is defined as the patterns in data that do not conform to expected “normal” I. Introduction behavior [19]. Namely, the universal set consists of only the NOMALY detection in sound (ADS) has received much normal and the anomaly, and the anomaly is the complement attention. Since anomalous sounds might indicate symp- to the normal set. More intuitively, the universal set is various toms of mistakes or malicious activities, their prompt detection machine sounds including many types of machines, the normal can possibly prevent such problems. In particular, ADS has set is one specific type of various machine sound, and the been used for various purposes including audio surveillance anomaly set is all other types of machine sounds. Therefore, a [1]–[4], animal husbandry [5], [6], product inspection, and typical way of unsupervised-ADS is the use of the outlier- predictive maintenance [7], [8]. For the last application, since detection technique. Here, the deviation between a normal anomalous sounds might indicate a fault in a piece of machin- model and an observed sound is calculated; the deviation is ery, prompt detection of anomalies would decrease the number often called the “anomaly score”. The normal model indicates of defective product and/or prevent propagation of damage. In the notion of normal behavior which is trained from training this study, we investigated ADS for industrial equipment by data of normal sounds. The observed sound is identified as focusing on machine-operating sounds. an anomalous one when the anomaly score is higher than a ADS tasks can be broadly divided into supervised-ADS and pre-defined threshold value. Namely, the anomalous sounds unsupervised-ADS. The di erence between the two categories are defined as the sounds that do not exist in training data of is in the definition of anomalies. Supervised-ADS is the task normal sounds. of detecting “defined” anomalous sounds such as gunshots or To train the normal model, it is necessary to define the opti- screams [2], and it is a kind of rare sound event detection mality of the anomaly score. One of the popular performance measurements of ADS is to measure both the true positive rate All authors are with the NTT Media Intelligence Laboratories, NTT Corporation, Tokyo, Japan (e-mail: koizumi.yuma@ieee.org, fsaito.shoichiro, (TPR) and false positive rate (FPR). The TPR is the proportion uematsu.hisashi, kawachi.yuta, noboru.haradag@lab.ntt.co.jp). A preliminary of anomalies that are correctly identified, and the FPR is the version of this work is published in [8]. proportion of normal sounds that are incorrectly identified as Copyright (c) 2018 IEEE. This article is the “accepted” version. Digital Object Identifier: 10.1109/TASLP.2018.2877258 anomalies. To improve the performance of ADS, we need arXiv:1810.09133v1 [stat.ML] 22 Oct 2018 2 normal data simultaneously. However, since the generator is 0.6 trained to make normal data, if it perfectly generates normal sounds, the anomaly score of normal sounds and FPR will 0.4 increase. Therefore, it is necessary to build an algorithm to 0.2 simulate “non-normal” sounds. In this study, we propose a novel optimization principle 0 1 2 3 4 5 6 7 and its implementation for ADS using AE. By considering Anomaly score    [nat] an outlier-detection-based ADS as a statistical hypothesis test, 0.5 we define optimality as an objective function based on the Neyman-Pearson lemma [29]. The objective function works 0 1 2 3 4 5 6 7 1 to increase TPR under an arbitrary low FPR condition. A 0.5 problem in calculating TPR is the simulation of anomalous sound data. Here, we explicitly define the set of anomalous 0 1 2 3 4 5 6 7 Detection threshold [nat] sounds to be the complement to the set of normal sounds and simulate anomalous sounds by using a rejection sampling Fig. 1. Trade-o relationship between anomaly score, true positive rate (TPR) algorithm. and false positive rate (FPR). The top figure shows PDFs of anomaly scores for normal sounds (blue line) and anomalous sounds (red dashed line). The A preliminary version of this work is presented in [8]. bottom figures show the FPR and TPR with respect to the threshold. When The previous study utilized a DNN as a feature extractor, these PDFs overlap, a small threshold leads to a large TPR and FPR, and a and the anomaly score was calculated using the negative-log- large threshold leads to a small TPR and FPR. likelihood of a GMM trained from normal data. Thus, although the DNN was trained to maximize the objective function based to increase TPR and decrease FPR simultaneously. However, on the Neyman-Pearson lemma, the normal model did not these metrics are related to the threshold value and have a guarantee to increase TPR and decrease FPR. In this study, trade-o relationship, as shown in Fig. 1. When the PDFs of end-to-end training is achieved by using an AE as the normal the anomaly scores of normal and anomalous sounds overlap, model and both the feature extractor and the normal model false detections cannot be avoided regardless of any threshold. are trained to increase TPR and decrease FPR. Thus, to increase TPR and decreases FPR simultaneously, we The rest of this paper is organized as follows. Section II need to train the normal model to reduce the overlap area. briefly introduces outlier-detection-based ADS and its imple- More intuitively, it is essential to provide small anomaly scores mentation using an AE. Section III describes the proposed for normal sounds and large anomaly scores for anomalous training method and the details of the implementation. After sounds. In addition, if an ADS system gives a false alert reporting the results of objective experiments using synthetic frequently, we cannot trust it, just as “the boy who cried data and verification experiments in real environments in wolf ” cannot be trusted. Therefore, it is especially important Section IV, we conclude this paper in Section V. The mathe- to increase TPR under a low FPR condition in a practical matical symbols are listed in Appendix A. situation. The early studies used various statistical models to calculate the anomaly score, such as the Gaussian mixture model II. Conventional method (GMM) [3], [8] and support vector machine (SVM) [4]. The A. Identification of anomalous sound based on outlier detec- recent literature calculates the anomaly score through the use tion of deep neural networks (DNN) such as the autoencoder (AE) [20]–[23] and variational AE (VAE) [24], [25]. In the case of ADS is an identification problem of determining whether the the AE, one is trained to minimize the reconstruction error of sound emitted from a target is a normal sound or an anomalous the normal training data, and the anomaly score is calculated one. In this section, we briefly introduce the procedure of as the reconstruction error of the observed sound. Thus, unsupervised-ADS. the AE provides small anomaly scores for normal sounds. First, an anomaly score A(x ; ) is calculated using a However, it gives no guarantee to increase anomaly scores normal model. Here, x 2 R is an input vector calculated for anomalous sounds. Indeed, if the AE is generalized, the from the observed sound indexed on  2 f1; 2; :::; Tg for time, anomalous sounds will also be reconstructed and the anomaly and  is the set of parameters of the normal model. In many of score of anomalous sound will be small. Therefore, to increase the previous studies, x was composed of hand-crafted acoustic TPR and decrease FPR simultaneously, the objective function features such as mel-frequency cepstrum coecients (MFCCs) should be modified. [1]–[3], and the normal model was often constructed with a Another strategy for unsupervised-ADS is the use of a PDF of normal sounds. Accordingly, the anomaly score can generative adversarial network (GAN) [26], [27]. GANs have be calculated as been used to detect anomalies in medical images [28]. In this strategy, a generator simulates “fake” normal data, and a A(x ; ) = ln p(x j ; y = 0); (1) discriminator identifies whether the input data is a real normal data or not. Therefore, the discriminator can be trained to where y denotes the state, y = 0 is normal, and y , 0 is not increase TPR for fake normal data and decrease FPR for true normal, i.e. anomalous. p(xj; y = 0) is a normal model such TPR FPR Probability … … … … … … … In ADS using an AE, the anomaly score is the reconstruc- |Θ  |Θ tion error of the observed sound, which is calculated as Anomaly A(x ; ) := kx D(E(x j  ) j  )k : (8) E D Normal 2 To train the normal model to provide small anomaly scores Fig. 2. Anomaly detection procedure using autoencoder. The input vector is for normal sounds, the AE is trained to minimize the average compressed and reconstructed by two networks E and D, respectively. Since reconstruction error of normal sound, E and D are trained to minimize reconstruction error of normal sounds, the (u) reconstruction error would be small if x is normal. Thus, the anomaly score is N AE (u) calculated as a reconstruction error, and when the error exceeds a pre-defined J ( ;  ) = A(x ; ); (9) E D (u) threshold , the observation is identified as anomalous. n=1 (u) (u) where x is the n-th training data of normal sound and N as a GMM [8]. x is determined to be anomalous when the is the number of training samples of normal sound. This anomaly score exceeds a pre-defined threshold value : objective function works to decrease the anomaly score of normal sounds. However, there is no guarantee of increasing 0 (Normal) A(x ; ) anomaly scores for anomalous sounds. Indeed, if the AE is H (x ; ; ) = : (2) 1 (Anomaly) A(x ; ) > generalized, the anomalous sounds will also be reconstructed and the anomaly score of anomalous sounds will be also small. One of the performance measures of ADS consists of the Therefore, (9) does not ensure that false detections are reduced pair of TPR and FPR. The TPR and FPR can be calculated and the accuracy of ADS is improved; thus, it would be better as expectations of H (x; ; ) with respect to anomalous and to modify the objective function. normal sounds, respectively: TPR(; ) = E H (x; ; ) ; (3) III. Proposed method xjy,0 FPR(; ) = E H (x; ; ) ; (4) We will begin by defining an objective function that builds xjy=0 upon the Neyman-Pearson lemma in Sec. III-A. Then, we where E[] denotes the expectation with respect to x. These will describe the rejection sampling algorithm for simulating metrics are related to  and have a trade-o relationship as anomalous sound used for calculating TPR in Sec III-B. shown in Fig. 1. The top figure shows the PDFs of anomaly After that, the overall training and detection procedure of the scores for normal sounds p(A(x ; )jy = 0) and anomalous proposed method will be summarized in Sec. III-C and Sec. sounds p(A(x ; )jy , 0). The bottom figures show the FPR III-D. As a modified implementation of proposed method, we and TPR with respect to . When these PDFs overlap, false extend the proposed method to an area under the receiver detections, i.e. false-positive and/or false-negative, cannot be operating characteristic curve (AUC) maximization in Sec avoided regardless of any . In addition, the false detections III-E. increase as the overlap area gets wider. Therefore, to increase TPR and decrease FPR simultaneously, it is necessary to train A. Objective function for anomaly detection based on the so that the anomaly score is small for normal sounds and Neyman-Pearson lemma large for anomalous sounds. More precisely, we need to train to reduce the overlap area. From (1) and (2), an anomalous sound satisfies the following inequality: p(x j ; y = 0) < exp(): (10) B. Unsupervised-ADS using an autoencoder Since  is assumed to be suciently large to avoid false Recently, deep learning has been used to construct a normal positives, an anomalous sound can be defined as “a sound model. Several studies on deep-learning-based unsupervised- which cannot be regarded as a sample of the normal model.” ADS have used an autoencoder (AE) [20]–[23]. This section Thus, we can regard outlier-detection-based ADS as a statis- briefly describes unsupervised-ADS using an AE (see Fig. 2). tical hypothesis test. In other words, the observed sound is The goal of using an AE is to learn an ecient representa- identified as anomalous when the following null hypothesis is tion of the input vector by using two neural networks E and D, rejected. which are called the encoder and decoder, respectively. First, the input vector x is converted into a latent vector z 2 R by Null hypotheses: x is a sample of the normal model p(x j E. Then, an input vector is reconstructed from z by D. These ; y = 0). processes are expressed as The Neyman-Pearson lemma [29] states the condition for z = E(x j  ); (5) A(x; ) that achieves the most powerful test between two simple hypotheses. According to it, the most powerful test has x ˆ = D(z j  ): (6) the greatest detection power among all possible tests of a given The parameters of both neural networks  = f ;  g are FPR [30]. More simply, the most powerful test maximizes the E D trained to minimize the reconstruction error: TPR under the constraint that the FPR equals , i.e., h i AE 2 J ( ;  ) = E kxD(E(x j  ) j  )k : (7) maximize TPR(; ), subject to FPR(; ) = : E D E D … … … … … … … 4 Non-linear Algorithm 1 Simulation algorithm of anomalous sound in Input vector space Latent vector space mapping latent vector space. 1: Input: Generator G, GMM p(z j ; y = 0) and 2: ` 1 3: while `   do 4: Draw z ˜ from N (zj0 ; I ) R R 5: Evaluate ` ln p(z ˜ j ; y = 0) 6: end while (a) 7: z z (a) (a) 8: Generate anomalous sound by x = G(z j  ) (a) 9: Output: x Fig. 3. Concept of PDFs of normal, various, and anomalous sounds using two neural networks. The PDF of normal sounds (i.e. meshed area) is a subset of the PDF of various sounds (i.e. gray area), and the PDF of anomalous consider the set of normal sounds to be a subset of various sounds is expressed as complement of the PDF of normal sounds (i.e. inside machine sounds, and the set of anomalous sounds to be its the gray area and outside the meshed area). x is mapped to z by E, and z is reconstructed to x ˜ by G. Here, E and G are trained to satisfy p(z) = complement. Then, we use rejection sampling to simulate N (zj0 ; I ) and x = x ˜ , respectively. The PDF of the latent vector of normal R R anomalous sounds; namely, a sound is sampled from various sounds is modeled using a GMM p(z j ; y = 0) given by (13). machine-sound PDFs, and it is accepted as an anomalous sound when its anomaly score is high. However, since the PDF of various machine sounds in the input vector domain Since the FPR can be controlled by manipulating , we define p(x) may have a complex form, the PDF cannot be written as satisfying FPR(;  ) = . Accordingly, the objective in an analytical form and the sampling algorithm would function to obtain the most powerful test function can be become complex. Inspired by the strategy of VAE, we can defined as the one that maximizes TPR(;  ) with respect avoid this problem by training E so that the PDF of various to . However, since the FPR is also a function of , it may latent vectors p(z) is mapped to a PDF whose samples can become large when focusing only on TPR. To maximize the be generated by a pseudorandom number generator from a TPR and minimize the FPR simultaneously, we train  to uniform distribution and its variable conversion. Then, the maximize the following objective function, (a) latent vectors of anomalous sounds z are sampled using NP J () = TPR(;  ) FPR(;  ); (11) the rejection sampling algorithm, and the input vectors of (a) anomalous sounds x are reconstructed using a third neural where the superscript “NP” is an abbreviation of “Neyman- network G, Pearson”. Since the proposed objective function directly in- creases TPR and decreases FPR,  can be trained to provide (a) (a) x = G(z j  ); (12) a small anomaly score for normal sounds and a large anomaly score for anomalous sounds. where  is the parameter of G. Hereafter, we call G the There are two problems when it comes to training  and generator. Although there is no constraint on the architecture to maximize (11). The first problem is the calculation of of G, we will use the same architecture for D and G. In TPR. The TPR and FPR are the expectations of H (x; ; ), addition, to simply generate and reject a candidate latent and in most practical cases, the expectation is approximated as vector, we use two constraints to train  and  , and model E G an average over the training data. Thus, to calculate TPR and the PDF of normal latent vectors using the GMM as FPR, we need to collect enough normal and anomalous sound data for the average to be an accurate approximation of the K expectation. However, since anomalous sounds occur rarely p(z j ; y = 0) = w N (z j  ;  ); (13) k k k and have high variability, this condition is dicult to satisfy. k=1 In section III-B, to calculate TPR, we consider “anomaly” to where  = fw ;  ;  j k = 1; :::; Kg, K is the number k k k mean “not normal” and simulate anomalous sounds by using a of mixtures, and w ;  , and  are respectively the weight, k k k sampling algorithm. The second problem is the determination mean vector, and covariance matrix of the k-th Gaussian. The of the threshold  . In a parametric hypothesis test such concepts of these PDFs are shown in Fig. 3, and the procedure as a t-test, the threshold at which FPR equals  can be of anomalous sound simulation is summarized in Algorithm analytically calculated. However, DNN is a non-parametric 1 and Fig. 4. statistical model; thus, the threshold  can not be analytically First, we describe the two constraints for training  and calculated. In section III-C, we numerically calculate  as . For algorithmic eciency, p(z) should be generated with the bMc-th value of the sorted anomaly scores of M normal a low computational cost. As an implementation of p(z), we sounds, where bc is the flooring function. use the normalized Gaussian distribution, because its samples can be generated by a pseudorandom number generator such B. Anomalous sound simulation using an autoencoder as the Mersenne-Twister. Thus, for training  and  , we E G In accordance with (10), anomalous sounds emitted from use the first constraint so that z of the various machine sounds the target machine are di erent from normal ones. Thus, we follows a normalized Gaussian distribution. To satisfy the first |Θ Latent vector space | , Various machine Algorithm 1 |Θ Simulated anomalous sound data sound data |Θ Normal sound data | , |Θ  |Θ Anomaly Normal Fig. 4. Procedure of anomalous sound simulation using autoencoder. constraint, we train  to minimize the following Kullback- parameters of the Gaussian distribution of the minibatch are Leibler divergence (KLD): calculated as KL J ( ) = D (N (z j 0 ; I )jjN (z j  ;  )) ; E R R V V h n o i 1 (v) 1 > 1 = z ; (15) = lnj j + tr  +    R ; (14) V V V V V n=1 where the superscript “KL” is an abbreviation of “Kullback- X 1 > (v) (v) = z  z  : (16) Leibler”, trfg denotes the trace of a matrix, > denotes trans- V V V n n n=1 position, 0 and I are respectively the zero vector and unit R R matrix with dimension R, and  and  are respectively V V Finally, to minimize the KLD and the reconstruction error of the mean vector and covariance matrix calculated from z of various sounds, the objective function is calculated as the various machine sounds. To generate anomalous sounds from (12), G needs to reconstruct various machine sounds, (v) (v) as x = G(E(x j  ) j  ). Thus, as a second constraint, E G X KR KL (v) (v) J () = J ( ) + x G E x j  j  ; we train  and  to minimize the reconstruction error (7) E G E E G n n n=1 calculated on the various machine sounds. (17) Next, we describe the GMM that models the PDF of the normal latent vectors. To reject a candidate z ˜ which seems to be z of a normal sound, we need to calculate the probability where the superscript “KR” is an abbreviation of “KLD and that the candidate is a normal one. To calculate the probability, reconstruction”, and  and  are updated by gradient E G KR we need to model p(z j y = 0). Since there is no constraint descent to minimize J (): on the form of p(z j y = 0) in the training procedure of  , p(z j y = 0) might have a complex form. For simplicity, we KR r J (); (18) E E use a GMM expressed as (13). KR r J (); (19) G G C. Detailed description of training procedure where  is the step size. Here, we describe the details of the training procedure shown in Fig 5. The training procedure consists in three steps. Second,  and  are trained to maximize the objective E D (u) Hereafter, we call the proposed method using this training function. A minibatch of normal sounds x is randomly procedure NP-PROP. The algorithm inputs are training data selected from the training dataset of normal sounds, and (a) constructed from normal sounds and various machine sounds, a minibatch of anomalous sounds x is simulated using (v) (u) and the outputs are  and  . Moreover, x and x Algorithm 1. Here, since DNN is not a parametric PDF, E D n n respectively denote the n-th training samples of minibatches the threshold  that satisfies FPR(;  ) =  cannot be of various and normal machine sounds, and M is the number analytically calculated. Thus, in this study, we approximately of samples included in a minibatch. calculate  by sorting the anomaly scores of normal sounds (u) (u) (u) First,  and  are trained to simulate anomalous sounds. in the minibatch x . First, A(x ; ) and ln(z j ; y = 0) E G A minibatch of various machine sounds is randomly selected are calculated, and  and  are set as the bMc-th value of (u) (u) from the training dataset of various machine sounds. Next, its the sorted A(x ; ) and ln(z j ; y = 0) in descending (v) (v) latent vectors are calculated as z E(x j ). Then, the order, respectively. Then, the TPR and FPR are approximately n n E Neyman-Pearson Various machine Start Autoencoder-based update lemma-based update sound data Updated a No Random Calculate  and  certain number Calculate  and select of times? by (3) and (12) from Yes Calculate  and Simulate  by Normal sound data by (15) and (16) Algorithm 1 Update Random by EM-algorithm Update  and  Update  and select for GMM by (18) and (19) by (22) and (23) Fig. 5. Training procedure of the proposed method. evaluated as data classification and/or anomaly detection. The AUC is calculated as M 2 3 6 7 (a) 6 7 6 7 TPR(;  )  sigmoid A x ;   ; (20) 6 7 6   7 6 7 M 0 6 7 6 7 n=1 AUC() = E E H (x ; ;A(x; )) ; (24) 6 7 x jy,0 6 7 6 7 | {z } 6 7 4 5 (u) ( ( )) TPR ;A x; FPR(;  )  sigmoid A x ;   ; (21) xjy=0 n=1 X (u) TPR ;A x ;  : (25) n=1 where the binary decision function H is approximated by a sigmoid function, allowing the gradient to be analytically As we can see in (25), anomalous sound data are needed NP calculated. Finally,  and  are updated to increaseJ () E D to calculate the AUC. Although the AUC has been used by gradient ascent: as an objective function in imbalanced data classification [31]–[33], it has not been applied to unsupervised-ADS so NP far. Fortunately, since the proposed rejection sampling can + r J (); (22) E E simulate anomalous sound data, AUC maximization can be NP + r J (): (23) D D NP used as an objective function of ADS. Instead of J (), the following objective function can be used in the training Third, to update the PDF of the latent vectors of normal procedure: sounds p(z j ; y = 0), when (18)–(23) is repeated a AUC J () certain number of times,  is updated using the expectation- maximization (EM) algorithm for GMM using all training data X (26) (u) (u) = TPR ;A x ;  FPR ;A x ;  : of normal sounds. The above algorithm is run a pre-defined n n n=1 number of epochs. AUC Hereafter, we call the proposed method using J () in- NP stead of J () AUC-PROP. D. Detailed description of detection procedure IV. Experiments After training  and  , we can identify whether the E D We conducted experiments to evaluate the performance observed sound is a normal one or not. First, the input vector of the proposed method. First, we conducted an objective x ;  2 f1; :::; Tg is calculated from the observed sound. Then, experiment using synthetic anomalous sounds (Sec. IV-B). To the anomaly score is calculated as (8). Finally, a decision generate a large enough anomalous dataset for the ADS accu- 1 T score, V = H (x ; ; ); is calculated, and when V ex- =1 racy evaluation, we used collision and sustained sounds from ceeds a pre-defined value  , the observed sound is determined datasets for detection and classification of acoustic scenes and to be anomalous. In this study, we used  = 0, meaning that, events 2016 (DCASE-2016 [36]). To show the e ectiveness if the anomaly score exceeds the threshold even for one frame, of the method in real environments, we conducted verification the observed sound is determined to be anomalous. experiments in three real environments (Sec. IV-C). A. Experimental conditions E. Modified implementation as an AUC maximization 1) Compared methods: The proposed methods described The receiver operating characteristic (ROC) curve and the in Sec III-C (NP-PROP) and Sec III-E (AUC-PROP) were AUC are widely used performance measures for imbalanced compared with three state-of-the-art ADS methods: 7 TABLE I Experimental conditions Parameters for signal processing Sampling rate 16.0 kHz FFT length 512 pts FFT shift length 256 pts Number of mel-filterbanks 40 Other parameters Context window size C 5 Dimension of input vector Q for FNN 440 Dimension of input vector Q for 1D-CRNN 40 Dimension of acoustic feature vector R 40 GMM update per gradient method 30 Number of mixtures K 16 Minibatch size M 512 FPR parameter  0.2 Step size  10 Fig. 6. Network architectures of encoder, decoder and generator used for L normalization parameter 10 NP-PROP, and AUC-PROP. The encoder and decoder of AE have the same architecture. In VAE, VAEGAN and CONV-PROP, the encoder has two output layers for the mean and variance vector. In VAEGAN, the architecture of the Q = 40(2C +1) = 440. The second architecture,“1D-CRNN”, discriminator is the same as that of the encoder, but the output dimension of the fully connected layer is 1. consisted in a one-dimensional convolution neural network (1D-CNN) layer and a long short-term memory (LSTM) layer; it worked well in supervised anomaly detection (race SED) in AE [20]: ADS using the autoencoder described in Sec DCASE 2017 [10]. In order to detect anomalous sounds in real II-B. The encoder and decoder were trained to minimize time, we changed the backward LSTM to a forward one. In (9). addition, to avoid overfitting, we used only one forward LSTM VAE [24]: E and D were implemented using VAE. The layer instead of two backward LSTM layers. The input vector encoder estimated the mean and variance parameters of x was a 40-dimensional log mel-band energy: the Gaussian distribution in the latent space. Then, the x := ln (Mel [Abs [X ]]) : latent vectors were sampled from the Gaussian distribu- tion whose parameters were estimated by the encoder. The dimension of x was Q = 40. For each architecture, the Then, the decoder reconstructed the input vector from the dimension of the latent vector z was R = 40. All input vectors sampled latent vectors. Finally, the reconstruction error were mean-and-variance normalized using the training data was calculated and used as the anomaly score. statistics. VAEGAN [27]: To investigate the e ectiveness of the As an implementation for the gradient method, the Adam anomalous sound simulation, VAEGAN [27] was used method [34] was used instead of the gradient descent/ascent to simulate fake normal data. The generators (i.e. VAE) shown in (18)–(23). To avoid overfitting, L normalization [35] were used to simulate fake normal sounds. The output of with a regularization penalty of 10 was used. The minibatch the discriminator without the sigmoid activation was used size for all methods was M = 512. All models were trained for as the anomaly score. 500 epochs. In all methods, the average value of the loss was We also used our previous work [8] (CONV-PROP) for com- calculated on the training set at every epoch, and when the parison. This method uses a VAE to extract latent vectors as loss did not decrease for five consecutive epochs, the stepsize acoustic features. A GMM is used for the normal model, and was decreased by half. the encoder and decoder are trained to maximize (11). 3) Other conditions: All sounds were recorded at a sam- 2) DNN architecture and setup: We tested two types of pling rate of 16 kHz. The frame size of the DFT was 512, and network architecture as shown in Fig. 6. The first architecture, the frame was shifted every 256 samples. For p(z j ; y = 0), “FNN”, consisted of fully connected DNNs with three hidden the number of Gaussian mixtures was K = 16 and a diagonal layers and 512 hidden units. The rectified linear unit (ReLU) covariance matrix was used to prevent the problem from being was used as the activation functions of the hidden layers. The ill-conditioned. The EM algorithm for the GMM involved iter- input vector x was defined as ating (18)–(23) 30 times. All the above-mentioned conditions are summarized in Table I. x := (ln [Mel [Abs [X ]]] ; :::; ln [Mel [Abs [X ]]]) ; C +C X := X ; :::; X ; 1; B. Objective experiments on synthetic data where X is the discrete Fourier transform (DFT) spectrum 1) Dataset: Sounds emitted from a condensing unit of an !; of the observed sound, ! 2 f1; :::; g denotes the frequency air conditioner operating in a real environment were used index, C(= 5) is the context window size, and Mel[] and as the normal sounds. In addition, various machine sounds Abs[] denote 40-dimensional Mel matrix multiplication and were recorded from other machines, including a compressor, the element-wise absolute value. Thus, the dimension of x was engine, compression pump, and an electric drill, as well as FNN 1D-CRNN Input Input 1D Conv (  ) Fully connect  ReLU ReLU  Max pooling Fully connect  ReLU LSTM Fully connect Fully connect Output Output Input Input Fully connect  ReLU Fully connect  ReLU LSTM Fully connect  ReLU Reshape Fully connect 1D Deconv (  ) Output Output 8 AE VAE NP-CONV NP-PROP AUC-PROP VAEGAN ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE VAE VAE 0.8 0.8 0.8 VAEGAN VAEGAN NP-CONV NP-CONV NP-PROP NP-PROP 0.6 0.6 0.6 AUC-PROP AUC-PROP 1 2 3 1 2 3 1 2 3 1 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0 0 0 1 2 3 1 2 3 1 2 3 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0 0 1 2 3 1 2 3 1 2 3 Collision Sustain Mix Collision Sustain Mix Collision Sustain Mix Fig. 7. Evaluation results of FNN. AE VAE NP-PROP VAEGAN NP-CONV AUC-PROP ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE 0.9 0.9 0.9 VAE VAE 0.8 0.8 VAEGAN 0.8 VAEGAN NP-CONV NP-CONV 0.7 0.7 NP-PROP 0.7 NP-PROP AUC-PROP AUC-PROP 0.6 0.6 0.6 1 2 3 1 2 3 1 2 3 1 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0  0 0 1 2 3 1 2 3 1 2 3 1 1 0.8 0.8 AE 0.8 AE VAE VAE 0.6 0.6 VAEGAN 0.6 VAEGAN NP-CONV NP-CONV 0.4 0.4 0.4 NP-PROP NP-PROP AUC-PROP AUC-PROP 0.2 0.2 0.2 1 2 3 1 2 3 1 2 3 Collision Sustain Mix Collision Sustain Mix Collision Sustain Mix Fig. 8. Evaluation results of 1D-CRNN. environmental noise of factories. The normal and various including anomalous sounds, synthetic anomalous data were machine sound data totaled 4 and 20 hours (= 4 hours normal used in this evaluation. In particular, we used the training + 16 hours other machines), respectively. These sounds were datasets for task of DCASE-2016 [36] as anomalous sounds. recorded at a 16-kHz sampling rate. In order to improve Although these sounds are “normal” sounds in an oce, in the robustness for di erent loudness levels and ratios of the unsupervised-ADS, the unknown sounds are categorized as normal and anomalous sound, the various machine sounds in “anomalous”. Thus, we consider that this evaluation can at the training dataset were augmented with a multiplication of least evaluate the detection performance for unknown sounds. five amplitude gains. These gains are calculated so that the Since the anomalous sounds of machines are roughly cat- maximum amplitudes of various sounds becomes to 1.0, 0.5, egorized into collision sounds (e.g., the sound of a metal 0.25, 0.125, and 0.063. part falling on the floor) and sustained sounds (e.g., frictional sound caused by scratched bearings), we selected 80 collision Since it is dicult to collect a massive amount of test data pAUC TPR AUC pAUC TPR AUC 9 NP-PROP AE NP-PROP AUC-PROP VAE AUC-PROP ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE 0.5 0.5 NP-PROP 0.5NP-PROP AUC-PROP AUC-PROP 0  0  0 0 0.5 1 0 0.5 1 0 0.5 1 1 FPR 1 FPR 1 FPR AE AE NP-PROP NP-PROP 0.5 0.5 0.5 AUC-PROP AUC-PROP 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 FPR FPR FPR Fig. 9. ROC curves of AE, NP-PROP and AUC-PROP for each ANR condition evaluated on Mix dataset. sounds, including (slamming doors , knocking at doors , 1. The parameters were  = 0:05 and p = 0:1. We evaluated keys put on a table, keystrokes on a keyboard), and 60 these metrics for three di erent evaluation sets: 80 collision sustained sounds (drawers being opened, pages being turned, sounds (Collision), 60 sustained sounds (Sustain), and the sum and phones ringing), from this dataset [37]. To synthesize of these 80 + 60 = 140 sounds (Mix). the test data, the anomalous sounds were mixed with normal The results for each score, sound category, and ANR on sounds at anomaly-to-normal power ratios (ANRs ) of -15, FNN and 1D-CRNN are shown in Fig. 7 and Fig. 8. Overall, -20 and -25 dB using the following procedure: the performances of AE, NP-PROP and AUC-PROP were better than those of VAE and VAEGAN. In detail, AE achieved high 1) select an anomalous sound and randomly cut a normal so scores for all measurements, AUC-PROP achieved high scores that has the same signal length of the selected anomalous for AUC and pAUC, and NP-PROP achieved high scores for sound. TPR and pAUC. In addition, for all conditions, the TPR 2) for the cut normal and anomalous sounds, calculate and pAUC scores of NP-PROP were higher than those of AE. the frame-wise log power of each of 512 points with To discuss the di erence between the objective functions of a 256 point shift on a dB scale, namely P = AE, NP-PROP and AUC-PROP, we show the ROC curves in 20 log X : !; !=1 Fig. 9. Since the di erences between the results of Collision, 3) select the median of P as the representative power of Sustained, and Mix were small, we plotted only those of the each sound as. Mix dataset. From these ROC curves, we can see that the TPRs 4) manipulate the power of the anomalous sound so that of NP-PROP under the low FPR conditions were significantly the ANR has the desired value. higher than those of other methods. This might be because the 5) used the cut normal sound as the test data of normal objective function of NP-PROP works to increase TPR under sound, and generate the test data of the anomalous the low FPR condition. In addition, although AUC-PROP’s sound by mixing the anomalous sound with the quarried TPRs under the low FPR condition were lower than those normal sound. of NP-PROP, the TPRs under the moderate and high FPR In total, we used 140 normal and anomalous sound samples for conditions were higher than those of the other methods. This each ANR condition. The training dataset of normal sounds might be because the objective function of AUC-PROP works and the MATLAB code to generate the test dataset are freely to increase TPR for all FPR conditions. Since the individual available on the website . results and objective function tend to coincide, we consider 2) Results: To evaluate the performance of ADS, we used that the training of each neural network succeeded. In addition, the AUC, TPR, and partial AUC ( pAUC) [38]. The AUC is TPR under the low FPR conditions is especially important a traditional performance measure of anomaly detection. The when the ADS is used in real environments, because if an other two measurements evaluated the performance under low ADS system frequently gives false alert, we cannot trust it. FPR conditions. TPR is the TPR under the condition that Therefore, unsupervised-ADS using an AE trained using (11) FPR equals . The pAUC is an AUC calculated with FPRs would be e ective in real situations. ranging from 0 to p with respect to the maximum value of In addition, regarding the FNN results, VAE scored lower than AE, and VAEGAN scored lower than all the other methods. ANR is a measure comparing the level of an anomalous sound to the These results suggest that when calculating the anomaly score level of a normal sound. This definition is the same as the signal-to-noise using a simple network architecture like FNN, a simple ratio (SNR) when the signal is an anomalous sound and the noise is a normal reconstruction error would be better than complex calculation sound. https://archive.org/details/ADSdataset procedures such as VAE and VAEGAN. Moreover, the scores of FNN 1D-CRNN TPR TPR TPR TPR TPR TPR 10 (a) 3D-printer (collision) (b) Air blower pump (collision) (c) Water pump (sustained) 4 0 8 0 -20 6 -20 (dB) 8 8 0 8 0 0 4 -40 -40 6 6 6 -20 2 -60 -20 -20 -60 10 11 12 13 14 15 4 4 4 0 -4 -80 0 0 20 40 60 80 100 120 -40 -40 2 2 2 -60 0 0 0 -60 -60 -80 0 10 20 30 40 50 60 0 2 4 6 8 0 20 40 60 80 100 120 3 25 60 60 60 0.4 50 50 50 40 40 40 30 30 30 0.2 20 20 20 10 10 10 10 0 0 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 1 40 60 60 60 50 50 40 40 40 0.5 30 30 30 2 20 20 20 10 10 10 0 0 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 100 800 50 60 60 60 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 10 30 60 60 60 50 50 40 40 40 5 20 30 30 30 1 20 20 20 10 10 10 0 0 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 60 60 60 50 50 50 40 40 40 30 30 30 5 20 20 20 10 10 10 0 0 0 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 Time (s) Time (s) Time (s) Fig. 10. Anomaly detection results for sound emitted from 3D-printer (left), air blower pump (center), and water pump (right). The top figure shows the spectrogram, and the bottom figures show the anomaly score (black solid line) and threshold  (red dashed line) of each method. Anomalous sounds are 0:001 enclosed in white dotted boxes, and false-positive detections are circled in purple. Since the spectrum changes due to the anomalous sounds of 3D-printer and water pump are dicult to see, their anomalous sounds are enlarged. In addition, since anomalous sound of the water pump is a sustained, 60 seconds of normal sounds and 60 seconds of anomalous sound are concatenated for comparison. NP-CONV were lower than those of the DNN-based methods.  Air blower pump: We collected an actual collision- In our previous study [8], we used a DNN a feature extractor type anomalous sound. Twenty minutes worth of normal and constructed the normal model by using a GMM. These sounds were collected as training data. The anomalous results suggest that using a DNN for the normal model would sound was caused by blockage by a foreign object stuck be better than using a GMM. in the air blower duct. This anomaly does not lead to immediate machine failure; however, it should be addressed. C. Verification experiment in a real environment Water pump: We collected an actual sustained type We conducted three verification experiments to test whether anomalous sound. Three hours worth of normal sounds anomalous sounds in real environments can be detected. The were collected as training data. Above 4 kHz, the anoma- target equipment and experimental conditions were as follows: lous sound has a larger amplitude than that of the normal Stereolithography 3D-printer: We collected an actual sounds, and it was due to wearing of the bearings. An collision-type anomalous sound. Two hours worth of nor- expert conducting a periodic inspection diagnosed that mal sounds were collected as training data. The anoma- the bearings needed to be replaced. lous sound was caused by collision of the sweeper and the formed object. The 3D-printer stopped 5 minutes after All anomalous and normal sounds were recorded at a 16-kHz this anomalous sound occurred. sampling rate. The other conditions were the same as in the AUC-PROP NP-PROP VAEGAN VAE AE Anomaly score Anomaly score Anomaly score Anomaly score Anomaly score Frequency (kHz) 11 objective experiment. The FNN architecture was used for the system in a real environment, we may occasionally obtain anomaly score calculation. partial samples of anomalous sounds. While it might be better Figure 10 shows the spectrogram (top) and anomaly scores to use the collected anomalous sounds in training, the cross- of each method (bottom). The red dashed line in each of the entropy loss would not be the best way to detect both known bottom figures is the threshold  , which is defined such and unknown anomalous sounds [39]. In addition, if we 0:001 NP AUC that the FPR of the training data was 0.1%. Anomalous sounds calculate the TPR in J () and/or J () only using a are enclosed in white dotted boxes in the spectrograms, and the part of the anomalous sounds, this training does not guarantee false-positive detections are circled in purple in the anomaly the performance for unknown anomalous sounds. Thus, we score graphs. Since the anomalous sound of the water pump should develop a supervised-ADS method that can also detect is a sustained sound, for ease of comparison, 60 seconds unknown anomalous sounds; a preliminary study on this has of normal sounds and 60 seconds of anomalous sound are been published in [25]. concatenated in each figure. In addition, the anomalous sounds 2) Incorporating machine or context-specific knowledge: to are enlarged, since the spectrum changes due to the anomalous simplify the experiments, we used the simple detection rule sounds of the 3D-printer and water pump are dicult to see. described in Sec. III-D. However, for the anomaly alert, it All of the results for NP-PROP and AUC-PROP indicate that would be better to use machine/context-specific rules, such anomalous sounds were clearly detected; the anomaly scores as modifying or smoothing the detection result from the raw of the anomalous sounds evidently exceeded the threshold, anomaly score. Thus, it will be necessary to develop rules or a while those of the normal sounds were below the threshold. trainable post-processing block to modify the anomaly score. Meanwhile, in the results of AE and VAE, although the anomaly scores of all anomalous sounds exceeded the threshold, false- References positives were also observed in the results for the water pump. [1] C. Clavel, T. Ehrette, and G. Richard “Events Detection for an Audio- In addition, although AE’s anomaly score of the 3D-printer Based Surveillance System,” In Proc. of ICME, 2005. [2] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, and VAE’s anomaly score of the air blower pump exceeded the “Scream and Gunshot Detection and Localization for Audio-Surveillance threshold, the excess margin of the anomaly score is small and Systems,” In Proc. of AVSS, 2007. it is dicult to use a higher threshold for reducing FPR. This [3] S. Ntalampiras, I. Potamitis, and N. Fakotakis “Probabilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,” problem might be because that the objective functions do not IEEE Trans. on Multimedia, pp.713–719, 2011. work to increase anomaly scores for anomalous sounds, and [4] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audio thus, the encoder and decoder reconstructed not only normal Surveillance of Roads: A System for Detecting Anomalous Sounds,” IEEE Trans. ITS, pp.279–288, 2016. sounds but also anomalous sounds. In VAEGAN, the anomaly [5] P. Coucke, B. De. Ketelaere, and J. De. Baerdemaeker, “Experimental scores of the 3D-printer and the water pump exceeded the analysis of the dynamic, mechanical behavior of a chicken egg,” Journal threshold, whereas those of the air blower pump did not exceed of Sound and Vibration, Vol. 266, pp.711–721, 2003. [6] Y. Chung, S. Oh, J. Lee, D. Park, H. H. Chang and S. Kim, “Automatic the threshold. The reason might be that when the generator Detection and Recognition of Pig Wasting Diseases Using Sound Data precisely generates “fake” normal sounds, the normal model in Audio Surveillance Systems,” Sensors, pp.12929–12942, 2013. is trained to increase the anomaly scores of normal sounds. [7] A. Yamashita, T. Hara, and T. Kaneko, “Inspection of Visible and Invisible Features of Objects with Image and Sound Signal Processing,” Therefore, the threshold of the air blower pump, which is in Proceedings of the 2006 IEEE/RSJ International Conference on defined as the FPR of normal training data becoming 0.001, Intelligent Robots and Systems (IROS2006), pp. 3837–3842, 2006. takes a very high value. These verification experiments suggest [8] Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on Neyman- that the proposed method is e ective at identifying anomalous Pearson Lemma,” in Proc. of EUSIPCO, 2017. sounds under practical conditions. [9] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system,” in Proc. of the Detection and Classification of V. Conclusions Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92, This paper proposed a novel training method for unsupervised-ADS using an AE for detecting unknown anoma- [10] H. Lim, J. Park and Y. Han, “Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks,” in Proc. of the Detection lous sound. The contributions of this research are as follows: and Classification of Acoustic Scenes and Events 2017 Workshop 1) by considering outlier-detection-based ADS as a statistical (DCASE2017), 2017. hypothesis test, we defined an objective function that builds [11] E. Cakir and T. Virtanen, “Convolutional Recurrent Neural Networks for Rare Sound Event Detection,” in Proc. of the Detection and Classi- upon the Neyman-Pearson lemma [29]. The objective function fication of Acoustic Scenes and Events 2017 Workshop (DCASE2017), increases the TPR under a low FPR condition, which is often used in practice. 2) By considering the set of anomalous [12] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “CP-JKU Submissions for DCASE-2016: a Hybrid Approach Using Binaural I- sounds to be complement to the set of normal sounds, we for- Vectors and Deep Convolutional Neural Networks,” in Proc. of the De- mulated a rejection sampling algorithm to simulate anomalous tection and Classification of Acoustic Scenes and Events 2016 Workshop sounds. Experimental results showed that these contributions (DCASE2016), 2016. [13] S. Mun, S. Park, D. K. Han, and H. Ko, “Generative Adversarial Net- enabled us to construct an ADS system that accurately detects work Based Acoustic Scene Training Set Augmentation and Selection unknown anomalous sounds in three real environments. Using Svm Hyperplane,” in Proc. of the Detection and Classification of In future, we will tackle the following remaining issues of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017. [14] S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, ADS systems in real environments: “Sound Event Detection in Multichannel Audio Using Spatial and 1) Extension to a supervised approach to detect both known Harmonic Features,” in Proc. of the Detection and Classification of and unknown anomalous sounds: while operating an ADS Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016. 12 [15] S. Adavanne, and T. Virtanen, “A Report on Sound Event Detection with Appendix Di erent Binaural Features,” in Proc. of the Detection and Classification A. List of Symbols of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017. [16] T. Lidy and A. Schindler, “CQT-Based Convolutional Neural Networks 1. Functions for Audio Scene Classification and Domestic Audio Tagging,” in Proc. J Objective function of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016. A Anomaly score [17] V. J. Hodge and J. Austin, “A Survey of Outlier Detection Methodolo- H Binary decision gies,” Artificial Intelligence Review, pp 85–126, 2004. E Encoder of autoencoder [18] A. Patcha and J. M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Journal Computer D Decoder of autoencoder Networks, pp.3448–3470, 2007. G Generator [19] V. Chandola, A. Banerjee, and V. Kumar “Anomaly detection: A survey,” N Gaussian distribution ACM Computing Surveys, 2009. [20] E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel E[] Expectation with respect to x Approach for Automatic Acoustic Novelty Detection using a Denoising r () Gradient with respect to x Autoencoder with Bidirectional LSTM Neural Networks,” In Proc. of tr() Trace of matrix ICASSP, 2015. [21] T. Tagawa, Y. Tadokoro, and T. Yairi, “Structured Denoising Au- D(AjjB) Kullback-Leibler divergence between A and B toencoder for Fault Detection and Analysis,” Proceedings of Machine kk L norm 2 2 Learning Research, pp.96–111, 2015. bc Flooring function [22] E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Non-linear prediction with LSTM recurrent neural net- 2. Parameters works for acoustic novelty detection,” In Proc. of IJCNN, 2015. [23] Y. Kawaguchi and T. Endo, “How can we detect anomalies from Parameters of normal model subsampled audio signals?,” in Proc. of MLSP, 2017. Parameters of encoder [24] J. An and S. Cho, “Variational Autoencoder based Anomaly Detection Parameters of decoder using Reconstruction Probability,” Technical Report. SNU Data Mining Center, pp.1–18, 2015. Parameters of generator [25] Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary Set Vari- Parameters of Gaussian mixture model ational Autoencoder for Supervised Anomaly Detection,” in Proc. of ICASSP, 2018. 3 Variables [26] I. J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, x Input vector A. Courville, and Y. Bengio, “Generative Adversarial Networks,” In Proc of NIPS, 2014. y State variable [27] A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, z Latent vector “Autoencoding beyond pixels using a learned similarity metric,” In Proc. Threshold for anomaly score of ICML, 2016. [28] T. Schlegl, P. Seebock, S. M. Waldstein, U. S. Erfurth, and G. Langs,  Desired false positive rate “Unsupervised Anomaly Detection with Generative Adversarial Net- Mean vector works to Guide Marker Discovery,” In Proc. of IPMI, 2017. Covariance matrix [29] J. Neyman and E. S. Pearson, “On the Problem of the Most Ecient Tests of Statistical Hypotheses,” Phi. Trans. of the Royal Society, 1933. w Mixing weight of Gaussian mixure model [30] G. Casella and R. L. Berger, “Statistical Inference, section 8.3.2 Most K Number of gaussian mixtures Powerful Test,” Duxbury Pr, pp.387–393, 2001. T Number of time frames of observation [31] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. N Number of training samples 1145–1159, 1997. M Minibatch size [32] A. Herschtal and B. Raskutti, “Optimising Area Under the ROC Curve Q Dimension of input vector Using Gradient Descent,” In Proc. of ICML, 2004. [33] A. Fujino and N. Ueda, “A Semi-supervised AUC Optimization Method R Dimension of latent vector with Generative Models,” In Proc. of ICDM, 2016. Step size for gradient method [34] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimiza- C Context window size tion,” In Proc. of ICLR, 2015. [35] A. Krogh and J. A. Hertz, “A Simple Weight Decay Can Improve ` Temporary variable of anomaly score Generalization,” In Proc. of NIPS, 1992. V Anomaly decision score for one audio clip [36] http://www.cs.tut.fi/sgn/arg/dcase2016/ [37] http://www.cs.tut.fi/sgn/arg/dcase2016/download 4. Notations [38] S. D. Walter “The partial area under the summary ROC curve,” Statistics Time-frame index of observation in medicine, pp.2025–2040, 2005. n Index of training sample [39] N. Gornitz, M. Kloft, K. Rieck, and U. Brefeld, “Toward Supervised Anomaly Detection,” Journal of Artificial Intelligence Research, pp.235– k Index of Gaussian distribution 262, 2013. () Transpose of matrix or vector (u) () Variable of normal sound (a) () Variable of anomalous sound (v) () Variable of various sound 13 Yuma Koizumi (M ’15) received the B.S. and M.S. Noboru Harada (M ’99-SM ’18) received the B.S., degrees from Hosei University, Tokyo, in 2012 and and M.S., degrees from the Department of Computer 2014, and the Ph.D. degree from the University Science and Systems Engineering of Kyushu Insti- of Electro-Communications, Tokyo, in 2017. Since tute of Technology in 1995 and 1997, respectively. joining the Nippon Telegraph and Telephone Cor- He received the Ph.D. degree from the Graduate poration (NTT) in 2014, he has been researching School of Systems and Information Engineering, acoustic signal processing and machine learning in- University of Tsukuba in 2017. Since joining NTT cluding basic research of sound source enhancement in 1997, he has been researching speech and audio and unsupervised/supervised anomaly detection in signal processing such as high eciency coding sounds. He was awarded the FUNAI Best Paper and lossless compression. His current research inter- Award and the IPSJ Yamashita SIG Research Award ests include acoustic signal processing and machine from the Information Processing Society of Japan (IPSJ) in 2013 and 2014, learning for acoustic event detection including anomaly detection in sound. He respectively, and the Awaya Prize from the Acoustical Society of Japan received the Technical Development Award from the ASJ in 2016, Industrial (ASJ) in 2017. He is a member of the ASJ and the Institute of Electronics, Standardization Encouragement Awards from Ministry of Economy Trade Information and Communication Engineers (IEICE). and Industry (METI) of Japan in 2011, the Telecom System Technology Paper Encouragement Award from the Telecommunications Advancement Foundation (TAF) of Japan in 2007. He is a member of the ASJ, the IEICE, and the IPSJ. Shoichiro Saito (SM ’06-M ’07) received the B.E. and M.E. degrees from the University of Tokyo in 2005 and 2007. Since joining NTT in 2007, he has been engaging in research and development of acoustic signal processing systems including acous- tic echo cancellers, hands-free telecommunication, and anomaly detection in sound. He is currently a Senior Research Engineer of Audio, Speech, and Language Media Laboratory, NTT Media Intelli- gence Laboratories. He is a member of the IEICE, and the ASJ. Hisashi Uematsu received the B.E., M.E., and Ph.D. degrees in Information Science from Tohoku Univer- sity, Miyagi, in 1991, 1993, and 1996. He joined NTT in 1996 and has been engaged in research on psycho-acoustics (human auditory mechanisms) and digital signal processing. He is currently a Se- nior Research Engineer of Cross-Modal Computing Project, NTT Media Intelligence Laboratories. He was awarded the Awaya Prize from the ASJ in 2001. He is a member of the ASJ. Yuta Kawachi received a B.E. and M.E. degrees from Waseda University, Tokyo, in 2012 and 2014. Since joining NTT in 2014, he has been researching acoustic signal processing and machine learning. He is a member of the ASJ. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma

Loading next page...
 
/lp/arxiv-cornell-university/unsupervised-detection-of-anomalous-sound-based-on-deep-learning-and-PAvhuMZ0nH

References (40)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2018.2877258
Publisher site
See Article on Publisher Site

Abstract

Unsupervised Detection of Anomalous Sound based on Deep Learning and the Neyman-Pearson Lemma 1 1 1 Yuma Koizumi Member, IEEE, Shoichiro Saito Member, IEEE, Hisashi Uematsu Non-Member, 1 1 Yuta Kawachi Non-Member, and Noboru Harada Senior Member, IEEE Abstract—This paper proposes a novel optimization princi- (SED) [9]–[11]. Since the anomalies are defined, we can ple and its implementation for unsupervised anomaly detec- collect a dataset of the target anomalous sounds even though tion in sound (ADS) using an autoencoder (AE). The goal the anomalies are rarer than normal sounds. Thus, the ADS of unsupervised-ADS is to detect unknown anomalous sound system can be trained using a supervised method that is used without training data of anomalous sound. Use of an AE as in various SED tasks of the “Detection and Classification of a normal model is a state-of-the-art technique for unsupervised- ADS. To decrease the false positive rate (FPR), the AE is trained Acoustic Scenes and Events challenge” (DCASE) such as au- to minimize the reconstruction error of normal sounds and the dio scene classification [12], [13], sound event detection [14], anomaly score is calculated as the reconstruction error of the [15], and audio tagging [16]. On the other hand, unsupervised- observed sound. Unfortunately, since this training procedure does ADS [17]–[19] is the task of detecting “unknown” anomalous not take into account the anomaly score for anomalous sounds, sounds that have not been observed. In the case of real- the true positive rate (TPR) does not necessarily increase. In this study, we define an objective function based on the Neyman- world factories, from the view of the development cost, it Pearson lemma by considering ADS as a statistical hypothesis is impracticable to deliberately be damaged the expensive test. The proposed objective function trains the AE to maximize target machine. In addition, actual anomalous sounds occur the TPR under an arbitrary low FPR condition. To calculate rarely and have high variability. Therefore, it is impossible the TPR in the objective function, we consider that the set of to collect an exhaustive set of anomalous sounds and need anomalous sounds is the complementary set of normal sounds and simulate anomalous sounds by using a rejection sampling to detect anomalous sounds for which training data does not algorithm. Through experiments using synthetic data, we found exist. From this reason, the task is often tackled as the one- that the proposed method improved the performance measures class unsupervised classification problem [17]–[19]. This point of ADS under low FPR conditions. In addition, we confirmed is one of the major di erences in premise between the DCASE that the proposed method could detect anomalous sounds in real tasks and ADS for industrial equipment. Thus, in this study, environments. we aim to detect unknown anomalous sounds based on an Index Terms—Anomaly detection in sound, Neyman-Pearson unsupervised approach. lemma, deep learning, and autoencoder. In unsupervised anomaly detection, “anomaly” is defined as the patterns in data that do not conform to expected “normal” I. Introduction behavior [19]. Namely, the universal set consists of only the NOMALY detection in sound (ADS) has received much normal and the anomaly, and the anomaly is the complement attention. Since anomalous sounds might indicate symp- to the normal set. More intuitively, the universal set is various toms of mistakes or malicious activities, their prompt detection machine sounds including many types of machines, the normal can possibly prevent such problems. In particular, ADS has set is one specific type of various machine sound, and the been used for various purposes including audio surveillance anomaly set is all other types of machine sounds. Therefore, a [1]–[4], animal husbandry [5], [6], product inspection, and typical way of unsupervised-ADS is the use of the outlier- predictive maintenance [7], [8]. For the last application, since detection technique. Here, the deviation between a normal anomalous sounds might indicate a fault in a piece of machin- model and an observed sound is calculated; the deviation is ery, prompt detection of anomalies would decrease the number often called the “anomaly score”. The normal model indicates of defective product and/or prevent propagation of damage. In the notion of normal behavior which is trained from training this study, we investigated ADS for industrial equipment by data of normal sounds. The observed sound is identified as focusing on machine-operating sounds. an anomalous one when the anomaly score is higher than a ADS tasks can be broadly divided into supervised-ADS and pre-defined threshold value. Namely, the anomalous sounds unsupervised-ADS. The di erence between the two categories are defined as the sounds that do not exist in training data of is in the definition of anomalies. Supervised-ADS is the task normal sounds. of detecting “defined” anomalous sounds such as gunshots or To train the normal model, it is necessary to define the opti- screams [2], and it is a kind of rare sound event detection mality of the anomaly score. One of the popular performance measurements of ADS is to measure both the true positive rate All authors are with the NTT Media Intelligence Laboratories, NTT Corporation, Tokyo, Japan (e-mail: koizumi.yuma@ieee.org, fsaito.shoichiro, (TPR) and false positive rate (FPR). The TPR is the proportion uematsu.hisashi, kawachi.yuta, noboru.haradag@lab.ntt.co.jp). A preliminary of anomalies that are correctly identified, and the FPR is the version of this work is published in [8]. proportion of normal sounds that are incorrectly identified as Copyright (c) 2018 IEEE. This article is the “accepted” version. Digital Object Identifier: 10.1109/TASLP.2018.2877258 anomalies. To improve the performance of ADS, we need arXiv:1810.09133v1 [stat.ML] 22 Oct 2018 2 normal data simultaneously. However, since the generator is 0.6 trained to make normal data, if it perfectly generates normal sounds, the anomaly score of normal sounds and FPR will 0.4 increase. Therefore, it is necessary to build an algorithm to 0.2 simulate “non-normal” sounds. In this study, we propose a novel optimization principle 0 1 2 3 4 5 6 7 and its implementation for ADS using AE. By considering Anomaly score    [nat] an outlier-detection-based ADS as a statistical hypothesis test, 0.5 we define optimality as an objective function based on the Neyman-Pearson lemma [29]. The objective function works 0 1 2 3 4 5 6 7 1 to increase TPR under an arbitrary low FPR condition. A 0.5 problem in calculating TPR is the simulation of anomalous sound data. Here, we explicitly define the set of anomalous 0 1 2 3 4 5 6 7 Detection threshold [nat] sounds to be the complement to the set of normal sounds and simulate anomalous sounds by using a rejection sampling Fig. 1. Trade-o relationship between anomaly score, true positive rate (TPR) algorithm. and false positive rate (FPR). The top figure shows PDFs of anomaly scores for normal sounds (blue line) and anomalous sounds (red dashed line). The A preliminary version of this work is presented in [8]. bottom figures show the FPR and TPR with respect to the threshold. When The previous study utilized a DNN as a feature extractor, these PDFs overlap, a small threshold leads to a large TPR and FPR, and a and the anomaly score was calculated using the negative-log- large threshold leads to a small TPR and FPR. likelihood of a GMM trained from normal data. Thus, although the DNN was trained to maximize the objective function based to increase TPR and decrease FPR simultaneously. However, on the Neyman-Pearson lemma, the normal model did not these metrics are related to the threshold value and have a guarantee to increase TPR and decrease FPR. In this study, trade-o relationship, as shown in Fig. 1. When the PDFs of end-to-end training is achieved by using an AE as the normal the anomaly scores of normal and anomalous sounds overlap, model and both the feature extractor and the normal model false detections cannot be avoided regardless of any threshold. are trained to increase TPR and decrease FPR. Thus, to increase TPR and decreases FPR simultaneously, we The rest of this paper is organized as follows. Section II need to train the normal model to reduce the overlap area. briefly introduces outlier-detection-based ADS and its imple- More intuitively, it is essential to provide small anomaly scores mentation using an AE. Section III describes the proposed for normal sounds and large anomaly scores for anomalous training method and the details of the implementation. After sounds. In addition, if an ADS system gives a false alert reporting the results of objective experiments using synthetic frequently, we cannot trust it, just as “the boy who cried data and verification experiments in real environments in wolf ” cannot be trusted. Therefore, it is especially important Section IV, we conclude this paper in Section V. The mathe- to increase TPR under a low FPR condition in a practical matical symbols are listed in Appendix A. situation. The early studies used various statistical models to calculate the anomaly score, such as the Gaussian mixture model II. Conventional method (GMM) [3], [8] and support vector machine (SVM) [4]. The A. Identification of anomalous sound based on outlier detec- recent literature calculates the anomaly score through the use tion of deep neural networks (DNN) such as the autoencoder (AE) [20]–[23] and variational AE (VAE) [24], [25]. In the case of ADS is an identification problem of determining whether the the AE, one is trained to minimize the reconstruction error of sound emitted from a target is a normal sound or an anomalous the normal training data, and the anomaly score is calculated one. In this section, we briefly introduce the procedure of as the reconstruction error of the observed sound. Thus, unsupervised-ADS. the AE provides small anomaly scores for normal sounds. First, an anomaly score A(x ; ) is calculated using a However, it gives no guarantee to increase anomaly scores normal model. Here, x 2 R is an input vector calculated for anomalous sounds. Indeed, if the AE is generalized, the from the observed sound indexed on  2 f1; 2; :::; Tg for time, anomalous sounds will also be reconstructed and the anomaly and  is the set of parameters of the normal model. In many of score of anomalous sound will be small. Therefore, to increase the previous studies, x was composed of hand-crafted acoustic TPR and decrease FPR simultaneously, the objective function features such as mel-frequency cepstrum coecients (MFCCs) should be modified. [1]–[3], and the normal model was often constructed with a Another strategy for unsupervised-ADS is the use of a PDF of normal sounds. Accordingly, the anomaly score can generative adversarial network (GAN) [26], [27]. GANs have be calculated as been used to detect anomalies in medical images [28]. In this strategy, a generator simulates “fake” normal data, and a A(x ; ) = ln p(x j ; y = 0); (1) discriminator identifies whether the input data is a real normal data or not. Therefore, the discriminator can be trained to where y denotes the state, y = 0 is normal, and y , 0 is not increase TPR for fake normal data and decrease FPR for true normal, i.e. anomalous. p(xj; y = 0) is a normal model such TPR FPR Probability … … … … … … … In ADS using an AE, the anomaly score is the reconstruc- |Θ  |Θ tion error of the observed sound, which is calculated as Anomaly A(x ; ) := kx D(E(x j  ) j  )k : (8) E D Normal 2 To train the normal model to provide small anomaly scores Fig. 2. Anomaly detection procedure using autoencoder. The input vector is for normal sounds, the AE is trained to minimize the average compressed and reconstructed by two networks E and D, respectively. Since reconstruction error of normal sound, E and D are trained to minimize reconstruction error of normal sounds, the (u) reconstruction error would be small if x is normal. Thus, the anomaly score is N AE (u) calculated as a reconstruction error, and when the error exceeds a pre-defined J ( ;  ) = A(x ; ); (9) E D (u) threshold , the observation is identified as anomalous. n=1 (u) (u) where x is the n-th training data of normal sound and N as a GMM [8]. x is determined to be anomalous when the is the number of training samples of normal sound. This anomaly score exceeds a pre-defined threshold value : objective function works to decrease the anomaly score of normal sounds. However, there is no guarantee of increasing 0 (Normal) A(x ; ) anomaly scores for anomalous sounds. Indeed, if the AE is H (x ; ; ) = : (2) 1 (Anomaly) A(x ; ) > generalized, the anomalous sounds will also be reconstructed and the anomaly score of anomalous sounds will be also small. One of the performance measures of ADS consists of the Therefore, (9) does not ensure that false detections are reduced pair of TPR and FPR. The TPR and FPR can be calculated and the accuracy of ADS is improved; thus, it would be better as expectations of H (x; ; ) with respect to anomalous and to modify the objective function. normal sounds, respectively: TPR(; ) = E H (x; ; ) ; (3) III. Proposed method xjy,0 FPR(; ) = E H (x; ; ) ; (4) We will begin by defining an objective function that builds xjy=0 upon the Neyman-Pearson lemma in Sec. III-A. Then, we where E[] denotes the expectation with respect to x. These will describe the rejection sampling algorithm for simulating metrics are related to  and have a trade-o relationship as anomalous sound used for calculating TPR in Sec III-B. shown in Fig. 1. The top figure shows the PDFs of anomaly After that, the overall training and detection procedure of the scores for normal sounds p(A(x ; )jy = 0) and anomalous proposed method will be summarized in Sec. III-C and Sec. sounds p(A(x ; )jy , 0). The bottom figures show the FPR III-D. As a modified implementation of proposed method, we and TPR with respect to . When these PDFs overlap, false extend the proposed method to an area under the receiver detections, i.e. false-positive and/or false-negative, cannot be operating characteristic curve (AUC) maximization in Sec avoided regardless of any . In addition, the false detections III-E. increase as the overlap area gets wider. Therefore, to increase TPR and decrease FPR simultaneously, it is necessary to train A. Objective function for anomaly detection based on the so that the anomaly score is small for normal sounds and Neyman-Pearson lemma large for anomalous sounds. More precisely, we need to train to reduce the overlap area. From (1) and (2), an anomalous sound satisfies the following inequality: p(x j ; y = 0) < exp(): (10) B. Unsupervised-ADS using an autoencoder Since  is assumed to be suciently large to avoid false Recently, deep learning has been used to construct a normal positives, an anomalous sound can be defined as “a sound model. Several studies on deep-learning-based unsupervised- which cannot be regarded as a sample of the normal model.” ADS have used an autoencoder (AE) [20]–[23]. This section Thus, we can regard outlier-detection-based ADS as a statis- briefly describes unsupervised-ADS using an AE (see Fig. 2). tical hypothesis test. In other words, the observed sound is The goal of using an AE is to learn an ecient representa- identified as anomalous when the following null hypothesis is tion of the input vector by using two neural networks E and D, rejected. which are called the encoder and decoder, respectively. First, the input vector x is converted into a latent vector z 2 R by Null hypotheses: x is a sample of the normal model p(x j E. Then, an input vector is reconstructed from z by D. These ; y = 0). processes are expressed as The Neyman-Pearson lemma [29] states the condition for z = E(x j  ); (5) A(x; ) that achieves the most powerful test between two simple hypotheses. According to it, the most powerful test has x ˆ = D(z j  ): (6) the greatest detection power among all possible tests of a given The parameters of both neural networks  = f ;  g are FPR [30]. More simply, the most powerful test maximizes the E D trained to minimize the reconstruction error: TPR under the constraint that the FPR equals , i.e., h i AE 2 J ( ;  ) = E kxD(E(x j  ) j  )k : (7) maximize TPR(; ), subject to FPR(; ) = : E D E D … … … … … … … 4 Non-linear Algorithm 1 Simulation algorithm of anomalous sound in Input vector space Latent vector space mapping latent vector space. 1: Input: Generator G, GMM p(z j ; y = 0) and 2: ` 1 3: while `   do 4: Draw z ˜ from N (zj0 ; I ) R R 5: Evaluate ` ln p(z ˜ j ; y = 0) 6: end while (a) 7: z z (a) (a) 8: Generate anomalous sound by x = G(z j  ) (a) 9: Output: x Fig. 3. Concept of PDFs of normal, various, and anomalous sounds using two neural networks. The PDF of normal sounds (i.e. meshed area) is a subset of the PDF of various sounds (i.e. gray area), and the PDF of anomalous consider the set of normal sounds to be a subset of various sounds is expressed as complement of the PDF of normal sounds (i.e. inside machine sounds, and the set of anomalous sounds to be its the gray area and outside the meshed area). x is mapped to z by E, and z is reconstructed to x ˜ by G. Here, E and G are trained to satisfy p(z) = complement. Then, we use rejection sampling to simulate N (zj0 ; I ) and x = x ˜ , respectively. The PDF of the latent vector of normal R R anomalous sounds; namely, a sound is sampled from various sounds is modeled using a GMM p(z j ; y = 0) given by (13). machine-sound PDFs, and it is accepted as an anomalous sound when its anomaly score is high. However, since the PDF of various machine sounds in the input vector domain Since the FPR can be controlled by manipulating , we define p(x) may have a complex form, the PDF cannot be written as satisfying FPR(;  ) = . Accordingly, the objective in an analytical form and the sampling algorithm would function to obtain the most powerful test function can be become complex. Inspired by the strategy of VAE, we can defined as the one that maximizes TPR(;  ) with respect avoid this problem by training E so that the PDF of various to . However, since the FPR is also a function of , it may latent vectors p(z) is mapped to a PDF whose samples can become large when focusing only on TPR. To maximize the be generated by a pseudorandom number generator from a TPR and minimize the FPR simultaneously, we train  to uniform distribution and its variable conversion. Then, the maximize the following objective function, (a) latent vectors of anomalous sounds z are sampled using NP J () = TPR(;  ) FPR(;  ); (11) the rejection sampling algorithm, and the input vectors of (a) anomalous sounds x are reconstructed using a third neural where the superscript “NP” is an abbreviation of “Neyman- network G, Pearson”. Since the proposed objective function directly in- creases TPR and decreases FPR,  can be trained to provide (a) (a) x = G(z j  ); (12) a small anomaly score for normal sounds and a large anomaly score for anomalous sounds. where  is the parameter of G. Hereafter, we call G the There are two problems when it comes to training  and generator. Although there is no constraint on the architecture to maximize (11). The first problem is the calculation of of G, we will use the same architecture for D and G. In TPR. The TPR and FPR are the expectations of H (x; ; ), addition, to simply generate and reject a candidate latent and in most practical cases, the expectation is approximated as vector, we use two constraints to train  and  , and model E G an average over the training data. Thus, to calculate TPR and the PDF of normal latent vectors using the GMM as FPR, we need to collect enough normal and anomalous sound data for the average to be an accurate approximation of the K expectation. However, since anomalous sounds occur rarely p(z j ; y = 0) = w N (z j  ;  ); (13) k k k and have high variability, this condition is dicult to satisfy. k=1 In section III-B, to calculate TPR, we consider “anomaly” to where  = fw ;  ;  j k = 1; :::; Kg, K is the number k k k mean “not normal” and simulate anomalous sounds by using a of mixtures, and w ;  , and  are respectively the weight, k k k sampling algorithm. The second problem is the determination mean vector, and covariance matrix of the k-th Gaussian. The of the threshold  . In a parametric hypothesis test such concepts of these PDFs are shown in Fig. 3, and the procedure as a t-test, the threshold at which FPR equals  can be of anomalous sound simulation is summarized in Algorithm analytically calculated. However, DNN is a non-parametric 1 and Fig. 4. statistical model; thus, the threshold  can not be analytically First, we describe the two constraints for training  and calculated. In section III-C, we numerically calculate  as . For algorithmic eciency, p(z) should be generated with the bMc-th value of the sorted anomaly scores of M normal a low computational cost. As an implementation of p(z), we sounds, where bc is the flooring function. use the normalized Gaussian distribution, because its samples can be generated by a pseudorandom number generator such B. Anomalous sound simulation using an autoencoder as the Mersenne-Twister. Thus, for training  and  , we E G In accordance with (10), anomalous sounds emitted from use the first constraint so that z of the various machine sounds the target machine are di erent from normal ones. Thus, we follows a normalized Gaussian distribution. To satisfy the first |Θ Latent vector space | , Various machine Algorithm 1 |Θ Simulated anomalous sound data sound data |Θ Normal sound data | , |Θ  |Θ Anomaly Normal Fig. 4. Procedure of anomalous sound simulation using autoencoder. constraint, we train  to minimize the following Kullback- parameters of the Gaussian distribution of the minibatch are Leibler divergence (KLD): calculated as KL J ( ) = D (N (z j 0 ; I )jjN (z j  ;  )) ; E R R V V h n o i 1 (v) 1 > 1 = z ; (15) = lnj j + tr  +    R ; (14) V V V V V n=1 where the superscript “KL” is an abbreviation of “Kullback- X 1 > (v) (v) = z  z  : (16) Leibler”, trfg denotes the trace of a matrix, > denotes trans- V V V n n n=1 position, 0 and I are respectively the zero vector and unit R R matrix with dimension R, and  and  are respectively V V Finally, to minimize the KLD and the reconstruction error of the mean vector and covariance matrix calculated from z of various sounds, the objective function is calculated as the various machine sounds. To generate anomalous sounds from (12), G needs to reconstruct various machine sounds, (v) (v) as x = G(E(x j  ) j  ). Thus, as a second constraint, E G X KR KL (v) (v) J () = J ( ) + x G E x j  j  ; we train  and  to minimize the reconstruction error (7) E G E E G n n n=1 calculated on the various machine sounds. (17) Next, we describe the GMM that models the PDF of the normal latent vectors. To reject a candidate z ˜ which seems to be z of a normal sound, we need to calculate the probability where the superscript “KR” is an abbreviation of “KLD and that the candidate is a normal one. To calculate the probability, reconstruction”, and  and  are updated by gradient E G KR we need to model p(z j y = 0). Since there is no constraint descent to minimize J (): on the form of p(z j y = 0) in the training procedure of  , p(z j y = 0) might have a complex form. For simplicity, we KR r J (); (18) E E use a GMM expressed as (13). KR r J (); (19) G G C. Detailed description of training procedure where  is the step size. Here, we describe the details of the training procedure shown in Fig 5. The training procedure consists in three steps. Second,  and  are trained to maximize the objective E D (u) Hereafter, we call the proposed method using this training function. A minibatch of normal sounds x is randomly procedure NP-PROP. The algorithm inputs are training data selected from the training dataset of normal sounds, and (a) constructed from normal sounds and various machine sounds, a minibatch of anomalous sounds x is simulated using (v) (u) and the outputs are  and  . Moreover, x and x Algorithm 1. Here, since DNN is not a parametric PDF, E D n n respectively denote the n-th training samples of minibatches the threshold  that satisfies FPR(;  ) =  cannot be of various and normal machine sounds, and M is the number analytically calculated. Thus, in this study, we approximately of samples included in a minibatch. calculate  by sorting the anomaly scores of normal sounds (u) (u) (u) First,  and  are trained to simulate anomalous sounds. in the minibatch x . First, A(x ; ) and ln(z j ; y = 0) E G A minibatch of various machine sounds is randomly selected are calculated, and  and  are set as the bMc-th value of (u) (u) from the training dataset of various machine sounds. Next, its the sorted A(x ; ) and ln(z j ; y = 0) in descending (v) (v) latent vectors are calculated as z E(x j ). Then, the order, respectively. Then, the TPR and FPR are approximately n n E Neyman-Pearson Various machine Start Autoencoder-based update lemma-based update sound data Updated a No Random Calculate  and  certain number Calculate  and select of times? by (3) and (12) from Yes Calculate  and Simulate  by Normal sound data by (15) and (16) Algorithm 1 Update Random by EM-algorithm Update  and  Update  and select for GMM by (18) and (19) by (22) and (23) Fig. 5. Training procedure of the proposed method. evaluated as data classification and/or anomaly detection. The AUC is calculated as M 2 3 6 7 (a) 6 7 6 7 TPR(;  )  sigmoid A x ;   ; (20) 6 7 6   7 6 7 M 0 6 7 6 7 n=1 AUC() = E E H (x ; ;A(x; )) ; (24) 6 7 x jy,0 6 7 6 7 | {z } 6 7 4 5 (u) ( ( )) TPR ;A x; FPR(;  )  sigmoid A x ;   ; (21) xjy=0 n=1 X (u) TPR ;A x ;  : (25) n=1 where the binary decision function H is approximated by a sigmoid function, allowing the gradient to be analytically As we can see in (25), anomalous sound data are needed NP calculated. Finally,  and  are updated to increaseJ () E D to calculate the AUC. Although the AUC has been used by gradient ascent: as an objective function in imbalanced data classification [31]–[33], it has not been applied to unsupervised-ADS so NP far. Fortunately, since the proposed rejection sampling can + r J (); (22) E E simulate anomalous sound data, AUC maximization can be NP + r J (): (23) D D NP used as an objective function of ADS. Instead of J (), the following objective function can be used in the training Third, to update the PDF of the latent vectors of normal procedure: sounds p(z j ; y = 0), when (18)–(23) is repeated a AUC J () certain number of times,  is updated using the expectation- maximization (EM) algorithm for GMM using all training data X (26) (u) (u) = TPR ;A x ;  FPR ;A x ;  : of normal sounds. The above algorithm is run a pre-defined n n n=1 number of epochs. AUC Hereafter, we call the proposed method using J () in- NP stead of J () AUC-PROP. D. Detailed description of detection procedure IV. Experiments After training  and  , we can identify whether the E D We conducted experiments to evaluate the performance observed sound is a normal one or not. First, the input vector of the proposed method. First, we conducted an objective x ;  2 f1; :::; Tg is calculated from the observed sound. Then, experiment using synthetic anomalous sounds (Sec. IV-B). To the anomaly score is calculated as (8). Finally, a decision generate a large enough anomalous dataset for the ADS accu- 1 T score, V = H (x ; ; ); is calculated, and when V ex- =1 racy evaluation, we used collision and sustained sounds from ceeds a pre-defined value  , the observed sound is determined datasets for detection and classification of acoustic scenes and to be anomalous. In this study, we used  = 0, meaning that, events 2016 (DCASE-2016 [36]). To show the e ectiveness if the anomaly score exceeds the threshold even for one frame, of the method in real environments, we conducted verification the observed sound is determined to be anomalous. experiments in three real environments (Sec. IV-C). A. Experimental conditions E. Modified implementation as an AUC maximization 1) Compared methods: The proposed methods described The receiver operating characteristic (ROC) curve and the in Sec III-C (NP-PROP) and Sec III-E (AUC-PROP) were AUC are widely used performance measures for imbalanced compared with three state-of-the-art ADS methods: 7 TABLE I Experimental conditions Parameters for signal processing Sampling rate 16.0 kHz FFT length 512 pts FFT shift length 256 pts Number of mel-filterbanks 40 Other parameters Context window size C 5 Dimension of input vector Q for FNN 440 Dimension of input vector Q for 1D-CRNN 40 Dimension of acoustic feature vector R 40 GMM update per gradient method 30 Number of mixtures K 16 Minibatch size M 512 FPR parameter  0.2 Step size  10 Fig. 6. Network architectures of encoder, decoder and generator used for L normalization parameter 10 NP-PROP, and AUC-PROP. The encoder and decoder of AE have the same architecture. In VAE, VAEGAN and CONV-PROP, the encoder has two output layers for the mean and variance vector. In VAEGAN, the architecture of the Q = 40(2C +1) = 440. The second architecture,“1D-CRNN”, discriminator is the same as that of the encoder, but the output dimension of the fully connected layer is 1. consisted in a one-dimensional convolution neural network (1D-CNN) layer and a long short-term memory (LSTM) layer; it worked well in supervised anomaly detection (race SED) in AE [20]: ADS using the autoencoder described in Sec DCASE 2017 [10]. In order to detect anomalous sounds in real II-B. The encoder and decoder were trained to minimize time, we changed the backward LSTM to a forward one. In (9). addition, to avoid overfitting, we used only one forward LSTM VAE [24]: E and D were implemented using VAE. The layer instead of two backward LSTM layers. The input vector encoder estimated the mean and variance parameters of x was a 40-dimensional log mel-band energy: the Gaussian distribution in the latent space. Then, the x := ln (Mel [Abs [X ]]) : latent vectors were sampled from the Gaussian distribu- tion whose parameters were estimated by the encoder. The dimension of x was Q = 40. For each architecture, the Then, the decoder reconstructed the input vector from the dimension of the latent vector z was R = 40. All input vectors sampled latent vectors. Finally, the reconstruction error were mean-and-variance normalized using the training data was calculated and used as the anomaly score. statistics. VAEGAN [27]: To investigate the e ectiveness of the As an implementation for the gradient method, the Adam anomalous sound simulation, VAEGAN [27] was used method [34] was used instead of the gradient descent/ascent to simulate fake normal data. The generators (i.e. VAE) shown in (18)–(23). To avoid overfitting, L normalization [35] were used to simulate fake normal sounds. The output of with a regularization penalty of 10 was used. The minibatch the discriminator without the sigmoid activation was used size for all methods was M = 512. All models were trained for as the anomaly score. 500 epochs. In all methods, the average value of the loss was We also used our previous work [8] (CONV-PROP) for com- calculated on the training set at every epoch, and when the parison. This method uses a VAE to extract latent vectors as loss did not decrease for five consecutive epochs, the stepsize acoustic features. A GMM is used for the normal model, and was decreased by half. the encoder and decoder are trained to maximize (11). 3) Other conditions: All sounds were recorded at a sam- 2) DNN architecture and setup: We tested two types of pling rate of 16 kHz. The frame size of the DFT was 512, and network architecture as shown in Fig. 6. The first architecture, the frame was shifted every 256 samples. For p(z j ; y = 0), “FNN”, consisted of fully connected DNNs with three hidden the number of Gaussian mixtures was K = 16 and a diagonal layers and 512 hidden units. The rectified linear unit (ReLU) covariance matrix was used to prevent the problem from being was used as the activation functions of the hidden layers. The ill-conditioned. The EM algorithm for the GMM involved iter- input vector x was defined as ating (18)–(23) 30 times. All the above-mentioned conditions are summarized in Table I. x := (ln [Mel [Abs [X ]]] ; :::; ln [Mel [Abs [X ]]]) ; C +C X := X ; :::; X ; 1; B. Objective experiments on synthetic data where X is the discrete Fourier transform (DFT) spectrum 1) Dataset: Sounds emitted from a condensing unit of an !; of the observed sound, ! 2 f1; :::; g denotes the frequency air conditioner operating in a real environment were used index, C(= 5) is the context window size, and Mel[] and as the normal sounds. In addition, various machine sounds Abs[] denote 40-dimensional Mel matrix multiplication and were recorded from other machines, including a compressor, the element-wise absolute value. Thus, the dimension of x was engine, compression pump, and an electric drill, as well as FNN 1D-CRNN Input Input 1D Conv (  ) Fully connect  ReLU ReLU  Max pooling Fully connect  ReLU LSTM Fully connect Fully connect Output Output Input Input Fully connect  ReLU Fully connect  ReLU LSTM Fully connect  ReLU Reshape Fully connect 1D Deconv (  ) Output Output 8 AE VAE NP-CONV NP-PROP AUC-PROP VAEGAN ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE VAE VAE 0.8 0.8 0.8 VAEGAN VAEGAN NP-CONV NP-CONV NP-PROP NP-PROP 0.6 0.6 0.6 AUC-PROP AUC-PROP 1 2 3 1 2 3 1 2 3 1 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0 0 0 1 2 3 1 2 3 1 2 3 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0 0 1 2 3 1 2 3 1 2 3 Collision Sustain Mix Collision Sustain Mix Collision Sustain Mix Fig. 7. Evaluation results of FNN. AE VAE NP-PROP VAEGAN NP-CONV AUC-PROP ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE 0.9 0.9 0.9 VAE VAE 0.8 0.8 VAEGAN 0.8 VAEGAN NP-CONV NP-CONV 0.7 0.7 NP-PROP 0.7 NP-PROP AUC-PROP AUC-PROP 0.6 0.6 0.6 1 2 3 1 2 3 1 2 3 1 1 1 AE AE VAE VAE VAEGAN VAEGAN 0.5 0.5 0.5 NP-CONV NP-CONV NP-PROP NP-PROP AUC-PROP AUC-PROP 0  0 0 1 2 3 1 2 3 1 2 3 1 1 0.8 0.8 AE 0.8 AE VAE VAE 0.6 0.6 VAEGAN 0.6 VAEGAN NP-CONV NP-CONV 0.4 0.4 0.4 NP-PROP NP-PROP AUC-PROP AUC-PROP 0.2 0.2 0.2 1 2 3 1 2 3 1 2 3 Collision Sustain Mix Collision Sustain Mix Collision Sustain Mix Fig. 8. Evaluation results of 1D-CRNN. environmental noise of factories. The normal and various including anomalous sounds, synthetic anomalous data were machine sound data totaled 4 and 20 hours (= 4 hours normal used in this evaluation. In particular, we used the training + 16 hours other machines), respectively. These sounds were datasets for task of DCASE-2016 [36] as anomalous sounds. recorded at a 16-kHz sampling rate. In order to improve Although these sounds are “normal” sounds in an oce, in the robustness for di erent loudness levels and ratios of the unsupervised-ADS, the unknown sounds are categorized as normal and anomalous sound, the various machine sounds in “anomalous”. Thus, we consider that this evaluation can at the training dataset were augmented with a multiplication of least evaluate the detection performance for unknown sounds. five amplitude gains. These gains are calculated so that the Since the anomalous sounds of machines are roughly cat- maximum amplitudes of various sounds becomes to 1.0, 0.5, egorized into collision sounds (e.g., the sound of a metal 0.25, 0.125, and 0.063. part falling on the floor) and sustained sounds (e.g., frictional sound caused by scratched bearings), we selected 80 collision Since it is dicult to collect a massive amount of test data pAUC TPR AUC pAUC TPR AUC 9 NP-PROP AE NP-PROP AUC-PROP VAE AUC-PROP ANR: -15 dB ANR: -20 dB ANR: -25 dB 1 1 1 AE AE 0.5 0.5 NP-PROP 0.5NP-PROP AUC-PROP AUC-PROP 0  0  0 0 0.5 1 0 0.5 1 0 0.5 1 1 FPR 1 FPR 1 FPR AE AE NP-PROP NP-PROP 0.5 0.5 0.5 AUC-PROP AUC-PROP 0 0 0 0 0.5 1 0 0.5 1 0 0.5 1 FPR FPR FPR Fig. 9. ROC curves of AE, NP-PROP and AUC-PROP for each ANR condition evaluated on Mix dataset. sounds, including (slamming doors , knocking at doors , 1. The parameters were  = 0:05 and p = 0:1. We evaluated keys put on a table, keystrokes on a keyboard), and 60 these metrics for three di erent evaluation sets: 80 collision sustained sounds (drawers being opened, pages being turned, sounds (Collision), 60 sustained sounds (Sustain), and the sum and phones ringing), from this dataset [37]. To synthesize of these 80 + 60 = 140 sounds (Mix). the test data, the anomalous sounds were mixed with normal The results for each score, sound category, and ANR on sounds at anomaly-to-normal power ratios (ANRs ) of -15, FNN and 1D-CRNN are shown in Fig. 7 and Fig. 8. Overall, -20 and -25 dB using the following procedure: the performances of AE, NP-PROP and AUC-PROP were better than those of VAE and VAEGAN. In detail, AE achieved high 1) select an anomalous sound and randomly cut a normal so scores for all measurements, AUC-PROP achieved high scores that has the same signal length of the selected anomalous for AUC and pAUC, and NP-PROP achieved high scores for sound. TPR and pAUC. In addition, for all conditions, the TPR 2) for the cut normal and anomalous sounds, calculate and pAUC scores of NP-PROP were higher than those of AE. the frame-wise log power of each of 512 points with To discuss the di erence between the objective functions of a 256 point shift on a dB scale, namely P = AE, NP-PROP and AUC-PROP, we show the ROC curves in 20 log X : !; !=1 Fig. 9. Since the di erences between the results of Collision, 3) select the median of P as the representative power of Sustained, and Mix were small, we plotted only those of the each sound as. Mix dataset. From these ROC curves, we can see that the TPRs 4) manipulate the power of the anomalous sound so that of NP-PROP under the low FPR conditions were significantly the ANR has the desired value. higher than those of other methods. This might be because the 5) used the cut normal sound as the test data of normal objective function of NP-PROP works to increase TPR under sound, and generate the test data of the anomalous the low FPR condition. In addition, although AUC-PROP’s sound by mixing the anomalous sound with the quarried TPRs under the low FPR condition were lower than those normal sound. of NP-PROP, the TPRs under the moderate and high FPR In total, we used 140 normal and anomalous sound samples for conditions were higher than those of the other methods. This each ANR condition. The training dataset of normal sounds might be because the objective function of AUC-PROP works and the MATLAB code to generate the test dataset are freely to increase TPR for all FPR conditions. Since the individual available on the website . results and objective function tend to coincide, we consider 2) Results: To evaluate the performance of ADS, we used that the training of each neural network succeeded. In addition, the AUC, TPR, and partial AUC ( pAUC) [38]. The AUC is TPR under the low FPR conditions is especially important a traditional performance measure of anomaly detection. The when the ADS is used in real environments, because if an other two measurements evaluated the performance under low ADS system frequently gives false alert, we cannot trust it. FPR conditions. TPR is the TPR under the condition that Therefore, unsupervised-ADS using an AE trained using (11) FPR equals . The pAUC is an AUC calculated with FPRs would be e ective in real situations. ranging from 0 to p with respect to the maximum value of In addition, regarding the FNN results, VAE scored lower than AE, and VAEGAN scored lower than all the other methods. ANR is a measure comparing the level of an anomalous sound to the These results suggest that when calculating the anomaly score level of a normal sound. This definition is the same as the signal-to-noise using a simple network architecture like FNN, a simple ratio (SNR) when the signal is an anomalous sound and the noise is a normal reconstruction error would be better than complex calculation sound. https://archive.org/details/ADSdataset procedures such as VAE and VAEGAN. Moreover, the scores of FNN 1D-CRNN TPR TPR TPR TPR TPR TPR 10 (a) 3D-printer (collision) (b) Air blower pump (collision) (c) Water pump (sustained) 4 0 8 0 -20 6 -20 (dB) 8 8 0 8 0 0 4 -40 -40 6 6 6 -20 2 -60 -20 -20 -60 10 11 12 13 14 15 4 4 4 0 -4 -80 0 0 20 40 60 80 100 120 -40 -40 2 2 2 -60 0 0 0 -60 -60 -80 0 10 20 30 40 50 60 0 2 4 6 8 0 20 40 60 80 100 120 3 25 60 60 60 0.4 50 50 50 40 40 40 30 30 30 0.2 20 20 20 10 10 10 10 0 0 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 1 40 60 60 60 50 50 40 40 40 0.5 30 30 30 2 20 20 20 10 10 10 0 0 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 100 800 50 60 60 60 50 50 50 40 40 40 30 30 30 20 20 20 10 10 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 10 30 60 60 60 50 50 40 40 40 5 20 30 30 30 1 20 20 20 10 10 10 0 0 10 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 60 60 60 50 50 50 40 40 40 30 30 30 5 20 20 20 10 10 10 0 0 0 0 10 20 30 40 50 60 0 2 4 6 8 10 0 20 40 60 80 100 120 Time (s) Time (s) Time (s) Fig. 10. Anomaly detection results for sound emitted from 3D-printer (left), air blower pump (center), and water pump (right). The top figure shows the spectrogram, and the bottom figures show the anomaly score (black solid line) and threshold  (red dashed line) of each method. Anomalous sounds are 0:001 enclosed in white dotted boxes, and false-positive detections are circled in purple. Since the spectrum changes due to the anomalous sounds of 3D-printer and water pump are dicult to see, their anomalous sounds are enlarged. In addition, since anomalous sound of the water pump is a sustained, 60 seconds of normal sounds and 60 seconds of anomalous sound are concatenated for comparison. NP-CONV were lower than those of the DNN-based methods.  Air blower pump: We collected an actual collision- In our previous study [8], we used a DNN a feature extractor type anomalous sound. Twenty minutes worth of normal and constructed the normal model by using a GMM. These sounds were collected as training data. The anomalous results suggest that using a DNN for the normal model would sound was caused by blockage by a foreign object stuck be better than using a GMM. in the air blower duct. This anomaly does not lead to immediate machine failure; however, it should be addressed. C. Verification experiment in a real environment Water pump: We collected an actual sustained type We conducted three verification experiments to test whether anomalous sound. Three hours worth of normal sounds anomalous sounds in real environments can be detected. The were collected as training data. Above 4 kHz, the anoma- target equipment and experimental conditions were as follows: lous sound has a larger amplitude than that of the normal Stereolithography 3D-printer: We collected an actual sounds, and it was due to wearing of the bearings. An collision-type anomalous sound. Two hours worth of nor- expert conducting a periodic inspection diagnosed that mal sounds were collected as training data. The anoma- the bearings needed to be replaced. lous sound was caused by collision of the sweeper and the formed object. The 3D-printer stopped 5 minutes after All anomalous and normal sounds were recorded at a 16-kHz this anomalous sound occurred. sampling rate. The other conditions were the same as in the AUC-PROP NP-PROP VAEGAN VAE AE Anomaly score Anomaly score Anomaly score Anomaly score Anomaly score Frequency (kHz) 11 objective experiment. The FNN architecture was used for the system in a real environment, we may occasionally obtain anomaly score calculation. partial samples of anomalous sounds. While it might be better Figure 10 shows the spectrogram (top) and anomaly scores to use the collected anomalous sounds in training, the cross- of each method (bottom). The red dashed line in each of the entropy loss would not be the best way to detect both known bottom figures is the threshold  , which is defined such and unknown anomalous sounds [39]. In addition, if we 0:001 NP AUC that the FPR of the training data was 0.1%. Anomalous sounds calculate the TPR in J () and/or J () only using a are enclosed in white dotted boxes in the spectrograms, and the part of the anomalous sounds, this training does not guarantee false-positive detections are circled in purple in the anomaly the performance for unknown anomalous sounds. Thus, we score graphs. Since the anomalous sound of the water pump should develop a supervised-ADS method that can also detect is a sustained sound, for ease of comparison, 60 seconds unknown anomalous sounds; a preliminary study on this has of normal sounds and 60 seconds of anomalous sound are been published in [25]. concatenated in each figure. In addition, the anomalous sounds 2) Incorporating machine or context-specific knowledge: to are enlarged, since the spectrum changes due to the anomalous simplify the experiments, we used the simple detection rule sounds of the 3D-printer and water pump are dicult to see. described in Sec. III-D. However, for the anomaly alert, it All of the results for NP-PROP and AUC-PROP indicate that would be better to use machine/context-specific rules, such anomalous sounds were clearly detected; the anomaly scores as modifying or smoothing the detection result from the raw of the anomalous sounds evidently exceeded the threshold, anomaly score. Thus, it will be necessary to develop rules or a while those of the normal sounds were below the threshold. trainable post-processing block to modify the anomaly score. Meanwhile, in the results of AE and VAE, although the anomaly scores of all anomalous sounds exceeded the threshold, false- References positives were also observed in the results for the water pump. [1] C. Clavel, T. Ehrette, and G. Richard “Events Detection for an Audio- In addition, although AE’s anomaly score of the 3D-printer Based Surveillance System,” In Proc. of ICME, 2005. [2] G. Valenzise, L. Gerosa, M. Tagliasacchi, F. Antonacci, and A. Sarti, and VAE’s anomaly score of the air blower pump exceeded the “Scream and Gunshot Detection and Localization for Audio-Surveillance threshold, the excess margin of the anomaly score is small and Systems,” In Proc. of AVSS, 2007. it is dicult to use a higher threshold for reducing FPR. This [3] S. Ntalampiras, I. Potamitis, and N. Fakotakis “Probabilistic Novelty Detection for Acoustic Surveillance Under Real-World Conditions,” problem might be because that the objective functions do not IEEE Trans. on Multimedia, pp.713–719, 2011. work to increase anomaly scores for anomalous sounds, and [4] P. Foggia, N. Petkov, A. Saggese, N. Strisciuglio, and M. Vento, “Audio thus, the encoder and decoder reconstructed not only normal Surveillance of Roads: A System for Detecting Anomalous Sounds,” IEEE Trans. ITS, pp.279–288, 2016. sounds but also anomalous sounds. In VAEGAN, the anomaly [5] P. Coucke, B. De. Ketelaere, and J. De. Baerdemaeker, “Experimental scores of the 3D-printer and the water pump exceeded the analysis of the dynamic, mechanical behavior of a chicken egg,” Journal threshold, whereas those of the air blower pump did not exceed of Sound and Vibration, Vol. 266, pp.711–721, 2003. [6] Y. Chung, S. Oh, J. Lee, D. Park, H. H. Chang and S. Kim, “Automatic the threshold. The reason might be that when the generator Detection and Recognition of Pig Wasting Diseases Using Sound Data precisely generates “fake” normal sounds, the normal model in Audio Surveillance Systems,” Sensors, pp.12929–12942, 2013. is trained to increase the anomaly scores of normal sounds. [7] A. Yamashita, T. Hara, and T. Kaneko, “Inspection of Visible and Invisible Features of Objects with Image and Sound Signal Processing,” Therefore, the threshold of the air blower pump, which is in Proceedings of the 2006 IEEE/RSJ International Conference on defined as the FPR of normal training data becoming 0.001, Intelligent Robots and Systems (IROS2006), pp. 3837–3842, 2006. takes a very high value. These verification experiments suggest [8] Y. Koizumi, S. Saito, H. Uematsu, and N. Harada, “Optimizing Acoustic Feature Extractor for Anomalous Sound Detection Based on Neyman- that the proposed method is e ective at identifying anomalous Pearson Lemma,” in Proc. of EUSIPCO, 2017. sounds under practical conditions. [9] A. Mesaros, T. Heittola, A. Diment, B. Elizalde, A. Shah, E. Vincent, B. Raj, and T. Virtanen, “DCASE 2017 challenge setup: tasks, datasets and baseline system,” in Proc. of the Detection and Classification of V. Conclusions Acoustic Scenes and Events 2017 Workshop (DCASE2017), pp. 85–92, This paper proposed a novel training method for unsupervised-ADS using an AE for detecting unknown anoma- [10] H. Lim, J. Park and Y. Han, “Rare Sound Event Detection Using 1D Convolutional Recurrent Neural Networks,” in Proc. of the Detection lous sound. The contributions of this research are as follows: and Classification of Acoustic Scenes and Events 2017 Workshop 1) by considering outlier-detection-based ADS as a statistical (DCASE2017), 2017. hypothesis test, we defined an objective function that builds [11] E. Cakir and T. Virtanen, “Convolutional Recurrent Neural Networks for Rare Sound Event Detection,” in Proc. of the Detection and Classi- upon the Neyman-Pearson lemma [29]. The objective function fication of Acoustic Scenes and Events 2017 Workshop (DCASE2017), increases the TPR under a low FPR condition, which is often used in practice. 2) By considering the set of anomalous [12] H. Eghbal-Zadeh, B. Lehner, M. Dorfer, and G. Widmer, “CP-JKU Submissions for DCASE-2016: a Hybrid Approach Using Binaural I- sounds to be complement to the set of normal sounds, we for- Vectors and Deep Convolutional Neural Networks,” in Proc. of the De- mulated a rejection sampling algorithm to simulate anomalous tection and Classification of Acoustic Scenes and Events 2016 Workshop sounds. Experimental results showed that these contributions (DCASE2016), 2016. [13] S. Mun, S. Park, D. K. Han, and H. Ko, “Generative Adversarial Net- enabled us to construct an ADS system that accurately detects work Based Acoustic Scene Training Set Augmentation and Selection unknown anomalous sounds in three real environments. Using Svm Hyperplane,” in Proc. of the Detection and Classification of In future, we will tackle the following remaining issues of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017. [14] S. Adavanne, G. Parascandolo, P. Pertila, T. Heittola, and T. Virtanen, ADS systems in real environments: “Sound Event Detection in Multichannel Audio Using Spatial and 1) Extension to a supervised approach to detect both known Harmonic Features,” in Proc. of the Detection and Classification of and unknown anomalous sounds: while operating an ADS Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016. 12 [15] S. Adavanne, and T. Virtanen, “A Report on Sound Event Detection with Appendix Di erent Binaural Features,” in Proc. of the Detection and Classification A. List of Symbols of Acoustic Scenes and Events 2017 Workshop (DCASE2017), 2017. [16] T. Lidy and A. Schindler, “CQT-Based Convolutional Neural Networks 1. Functions for Audio Scene Classification and Domestic Audio Tagging,” in Proc. J Objective function of the Detection and Classification of Acoustic Scenes and Events 2016 Workshop (DCASE2016), 2016. A Anomaly score [17] V. J. Hodge and J. Austin, “A Survey of Outlier Detection Methodolo- H Binary decision gies,” Artificial Intelligence Review, pp 85–126, 2004. E Encoder of autoencoder [18] A. Patcha and J. M. Park, “An overview of anomaly detection techniques: Existing solutions and latest technological trends,” Journal Computer D Decoder of autoencoder Networks, pp.3448–3470, 2007. G Generator [19] V. Chandola, A. Banerjee, and V. Kumar “Anomaly detection: A survey,” N Gaussian distribution ACM Computing Surveys, 2009. [20] E. Marchi, F. Vesperini, F. Eyben, S. Squartini, and B. Schuller, “A Novel E[] Expectation with respect to x Approach for Automatic Acoustic Novelty Detection using a Denoising r () Gradient with respect to x Autoencoder with Bidirectional LSTM Neural Networks,” In Proc. of tr() Trace of matrix ICASSP, 2015. [21] T. Tagawa, Y. Tadokoro, and T. Yairi, “Structured Denoising Au- D(AjjB) Kullback-Leibler divergence between A and B toencoder for Fault Detection and Analysis,” Proceedings of Machine kk L norm 2 2 Learning Research, pp.96–111, 2015. bc Flooring function [22] E. Marchi, F. Vesperini, F. Weninger, F. Eyben, S. Squartini, and B. Schuller, “Non-linear prediction with LSTM recurrent neural net- 2. Parameters works for acoustic novelty detection,” In Proc. of IJCNN, 2015. [23] Y. Kawaguchi and T. Endo, “How can we detect anomalies from Parameters of normal model subsampled audio signals?,” in Proc. of MLSP, 2017. Parameters of encoder [24] J. An and S. Cho, “Variational Autoencoder based Anomaly Detection Parameters of decoder using Reconstruction Probability,” Technical Report. SNU Data Mining Center, pp.1–18, 2015. Parameters of generator [25] Y. Kawachi, Y. Koizumi, and N. Harada, “Complementary Set Vari- Parameters of Gaussian mixture model ational Autoencoder for Supervised Anomaly Detection,” in Proc. of ICASSP, 2018. 3 Variables [26] I. J. Goodfellow, J. P. Abadie, M. Mirza, B. Xu, D. W. Farley, S. Ozair, x Input vector A. Courville, and Y. Bengio, “Generative Adversarial Networks,” In Proc of NIPS, 2014. y State variable [27] A. B. L. Larsen, S. K. Sonderby, H. Larochelle, and O. Winther, z Latent vector “Autoencoding beyond pixels using a learned similarity metric,” In Proc. Threshold for anomaly score of ICML, 2016. [28] T. Schlegl, P. Seebock, S. M. Waldstein, U. S. Erfurth, and G. Langs,  Desired false positive rate “Unsupervised Anomaly Detection with Generative Adversarial Net- Mean vector works to Guide Marker Discovery,” In Proc. of IPMI, 2017. Covariance matrix [29] J. Neyman and E. S. Pearson, “On the Problem of the Most Ecient Tests of Statistical Hypotheses,” Phi. Trans. of the Royal Society, 1933. w Mixing weight of Gaussian mixure model [30] G. Casella and R. L. Berger, “Statistical Inference, section 8.3.2 Most K Number of gaussian mixtures Powerful Test,” Duxbury Pr, pp.387–393, 2001. T Number of time frames of observation [31] A. P. Bradley, “The use of the area under the ROC curve in the evaluation of machine learning algorithms,” Pattern Recognition, vol. 30, no. 7, pp. N Number of training samples 1145–1159, 1997. M Minibatch size [32] A. Herschtal and B. Raskutti, “Optimising Area Under the ROC Curve Q Dimension of input vector Using Gradient Descent,” In Proc. of ICML, 2004. [33] A. Fujino and N. Ueda, “A Semi-supervised AUC Optimization Method R Dimension of latent vector with Generative Models,” In Proc. of ICDM, 2016. Step size for gradient method [34] D. P. Kingma and J. L. Ba, “Adam: A Method for Stochastic Optimiza- C Context window size tion,” In Proc. of ICLR, 2015. [35] A. Krogh and J. A. Hertz, “A Simple Weight Decay Can Improve ` Temporary variable of anomaly score Generalization,” In Proc. of NIPS, 1992. V Anomaly decision score for one audio clip [36] http://www.cs.tut.fi/sgn/arg/dcase2016/ [37] http://www.cs.tut.fi/sgn/arg/dcase2016/download 4. Notations [38] S. D. Walter “The partial area under the summary ROC curve,” Statistics Time-frame index of observation in medicine, pp.2025–2040, 2005. n Index of training sample [39] N. Gornitz, M. Kloft, K. Rieck, and U. Brefeld, “Toward Supervised Anomaly Detection,” Journal of Artificial Intelligence Research, pp.235– k Index of Gaussian distribution 262, 2013. () Transpose of matrix or vector (u) () Variable of normal sound (a) () Variable of anomalous sound (v) () Variable of various sound 13 Yuma Koizumi (M ’15) received the B.S. and M.S. Noboru Harada (M ’99-SM ’18) received the B.S., degrees from Hosei University, Tokyo, in 2012 and and M.S., degrees from the Department of Computer 2014, and the Ph.D. degree from the University Science and Systems Engineering of Kyushu Insti- of Electro-Communications, Tokyo, in 2017. Since tute of Technology in 1995 and 1997, respectively. joining the Nippon Telegraph and Telephone Cor- He received the Ph.D. degree from the Graduate poration (NTT) in 2014, he has been researching School of Systems and Information Engineering, acoustic signal processing and machine learning in- University of Tsukuba in 2017. Since joining NTT cluding basic research of sound source enhancement in 1997, he has been researching speech and audio and unsupervised/supervised anomaly detection in signal processing such as high eciency coding sounds. He was awarded the FUNAI Best Paper and lossless compression. His current research inter- Award and the IPSJ Yamashita SIG Research Award ests include acoustic signal processing and machine from the Information Processing Society of Japan (IPSJ) in 2013 and 2014, learning for acoustic event detection including anomaly detection in sound. He respectively, and the Awaya Prize from the Acoustical Society of Japan received the Technical Development Award from the ASJ in 2016, Industrial (ASJ) in 2017. He is a member of the ASJ and the Institute of Electronics, Standardization Encouragement Awards from Ministry of Economy Trade Information and Communication Engineers (IEICE). and Industry (METI) of Japan in 2011, the Telecom System Technology Paper Encouragement Award from the Telecommunications Advancement Foundation (TAF) of Japan in 2007. He is a member of the ASJ, the IEICE, and the IPSJ. Shoichiro Saito (SM ’06-M ’07) received the B.E. and M.E. degrees from the University of Tokyo in 2005 and 2007. Since joining NTT in 2007, he has been engaging in research and development of acoustic signal processing systems including acous- tic echo cancellers, hands-free telecommunication, and anomaly detection in sound. He is currently a Senior Research Engineer of Audio, Speech, and Language Media Laboratory, NTT Media Intelli- gence Laboratories. He is a member of the IEICE, and the ASJ. Hisashi Uematsu received the B.E., M.E., and Ph.D. degrees in Information Science from Tohoku Univer- sity, Miyagi, in 1991, 1993, and 1996. He joined NTT in 1996 and has been engaged in research on psycho-acoustics (human auditory mechanisms) and digital signal processing. He is currently a Se- nior Research Engineer of Cross-Modal Computing Project, NTT Media Intelligence Laboratories. He was awarded the Awaya Prize from the ASJ in 2001. He is a member of the ASJ. Yuta Kawachi received a B.E. and M.E. degrees from Waseda University, Tokyo, in 2012 and 2014. Since joining NTT in 2014, he has been researching acoustic signal processing and machine learning. He is a member of the ASJ.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Oct 22, 2018

There are no references for this article.