Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling Siyuan Feng and Tan Lee Abstract—This research addresses the problem of acoustic Unsupervised speech modeling is the task of building sub- modeling of low-resource languages for which transcribed train- word or word-level AMs, when only untranscribed speech ing data is absent. The goal is to learn robust frame-level feature are available for training [7]–[9]. This is also known as representations that can be used to identify and distinguish the zero-resource problem, which has attracted increasing subword-level speech units. The proposed feature representations research interest in recent years. The Zero Resource Speech comprise various types of multilingual bottleneck features (BNFs) that are obtained via multi-task learning of deep neural networks Challenge 2015 (ZeroSpeech 2015) [9] and 2017 (ZeroSpeech (MTL-DNN). One of the key problems is how to acquire high- 2017) [6] precisely focused on unsupervised speech modeling. quality frame labels for untranscribed training data to facilitate ZeroSpeech 2017 was organized to tackle two sub-problems, supervised DNN training. It is shown that learning of robust namely unsupervised subword modeling (Track 1) and spoken BNF representations can be achieved by effectively leveraging term discovery (STD) (Track 2). The present study addresses transcribed speech data and well-trained automatic speech recog- nition (ASR) systems from one or more out-of-domain (resource- the Track 1 problem and aims to learn frame-level feature rich) languages. Out-of-domain ASR systems can be applied to representation that is effective in identifying and discriminat- perform speaker adaptation with untranscribed training data of ing subword-level units and robust to irrelevant factors, e.g., the target language, and to decode the training speech into frame- speaker and/or channel variation, emotion, etc. Robust feature level labels for DNN training. It is also found that better frame representations obtained by learning from data have been labels can be generated by considering temporal dependency in speech when performing frame clustering. The proposed found to be preferable to conventional spectral features like methods of feature learning are evaluated on the standard task Mel-frequency cepstral coefficients (MFCCs) for downstream of unsupervised subword modeling in Track 1 of the ZeroSpeech applications [10], [11]. 2017 Challenge. The best performance achieved by our system DNN models are commonly adopted in frame-level feature is 9:7% in terms of across-speaker triphone minimal-pair ABX learning for unsupervised subword modeling. A DNN model error rate, which is comparable to the best systems reported recently. Lastly, our investigation reveals that the closeness is typically trained using available speech data. The learned between target languages and out-of-domain languages and the features are obtained either from a designated low-dimension amount of available training data for individual target languages hidden layer of the DNN, known as the bottleneck features could have significant impact on the goodness of learned features. (BNFs) [12], or from the softmax output layer, known as Index Terms—zero resource, unsupervised learning, robust the posterior features or posteriorgram [13]. To facilitate features, speaker adaptation, multi-task learning supervised training of the DNN, target labels of training speech are needed. In zero-resource scenarios, the key problem is how to generate informative frame-level labels in the absence of I. I NTRODUCTION speech transcription. One of the possible approaches is based on unsupervised clustering of training speech. The frame-level TATE-OF-THE-ART automatic speech recognition (ASR) cluster indices can be used as target labels for DNN training systems have demonstrated fairly impressive performance [11]–[13]. Another approach seeks to use pre-trained out- in terms of word accuracy [1], [2]. This is mainly attributed of-domain ASR systems to tokenize untranscribed in-domain to the advances of deep neural network (DNN) based acoustic speech and hence each frame is assigned with an ASR senone models (AMs) and language models (LMs) [3], [4]. Typically label [5], [14]. Fully unsupervised [13] or weakly supervised a well-trained DNN-based AM requires hundreds to thousands [15]–[17] methods for DNN training were also reported in the of hours of transcribed speech. As a matter of fact, high- research on acoustic modeling for low-resource languages. performance ASR systems are available only for major lan- The present study adopts the general framework of su- guages [5]. Even for resource-rich languages, preparing tran- pervised DNN training for the purpose of extracting BNF scriptions for available training data is a time-consuming task as the learned feature representation. We attempt to improve that involves considerable human effort. For many languages the efficacy and performance of learned features along two in the world, very little or no transcribed speech is available directions. First, advanced unsupervised acoustic modeling [6], and conventional acoustic modeling techniques are simply techniques are explored to generate initial frame-level labels not applicable. for supervised DNN training. Second, speaker adaptation S. Feng and T. Lee are with the Department of Electronic Engineering, techniques are applied to make input speech features more The Chinese University of Hong Kong, Hong Kong SAR, China (e-mail: robust to speaker variation. siyuanfeng@link.cuhk.edu.hk; tanlee@ee.cuhk.edu.hk). Dirichlet process Gaussian mixture model (DPGMM) is This research is partially supported by a GRF project grant (Ref: CUHK 14227216) from Hong Kong Research Grants Council. commonly used for clustering of unlabelled speech frames arXiv:1908.03538v2 [eess.AS] 29 Sep 2019 2 [18]. It demonstrated superior performance on the tasks in clusions. ZeroSpeech Challenges [19], [20]. However, DPGMM clus- tering, as well as other conventional clustering algorithms II. R ELATED WORKS like k-means [21] and GMM [13], assumes that neighboring A. Deep learning approaches to unsupervised subword mod- speech frames are independent of each other. This is obviously eling not in accordance with the nature of speech. To address A variety of DNN models have been investigated towards this limitation, a full-fledged Gaussian mixture model-hidden unsupervised subword modeling. They include multi-layer Markov model (GMM-HMM) AM is trained to capture con- perceptron (MLP) [12], auto-encoder (AE) [13], correspon- textual information in speech. The transcriptions required for dence AE (cAE), denoising AE (dAE) [27], variational AE GMM-HMM training are initialized via DPGMM clustering. (VAE) [28] and siamese network [29]. In terms of training Following the terminology in [22], this model is referred to as strategies, these models can be classified into three categories, DPGMM-HMM. We use the DPGMM-HMM AM to generate namely, supervised (MLP), unsupervised (AE, VAE, dAE) frame-level labels to support BNF representation learning. In and weakly/pair-wise supervised (cAE, siamese network). [22], a similar approach was adopted for learning feature- Supervised DNN training requires frame-level labels for all space maximum likelihood linear regression (fMLLR) and training speech, which could be obtained either via a clustering posteriorgram features. process or exploiting out-of-domain resources. In [11], [12], In unsupervised subword modeling, the outcome of frame DPGMM clustering was performed on conventional short-time clustering ideally comprises a set of clusters that correspond spectral features of target speech, followed by multilingual to phoneme-related speech units. The underlying assumption DNN training to obtain the BNF representation. In [13], is that speech frames identified as the same phoneme should GMM-universal background model (GMM-UBM) was used have homogeneous acoustic properties. In practice, speaker to generate frame labels. A DNN was trained using these and environment variations would inevitably impact the re- labels to generate BNF or posteriorgram representation. In liability of frame clustering results. Our preliminary experi- [5], [14], language-mismatched ASR systems were utilized to ments showed that applying DPGMM typically results in an decode the target speech, and frame labels were generated excessive number of fine-grained clusters. Similar observations from the ASR decoding lattices. In [30], BNF representation were reported in [23], [24]. These over-fragmented clusters was generated by applying multi-task learning with both in- may adversely affect the effectiveness of unsupervised speech domain and out-of-domain data [25]. The frame labels for modeling. In this work we develop and apply a new algorithm out-of-domain data were obtained by HMM forced alignment, to filter out infrequent labels in DPGMM clustering results, while the labels for in-domain data were from DPGMM and experimentally validate its effectiveness. clustering [12]. In [5], [14], [31], a DNN AM was trained In addition to the DPGMM-HMM labels, a different type with transcribed data of an out-of-domain language, and used of frame labels can be obtained using one or more out-of- to extract BNFs or posteriorgrams from target speech. domain ASR systems [5], [14]. While the DPGMM-HMM Unsupervised DNN training does not require any kind frame labels incorporate statistical information of the acoustic of target labels. For example, an AE model generates non- properties of target speech, the ASR senone labels leverage the linear embeddings of input speech and meanwhile learn to phonetic information acquired from out-of-domain languages. reconstruct the same speech from the embeddings. Recently, We propose to exploit their complementarity in DNN based weakly-supervised model training is studied extensively [15]– feature learning by applying the multi-task learning strategy [17]. In the cAE model [27], a pair of speech segments that [25]. contain the same linguistic unit (word or subword) are used Numerous studies have demonstrated the benefit of applying as the input and output for training, with the objective of speaker adaptation on input features for unsupervised subword minimizing the reconstruction error. In a siamese network, modeling [12], [26]. In the present study, we propose to exploit the input comprises two speech segments. The network is cross-lingual speech data in fMLLR-based speaker adaptation. trained to determine whether the segments are from the same Specifically, transcribed speech from a resource-rich language linguistic unit or not. These models were shown to achieve is used to train an out-of-domain ASR system. This sys- better performance than unsupervised models [27]. However, tem is then applied to the zero-resource target languages for zero-resource languages, such pair-wise knowledge may for estimating linear discriminant analysis (LDA), maximum not be directly available. likelihood linear transform (MLLT) and fMLLR transforms on conventional spectral features. We advocate that this approach B. Unsupervised subword modeling without using DNN is effective and practically desirable as transcribed speech data of resource-rich languages are relatively easy to access. There were numerous studies on unsupervised subword The remainder of this paper is organized as follows. Sec- modeling without involving deep learning models. In these tion II provides a review of related works on unsupervised studies, clustering of short-time frame features is an important subword modeling with untranscribed speech. In Section III, first step. After frame clustering, each cluster is represented by we provide detailed description on the proposed approaches a learned probability distribution, and the cluster posteriorgram to feature learning. Section IV introduces experimental design can be regarded as the learned representation for subword on ZeroSpeech 2017 development data. Section V discusses modeling. Frame clustering could be done straightforwardly and analyzes experimental results. Section VI gives the con- by applying k-means [21], GMM [32] and DPGMM [19] 3 algorithms. In [19], DPGMM clustering was applied to a zero- processed by an out-of-domain ASR system, where VTLN, resource target language. An extension of this approach was LDA, MLLT and fMLLR transforms are estimated sequen- reported in [20], where clustering was performed with fMLLR- tially. The DPGMM clustering algorithm is applied to the based speaker-adapted features. In [32], GMM posteriorgram fMLLR features of target speech. The resulted frame labels are and HMM posteriorgram were compared, where the HMM post-processed by a label filtering algorithm and then used for was trained based on GMM-UBM clustering results. context-dependent GMM-HMM (CD-GMM-HMM) acoustic To better retain temporal dependency in speech, frame modeling. The trained AMs forced align target speech to clustering can be embodied in segment level. Initial segmen- generate DPGMM-HMM alignments. Subsequently, an MTL- tation of speech utterances could be obtained by hierarchical DNN is trained to generate BNFs for subword modeling. agglomerative clustering [33], or using language-mismatched The training tasks of MTL include DPGMM-HMM alignment prediction and language-mismatched label prediction of mul- phone recognizers [34], [35]. Subsequently a fixed-length feature vector is derived to represent each speech segment. tiple target languages. The language-mismatched labels are Clustering of segment-level feature vectors was tackled using generated by multiple out-of-domain ASR systems. a range of algorithms, including vector quantization (VQ) [36], The proposed system design emphasizes on leveraging segmental GMM (SGMM) [37], spectral clustering [38] and speech data resources from out-of-domain languages [5], [14]. graph clustering [39]. In [40], segmentation and clustering This is realized in the following aspects: were integrated as a jointly optimized process. Use out-of-domain data to perform fMLLR speaker adap- The present study is on one hand largely based on DNN tation on target speech. modeling of speech, and on the other hand incorporates the Use out-of-domain ASR systems to generate frame labels ideas of frame clustering (as the initial tokenization) [19], to facilitate multi-task DNN training. fMLLR-based speaker adaptation [20], and use of HMM to Use an out-of-domain DNN AM to extract BNFs. capture temporal dependency [32]. A. Speaker adaptation with out-of-domain data C. Optimizing DPGMM clustering For resource-rich languages, a large amount of transcribed DPGMM clustering has been shown to be a preferred and speaker-annotated speech data are readily available. We method of frame labeling for unsupervised subword modeling propose to utilize these out-of-domain data to model speaker [19], [20]. Nevertheless, one shortcoming of DPGMM is variation in untranscribed speech of the target speech. A that it tends to produce over-fragmented speech units [23], conventional CD-GMM-HMM AM is trained using the out- [24]. Different approaches have been proposed to tackle this of-domain data. Based on this model, VTLN, LDA, MLLT problem. In [23], DPGMMs were replaced by the Dirichlet and fMLLR transforms can be estimated. Subsequently, CD- process mixture of mixtures model (DPMoMM) to enable GMM-HMM AMs with speaker adaptive training (CD-GMM- multi-modal cluster inference. In [24], small-sized clusters HMM-SAT) are trained and used to estimate fMLLR trans- were merged based on low functional load [41], [42]. In our forms for target speech utterances. It must be noted that the work, this problem is tackled by a label filtering algorithm. estimated fMLLR features of target speech could be directly DPGMM for frame labeling could be optimized at input used for subword modeling. They are expected to provide feature level. Conventional spectral features like MFCC [19] a better baseline than the conventional spectral features like and perceptual linear prediction (PLP) [26] were commonly MFCCs or PLPs. used as the initial representations of target speech. Albeit straightforward, these features are considered sub-optimal for B. Frame labeling unsupervised subword modeling, as they contain a lot of irrel- evant information such as speaker identity and emotion. Heck 1) DPGMM clustering: DPGMM is a non-parametric et al. [26], [43] found that fMLLR transforms can noticeably Bayesian extension to GMM, where a Dirichlet process prior suppress speaker-related feature variation, and advocated the replaces the vanilla GMM. One advantage of DPGMM clus- importance of speaker adaptation in the concerned task. To tering is that the cluster number does not need to be pre- enable supervised estimation of fMLLRs, clustering results on defined. Let us consider M zero-resource target languages. For spectral features were taken as pseudo transcriptions. Chen et an utterance from the i-th language, the frame-level features i i i al. [12] showed that vocal tract length normalization (VTLN) are denoted as fx ;x ;: : :;x g, where L is the number 1 2 on top of spectral features contribute to generating more robust of frames in the utterance. By applying DPGMM clustering, DPGMM frame labels. In our study, fMLLR features are K clusters are obtained and represented with k Gaussian i i i estimated by exploiting an out-of-domain ASR system. components. The frame labels fl ; l ; : : : ; l g are given as, 1 2 L i i III. PROPOSED S YSTEM l = arg max Prob(kjx ); (1) t t 1kK The proposed system framework for unsupervised subword i i modeling of zero-resource languages is illustrated as in Fig. 1. where Prob(kjx ) denotes the posterior probability of x with t t It comprises three modules, namely, speaker-adapted feature respect to the k-th Gaussian component. The inference of extraction, unsupervised acoustic modeling, and multi-task DPGMM parameters can be performed using the algorithm BNF learning. Speech frames of the target language are first as described in [18]. 4 Unsuperivsed acoustic modeling Out-of-domain languages Speaker adapted DPGMM Label GMM-HMM feature extraction fMLLRs CA ASR clustering filtering training MFCCs Multiple ASRs Decode and Zero-resource align languages Forced-alignment Labels Labels MTL-DNN BNF extraction Evaluation training Fig. 1. The proposed framework of unsupervised subword modeling. Let P be the percentage of frame labels that we aim to Descending order retain. These frames are from K “dominant” clusters, where cut  90  80  100  50  60  20 c c c c c c 1 2 3 4 5 6 K c ^ k=1 K = arg min  P: (4) cut 0 N c c c c c c 3 1 2 5 4 6 O denotes the collection of all frame labels that are re- ˆ ˆ ˆ ˆ ˆ ˆ c c c c c c moved, i.e., 1 2 3 4 5 6 n o m (3)1 m (1)2 m ( 2)3 m (5)4 m ( 4)5 m ( 6)6 O = l : l 2 F; i 2 f1; 2; : : : ; Ng ; (5) i i where n o Fig. 2. Example of cluster size sorting. F = m(K + 1); : : : ; m(K ) : (6) cut F contains indices of K K clusters that are the cut 2) Out-of-domain ASR decoding: Given a speech utterance least frequent to occur. Frames assigned to these clusters are in the target language, an out-of-domain ASR system can considered as outliers. be applied to generate a sequence of phone-level or state- In the extreme case when P is set to 1, F and O will level labels [14]. The idea can be naturally extended to using be empty sets. The smaller the value of P , the larger the multiple out-of-domain ASR systems and desirably providing proportion of filtered frame labels. The label filtering algorithm a wide coverage of phonetic diversity. The outcome of ASR is summarized as in Algorithm 1. decoding depends on the relative weighting of AM and LM. In our work, the LM is assigned a very small weight, such that Algorithm 1 DPGMM label filtering algorithm the acquired frame labels mainly reflect acoustic properties of Input: l ; l ; : : : ; l , P 1 2 N the target speech being modeled. Output: O 1: Calculate c by Equation (2). C. DPGMM label filtering 2: Sort fc ; c ; : : : ; c g in descending order. 1 2 K For a specific target language, let us assume that K 3: Calculate m(k) by Equation (3). Gaussian components (clusters) are obtained by DPGMM 4: Calculate K by Equation (4) and P . cut clustering. The frame labels are denoted as l ; l ; : : : ; l for 5: Select a subset of l ; l ; : : : ; l asO, by Equation (5)&(6). 1 2 N 1 2 N an N -frame utterance. Let c be the number of frames labeled . Frame labels that are removed. as cluster k, i.e., c = 1(l = k); k 2 f1; 2; : : : ; Kg; (2) k i D. DPGMM-HMM acoustic modeling i=1 Each DPGMM cluster can be regarded as a pseudo phone. where 1() is the indicator function. The sequence of DPGMM frame labels (after filtering) can The elements in fc ; c ; : : : ; c g are sorted in descending 1 2 K be converted into a pseudo transcription by collapsing neigh- order into fc ^ ; c ^ ; : : : ; c ^ jc ^  c ^  : : :  c ^ g. m() denotes 1 2 K 1 2 K boring duplicated labels, e.g., “1,3,3,3,7,10,10” ! “1,3,7,10”. the index mapping function, i.e., Based on the pseudo transcription, HMM acoustic model- ing is done by following the standard supervised training c ^ = c : (3) k m(k) pipeline, i.e., proceeding from monophone model training with Fig. 2 gives an example of cluster size sorting. uniform time alignment to context-dependent GMM-HMM 5 (CD-GMM-HMM). The trained AM is used to produce time alignment information for DNN-based subword discriminative Alignments for Alignments for OOD Lang OOD Lang modeling (will be discussed in Section III-E). To be distin- Lang 1 Lang M label 1 label N …... …... guished from the DPGMM frame labels, the frame labels obtained from the HMM forced alignment are referred to as DPGMM-HMM labels. BNF for subword Although the DPGMM labels could be directly used for discriminability task supervised DNN acoustic modeling [12], [14], we expect that Tasks for MUBNF DPGMM-HMM labels are more reasonable as they are derived Tasks for OSBNF with consideration on contextual dependency of speech. Tasks for LI-BNF Lang 1 Lang M …... fMLLRs fMLLRs E. Multi-task learning for BNFs The bottleneck feature (BNF) is a type of representation Fig. 3. MTL-DNN for extracting LI-BNF, MUBNF and OSBNF. The term “OOD” stands for out-of-domain. obtained from a designated low-dimension hidden layer of a DNN. In ASR applications, BNFs have been shown to provide a compact and phonetically-discriminative representation of IV. E XPERIM ENTAL SETUP input speech, and be effective in suppressing linguistically- A. Dataset and evaluation metric irrelevant variations [44]. In the context of zero-resource Experiments are carried out with the development data speech modeling, BNFs have also been widely investigated of ZeroSpeech 2017 Track 1 [6]. The data covers three [5], [12], [14], [17]. target languages, namely English, French and Mandarin. For The proposed MTL-DNN is depicted in Fig 3. The DNN each language, there are separate training set and test set of training involves a total of M + N tasks, which involves untranscribed speech. Speaker identity information is provided M zero-resource target languages and N out-of-domain ASR for the train sets but not available for the test sets. The test systems. Each of the tasks is represented by a task-specific data are organized into subsets of different utterance lengths: 1 softmax output layer in the DNN. The hidden layers, including second, 10 second and 120 second. Detailed information about a low-dimension linear BN layer, are shared across all tasks. the dataset are given as in Table I. For the zero-resource language tasks, state-level or phone-level DPGMM-HMM labels are used as target labels. The decoding TABLE I output from each of the out-of-domain ASR systems provides D EVELOPM ENT DATA OF ZEROS PEECH 2017 T RACK 1 one set of frame-level labels for MTL. For the MTL-DNN trained only on the M target language Training Test tasks, the extracted BNFs are referred to as multilingual Duration (hours) # speakers Duration (hours) unsupervised BNFs (MUBNFs). When out-of-domain ASR English 45 60 27 tasks are added, the BNFs are named language-independent French 24 18 18 BNFs (LI-BNFs). In the case that only the out-of-domain ASR Mandarin 2:5 8 25 tasks are involved, the extracted BNFs are referred to as out- of-domain supervised BNFs (OSBNFs). The evaluation metric adopted for ZeroSpeech 2017 Track The DPGMM-HMM labels are obtained through statistical 1 task is the ABX subword discriminability. Inspired by the modeling of target speech. The ASR senone labels leverage the match-to-sample task in human psychophysics, it is a simple phonetic knowledge acquired from out-of-domain languages. method to measure the discriminability between two categories It is expected that they would contribute complementarily in of speech units [9]. The basic ABX task is to decide whether feature learning. Learning from speech of multiple languages X belongs to x or y, if A belongs to x and B belongs would result in a language-independent BNF representation to y, where A, B and X are three data samples, x and y that is more generalizable to unknown languages. are the two pattern categories concerned. The performance For the shared-hidden-layer structure in the MTL-DNN, evaluation in ZeroSpeech 2017 is carried out on the triphone multi-layer perceptron (MLP) is commonly used [12]–[14], minimal-pair task. A triphone minimal pair comprises two [31]. In this study, in addition to MLP, we investigate the use triphone sequences, which have different center phones and of long short-term memory (LSTM) [45] and bi-directional identical context phones, for examples, “beg”-“bag”, “api”- LSTM (BLSTM) [46], which were shown to perform better “ati”. Discriminating triphone minimal pairs is a non-trivial than MLP in conventional supervised acoustic modeling. task. The performance of a feature representation on the On the other hand, BNF representation can also be obtained triphone minimal-pair ABX task is considered a good indicator from the DNN AM pre-trained for a resource-rich language of its efficacy in speech modeling [48]. [5]. This is considered as a transfer learning approach [47]. Let x and y denote a pair of triphone categories. Consider This transfer learning BNF (TLBNF) is expected to further three speech segments A, B and X , where A and X belong to enrich the feature representation and will be jointly used with category x and Y belongs to y. The ABX discriminability of MUBNF, OSBNF and LI-BNF for subword modeling. x from y is measured in terms of the ABX error rate (x; y), 6 which is defined as the probability that the distance of A from trained with transcriptions of CUSENT training data is used X is greater than that of B from X , i.e., during decoding. The LM is trained with SRILM [51]. The other three out-of-domain ASR systems are all phone X X X recognizers developed by Brno University of Technology [52]. (x; y) = jS(x)j(jS(x)j 1)jS(y)j The recognizers adopt a 3-layer MLP structure, in which the A2S(x) B2S(y) X2S(x)nfAg first two are sigmoid layers and the third is a softmax layer. (1 + 1 ); d(A;X)>d(B;X) d(A;X)=d(B;X) They were trained with the SpeechDat-E databases [53]. The (7) numbers of modeled phones in Czech, Hungarian and Russian are 45; 61 and 52, respectively. The training data sizes are 9:7, where S(x) and S(y) denote the sets of features that represent 7:9 and 14:0 hours, respectively. The cross-entropy criterion triphone categories x and y, respectively. d(;) denotes the was used for MLP training. dissimilarity between two speech segments, which is computed by dynamic time warping (DTW) in our study. The frame- C. Speaker adaptation of target speech level dissimilarity measure used for DTW scoring is the The Cantonese ASR system is used to perform fMLLR- cosine distance. Note that (x; y) is asymmetric to x and based speaker adaptation of target speech on the 39-dimension y. A symmetric form can be defined by taking average of MFCC features in a two-pass procedure. In the first pass, (x; y) and (y; x). The overall ABX error rate is obtained input speech utterances are decoded in a speaker-independent by averaging over all triphone categories and speakers in manner, using unadapted features, from which initial fMLLR the test set. A high ABX error rate means that the feature transforms are estimated. In the second pass, input speech are representation is not discriminative, and vice versa. Intuitively, decoded with initial fMLLRs in a speaker-adaptive manner. the error rate should be no larger than 50%, as by random After the decoding, final fMLLR transforms for target speech decision, the expectation of ABX error rate is 50%. utterances are estimated. The dimension of fMLLR features is Two evaluation conditions were defined in ZeroSpeech 2017, namely within-speaker and across-speaker. In both con- ditions, the segments A and B to be evaluated are generated by the same speaker. In the within-speaker condition, segment D. DPGMM frame clustering and label filtering X is generated by the same speaker as A and B; In the across- Speech frames for different languages are clustered sepa- speaker condition, X is generated by a speaker different from rately by the DPGMM algorithm based on the 40-dimension A and B. fMLLR features. The implementation of DPGMM clustering is performed using an open-source tool developed by Chang et al. [18]. For the three target languages, namely English, B. Out-of-domain ASR systems French and Mandarin, the numbers of iterations of clustering Four out-of-domain ASR systems are utilized and investi- were 120; 200 and 3000 respectively. The numbers of itera- gated in our experiments. They cover the languages of Can- tions for English and French are determined by preliminary tonese (CA), Czech (CZ), Hungarian (HU) and Russian (RU). experiments. Specifically, the iterations for English ranging in The Cantonese ASR is trained with the CUSENT database f40; 80; : : : ; 680g and for French ranging in f40; 80; : : : ; 400g [49]. The database contains 20; 378 training utterances from were tested. The optimal numbers of iterations were 120 and 34 male and 34 female speakers, with a total of 19:3 hours 200 respectively. For Mandarin, the number of iterations was of speech. The Kaldi toolkit [50] is used to train two versions empirically determined. The resulted numbers of DPGMM of AMs: CD-GMM-HMM-SAT and DNN-HMM. DNN-HMM clusters for English, French and Mandarin are 1118; 1345 and training labels are acquired from CD-GMM-HMM-SAT time 596, respectively. Each frame is assigned a cluster label. Fig. alignment. The input features for CD-GMM-HMM-SAT are 4 shows the results of clustering in the form of cumulative 40-dimension fMLLRs, and the input features for DNN-HMM distribution function (CDF) for the three target languages. are fMLLRs with 5 splicing. The fMLLR features are The clusters are sorted according to their cluster size in estimated during CD-GMM-HMM-SAT training. Specifically, descending order. In other words, each point (K ; Q ) on the i i VTLN is estimated towards 39-dimension MFCCs++. CDF represents the proportion of frame labels Q that the The resulted features with 3 splicing are used to estimate largest K clusters cover. 40-dimension LDA and MLLT. Finally, fMLLR transforms For label filtering, we evaluated different thresholds on the are estimated. MFCC features are computed using a 25-ms percentage of preserved labels, with the value of P ranging Hamming window and a 10-ms frame shift. Per-utterance from 0:6 to 0:95, with the step size of 0:05. After filtering, cepstral mean variance normalization (CMVN) is applied to the frame-level label sequences are converted into pseudo MFCCs. The DNN-HMM model for Cantonese is a 7-layer transcriptions, for the training of DPGMM-HMM AMs (in MLP, with layer configuration 440-1024  5-40-1024-2462. Section IV-E). The dimension of the output layer is determined by the DPGMM clustering was also tested with MFCCs as input number of CD-HMM states modeled by CD-GMM-HMM- features. The numbers of iterations for MFCC clustering SAT. Hidden layers are activated with sigmoid function, except are 200; 240 and 3000 for English, French and Mandarin for the 40-dimension linear BN layer. The network is trained respectively, and the resulted numbers of DPGMM clusters to optimize the cross-entropy criterion. A syllable trigram LM are 1554; 1541 and 381. 7 for (HMM-)MUBNF, OSBNF and (HMM-)LI-BNF are listed Cumulative distribution function in Table II. TABLE II 0.8 C ONFIGURATIONS FOR (HMM-)MUBNF, OSBNF AND (HMM-)LI-BNF English REPRESENTATIONS French 0.6 Mandarin Task label from DPGMM DPGMM-HMM CA CZ HU RU 0.4 Train set EN FR MA EN FR MA Pooling EN, FR and MA MUBNF X X X 0.2 OSBNF1 X OSBNF2 X X X X 0 200 400 600 800 1000 1200 1400 LI-BNF1 X X X X DPGMM cluster number LI-BNF2 X X X X X X X HMM-MUBNF X X X HMM-LI-BNF1 X X X X Fig. 4. Clustering results in the form of cumulative distribution function for HMM-LI-BNF2 X X X X X X X the three target languages. Clusters are sorted according to cluster size in descending order. The MTL-DNN is implemented in three different model structures: MLP, LSTM and BLSTM. The input features are E. DPGMM-HMM and MTL-DNN training 40-dimension fMLLRs spliced with context size 5. The dimensions of shared hidden layers in the MLP are 440- DPGMM-HMM AMs are trained from scratch with pseudo 1024  5-40-1024. Sigmoid activation is used in all hidden transcriptions. Different from the conventional 3-state HMM layers except that the 40 neurons in the BN layer use linear topology, during DPGMM-HMM training we set 1-state HMM activation functions. The learning rate for MLP training is set for each pseudo phone. This prevents the problem of unsuc- at 0:008 at the beginning, and halved when no improvement cessful forced alignments, as the numbers of pseudo phones is observed on a cross-validation set. The mini-batch size is for target languages are significantly larger than the number of 256. The LSTM model comprises 2 LSTM layers with 320- phones for a typical language. The input features for DPGMM- dimension cell activation vectors, and 1024-dimension outputs. HMM are 40-dimension fMLLRs estimated by the Cantonese A 40-dimension BN layer followed by a 1024-dimension ASR. The training procedure follows the standard pipeline as fully-connected (FC) layer is set on top of LSTMs. For the in Kaldi s5 recipe , i.e., starting from CI-GMM-HMM to CD- BLSTM model, there are 2 pairs of forward and backward GMM-HMM, followed by VTLN and fMLLR-based SAT . LSTM layers. Each bi-directional layer has 320-dimension After training, the numbers of CD-HMM states for English, cell activation vectors and 512-dimension outputs. A BN layer French and Mandarin are 2818; 2856 and 2688, respectively. followed by an FC layer is set on top of BLSTMs, with the The MTL-DNN model is trained with all the three target same configuration as in the LSTM. The activation function zero-resource languages, from which BNFs are extracted and in (B)LSTMs is tanh. The learning rate is 2e4 initially, and evaluated by the ABX subword discriminability task. There are halved under the same criteria as for MLP. The truncated back- two types of tasks for MTL, namely, DPGMM-HMM align- propagation through time (BPTT) algorithm [54] is used to ment prediction task and out-of-domain ASR label prediction train (B)LSTM, with a fixed time step T = 20. Note that bptt task. In the first case, three tasks are included, i.e., frame the model parameters of LSTM and BLSTM structures were alignments generated by DPGMM-HMM AMs, one for each tuned in preliminary studies, while for MLP we follow the target zero-resource language. In the second case, four tasks configuration of our previous study [14]. corresponding to Cantonese, Czech, Hungarian and Russian recognizers’ senone labels are included. The senone labels F. TLBNF generation are generated by decoding with LM to AM weight ratio set to 0:001. After MTL-DNN training, 40-dimension HMM-LI- The TLBNFs for target zero-resource languages are gen- BNFs are extracted for the ABX task evaluation. Similarly, erated by applying the Cantonese DNN-HMM AM as the HMM-MUBNFs , extracted by MTL-DNN with DPGMM- feature extractor. During TLBNF extraction, all the parameters HMM alignment tasks, and OSBNFs, extracted by MTL-DNN of the DNN-HMM are fixed. The fMLLR features for target with one or more out-of-domain phone recognizers’ senone languages are fed as inputs to the DNN-HMM till its BN layer labels, are also evaluated by the ABX task. The dimensions to generate TLBNFs. of both HMM-MUBNFs and OSBNFs are 40. As illustrated in Fig. 3, we defined several BNF representations according to V. RESULTS AND DISCUSSION the tasks included in MTL-DNN training. The configurations Table III provides a master summary to facilitate perfor- mance comparison among different systems of feature repre- kaldi/egs/wsj/s5/run.sh sentation learning. The methods are organized in four groups, LDA and MLLT are not estimated, as no improvement was found. marked by circled numerals 1 to 4 in the Table. The first The prefix ‘HMM-’ emphasizes the use of DPGMM-HMM alignments, rather than DPGMM cluster labels. group comprises a few relevant baseline and reference systems. Proportion 8 TABLE III ABX ERROR RATES (%) ON THE BASELINE, OUR PROPOSED METHODS AND STATE OF THE ART OF ZEROS PEECH 2017. MLP IS ADOPTED AS THE SHARED- HIDDEN-LAYER STRUCTURE. L ABEL FILTERING IS NOT APPLIED. Within-speaker Across-speaker English French Mandarin Avg. English French Mandarin Avg. 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s MFCC Baseline [6] 12:0 12:1 12:1 12:5 12:6 12:6 11:5 11:5 11:5 12:0 23:4 23:4 23:4 25:2 25:5 25:2 21:3 21:3 21:3 23:3 Out-of-domain fMLLR [14] 8:0 8:2 7:3 10:3 10:3 9:1 9:3 9:3 8:4 8:9 13:4 12:0 11:3 17:2 15:8 14:8 10:7 10:2 9:4 12:8 Out-of-domain fMLLR [5] 7:8 7:7 7:0 10:4 10:5 9:2 9:2 11:4 8:8 9:1 14:2 11:9 11:3 17:6 15:2 14:4 12:7 13:6 10:0 13:4 MUBNF0 8:0 7:3 7:3 10:3 9:4 9:3 10:1 8:8 8:9 8:8 13:5 12:4 12:4 17:8 16:4 16:1 12:6 11:9 12:0 13:9 MUBNF 7:4 6:9 6:3 9:6 9:0 8:1 9:8 8:8 8:1 8:2 10:9 9:5 8:9 15:2 13:0 12:0 10:5 8:9 8:2 10:8 OSBNF1 7:2 7:1 6:3 10:2 9:7 8:7 9:1 8:6 7:6 8:3 10:0 9:7 8:6 13:9 13:4 11:6 9:0 8:4 7:5 10:2 OSBNF2 6:8 6:7 5:9 9:5 9:2 8:3 9:7 8:9 8:0 8:1 9:5 9:2 7:9 13:1 13:0 11:3 9:4 8:7 7:9 10:0 LI-BNF1 6:9 6:6 6:1 9:5 9:2 8:4 9:2 8:5 7:9 8:0 10:0 8:9 8:2 14:3 12:9 11:5 9:5 8:5 7:7 10:2 LI-BNF2 6:6 6:4 5:7 9:1 9:3 8:2 9:5 8:7 8:1 8:0 9:4 8:7 7:8 13:4 12:7 11:0 9:3 8:6 7:7 9:8 HMM(S)-MUBNF 7:2 6:7 6:3 9:7 9:2 8:3 10:4 9:2 8:5 8:4 10:2 9:3 8:6 14:5 13:0 11:9 10:7 9:2 8:4 10:6 HMM(P)-MUBNF 7:1 6:6 6:2 9:4 9:1 7:8 9:9 8:8 8:2 8:1 10:4 9:2 8:7 14:5 12:7 11:7 10:4 8:9 8:2 10:5 HMM(P)-LI-BNF1 6:8 6:3 5:8 9:1 8:7 7:8 9:1 8:5 7:6 7:7 9:7 8:7 8:0 13:7 12:3 11:1 9:7 8:4 7:6 9:9 HMM(P)-LI-BNF2 6:6 6:4 5:7 9:2 8:8 8:1 9:2 8:6 7:9 7:8 9:3 8:7 7:8 13:0 12:4 11:0 9:5 8:5 7:7 9:8 TLBNF 7:2 6:8 6:1 9:6 9:0 8:0 8:7 7:6 6:8 7:8 10:6 9:6 8:7 14:2 13:2 11:5 8:5 7:6 6:7 10:1 TLBNF+LI-BNF1 7:0 6:6 6:0 9:3 8:8 7:9 8:6 7:5 6:7 7:6 10:3 9:3 8:4 13:9 12:9 11:4 8:5 7:6 6:7 9:9 TLBNF+LI-BNF2 7:1 6:6 6:0 9:4 8:9 7:8 8:7 7:5 6:8 7:6 10:4 9:4 8:5 14:0 13:0 11:3 8:5 7:6 6:6 9:9 TLBNF+HMM(P)-LI-BNF1 7:0 6:6 6:0 9:4 8:8 7:8 8:6 7:5 6:7 7:6 10:3 9:4 8:4 13:9 12:9 11:3 8:5 7:6 6:6 9:9 TLBNF+MUBNF+OSBNF1 6:8 6:4 5:8 9:0 8:8 7:8 8:5 7:7 6:8 7:5 9:9 9:0 8:2 13:6 12:6 11:1 8:4 7:7 6:7 9:7 TLBNF+HMM(P)-MUBNF+OSBNF1 6:8 6:4 5:7 8:8 8:7 7:5 8:4 7:5 6:8 7:4 10:0 9:0 8:2 13:6 12:6 11:1 8:4 7:6 6:7 9:7 TLBNF+HMM(P)-MUBNF+OSBNF2 6:7 6:4 5:8 9:0 8:8 7:5 8:3 7:5 6:8 7:4 10:0 9:0 8:2 13:6 12:6 11:1 8:4 7:6 6:7 9:7 Heck et al. [20] 6:9 6:2 6:0 9:7 8:7 8:4 8:8 7:9 7:8 7:8 10:1 8:7 8:5 13:6 11:7 11:3 8:8 7:4 7:3 9:7 Chorowski et al. [28] 5:8 5:7 5:8 7:1 7:0 6:9 7:4 7:2 7:1 6:7 9:3 9:3 9:3 11:9 11:4 11:6 8:6 8:5 8:5 9:8 The MFCC baseline system refers to the one, in which generic features in the first group of systems outperform the MFCC MFCC features are directly used in triphone minimal pair baseline consistently on all target languages. This improve- discrimination. The first out-of-domain fMLLR system comes ment can be achieved without requiring any transcribed train- from previous work [14], which used a Cantonese ASR system ing data of the target language, which is highly desirable in for fMLLR estimation. The second one used a Japanese ASR the zero-resource scenario. [5]. In [5], the out-of-domain ASR system was trained on 240 The second and third groups of systems all use multilingual hours of Japanese speech. The experimental results in [14] BNF representations, which are learned by different methods show that using a Cantonese ASR system trained on only 19 as described in Section IV-E. DPGMM labels and DPGMM- hours of speech could give a better performance in both within- HMM labels are applied in the the second group and the third and across-speaker conditions. The advantage is particularly group respectively. In the second group, MUBNF0 is learned significant when the target language is Mandarin. using MFCC as input features for DPGMM clustering and MTL-DNN modeling. The other representations in these two B. Effectiveness of multilingual BNFs groups are learned using fMLLRs as DNN input features. As described in Section IV-E and Table II, OSBNF1 and The following observations can be made on the perfor- OSBNF2 are trained with out-of-domain ASR senone labels, mances of the learned multilingual BNF representations: and LI-BNF1 and LI-BNF2 are trained with both DPGMM (1) BNF representations learned by MTL-DNN clearly out- labels and out-of-domain ASR senone labels. In the third perform the respective input features to the DNN. MTL-DNN group, “HMM(S)” and “HMM(P)” denote the use of state- training with DPGMM labels is effective for both MFCC and level and phone-level HMM alignments respectively for label fMLLR. The average ABX error rates achieved by MUBNF0 generation. The fourth group of systems are built on different are 8:8% and 13:9% in the within-speaker and across-speaker combination of BNF features. The “+” sign is used to denote conditions respectively, versus 12:0% and 23:3% attained by concatenation of two frame-level feature representations. The MFCC. For MUBNF representation, the relative performance experimental results on all methods of BNF representation improvements over fMLLR are 7:9% and 15:6% in the two test learning as shown in Table III are obtained by using the conditions. MUBNF outperforms MUBNF0 to a large extent, MLP structure in MTL-DNN. In addition, two representative especially in the across-speaker test condition. This suggests systems that achieved very good performances in ZeroSpeech that speaker adaptation at input feature level is a critical step 2017 [20], [28] are also listed in the Table. in obtaining speaker-invariant BNF representations. (2) The effectiveness of BNF can be further improved by training the MTL-DNN with additional out-of-domain ASRs’ A. Effect of out-of-domain speaker adaptation senone labels. With the Cantonese ASR’s senone labels in- The fMLLR features estimated with in-domain data were cluded as one of the training tasks, the LI-BNF1 representation shown to perform significantly better than conventional spec- reduces within-/across-speaker ABX error rates by absolute tral features in unsupervised subword modeling [22], [26]. In 0:2%=0:6% as compared to MUBNF. When the senone labels the present study, it has been shown that similar improvement of Czech, Hungarian and Russian are added, the resulted LI- could also be attained by performing speaker adaptation using BNF2 representation shows a further improvement of absolute an out-of-domain ASR system. Both out-of-domain fMLLR 0:4% under the across-speaker condition. This shows that out- 9 of-domain acoustic-phonetic knowledge provides complemen- reported systems so far. For the across-speaker condition, our tary information to the in-domain clustering labels for feature proposed systems with combined BNF features have slightly learning. The performance gain of OSBNF2 over OSBNF1, as better performance than VQ-VAE (9:8%). Our systems are well as that of LI-BNF2 over LI-BNF1, confirm the benefit of found to be more effective on long utterances than VQ-VAE. exploiting a wider coverage of language resources. In Table III, it is noted that the performance of VQ-VAE does The performance of OSBNF2 is inferior to OSBNF1 on not depend on utterance duration. For English and Mandarin, Mandarin test set, but not on English and French. It is the ABX error rates are almost exactly the same between the noted that OSBNF1 is learned by using the Cantonese ASR cases of 1s and 120s. One possible reason is that the VQ- senone labels while OSBNF2 is learned by involving Can- VAE system does not perform explicit utterance-level speaker tonese and the other three European languages. Cantonese, normalization on input features. On the contrary, the BNF being a Chinese dialect, is apparently closer to Mandarin representations investigated in the study perform significantly than Czech, Hungarian and Russian in terms of acoustic- better on longer utterances (10s & 120s) than on 1s ones. It is phonetic properties. The experimental results imply that the also noted that our systems are more effective for Mandarin frame labels generated by involving highly-mismatched out- in the across-speaker condition. This may be due to the use of of-domain languages may be of low quality and not suitable Cantonese speech in feature learning. VQ-VAE may be over- for feature learning. fitting to Mandarin due to small data size [28]. (3) As discussed in Section III-D, DPGMM-HMM labels are obtained by modeling temporal dependency of speech C. Effectiveness of label filtering and DPGMM labels are determined with the assumption that neighboring speech frames are independent. Comparing the The effectiveness of the proposed label filtering algorithm is evaluated with the HMM(P)-MUBNF representation, which is corresponding systems in the second and the third groups trained exclusively based on DPGMM-HMM labels, without of Table III, it is noted that DPGMM-HMM labels per- involving out-of-domain speech data. Algorithm 1 requires one form slightly better than DPGMM labels. The ABX error tunable parameter P , i.e., the percentage frame labels to be rates attained with HMM(P)-MUBNF, HMM(P)-LI-BNF1 and retained. The average ABX error rates attained with different HMM(P)-LI-BNF2 are about absolute 0:2% - 0:3% lower values of P are plotted as in Fig. 5. P = 1 means that all than those with MUBNF, LI-BNF1 and LI-BNF2 respectively, labels are kept, which is the setting used to obtain the results except for HMM(P)-LI-BNF2 under the across-speaker condi- in Table III. tion. This demonstrates that capturing temporal dependency in Under both within-speaker and across-speaker conditions, speech is beneficial to feature learning for subword modeling the optimal values of P are in the range of 0:7 to 0:9. That [22]. It is also noted that phone-level HMM alignments are is, when on average about 10 30% of the frame labels better than state-level ones. are removed, the ABX error rates could be slightly reduced. (4) Combining different types of BNF feature representa- This indicates that indeed a certain portion of the labels tions leads to further improvement of performance. Specif- ically, by concatenating HMM(P)-MUBNF, OSBNF1 and are not reliable. However, if too many labels are removed, TLBNF, the best ABX error rates under both within-speaker e.g., more than 30%, the system performance would degrade and across-speaker conditions are achieved (7:4% and 9:7%). significantly, because some good labels are lost. It is found that BNFs learned from in-domain unsupervised The proposed label filtering method is very simple in that data (HMM(P)-MUBNF, OSBNF1) and learned via transfer only the occurrence counts of the labels are considered. Fig. learning (TLBNF) can be jointly used to compose an optimal 5 shows that this criterion is appropriate to a certain extent. feature representation that is better than any individual BNF. However, there may exist infrequent subword units that are The best performance attained in this study is competitive to meaningful and crucial in conveying linguistic content. In the best submitted system for the ZeroSpeech 2017 challenge, [23], [24], it was suggested to reduce the number of DPGMM which is based on the combination of multiple DPGMM clusters without ignoring any frame labels. Since these studies posteriorgrams [20]. These posteriograms were generated with were carried out on a different database, direct comparison of unsupervisedly estimated fMLLRs based on different imple- system performance can not be made. mentation parameters. The combination of posteriorgrams led to 3:0% and 3:3% relative error rate reduction under the D. Comparison of DNN model structures within-speaker and across-speaker conditions, compared to For BNF feature learning with the MTL-DNN approach, the use of single posteriorgram representation. In our work, DNN models other than MLP can be used. Table IV compares concatenating the three aforementioned BNF representations the system performances obtained by using MLP, LSTM results in 5:1% and 4:0% relative error rate reduction, as and BLSTM. The feature representations being investigated compared with the best system with single BNF. It must be include MUBNF, HMM(P)-MUBNF and HMM(P)-LI-BNF1, noted that no out-of-domain transcribed speech was involved and label filtering is not applied. in the system of [20]. In a very recent work [28], vector quantized VAE (VQ-VAE) It is noted that LSTM and BLSTM do not perform as was applied to develop a system of unsupervised subword well as MLP on all three types of BNF representations. modeling. The reported average ABX error rate was 6:7% Experiments were carried out with different parameter settings for within-speaker condition, which is the best among all on LSTM and BLSTM, and the system performance remained 10 TABLE IV COMPARISON OF MTL-DNN SHARED- HIDDEN-LAYER STRUCTURES IN FEATURE REPRESENTATION LEARNING OF ZEROS PEECH 2017. Within-speaker Across-speaker English French Mandarin Avg. English French Mandarin Avg. 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s MLP 7:4 6:9 6:3 9:6 9:0 8:1 9:8 8:8 8:1 8:2 10:9 9:5 8:9 15:2 13:0 12:0 10:5 8:9 8:2 10:8 MUBNF LSTM 7:4 7:1 6:8 10:0 9:5 8:7 10:4 9:5 8:7 8:7 10:4 9:6 9:0 14:6 13:3 12:3 10:9 9:3 8:6 10:9 BLSTM 7:4 7:1 6:7 9:9 9:5 8:9 10:4 9:4 8:7 8:7 10:4 9:6 9:0 14:7 13:3 12:1 10:7 9:3 8:6 10:9 MLP 7:1 6:6 6:2 9:4 9:1 7:8 9:9 8:8 8:2 8:1 10:4 9:2 8:7 14:5 12:7 11:7 10:4 8:9 8:2 10:5 HMM(P)-MUBNF LSTM 7:2 6:8 6:4 9:9 9:4 8:7 10:4 9:5 8:8 8:6 10:0 9:3 8:6 14:3 13:1 11:8 10:7 9:3 8:6 10:6 BSLTM 7:3 6:9 6:5 9:6 9:5 8:4 10:5 9:4 9:0 8:6 10:1 9:4 8:9 14:2 13:0 11:9 10:8 9:4 8:7 10:7 MLP 6:8 6:3 5:8 9:1 8:7 7:8 9:1 8:5 7:6 7:7 9:7 8:7 8:0 13:7 12:3 11:1 9:7 8:4 7:6 9:9 HMM(P)-LI-BNF1 LSTM 6:7 6:6 5:9 9:5 9:4 8:2 9:6 8:9 7:9 8:1 9:6 9:1 8:1 14:1 13:3 11:6 10:2 9:1 8:0 10:3 BLSTM 7:0 6:6 6:1 9:3 9:2 8:2 9:4 8:7 8:0 8:1 9:5 9:0 8:2 13:7 13:0 11:6 9:7 8:7 7:8 10:1 Across-speaker 14 14 10.7 MLP 12 12 LSTM 10.6 BLSTM 10 10 10.5 8 8 10.4 0.6 0.7 0.8 0.9 1 6 6 Percentage of preservation EN FR MA EN FR MA Within-Speaker 8.4 Fig. 6. Average ABX error rates (%) of HMM(P)-MUBNF representation 8.2 over utterance lengths for each language. Left: Across-speaker; Right: Within- speaker. 7.8 guages. In the case of low-resource languages, the challenge 0.6 0.7 0.8 0.9 1 of lacking transcribed data could be translated into the prob- Percentage of preservation lem of acquiring high-quality labels to facilitate supervised DNN training. Commonly used approaches to tackling this problem include applying clustering algorithms on short-time Fig. 5. Average ABX error rates (%) with respect to label filtering percentage speech frames and leveraging a language-mismatched phone over three zero-resource languages, in HMM(P)-MUBNF representation. recognizer to decode input speech. In this paper, it has been demonstrated that learning of robust BNF representations largely unchanged. Fig. 6 gives the performances of HMM(P)- could be achieved by joint contributions from a variety of MUBNF learned by MLP, LSTM and BLSTM for each target techniques, including: (1) use of speaker adapted features; (2) language. For English (EN), different DNN structures have considering temporal dependency in speech when performing similar performance. For French (FR) and Mandarin (MA), frame clustering; (3) increasing phonetic diversity by involving the advantage of MLP over (B)LSTM is more prominent. multiple out-of-domain languages; (4) discarding unreliable This may be related to that the amount of training data frame labels in DNN training. for English is significantly greater than those for French The proposed methods of feature learning have been evalu- and Mandarin. The advantage of LSTM and BLSTM over ated on the standard task of unsupervised subword modeling in MLP in conventional supervised acoustic modeling has been the ZeroSpeech 2017 Challenge. The experimental results have widely recognized and attributed to the capability of capturing shown that effective speaker adaptation with untranscribed temporal characteristics of speech. With limited training data, training data could be achieved by using an out-of-domain the benefits of recurrent structures can not be fully exploited. ASR system. Out-of-domain ASR systems from resource- In our systems, contextual information is incorporated via the rich languages can also be utilized to provide phonetically use of DPGMM-HMM labels and its effectiveness has been informed labels to support multi-task learning of BNFs, in demonstrated by the experimental results. conjunction with the learning tasks based on DPGMM-HMM clustering labels. Combining different types of BNFs by vector VI. CONCLUSIONS concatenation leads to further performance improvement. The BNFs learned from multilingual speech data have been best performance achieved by our proposed system is 9:7% in proven highly effective for acoustic modeling of spoken lan- terms of across-speaker triphone minimal-pair ABX error rate. ABX error rates (%) ABX error rates (%) ABX error rate (%) 11 It is equal to the performance of the best submitted system in [20] M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clus- tering for unsupervised subword modeling: A contribution to zerospeech the ZeroSpeech 2017 and better than other recently reported 2017,” in Proc. ASRU, 2017, pp. 740–746. systems. [21] C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit In principle, the proposed methods are expected to be discovery using k-means and neural networks,” in Proc. SLSP, 2017, pp. 169–180. effective for any combination of languages other than those [22] M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a DPGMM- in ZeroSpeech 2017. Nevertheless, our investigation has sug- HMM acoustic unit recognizer in a zero resource scenario,” in Proc. gested that the closeness between target languages and out-of- SLT, 2016, pp. 57–63. [23] ——, “Dirichlet process mixture of mixtures model for unsupervised domain languages and the amount of available training data subword modeling,” IEEE/ACM TASLP, vol. 26, no. 11, pp. 2027–2042, for individual target languages might have significant impact on the goodness of learned features. [24] B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero-resource setting based on functional load,” in Proc. SLTU, 2018, pp. 1–5. [25] R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, REFERENCES pp. 95–133. [26] M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acoustic [1] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, models in a zero resource setting to improve DPGMM clustering,” in X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall, Proc. INTERSPEECH, 2016, pp. 1310–1314. “English conversational telephone speech recognition by humans and [27] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison machines,” in Proc. INTERSPEECH, 2017, pp. 132–136. of neural network methods for unsupervised representation learning on [2] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC- the zero resource speech challenge,” in Proc. INTERSPEECH, 2015, pp. attention based end-to-end speech recognition with a deep CNN encoder 3199–3203. and RNN-LM,” in Proc. INTERSPEECH, 2017, pp. 949–953. [28] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsuper- [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, vised speech representation learning using wavenet autoencoders,” arXiv A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, preprint arXiv:1901.08810, 2019. “Deep neural networks for acoustic modeling in speech recognition: [29] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting The shared views of four research groups,” IEEE Signal Processing bottleneck features and word-like pairs from untranscribed speech for Magazine, vol. 29, no. 6, pp. 82–97, 2012. feature representations,” in Proc. ASRU, 2017, pp. 734–739. [4] A. Ragni, E. Dakin, X. Chen, M. J. Gales, and K. M. Knill, “Multi- [30] H. Chen, C. Leung, L. Xie, B. Ma, and H. Li, “Multitask feature learning language neural network language models.” in Proc. INTERSPEECH, for low-resource query-by-example spoken term detection,” J. Sel. Topics 2016, pp. 3042–3046. Signal Processing, vol. 11, no. 8, pp. 1329–1339, 2017. [5] H. Shibata, T. Kato, T. Shinozaki, and S. Watanabe, “Composite em- [31] T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariant bedding systems for zerospeech2017 track 1,” in Proc. ASRU, 2017, pp. feature extraction for zero-resource languages with adversarial learning,” 747–753. in Proc. ICASSP, 2018, pp. 2381–2385. [6] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Be- [32] T. K. Ansari, R. Kumar, S. Singh, S. Ganapathy, and S. Devi, “Unsuper- sacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge vised HMM posteriograms for language independent acoustic modeling 2017,” in Proc. ASRU, 2017, pp. 323–330. in zero resource conditions,” in Proc. ASRU, 2017, pp. 762–768. [7] J. Glass, “Towards unsupervised speech processing,” in Proc. ISSPA, [33] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal 2012, pp. 1–4. phoneme segmentation: Objectives, algorithm and comparisons,” in [8] H. Kamper, A. Jansen, and S. Goldwater, “Fully unsupervised small- Proc. ICASSP, 2008, pp. 3989–3992. vocabulary speech recognition using a segmental bayesian model,” in [34] S. Feng, T. Lee, and H. Wang, “Exploiting language-mismatched Proc. INTERSPEECH, 2015, pp. 678–682. phoneme recognizers for unsupervised acoustic modeling,” in Proc. [9] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, ISCSLP, 2016, pp. 1–5. A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015.” [35] M.-L. Sung, S. Feng, and T. Lee, “Unsupervised pattern discovery from in Proc. INTERSPEECH, 2015, pp. 3169–3173. thematic speech archives based on multilingual bottleneck features,” in [10] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised Accepted by APSIPA ASC, 2018. bottleneck features for low-resource query-by-example spoken term [36] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model based detection,” in INTERSPEECH, 2016, pp. 923–927. approach to speech recognition,” in Proc. ICASSP, 1988, pp. 501–504. [11] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pairwise [37] H. Gish and K. Ng, “A segmental speech model with applications to learning using multi-lingual bottleneck features for low-resource query- word spotting,” in Proc. ICASSP, vol. 2, 1993, pp. 447–450. by-example spoken term detection,” in Proc. ICASSP, 2017, pp. 5645– [38] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Acoustic segment modeling with spectral clustering methods,” IEEE/ACM Trans. ASLP, [12] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle- vol. 23, no. 2, pp. 264–277, 2015. neck feature learning from untranscribed speech,” in Proc. ASRU, 2017, [39] S. Bhati, S. Nayak, and K. S. R. Murty, “Unsupervised speech signal to pp. 727–733. symbol transformation for zero resource speech applications,” in Proc. [13] T. K. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning INTERSPEECH, 2017, pp. 2133–2137. methods for unsupervised acoustic modeling - LEAP submission to [40] H. Kamper, A. Jansen, and S. Goldwater, “Unsupervised word segmenta- zerospeech challenge 2017,” in Proc. ASRU, 2017, pp. 754–761. tion and lexicon discovery using acoustic word embeddings,” IEEE/ACM [14] S. Feng and T. Lee, “Exploiting speaker and phonetic diversity of TASLP, vol. 24, no. 4, pp. 669–679, 2016. mismatched language resources for unsupervised subword modeling,” [41] A. Martinet, “Economie des changements phonétiques,” 1970. in Proc. INTERSPEECH, 2018, pp. 2673–2677. [42] C. F. Hockett, A manual of phonology. Waverly Press, 1955, no. 11. [15] G. Synnaeve and E. Dupoux, “Weakly supervised multi-embeddings [43] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discrimi- learning of acoustic models,” arXiv preprint arXiv:1412.6645, 2014. nant analysis for supporting DPGMM clustering in the zero resource [16] H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neu- scenario,” in Proc. SLTU, 2016, pp. 73–79. ral network based feature extraction using weak top-down constraints,” [44] F. Grézl, M. Karafiát, and L. Burget, “Investigation into bottle-neck in Proc. ICASSP, 2015, pp. 5818–5822. features for meeting speech recognition,” in Proc. INTERSPEECH, [17] E. Hermann, H. Kamper, and S. Goldwater, “Multilingual and unsuper- 2009, pp. 2947–2950. vised subword modeling for zero-resource languages,” arXiv preprint [45] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory arXiv:1811.04791, 2018. recurrent neural network architectures for large scale acoustic modeling.” [18] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models in Proc. INTERSPEECH, 2014, pp. 338–342. using sub-cluster splits,” in Advances in NIPS, 2013, pp. 620–628. [46] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition [19] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of with deep bidirectional LSTM,” in Proc. ASRU, 2013, pp. 273–278. Dirichlet process Gaussian mixture models for unsupervised acoustic [47] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. knowledge transfer in DNN-based LVCSR,” in Proc. SLT, 2012, pp. 3189–3193. 246–251. 12 [48] N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, “A deep ICSLP, 2002, pp. 901–904. scattering spectrum-deep siamese network pipeline for unsupervised [52] P. Schwarz, “Phoneme recognition based on long temporal context,” acoustic modeling,” in Proc. ICASSP, 2016, pp. 4965–4969. PhD Tesis. Brno University of Technology., 2009. [49] T. Lee, W. K. Lo, P. C. Ching, and H. Meng, “Spoken language [53] H. v. d. Heuvel, J. Boudy, Z. Bakcsi, J. Cernocky, V. Galunov, J. Kochan- resources for Cantonese speech processing,” Speech Communication, ina, W. Majewski, P. Pollak, M. Rusko, J. Sadowski et al., “SpeechDat- vol. 36, no. 3, pp. 327–342, 2002. E: Five eastern european speech databases for voice-operated teleser- [50] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, vices completed,” in Proc. INTERSPEECH, 2001. M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi [54] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for speech recognition toolkit,” in Proc. ASRU, 2011. on-line training of recurrent network trajectories,” Neural computation, [51] A. Stolcke, “SRILM – an extensible language modeling toolkit,” in Proc. vol. 2, no. 4, pp. 490–501, 1990. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling

Electrical Engineering and Systems Science , Volume 2019 (1908) – Aug 9, 2019

Loading next page...
 
/lp/arxiv-cornell-university/exploiting-cross-lingual-speaker-and-phonetic-diversity-for-khm4mznNtp

References (53)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2937953
Publisher site
See Article on Publisher Site

Abstract

Exploiting Cross-Lingual Speaker and Phonetic Diversity for Unsupervised Subword Modeling Siyuan Feng and Tan Lee Abstract—This research addresses the problem of acoustic Unsupervised speech modeling is the task of building sub- modeling of low-resource languages for which transcribed train- word or word-level AMs, when only untranscribed speech ing data is absent. The goal is to learn robust frame-level feature are available for training [7]–[9]. This is also known as representations that can be used to identify and distinguish the zero-resource problem, which has attracted increasing subword-level speech units. The proposed feature representations research interest in recent years. The Zero Resource Speech comprise various types of multilingual bottleneck features (BNFs) that are obtained via multi-task learning of deep neural networks Challenge 2015 (ZeroSpeech 2015) [9] and 2017 (ZeroSpeech (MTL-DNN). One of the key problems is how to acquire high- 2017) [6] precisely focused on unsupervised speech modeling. quality frame labels for untranscribed training data to facilitate ZeroSpeech 2017 was organized to tackle two sub-problems, supervised DNN training. It is shown that learning of robust namely unsupervised subword modeling (Track 1) and spoken BNF representations can be achieved by effectively leveraging term discovery (STD) (Track 2). The present study addresses transcribed speech data and well-trained automatic speech recog- nition (ASR) systems from one or more out-of-domain (resource- the Track 1 problem and aims to learn frame-level feature rich) languages. Out-of-domain ASR systems can be applied to representation that is effective in identifying and discriminat- perform speaker adaptation with untranscribed training data of ing subword-level units and robust to irrelevant factors, e.g., the target language, and to decode the training speech into frame- speaker and/or channel variation, emotion, etc. Robust feature level labels for DNN training. It is also found that better frame representations obtained by learning from data have been labels can be generated by considering temporal dependency in speech when performing frame clustering. The proposed found to be preferable to conventional spectral features like methods of feature learning are evaluated on the standard task Mel-frequency cepstral coefficients (MFCCs) for downstream of unsupervised subword modeling in Track 1 of the ZeroSpeech applications [10], [11]. 2017 Challenge. The best performance achieved by our system DNN models are commonly adopted in frame-level feature is 9:7% in terms of across-speaker triphone minimal-pair ABX learning for unsupervised subword modeling. A DNN model error rate, which is comparable to the best systems reported recently. Lastly, our investigation reveals that the closeness is typically trained using available speech data. The learned between target languages and out-of-domain languages and the features are obtained either from a designated low-dimension amount of available training data for individual target languages hidden layer of the DNN, known as the bottleneck features could have significant impact on the goodness of learned features. (BNFs) [12], or from the softmax output layer, known as Index Terms—zero resource, unsupervised learning, robust the posterior features or posteriorgram [13]. To facilitate features, speaker adaptation, multi-task learning supervised training of the DNN, target labels of training speech are needed. In zero-resource scenarios, the key problem is how to generate informative frame-level labels in the absence of I. I NTRODUCTION speech transcription. One of the possible approaches is based on unsupervised clustering of training speech. The frame-level TATE-OF-THE-ART automatic speech recognition (ASR) cluster indices can be used as target labels for DNN training systems have demonstrated fairly impressive performance [11]–[13]. Another approach seeks to use pre-trained out- in terms of word accuracy [1], [2]. This is mainly attributed of-domain ASR systems to tokenize untranscribed in-domain to the advances of deep neural network (DNN) based acoustic speech and hence each frame is assigned with an ASR senone models (AMs) and language models (LMs) [3], [4]. Typically label [5], [14]. Fully unsupervised [13] or weakly supervised a well-trained DNN-based AM requires hundreds to thousands [15]–[17] methods for DNN training were also reported in the of hours of transcribed speech. As a matter of fact, high- research on acoustic modeling for low-resource languages. performance ASR systems are available only for major lan- The present study adopts the general framework of su- guages [5]. Even for resource-rich languages, preparing tran- pervised DNN training for the purpose of extracting BNF scriptions for available training data is a time-consuming task as the learned feature representation. We attempt to improve that involves considerable human effort. For many languages the efficacy and performance of learned features along two in the world, very little or no transcribed speech is available directions. First, advanced unsupervised acoustic modeling [6], and conventional acoustic modeling techniques are simply techniques are explored to generate initial frame-level labels not applicable. for supervised DNN training. Second, speaker adaptation S. Feng and T. Lee are with the Department of Electronic Engineering, techniques are applied to make input speech features more The Chinese University of Hong Kong, Hong Kong SAR, China (e-mail: robust to speaker variation. siyuanfeng@link.cuhk.edu.hk; tanlee@ee.cuhk.edu.hk). Dirichlet process Gaussian mixture model (DPGMM) is This research is partially supported by a GRF project grant (Ref: CUHK 14227216) from Hong Kong Research Grants Council. commonly used for clustering of unlabelled speech frames arXiv:1908.03538v2 [eess.AS] 29 Sep 2019 2 [18]. It demonstrated superior performance on the tasks in clusions. ZeroSpeech Challenges [19], [20]. However, DPGMM clus- tering, as well as other conventional clustering algorithms II. R ELATED WORKS like k-means [21] and GMM [13], assumes that neighboring A. Deep learning approaches to unsupervised subword mod- speech frames are independent of each other. This is obviously eling not in accordance with the nature of speech. To address A variety of DNN models have been investigated towards this limitation, a full-fledged Gaussian mixture model-hidden unsupervised subword modeling. They include multi-layer Markov model (GMM-HMM) AM is trained to capture con- perceptron (MLP) [12], auto-encoder (AE) [13], correspon- textual information in speech. The transcriptions required for dence AE (cAE), denoising AE (dAE) [27], variational AE GMM-HMM training are initialized via DPGMM clustering. (VAE) [28] and siamese network [29]. In terms of training Following the terminology in [22], this model is referred to as strategies, these models can be classified into three categories, DPGMM-HMM. We use the DPGMM-HMM AM to generate namely, supervised (MLP), unsupervised (AE, VAE, dAE) frame-level labels to support BNF representation learning. In and weakly/pair-wise supervised (cAE, siamese network). [22], a similar approach was adopted for learning feature- Supervised DNN training requires frame-level labels for all space maximum likelihood linear regression (fMLLR) and training speech, which could be obtained either via a clustering posteriorgram features. process or exploiting out-of-domain resources. In [11], [12], In unsupervised subword modeling, the outcome of frame DPGMM clustering was performed on conventional short-time clustering ideally comprises a set of clusters that correspond spectral features of target speech, followed by multilingual to phoneme-related speech units. The underlying assumption DNN training to obtain the BNF representation. In [13], is that speech frames identified as the same phoneme should GMM-universal background model (GMM-UBM) was used have homogeneous acoustic properties. In practice, speaker to generate frame labels. A DNN was trained using these and environment variations would inevitably impact the re- labels to generate BNF or posteriorgram representation. In liability of frame clustering results. Our preliminary experi- [5], [14], language-mismatched ASR systems were utilized to ments showed that applying DPGMM typically results in an decode the target speech, and frame labels were generated excessive number of fine-grained clusters. Similar observations from the ASR decoding lattices. In [30], BNF representation were reported in [23], [24]. These over-fragmented clusters was generated by applying multi-task learning with both in- may adversely affect the effectiveness of unsupervised speech domain and out-of-domain data [25]. The frame labels for modeling. In this work we develop and apply a new algorithm out-of-domain data were obtained by HMM forced alignment, to filter out infrequent labels in DPGMM clustering results, while the labels for in-domain data were from DPGMM and experimentally validate its effectiveness. clustering [12]. In [5], [14], [31], a DNN AM was trained In addition to the DPGMM-HMM labels, a different type with transcribed data of an out-of-domain language, and used of frame labels can be obtained using one or more out-of- to extract BNFs or posteriorgrams from target speech. domain ASR systems [5], [14]. While the DPGMM-HMM Unsupervised DNN training does not require any kind frame labels incorporate statistical information of the acoustic of target labels. For example, an AE model generates non- properties of target speech, the ASR senone labels leverage the linear embeddings of input speech and meanwhile learn to phonetic information acquired from out-of-domain languages. reconstruct the same speech from the embeddings. Recently, We propose to exploit their complementarity in DNN based weakly-supervised model training is studied extensively [15]– feature learning by applying the multi-task learning strategy [17]. In the cAE model [27], a pair of speech segments that [25]. contain the same linguistic unit (word or subword) are used Numerous studies have demonstrated the benefit of applying as the input and output for training, with the objective of speaker adaptation on input features for unsupervised subword minimizing the reconstruction error. In a siamese network, modeling [12], [26]. In the present study, we propose to exploit the input comprises two speech segments. The network is cross-lingual speech data in fMLLR-based speaker adaptation. trained to determine whether the segments are from the same Specifically, transcribed speech from a resource-rich language linguistic unit or not. These models were shown to achieve is used to train an out-of-domain ASR system. This sys- better performance than unsupervised models [27]. However, tem is then applied to the zero-resource target languages for zero-resource languages, such pair-wise knowledge may for estimating linear discriminant analysis (LDA), maximum not be directly available. likelihood linear transform (MLLT) and fMLLR transforms on conventional spectral features. We advocate that this approach B. Unsupervised subword modeling without using DNN is effective and practically desirable as transcribed speech data of resource-rich languages are relatively easy to access. There were numerous studies on unsupervised subword The remainder of this paper is organized as follows. Sec- modeling without involving deep learning models. In these tion II provides a review of related works on unsupervised studies, clustering of short-time frame features is an important subword modeling with untranscribed speech. In Section III, first step. After frame clustering, each cluster is represented by we provide detailed description on the proposed approaches a learned probability distribution, and the cluster posteriorgram to feature learning. Section IV introduces experimental design can be regarded as the learned representation for subword on ZeroSpeech 2017 development data. Section V discusses modeling. Frame clustering could be done straightforwardly and analyzes experimental results. Section VI gives the con- by applying k-means [21], GMM [32] and DPGMM [19] 3 algorithms. In [19], DPGMM clustering was applied to a zero- processed by an out-of-domain ASR system, where VTLN, resource target language. An extension of this approach was LDA, MLLT and fMLLR transforms are estimated sequen- reported in [20], where clustering was performed with fMLLR- tially. The DPGMM clustering algorithm is applied to the based speaker-adapted features. In [32], GMM posteriorgram fMLLR features of target speech. The resulted frame labels are and HMM posteriorgram were compared, where the HMM post-processed by a label filtering algorithm and then used for was trained based on GMM-UBM clustering results. context-dependent GMM-HMM (CD-GMM-HMM) acoustic To better retain temporal dependency in speech, frame modeling. The trained AMs forced align target speech to clustering can be embodied in segment level. Initial segmen- generate DPGMM-HMM alignments. Subsequently, an MTL- tation of speech utterances could be obtained by hierarchical DNN is trained to generate BNFs for subword modeling. agglomerative clustering [33], or using language-mismatched The training tasks of MTL include DPGMM-HMM alignment prediction and language-mismatched label prediction of mul- phone recognizers [34], [35]. Subsequently a fixed-length feature vector is derived to represent each speech segment. tiple target languages. The language-mismatched labels are Clustering of segment-level feature vectors was tackled using generated by multiple out-of-domain ASR systems. a range of algorithms, including vector quantization (VQ) [36], The proposed system design emphasizes on leveraging segmental GMM (SGMM) [37], spectral clustering [38] and speech data resources from out-of-domain languages [5], [14]. graph clustering [39]. In [40], segmentation and clustering This is realized in the following aspects: were integrated as a jointly optimized process. Use out-of-domain data to perform fMLLR speaker adap- The present study is on one hand largely based on DNN tation on target speech. modeling of speech, and on the other hand incorporates the Use out-of-domain ASR systems to generate frame labels ideas of frame clustering (as the initial tokenization) [19], to facilitate multi-task DNN training. fMLLR-based speaker adaptation [20], and use of HMM to Use an out-of-domain DNN AM to extract BNFs. capture temporal dependency [32]. A. Speaker adaptation with out-of-domain data C. Optimizing DPGMM clustering For resource-rich languages, a large amount of transcribed DPGMM clustering has been shown to be a preferred and speaker-annotated speech data are readily available. We method of frame labeling for unsupervised subword modeling propose to utilize these out-of-domain data to model speaker [19], [20]. Nevertheless, one shortcoming of DPGMM is variation in untranscribed speech of the target speech. A that it tends to produce over-fragmented speech units [23], conventional CD-GMM-HMM AM is trained using the out- [24]. Different approaches have been proposed to tackle this of-domain data. Based on this model, VTLN, LDA, MLLT problem. In [23], DPGMMs were replaced by the Dirichlet and fMLLR transforms can be estimated. Subsequently, CD- process mixture of mixtures model (DPMoMM) to enable GMM-HMM AMs with speaker adaptive training (CD-GMM- multi-modal cluster inference. In [24], small-sized clusters HMM-SAT) are trained and used to estimate fMLLR trans- were merged based on low functional load [41], [42]. In our forms for target speech utterances. It must be noted that the work, this problem is tackled by a label filtering algorithm. estimated fMLLR features of target speech could be directly DPGMM for frame labeling could be optimized at input used for subword modeling. They are expected to provide feature level. Conventional spectral features like MFCC [19] a better baseline than the conventional spectral features like and perceptual linear prediction (PLP) [26] were commonly MFCCs or PLPs. used as the initial representations of target speech. Albeit straightforward, these features are considered sub-optimal for B. Frame labeling unsupervised subword modeling, as they contain a lot of irrel- evant information such as speaker identity and emotion. Heck 1) DPGMM clustering: DPGMM is a non-parametric et al. [26], [43] found that fMLLR transforms can noticeably Bayesian extension to GMM, where a Dirichlet process prior suppress speaker-related feature variation, and advocated the replaces the vanilla GMM. One advantage of DPGMM clus- importance of speaker adaptation in the concerned task. To tering is that the cluster number does not need to be pre- enable supervised estimation of fMLLRs, clustering results on defined. Let us consider M zero-resource target languages. For spectral features were taken as pseudo transcriptions. Chen et an utterance from the i-th language, the frame-level features i i i al. [12] showed that vocal tract length normalization (VTLN) are denoted as fx ;x ;: : :;x g, where L is the number 1 2 on top of spectral features contribute to generating more robust of frames in the utterance. By applying DPGMM clustering, DPGMM frame labels. In our study, fMLLR features are K clusters are obtained and represented with k Gaussian i i i estimated by exploiting an out-of-domain ASR system. components. The frame labels fl ; l ; : : : ; l g are given as, 1 2 L i i III. PROPOSED S YSTEM l = arg max Prob(kjx ); (1) t t 1kK The proposed system framework for unsupervised subword i i modeling of zero-resource languages is illustrated as in Fig. 1. where Prob(kjx ) denotes the posterior probability of x with t t It comprises three modules, namely, speaker-adapted feature respect to the k-th Gaussian component. The inference of extraction, unsupervised acoustic modeling, and multi-task DPGMM parameters can be performed using the algorithm BNF learning. Speech frames of the target language are first as described in [18]. 4 Unsuperivsed acoustic modeling Out-of-domain languages Speaker adapted DPGMM Label GMM-HMM feature extraction fMLLRs CA ASR clustering filtering training MFCCs Multiple ASRs Decode and Zero-resource align languages Forced-alignment Labels Labels MTL-DNN BNF extraction Evaluation training Fig. 1. The proposed framework of unsupervised subword modeling. Let P be the percentage of frame labels that we aim to Descending order retain. These frames are from K “dominant” clusters, where cut  90  80  100  50  60  20 c c c c c c 1 2 3 4 5 6 K c ^ k=1 K = arg min  P: (4) cut 0 N c c c c c c 3 1 2 5 4 6 O denotes the collection of all frame labels that are re- ˆ ˆ ˆ ˆ ˆ ˆ c c c c c c moved, i.e., 1 2 3 4 5 6 n o m (3)1 m (1)2 m ( 2)3 m (5)4 m ( 4)5 m ( 6)6 O = l : l 2 F; i 2 f1; 2; : : : ; Ng ; (5) i i where n o Fig. 2. Example of cluster size sorting. F = m(K + 1); : : : ; m(K ) : (6) cut F contains indices of K K clusters that are the cut 2) Out-of-domain ASR decoding: Given a speech utterance least frequent to occur. Frames assigned to these clusters are in the target language, an out-of-domain ASR system can considered as outliers. be applied to generate a sequence of phone-level or state- In the extreme case when P is set to 1, F and O will level labels [14]. The idea can be naturally extended to using be empty sets. The smaller the value of P , the larger the multiple out-of-domain ASR systems and desirably providing proportion of filtered frame labels. The label filtering algorithm a wide coverage of phonetic diversity. The outcome of ASR is summarized as in Algorithm 1. decoding depends on the relative weighting of AM and LM. In our work, the LM is assigned a very small weight, such that Algorithm 1 DPGMM label filtering algorithm the acquired frame labels mainly reflect acoustic properties of Input: l ; l ; : : : ; l , P 1 2 N the target speech being modeled. Output: O 1: Calculate c by Equation (2). C. DPGMM label filtering 2: Sort fc ; c ; : : : ; c g in descending order. 1 2 K For a specific target language, let us assume that K 3: Calculate m(k) by Equation (3). Gaussian components (clusters) are obtained by DPGMM 4: Calculate K by Equation (4) and P . cut clustering. The frame labels are denoted as l ; l ; : : : ; l for 5: Select a subset of l ; l ; : : : ; l asO, by Equation (5)&(6). 1 2 N 1 2 N an N -frame utterance. Let c be the number of frames labeled . Frame labels that are removed. as cluster k, i.e., c = 1(l = k); k 2 f1; 2; : : : ; Kg; (2) k i D. DPGMM-HMM acoustic modeling i=1 Each DPGMM cluster can be regarded as a pseudo phone. where 1() is the indicator function. The sequence of DPGMM frame labels (after filtering) can The elements in fc ; c ; : : : ; c g are sorted in descending 1 2 K be converted into a pseudo transcription by collapsing neigh- order into fc ^ ; c ^ ; : : : ; c ^ jc ^  c ^  : : :  c ^ g. m() denotes 1 2 K 1 2 K boring duplicated labels, e.g., “1,3,3,3,7,10,10” ! “1,3,7,10”. the index mapping function, i.e., Based on the pseudo transcription, HMM acoustic model- ing is done by following the standard supervised training c ^ = c : (3) k m(k) pipeline, i.e., proceeding from monophone model training with Fig. 2 gives an example of cluster size sorting. uniform time alignment to context-dependent GMM-HMM 5 (CD-GMM-HMM). The trained AM is used to produce time alignment information for DNN-based subword discriminative Alignments for Alignments for OOD Lang OOD Lang modeling (will be discussed in Section III-E). To be distin- Lang 1 Lang M label 1 label N …... …... guished from the DPGMM frame labels, the frame labels obtained from the HMM forced alignment are referred to as DPGMM-HMM labels. BNF for subword Although the DPGMM labels could be directly used for discriminability task supervised DNN acoustic modeling [12], [14], we expect that Tasks for MUBNF DPGMM-HMM labels are more reasonable as they are derived Tasks for OSBNF with consideration on contextual dependency of speech. Tasks for LI-BNF Lang 1 Lang M …... fMLLRs fMLLRs E. Multi-task learning for BNFs The bottleneck feature (BNF) is a type of representation Fig. 3. MTL-DNN for extracting LI-BNF, MUBNF and OSBNF. The term “OOD” stands for out-of-domain. obtained from a designated low-dimension hidden layer of a DNN. In ASR applications, BNFs have been shown to provide a compact and phonetically-discriminative representation of IV. E XPERIM ENTAL SETUP input speech, and be effective in suppressing linguistically- A. Dataset and evaluation metric irrelevant variations [44]. In the context of zero-resource Experiments are carried out with the development data speech modeling, BNFs have also been widely investigated of ZeroSpeech 2017 Track 1 [6]. The data covers three [5], [12], [14], [17]. target languages, namely English, French and Mandarin. For The proposed MTL-DNN is depicted in Fig 3. The DNN each language, there are separate training set and test set of training involves a total of M + N tasks, which involves untranscribed speech. Speaker identity information is provided M zero-resource target languages and N out-of-domain ASR for the train sets but not available for the test sets. The test systems. Each of the tasks is represented by a task-specific data are organized into subsets of different utterance lengths: 1 softmax output layer in the DNN. The hidden layers, including second, 10 second and 120 second. Detailed information about a low-dimension linear BN layer, are shared across all tasks. the dataset are given as in Table I. For the zero-resource language tasks, state-level or phone-level DPGMM-HMM labels are used as target labels. The decoding TABLE I output from each of the out-of-domain ASR systems provides D EVELOPM ENT DATA OF ZEROS PEECH 2017 T RACK 1 one set of frame-level labels for MTL. For the MTL-DNN trained only on the M target language Training Test tasks, the extracted BNFs are referred to as multilingual Duration (hours) # speakers Duration (hours) unsupervised BNFs (MUBNFs). When out-of-domain ASR English 45 60 27 tasks are added, the BNFs are named language-independent French 24 18 18 BNFs (LI-BNFs). In the case that only the out-of-domain ASR Mandarin 2:5 8 25 tasks are involved, the extracted BNFs are referred to as out- of-domain supervised BNFs (OSBNFs). The evaluation metric adopted for ZeroSpeech 2017 Track The DPGMM-HMM labels are obtained through statistical 1 task is the ABX subword discriminability. Inspired by the modeling of target speech. The ASR senone labels leverage the match-to-sample task in human psychophysics, it is a simple phonetic knowledge acquired from out-of-domain languages. method to measure the discriminability between two categories It is expected that they would contribute complementarily in of speech units [9]. The basic ABX task is to decide whether feature learning. Learning from speech of multiple languages X belongs to x or y, if A belongs to x and B belongs would result in a language-independent BNF representation to y, where A, B and X are three data samples, x and y that is more generalizable to unknown languages. are the two pattern categories concerned. The performance For the shared-hidden-layer structure in the MTL-DNN, evaluation in ZeroSpeech 2017 is carried out on the triphone multi-layer perceptron (MLP) is commonly used [12]–[14], minimal-pair task. A triphone minimal pair comprises two [31]. In this study, in addition to MLP, we investigate the use triphone sequences, which have different center phones and of long short-term memory (LSTM) [45] and bi-directional identical context phones, for examples, “beg”-“bag”, “api”- LSTM (BLSTM) [46], which were shown to perform better “ati”. Discriminating triphone minimal pairs is a non-trivial than MLP in conventional supervised acoustic modeling. task. The performance of a feature representation on the On the other hand, BNF representation can also be obtained triphone minimal-pair ABX task is considered a good indicator from the DNN AM pre-trained for a resource-rich language of its efficacy in speech modeling [48]. [5]. This is considered as a transfer learning approach [47]. Let x and y denote a pair of triphone categories. Consider This transfer learning BNF (TLBNF) is expected to further three speech segments A, B and X , where A and X belong to enrich the feature representation and will be jointly used with category x and Y belongs to y. The ABX discriminability of MUBNF, OSBNF and LI-BNF for subword modeling. x from y is measured in terms of the ABX error rate (x; y), 6 which is defined as the probability that the distance of A from trained with transcriptions of CUSENT training data is used X is greater than that of B from X , i.e., during decoding. The LM is trained with SRILM [51]. The other three out-of-domain ASR systems are all phone X X X recognizers developed by Brno University of Technology [52]. (x; y) = jS(x)j(jS(x)j 1)jS(y)j The recognizers adopt a 3-layer MLP structure, in which the A2S(x) B2S(y) X2S(x)nfAg first two are sigmoid layers and the third is a softmax layer. (1 + 1 ); d(A;X)>d(B;X) d(A;X)=d(B;X) They were trained with the SpeechDat-E databases [53]. The (7) numbers of modeled phones in Czech, Hungarian and Russian are 45; 61 and 52, respectively. The training data sizes are 9:7, where S(x) and S(y) denote the sets of features that represent 7:9 and 14:0 hours, respectively. The cross-entropy criterion triphone categories x and y, respectively. d(;) denotes the was used for MLP training. dissimilarity between two speech segments, which is computed by dynamic time warping (DTW) in our study. The frame- C. Speaker adaptation of target speech level dissimilarity measure used for DTW scoring is the The Cantonese ASR system is used to perform fMLLR- cosine distance. Note that (x; y) is asymmetric to x and based speaker adaptation of target speech on the 39-dimension y. A symmetric form can be defined by taking average of MFCC features in a two-pass procedure. In the first pass, (x; y) and (y; x). The overall ABX error rate is obtained input speech utterances are decoded in a speaker-independent by averaging over all triphone categories and speakers in manner, using unadapted features, from which initial fMLLR the test set. A high ABX error rate means that the feature transforms are estimated. In the second pass, input speech are representation is not discriminative, and vice versa. Intuitively, decoded with initial fMLLRs in a speaker-adaptive manner. the error rate should be no larger than 50%, as by random After the decoding, final fMLLR transforms for target speech decision, the expectation of ABX error rate is 50%. utterances are estimated. The dimension of fMLLR features is Two evaluation conditions were defined in ZeroSpeech 2017, namely within-speaker and across-speaker. In both con- ditions, the segments A and B to be evaluated are generated by the same speaker. In the within-speaker condition, segment D. DPGMM frame clustering and label filtering X is generated by the same speaker as A and B; In the across- Speech frames for different languages are clustered sepa- speaker condition, X is generated by a speaker different from rately by the DPGMM algorithm based on the 40-dimension A and B. fMLLR features. The implementation of DPGMM clustering is performed using an open-source tool developed by Chang et al. [18]. For the three target languages, namely English, B. Out-of-domain ASR systems French and Mandarin, the numbers of iterations of clustering Four out-of-domain ASR systems are utilized and investi- were 120; 200 and 3000 respectively. The numbers of itera- gated in our experiments. They cover the languages of Can- tions for English and French are determined by preliminary tonese (CA), Czech (CZ), Hungarian (HU) and Russian (RU). experiments. Specifically, the iterations for English ranging in The Cantonese ASR is trained with the CUSENT database f40; 80; : : : ; 680g and for French ranging in f40; 80; : : : ; 400g [49]. The database contains 20; 378 training utterances from were tested. The optimal numbers of iterations were 120 and 34 male and 34 female speakers, with a total of 19:3 hours 200 respectively. For Mandarin, the number of iterations was of speech. The Kaldi toolkit [50] is used to train two versions empirically determined. The resulted numbers of DPGMM of AMs: CD-GMM-HMM-SAT and DNN-HMM. DNN-HMM clusters for English, French and Mandarin are 1118; 1345 and training labels are acquired from CD-GMM-HMM-SAT time 596, respectively. Each frame is assigned a cluster label. Fig. alignment. The input features for CD-GMM-HMM-SAT are 4 shows the results of clustering in the form of cumulative 40-dimension fMLLRs, and the input features for DNN-HMM distribution function (CDF) for the three target languages. are fMLLRs with 5 splicing. The fMLLR features are The clusters are sorted according to their cluster size in estimated during CD-GMM-HMM-SAT training. Specifically, descending order. In other words, each point (K ; Q ) on the i i VTLN is estimated towards 39-dimension MFCCs++. CDF represents the proportion of frame labels Q that the The resulted features with 3 splicing are used to estimate largest K clusters cover. 40-dimension LDA and MLLT. Finally, fMLLR transforms For label filtering, we evaluated different thresholds on the are estimated. MFCC features are computed using a 25-ms percentage of preserved labels, with the value of P ranging Hamming window and a 10-ms frame shift. Per-utterance from 0:6 to 0:95, with the step size of 0:05. After filtering, cepstral mean variance normalization (CMVN) is applied to the frame-level label sequences are converted into pseudo MFCCs. The DNN-HMM model for Cantonese is a 7-layer transcriptions, for the training of DPGMM-HMM AMs (in MLP, with layer configuration 440-1024  5-40-1024-2462. Section IV-E). The dimension of the output layer is determined by the DPGMM clustering was also tested with MFCCs as input number of CD-HMM states modeled by CD-GMM-HMM- features. The numbers of iterations for MFCC clustering SAT. Hidden layers are activated with sigmoid function, except are 200; 240 and 3000 for English, French and Mandarin for the 40-dimension linear BN layer. The network is trained respectively, and the resulted numbers of DPGMM clusters to optimize the cross-entropy criterion. A syllable trigram LM are 1554; 1541 and 381. 7 for (HMM-)MUBNF, OSBNF and (HMM-)LI-BNF are listed Cumulative distribution function in Table II. TABLE II 0.8 C ONFIGURATIONS FOR (HMM-)MUBNF, OSBNF AND (HMM-)LI-BNF English REPRESENTATIONS French 0.6 Mandarin Task label from DPGMM DPGMM-HMM CA CZ HU RU 0.4 Train set EN FR MA EN FR MA Pooling EN, FR and MA MUBNF X X X 0.2 OSBNF1 X OSBNF2 X X X X 0 200 400 600 800 1000 1200 1400 LI-BNF1 X X X X DPGMM cluster number LI-BNF2 X X X X X X X HMM-MUBNF X X X HMM-LI-BNF1 X X X X Fig. 4. Clustering results in the form of cumulative distribution function for HMM-LI-BNF2 X X X X X X X the three target languages. Clusters are sorted according to cluster size in descending order. The MTL-DNN is implemented in three different model structures: MLP, LSTM and BLSTM. The input features are E. DPGMM-HMM and MTL-DNN training 40-dimension fMLLRs spliced with context size 5. The dimensions of shared hidden layers in the MLP are 440- DPGMM-HMM AMs are trained from scratch with pseudo 1024  5-40-1024. Sigmoid activation is used in all hidden transcriptions. Different from the conventional 3-state HMM layers except that the 40 neurons in the BN layer use linear topology, during DPGMM-HMM training we set 1-state HMM activation functions. The learning rate for MLP training is set for each pseudo phone. This prevents the problem of unsuc- at 0:008 at the beginning, and halved when no improvement cessful forced alignments, as the numbers of pseudo phones is observed on a cross-validation set. The mini-batch size is for target languages are significantly larger than the number of 256. The LSTM model comprises 2 LSTM layers with 320- phones for a typical language. The input features for DPGMM- dimension cell activation vectors, and 1024-dimension outputs. HMM are 40-dimension fMLLRs estimated by the Cantonese A 40-dimension BN layer followed by a 1024-dimension ASR. The training procedure follows the standard pipeline as fully-connected (FC) layer is set on top of LSTMs. For the in Kaldi s5 recipe , i.e., starting from CI-GMM-HMM to CD- BLSTM model, there are 2 pairs of forward and backward GMM-HMM, followed by VTLN and fMLLR-based SAT . LSTM layers. Each bi-directional layer has 320-dimension After training, the numbers of CD-HMM states for English, cell activation vectors and 512-dimension outputs. A BN layer French and Mandarin are 2818; 2856 and 2688, respectively. followed by an FC layer is set on top of BLSTMs, with the The MTL-DNN model is trained with all the three target same configuration as in the LSTM. The activation function zero-resource languages, from which BNFs are extracted and in (B)LSTMs is tanh. The learning rate is 2e4 initially, and evaluated by the ABX subword discriminability task. There are halved under the same criteria as for MLP. The truncated back- two types of tasks for MTL, namely, DPGMM-HMM align- propagation through time (BPTT) algorithm [54] is used to ment prediction task and out-of-domain ASR label prediction train (B)LSTM, with a fixed time step T = 20. Note that bptt task. In the first case, three tasks are included, i.e., frame the model parameters of LSTM and BLSTM structures were alignments generated by DPGMM-HMM AMs, one for each tuned in preliminary studies, while for MLP we follow the target zero-resource language. In the second case, four tasks configuration of our previous study [14]. corresponding to Cantonese, Czech, Hungarian and Russian recognizers’ senone labels are included. The senone labels F. TLBNF generation are generated by decoding with LM to AM weight ratio set to 0:001. After MTL-DNN training, 40-dimension HMM-LI- The TLBNFs for target zero-resource languages are gen- BNFs are extracted for the ABX task evaluation. Similarly, erated by applying the Cantonese DNN-HMM AM as the HMM-MUBNFs , extracted by MTL-DNN with DPGMM- feature extractor. During TLBNF extraction, all the parameters HMM alignment tasks, and OSBNFs, extracted by MTL-DNN of the DNN-HMM are fixed. The fMLLR features for target with one or more out-of-domain phone recognizers’ senone languages are fed as inputs to the DNN-HMM till its BN layer labels, are also evaluated by the ABX task. The dimensions to generate TLBNFs. of both HMM-MUBNFs and OSBNFs are 40. As illustrated in Fig. 3, we defined several BNF representations according to V. RESULTS AND DISCUSSION the tasks included in MTL-DNN training. The configurations Table III provides a master summary to facilitate perfor- mance comparison among different systems of feature repre- kaldi/egs/wsj/s5/run.sh sentation learning. The methods are organized in four groups, LDA and MLLT are not estimated, as no improvement was found. marked by circled numerals 1 to 4 in the Table. The first The prefix ‘HMM-’ emphasizes the use of DPGMM-HMM alignments, rather than DPGMM cluster labels. group comprises a few relevant baseline and reference systems. Proportion 8 TABLE III ABX ERROR RATES (%) ON THE BASELINE, OUR PROPOSED METHODS AND STATE OF THE ART OF ZEROS PEECH 2017. MLP IS ADOPTED AS THE SHARED- HIDDEN-LAYER STRUCTURE. L ABEL FILTERING IS NOT APPLIED. Within-speaker Across-speaker English French Mandarin Avg. English French Mandarin Avg. 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s MFCC Baseline [6] 12:0 12:1 12:1 12:5 12:6 12:6 11:5 11:5 11:5 12:0 23:4 23:4 23:4 25:2 25:5 25:2 21:3 21:3 21:3 23:3 Out-of-domain fMLLR [14] 8:0 8:2 7:3 10:3 10:3 9:1 9:3 9:3 8:4 8:9 13:4 12:0 11:3 17:2 15:8 14:8 10:7 10:2 9:4 12:8 Out-of-domain fMLLR [5] 7:8 7:7 7:0 10:4 10:5 9:2 9:2 11:4 8:8 9:1 14:2 11:9 11:3 17:6 15:2 14:4 12:7 13:6 10:0 13:4 MUBNF0 8:0 7:3 7:3 10:3 9:4 9:3 10:1 8:8 8:9 8:8 13:5 12:4 12:4 17:8 16:4 16:1 12:6 11:9 12:0 13:9 MUBNF 7:4 6:9 6:3 9:6 9:0 8:1 9:8 8:8 8:1 8:2 10:9 9:5 8:9 15:2 13:0 12:0 10:5 8:9 8:2 10:8 OSBNF1 7:2 7:1 6:3 10:2 9:7 8:7 9:1 8:6 7:6 8:3 10:0 9:7 8:6 13:9 13:4 11:6 9:0 8:4 7:5 10:2 OSBNF2 6:8 6:7 5:9 9:5 9:2 8:3 9:7 8:9 8:0 8:1 9:5 9:2 7:9 13:1 13:0 11:3 9:4 8:7 7:9 10:0 LI-BNF1 6:9 6:6 6:1 9:5 9:2 8:4 9:2 8:5 7:9 8:0 10:0 8:9 8:2 14:3 12:9 11:5 9:5 8:5 7:7 10:2 LI-BNF2 6:6 6:4 5:7 9:1 9:3 8:2 9:5 8:7 8:1 8:0 9:4 8:7 7:8 13:4 12:7 11:0 9:3 8:6 7:7 9:8 HMM(S)-MUBNF 7:2 6:7 6:3 9:7 9:2 8:3 10:4 9:2 8:5 8:4 10:2 9:3 8:6 14:5 13:0 11:9 10:7 9:2 8:4 10:6 HMM(P)-MUBNF 7:1 6:6 6:2 9:4 9:1 7:8 9:9 8:8 8:2 8:1 10:4 9:2 8:7 14:5 12:7 11:7 10:4 8:9 8:2 10:5 HMM(P)-LI-BNF1 6:8 6:3 5:8 9:1 8:7 7:8 9:1 8:5 7:6 7:7 9:7 8:7 8:0 13:7 12:3 11:1 9:7 8:4 7:6 9:9 HMM(P)-LI-BNF2 6:6 6:4 5:7 9:2 8:8 8:1 9:2 8:6 7:9 7:8 9:3 8:7 7:8 13:0 12:4 11:0 9:5 8:5 7:7 9:8 TLBNF 7:2 6:8 6:1 9:6 9:0 8:0 8:7 7:6 6:8 7:8 10:6 9:6 8:7 14:2 13:2 11:5 8:5 7:6 6:7 10:1 TLBNF+LI-BNF1 7:0 6:6 6:0 9:3 8:8 7:9 8:6 7:5 6:7 7:6 10:3 9:3 8:4 13:9 12:9 11:4 8:5 7:6 6:7 9:9 TLBNF+LI-BNF2 7:1 6:6 6:0 9:4 8:9 7:8 8:7 7:5 6:8 7:6 10:4 9:4 8:5 14:0 13:0 11:3 8:5 7:6 6:6 9:9 TLBNF+HMM(P)-LI-BNF1 7:0 6:6 6:0 9:4 8:8 7:8 8:6 7:5 6:7 7:6 10:3 9:4 8:4 13:9 12:9 11:3 8:5 7:6 6:6 9:9 TLBNF+MUBNF+OSBNF1 6:8 6:4 5:8 9:0 8:8 7:8 8:5 7:7 6:8 7:5 9:9 9:0 8:2 13:6 12:6 11:1 8:4 7:7 6:7 9:7 TLBNF+HMM(P)-MUBNF+OSBNF1 6:8 6:4 5:7 8:8 8:7 7:5 8:4 7:5 6:8 7:4 10:0 9:0 8:2 13:6 12:6 11:1 8:4 7:6 6:7 9:7 TLBNF+HMM(P)-MUBNF+OSBNF2 6:7 6:4 5:8 9:0 8:8 7:5 8:3 7:5 6:8 7:4 10:0 9:0 8:2 13:6 12:6 11:1 8:4 7:6 6:7 9:7 Heck et al. [20] 6:9 6:2 6:0 9:7 8:7 8:4 8:8 7:9 7:8 7:8 10:1 8:7 8:5 13:6 11:7 11:3 8:8 7:4 7:3 9:7 Chorowski et al. [28] 5:8 5:7 5:8 7:1 7:0 6:9 7:4 7:2 7:1 6:7 9:3 9:3 9:3 11:9 11:4 11:6 8:6 8:5 8:5 9:8 The MFCC baseline system refers to the one, in which generic features in the first group of systems outperform the MFCC MFCC features are directly used in triphone minimal pair baseline consistently on all target languages. This improve- discrimination. The first out-of-domain fMLLR system comes ment can be achieved without requiring any transcribed train- from previous work [14], which used a Cantonese ASR system ing data of the target language, which is highly desirable in for fMLLR estimation. The second one used a Japanese ASR the zero-resource scenario. [5]. In [5], the out-of-domain ASR system was trained on 240 The second and third groups of systems all use multilingual hours of Japanese speech. The experimental results in [14] BNF representations, which are learned by different methods show that using a Cantonese ASR system trained on only 19 as described in Section IV-E. DPGMM labels and DPGMM- hours of speech could give a better performance in both within- HMM labels are applied in the the second group and the third and across-speaker conditions. The advantage is particularly group respectively. In the second group, MUBNF0 is learned significant when the target language is Mandarin. using MFCC as input features for DPGMM clustering and MTL-DNN modeling. The other representations in these two B. Effectiveness of multilingual BNFs groups are learned using fMLLRs as DNN input features. As described in Section IV-E and Table II, OSBNF1 and The following observations can be made on the perfor- OSBNF2 are trained with out-of-domain ASR senone labels, mances of the learned multilingual BNF representations: and LI-BNF1 and LI-BNF2 are trained with both DPGMM (1) BNF representations learned by MTL-DNN clearly out- labels and out-of-domain ASR senone labels. In the third perform the respective input features to the DNN. MTL-DNN group, “HMM(S)” and “HMM(P)” denote the use of state- training with DPGMM labels is effective for both MFCC and level and phone-level HMM alignments respectively for label fMLLR. The average ABX error rates achieved by MUBNF0 generation. The fourth group of systems are built on different are 8:8% and 13:9% in the within-speaker and across-speaker combination of BNF features. The “+” sign is used to denote conditions respectively, versus 12:0% and 23:3% attained by concatenation of two frame-level feature representations. The MFCC. For MUBNF representation, the relative performance experimental results on all methods of BNF representation improvements over fMLLR are 7:9% and 15:6% in the two test learning as shown in Table III are obtained by using the conditions. MUBNF outperforms MUBNF0 to a large extent, MLP structure in MTL-DNN. In addition, two representative especially in the across-speaker test condition. This suggests systems that achieved very good performances in ZeroSpeech that speaker adaptation at input feature level is a critical step 2017 [20], [28] are also listed in the Table. in obtaining speaker-invariant BNF representations. (2) The effectiveness of BNF can be further improved by training the MTL-DNN with additional out-of-domain ASRs’ A. Effect of out-of-domain speaker adaptation senone labels. With the Cantonese ASR’s senone labels in- The fMLLR features estimated with in-domain data were cluded as one of the training tasks, the LI-BNF1 representation shown to perform significantly better than conventional spec- reduces within-/across-speaker ABX error rates by absolute tral features in unsupervised subword modeling [22], [26]. In 0:2%=0:6% as compared to MUBNF. When the senone labels the present study, it has been shown that similar improvement of Czech, Hungarian and Russian are added, the resulted LI- could also be attained by performing speaker adaptation using BNF2 representation shows a further improvement of absolute an out-of-domain ASR system. Both out-of-domain fMLLR 0:4% under the across-speaker condition. This shows that out- 9 of-domain acoustic-phonetic knowledge provides complemen- reported systems so far. For the across-speaker condition, our tary information to the in-domain clustering labels for feature proposed systems with combined BNF features have slightly learning. The performance gain of OSBNF2 over OSBNF1, as better performance than VQ-VAE (9:8%). Our systems are well as that of LI-BNF2 over LI-BNF1, confirm the benefit of found to be more effective on long utterances than VQ-VAE. exploiting a wider coverage of language resources. In Table III, it is noted that the performance of VQ-VAE does The performance of OSBNF2 is inferior to OSBNF1 on not depend on utterance duration. For English and Mandarin, Mandarin test set, but not on English and French. It is the ABX error rates are almost exactly the same between the noted that OSBNF1 is learned by using the Cantonese ASR cases of 1s and 120s. One possible reason is that the VQ- senone labels while OSBNF2 is learned by involving Can- VAE system does not perform explicit utterance-level speaker tonese and the other three European languages. Cantonese, normalization on input features. On the contrary, the BNF being a Chinese dialect, is apparently closer to Mandarin representations investigated in the study perform significantly than Czech, Hungarian and Russian in terms of acoustic- better on longer utterances (10s & 120s) than on 1s ones. It is phonetic properties. The experimental results imply that the also noted that our systems are more effective for Mandarin frame labels generated by involving highly-mismatched out- in the across-speaker condition. This may be due to the use of of-domain languages may be of low quality and not suitable Cantonese speech in feature learning. VQ-VAE may be over- for feature learning. fitting to Mandarin due to small data size [28]. (3) As discussed in Section III-D, DPGMM-HMM labels are obtained by modeling temporal dependency of speech C. Effectiveness of label filtering and DPGMM labels are determined with the assumption that neighboring speech frames are independent. Comparing the The effectiveness of the proposed label filtering algorithm is evaluated with the HMM(P)-MUBNF representation, which is corresponding systems in the second and the third groups trained exclusively based on DPGMM-HMM labels, without of Table III, it is noted that DPGMM-HMM labels per- involving out-of-domain speech data. Algorithm 1 requires one form slightly better than DPGMM labels. The ABX error tunable parameter P , i.e., the percentage frame labels to be rates attained with HMM(P)-MUBNF, HMM(P)-LI-BNF1 and retained. The average ABX error rates attained with different HMM(P)-LI-BNF2 are about absolute 0:2% - 0:3% lower values of P are plotted as in Fig. 5. P = 1 means that all than those with MUBNF, LI-BNF1 and LI-BNF2 respectively, labels are kept, which is the setting used to obtain the results except for HMM(P)-LI-BNF2 under the across-speaker condi- in Table III. tion. This demonstrates that capturing temporal dependency in Under both within-speaker and across-speaker conditions, speech is beneficial to feature learning for subword modeling the optimal values of P are in the range of 0:7 to 0:9. That [22]. It is also noted that phone-level HMM alignments are is, when on average about 10 30% of the frame labels better than state-level ones. are removed, the ABX error rates could be slightly reduced. (4) Combining different types of BNF feature representa- This indicates that indeed a certain portion of the labels tions leads to further improvement of performance. Specif- ically, by concatenating HMM(P)-MUBNF, OSBNF1 and are not reliable. However, if too many labels are removed, TLBNF, the best ABX error rates under both within-speaker e.g., more than 30%, the system performance would degrade and across-speaker conditions are achieved (7:4% and 9:7%). significantly, because some good labels are lost. It is found that BNFs learned from in-domain unsupervised The proposed label filtering method is very simple in that data (HMM(P)-MUBNF, OSBNF1) and learned via transfer only the occurrence counts of the labels are considered. Fig. learning (TLBNF) can be jointly used to compose an optimal 5 shows that this criterion is appropriate to a certain extent. feature representation that is better than any individual BNF. However, there may exist infrequent subword units that are The best performance attained in this study is competitive to meaningful and crucial in conveying linguistic content. In the best submitted system for the ZeroSpeech 2017 challenge, [23], [24], it was suggested to reduce the number of DPGMM which is based on the combination of multiple DPGMM clusters without ignoring any frame labels. Since these studies posteriorgrams [20]. These posteriograms were generated with were carried out on a different database, direct comparison of unsupervisedly estimated fMLLRs based on different imple- system performance can not be made. mentation parameters. The combination of posteriorgrams led to 3:0% and 3:3% relative error rate reduction under the D. Comparison of DNN model structures within-speaker and across-speaker conditions, compared to For BNF feature learning with the MTL-DNN approach, the use of single posteriorgram representation. In our work, DNN models other than MLP can be used. Table IV compares concatenating the three aforementioned BNF representations the system performances obtained by using MLP, LSTM results in 5:1% and 4:0% relative error rate reduction, as and BLSTM. The feature representations being investigated compared with the best system with single BNF. It must be include MUBNF, HMM(P)-MUBNF and HMM(P)-LI-BNF1, noted that no out-of-domain transcribed speech was involved and label filtering is not applied. in the system of [20]. In a very recent work [28], vector quantized VAE (VQ-VAE) It is noted that LSTM and BLSTM do not perform as was applied to develop a system of unsupervised subword well as MLP on all three types of BNF representations. modeling. The reported average ABX error rate was 6:7% Experiments were carried out with different parameter settings for within-speaker condition, which is the best among all on LSTM and BLSTM, and the system performance remained 10 TABLE IV COMPARISON OF MTL-DNN SHARED- HIDDEN-LAYER STRUCTURES IN FEATURE REPRESENTATION LEARNING OF ZEROS PEECH 2017. Within-speaker Across-speaker English French Mandarin Avg. English French Mandarin Avg. 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s 1s 10s 120s MLP 7:4 6:9 6:3 9:6 9:0 8:1 9:8 8:8 8:1 8:2 10:9 9:5 8:9 15:2 13:0 12:0 10:5 8:9 8:2 10:8 MUBNF LSTM 7:4 7:1 6:8 10:0 9:5 8:7 10:4 9:5 8:7 8:7 10:4 9:6 9:0 14:6 13:3 12:3 10:9 9:3 8:6 10:9 BLSTM 7:4 7:1 6:7 9:9 9:5 8:9 10:4 9:4 8:7 8:7 10:4 9:6 9:0 14:7 13:3 12:1 10:7 9:3 8:6 10:9 MLP 7:1 6:6 6:2 9:4 9:1 7:8 9:9 8:8 8:2 8:1 10:4 9:2 8:7 14:5 12:7 11:7 10:4 8:9 8:2 10:5 HMM(P)-MUBNF LSTM 7:2 6:8 6:4 9:9 9:4 8:7 10:4 9:5 8:8 8:6 10:0 9:3 8:6 14:3 13:1 11:8 10:7 9:3 8:6 10:6 BSLTM 7:3 6:9 6:5 9:6 9:5 8:4 10:5 9:4 9:0 8:6 10:1 9:4 8:9 14:2 13:0 11:9 10:8 9:4 8:7 10:7 MLP 6:8 6:3 5:8 9:1 8:7 7:8 9:1 8:5 7:6 7:7 9:7 8:7 8:0 13:7 12:3 11:1 9:7 8:4 7:6 9:9 HMM(P)-LI-BNF1 LSTM 6:7 6:6 5:9 9:5 9:4 8:2 9:6 8:9 7:9 8:1 9:6 9:1 8:1 14:1 13:3 11:6 10:2 9:1 8:0 10:3 BLSTM 7:0 6:6 6:1 9:3 9:2 8:2 9:4 8:7 8:0 8:1 9:5 9:0 8:2 13:7 13:0 11:6 9:7 8:7 7:8 10:1 Across-speaker 14 14 10.7 MLP 12 12 LSTM 10.6 BLSTM 10 10 10.5 8 8 10.4 0.6 0.7 0.8 0.9 1 6 6 Percentage of preservation EN FR MA EN FR MA Within-Speaker 8.4 Fig. 6. Average ABX error rates (%) of HMM(P)-MUBNF representation 8.2 over utterance lengths for each language. Left: Across-speaker; Right: Within- speaker. 7.8 guages. In the case of low-resource languages, the challenge 0.6 0.7 0.8 0.9 1 of lacking transcribed data could be translated into the prob- Percentage of preservation lem of acquiring high-quality labels to facilitate supervised DNN training. Commonly used approaches to tackling this problem include applying clustering algorithms on short-time Fig. 5. Average ABX error rates (%) with respect to label filtering percentage speech frames and leveraging a language-mismatched phone over three zero-resource languages, in HMM(P)-MUBNF representation. recognizer to decode input speech. In this paper, it has been demonstrated that learning of robust BNF representations largely unchanged. Fig. 6 gives the performances of HMM(P)- could be achieved by joint contributions from a variety of MUBNF learned by MLP, LSTM and BLSTM for each target techniques, including: (1) use of speaker adapted features; (2) language. For English (EN), different DNN structures have considering temporal dependency in speech when performing similar performance. For French (FR) and Mandarin (MA), frame clustering; (3) increasing phonetic diversity by involving the advantage of MLP over (B)LSTM is more prominent. multiple out-of-domain languages; (4) discarding unreliable This may be related to that the amount of training data frame labels in DNN training. for English is significantly greater than those for French The proposed methods of feature learning have been evalu- and Mandarin. The advantage of LSTM and BLSTM over ated on the standard task of unsupervised subword modeling in MLP in conventional supervised acoustic modeling has been the ZeroSpeech 2017 Challenge. The experimental results have widely recognized and attributed to the capability of capturing shown that effective speaker adaptation with untranscribed temporal characteristics of speech. With limited training data, training data could be achieved by using an out-of-domain the benefits of recurrent structures can not be fully exploited. ASR system. Out-of-domain ASR systems from resource- In our systems, contextual information is incorporated via the rich languages can also be utilized to provide phonetically use of DPGMM-HMM labels and its effectiveness has been informed labels to support multi-task learning of BNFs, in demonstrated by the experimental results. conjunction with the learning tasks based on DPGMM-HMM clustering labels. Combining different types of BNFs by vector VI. CONCLUSIONS concatenation leads to further performance improvement. The BNFs learned from multilingual speech data have been best performance achieved by our proposed system is 9:7% in proven highly effective for acoustic modeling of spoken lan- terms of across-speaker triphone minimal-pair ABX error rate. ABX error rates (%) ABX error rates (%) ABX error rate (%) 11 It is equal to the performance of the best submitted system in [20] M. Heck, S. Sakti, and S. Nakamura, “Feature optimized DPGMM clus- tering for unsupervised subword modeling: A contribution to zerospeech the ZeroSpeech 2017 and better than other recently reported 2017,” in Proc. ASRU, 2017, pp. 740–746. systems. [21] C. Manenti, T. Pellegrini, and J. Pinquier, “Unsupervised speech unit In principle, the proposed methods are expected to be discovery using k-means and neural networks,” in Proc. SLSP, 2017, pp. 169–180. effective for any combination of languages other than those [22] M. Heck, S. Sakti, and S. Nakamura, “Iterative training of a DPGMM- in ZeroSpeech 2017. Nevertheless, our investigation has sug- HMM acoustic unit recognizer in a zero resource scenario,” in Proc. gested that the closeness between target languages and out-of- SLT, 2016, pp. 57–63. [23] ——, “Dirichlet process mixture of mixtures model for unsupervised domain languages and the amount of available training data subword modeling,” IEEE/ACM TASLP, vol. 26, no. 11, pp. 2027–2042, for individual target languages might have significant impact on the goodness of learned features. [24] B. Wu, S. Sakti, J. Zhang, and S. Nakamura, “Optimizing DPGMM clustering in zero-resource setting based on functional load,” in Proc. SLTU, 2018, pp. 1–5. [25] R. Caruana, “Multitask learning,” in Learning to learn. Springer, 1998, REFERENCES pp. 95–133. [26] M. Heck, S. Sakti, and S. Nakamura, “Supervised learning of acoustic [1] G. Saon, G. Kurata, T. Sercu, K. Audhkhasi, S. Thomas, D. Dimitriadis, models in a zero resource setting to improve DPGMM clustering,” in X. Cui, B. Ramabhadran, M. Picheny, L.-L. Lim, B. Roomi, and P. Hall, Proc. INTERSPEECH, 2016, pp. 1310–1314. “English conversational telephone speech recognition by humans and [27] D. Renshaw, H. Kamper, A. Jansen, and S. Goldwater, “A comparison machines,” in Proc. INTERSPEECH, 2017, pp. 132–136. of neural network methods for unsupervised representation learning on [2] T. Hori, S. Watanabe, Y. Zhang, and W. Chan, “Advances in joint CTC- the zero resource speech challenge,” in Proc. INTERSPEECH, 2015, pp. attention based end-to-end speech recognition with a deep CNN encoder 3199–3203. and RNN-LM,” in Proc. INTERSPEECH, 2017, pp. 949–953. [28] J. Chorowski, R. J. Weiss, S. Bengio, and A. v. d. Oord, “Unsuper- [3] G. Hinton, L. Deng, D. Yu, G. E. Dahl, A.-r. Mohamed, N. Jaitly, vised speech representation learning using wavenet autoencoders,” arXiv A. Senior, V. Vanhoucke, P. Nguyen, T. N. Sainath, and B. Kingsbury, preprint arXiv:1901.08810, 2019. “Deep neural networks for acoustic modeling in speech recognition: [29] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting The shared views of four research groups,” IEEE Signal Processing bottleneck features and word-like pairs from untranscribed speech for Magazine, vol. 29, no. 6, pp. 82–97, 2012. feature representations,” in Proc. ASRU, 2017, pp. 734–739. [4] A. Ragni, E. Dakin, X. Chen, M. J. Gales, and K. M. Knill, “Multi- [30] H. Chen, C. Leung, L. Xie, B. Ma, and H. Li, “Multitask feature learning language neural network language models.” in Proc. INTERSPEECH, for low-resource query-by-example spoken term detection,” J. Sel. Topics 2016, pp. 3042–3046. Signal Processing, vol. 11, no. 8, pp. 1329–1339, 2017. [5] H. Shibata, T. Kato, T. Shinozaki, and S. Watanabe, “Composite em- [31] T. Tsuchiya, N. Tawara, T. Ogawa, and T. Kobayashi, “Speaker invariant bedding systems for zerospeech2017 track 1,” in Proc. ASRU, 2017, pp. feature extraction for zero-resource languages with adversarial learning,” 747–753. in Proc. ICASSP, 2018, pp. 2381–2385. [6] E. Dunbar, X.-N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Be- [32] T. K. Ansari, R. Kumar, S. Singh, S. Ganapathy, and S. Devi, “Unsuper- sacier, X. Anguera, and E. Dupoux, “The zero resource speech challenge vised HMM posteriograms for language independent acoustic modeling 2017,” in Proc. ASRU, 2017, pp. 323–330. in zero resource conditions,” in Proc. ASRU, 2017, pp. 762–768. [7] J. Glass, “Towards unsupervised speech processing,” in Proc. ISSPA, [33] Y. Qiao, N. Shimomura, and N. Minematsu, “Unsupervised optimal 2012, pp. 1–4. phoneme segmentation: Objectives, algorithm and comparisons,” in [8] H. Kamper, A. Jansen, and S. Goldwater, “Fully unsupervised small- Proc. ICASSP, 2008, pp. 3989–3992. vocabulary speech recognition using a segmental bayesian model,” in [34] S. Feng, T. Lee, and H. Wang, “Exploiting language-mismatched Proc. INTERSPEECH, 2015, pp. 678–682. phoneme recognizers for unsupervised acoustic modeling,” in Proc. [9] M. Versteegh, R. Thiollière, T. Schatz, X.-N. Cao, X. Anguera, ISCSLP, 2016, pp. 1–5. A. Jansen, and E. Dupoux, “The zero resource speech challenge 2015.” [35] M.-L. Sung, S. Feng, and T. Lee, “Unsupervised pattern discovery from in Proc. INTERSPEECH, 2015, pp. 3169–3173. thematic speech archives based on multilingual bottleneck features,” in [10] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Unsupervised Accepted by APSIPA ASC, 2018. bottleneck features for low-resource query-by-example spoken term [36] C.-H. Lee, F. K. Soong, and B.-H. Juang, “A segment model based detection,” in INTERSPEECH, 2016, pp. 923–927. approach to speech recognition,” in Proc. ICASSP, 1988, pp. 501–504. [11] Y. Yuan, C.-C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Pairwise [37] H. Gish and K. Ng, “A segmental speech model with applications to learning using multi-lingual bottleneck features for low-resource query- word spotting,” in Proc. ICASSP, vol. 2, 1993, pp. 447–450. by-example spoken term detection,” in Proc. ICASSP, 2017, pp. 5645– [38] H. Wang, T. Lee, C.-C. Leung, B. Ma, and H. Li, “Acoustic segment modeling with spectral clustering methods,” IEEE/ACM Trans. ASLP, [12] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle- vol. 23, no. 2, pp. 264–277, 2015. neck feature learning from untranscribed speech,” in Proc. ASRU, 2017, [39] S. Bhati, S. Nayak, and K. S. R. Murty, “Unsupervised speech signal to pp. 727–733. symbol transformation for zero resource speech applications,” in Proc. [13] T. K. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning INTERSPEECH, 2017, pp. 2133–2137. methods for unsupervised acoustic modeling - LEAP submission to [40] H. Kamper, A. Jansen, and S. Goldwater, “Unsupervised word segmenta- zerospeech challenge 2017,” in Proc. ASRU, 2017, pp. 754–761. tion and lexicon discovery using acoustic word embeddings,” IEEE/ACM [14] S. Feng and T. Lee, “Exploiting speaker and phonetic diversity of TASLP, vol. 24, no. 4, pp. 669–679, 2016. mismatched language resources for unsupervised subword modeling,” [41] A. Martinet, “Economie des changements phonétiques,” 1970. in Proc. INTERSPEECH, 2018, pp. 2673–2677. [42] C. F. Hockett, A manual of phonology. Waverly Press, 1955, no. 11. [15] G. Synnaeve and E. Dupoux, “Weakly supervised multi-embeddings [43] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discrimi- learning of acoustic models,” arXiv preprint arXiv:1412.6645, 2014. nant analysis for supporting DPGMM clustering in the zero resource [16] H. Kamper, M. Elsner, A. Jansen, and S. Goldwater, “Unsupervised neu- scenario,” in Proc. SLTU, 2016, pp. 73–79. ral network based feature extraction using weak top-down constraints,” [44] F. Grézl, M. Karafiát, and L. Burget, “Investigation into bottle-neck in Proc. ICASSP, 2015, pp. 5818–5822. features for meeting speech recognition,” in Proc. INTERSPEECH, [17] E. Hermann, H. Kamper, and S. Goldwater, “Multilingual and unsuper- 2009, pp. 2947–2950. vised subword modeling for zero-resource languages,” arXiv preprint [45] H. Sak, A. W. Senior, and F. Beaufays, “Long short-term memory arXiv:1811.04791, 2018. recurrent neural network architectures for large scale acoustic modeling.” [18] J. Chang and J. W. Fisher III, “Parallel sampling of DP mixture models in Proc. INTERSPEECH, 2014, pp. 338–342. using sub-cluster splits,” in Advances in NIPS, 2013, pp. 620–628. [46] A. Graves, N. Jaitly, and A.-r. Mohamed, “Hybrid speech recognition [19] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel inference of with deep bidirectional LSTM,” in Proc. ASRU, 2013, pp. 273–278. Dirichlet process Gaussian mixture models for unsupervised acoustic [47] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual modeling: A feasibility study,” in Proc. INTERSPEECH, 2015, pp. knowledge transfer in DNN-based LVCSR,” in Proc. SLT, 2012, pp. 3189–3193. 246–251. 12 [48] N. Zeghidour, G. Synnaeve, M. Versteegh, and E. Dupoux, “A deep ICSLP, 2002, pp. 901–904. scattering spectrum-deep siamese network pipeline for unsupervised [52] P. Schwarz, “Phoneme recognition based on long temporal context,” acoustic modeling,” in Proc. ICASSP, 2016, pp. 4965–4969. PhD Tesis. Brno University of Technology., 2009. [49] T. Lee, W. K. Lo, P. C. Ching, and H. Meng, “Spoken language [53] H. v. d. Heuvel, J. Boudy, Z. Bakcsi, J. Cernocky, V. Galunov, J. Kochan- resources for Cantonese speech processing,” Speech Communication, ina, W. Majewski, P. Pollak, M. Rusko, J. Sadowski et al., “SpeechDat- vol. 36, no. 3, pp. 327–342, 2002. E: Five eastern european speech databases for voice-operated teleser- [50] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, vices completed,” in Proc. INTERSPEECH, 2001. M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz et al., “The Kaldi [54] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for speech recognition toolkit,” in Proc. ASRU, 2011. on-line training of recurrent network trajectories,” Neural computation, [51] A. Stolcke, “SRILM – an extensible language modeling toolkit,” in Proc. vol. 2, no. 4, pp. 490–501, 1990.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Aug 9, 2019

There are no references for this article.