Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Unsupervised speech representation learning using WaveNet autoencoders

Unsupervised speech representation learning using WaveNet autoencoders Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron ¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic desired in several tasks, such as low resource automatic speech content from the signal, e.g. phoneme identities, while being recognition (ASR), where only a small amount of labeled invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. to learn the acoustic model and a data representation in a fully Moreover, the behavior of autoencoder models depends on the supervised manner [15], [16]. kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the [17]. We tune the learned latent representation to encode only ability to predict phonetic content, and the ability to accurately re- phonetic content and remove other confounding detail. However, construct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease to enable signal reconstruction, we rely on an autoregressive of mapping them to phonemes. We introduce a regularization WaveNet [18] decoder to infer information that was rejected scheme that forces the representations to focus on the phonetic by the encoder. The use of such a powerful decoder acts content of the utterance and report performance comparable with as an inductive bias, freeing up the encoder from using its the top entries in the ZeroSpeech 2017 unsupervised acoustic unit capacity to represent low level detail and instead allowing it discovery task. to focus on high level semantic features. We discover that best Index Terms—autoencoder, speech representation learning, un- representations arise when ASR features, such as mel-frequency supervised learning, acoustic unit discovery cepstral coefficients (MFCCs) are used as inputs, while raw waveforms are used as decoder targets. This forces the system I. I NTRODUCTION to also learn to generate sample level detail which was removed Creating good data representations is important. The deep during feature extraction. Furthermore, we observe that the learning revolution was triggered by the development of Vector Quantized Variational Autoencoder (VQ-VAE) [19] hierarchical representation learning algorithms, such as stacked yields the best separation between the acoustic content and Restricted Boltzman Machines [1] and Denoising Autoencoders speaker information. We investigate the interpetability of VQ- [2]. However, recent breakthroughs in computer vision [3], VAE tokens by mapping them to phonemes, demonstrate [4], machine translation [5], [6], speech recognition [7], [8], the impact of model hyperparameters on interpretability and and language understanding [9], [10] rely on large labeled propose a new regularization scheme which improves the degree datasets and make little to no use of unsupervised representation to which the latent representation can be mapped to the phonetic content. Finally, we demonstrate strong performance on the learning. This has two drawbacks: first, the requirement of large ZeroSpeech 2017 acoustic unit discovery task [20], which human labeled datasets often makes the development of deep measures how discriminative a representation is to minimal learning models expensive. Second, while a deep model may phonetic changes within an utterance. excel at solving a given task, it yields limited insights into the problem domain, with main intuitions typically consisting of visualizations of salient input patterns [11], [12], a strategy that II. R EPRESENTATION L EARNING WITH NEURAL NETWORKS is applicable only to problem domains that are easily solved Neural networks are hierarchical information processing by humans. models that are typically implemented using layers of computa- In this paper we focus on evaluating and improving un- tional units. Each layer can be interpreted as a feature extractor supervised speech representations. Specifically, we focus on whose outputs are passed to upstream units [21]. Especially in representations that separate selected speaker traits, specifically the visual domain, features learned with neural networks have J. Chorowski is with the Institute of Computer Science, University of been shown to create a hierarchy of visual atoms [11] that Wrocław, Poland e-mail: jan.chorowski@cs.uni.wroc.pl. match some properties of the visual cortex [22]. Similarly, when R. Weiss and S. Bengio are with Google Research. A. van den Oord is with DeepMind email: fronw, bengio, avdnoordg@google.com. applied to audio waveforms, neural networks have been shown arXiv:1901.08810v2 [cs.LG] 11 Sep 2019 2 to learn auditory-like frequency decompositions on music [23] from a prior distribution p(z) (typically a multidimensional and speech [24], [25], [26], [27] in their lower layers. normal distribution). Then the data sample x is generated using a deep decoder neural network with parameters  that computes p(xjz; ). However, computing the exact posterior A. Supervised feature learning distribution p(zjx) that is needed during maximum likelihood Neural networks can learn useful data representations in both training is difficult. Instead, the VAE introduces a variational supervised and unsupervised manners. In the supervised case, approximation to the posterior, q(zjx; ), which is modeled features learned on large datasets are often directly useful using an encoder neural network with parameters . Thus the in similar but data-poor tasks. For instance, in the visual VAE resembles a traditional autoencoder, in which the encoder domain, features discovered on ImageNet [28] are routinely produces distributions over latent representations, rather than used as input representations in other computer vision tasks [29]. deterministic encodings, while the decoder is trained on samples Similarly, the speech community has used bottleneck features from this distribution. Encoding and decoding networks are extracted from networks trained on phoneme prediction tasks trained jointly to maximize a lower bound on the log-likelihood [30], [31] as feature representations for speech recognition of data point x [38], [39]: systems. Likewise, in natural language processing, universal text representations can be extracted from networks trained for J (; ; x) = E [log p(xjz; )] VAE q(zjx;) machine translation [32] or language inference [33], [34]. D (q(zjx; )jj p(z)) : (1) KL We can interpret the two terms of Eq. (1) as the autoencoder’s B. Unsupervised feature learning reconstruction cost augmented with a penalty term applied to In this paper we focus on unsupervised feature learning. the hidden representation. In particular, the KL divergence Since no training labels are available we investigate autoen- expresses the amount of information in nats which the latent coders, i.e., networks which are tasked with reconstructing representation carries about the data sample. Thus, it acts as an their inputs. Autoencoders use an encoding network to extract information bottleneck [40] on the latent representation, where a latent representation, which is then passed through a decod- controls the trade-off between reconstruction quality and the ing network to recover the original data. Ideally, the latent representation simplicity. representation preserves the salient features of the original An alternative formulation of the VAE objective explicitly data, while being easier to analyze and work with, e.g. by constrains the amount of information contained in the latent disentangling different factors of variation in the data, and representation [41]: discarding spurious patterns (noise). These desirable qualities J (; ; x) = E [log p(xjz; )] VAE q(zjx;) are typically obtained through a judicious application of max (B; D (q(zjx; )jj p(z))) ; (2) regularization techniques and constraints or bottlenecks (we KL use the two terms interchangeably). The representation learned where the constant B corresponds to the amount of free by an autoencoder is thus subject to two competing forces. On information in q, because the model is only penalized if it the one hand, it should provide the decoder with information transmits more than B nats over the prior in the distribution necessary for perfect reconstruction and thus capture in the over the latents. Please note that for convenience we will often latents as much of the input data characteristics as possible. refer to information content using units of bits instead of nats. On the other hand, the constraints force some information to A recently proposed modification of the VAE, called the be discarded, preventing the latent representation from being Vector Quantized VAE [19], replaces the continuous and trivial to invert, e.g. by exactly passing through the input. Thus stochastic latent vectors with deterministically quantized ver- the bottleneck is necessary to force the network to learn a sions. The VQ-VAE maintains a number of prototype vectors non-trivial data transformation. fe ; i = 1; : : : ; Kg. During the forward pass, representations Reducing the dimensionality of the latent representation can produced by the encoder are replaced with their closest serve as a basic constraint applied to the latent vectors, with prototypes. Formally, let z (x) be the output of the encoder the autoencoder acting as a nonlinear variant of linear low- prior to quantization. VQ-VAE finds the nearest prototype rank data projections, such as PCA or SVD [35]. However, q(x) = argmin kz (x) e k and uses it as the latent e i i 2 such representations may be difficult to interpret because the representation z (x) = e which is passed to the decoder. q(x) reconstruction of an input depends on all latent features [36]. In When using the model in downstream tasks, the learned contrast, dictionary learning techniques, such as sparse [37] and representation can therefore be treated either as a distributed non-negative [36] decompositions, express each input pattern representation in which each sample is represented by a using a combination of a small number of selected features out continuous vector, or as a discrete representation in which of a larger pool, which facilitates their interpretability. Discrete each sample is represented by the prototype ID (also called feature learning using vector quantization can be seen as an the token ID). extreme form of sparseness in which the reconstruction uses During the backward pass, the gradient of the loss with only one element from the dictionary. respect to the pre-quantized embedding is approximated using The Variational Autoencoder (VAE) [38] proposes a different @L @L the straight-through estimator [42], i.e.,  . The @z (x) @z (x) e q interpretation of feature learning which follows a probabilistic framework. The autoencoding network is derived from a latent- In TensorFlow this can be conveniently implemented using z (x) = variable generative model. First, a latent vector z is sampled z (x) + stop gradient(e z (x)) e e q(x) 3 prototypes are trained by extending the learning objective VQ-VAE Encoder p p enc proj with terms which optimize quantization. Prototypes are forced + Linear(64) VQ 64D 50Hz to lie close to vectors which they replace with an auxiliary or ReLU(768) cost, dubbed the commitment loss, introduced to encourage VAE proj the encoder to produce vectors which lie close to prototypes. Linear(128) sample Without the commitment loss VQ-VAE training can diverge by ReLU(768) or emitting representations with unbounded magnitude. Therefore, AE VQ-VAE is trained using a sum of three loss terms: the negative ReLU(768) Linear(64) log-likelihood of the reconstruction, which uses the straight- through estimator to bring the gradient from the decoder to pbn the encoder, and two VQ-related terms: the distance from each jitter(0:12) Decoder ReLU(768) prototype to its assigned vectors and the commitment cost [19]: Conv (128) L = log p x j z (x) cond 128D 50Hz Conv (768) 2 2 3 +ksg z (x) e k + kz (x) sg(e )k ; (3) e q(x) e q(x) 2 2 upsample 128D 16kHz where sg() denotes the stop-gradient operation which zeros WaveNet cycle Conv3(768) concat the gradient with respect to its argument during backward pass. (10 layers) 768D 50Hz 128 +N The quantization within the VQ-VAE acts as an information 16kHz StridedConv (768) bottleneck. The encoder can be interpreted as a probabilistic (stride = 2) 256D 16kHz model which puts all probability mass on the selected discrete WaveNet cycle token (prototype ID). Assuming a uniform prior distribution (10 layers) over K tokens, the KL divergence is constant and equal to Conv (768) log K . Therefore, the KL term does not need to be included in 768D 100Hz the VQ-VAE training criterion in Eq. (3) and instead becomes Conv (768) + ReLU(256) a hyperparameter tied to the size of the prototype inventory. 39D 100Hz The VQ-VAE was qualitatively shown to learn a representa- MFCC + d + a ReLU(256) feature extraction tion which separated the phonetic content within an utterance 1D 16kHz sample softmax from the identity of the speaker [19]. Moreover the discovered tokens could be mapped to phonemes in a limited setting. Ns speaker waveform one-hot C. Autoencoders for sequential data Sequential data, such as speech or text, often contain local Fig. 1. The proposed model is conceptually divided into 3 parts: an encoder dependencies that can be exploited by generative models. In (green), made of a residual convnet that computes a stream of latent vectors (typically every 10ms or 20ms) from a time-domain waveform sampled at fact, purely autoregressive models of sequential data, which 16 kHz, which are passed through a bottleneck (red) before being used to predict the next observation based on recent history, are very condition a WaveNet decoder (blue) which reconstructs the waveform using successful. For text, these correspond to n-gram models [43] two additional information streams: an autoregressive stream which predicts the next sample based on past samples, and global conditioning which represents and convolutional neural language models [44], [45]. Similarly, the identity of the input speaker (one out of N total training speakers). We WaveNet [18] is a state-of-the-art autoregressive model of experiment with three bottleneck variants: a simple dimensionality reduction time-domain waveform samples for text-to-speech synthesis. (AE), a sampling layer with an additional Kullback-Leibler penalty term (VAE), or a discretization layer (VQ-VAE). Intuitively, this bottleneck encourages A downside of such autoregressive models is that they the encoder to discard portions of the latent representation which the decoder do not explicitly produce latent representations of the data. can infer from the two other information streams. For all layers, numbers in However, it is possible to combine an autoregressive sequence parentheses indicate the number of output channels, and subscripts denote the filter length. Locations of “probe” points which are used in Section IV to generation model with an encoder tasked with extraction of evaluate the quality of the learned representation are denoted with black dots. latent representations. Depending on the use case, the encoder can process the whole utterance, emit a single latent vector and feed it to an autoregressive decoder [33], [46] or the encoder III. M ODEL DESCRIPTION can periodically emit vectors of latent features to be consumed The architecture of our model is presented in Figure 1. The by the decoder [19], [47]. We concentrate on the latter solution. encoder reads a sequence of either raw audio samples, or of Training mixed latent variable and autoregressive models audio features and extracts a sequence of hidden vectors, is prone to latent space collapse, in which the decoder learns which are passed through a bottleneck to become a sequence to ignore the constrained latent representations and only uses of latent representations. The frequency at which the latent the unconstrained signal coming through the autoregressive vectors are extracted is governed by the number of strided path. For the VAE, this collapse can be prevented by annealing convolutions applied by the encoder. the weight of the KL term and using the free-information The decoder reconstructs the utterance by conditioning a formulation in Eq. (2). The VQ-VAE is naturally resilient to WaveNet [18] network on the latent representation extracted by the latent collapse because the KL term is a hyperparameter which is not optimized using gradient training of a given model. To keep the autoencoder viewpoint, the feature extractor can be interpreted We defer further discussion of this topic to Section V. as a fixed signal processing layer in the encoder. 4 the encoder and, separately, on a speaker embedding. Explicitly The regularization layer is inserted right after the encoder’s conditioning the decoder on speaker identity frees the encoder bottleneck (i.e., after dimensionality reduction for regular from having to capture speaker-dependent information in the autoencoder, after sampling a realization of the latent layer for latent representation. Specifically, the decoder (i) takes the en- the VAE and after discretization for the VQ-VAE). It is only coder’s output, (ii) optionally applies a stochastic regularization enabled during training. For each time step we independently to the latent vectors (see Section III-A), (iii) then combines sample whether it is to be replaced with the token right after latent vectors extracted at neighboring time steps using con- or before it. We do not copy a token more than one timestep. volutions and (iv) upsamples them to the output frequency. Waveform samples are reconstructed with a WaveNet that IV. E XPERIM ENTS combines all conditioning sources: autoregressive information We evaluated models on two datasets: LibriSpeech [17] about past samples, global information about the speaker, and (clean subset) and ZeroSpeech 2017 Contest Track 1 data [20]. latent information about past and future samples extracted Both datasets have similar characteristics: multiple speakers, by the encoder. We find that the encoder’s bottleneck and clean, read speech (sourced from audio books) recorded at a the proposed regularization is crucial in extracting nontrivial sampling rate of 16 kHz. Moreover the ZeroSpeech challenge representations of data. With no bottleneck, the model is prone controls the amount of per-speaker data with the majority of to learn a simple reconstruction strategy which makes verbatim the data being uttered by only a few speakers. copies of future samples. We also note that the encoder is Initial experiments, presented in section IV-B, compare differ- speaker independent and requires only speech data, while the ent bottleneck variants and establish what type of information decoder also requires speaker information. from the input audio is preserved in the continuous latent We consider three forms of bottleneck: (i) simple di- representations produced by the model at the four different mensionality reduction, (ii) a Gaussian VAE with different probe points pictured in Figure 1. Using the representation latent representation dimensionalities and different capacities computed at each probe point, we measure performance following Eq. (2), and (iii) a VQ-VAE with different number of on several prediction tasks: phoneme prediction (per-frame quantization prototypes. All bottlenecks are optionally followed accuracy), speaker identity and gender prediction accuracy, and by the dropout inspired time-jitter regularization described L reconstruction error of spectrogram frames. We establish below. Furthermore, we experiment with different input and that the VQ-VAE learns latent representations with strongest output representations, using raw waveforms, log-mel filterbank, disentanglement between the phonetic content and speaker and mel-frequency cepstral coefficient (MFCC) features which identity, and focus on this architecture in the following discard pitch information present in the spectrogram. experiments. In section IV-C we analyze the interpretability of VQ-VAE tokens by mapping each discrete token to the most frequent A. Time-jitter regularization corresponding phoneme in a forced alignment of a small labeled We would like the model to learn a representation of speech data set (LibriSpeech dev) and report the accuracy of the which corresponds to the slowly-changing phonetic content mapping on a separate set (LibriSpeech test). Intuitively, this within an utterance: a mostly constant signal that can abruptly captures the interpretability of individual tokens. change at phoneme boundaries. We then apply the VQ-VAE to the ZeroSpeech 2017 acoustic Inspired by the slow features analysis [48] we first exper- unit discovery task [20] in section IV-D. This task evaluates imented with penalizing time differences between encoder how discriminative the representation is with respect to the representation either before or after the bottleneck. However, phonetic class. Finally, in section IV-E we measure the impact this regularization resulted in a collapse of the latent space of different hyperparameters on performance. – the model learned to output a constant encoding. This is a common problem of sequential VAEs that use loss terms to A. Default model hyperparameters regularize the latent encoding [49]. Reconsidering the problem we realized that we want each Our best models used MFCCs as the encoder input, but frame’s representation to correspond to a meaningful phonetic reconstructed raw waveforms at the decoder output. We used unit. Thus we want to prevent the system from using consecu- standard 13 MFCC features extracted every 10ms (i.e., at a tive latent vectors as individual units. Put differently, we want rate of 100 Hz) and augmented with their temporal first and to prevent latent vector co-adaptation. We therefore introduce second derivatives. Such features were originally designed for a dropout-inspired [50] time-jitter regularizer, also reminiscent speech recognition and are mostly invariant to pitch and similar of Zoneout [51] regularization for recurrent networks. During confounding detail in the audio signal. The encoder had 9 layers training, each latent vector can replace either one or both of each using 768 units with ReLU activation, organized into the its neighbors. As in dropout, this prevents the model from following groups: 2 preprocessing convolution layers with filter relying on consistency across groups of tokens. Additionally, length 3 and residual connections, 1 strided convolution length this regularization also promotes latent representation stability reduction layer with filter length 4 and stride 2 (downsampling over time: a latent vector extracted at time step t must strive the signal by a factor of two), followed by 2 convolutional to also be useful at time steps t 1 or t + 1. In fact, the layers with length 3 and residual connections, and finally regularization was crucial for reaching good performance on 4 feedforward ReLU layers with residual connections. The ZeroSpeech at higher token extraction frequencies. resulting latent vectors were extracted at 50 Hz (i.e., every Filterbank Phoneme Gender Speaker p proj p p enc bn cond 0.8 Bottleneck 0.6 AE 0.4 VAE (D= 4) 0.2 VAE (D= 8) 0.7 VAE (D=16) 0.6 0.5 VAE (D=32) 0.4 VQ-VAE 0.9 Latent dimensions 0.8 0.7 0.6 0.6 0.4 0.2 VAE free bits / VQ-VAE bits per token Fig. 2. Accuracy of predicting signal characteristics at various probe locations in the network. Among the three bottlenecks evaluated, VQ-VAE discards the most speaker-related information at the bottleneck, while preserving the most phonetic information. For all bottlenecks, the representation coming out of the encoder yields over 70% accurate framewise phoneme predictions. Both the simple AE and VQ-VAE preserve this information in the bottleneck (the accuracy drops to 50%-60% depending on the bottleneck’s strength). However, the VQ-VAE discards almost all speaker information (speaker classification accuracy is close to 0% and gender prediction close to 50%). This causes the VQ-VAE representation to perform best on the acoustic unit discovery task – the representation captures the phonetic content while being invariant to speaker identity. The jittered latent sequence was passed through a single Probe point convolutional layer with filter length 3 and 128 hidden enc 0.7 units to mix information across neighboring timesteps. The proj representation was then upsampled 320 times (to match the bn 0.6 16kHz audio sampling rate) and concatenated with a one-hot cond vector representing the current speaker to form the conditioning Bottleneck input of an autoregressive WaveNet [18]. The WaveNet was 0.5 composed of 20 causal dilated convolution layers, each using AE 368 gated units with residual connections, organized into two VAE (D=32) 0.4 “cycles” of 10 layers with dilation rates 1; 2; 4; : : : ; 2 . The VQ-VAE conditioning signal was passed separately into each layer. The 0.6 0.7 0.8 0.9 signal from each layer of the WaveNet was passed to the output Gender prediction accuracy using skip-connections. Finally, the signal was passed through 2 Fig. 3. Comparison of gender and phoneme prediction accuracy for different ReLU layers with 256 units. A Softmax was applied to compute bottleneck types and probe points. The decoder is conditioned on the speaker, the next sample probability. We used 256 quantization levels thus the gender information can be recovered and the bottleneck should discard it. While information is present at the p probe. The AE and VAE models after mu-law companding [18]. enc tend to similarly discard both gender and phoneme information at other probe All models were trained on minibatches of 64 sequences of points. On the other hand, VQ-VAE selectively discards gender information. length 5120 time-domain samples (320 ms) sampled uniformly from the training dataset. Training a single model on 4 Google Cloud TPUs (16 chips) took a week. We used the Adam second frame), with each latent vector depending on a receptive optimizer [52] with initial learning rate 4 10 which was field of 16 input frames. We also used an alternative encoder halved after 400k, 600k, and 800k steps. Polyak averaging [53] with two length reduction layers, which extracted latent was applied to all checkpoints used for model evaluation. representation at 25 Hz with a receptive field of 30 frames. When unspecified, the latent representation was 64 dimen- B. Bottleneck comparison sional and when applicable constrained to 14 bits. Furthermore, for the VQ-VAE we used the recommended = 0:25 [19]. We train models on LibriSpeech and analyze the informa- The decoder applied the randomized time-jitter regularization tion captured in the hidden representations surrounding the (see Section III-A). During training each latent vector was autoencoder bottleneck at each of the four probe points shown replaced with either of its neighbors with probability 0.12. in Figure 1: Phoneme prediction accuracy Accuracy Accuracy Accuracy Recon. Error N/A N/A N/A N/A 16 7 TABLE I accuracy, while a model with no time-reduction layers set the L IBRIS PEECH FRAME-WISE PHONEM E RECOGNITION ACCURACY. VQ-VAE upper bound at 88%. MODELS CONSUME MFCC FEATURES AND EXTRACTED TOKENS AT 25 HZ. Table I indicates that the mapping accuracy improves with the number of tokens, with the best model reaching 64:5% Num tokens / bits 256 512 1024 2048 4096 8192 16384 32768 accuracy using 32768 tokens. However, the largest accuracy Train steps 8 9 10 11 12 13 14 15 gain occurs at 4096 tokens, with diminishing returns as the 200k 56.7 58.3 59.7 60.3 60.7 61.2 61.4 61.7 number of tokens is further increased. This result is in rough 900k 58.6 61.0 61.9 63.3 63.8 63.9 64.3 64.5 correspondence with the 5760 tied triphone states used in the Kaldi tri6b model. We also note that increasing the number of tokens does mation better than simple dimensionality reduction, but not as not trivially lead to improved accuracies, because we measure well as VQ-VAE. The VAE discards phonetic and speaker infor- generalization, and not cluster purity. In the limit of assigning mation more uniformly than VQ-VAE: at p , VAE’s phoneme bn a different token to each frame, the accuracy will be poor predictions are less accurate, while its gender predictions because of overfitting to the small development set on which are more accurate. Moreover, combining information across we establish the mapping. However, in our experiments we a wider receptive field at p does not improve phoneme cond consistently observed improved accuracy. recognition as much as in VQ-VAE models. The sensitivity to the bottleneck dimensionality, seen in Figure 2 is also surprising, D. Unsupervised ZeroSpeech 2017 acoustic unit discovery with narrower VAE bottlenecks discarding less information than The ZeroSpeech 2017 phonetic unit discovery task [20] eval- wider ones. This may be due to the stochastic operation of the uates a representation’s ability to discriminate between different VAE: to provide the same KL divergence as at low bottleneck sounds, rather than the ease of mapping the representation to dimensions, more noise needs to be added at high dimensions. predefined phonetic units. It is therefore complementary to the This noise may mask information present in the representation. phoneme classification accuracy metric used in the previous Based on these results we conclude that the VQ-VAE section. The ZeroSpeech evaluation scheme uses the minimal bottleneck is most appropriate for learning latent representations pair ABX test [56], [57] which assesses the model’s ability to which capture phonetic content while being invariant to the discriminate between pairs of three phoneme long segments underlying speaker identity. of speech that differ only in the middle phone (e.g. “get” and “got”). We trained the models on the provided training data C. VQ-VAE token interpretability (45 hours for English, 24 hours for French and 2.5 hours Up to this point we have used the VQ-VAE as a bottleneck for Mandarin) and evaluated them on the test data using the that quantizes latent vectors. In this section we seek an official evaluation scripts. To ensure that we do not overfit to the interpretation of the discrete prototype IDs, evaluating whether ZeroSpeech task we only considered the best hyperparameter VQ-VAE tokens can be mapped to phonemes, the underlying settings found on LibriSpeech (c.f. Section IV-E). Moreover, discrete constituents of speech sounds. Example token IDs to maximally abide by the ZeroSpeech convention, we used the are pictured in the middle pane of Figure 4, where we can same hyperparameters for all languages, denoted as VQ-VAE see that the token 11 is consistently associated with the (per lang, MFCC, p ) in Table II. cond transient “T” phone. To evaluate whether other tokens have On English and French, which come with sufficiently similar interpretations, we measured the frame-wise phoneme large training datasets, we achieve results better than the top recognition accuracy in which each token was mapped to one contestant [58], despite using a speaker independent encoder. out of 41 phonemes. We used the 460 hour clean LibriSpeech The results are consistent with our analysis of information training set for unsupervised training, and used labels from separation performed by the VQ-VAE bottleneck: in the the clean dev subset to associate each token with the most more challenging across-speaker evaluation, the best perfor- probable phoneme. We evaluated the mapping by computing mance uses the p representation, which combines several cond frame-wise phone recognition accuracy on the clean test set at neighboring frames of the bottleneck representation (VQ-VAE, a frame rate of 100 Hz. The ground-truth phoneme boundaries (per lang, MFCC, p ) in Table II). Comparing within- cond were obtained from forced alignments using the Kaldi tri6b and across-speaker results is similarly consistent with the model from the s5 LibriSpeech recipe [55]. observations in Section IV-B. In the within-speaker case, it is Table I shows performance of the configuration which not necessary to disentangle speaker identity from phonetic obtained the best accuracy mapping VQ-VAE tokens to content so the quantization between p and p probe points proj bn phonemes on LibriSpeech. Recognition accuracy is given at two hurts performance (although on English this is corrected by time points: after 200k gradient descent steps, when the relative considering the broader context at p ). In the across-speaker cond performance of models can be assessed, and after 900k steps case, quantization improves the scores on English and French when the models have converged. We did not observe overfitting because the gain from discarding the confounding speaker with longer training times. Predicting the most frequent silence phoneme for all frames set an accuracy lower bound at 16%. The comparison with other systems from the challenge is fair, because according to the ZeroSpeech experimental protocol, all participants were A model discriminatively trained on the full 460 hour training encouraged to tune their systems on the three languages that we use (English, set to predict phonemes with the same architecture as the French, and Mandarin), while the final evaluation used two surprise languages 25 Hz encoder achieved 80% framewise phoneme recognition for which we do not have the labels required for evaluation. 8 TABLE II ZEROS PEECH 2017 PHONETIC UNIT DISCOVERY ABX SCORES REPORTED ACROSS- AND WITHIN- SPEAKERS ( LOWER IS BETTER). T HE VQ-VAE ENCODER IS SPEAKER INDEPENDENT AND THUS ITS RESULTS DO NOT CHANGE WITH THE AM OUNT OF TEST SPEAKER DATA (1S, 10 S, OR 2M), WHILE SPEAKER- ADAPTIVE MODELS ( E. G. SUPERVISED TOPLINE) IMPROVE WITH MORE TARGET SPEAKER DATA. W E REPORT THE TWO REFERENCE POINTS FROM THE CHALLENGE, ALONG WITH THE CHALLENGE W INNER [58] AND THREE OTHER SUBMISSIONS THAT USED NEURAL NETWORK IN AN UNSUPERVISED SETTING [59], [60], [61]. ALL VQ-VAE MODELS USE EXACTLY THE SAME HYPERPARAMETER SETUP (14 BIT TOKENS EXTRACTED AT 50 HZ WITH TIM E-JITTER PROBABILITY 0.5), REGARDLESS OF THE AM OUNT OF UNLABELED TRAINING DATA (45 H, 24H OR 2.4 H). T HE TOP VQ-VAE RESULTS ROW (VQ-VAE TRAINED ON TARGET LANGUAGE, FEATURES EXTRACTED AT THE p POINT) GIVES BEST RESULTS COND OVERALL. W E ALSO INCLUDE in italics RESULTS FOR DIFFERENT PROBE POINTS AND FOR VQ-VAES JOINTLY TRAINED ON ALL LANGUAGES. MULTILINGUAL TRAINING HELPS MANDARIN. WE ALSO OBSERVE THAT THE QUANTIZATION M OSTLY DISCARDS SPEAKER AND CONTEXT INFLUENCE. THE CONTEXT IS HOWEVER RECOVERED IN THE CONDITIONING SIGNAL WHICH COM BINES INFORMATION FROM LATENT VECTORS AT NEIGHBORING TIMESTEPS. Within-speaker Across-speaker English (45h) French (24h) Mandarin (2.4h) English (45h) French (24h) Mandarin (2.4h) Model 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m Unsupervised baseline 12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3 Supervised topline 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 VQ-VAE (per lang, MFCC, p ) 5.6 5.5 5.5 7.3 7.5 7.5 11.2 10.7 10.8 8.1 8.0 8.0 11.0 10.8 11.1 12.2 11.7 11.9 cond VQ-VAE (per lang, MFCC, p ) 6.2 6.0 6.0 7.5 7.3 7.6 10.8 10.5 10.6 8.9 8.8 8.9 11.3 11.0 11.2 11.9 11.4 11.6 bn VQ-VAE (per lang, MFCC, p ) 5.9 5.8 5.9 6.7 6.9 6.9 9.9 9.7 9.7 9.1 9.0 9.0 11.9 11.6 11.7 11.0 10.6 10.7 proj VQ-VAE (all lang, MFCC, p ) 5.8 5.8 5.8 8.0 7.9 7.8 9.2 9.1 9.2 8.8 8.6 8.7 11.8 11.6 11.6 10.3 10.0 9.9 cond VQ-VAE (all lang, MFCC, p ) 6.3 6.2 6.3 8.0 8.0 7.9 9.0 8.9 9.1 9.4 9.2 9.3 11.8 11.7 11.8 9.9 9.7 9.7 bn VQ-VAE (all lang, MFCC, p ) 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5 proj VQ-VAE (all lang, fbank, p ) 6.0 6.0 6.0 6.9 6.8 6.8 6.8 6.6 6.6 10.1 10.1 10.1 12.5 12.2 12.3 7.8 7.7 7.7 proj Heck et al. [58] 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 Chen et al. [59] 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1 Ansari et al. [60] 7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3 Yuan et al. [61] 9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7 information offsets the loss of some phonetic details. Moreover, these design choices on the English part of the ZeroSpeech the discarded phonetic information can be recovered by mixing challenge task. Indeed, we found that the proposed time-jitter neighboring timesteps at p . regularization improved ZeroSpeech ABX scores for all input cond VQ-VAE performance on Mandarin is worse, which we representations. Using MFCC or filterbank features yields better can attribute to three main causes. First, the training dataset scores that using waveforms, and the model consistently obtains consists of only 2.4 hours or speech, leading to overfitting better scores when more tokens are used. (see Sec. IV-E7). This can be partially improved by mul- 1) Time-jitter regularization: In Table III we analyze the tilingual training, as in VQ-VAE, (all lang, MFCC, p ). effectiveness of the time-jitter regularization on VQ-VAE cond Second, Mandarin is a tonal language, while the default encodings and compare it to two variants of dropout: regular input features (MFCCs) discard pitch information. We note a dropout applied to individual dimensions of the encoding and slight improvement with a multilingual model trained on mel dropout applied randomly to the full encoding at individual filterbank features (VQ-VAE, (all lang, fbank, p )). Third, time steps. Regular dropout does not force the model to sepa- proj VQ-VAE was shown not to encode prosody in the latent rate information in neighboring timesteps. Step-wise dropout representation [19]. Comparing the results across probe points, promotes encodings which are independent across timesteps we see that Mandarin is the only language for which the VQ and performs slightly worse than the time-jitter . bottleneck discards information and decreases performance in The proposed time-jitter regularization greatly improves the across-speaker testing regime. Nevertheless, the multilingual token mapping accuracy and extends the range of token prequantized features yield accuracies comparable to [58]. frame rates which perform well to include 50 Hz. While the We do not consider the need for more unsupervised training LibriSpeech token accuracies are comparable at 25 Hz and data to be a problem. Unlabeled data is abundant. We believe 50 Hz, higher token emission frequencies are important for that a more powerful model that requires and can make better the ZeroSpeech AUD task, on which the 50 Hz model was use of large amounts of unlabeled training data is preferable to noticeably better. This behavior is due to the fact that the 25 Hz a simpler model whose performance saturates on small datasets. model is prone to omitting short phones (Sec. IV-E6), which However, it remains to be verified if increasing the amount impacts the ABX results on the ZeroSpeech task. of training data would help the Mandarin VQ-VAE learn to We also analyzed information content at the four probe points discard less tonal information (the multilingual model might for VQ-VAE, VAE, and simple dimensionality reduction AE have learned to do this to accommodate French and English). bottleneck, shown in Figure 5. For all bottleneck mechanisms, the regularization limits the quality of filterbank reconstruc- tions and increases the phoneme recognition accuracy in the E. Hyperparameter impact constrained representation. However this benefit is smaller after All VQ-VAE autoencoder hyperparameters were tuned on the LibriSpeech task using several grid-searches, optimizing for The token copy probability of 0:12 keeps a given token with probability the highest phoneme recognition accuracy. We also validated 0:88 = 0:77 which roughly corresponds to a 0:23 per-timestep dropout rate Filterbank Phoneme Gender Speaker p p p p enc proj bn cond 0.75 0.6 0.4 Pred. target 0.70 0.2 gender 0.65 phonemes 0.6 0.4 Time-jitter 0.2 probability 0.60 1 10 100 0.75 0.12 WaveNet Receptive Field [ms] 0.50 0.25 Fig. 6. Impact of decoder WaveNet receptive field on the properties of the VQ-VAE conditioning signal. The representation is significantly more gender 0.6 invariant when the receptive field is larger that 10ms. Frame-wise phoneme 0.4 recognition accuracy peaks at about 125ms. The depth and width of the WaveNet have a secondary effect (cf. points with the same RF). 0.2 features, especially MFCCs, perform better than waveforms, Bottleneck because by design they discard information about pitch and provide a degree of speaker invariance. Using such a reduced Fig. 5. Impact of the time-jitter regularization on information captured by representations at different probe points. representation forces the encoder to transmit less information to the decoder, acting as an inductive bias toward a more speaker TABLE III invariant latent encoding. EFFECTS OF INPUT REPRESENTATION AND REGULARIZATION ON PHONEME 3) Output representation: We constructed an autoregressive RECOGNITION ACCURACY ON LIBRIS PEECH, MEASURED AFTER 200 K decoder network that reconstructed filterbank features rather TRAINING STEPS. ALL MODELS EXTRACT 256 TOKENS. than raw waveform samples. Inspired by recent progress in Input features Token rate Regularization Accuracy text-to-speech systems, we implemented a Tacotron 2-like decoder [62] with a built-in information bottleneck on the MFCC 25 Hz None 52.5 MFCC 25 Hz Regular dropout p = 0:1 50.7 autoregressive information flow, which was found to be critical MFCC 25 Hz Regular dropout p = 0:2 49.1 in TTS applications. Similarly to Tacotron 2 the filterbank MFCC 25 Hz Per-time step dropout p = 0:2 55.3 features were first processed by a small “pre-net”, we applied MFCC 25 Hz Per-time step dropout p = 0:3 55.7 MFCC 25 Hz Per-time step dropout p = 0:4 55.1 generous amounts of dropout and configured the decoder to MFCC 25 Hz Time-jitter p = 0:08 56.2 predict up to 4 frames in parallel. However, these modifications MFCC 25 Hz Time-jitter p = 0:12 56.2 yielded at best 42% phoneme recognition accuracy, significantly MFCC 25 Hz Time-jitter p = 0:16 56.1 lower than the other architectures described in this paper. The MFCC 50 Hz None 46.5 MFCC 50 Hz Time-jitter p = 0:5 56.1 model was however an order of magnitude faster to train. Finally, we analyzed the impact of the size of the decoding log-mel spectrogram 25 Hz None 50.1 log-mel spectrogram 25 Hz Time-jitter p = 0:12 53.6 WaveNet on the representation extracted by the VQ-VAE. We have found that overall receptive field (RF) has a larger impact raw waveform 30 Hz None 37.6 raw waveform 30 Hz Time-jitter p = 0:12 48.1 than the depth or width of the WaveNet. In particular, a large change in the properties of the latent representation happens when the decoder’s receptive field crosses than about 10ms. neighboring timesteps are combined in the p probe point. As shown in Figure 6, for smaller RFs, the conditioning signal cond Moreover, for VQ-VAE and VAE the regularization decreases contains more speaker information: gender prediction is close gender prediction accuracy and makes the representation to 80%, while framewise phoneme prediction accuracy is only slightly less speaker-sensitive. 55%. For larger RFs, gender prediction accuracy is about 60%, 2) Input representation: In this set of experiments we while phoneme prediction peaks near 65%. Finally, while the compared performance using different input representation: reconstruction log-likelihood improved with WaveNet depth up raw waveforms, log-mel spectrograms, or MFCCs. The raw to 30 layers, the phoneme recognition accuracy plateaued with waveform encoder used 9 strided convolutional layers, which 20 layers. Since the WaveNet has the largest computational resulted in token extraction frequency of 30 Hz. We then cost we decided to keep the 20 layer configuration. replaced the waveform with a customary ASR data pipeline: 4) Decoder speaker conditioning: The WaveNet decoder 80 log-mel filterbank features extracted every 10ms from 25ms- generates samples based on three sources of information: the long windows and 13 MFCC features extracted from the mel- previously emitted samples (via the autoregressive connection), filterbank output, both augmented with their first and second global conditioning on speaker or other information which temporal derivatives. Using two strided convolution layers in is stationary in time, and on the time-varying representation the encoder led to a 25 Hz token rate for these models. extracted from the encoder. We found that disabling global The results are reported in the bottom of Table III. High-level speaker conditioning reduces phoneme classification accuracy Accuracy Accuracy Accuracy Recon. Error VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE Prediction accuracy 10 by 3 percentage points. This further corroborates our findings An interesting future area for research would be investigating about disentanglement induced by the VQ-VAE bottleneck, methods to increase the model capacity to make better use of which biases the model to discard information that is available larger amounts of unlabeled data. in a more explicit form. Throughout our experiments we used The influence of the size of the dataset is also visible in a speaker-independent encoder. However, adapting the encoder the ZeroSpeech Challenge results (Table II): VQ-VAE models to the speaker might further improve the results. In fact, [58] obtained good performance on English (45 hours of training demonstrates improvements on the ZeroSpeech task using a data) and French (24 hours), but performed poorly on Mandarin speaker-adaptive approach. (2.5 hours). Moreover, on English and French we obtained the 5) Encoder hyperparameters: We experimented with tuning best results with models trained on monolingual data. On the number of encoder convolutional layers, as well as the Mandarin slightly better results were obtained using a model number of filters, and the filter length. In general, performance trained jointly on data from all languages. improved with larger encoders, however we established that the encoder’s receptive field must be carefully controlled, with V. RELATED WORK the best performing encoders seeing about 0.3 seconds of input VAEs for sequential data were introduced in [49]. The model signal for each generated token. used LSTM encoder and decoder, while the latent representation The effective receptive field can be controlled using two was formed from the last hidden state of the encoder. The model mechanisms: by carefully tuning the encoder architecture, or by proved useful for natural language processing tasks. However, it designing an encoder with a wide receptive field, but limiting also demonstrated the problem of latent representation collapse: the duration of signal segments seen during training to the when a powerful autoregressive decoder is used simultaneously desired receptive field. In this way the model never learns to with a penalty on the latent encoding, such as the KL prior, use its full capacity. When the model was trained on 2.5s long the VAE has a tendency to ignore the prior and act as if it segments, an encoder with receptive field of 0.3s had framewise were a purely autoregressive sequence model. This issue can phoneme recognition accuracy of 56.5%, while and encoder be mitigated by changing the weight of the KL term, and with a receptive field of 0.8s scored only 54.3%. When trained limiting the amount of information on the autoregressive path on segments of 0.3s, both models performed similarly. by using word dropout [49]. Latent collapse can also be avoided 6) Bottleneck bit rate: The speech VQ-VAE encoder can be in deterministic autoencoders, such as [64], which coupled a seen as encoding a signal using a very low bit rate. To achieve convolutional encoder to a powerful autoregressive WaveNet a predetermined target bit rate, one can control both the token decoder [18] to learn a latent representation of music audio rate (i.e., by controlling the degree of downsampling down in consisting of isolated notes from a variety of instruments. the encoder strided convolutions), and the number of tokens We empirically validate that conditioning the decoder on (or equivalently the number of bits) extracted at every step. We speaker information results in encodings which are more found that the token rate is a crucial parameter which must be speaker invariant. Moyer et al. [54] give a rigorous proof chosen carefully, with the best results after 200k training steps that this approach produces representations that are invariant obtained at 50 Hz (56.0% phoneme recognition accuracy ) and to the explicitly provided information and relate it to domain- 25 Hz (56.3%). Accuracy drops abruptly at higher token rates adversarial training, another technique designed to enforce (49.3% at 100 Hz), while lower rates miss very short phones invariance to a known nuisance factor [65]. (53% accuracy at 12.5 Hz). In contrast to the number of tokens, the dimensionality of the When applied to audio, the VQ-VAE uses the WaveNet decoder to free the latent representation from modeling VQ-VAE embedding has a secondary effect on representation information that is easily recoverable form the recent past quality. We found 64 to be a good setting, with much smaller [19]. It avoids the problem of posterior collapse by using a dimensions deteriorating performance for models with a small discrete latent code with a uniform prior which results in a number of tokens and higher dimensionalities negatively constant KL penalty. We employ the same strategy to design affecting performance for models with a large number of tokens. the latent representation regularizer: rather than extending the For completeness, we observe that even for the model with cost function with a penalty term that can cause the latent space the largest inventory of tokens, the overall encoder bitrate is to collapse, we rely on random copies of the latent variables low: 14 bits at 50 Hz = 700 bps, which is on par with the to prevent their co-adaptation and promote stability over time. lowest bitrate of classical speech codecs [63]. The randomized time-jitter regularization introduced in this 7) Training corpus size: We experimented with training paper is inspired by slow representations of data [48] and models on subsets of the LibriSpeech training set, varying by dropout, which randomly removes during training neurons the size from 4.6 hours (1%) to 460 hours (100%). Training to prevent their co-adaptation [50]. It is also very similar to on 4.6 hours of data, phoneme recognition accuracy peaked Zoneout [51] which relies on random time copies of selected at 50.5% at 100k steps and then deteriorated. Training on 9 neurons to regularize recurrent neural networks. hours led to a peak accuracy of 52.5% at 180k sets. When the size of training set was increased past 23 hours the phoneme Several authors have recently proposed to model sequences recognition reached 54% after around 900k steps. No further with VAEs that use a hierarchy of variables. [66] explore a improvements were found by training on the full 460 hours of hierarchical latent space which separates sequence-dependent data. We did not observe any overfitting, and for best results variables from those which are sequence-independent ones. trained models until reaching 900k steps with no early stopping. Their model was shown to perform speaker conversion and to 11 improve automatic speech recognititon (ASR) performance in from speaker characteristics. Furthermore, we observe that the the presence of domain mismatch. [67] introduce a stochastic latent collapse problem induced by bottlenecks which are too latent variable model for sequential data which also yields strong can be avoided by making the bottleneck strength a disentangled representations and allows content swapping model hyperparameter, either removing it completely (as in between generated sequences. These other approaches could the VQ-VAE), or by using the free-information VAE objective. possibly benefit from regularizing the latent representation to To further improve representation quality, we introduced a achieve further information disentanglement. time-jitter regularization scheme which limits the capacity of Acoustic unit discovery systems aim at transducing the the latent code yet does not result in a collapse of the latent acoustic signal into a sequence of interpretable units akin space. We hope that this can similarly improve performance to phones. They often involve clustering of acoustic frames, of latent variable models used with auto-regressive decoders MFCC or neural network bottleneck features, regularized using in other problem domains. a probabilistic prior. DP-GMM [68] imposes a Dirichlet Process Both the VAE and VQ-VAE constrain the information prior over a Gaussian Mixture Model. Extending it with an bandwidth of the latent representation. However, the VQ-VAE HMM temporal structure for sub-phonetic units leads to the uses a quantization mechanism, which deterministically forces DP-HMM and the HDP-HMM [69], [70], [71]. HMM-VAE the encoding to be equal to a prototype, while the VAE limits proposes the use of a deep neural network instead of a GMM the amount of information by injecting noise. In our study, [72], [73]. These approaches enforce top-down constraints via the VQ-VAE resulted in better information separation than HMM temporal smoothing and temporal modeling. Linguistic the VAE. However, further experiments are needed to fully unit discovery models detect recurring speech patterns at a understand this effect. In particular, is this a consequence of word-like level, finding commonly repeated segments with a the quantization, or of the deterministic operation? constrained dynamic time warping [74]. We also observe that while the VQ-VAE produces a discrete In the segmental unsupervised speech recognition framework, representation, for best results it uses a token set so large that neural autoencoders were used to embed variable length speech it is impractical to assign a separate meaning to each one. In segments into a common vector space where they could be particular, in our ZeroSpeech experiments we used the dense clustered into word types [75]. [76] replace the segmental embedding representation of each token, which provided a autoencoder with a model that instead predicts a nearby more nuanced token similarity measure than simply using the speech segment and demonstrate that the representation shares token identity. Perhaps a more structured latent representation many properties with word embeddings. Coupled with an is needed, in which a small set of units can be modulated in a unsupervised word segmentation algorithm and unsupervised continuous fashion. mapping of word embeddings discovered on separate corpora Extensive hyperparameter evaluation indicated that opti- [77] the approach yielded an ASR system trained on unpaired mizing the receptive field sizes of the encoder and decoder speech and text data [78]. networks is important for good model performance. A multi- Several entries to the ZeroSpeech 2017 challenge relied scale modeling approach could furthermore separate the on neural networks for phonetic unit discovery. [61] trains prosodic information. Our autoencoding approach could also an autoencoder on pairs of speech segments found using an be combined with penalties that are more specialized to speech unsupervised term discovery system [79]. [59] first clustered processing. Introducing a HMM prior as in [73] could promote speech frames, then trained a neural network to predict the a latent representation which better mimics the temporal cluster IDs and used its hidden representation as features. phonetic structure of speech. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs. ACKNOWLEDGMENTS The authors thank Tara Sainath, Ulfar Erlingsson, Aren VI. CONCLUSIONS Jansen, Sander Dieleman, Jesse Engel, Łukasz Kaiser, Tom We applied sequence autoencoders to speech modeling and Walters, Cristina Garbacea, and the Google Brain team for compared different information bottlenecks, including VAEs their helpful discussions and feedback. and VQ-VAEs. We carefully evaluated the induced latent representation using interpretability criteria as well as the ability REFERENCES to discriminate between similar speech sounds. The comparison [1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of of bottlenecks revealed that discrete representations obtained data with neural networks,” Science, vol. 313, no. 5786, 2006. using VQ-VAE preserved the most phonetic information [2] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. while also being the most speaker-invariant. The extracted International Conference on Machine Learning, 2008. representation allowed for accurate mapping of the extracted [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification symbols into phonemes and obtained competitive performance with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. on the ZeroSpeech 2017 acoustic unit discovery task. A similar [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, combination of VQ-VAE encoder and WaveNet decoder by V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Cho et al. had the best acoustic unit discovery performance in Proc. IEEE Conference on Computer Vision and Pattern Recognition, ZeroSpeech 2019 [80]. [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by We established that an information bottleneck is required jointly learning to align and translate,” in Proc. International Conference for the model to learn a representation that separates content on Learning Representations, 2015. 12 [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are M. Krikun, Y. Cao, Q. Gao, K. Macherey, and et al, “Google’s neural features in deep neural networks?” in Advances in Neural Information machine translation system: Bridging the gap between human and Processing Systems, 2014, pp. 3320–3328. machine translation,” arXiv preprint arXiv:1609.08144, 2016. [30] K. Vesely, ` M. Karafiat, ´ F. Grezl, ´ M. Janda, and E. Egorova, “The [7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with language-independent bottleneck features,” in Proc. Spoken Language deep recurrent neural networks,” in Proc. International Conference on Technology Workshop (SLT), 2012, pp. 336–341. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649. [31] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained [8] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, deep neural networks,” in Proc. Interspeech, 2011. A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski, [32] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: and M. Bacchiani, “State-of-the-art speech recognition with sequence- Contextualized word vectors,” in Advances in Neural Information to-sequence models,” in Proc. International Conference on Acoustics, Processing Systems, 2017, pp. 6294–6305. Speech and Signal Processing (ICASSP), 2018. [33] S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated [9] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching corpus for learning natural language inference,” in Proc. Conference on networks for reading comprehension and question answering,” in Proc. Empirical Methods in Natural Language Processing, 2015. 55th Annual Meeting of the Association for Computational Linguistics [34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, (Volume 1: Long Papers), vol. 1, 2017, pp. 189–198. “Supervised learning of universal sentence representations from natural [10] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, language inference data,” in Proc. Conference on Empirical Methods in and Q. V. Le, “QANet: Combining local convolution with global self- Natural Language Processing (EMNLP), September 2017, pp. 670–680. attention for reading comprehension,” in Proc. International Conference [35] C. M. Bishop, “Continuous latent variables,” in Pattern Recognition and on Learning Representations, 2018. Machine Learning. Springer, 2006, ch. 12. [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional [36] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative networks,” in European Conference on Computer Vision, 2014. matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999. [12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep [37] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive networks,” in Proc. International Conference on Machine Learning, 2017. field properties by learning a sparse code for natural images,” Nature, [13] T. Nagamine and N. Mesgarani, “Understanding the representation vol. 381, no. 6583, p. 607, 1996. and computation of multilayer perceptrons: A case study in speech [38] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in recognition,” in Proc. International Conference on Machine Learning, Proc. International Conference on Learning Representations, 2014. [39] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, [14] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using S. Mohamed, and A. Lerchner, “Beta-VAE: Learning basic visual backpropagation for speech texture generation and voice conversion,” concepts with a constrained variational framework,” in Proc. International in Proc. International Conference on Acoustics, Speech and Signal Conference on Learning Representations, 2017. Processing (ICASSP), Apr. 2018. [40] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational [15] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual information bottleneck,” in Proc. International Conference on Learning knowledge transfer in DNN-based LVCSR,” in Proc. Spoken Language Representations, 2017. Technology Workshop (SLT), 2012, pp. 246–251. [41] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and [16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural M. Welling, “Improved variational inference with inverse autoregressive network features and semi-supervised training for low resource speech flow,” in Advances in Neural Information Processing Systems, 2016. recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6704–6708. [42] Y. Bengio, N. Leonard, ´ and A. Courville, “Estimating or propagating [17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an gradients through stochastic neurons for conditional computation,” arXiv ASR corpus based on public domain audio books,” in Proc. International preprint arXiv:1308.3432, 2013. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [43] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd [18] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [44] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling A generative model for raw audio,” arXiv preprint arXiv:1609.03499, with gated convolutional networks,” in Proc. International Conference on Machine Learning, 2017. [19] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete [45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic representation learning,” in Advances in Neural Information Processing convolutional and recurrent networks for sequence modeling,” arXiv Systems, 2017, pp. 6309–6318. preprint arXiv:1803.01271, 2018. [20] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, [46] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative in Proc. Automatic Speech Recognition and Understanding Workshop modeling for controllable speech synthesis,” in Proc. International (ASRU), 2017. Conference on Learning Representations, 2019. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [47] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Ta¨ ıga, F. Visin, D. Vazquez, ´ resentations by back-propagating errors,” Nature, vol. 323, no. 6088, and A. Courville, “PixelVAE: A latent variable model for natural images,” in Proc. International Conference on Learning Representations, 2017. [22] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for [48] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised visual area V2,” in Advances in Neural Information Processing Systems, learning of invariances,” Neural Computation, vol. 14, no. 4, 2002. [49] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and [23] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” S. Bengio, “Generating sentences from a continuous space,” in SIGNLL in Proc. International Conference on Acoustics, Speech and Signal Conference on Computational Natural Language Learning, 2016. Processing (ICASSP), 2014, pp. 6964–6968. [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- [24] N. Jaitly and G. Hinton, “Learning a better representation of speech nov, “Dropout: A simple way to prevent neural networks from overfitting,” soundwaves using restricted Boltzmann machines,” in Proc. International Journal of Machine Learning Research, vol. 15, no. 1, 2014. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [51] D. Krueger, T. Maharaj, J. Kramar ´ , M. Pezeshki, N. Ballas, N. R. Ke, [25] Z. Tusk ¨ e, P. Golik, R. Schluter ¨ , and H. Ney, “Acoustic modeling with deep A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing neural networks using raw time signal for LVCSR,” in Proc. Interspeech, RNNs by randomly preserving hidden activations,” in Proc. International Conference on Learning Representations, 2017. [26] D. Palaz, M. Magima Doss, and R. Collobert, “Analysis of CNN- [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” based speech recognition system using raw speech as input,” in Proc. in Proc. International Conference on Learning Representations, 2015. Interspeech, 2015. [53] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation [27] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. Interspeech, 2015. pp. 838–855, 1992. [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [54] D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg, A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conference “Invariant Representations without Adversarial Training,” in Advances in on Computer Vision and Pattern Recognition, 2009. Neural Information Processing Systems 31, 2018, pp. 9084–9093. 13 [55] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, [79] A. Jansen and B. Van Durme, “Efficient spoken term discovery using M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, randomized algorithms,” in Proc. Automatic Speech Recognition and and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. Automatic Understanding Workshop (ASRU), 2011, pp. 401–406. Speech Recognition and Understanding Workshop (ASRU), 2011. [80] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and [56] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, E. Dupoux, “The Zero Resource Speech Challenge 2019: TTS without T,” “Evaluating speech features with the minimal-pair ABX task: Analysis arXiv preprint arXiv:1904.11469, 2019, accepted to Interspeech 2019. of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1–5. [57] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task (ii): Resistance to noise,” in Proc. Interspeech, 2014. [58] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario,” Jan Chorowski is an Associate Professor at Faculty Procedia Computer Science, vol. 81, pp. 73–79, 2016. of Mathematics and Computer Science at the Uni- [59] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle- versity of Wrocław. He received his M.Sc. degree in neck feature learning from untranscribed speech,” in Proc. Automatic electrical engineering from the Wrocław University of Speech Recognition and Understanding Workshop (ASRU), 2017. Technology, Poland and EE PhD from the University [60] T. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning methods of Louisville, Kentucky in 2012. He has worked for unsupervised acoustic modeling—leap submission to zerospeech chal- with several research teams, including Google Brain, lenge 2017,” in Proc. Automatic Speech Recognition and Understanding Microsoft Research and Yoshua Bengio’s lab at the Workshop (ASRU), 2017, pp. 754–761. University of Montreal. His research interests are [61] Y. Yuan, C. C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting applications of neural networks to problems which bottleneck features and word-like pairs from untranscribed speech for are intuitive for humans but difficult for machines, feature representation,” in Proc. Automatic Speech Recognition and such as speech and natural language processing. Understanding Workshop (ASRU), Dec 2017, pp. 734–739. [62] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian- nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. International Conference on Ron J. Weiss is a software engineer at Google Acoustics, Speech and Signal Processing (ICASSP), 2018. where he has worked on content-based audio analysis, [63] X. Wang and C.-C. J. Kuo, “An 800 bps VQ-based LPC voice coder,” recommender systems for music, noise robust speech Journal of the Acoustical Society of America, vol. 103, no. 5, 1998. recognition, speech translation, and speech synthesis. [64] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and Ron completed his Ph.D. in electrical engineering K. Simonyan, “Neural audio synthesis of musical notes with wavenet from Columbia University in 2009 where he worked autoencoders,” in Proc. International Conference on Machine Learning, in the Laboratory for the Recognition of Speech and 2017, pp. 1068–1077. Audio. From 2009 to 2010 he was a postdoctoral re- searcher in the Music and Audio Research Laboratory [65] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, at New York University. M. Marchand, and V. Lempitsky, “Domain-Adversarial Training of Neural Networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016. [66] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentan- gled and interpretable representations from sequential data,” in Advances in Neural Information Processing Systems, 2017, pp. 1876–1887. [67] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” in Proc. Samy Bengio (PhD in computer science, University International Conference on Machine Learning, 2018. of Montreal, 1993) is a research scientist at Google [68] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of since 2007. He currently leads a group of research Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic scientists in the Google Brain team, conducting Modeling: A Feasibility Study,” in Proc. Interspeech, 2015. research in many areas of machine learning such as [69] C.-y. Lee and J. Glass, “A Nonparametric Bayesian Approach to Acoustic deep architectures, representation learning, sequence Model Discovery,” in Proc. 50th Annual Meeting of the Association for processing, speech recognition, image understanding, Computational Linguistics (Volume 1: Long Papers), Jul. 2012, pp. 40–49. large-scale problems, adversarial settings, etc. He is the general chair for Neural Information [70] L. Ondel, L. Burget, and J. Cernocky, ´ “Variational Inference for Acoustic Processing Systems (NeurIPS) 2018, the main con- Unit Discovery,” Procedia Computer Science, vol. 81, Jan. 2016. ference venue for machine learning, was the program [71] R. Marxer and H. Purwins, “Unsupervised Incremental Online Learning chair for NeurIPS in 2017, is action editor of the Journal of Machine Learning and Prediction of Musical Audio Signals,” IEEE/ACM Transactions on Research and on the editorial board of the Machine Learning Journal, was Audio, Speech, and Language Processing, vol. 24, no. 5, May 2016. program chair of the International Conference on Learning Representations [72] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and (ICLR 2015, 2016), general chair of BayLearn (2012-2015) and the Workshops B. Raj, “Hidden Markov Model Variational Autoencoder for Acoustic on Machine Learning for Multimodal Interactions (MLMI’2004-2006), as well Unit Discovery,” in Proc. Interspeech, Aug. 2017, pp. 488–492. as the IEEE Workshop on Neural Networks for Signal Processing (NNSP’2002), [73] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian and on the program committee of several international conferences such as Hidden Markov Model Variational Autoencoder for Acoustic Unit NeurIPS, ICML, ICLR, ECML and IJCAI. Discovery,” in Proc. Interspeech, Sep. 2018, pp. 2688–2692. [74] A. S. Park and J. R. Glass, “Unsupervised Pattern Discovery in Speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 186–197, Jan. 2008. [75] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework Aar ¨ on van den Oord is a research scientist at for fully-unsupervised large-vocabulary speech recognition,” Computer DeepMind, London. Aaron ¨ completed his PhD at Speech & Language, vol. 46, pp. 154–174, 2017. the University of Ghent, Belgium in 2015. He [76] Y.-A. Chung and J. Glass, “Learning word embeddings from speech,” has worked on unsupervised representation learning, arXiv preprint arXiv:1711.01515, 2017. music recommendation, generative modeling with [77] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised autoregressive networks and various applications of Machine Translation Using Monolingual Corpora Only,” in Proc. Inter- generative models such text-to-speech synthesis and national Conference on Learning Representations, 2018. data compression. [78] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervised cross- modal alignment of speech and text embedding spaces,” Advances in Neural Information Processing Systems, 2018. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Unsupervised speech representation learning using WaveNet autoencoders

Unsupervised speech representation learning using WaveNet autoencoders

Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron ¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic desired in several tasks, such as low resource automatic speech content from the signal, e.g. phoneme identities, while being recognition (ASR), where only a small amount of labeled invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. to learn the acoustic model and a data representation in a fully Moreover, the behavior of autoencoder models depends on the supervised manner [15], [16]. kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the [17]. We tune the learned latent representation to encode only ability to predict phonetic content, and the ability to accurately re- phonetic content and remove other confounding detail. However, construct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease to enable signal reconstruction, we rely on an autoregressive of mapping them to phonemes. We introduce a regularization WaveNet [18] decoder to infer information that was rejected scheme that forces the representations to focus on the phonetic by the encoder. The use of such a powerful decoder acts content of the utterance and report performance comparable with as an inductive bias, freeing up the encoder from using its the top entries in the ZeroSpeech 2017 unsupervised acoustic unit capacity to represent low level detail and instead allowing it discovery task. to focus on high level semantic features. We discover that best Index Terms—autoencoder, speech representation learning, un- representations arise when ASR features, such as mel-frequency supervised learning, acoustic unit discovery cepstral coefficients (MFCCs) are used as inputs, while raw waveforms are used as decoder targets. This forces the system I. I NTRODUCTION to also learn to generate sample level detail which was removed Creating good data representations is important. The deep during feature extraction. Furthermore, we observe that the learning revolution was triggered by the development of Vector Quantized Variational Autoencoder (VQ-VAE) [19] hierarchical representation learning algorithms, such as stacked yields the best separation between the acoustic content and Restricted Boltzman Machines [1] and Denoising Autoencoders speaker information. We investigate the interpetability of VQ- [2]. However, recent breakthroughs in computer vision [3], VAE tokens by mapping them to phonemes, demonstrate [4], machine translation [5], [6], speech recognition [7], [8], the impact of model hyperparameters on interpretability and and language understanding [9], [10] rely on large labeled propose a new regularization scheme which improves the degree datasets and make little to no use of unsupervised representation to which the latent representation can be mapped to the phonetic content. Finally, we demonstrate strong performance on the learning. This has two drawbacks: first, the requirement of large ZeroSpeech 2017 acoustic unit discovery task [20], which human labeled datasets often makes the development of deep measures how discriminative a representation is to minimal learning models expensive. Second, while a deep model may phonetic changes within an utterance. excel at solving a given task, it yields limited insights into the problem domain, with main intuitions typically consisting of visualizations of salient input patterns [11], [12], a strategy that II. R EPRESENTATION L EARNING WITH NEURAL NETWORKS is applicable only to problem domains that are easily solved Neural networks are hierarchical information processing by humans. models that are typically implemented using layers of computa- In this paper we focus on evaluating and improving un- tional units. Each layer can be interpreted as a feature extractor supervised speech representations. Specifically, we focus on whose outputs are passed to upstream units [21]. Especially in representations that separate selected speaker traits, specifically the visual domain, features learned with neural networks have J. Chorowski is with the Institute of Computer Science, University of been shown to create a hierarchy of visual atoms [11] that Wrocław, Poland e-mail: jan.chorowski@cs.uni.wroc.pl. match some properties of the visual cortex [22]. Similarly, when R. Weiss and S. Bengio are with Google Research. A. van den Oord is with DeepMind email: fronw, bengio, avdnoordg@google.com. applied to audio waveforms, neural networks have been shown arXiv:1901.08810v2 [cs.LG] 11 Sep 2019 2 to learn auditory-like frequency decompositions on music [23] from a prior distribution p(z) (typically a multidimensional and speech [24], [25], [26], [27] in their lower layers. normal distribution). Then the data sample x is generated using a deep decoder neural network with parameters  that computes p(xjz; ). However, computing the exact posterior A. Supervised feature learning distribution p(zjx) that is needed during maximum likelihood Neural networks can learn useful data representations in both training is difficult. Instead, the VAE introduces a variational supervised and unsupervised manners. In the supervised case, approximation to the posterior, q(zjx; ), which is modeled features learned on large datasets are often directly useful using an encoder neural network with parameters . Thus the in similar but data-poor tasks. For instance, in the visual VAE resembles a traditional autoencoder, in which the encoder domain, features discovered on ImageNet [28] are routinely produces distributions over latent representations, rather than used as input representations in other computer vision tasks [29]. deterministic encodings, while the decoder is trained on samples Similarly, the speech community has used bottleneck features from this distribution. Encoding and decoding networks are extracted from networks trained on phoneme prediction tasks trained jointly to maximize a lower bound on the log-likelihood [30], [31] as feature representations for speech recognition of data point x [38], [39]: systems. Likewise, in natural language processing, universal text representations can be extracted from networks trained for J (; ; x) = E [log p(xjz; )] VAE q(zjx;) machine translation [32] or language inference [33], [34]. D (q(zjx; )jj p(z)) : (1) KL We can interpret the two terms of Eq. (1) as the autoencoder’s B. Unsupervised feature learning reconstruction cost augmented with a penalty term applied to In this paper we focus on unsupervised feature learning. the hidden representation. In particular, the KL divergence Since no training labels are available we investigate autoen- expresses the amount of information in nats which the latent coders, i.e., networks which are tasked with reconstructing representation carries about the data sample. Thus, it acts as an their inputs. Autoencoders use an encoding network to extract information bottleneck [40] on the latent representation, where a latent representation, which is then passed through a decod- controls the trade-off between reconstruction quality and the ing network to recover the original data. Ideally, the latent representation simplicity. representation preserves the salient features of the original An alternative formulation of the VAE objective explicitly data, while being easier to analyze and work with, e.g. by constrains the amount of information contained in the latent disentangling different factors of variation in the data, and representation [41]: discarding spurious patterns (noise). These desirable qualities J (; ; x) = E [log p(xjz; )] VAE q(zjx;) are typically obtained through a judicious application of max (B; D (q(zjx; )jj p(z))) ; (2) regularization techniques and constraints or bottlenecks (we KL use the two terms interchangeably). The representation learned where the constant B corresponds to the amount of free by an autoencoder is thus subject to two competing forces. On information in q, because the model is only penalized if it the one hand, it should provide the decoder with information transmits more than B nats over the prior in the distribution necessary for perfect reconstruction and thus capture in the over the latents. Please note that for convenience we will often latents as much of the input data characteristics as possible. refer to information content using units of bits instead of nats. On the other hand, the constraints force some information to A recently proposed modification of the VAE, called the be discarded, preventing the latent representation from being Vector Quantized VAE [19], replaces the continuous and trivial to invert, e.g. by exactly passing through the input. Thus stochastic latent vectors with deterministically quantized ver- the bottleneck is necessary to force the network to learn a sions. The VQ-VAE maintains a number of prototype vectors non-trivial data transformation. fe ; i = 1; : : : ; Kg. During the forward pass, representations Reducing the dimensionality of the latent representation can produced by the encoder are replaced with their closest serve as a basic constraint applied to the latent vectors, with prototypes. Formally, let z (x) be the output of the encoder the autoencoder acting as a nonlinear variant of linear low- prior to quantization. VQ-VAE finds the nearest prototype rank data projections, such as PCA or SVD [35]. However, q(x) = argmin kz (x) e k and uses it as the latent e i i 2 such representations may be difficult to interpret because the representation z (x) = e which is passed to the decoder. q(x) reconstruction of an input depends on all latent features [36]. In When using the model in downstream tasks, the learned contrast, dictionary learning techniques, such as sparse [37] and representation can therefore be treated either as a distributed non-negative [36] decompositions, express each input pattern representation in which each sample is represented by a using a combination of a small number of selected features out continuous vector, or as a discrete representation in which of a larger pool, which facilitates their interpretability. Discrete each sample is represented by the prototype ID (also called feature learning using vector quantization can be seen as an the token ID). extreme form of sparseness in which the reconstruction uses During the backward pass, the gradient of the loss with only one element from the dictionary. respect to the pre-quantized embedding is approximated using The Variational Autoencoder (VAE) [38] proposes a different @L @L the straight-through estimator [42], i.e.,  . The @z (x) @z (x) e q interpretation of feature learning which follows a probabilistic framework. The autoencoding network is derived from a latent- In TensorFlow this can be conveniently implemented using z (x) = variable generative model. First, a latent vector z is sampled z (x) + stop gradient(e z (x)) e e q(x) 3 prototypes are trained by extending the learning objective VQ-VAE Encoder p p enc proj with terms which optimize quantization. Prototypes are forced + Linear(64) VQ 64D 50Hz to lie close to vectors which they replace with an auxiliary or ReLU(768) cost, dubbed the commitment loss, introduced to encourage VAE proj the encoder to produce vectors which lie close to prototypes. Linear(128) sample Without the commitment loss VQ-VAE training can diverge by ReLU(768) or emitting representations with unbounded magnitude. Therefore, AE VQ-VAE is trained using a sum of three loss terms: the negative ReLU(768) Linear(64) log-likelihood of the reconstruction, which uses the straight- through estimator to bring the gradient from the decoder to pbn the encoder, and two VQ-related terms: the distance from each jitter(0:12) Decoder ReLU(768) prototype to its assigned vectors and the commitment cost [19]: Conv (128) L = log p x j z (x) cond 128D 50Hz Conv (768) 2 2 3 +ksg z (x) e k +
kz (x) sg(e )k ; (3) e q(x) e q(x) 2 2 upsample 128D 16kHz where sg() denotes the stop-gradient operation which zeros WaveNet cycle Conv3(768) concat the gradient with respect to its argument during backward pass. (10 layers) 768D 50Hz 128 +N The quantization within the VQ-VAE acts as an information 16kHz StridedConv (768) bottleneck. The encoder can be interpreted as a probabilistic (stride = 2) 256D 16kHz model which puts all probability mass on the selected discrete WaveNet cycle token (prototype ID). Assuming a uniform prior distribution (10 layers) over K tokens, the KL divergence is constant and equal to Conv (768) log K . Therefore, the KL term does not need to be included in 768D 100Hz the VQ-VAE training criterion in Eq. (3) and instead becomes Conv (768) + ReLU(256) a hyperparameter tied to the size of the prototype inventory. 39D 100Hz The VQ-VAE was qualitatively shown to learn a representa- MFCC + d + a ReLU(256) feature extraction tion which separated the phonetic content within an utterance 1D 16kHz sample softmax from the identity of the speaker [19]. Moreover the discovered tokens could be mapped to phonemes in a limited setting. Ns speaker waveform one-hot C. Autoencoders for sequential data Sequential data, such as speech or text, often contain local Fig. 1. The proposed model is conceptually divided into 3 parts: an encoder dependencies that can be exploited by generative models. In (green), made of a residual convnet that computes a stream of latent vectors (typically every 10ms or 20ms) from a time-domain waveform sampled at fact, purely autoregressive models of sequential data, which 16 kHz, which are passed through a bottleneck (red) before being used to predict the next observation based on recent history, are very condition a WaveNet decoder (blue) which reconstructs the waveform using successful. For text, these correspond to n-gram models [43] two additional information streams: an autoregressive stream which predicts the next sample based on past samples, and global conditioning which represents and convolutional neural language models [44], [45]. Similarly, the identity of the input speaker (one out of N total training speakers). We WaveNet [18] is a state-of-the-art autoregressive model of experiment with three bottleneck variants: a simple dimensionality reduction time-domain waveform samples for text-to-speech synthesis. (AE), a sampling layer with an additional Kullback-Leibler penalty term (VAE), or a discretization layer (VQ-VAE). Intuitively, this bottleneck encourages A downside of such autoregressive models is that they the encoder to discard portions of the latent representation which the decoder do not explicitly produce latent representations of the data. can infer from the two other information streams. For all layers, numbers in However, it is possible to combine an autoregressive sequence parentheses indicate the number of output channels, and subscripts denote the filter length. Locations of “probe” points which are used in Section IV to generation model with an encoder tasked with extraction of evaluate the quality of the learned representation are denoted with black dots. latent representations. Depending on the use case, the encoder can process the whole utterance, emit a single latent vector and feed it to an autoregressive decoder [33], [46] or the encoder III. M ODEL DESCRIPTION can periodically emit vectors of latent features to be consumed The architecture of our model is presented in Figure 1. The by the decoder [19], [47]. We concentrate on the latter solution. encoder reads a sequence of either raw audio samples, or of Training mixed latent variable and autoregressive models audio features and extracts a sequence of hidden vectors, is prone to latent space collapse, in which the decoder learns which are passed through a bottleneck to become a sequence to ignore the constrained latent representations and only uses of latent representations. The frequency at which the latent the unconstrained signal coming through the autoregressive vectors are extracted is governed by the number of strided path. For the VAE, this collapse can be prevented by annealing convolutions applied by the encoder. the weight of the KL term and using the free-information The decoder reconstructs the utterance by conditioning a formulation in Eq. (2). The VQ-VAE is naturally resilient to WaveNet [18] network on the latent representation extracted by the latent collapse because the KL term is a hyperparameter which is not optimized using gradient training of a given model. To keep the autoencoder viewpoint, the feature extractor can be interpreted We defer further discussion of this topic to Section V. as a fixed signal processing layer in the encoder. 4 the encoder and, separately, on a speaker embedding. Explicitly The regularization layer is inserted right after the encoder’s conditioning the decoder on speaker identity frees the encoder bottleneck (i.e., after dimensionality reduction for regular from having to capture speaker-dependent information in the autoencoder, after sampling a realization of the latent layer for latent representation. Specifically, the decoder (i) takes the en- the VAE and after discretization for the VQ-VAE). It is only coder’s output, (ii) optionally applies a stochastic regularization enabled during training. For each time step we independently to the latent vectors (see Section III-A), (iii) then combines sample whether it is to be replaced with the token right after latent vectors extracted at neighboring time steps using con- or before it. We do not copy a token more than one timestep. volutions and (iv) upsamples them to the output frequency. Waveform samples are reconstructed with a WaveNet that IV. E XPERIM ENTS combines all conditioning sources: autoregressive information We evaluated models on two datasets: LibriSpeech [17] about past samples, global information about the speaker, and (clean subset) and ZeroSpeech 2017 Contest Track 1 data [20]. latent information about past and future samples extracted Both datasets have similar characteristics: multiple speakers, by the encoder. We find that the encoder’s bottleneck and clean, read speech (sourced from audio books) recorded at a the proposed regularization is crucial in extracting nontrivial sampling rate of 16 kHz. Moreover the ZeroSpeech challenge representations of data. With no bottleneck, the model is prone controls the amount of per-speaker data with the majority of to learn a simple reconstruction strategy which makes verbatim the data being uttered by only a few speakers. copies of future samples. We also note that the encoder is Initial experiments, presented in section IV-B, compare differ- speaker independent and requires only speech data, while the ent bottleneck variants and establish what type of information decoder also requires speaker information. from the input audio is preserved in the continuous latent We consider three forms of bottleneck: (i) simple di- representations produced by the model at the four different mensionality reduction, (ii) a Gaussian VAE with different probe points pictured in Figure 1. Using the representation latent representation dimensionalities and different capacities computed at each probe point, we measure performance following Eq. (2), and (iii) a VQ-VAE with different number of on several prediction tasks: phoneme prediction (per-frame quantization prototypes. All bottlenecks are optionally followed accuracy), speaker identity and gender prediction accuracy, and by the dropout inspired time-jitter regularization described L reconstruction error of spectrogram frames. We establish below. Furthermore, we experiment with different input and that the VQ-VAE learns latent representations with strongest output representations, using raw waveforms, log-mel filterbank, disentanglement between the phonetic content and speaker and mel-frequency cepstral coefficient (MFCC) features which identity, and focus on this architecture in the following discard pitch information present in the spectrogram. experiments. In section IV-C we analyze the interpretability of VQ-VAE tokens by mapping each discrete token to the most frequent A. Time-jitter regularization corresponding phoneme in a forced alignment of a small labeled We would like the model to learn a representation of speech data set (LibriSpeech dev) and report the accuracy of the which corresponds to the slowly-changing phonetic content mapping on a separate set (LibriSpeech test). Intuitively, this within an utterance: a mostly constant signal that can abruptly captures the interpretability of individual tokens. change at phoneme boundaries. We then apply the VQ-VAE to the ZeroSpeech 2017 acoustic Inspired by the slow features analysis [48] we first exper- unit discovery task [20] in section IV-D. This task evaluates imented with penalizing time differences between encoder how discriminative the representation is with respect to the representation either before or after the bottleneck. However, phonetic class. Finally, in section IV-E we measure the impact this regularization resulted in a collapse of the latent space of different hyperparameters on performance. – the model learned to output a constant encoding. This is a common problem of sequential VAEs that use loss terms to A. Default model hyperparameters regularize the latent encoding [49]. Reconsidering the problem we realized that we want each Our best models used MFCCs as the encoder input, but frame’s representation to correspond to a meaningful phonetic reconstructed raw waveforms at the decoder output. We used unit. Thus we want to prevent the system from using consecu- standard 13 MFCC features extracted every 10ms (i.e., at a tive latent vectors as individual units. Put differently, we want rate of 100 Hz) and augmented with their temporal first and to prevent latent vector co-adaptation. We therefore introduce second derivatives. Such features were originally designed for a dropout-inspired [50] time-jitter regularizer, also reminiscent speech recognition and are mostly invariant to pitch and similar of Zoneout [51] regularization for recurrent networks. During confounding detail in the audio signal. The encoder had 9 layers training, each latent vector can replace either one or both of each using 768 units with ReLU activation, organized into the its neighbors. As in dropout, this prevents the model from following groups: 2 preprocessing convolution layers with filter relying on consistency across groups of tokens. Additionally, length 3 and residual connections, 1 strided convolution length this regularization also promotes latent representation stability reduction layer with filter length 4 and stride 2 (downsampling over time: a latent vector extracted at time step t must strive the signal by a factor of two), followed by 2 convolutional to also be useful at time steps t 1 or t + 1. In fact, the layers with length 3 and residual connections, and finally regularization was crucial for reaching good performance on 4 feedforward ReLU layers with residual connections. The ZeroSpeech at higher token extraction frequencies. resulting latent vectors were extracted at 50 Hz (i.e., every Filterbank Phoneme Gender Speaker p proj p p enc bn cond 0.8 Bottleneck 0.6 AE 0.4 VAE (D= 4) 0.2 VAE (D= 8) 0.7 VAE (D=16) 0.6 0.5 VAE (D=32) 0.4 VQ-VAE 0.9 Latent dimensions 0.8 0.7 0.6 0.6 0.4 0.2 VAE free bits / VQ-VAE bits per token Fig. 2. Accuracy of predicting signal characteristics at various probe locations in the network. Among the three bottlenecks evaluated, VQ-VAE discards the most speaker-related information at the bottleneck, while preserving the most phonetic information. For all bottlenecks, the representation coming out of the encoder yields over 70% accurate framewise phoneme predictions. Both the simple AE and VQ-VAE preserve this information in the bottleneck (the accuracy drops to 50%-60% depending on the bottleneck’s strength). However, the VQ-VAE discards almost all speaker information (speaker classification accuracy is close to 0% and gender prediction close to 50%). This causes the VQ-VAE representation to perform best on the acoustic unit discovery task – the representation captures the phonetic content while being invariant to speaker identity. The jittered latent sequence was passed through a single Probe point convolutional layer with filter length 3 and 128 hidden enc 0.7 units to mix information across neighboring timesteps. The proj representation was then upsampled 320 times (to match the bn 0.6 16kHz audio sampling rate) and concatenated with a one-hot cond vector representing the current speaker to form the conditioning Bottleneck input of an autoregressive WaveNet [18]. The WaveNet was 0.5 composed of 20 causal dilated convolution layers, each using AE 368 gated units with residual connections, organized into two VAE (D=32) 0.4 “cycles” of 10 layers with dilation rates 1; 2; 4; : : : ; 2 . The VQ-VAE conditioning signal was passed separately into each layer. The 0.6 0.7 0.8 0.9 signal from each layer of the WaveNet was passed to the output Gender prediction accuracy using skip-connections. Finally, the signal was passed through 2 Fig. 3. Comparison of gender and phoneme prediction accuracy for different ReLU layers with 256 units. A Softmax was applied to compute bottleneck types and probe points. The decoder is conditioned on the speaker, the next sample probability. We used 256 quantization levels thus the gender information can be recovered and the bottleneck should discard it. While information is present at the p probe. The AE and VAE models after mu-law companding [18]. enc tend to similarly discard both gender and phoneme information at other probe All models were trained on minibatches of 64 sequences of points. On the other hand, VQ-VAE selectively discards gender information. length 5120 time-domain samples (320 ms) sampled uniformly from the training dataset. Training a single model on 4 Google Cloud TPUs (16 chips) took a week. We used the Adam second frame), with each latent vector depending on a receptive optimizer [52] with initial learning rate 4 10 which was field of 16 input frames. We also used an alternative encoder halved after 400k, 600k, and 800k steps. Polyak averaging [53] with two length reduction layers, which extracted latent was applied to all checkpoints used for model evaluation. representation at 25 Hz with a receptive field of 30 frames. When unspecified, the latent representation was 64 dimen- B. Bottleneck comparison sional and when applicable constrained to 14 bits. Furthermore, for the VQ-VAE we used the recommended
= 0:25 [19]. We train models on LibriSpeech and analyze the informa- The decoder applied the randomized time-jitter regularization tion captured in the hidden representations surrounding the (see Section III-A). During training each latent vector was autoencoder bottleneck at each of the four probe points shown replaced with either of its neighbors with probability 0.12. in Figure 1: Phoneme prediction accuracy Accuracy Accuracy Accuracy Recon. Error N/A N/A N/A N/A 16 7 TABLE I accuracy, while a model with no time-reduction layers set the L IBRIS PEECH FRAME-WISE PHONEM E RECOGNITION ACCURACY. VQ-VAE upper bound at 88%. MODELS CONSUME MFCC FEATURES AND EXTRACTED TOKENS AT 25 HZ. Table I indicates that the mapping accuracy improves with the number of tokens, with the best model reaching 64:5% Num tokens / bits 256 512 1024 2048 4096 8192 16384 32768 accuracy using 32768 tokens. However, the largest accuracy Train steps 8 9 10 11 12 13 14 15 gain occurs at 4096 tokens, with diminishing returns as the 200k 56.7 58.3 59.7 60.3 60.7 61.2 61.4 61.7 number of tokens is further increased. This result is in rough 900k 58.6 61.0 61.9 63.3 63.8 63.9 64.3 64.5 correspondence with the 5760 tied triphone states used in the Kaldi tri6b model. We also note that increasing the number of tokens does mation better than simple dimensionality reduction, but not as not trivially lead to improved accuracies, because we measure well as VQ-VAE. The VAE discards phonetic and speaker infor- generalization, and not cluster purity. In the limit of assigning mation more uniformly than VQ-VAE: at p , VAE’s phoneme bn a different token to each frame, the accuracy will be poor predictions are less accurate, while its gender predictions because of overfitting to the small development set on which are more accurate. Moreover, combining information across we establish the mapping. However, in our experiments we a wider receptive field at p does not improve phoneme cond consistently observed improved accuracy. recognition as much as in VQ-VAE models. The sensitivity to the bottleneck dimensionality, seen in Figure 2 is also surprising, D. Unsupervised ZeroSpeech 2017 acoustic unit discovery with narrower VAE bottlenecks discarding less information than The ZeroSpeech 2017 phonetic unit discovery task [20] eval- wider ones. This may be due to the stochastic operation of the uates a representation’s ability to discriminate between different VAE: to provide the same KL divergence as at low bottleneck sounds, rather than the ease of mapping the representation to dimensions, more noise needs to be added at high dimensions. predefined phonetic units. It is therefore complementary to the This noise may mask information present in the representation. phoneme classification accuracy metric used in the previous Based on these results we conclude that the VQ-VAE section. The ZeroSpeech evaluation scheme uses the minimal bottleneck is most appropriate for learning latent representations pair ABX test [56], [57] which assesses the model’s ability to which capture phonetic content while being invariant to the discriminate between pairs of three phoneme long segments underlying speaker identity. of speech that differ only in the middle phone (e.g. “get” and “got”). We trained the models on the provided training data C. VQ-VAE token interpretability (45 hours for English, 24 hours for French and 2.5 hours Up to this point we have used the VQ-VAE as a bottleneck for Mandarin) and evaluated them on the test data using the that quantizes latent vectors. In this section we seek an official evaluation scripts. To ensure that we do not overfit to the interpretation of the discrete prototype IDs, evaluating whether ZeroSpeech task we only considered the best hyperparameter VQ-VAE tokens can be mapped to phonemes, the underlying settings found on LibriSpeech (c.f. Section IV-E). Moreover, discrete constituents of speech sounds. Example token IDs to maximally abide by the ZeroSpeech convention, we used the are pictured in the middle pane of Figure 4, where we can same hyperparameters for all languages, denoted as VQ-VAE see that the token 11 is consistently associated with the (per lang, MFCC, p ) in Table II. cond transient “T” phone. To evaluate whether other tokens have On English and French, which come with sufficiently similar interpretations, we measured the frame-wise phoneme large training datasets, we achieve results better than the top recognition accuracy in which each token was mapped to one contestant [58], despite using a speaker independent encoder. out of 41 phonemes. We used the 460 hour clean LibriSpeech The results are consistent with our analysis of information training set for unsupervised training, and used labels from separation performed by the VQ-VAE bottleneck: in the the clean dev subset to associate each token with the most more challenging across-speaker evaluation, the best perfor- probable phoneme. We evaluated the mapping by computing mance uses the p representation, which combines several cond frame-wise phone recognition accuracy on the clean test set at neighboring frames of the bottleneck representation (VQ-VAE, a frame rate of 100 Hz. The ground-truth phoneme boundaries (per lang, MFCC, p ) in Table II). Comparing within- cond were obtained from forced alignments using the Kaldi tri6b and across-speaker results is similarly consistent with the model from the s5 LibriSpeech recipe [55]. observations in Section IV-B. In the within-speaker case, it is Table I shows performance of the configuration which not necessary to disentangle speaker identity from phonetic obtained the best accuracy mapping VQ-VAE tokens to content so the quantization between p and p probe points proj bn phonemes on LibriSpeech. Recognition accuracy is given at two hurts performance (although on English this is corrected by time points: after 200k gradient descent steps, when the relative considering the broader context at p ). In the across-speaker cond performance of models can be assessed, and after 900k steps case, quantization improves the scores on English and French when the models have converged. We did not observe overfitting because the gain from discarding the confounding speaker with longer training times. Predicting the most frequent silence phoneme for all frames set an accuracy lower bound at 16%. The comparison with other systems from the challenge is fair, because according to the ZeroSpeech experimental protocol, all participants were A model discriminatively trained on the full 460 hour training encouraged to tune their systems on the three languages that we use (English, set to predict phonemes with the same architecture as the French, and Mandarin), while the final evaluation used two surprise languages 25 Hz encoder achieved 80% framewise phoneme recognition for which we do not have the labels required for evaluation. 8 TABLE II ZEROS PEECH 2017 PHONETIC UNIT DISCOVERY ABX SCORES REPORTED ACROSS- AND WITHIN- SPEAKERS ( LOWER IS BETTER). T HE VQ-VAE ENCODER IS SPEAKER INDEPENDENT AND THUS ITS RESULTS DO NOT CHANGE WITH THE AM OUNT OF TEST SPEAKER DATA (1S, 10 S, OR 2M), WHILE SPEAKER- ADAPTIVE MODELS ( E. G. SUPERVISED TOPLINE) IMPROVE WITH MORE TARGET SPEAKER DATA. W E REPORT THE TWO REFERENCE POINTS FROM THE CHALLENGE, ALONG WITH THE CHALLENGE W INNER [58] AND THREE OTHER SUBMISSIONS THAT USED NEURAL NETWORK IN AN UNSUPERVISED SETTING [59], [60], [61]. ALL VQ-VAE MODELS USE EXACTLY THE SAME HYPERPARAMETER SETUP (14 BIT TOKENS EXTRACTED AT 50 HZ WITH TIM E-JITTER PROBABILITY 0.5), REGARDLESS OF THE AM OUNT OF UNLABELED TRAINING DATA (45 H, 24H OR 2.4 H). T HE TOP VQ-VAE RESULTS ROW (VQ-VAE TRAINED ON TARGET LANGUAGE, FEATURES EXTRACTED AT THE p POINT) GIVES BEST RESULTS COND OVERALL. W E ALSO INCLUDE in italics RESULTS FOR DIFFERENT PROBE POINTS AND FOR VQ-VAES JOINTLY TRAINED ON ALL LANGUAGES. MULTILINGUAL TRAINING HELPS MANDARIN. WE ALSO OBSERVE THAT THE QUANTIZATION M OSTLY DISCARDS SPEAKER AND CONTEXT INFLUENCE. THE CONTEXT IS HOWEVER RECOVERED IN THE CONDITIONING SIGNAL WHICH COM BINES INFORMATION FROM LATENT VECTORS AT NEIGHBORING TIMESTEPS. Within-speaker Across-speaker English (45h) French (24h) Mandarin (2.4h) English (45h) French (24h) Mandarin (2.4h) Model 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m Unsupervised baseline 12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3 Supervised topline 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 VQ-VAE (per lang, MFCC, p ) 5.6 5.5 5.5 7.3 7.5 7.5 11.2 10.7 10.8 8.1 8.0 8.0 11.0 10.8 11.1 12.2 11.7 11.9 cond VQ-VAE (per lang, MFCC, p ) 6.2 6.0 6.0 7.5 7.3 7.6 10.8 10.5 10.6 8.9 8.8 8.9 11.3 11.0 11.2 11.9 11.4 11.6 bn VQ-VAE (per lang, MFCC, p ) 5.9 5.8 5.9 6.7 6.9 6.9 9.9 9.7 9.7 9.1 9.0 9.0 11.9 11.6 11.7 11.0 10.6 10.7 proj VQ-VAE (all lang, MFCC, p ) 5.8 5.8 5.8 8.0 7.9 7.8 9.2 9.1 9.2 8.8 8.6 8.7 11.8 11.6 11.6 10.3 10.0 9.9 cond VQ-VAE (all lang, MFCC, p ) 6.3 6.2 6.3 8.0 8.0 7.9 9.0 8.9 9.1 9.4 9.2 9.3 11.8 11.7 11.8 9.9 9.7 9.7 bn VQ-VAE (all lang, MFCC, p ) 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5 proj VQ-VAE (all lang, fbank, p ) 6.0 6.0 6.0 6.9 6.8 6.8 6.8 6.6 6.6 10.1 10.1 10.1 12.5 12.2 12.3 7.8 7.7 7.7 proj Heck et al. [58] 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 Chen et al. [59] 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1 Ansari et al. [60] 7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3 Yuan et al. [61] 9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7 information offsets the loss of some phonetic details. Moreover, these design choices on the English part of the ZeroSpeech the discarded phonetic information can be recovered by mixing challenge task. Indeed, we found that the proposed time-jitter neighboring timesteps at p . regularization improved ZeroSpeech ABX scores for all input cond VQ-VAE performance on Mandarin is worse, which we representations. Using MFCC or filterbank features yields better can attribute to three main causes. First, the training dataset scores that using waveforms, and the model consistently obtains consists of only 2.4 hours or speech, leading to overfitting better scores when more tokens are used. (see Sec. IV-E7). This can be partially improved by mul- 1) Time-jitter regularization: In Table III we analyze the tilingual training, as in VQ-VAE, (all lang, MFCC, p ). effectiveness of the time-jitter regularization on VQ-VAE cond Second, Mandarin is a tonal language, while the default encodings and compare it to two variants of dropout: regular input features (MFCCs) discard pitch information. We note a dropout applied to individual dimensions of the encoding and slight improvement with a multilingual model trained on mel dropout applied randomly to the full encoding at individual filterbank features (VQ-VAE, (all lang, fbank, p )). Third, time steps. Regular dropout does not force the model to sepa- proj VQ-VAE was shown not to encode prosody in the latent rate information in neighboring timesteps. Step-wise dropout representation [19]. Comparing the results across probe points, promotes encodings which are independent across timesteps we see that Mandarin is the only language for which the VQ and performs slightly worse than the time-jitter . bottleneck discards information and decreases performance in The proposed time-jitter regularization greatly improves the across-speaker testing regime. Nevertheless, the multilingual token mapping accuracy and extends the range of token prequantized features yield accuracies comparable to [58]. frame rates which perform well to include 50 Hz. While the We do not consider the need for more unsupervised training LibriSpeech token accuracies are comparable at 25 Hz and data to be a problem. Unlabeled data is abundant. We believe 50 Hz, higher token emission frequencies are important for that a more powerful model that requires and can make better the ZeroSpeech AUD task, on which the 50 Hz model was use of large amounts of unlabeled training data is preferable to noticeably better. This behavior is due to the fact that the 25 Hz a simpler model whose performance saturates on small datasets. model is prone to omitting short phones (Sec. IV-E6), which However, it remains to be verified if increasing the amount impacts the ABX results on the ZeroSpeech task. of training data would help the Mandarin VQ-VAE learn to We also analyzed information content at the four probe points discard less tonal information (the multilingual model might for VQ-VAE, VAE, and simple dimensionality reduction AE have learned to do this to accommodate French and English). bottleneck, shown in Figure 5. For all bottleneck mechanisms, the regularization limits the quality of filterbank reconstruc- tions and increases the phoneme recognition accuracy in the E. Hyperparameter impact constrained representation. However this benefit is smaller after All VQ-VAE autoencoder hyperparameters were tuned on the LibriSpeech task using several grid-searches, optimizing for The token copy probability of 0:12 keeps a given token with probability the highest phoneme recognition accuracy. We also validated 0:88 = 0:77 which roughly corresponds to a 0:23 per-timestep dropout rate Filterbank Phoneme Gender Speaker p p p p enc proj bn cond 0.75 0.6 0.4 Pred. target 0.70 0.2 gender 0.65 phonemes 0.6 0.4 Time-jitter 0.2 probability 0.60 1 10 100 0.75 0.12 WaveNet Receptive Field [ms] 0.50 0.25 Fig. 6. Impact of decoder WaveNet receptive field on the properties of the VQ-VAE conditioning signal. The representation is significantly more gender 0.6 invariant when the receptive field is larger that 10ms. Frame-wise phoneme 0.4 recognition accuracy peaks at about 125ms. The depth and width of the WaveNet have a secondary effect (cf. points with the same RF). 0.2 features, especially MFCCs, perform better than waveforms, Bottleneck because by design they discard information about pitch and provide a degree of speaker invariance. Using such a reduced Fig. 5. Impact of the time-jitter regularization on information captured by representations at different probe points. representation forces the encoder to transmit less information to the decoder, acting as an inductive bias toward a more speaker TABLE III invariant latent encoding. EFFECTS OF INPUT REPRESENTATION AND REGULARIZATION ON PHONEME 3) Output representation: We constructed an autoregressive RECOGNITION ACCURACY ON LIBRIS PEECH, MEASURED AFTER 200 K decoder network that reconstructed filterbank features rather TRAINING STEPS. ALL MODELS EXTRACT 256 TOKENS. than raw waveform samples. Inspired by recent progress in Input features Token rate Regularization Accuracy text-to-speech systems, we implemented a Tacotron 2-like decoder [62] with a built-in information bottleneck on the MFCC 25 Hz None 52.5 MFCC 25 Hz Regular dropout p = 0:1 50.7 autoregressive information flow, which was found to be critical MFCC 25 Hz Regular dropout p = 0:2 49.1 in TTS applications. Similarly to Tacotron 2 the filterbank MFCC 25 Hz Per-time step dropout p = 0:2 55.3 features were first processed by a small “pre-net”, we applied MFCC 25 Hz Per-time step dropout p = 0:3 55.7 MFCC 25 Hz Per-time step dropout p = 0:4 55.1 generous amounts of dropout and configured the decoder to MFCC 25 Hz Time-jitter p = 0:08 56.2 predict up to 4 frames in parallel. However, these modifications MFCC 25 Hz Time-jitter p = 0:12 56.2 yielded at best 42% phoneme recognition accuracy, significantly MFCC 25 Hz Time-jitter p = 0:16 56.1 lower than the other architectures described in this paper. The MFCC 50 Hz None 46.5 MFCC 50 Hz Time-jitter p = 0:5 56.1 model was however an order of magnitude faster to train. Finally, we analyzed the impact of the size of the decoding log-mel spectrogram 25 Hz None 50.1 log-mel spectrogram 25 Hz Time-jitter p = 0:12 53.6 WaveNet on the representation extracted by the VQ-VAE. We have found that overall receptive field (RF) has a larger impact raw waveform 30 Hz None 37.6 raw waveform 30 Hz Time-jitter p = 0:12 48.1 than the depth or width of the WaveNet. In particular, a large change in the properties of the latent representation happens when the decoder’s receptive field crosses than about 10ms. neighboring timesteps are combined in the p probe point. As shown in Figure 6, for smaller RFs, the conditioning signal cond Moreover, for VQ-VAE and VAE the regularization decreases contains more speaker information: gender prediction is close gender prediction accuracy and makes the representation to 80%, while framewise phoneme prediction accuracy is only slightly less speaker-sensitive. 55%. For larger RFs, gender prediction accuracy is about 60%, 2) Input representation: In this set of experiments we while phoneme prediction peaks near 65%. Finally, while the compared performance using different input representation: reconstruction log-likelihood improved with WaveNet depth up raw waveforms, log-mel spectrograms, or MFCCs. The raw to 30 layers, the phoneme recognition accuracy plateaued with waveform encoder used 9 strided convolutional layers, which 20 layers. Since the WaveNet has the largest computational resulted in token extraction frequency of 30 Hz. We then cost we decided to keep the 20 layer configuration. replaced the waveform with a customary ASR data pipeline: 4) Decoder speaker conditioning: The WaveNet decoder 80 log-mel filterbank features extracted every 10ms from 25ms- generates samples based on three sources of information: the long windows and 13 MFCC features extracted from the mel- previously emitted samples (via the autoregressive connection), filterbank output, both augmented with their first and second global conditioning on speaker or other information which temporal derivatives. Using two strided convolution layers in is stationary in time, and on the time-varying representation the encoder led to a 25 Hz token rate for these models. extracted from the encoder. We found that disabling global The results are reported in the bottom of Table III. High-level speaker conditioning reduces phoneme classification accuracy Accuracy Accuracy Accuracy Recon. Error VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE Prediction accuracy 10 by 3 percentage points. This further corroborates our findings An interesting future area for research would be investigating about disentanglement induced by the VQ-VAE bottleneck, methods to increase the model capacity to make better use of which biases the model to discard information that is available larger amounts of unlabeled data. in a more explicit form. Throughout our experiments we used The influence of the size of the dataset is also visible in a speaker-independent encoder. However, adapting the encoder the ZeroSpeech Challenge results (Table II): VQ-VAE models to the speaker might further improve the results. In fact, [58] obtained good performance on English (45 hours of training demonstrates improvements on the ZeroSpeech task using a data) and French (24 hours), but performed poorly on Mandarin speaker-adaptive approach. (2.5 hours). Moreover, on English and French we obtained the 5) Encoder hyperparameters: We experimented with tuning best results with models trained on monolingual data. On the number of encoder convolutional layers, as well as the Mandarin slightly better results were obtained using a model number of filters, and the filter length. In general, performance trained jointly on data from all languages. improved with larger encoders, however we established that the encoder’s receptive field must be carefully controlled, with V. RELATED WORK the best performing encoders seeing about 0.3 seconds of input VAEs for sequential data were introduced in [49]. The model signal for each generated token. used LSTM encoder and decoder, while the latent representation The effective receptive field can be controlled using two was formed from the last hidden state of the encoder. The model mechanisms: by carefully tuning the encoder architecture, or by proved useful for natural language processing tasks. However, it designing an encoder with a wide receptive field, but limiting also demonstrated the problem of latent representation collapse: the duration of signal segments seen during training to the when a powerful autoregressive decoder is used simultaneously desired receptive field. In this way the model never learns to with a penalty on the latent encoding, such as the KL prior, use its full capacity. When the model was trained on 2.5s long the VAE has a tendency to ignore the prior and act as if it segments, an encoder with receptive field of 0.3s had framewise were a purely autoregressive sequence model. This issue can phoneme recognition accuracy of 56.5%, while and encoder be mitigated by changing the weight of the KL term, and with a receptive field of 0.8s scored only 54.3%. When trained limiting the amount of information on the autoregressive path on segments of 0.3s, both models performed similarly. by using word dropout [49]. Latent collapse can also be avoided 6) Bottleneck bit rate: The speech VQ-VAE encoder can be in deterministic autoencoders, such as [64], which coupled a seen as encoding a signal using a very low bit rate. To achieve convolutional encoder to a powerful autoregressive WaveNet a predetermined target bit rate, one can control both the token decoder [18] to learn a latent representation of music audio rate (i.e., by controlling the degree of downsampling down in consisting of isolated notes from a variety of instruments. the encoder strided convolutions), and the number of tokens We empirically validate that conditioning the decoder on (or equivalently the number of bits) extracted at every step. We speaker information results in encodings which are more found that the token rate is a crucial parameter which must be speaker invariant. Moyer et al. [54] give a rigorous proof chosen carefully, with the best results after 200k training steps that this approach produces representations that are invariant obtained at 50 Hz (56.0% phoneme recognition accuracy ) and to the explicitly provided information and relate it to domain- 25 Hz (56.3%). Accuracy drops abruptly at higher token rates adversarial training, another technique designed to enforce (49.3% at 100 Hz), while lower rates miss very short phones invariance to a known nuisance factor [65]. (53% accuracy at 12.5 Hz). In contrast to the number of tokens, the dimensionality of the When applied to audio, the VQ-VAE uses the WaveNet decoder to free the latent representation from modeling VQ-VAE embedding has a secondary effect on representation information that is easily recoverable form the recent past quality. We found 64 to be a good setting, with much smaller [19]. It avoids the problem of posterior collapse by using a dimensions deteriorating performance for models with a small discrete latent code with a uniform prior which results in a number of tokens and higher dimensionalities negatively constant KL penalty. We employ the same strategy to design affecting performance for models with a large number of tokens. the latent representation regularizer: rather than extending the For completeness, we observe that even for the model with cost function with a penalty term that can cause the latent space the largest inventory of tokens, the overall encoder bitrate is to collapse, we rely on random copies of the latent variables low: 14 bits at 50 Hz = 700 bps, which is on par with the to prevent their co-adaptation and promote stability over time. lowest bitrate of classical speech codecs [63]. The randomized time-jitter regularization introduced in this 7) Training corpus size: We experimented with training paper is inspired by slow representations of data [48] and models on subsets of the LibriSpeech training set, varying by dropout, which randomly removes during training neurons the size from 4.6 hours (1%) to 460 hours (100%). Training to prevent their co-adaptation [50]. It is also very similar to on 4.6 hours of data, phoneme recognition accuracy peaked Zoneout [51] which relies on random time copies of selected at 50.5% at 100k steps and then deteriorated. Training on 9 neurons to regularize recurrent neural networks. hours led to a peak accuracy of 52.5% at 180k sets. When the size of training set was increased past 23 hours the phoneme Several authors have recently proposed to model sequences recognition reached 54% after around 900k steps. No further with VAEs that use a hierarchy of variables. [66] explore a improvements were found by training on the full 460 hours of hierarchical latent space which separates sequence-dependent data. We did not observe any overfitting, and for best results variables from those which are sequence-independent ones. trained models until reaching 900k steps with no early stopping. Their model was shown to perform speaker conversion and to 11 improve automatic speech recognititon (ASR) performance in from speaker characteristics. Furthermore, we observe that the the presence of domain mismatch. [67] introduce a stochastic latent collapse problem induced by bottlenecks which are too latent variable model for sequential data which also yields strong can be avoided by making the bottleneck strength a disentangled representations and allows content swapping model hyperparameter, either removing it completely (as in between generated sequences. These other approaches could the VQ-VAE), or by using the free-information VAE objective. possibly benefit from regularizing the latent representation to To further improve representation quality, we introduced a achieve further information disentanglement. time-jitter regularization scheme which limits the capacity of Acoustic unit discovery systems aim at transducing the the latent code yet does not result in a collapse of the latent acoustic signal into a sequence of interpretable units akin space. We hope that this can similarly improve performance to phones. They often involve clustering of acoustic frames, of latent variable models used with auto-regressive decoders MFCC or neural network bottleneck features, regularized using in other problem domains. a probabilistic prior. DP-GMM [68] imposes a Dirichlet Process Both the VAE and VQ-VAE constrain the information prior over a Gaussian Mixture Model. Extending it with an bandwidth of the latent representation. However, the VQ-VAE HMM temporal structure for sub-phonetic units leads to the uses a quantization mechanism, which deterministically forces DP-HMM and the HDP-HMM [69], [70], [71]. HMM-VAE the encoding to be equal to a prototype, while the VAE limits proposes the use of a deep neural network instead of a GMM the amount of information by injecting noise. In our study, [72], [73]. These approaches enforce top-down constraints via the VQ-VAE resulted in better information separation than HMM temporal smoothing and temporal modeling. Linguistic the VAE. However, further experiments are needed to fully unit discovery models detect recurring speech patterns at a understand this effect. In particular, is this a consequence of word-like level, finding commonly repeated segments with a the quantization, or of the deterministic operation? constrained dynamic time warping [74]. We also observe that while the VQ-VAE produces a discrete In the segmental unsupervised speech recognition framework, representation, for best results it uses a token set so large that neural autoencoders were used to embed variable length speech it is impractical to assign a separate meaning to each one. In segments into a common vector space where they could be particular, in our ZeroSpeech experiments we used the dense clustered into word types [75]. [76] replace the segmental embedding representation of each token, which provided a autoencoder with a model that instead predicts a nearby more nuanced token similarity measure than simply using the speech segment and demonstrate that the representation shares token identity. Perhaps a more structured latent representation many properties with word embeddings. Coupled with an is needed, in which a small set of units can be modulated in a unsupervised word segmentation algorithm and unsupervised continuous fashion. mapping of word embeddings discovered on separate corpora Extensive hyperparameter evaluation indicated that opti- [77] the approach yielded an ASR system trained on unpaired mizing the receptive field sizes of the encoder and decoder speech and text data [78]. networks is important for good model performance. A multi- Several entries to the ZeroSpeech 2017 challenge relied scale modeling approach could furthermore separate the on neural networks for phonetic unit discovery. [61] trains prosodic information. Our autoencoding approach could also an autoencoder on pairs of speech segments found using an be combined with penalties that are more specialized to speech unsupervised term discovery system [79]. [59] first clustered processing. Introducing a HMM prior as in [73] could promote speech frames, then trained a neural network to predict the a latent representation which better mimics the temporal cluster IDs and used its hidden representation as features. phonetic structure of speech. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs. ACKNOWLEDGMENTS The authors thank Tara Sainath, Ulfar Erlingsson, Aren VI. CONCLUSIONS Jansen, Sander Dieleman, Jesse Engel, Łukasz Kaiser, Tom We applied sequence autoencoders to speech modeling and Walters, Cristina Garbacea, and the Google Brain team for compared different information bottlenecks, including VAEs their helpful discussions and feedback. and VQ-VAEs. We carefully evaluated the induced latent representation using interpretability criteria as well as the ability REFERENCES to discriminate between similar speech sounds. The comparison [1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of of bottlenecks revealed that discrete representations obtained data with neural networks,” Science, vol. 313, no. 5786, 2006. using VQ-VAE preserved the most phonetic information [2] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. while also being the most speaker-invariant. The extracted International Conference on Machine Learning, 2008. representation allowed for accurate mapping of the extracted [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification symbols into phonemes and obtained competitive performance with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. on the ZeroSpeech 2017 acoustic unit discovery task. A similar [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, combination of VQ-VAE encoder and WaveNet decoder by V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Cho et al. had the best acoustic unit discovery performance in Proc. IEEE Conference on Computer Vision and Pattern Recognition, ZeroSpeech 2019 [80]. [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by We established that an information bottleneck is required jointly learning to align and translate,” in Proc. International Conference for the model to learn a representation that separates content on Learning Representations, 2015. 12 [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are M. Krikun, Y. Cao, Q. Gao, K. Macherey, and et al, “Google’s neural features in deep neural networks?” in Advances in Neural Information machine translation system: Bridging the gap between human and Processing Systems, 2014, pp. 3320–3328. machine translation,” arXiv preprint arXiv:1609.08144, 2016. [30] K. Vesely, ` M. Karafiat, ´ F. Grezl, ´ M. Janda, and E. Egorova, “The [7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with language-independent bottleneck features,” in Proc. Spoken Language deep recurrent neural networks,” in Proc. International Conference on Technology Workshop (SLT), 2012, pp. 336–341. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649. [31] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained [8] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, deep neural networks,” in Proc. Interspeech, 2011. A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski, [32] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: and M. Bacchiani, “State-of-the-art speech recognition with sequence- Contextualized word vectors,” in Advances in Neural Information to-sequence models,” in Proc. International Conference on Acoustics, Processing Systems, 2017, pp. 6294–6305. Speech and Signal Processing (ICASSP), 2018. [33] S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated [9] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching corpus for learning natural language inference,” in Proc. Conference on networks for reading comprehension and question answering,” in Proc. Empirical Methods in Natural Language Processing, 2015. 55th Annual Meeting of the Association for Computational Linguistics [34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, (Volume 1: Long Papers), vol. 1, 2017, pp. 189–198. “Supervised learning of universal sentence representations from natural [10] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, language inference data,” in Proc. Conference on Empirical Methods in and Q. V. Le, “QANet: Combining local convolution with global self- Natural Language Processing (EMNLP), September 2017, pp. 670–680. attention for reading comprehension,” in Proc. International Conference [35] C. M. Bishop, “Continuous latent variables,” in Pattern Recognition and on Learning Representations, 2018. Machine Learning. Springer, 2006, ch. 12. [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional [36] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative networks,” in European Conference on Computer Vision, 2014. matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999. [12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep [37] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive networks,” in Proc. International Conference on Machine Learning, 2017. field properties by learning a sparse code for natural images,” Nature, [13] T. Nagamine and N. Mesgarani, “Understanding the representation vol. 381, no. 6583, p. 607, 1996. and computation of multilayer perceptrons: A case study in speech [38] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in recognition,” in Proc. International Conference on Machine Learning, Proc. International Conference on Learning Representations, 2014. [39] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, [14] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using S. Mohamed, and A. Lerchner, “Beta-VAE: Learning basic visual backpropagation for speech texture generation and voice conversion,” concepts with a constrained variational framework,” in Proc. International in Proc. International Conference on Acoustics, Speech and Signal Conference on Learning Representations, 2017. Processing (ICASSP), Apr. 2018. [40] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational [15] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual information bottleneck,” in Proc. International Conference on Learning knowledge transfer in DNN-based LVCSR,” in Proc. Spoken Language Representations, 2017. Technology Workshop (SLT), 2012, pp. 246–251. [41] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and [16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural M. Welling, “Improved variational inference with inverse autoregressive network features and semi-supervised training for low resource speech flow,” in Advances in Neural Information Processing Systems, 2016. recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6704–6708. [42] Y. Bengio, N. Leonard, ´ and A. Courville, “Estimating or propagating [17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an gradients through stochastic neurons for conditional computation,” arXiv ASR corpus based on public domain audio books,” in Proc. International preprint arXiv:1308.3432, 2013. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [43] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd [18] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [44] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling A generative model for raw audio,” arXiv preprint arXiv:1609.03499, with gated convolutional networks,” in Proc. International Conference on Machine Learning, 2017. [19] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete [45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic representation learning,” in Advances in Neural Information Processing convolutional and recurrent networks for sequence modeling,” arXiv Systems, 2017, pp. 6309–6318. preprint arXiv:1803.01271, 2018. [20] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, [46] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative in Proc. Automatic Speech Recognition and Understanding Workshop modeling for controllable speech synthesis,” in Proc. International (ASRU), 2017. Conference on Learning Representations, 2019. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [47] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Ta¨ ıga, F. Visin, D. Vazquez, ´ resentations by back-propagating errors,” Nature, vol. 323, no. 6088, and A. Courville, “PixelVAE: A latent variable model for natural images,” in Proc. International Conference on Learning Representations, 2017. [22] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for [48] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised visual area V2,” in Advances in Neural Information Processing Systems, learning of invariances,” Neural Computation, vol. 14, no. 4, 2002. [49] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and [23] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” S. Bengio, “Generating sentences from a continuous space,” in SIGNLL in Proc. International Conference on Acoustics, Speech and Signal Conference on Computational Natural Language Learning, 2016. Processing (ICASSP), 2014, pp. 6964–6968. [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- [24] N. Jaitly and G. Hinton, “Learning a better representation of speech nov, “Dropout: A simple way to prevent neural networks from overfitting,” soundwaves using restricted Boltzmann machines,” in Proc. International Journal of Machine Learning Research, vol. 15, no. 1, 2014. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [51] D. Krueger, T. Maharaj, J. Kramar ´ , M. Pezeshki, N. Ballas, N. R. Ke, [25] Z. Tusk ¨ e, P. Golik, R. Schluter ¨ , and H. Ney, “Acoustic modeling with deep A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing neural networks using raw time signal for LVCSR,” in Proc. Interspeech, RNNs by randomly preserving hidden activations,” in Proc. International Conference on Learning Representations, 2017. [26] D. Palaz, M. Magima Doss, and R. Collobert, “Analysis of CNN- [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” based speech recognition system using raw speech as input,” in Proc. in Proc. International Conference on Learning Representations, 2015. Interspeech, 2015. [53] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation [27] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. Interspeech, 2015. pp. 838–855, 1992. [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [54] D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg, A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conference “Invariant Representations without Adversarial Training,” in Advances in on Computer Vision and Pattern Recognition, 2009. Neural Information Processing Systems 31, 2018, pp. 9084–9093. 13 [55] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, [79] A. Jansen and B. Van Durme, “Efficient spoken term discovery using M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, randomized algorithms,” in Proc. Automatic Speech Recognition and and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. Automatic Understanding Workshop (ASRU), 2011, pp. 401–406. Speech Recognition and Understanding Workshop (ASRU), 2011. [80] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and [56] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, E. Dupoux, “The Zero Resource Speech Challenge 2019: TTS without T,” “Evaluating speech features with the minimal-pair ABX task: Analysis arXiv preprint arXiv:1904.11469, 2019, accepted to Interspeech 2019. of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1–5. [57] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task (ii): Resistance to noise,” in Proc. Interspeech, 2014. [58] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario,” Jan Chorowski is an Associate Professor at Faculty Procedia Computer Science, vol. 81, pp. 73–79, 2016. of Mathematics and Computer Science at the Uni- [59] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle- versity of Wrocław. He received his M.Sc. degree in neck feature learning from untranscribed speech,” in Proc. Automatic electrical engineering from the Wrocław University of Speech Recognition and Understanding Workshop (ASRU), 2017. Technology, Poland and EE PhD from the University [60] T. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning methods of Louisville, Kentucky in 2012. He has worked for unsupervised acoustic modeling—leap submission to zerospeech chal- with several research teams, including Google Brain, lenge 2017,” in Proc. Automatic Speech Recognition and Understanding Microsoft Research and Yoshua Bengio’s lab at the Workshop (ASRU), 2017, pp. 754–761. University of Montreal. His research interests are [61] Y. Yuan, C. C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting applications of neural networks to problems which bottleneck features and word-like pairs from untranscribed speech for are intuitive for humans but difficult for machines, feature representation,” in Proc. Automatic Speech Recognition and such as speech and natural language processing. Understanding Workshop (ASRU), Dec 2017, pp. 734–739. [62] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian- nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. International Conference on Ron J. Weiss is a software engineer at Google Acoustics, Speech and Signal Processing (ICASSP), 2018. where he has worked on content-based audio analysis, [63] X. Wang and C.-C. J. Kuo, “An 800 bps VQ-based LPC voice coder,” recommender systems for music, noise robust speech Journal of the Acoustical Society of America, vol. 103, no. 5, 1998. recognition, speech translation, and speech synthesis. [64] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and Ron completed his Ph.D. in electrical engineering K. Simonyan, “Neural audio synthesis of musical notes with wavenet from Columbia University in 2009 where he worked autoencoders,” in Proc. International Conference on Machine Learning, in the Laboratory for the Recognition of Speech and 2017, pp. 1068–1077. Audio. From 2009 to 2010 he was a postdoctoral re- searcher in the Music and Audio Research Laboratory [65] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, at New York University. M. Marchand, and V. Lempitsky, “Domain-Adversarial Training of Neural Networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016. [66] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentan- gled and interpretable representations from sequential data,” in Advances in Neural Information Processing Systems, 2017, pp. 1876–1887. [67] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” in Proc. Samy Bengio (PhD in computer science, University International Conference on Machine Learning, 2018. of Montreal, 1993) is a research scientist at Google [68] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of since 2007. He currently leads a group of research Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic scientists in the Google Brain team, conducting Modeling: A Feasibility Study,” in Proc. Interspeech, 2015. research in many areas of machine learning such as [69] C.-y. Lee and J. Glass, “A Nonparametric Bayesian Approach to Acoustic deep architectures, representation learning, sequence Model Discovery,” in Proc. 50th Annual Meeting of the Association for processing, speech recognition, image understanding, Computational Linguistics (Volume 1: Long Papers), Jul. 2012, pp. 40–49. large-scale problems, adversarial settings, etc. He is the general chair for Neural Information [70] L. Ondel, L. Burget, and J. Cernocky, ´ “Variational Inference for Acoustic Processing Systems (NeurIPS) 2018, the main con- Unit Discovery,” Procedia Computer Science, vol. 81, Jan. 2016. ference venue for machine learning, was the program [71] R. Marxer and H. Purwins, “Unsupervised Incremental Online Learning chair for NeurIPS in 2017, is action editor of the Journal of Machine Learning and Prediction of Musical Audio Signals,” IEEE/ACM Transactions on Research and on the editorial board of the Machine Learning Journal, was Audio, Speech, and Language Processing, vol. 24, no. 5, May 2016. program chair of the International Conference on Learning Representations [72] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and (ICLR 2015, 2016), general chair of BayLearn (2012-2015) and the Workshops B. Raj, “Hidden Markov Model Variational Autoencoder for Acoustic on Machine Learning for Multimodal Interactions (MLMI’2004-2006), as well Unit Discovery,” in Proc. Interspeech, Aug. 2017, pp. 488–492. as the IEEE Workshop on Neural Networks for Signal Processing (NNSP’2002), [73] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian and on the program committee of several international conferences such as Hidden Markov Model Variational Autoencoder for Acoustic Unit NeurIPS, ICML, ICLR, ECML and IJCAI. Discovery,” in Proc. Interspeech, Sep. 2018, pp. 2688–2692. [74] A. S. Park and J. R. Glass, “Unsupervised Pattern Discovery in Speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 186–197, Jan. 2008. [75] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework Aar ¨ on van den Oord is a research scientist at for fully-unsupervised large-vocabulary speech recognition,” Computer DeepMind, London. Aaron ¨ completed his PhD at Speech & Language, vol. 46, pp. 154–174, 2017. the University of Ghent, Belgium in 2015. He [76] Y.-A. Chung and J. Glass, “Learning word embeddings from speech,” has worked on unsupervised representation learning, arXiv preprint arXiv:1711.01515, 2017. music recommendation, generative modeling with [77] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised autoregressive networks and various applications of Machine Translation Using Monolingual Corpora Only,” in Proc. Inter- generative models such text-to-speech synthesis and national Conference on Learning Representations, 2018. data compression. [78] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervised cross- modal alignment of speech and text embedding spaces,” Advances in Neural Information Processing Systems, 2018.
Loading next page...
 
/lp/arxiv-cornell-university/unsupervised-speech-representation-learning-using-wavenet-autoencoders-BK0A23LWCA

References (82)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2938863
Publisher site
See Article on Publisher Site

Abstract

Unsupervised speech representation learning using WaveNet autoencoders Jan Chorowski, Ron J. Weiss, Samy Bengio, Aaron ¨ van den Oord Abstract—We consider the task of unsupervised extraction speaker gender and identity, from phonetic content, properties of meaningful latent representations of speech by applying which are consistent with internal representations learned autoencoding neural networks to speech waveforms. The goal by speech recognizers [13], [14]. Such representations are is to learn a representation able to capture high level semantic desired in several tasks, such as low resource automatic speech content from the signal, e.g. phoneme identities, while being recognition (ASR), where only a small amount of labeled invariant to confounding low level details in the signal such as the underlying pitch contour or background noise. Since the training data is available. In such scenario, limited amounts learned representation is tuned to contain only phonetic content, of data may be sufficient to learn an acoustic model on the we resort to using a high capacity WaveNet decoder to infer representation discovered without supervision, but insufficient information discarded by the encoder from previous samples. to learn the acoustic model and a data representation in a fully Moreover, the behavior of autoencoder models depends on the supervised manner [15], [16]. kind of constraint that is applied to the latent representation. We compare three variants: a simple dimensionality reduction We focus on representations learned with autoencoders bottleneck, a Gaussian Variational Autoencoder (VAE), and a applied to raw waveforms and spectrogram features and discrete Vector Quantized VAE (VQ-VAE). We analyze the quality investigate the quality of learned representations on LibriSpeech of learned representations in terms of speaker independence, the [17]. We tune the learned latent representation to encode only ability to predict phonetic content, and the ability to accurately re- phonetic content and remove other confounding detail. However, construct individual spectrogram frames. Moreover, for discrete encodings extracted using the VQ-VAE, we measure the ease to enable signal reconstruction, we rely on an autoregressive of mapping them to phonemes. We introduce a regularization WaveNet [18] decoder to infer information that was rejected scheme that forces the representations to focus on the phonetic by the encoder. The use of such a powerful decoder acts content of the utterance and report performance comparable with as an inductive bias, freeing up the encoder from using its the top entries in the ZeroSpeech 2017 unsupervised acoustic unit capacity to represent low level detail and instead allowing it discovery task. to focus on high level semantic features. We discover that best Index Terms—autoencoder, speech representation learning, un- representations arise when ASR features, such as mel-frequency supervised learning, acoustic unit discovery cepstral coefficients (MFCCs) are used as inputs, while raw waveforms are used as decoder targets. This forces the system I. I NTRODUCTION to also learn to generate sample level detail which was removed Creating good data representations is important. The deep during feature extraction. Furthermore, we observe that the learning revolution was triggered by the development of Vector Quantized Variational Autoencoder (VQ-VAE) [19] hierarchical representation learning algorithms, such as stacked yields the best separation between the acoustic content and Restricted Boltzman Machines [1] and Denoising Autoencoders speaker information. We investigate the interpetability of VQ- [2]. However, recent breakthroughs in computer vision [3], VAE tokens by mapping them to phonemes, demonstrate [4], machine translation [5], [6], speech recognition [7], [8], the impact of model hyperparameters on interpretability and and language understanding [9], [10] rely on large labeled propose a new regularization scheme which improves the degree datasets and make little to no use of unsupervised representation to which the latent representation can be mapped to the phonetic content. Finally, we demonstrate strong performance on the learning. This has two drawbacks: first, the requirement of large ZeroSpeech 2017 acoustic unit discovery task [20], which human labeled datasets often makes the development of deep measures how discriminative a representation is to minimal learning models expensive. Second, while a deep model may phonetic changes within an utterance. excel at solving a given task, it yields limited insights into the problem domain, with main intuitions typically consisting of visualizations of salient input patterns [11], [12], a strategy that II. R EPRESENTATION L EARNING WITH NEURAL NETWORKS is applicable only to problem domains that are easily solved Neural networks are hierarchical information processing by humans. models that are typically implemented using layers of computa- In this paper we focus on evaluating and improving un- tional units. Each layer can be interpreted as a feature extractor supervised speech representations. Specifically, we focus on whose outputs are passed to upstream units [21]. Especially in representations that separate selected speaker traits, specifically the visual domain, features learned with neural networks have J. Chorowski is with the Institute of Computer Science, University of been shown to create a hierarchy of visual atoms [11] that Wrocław, Poland e-mail: jan.chorowski@cs.uni.wroc.pl. match some properties of the visual cortex [22]. Similarly, when R. Weiss and S. Bengio are with Google Research. A. van den Oord is with DeepMind email: fronw, bengio, avdnoordg@google.com. applied to audio waveforms, neural networks have been shown arXiv:1901.08810v2 [cs.LG] 11 Sep 2019 2 to learn auditory-like frequency decompositions on music [23] from a prior distribution p(z) (typically a multidimensional and speech [24], [25], [26], [27] in their lower layers. normal distribution). Then the data sample x is generated using a deep decoder neural network with parameters  that computes p(xjz; ). However, computing the exact posterior A. Supervised feature learning distribution p(zjx) that is needed during maximum likelihood Neural networks can learn useful data representations in both training is difficult. Instead, the VAE introduces a variational supervised and unsupervised manners. In the supervised case, approximation to the posterior, q(zjx; ), which is modeled features learned on large datasets are often directly useful using an encoder neural network with parameters . Thus the in similar but data-poor tasks. For instance, in the visual VAE resembles a traditional autoencoder, in which the encoder domain, features discovered on ImageNet [28] are routinely produces distributions over latent representations, rather than used as input representations in other computer vision tasks [29]. deterministic encodings, while the decoder is trained on samples Similarly, the speech community has used bottleneck features from this distribution. Encoding and decoding networks are extracted from networks trained on phoneme prediction tasks trained jointly to maximize a lower bound on the log-likelihood [30], [31] as feature representations for speech recognition of data point x [38], [39]: systems. Likewise, in natural language processing, universal text representations can be extracted from networks trained for J (; ; x) = E [log p(xjz; )] VAE q(zjx;) machine translation [32] or language inference [33], [34]. D (q(zjx; )jj p(z)) : (1) KL We can interpret the two terms of Eq. (1) as the autoencoder’s B. Unsupervised feature learning reconstruction cost augmented with a penalty term applied to In this paper we focus on unsupervised feature learning. the hidden representation. In particular, the KL divergence Since no training labels are available we investigate autoen- expresses the amount of information in nats which the latent coders, i.e., networks which are tasked with reconstructing representation carries about the data sample. Thus, it acts as an their inputs. Autoencoders use an encoding network to extract information bottleneck [40] on the latent representation, where a latent representation, which is then passed through a decod- controls the trade-off between reconstruction quality and the ing network to recover the original data. Ideally, the latent representation simplicity. representation preserves the salient features of the original An alternative formulation of the VAE objective explicitly data, while being easier to analyze and work with, e.g. by constrains the amount of information contained in the latent disentangling different factors of variation in the data, and representation [41]: discarding spurious patterns (noise). These desirable qualities J (; ; x) = E [log p(xjz; )] VAE q(zjx;) are typically obtained through a judicious application of max (B; D (q(zjx; )jj p(z))) ; (2) regularization techniques and constraints or bottlenecks (we KL use the two terms interchangeably). The representation learned where the constant B corresponds to the amount of free by an autoencoder is thus subject to two competing forces. On information in q, because the model is only penalized if it the one hand, it should provide the decoder with information transmits more than B nats over the prior in the distribution necessary for perfect reconstruction and thus capture in the over the latents. Please note that for convenience we will often latents as much of the input data characteristics as possible. refer to information content using units of bits instead of nats. On the other hand, the constraints force some information to A recently proposed modification of the VAE, called the be discarded, preventing the latent representation from being Vector Quantized VAE [19], replaces the continuous and trivial to invert, e.g. by exactly passing through the input. Thus stochastic latent vectors with deterministically quantized ver- the bottleneck is necessary to force the network to learn a sions. The VQ-VAE maintains a number of prototype vectors non-trivial data transformation. fe ; i = 1; : : : ; Kg. During the forward pass, representations Reducing the dimensionality of the latent representation can produced by the encoder are replaced with their closest serve as a basic constraint applied to the latent vectors, with prototypes. Formally, let z (x) be the output of the encoder the autoencoder acting as a nonlinear variant of linear low- prior to quantization. VQ-VAE finds the nearest prototype rank data projections, such as PCA or SVD [35]. However, q(x) = argmin kz (x) e k and uses it as the latent e i i 2 such representations may be difficult to interpret because the representation z (x) = e which is passed to the decoder. q(x) reconstruction of an input depends on all latent features [36]. In When using the model in downstream tasks, the learned contrast, dictionary learning techniques, such as sparse [37] and representation can therefore be treated either as a distributed non-negative [36] decompositions, express each input pattern representation in which each sample is represented by a using a combination of a small number of selected features out continuous vector, or as a discrete representation in which of a larger pool, which facilitates their interpretability. Discrete each sample is represented by the prototype ID (also called feature learning using vector quantization can be seen as an the token ID). extreme form of sparseness in which the reconstruction uses During the backward pass, the gradient of the loss with only one element from the dictionary. respect to the pre-quantized embedding is approximated using The Variational Autoencoder (VAE) [38] proposes a different @L @L the straight-through estimator [42], i.e.,  . The @z (x) @z (x) e q interpretation of feature learning which follows a probabilistic framework. The autoencoding network is derived from a latent- In TensorFlow this can be conveniently implemented using z (x) = variable generative model. First, a latent vector z is sampled z (x) + stop gradient(e z (x)) e e q(x) 3 prototypes are trained by extending the learning objective VQ-VAE Encoder p p enc proj with terms which optimize quantization. Prototypes are forced + Linear(64) VQ 64D 50Hz to lie close to vectors which they replace with an auxiliary or ReLU(768) cost, dubbed the commitment loss, introduced to encourage VAE proj the encoder to produce vectors which lie close to prototypes. Linear(128) sample Without the commitment loss VQ-VAE training can diverge by ReLU(768) or emitting representations with unbounded magnitude. Therefore, AE VQ-VAE is trained using a sum of three loss terms: the negative ReLU(768) Linear(64) log-likelihood of the reconstruction, which uses the straight- through estimator to bring the gradient from the decoder to pbn the encoder, and two VQ-related terms: the distance from each jitter(0:12) Decoder ReLU(768) prototype to its assigned vectors and the commitment cost [19]: Conv (128) L = log p x j z (x) cond 128D 50Hz Conv (768) 2 2 3 +ksg z (x) e k + kz (x) sg(e )k ; (3) e q(x) e q(x) 2 2 upsample 128D 16kHz where sg() denotes the stop-gradient operation which zeros WaveNet cycle Conv3(768) concat the gradient with respect to its argument during backward pass. (10 layers) 768D 50Hz 128 +N The quantization within the VQ-VAE acts as an information 16kHz StridedConv (768) bottleneck. The encoder can be interpreted as a probabilistic (stride = 2) 256D 16kHz model which puts all probability mass on the selected discrete WaveNet cycle token (prototype ID). Assuming a uniform prior distribution (10 layers) over K tokens, the KL divergence is constant and equal to Conv (768) log K . Therefore, the KL term does not need to be included in 768D 100Hz the VQ-VAE training criterion in Eq. (3) and instead becomes Conv (768) + ReLU(256) a hyperparameter tied to the size of the prototype inventory. 39D 100Hz The VQ-VAE was qualitatively shown to learn a representa- MFCC + d + a ReLU(256) feature extraction tion which separated the phonetic content within an utterance 1D 16kHz sample softmax from the identity of the speaker [19]. Moreover the discovered tokens could be mapped to phonemes in a limited setting. Ns speaker waveform one-hot C. Autoencoders for sequential data Sequential data, such as speech or text, often contain local Fig. 1. The proposed model is conceptually divided into 3 parts: an encoder dependencies that can be exploited by generative models. In (green), made of a residual convnet that computes a stream of latent vectors (typically every 10ms or 20ms) from a time-domain waveform sampled at fact, purely autoregressive models of sequential data, which 16 kHz, which are passed through a bottleneck (red) before being used to predict the next observation based on recent history, are very condition a WaveNet decoder (blue) which reconstructs the waveform using successful. For text, these correspond to n-gram models [43] two additional information streams: an autoregressive stream which predicts the next sample based on past samples, and global conditioning which represents and convolutional neural language models [44], [45]. Similarly, the identity of the input speaker (one out of N total training speakers). We WaveNet [18] is a state-of-the-art autoregressive model of experiment with three bottleneck variants: a simple dimensionality reduction time-domain waveform samples for text-to-speech synthesis. (AE), a sampling layer with an additional Kullback-Leibler penalty term (VAE), or a discretization layer (VQ-VAE). Intuitively, this bottleneck encourages A downside of such autoregressive models is that they the encoder to discard portions of the latent representation which the decoder do not explicitly produce latent representations of the data. can infer from the two other information streams. For all layers, numbers in However, it is possible to combine an autoregressive sequence parentheses indicate the number of output channels, and subscripts denote the filter length. Locations of “probe” points which are used in Section IV to generation model with an encoder tasked with extraction of evaluate the quality of the learned representation are denoted with black dots. latent representations. Depending on the use case, the encoder can process the whole utterance, emit a single latent vector and feed it to an autoregressive decoder [33], [46] or the encoder III. M ODEL DESCRIPTION can periodically emit vectors of latent features to be consumed The architecture of our model is presented in Figure 1. The by the decoder [19], [47]. We concentrate on the latter solution. encoder reads a sequence of either raw audio samples, or of Training mixed latent variable and autoregressive models audio features and extracts a sequence of hidden vectors, is prone to latent space collapse, in which the decoder learns which are passed through a bottleneck to become a sequence to ignore the constrained latent representations and only uses of latent representations. The frequency at which the latent the unconstrained signal coming through the autoregressive vectors are extracted is governed by the number of strided path. For the VAE, this collapse can be prevented by annealing convolutions applied by the encoder. the weight of the KL term and using the free-information The decoder reconstructs the utterance by conditioning a formulation in Eq. (2). The VQ-VAE is naturally resilient to WaveNet [18] network on the latent representation extracted by the latent collapse because the KL term is a hyperparameter which is not optimized using gradient training of a given model. To keep the autoencoder viewpoint, the feature extractor can be interpreted We defer further discussion of this topic to Section V. as a fixed signal processing layer in the encoder. 4 the encoder and, separately, on a speaker embedding. Explicitly The regularization layer is inserted right after the encoder’s conditioning the decoder on speaker identity frees the encoder bottleneck (i.e., after dimensionality reduction for regular from having to capture speaker-dependent information in the autoencoder, after sampling a realization of the latent layer for latent representation. Specifically, the decoder (i) takes the en- the VAE and after discretization for the VQ-VAE). It is only coder’s output, (ii) optionally applies a stochastic regularization enabled during training. For each time step we independently to the latent vectors (see Section III-A), (iii) then combines sample whether it is to be replaced with the token right after latent vectors extracted at neighboring time steps using con- or before it. We do not copy a token more than one timestep. volutions and (iv) upsamples them to the output frequency. Waveform samples are reconstructed with a WaveNet that IV. E XPERIM ENTS combines all conditioning sources: autoregressive information We evaluated models on two datasets: LibriSpeech [17] about past samples, global information about the speaker, and (clean subset) and ZeroSpeech 2017 Contest Track 1 data [20]. latent information about past and future samples extracted Both datasets have similar characteristics: multiple speakers, by the encoder. We find that the encoder’s bottleneck and clean, read speech (sourced from audio books) recorded at a the proposed regularization is crucial in extracting nontrivial sampling rate of 16 kHz. Moreover the ZeroSpeech challenge representations of data. With no bottleneck, the model is prone controls the amount of per-speaker data with the majority of to learn a simple reconstruction strategy which makes verbatim the data being uttered by only a few speakers. copies of future samples. We also note that the encoder is Initial experiments, presented in section IV-B, compare differ- speaker independent and requires only speech data, while the ent bottleneck variants and establish what type of information decoder also requires speaker information. from the input audio is preserved in the continuous latent We consider three forms of bottleneck: (i) simple di- representations produced by the model at the four different mensionality reduction, (ii) a Gaussian VAE with different probe points pictured in Figure 1. Using the representation latent representation dimensionalities and different capacities computed at each probe point, we measure performance following Eq. (2), and (iii) a VQ-VAE with different number of on several prediction tasks: phoneme prediction (per-frame quantization prototypes. All bottlenecks are optionally followed accuracy), speaker identity and gender prediction accuracy, and by the dropout inspired time-jitter regularization described L reconstruction error of spectrogram frames. We establish below. Furthermore, we experiment with different input and that the VQ-VAE learns latent representations with strongest output representations, using raw waveforms, log-mel filterbank, disentanglement between the phonetic content and speaker and mel-frequency cepstral coefficient (MFCC) features which identity, and focus on this architecture in the following discard pitch information present in the spectrogram. experiments. In section IV-C we analyze the interpretability of VQ-VAE tokens by mapping each discrete token to the most frequent A. Time-jitter regularization corresponding phoneme in a forced alignment of a small labeled We would like the model to learn a representation of speech data set (LibriSpeech dev) and report the accuracy of the which corresponds to the slowly-changing phonetic content mapping on a separate set (LibriSpeech test). Intuitively, this within an utterance: a mostly constant signal that can abruptly captures the interpretability of individual tokens. change at phoneme boundaries. We then apply the VQ-VAE to the ZeroSpeech 2017 acoustic Inspired by the slow features analysis [48] we first exper- unit discovery task [20] in section IV-D. This task evaluates imented with penalizing time differences between encoder how discriminative the representation is with respect to the representation either before or after the bottleneck. However, phonetic class. Finally, in section IV-E we measure the impact this regularization resulted in a collapse of the latent space of different hyperparameters on performance. – the model learned to output a constant encoding. This is a common problem of sequential VAEs that use loss terms to A. Default model hyperparameters regularize the latent encoding [49]. Reconsidering the problem we realized that we want each Our best models used MFCCs as the encoder input, but frame’s representation to correspond to a meaningful phonetic reconstructed raw waveforms at the decoder output. We used unit. Thus we want to prevent the system from using consecu- standard 13 MFCC features extracted every 10ms (i.e., at a tive latent vectors as individual units. Put differently, we want rate of 100 Hz) and augmented with their temporal first and to prevent latent vector co-adaptation. We therefore introduce second derivatives. Such features were originally designed for a dropout-inspired [50] time-jitter regularizer, also reminiscent speech recognition and are mostly invariant to pitch and similar of Zoneout [51] regularization for recurrent networks. During confounding detail in the audio signal. The encoder had 9 layers training, each latent vector can replace either one or both of each using 768 units with ReLU activation, organized into the its neighbors. As in dropout, this prevents the model from following groups: 2 preprocessing convolution layers with filter relying on consistency across groups of tokens. Additionally, length 3 and residual connections, 1 strided convolution length this regularization also promotes latent representation stability reduction layer with filter length 4 and stride 2 (downsampling over time: a latent vector extracted at time step t must strive the signal by a factor of two), followed by 2 convolutional to also be useful at time steps t 1 or t + 1. In fact, the layers with length 3 and residual connections, and finally regularization was crucial for reaching good performance on 4 feedforward ReLU layers with residual connections. The ZeroSpeech at higher token extraction frequencies. resulting latent vectors were extracted at 50 Hz (i.e., every Filterbank Phoneme Gender Speaker p proj p p enc bn cond 0.8 Bottleneck 0.6 AE 0.4 VAE (D= 4) 0.2 VAE (D= 8) 0.7 VAE (D=16) 0.6 0.5 VAE (D=32) 0.4 VQ-VAE 0.9 Latent dimensions 0.8 0.7 0.6 0.6 0.4 0.2 VAE free bits / VQ-VAE bits per token Fig. 2. Accuracy of predicting signal characteristics at various probe locations in the network. Among the three bottlenecks evaluated, VQ-VAE discards the most speaker-related information at the bottleneck, while preserving the most phonetic information. For all bottlenecks, the representation coming out of the encoder yields over 70% accurate framewise phoneme predictions. Both the simple AE and VQ-VAE preserve this information in the bottleneck (the accuracy drops to 50%-60% depending on the bottleneck’s strength). However, the VQ-VAE discards almost all speaker information (speaker classification accuracy is close to 0% and gender prediction close to 50%). This causes the VQ-VAE representation to perform best on the acoustic unit discovery task – the representation captures the phonetic content while being invariant to speaker identity. The jittered latent sequence was passed through a single Probe point convolutional layer with filter length 3 and 128 hidden enc 0.7 units to mix information across neighboring timesteps. The proj representation was then upsampled 320 times (to match the bn 0.6 16kHz audio sampling rate) and concatenated with a one-hot cond vector representing the current speaker to form the conditioning Bottleneck input of an autoregressive WaveNet [18]. The WaveNet was 0.5 composed of 20 causal dilated convolution layers, each using AE 368 gated units with residual connections, organized into two VAE (D=32) 0.4 “cycles” of 10 layers with dilation rates 1; 2; 4; : : : ; 2 . The VQ-VAE conditioning signal was passed separately into each layer. The 0.6 0.7 0.8 0.9 signal from each layer of the WaveNet was passed to the output Gender prediction accuracy using skip-connections. Finally, the signal was passed through 2 Fig. 3. Comparison of gender and phoneme prediction accuracy for different ReLU layers with 256 units. A Softmax was applied to compute bottleneck types and probe points. The decoder is conditioned on the speaker, the next sample probability. We used 256 quantization levels thus the gender information can be recovered and the bottleneck should discard it. While information is present at the p probe. The AE and VAE models after mu-law companding [18]. enc tend to similarly discard both gender and phoneme information at other probe All models were trained on minibatches of 64 sequences of points. On the other hand, VQ-VAE selectively discards gender information. length 5120 time-domain samples (320 ms) sampled uniformly from the training dataset. Training a single model on 4 Google Cloud TPUs (16 chips) took a week. We used the Adam second frame), with each latent vector depending on a receptive optimizer [52] with initial learning rate 4 10 which was field of 16 input frames. We also used an alternative encoder halved after 400k, 600k, and 800k steps. Polyak averaging [53] with two length reduction layers, which extracted latent was applied to all checkpoints used for model evaluation. representation at 25 Hz with a receptive field of 30 frames. When unspecified, the latent representation was 64 dimen- B. Bottleneck comparison sional and when applicable constrained to 14 bits. Furthermore, for the VQ-VAE we used the recommended = 0:25 [19]. We train models on LibriSpeech and analyze the informa- The decoder applied the randomized time-jitter regularization tion captured in the hidden representations surrounding the (see Section III-A). During training each latent vector was autoencoder bottleneck at each of the four probe points shown replaced with either of its neighbors with probability 0.12. in Figure 1: Phoneme prediction accuracy Accuracy Accuracy Accuracy Recon. Error N/A N/A N/A N/A 16 7 TABLE I accuracy, while a model with no time-reduction layers set the L IBRIS PEECH FRAME-WISE PHONEM E RECOGNITION ACCURACY. VQ-VAE upper bound at 88%. MODELS CONSUME MFCC FEATURES AND EXTRACTED TOKENS AT 25 HZ. Table I indicates that the mapping accuracy improves with the number of tokens, with the best model reaching 64:5% Num tokens / bits 256 512 1024 2048 4096 8192 16384 32768 accuracy using 32768 tokens. However, the largest accuracy Train steps 8 9 10 11 12 13 14 15 gain occurs at 4096 tokens, with diminishing returns as the 200k 56.7 58.3 59.7 60.3 60.7 61.2 61.4 61.7 number of tokens is further increased. This result is in rough 900k 58.6 61.0 61.9 63.3 63.8 63.9 64.3 64.5 correspondence with the 5760 tied triphone states used in the Kaldi tri6b model. We also note that increasing the number of tokens does mation better than simple dimensionality reduction, but not as not trivially lead to improved accuracies, because we measure well as VQ-VAE. The VAE discards phonetic and speaker infor- generalization, and not cluster purity. In the limit of assigning mation more uniformly than VQ-VAE: at p , VAE’s phoneme bn a different token to each frame, the accuracy will be poor predictions are less accurate, while its gender predictions because of overfitting to the small development set on which are more accurate. Moreover, combining information across we establish the mapping. However, in our experiments we a wider receptive field at p does not improve phoneme cond consistently observed improved accuracy. recognition as much as in VQ-VAE models. The sensitivity to the bottleneck dimensionality, seen in Figure 2 is also surprising, D. Unsupervised ZeroSpeech 2017 acoustic unit discovery with narrower VAE bottlenecks discarding less information than The ZeroSpeech 2017 phonetic unit discovery task [20] eval- wider ones. This may be due to the stochastic operation of the uates a representation’s ability to discriminate between different VAE: to provide the same KL divergence as at low bottleneck sounds, rather than the ease of mapping the representation to dimensions, more noise needs to be added at high dimensions. predefined phonetic units. It is therefore complementary to the This noise may mask information present in the representation. phoneme classification accuracy metric used in the previous Based on these results we conclude that the VQ-VAE section. The ZeroSpeech evaluation scheme uses the minimal bottleneck is most appropriate for learning latent representations pair ABX test [56], [57] which assesses the model’s ability to which capture phonetic content while being invariant to the discriminate between pairs of three phoneme long segments underlying speaker identity. of speech that differ only in the middle phone (e.g. “get” and “got”). We trained the models on the provided training data C. VQ-VAE token interpretability (45 hours for English, 24 hours for French and 2.5 hours Up to this point we have used the VQ-VAE as a bottleneck for Mandarin) and evaluated them on the test data using the that quantizes latent vectors. In this section we seek an official evaluation scripts. To ensure that we do not overfit to the interpretation of the discrete prototype IDs, evaluating whether ZeroSpeech task we only considered the best hyperparameter VQ-VAE tokens can be mapped to phonemes, the underlying settings found on LibriSpeech (c.f. Section IV-E). Moreover, discrete constituents of speech sounds. Example token IDs to maximally abide by the ZeroSpeech convention, we used the are pictured in the middle pane of Figure 4, where we can same hyperparameters for all languages, denoted as VQ-VAE see that the token 11 is consistently associated with the (per lang, MFCC, p ) in Table II. cond transient “T” phone. To evaluate whether other tokens have On English and French, which come with sufficiently similar interpretations, we measured the frame-wise phoneme large training datasets, we achieve results better than the top recognition accuracy in which each token was mapped to one contestant [58], despite using a speaker independent encoder. out of 41 phonemes. We used the 460 hour clean LibriSpeech The results are consistent with our analysis of information training set for unsupervised training, and used labels from separation performed by the VQ-VAE bottleneck: in the the clean dev subset to associate each token with the most more challenging across-speaker evaluation, the best perfor- probable phoneme. We evaluated the mapping by computing mance uses the p representation, which combines several cond frame-wise phone recognition accuracy on the clean test set at neighboring frames of the bottleneck representation (VQ-VAE, a frame rate of 100 Hz. The ground-truth phoneme boundaries (per lang, MFCC, p ) in Table II). Comparing within- cond were obtained from forced alignments using the Kaldi tri6b and across-speaker results is similarly consistent with the model from the s5 LibriSpeech recipe [55]. observations in Section IV-B. In the within-speaker case, it is Table I shows performance of the configuration which not necessary to disentangle speaker identity from phonetic obtained the best accuracy mapping VQ-VAE tokens to content so the quantization between p and p probe points proj bn phonemes on LibriSpeech. Recognition accuracy is given at two hurts performance (although on English this is corrected by time points: after 200k gradient descent steps, when the relative considering the broader context at p ). In the across-speaker cond performance of models can be assessed, and after 900k steps case, quantization improves the scores on English and French when the models have converged. We did not observe overfitting because the gain from discarding the confounding speaker with longer training times. Predicting the most frequent silence phoneme for all frames set an accuracy lower bound at 16%. The comparison with other systems from the challenge is fair, because according to the ZeroSpeech experimental protocol, all participants were A model discriminatively trained on the full 460 hour training encouraged to tune their systems on the three languages that we use (English, set to predict phonemes with the same architecture as the French, and Mandarin), while the final evaluation used two surprise languages 25 Hz encoder achieved 80% framewise phoneme recognition for which we do not have the labels required for evaluation. 8 TABLE II ZEROS PEECH 2017 PHONETIC UNIT DISCOVERY ABX SCORES REPORTED ACROSS- AND WITHIN- SPEAKERS ( LOWER IS BETTER). T HE VQ-VAE ENCODER IS SPEAKER INDEPENDENT AND THUS ITS RESULTS DO NOT CHANGE WITH THE AM OUNT OF TEST SPEAKER DATA (1S, 10 S, OR 2M), WHILE SPEAKER- ADAPTIVE MODELS ( E. G. SUPERVISED TOPLINE) IMPROVE WITH MORE TARGET SPEAKER DATA. W E REPORT THE TWO REFERENCE POINTS FROM THE CHALLENGE, ALONG WITH THE CHALLENGE W INNER [58] AND THREE OTHER SUBMISSIONS THAT USED NEURAL NETWORK IN AN UNSUPERVISED SETTING [59], [60], [61]. ALL VQ-VAE MODELS USE EXACTLY THE SAME HYPERPARAMETER SETUP (14 BIT TOKENS EXTRACTED AT 50 HZ WITH TIM E-JITTER PROBABILITY 0.5), REGARDLESS OF THE AM OUNT OF UNLABELED TRAINING DATA (45 H, 24H OR 2.4 H). T HE TOP VQ-VAE RESULTS ROW (VQ-VAE TRAINED ON TARGET LANGUAGE, FEATURES EXTRACTED AT THE p POINT) GIVES BEST RESULTS COND OVERALL. W E ALSO INCLUDE in italics RESULTS FOR DIFFERENT PROBE POINTS AND FOR VQ-VAES JOINTLY TRAINED ON ALL LANGUAGES. MULTILINGUAL TRAINING HELPS MANDARIN. WE ALSO OBSERVE THAT THE QUANTIZATION M OSTLY DISCARDS SPEAKER AND CONTEXT INFLUENCE. THE CONTEXT IS HOWEVER RECOVERED IN THE CONDITIONING SIGNAL WHICH COM BINES INFORMATION FROM LATENT VECTORS AT NEIGHBORING TIMESTEPS. Within-speaker Across-speaker English (45h) French (24h) Mandarin (2.4h) English (45h) French (24h) Mandarin (2.4h) Model 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m 1s 10s 2m Unsupervised baseline 12.0 12.1 12.1 12.5 12.6 12.6 11.5 11.5 11.5 23.4 23.4 23.4 25.2 25.5 25.2 21.3 21.3 21.3 Supervised topline 6.5 5.3 5.1 8.0 6.8 6.8 9.5 4.2 4.0 8.6 6.9 6.7 10.6 9.1 8.9 12.0 5.7 5.1 VQ-VAE (per lang, MFCC, p ) 5.6 5.5 5.5 7.3 7.5 7.5 11.2 10.7 10.8 8.1 8.0 8.0 11.0 10.8 11.1 12.2 11.7 11.9 cond VQ-VAE (per lang, MFCC, p ) 6.2 6.0 6.0 7.5 7.3 7.6 10.8 10.5 10.6 8.9 8.8 8.9 11.3 11.0 11.2 11.9 11.4 11.6 bn VQ-VAE (per lang, MFCC, p ) 5.9 5.8 5.9 6.7 6.9 6.9 9.9 9.7 9.7 9.1 9.0 9.0 11.9 11.6 11.7 11.0 10.6 10.7 proj VQ-VAE (all lang, MFCC, p ) 5.8 5.8 5.8 8.0 7.9 7.8 9.2 9.1 9.2 8.8 8.6 8.7 11.8 11.6 11.6 10.3 10.0 9.9 cond VQ-VAE (all lang, MFCC, p ) 6.3 6.2 6.3 8.0 8.0 7.9 9.0 8.9 9.1 9.4 9.2 9.3 11.8 11.7 11.8 9.9 9.7 9.7 bn VQ-VAE (all lang, MFCC, p ) 5.8 5.7 5.8 7.1 7.0 6.9 7.4 7.2 7.1 9.3 9.3 9.3 11.9 11.4 11.6 8.6 8.5 8.5 proj VQ-VAE (all lang, fbank, p ) 6.0 6.0 6.0 6.9 6.8 6.8 6.8 6.6 6.6 10.1 10.1 10.1 12.5 12.2 12.3 7.8 7.7 7.7 proj Heck et al. [58] 6.9 6.2 6.0 9.7 8.7 8.4 8.8 7.9 7.8 10.1 8.7 8.5 13.6 11.7 11.3 8.8 7.4 7.3 Chen et al. [59] 8.5 7.3 7.2 11.2 9.4 9.4 10.5 8.7 8.5 12.7 11.0 10.8 17.0 14.5 14.1 11.9 10.3 10.1 Ansari et al. [60] 7.7 6.8 N/A 10.4 N/A 8.8 10.4 9.3 9.1 13.2 12.0 N/A 17.2 N/A 15.4 13.0 12.2 12.3 Yuan et al. [61] 9.0 7.1 7.0 11.9 9.5 9.5 11.1 8.5 8.2 14.0 11.9 11.7 18.6 15.5 14.9 12.7 10.8 10.7 information offsets the loss of some phonetic details. Moreover, these design choices on the English part of the ZeroSpeech the discarded phonetic information can be recovered by mixing challenge task. Indeed, we found that the proposed time-jitter neighboring timesteps at p . regularization improved ZeroSpeech ABX scores for all input cond VQ-VAE performance on Mandarin is worse, which we representations. Using MFCC or filterbank features yields better can attribute to three main causes. First, the training dataset scores that using waveforms, and the model consistently obtains consists of only 2.4 hours or speech, leading to overfitting better scores when more tokens are used. (see Sec. IV-E7). This can be partially improved by mul- 1) Time-jitter regularization: In Table III we analyze the tilingual training, as in VQ-VAE, (all lang, MFCC, p ). effectiveness of the time-jitter regularization on VQ-VAE cond Second, Mandarin is a tonal language, while the default encodings and compare it to two variants of dropout: regular input features (MFCCs) discard pitch information. We note a dropout applied to individual dimensions of the encoding and slight improvement with a multilingual model trained on mel dropout applied randomly to the full encoding at individual filterbank features (VQ-VAE, (all lang, fbank, p )). Third, time steps. Regular dropout does not force the model to sepa- proj VQ-VAE was shown not to encode prosody in the latent rate information in neighboring timesteps. Step-wise dropout representation [19]. Comparing the results across probe points, promotes encodings which are independent across timesteps we see that Mandarin is the only language for which the VQ and performs slightly worse than the time-jitter . bottleneck discards information and decreases performance in The proposed time-jitter regularization greatly improves the across-speaker testing regime. Nevertheless, the multilingual token mapping accuracy and extends the range of token prequantized features yield accuracies comparable to [58]. frame rates which perform well to include 50 Hz. While the We do not consider the need for more unsupervised training LibriSpeech token accuracies are comparable at 25 Hz and data to be a problem. Unlabeled data is abundant. We believe 50 Hz, higher token emission frequencies are important for that a more powerful model that requires and can make better the ZeroSpeech AUD task, on which the 50 Hz model was use of large amounts of unlabeled training data is preferable to noticeably better. This behavior is due to the fact that the 25 Hz a simpler model whose performance saturates on small datasets. model is prone to omitting short phones (Sec. IV-E6), which However, it remains to be verified if increasing the amount impacts the ABX results on the ZeroSpeech task. of training data would help the Mandarin VQ-VAE learn to We also analyzed information content at the four probe points discard less tonal information (the multilingual model might for VQ-VAE, VAE, and simple dimensionality reduction AE have learned to do this to accommodate French and English). bottleneck, shown in Figure 5. For all bottleneck mechanisms, the regularization limits the quality of filterbank reconstruc- tions and increases the phoneme recognition accuracy in the E. Hyperparameter impact constrained representation. However this benefit is smaller after All VQ-VAE autoencoder hyperparameters were tuned on the LibriSpeech task using several grid-searches, optimizing for The token copy probability of 0:12 keeps a given token with probability the highest phoneme recognition accuracy. We also validated 0:88 = 0:77 which roughly corresponds to a 0:23 per-timestep dropout rate Filterbank Phoneme Gender Speaker p p p p enc proj bn cond 0.75 0.6 0.4 Pred. target 0.70 0.2 gender 0.65 phonemes 0.6 0.4 Time-jitter 0.2 probability 0.60 1 10 100 0.75 0.12 WaveNet Receptive Field [ms] 0.50 0.25 Fig. 6. Impact of decoder WaveNet receptive field on the properties of the VQ-VAE conditioning signal. The representation is significantly more gender 0.6 invariant when the receptive field is larger that 10ms. Frame-wise phoneme 0.4 recognition accuracy peaks at about 125ms. The depth and width of the WaveNet have a secondary effect (cf. points with the same RF). 0.2 features, especially MFCCs, perform better than waveforms, Bottleneck because by design they discard information about pitch and provide a degree of speaker invariance. Using such a reduced Fig. 5. Impact of the time-jitter regularization on information captured by representations at different probe points. representation forces the encoder to transmit less information to the decoder, acting as an inductive bias toward a more speaker TABLE III invariant latent encoding. EFFECTS OF INPUT REPRESENTATION AND REGULARIZATION ON PHONEME 3) Output representation: We constructed an autoregressive RECOGNITION ACCURACY ON LIBRIS PEECH, MEASURED AFTER 200 K decoder network that reconstructed filterbank features rather TRAINING STEPS. ALL MODELS EXTRACT 256 TOKENS. than raw waveform samples. Inspired by recent progress in Input features Token rate Regularization Accuracy text-to-speech systems, we implemented a Tacotron 2-like decoder [62] with a built-in information bottleneck on the MFCC 25 Hz None 52.5 MFCC 25 Hz Regular dropout p = 0:1 50.7 autoregressive information flow, which was found to be critical MFCC 25 Hz Regular dropout p = 0:2 49.1 in TTS applications. Similarly to Tacotron 2 the filterbank MFCC 25 Hz Per-time step dropout p = 0:2 55.3 features were first processed by a small “pre-net”, we applied MFCC 25 Hz Per-time step dropout p = 0:3 55.7 MFCC 25 Hz Per-time step dropout p = 0:4 55.1 generous amounts of dropout and configured the decoder to MFCC 25 Hz Time-jitter p = 0:08 56.2 predict up to 4 frames in parallel. However, these modifications MFCC 25 Hz Time-jitter p = 0:12 56.2 yielded at best 42% phoneme recognition accuracy, significantly MFCC 25 Hz Time-jitter p = 0:16 56.1 lower than the other architectures described in this paper. The MFCC 50 Hz None 46.5 MFCC 50 Hz Time-jitter p = 0:5 56.1 model was however an order of magnitude faster to train. Finally, we analyzed the impact of the size of the decoding log-mel spectrogram 25 Hz None 50.1 log-mel spectrogram 25 Hz Time-jitter p = 0:12 53.6 WaveNet on the representation extracted by the VQ-VAE. We have found that overall receptive field (RF) has a larger impact raw waveform 30 Hz None 37.6 raw waveform 30 Hz Time-jitter p = 0:12 48.1 than the depth or width of the WaveNet. In particular, a large change in the properties of the latent representation happens when the decoder’s receptive field crosses than about 10ms. neighboring timesteps are combined in the p probe point. As shown in Figure 6, for smaller RFs, the conditioning signal cond Moreover, for VQ-VAE and VAE the regularization decreases contains more speaker information: gender prediction is close gender prediction accuracy and makes the representation to 80%, while framewise phoneme prediction accuracy is only slightly less speaker-sensitive. 55%. For larger RFs, gender prediction accuracy is about 60%, 2) Input representation: In this set of experiments we while phoneme prediction peaks near 65%. Finally, while the compared performance using different input representation: reconstruction log-likelihood improved with WaveNet depth up raw waveforms, log-mel spectrograms, or MFCCs. The raw to 30 layers, the phoneme recognition accuracy plateaued with waveform encoder used 9 strided convolutional layers, which 20 layers. Since the WaveNet has the largest computational resulted in token extraction frequency of 30 Hz. We then cost we decided to keep the 20 layer configuration. replaced the waveform with a customary ASR data pipeline: 4) Decoder speaker conditioning: The WaveNet decoder 80 log-mel filterbank features extracted every 10ms from 25ms- generates samples based on three sources of information: the long windows and 13 MFCC features extracted from the mel- previously emitted samples (via the autoregressive connection), filterbank output, both augmented with their first and second global conditioning on speaker or other information which temporal derivatives. Using two strided convolution layers in is stationary in time, and on the time-varying representation the encoder led to a 25 Hz token rate for these models. extracted from the encoder. We found that disabling global The results are reported in the bottom of Table III. High-level speaker conditioning reduces phoneme classification accuracy Accuracy Accuracy Accuracy Recon. Error VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE VQ-VAE VAE AE Prediction accuracy 10 by 3 percentage points. This further corroborates our findings An interesting future area for research would be investigating about disentanglement induced by the VQ-VAE bottleneck, methods to increase the model capacity to make better use of which biases the model to discard information that is available larger amounts of unlabeled data. in a more explicit form. Throughout our experiments we used The influence of the size of the dataset is also visible in a speaker-independent encoder. However, adapting the encoder the ZeroSpeech Challenge results (Table II): VQ-VAE models to the speaker might further improve the results. In fact, [58] obtained good performance on English (45 hours of training demonstrates improvements on the ZeroSpeech task using a data) and French (24 hours), but performed poorly on Mandarin speaker-adaptive approach. (2.5 hours). Moreover, on English and French we obtained the 5) Encoder hyperparameters: We experimented with tuning best results with models trained on monolingual data. On the number of encoder convolutional layers, as well as the Mandarin slightly better results were obtained using a model number of filters, and the filter length. In general, performance trained jointly on data from all languages. improved with larger encoders, however we established that the encoder’s receptive field must be carefully controlled, with V. RELATED WORK the best performing encoders seeing about 0.3 seconds of input VAEs for sequential data were introduced in [49]. The model signal for each generated token. used LSTM encoder and decoder, while the latent representation The effective receptive field can be controlled using two was formed from the last hidden state of the encoder. The model mechanisms: by carefully tuning the encoder architecture, or by proved useful for natural language processing tasks. However, it designing an encoder with a wide receptive field, but limiting also demonstrated the problem of latent representation collapse: the duration of signal segments seen during training to the when a powerful autoregressive decoder is used simultaneously desired receptive field. In this way the model never learns to with a penalty on the latent encoding, such as the KL prior, use its full capacity. When the model was trained on 2.5s long the VAE has a tendency to ignore the prior and act as if it segments, an encoder with receptive field of 0.3s had framewise were a purely autoregressive sequence model. This issue can phoneme recognition accuracy of 56.5%, while and encoder be mitigated by changing the weight of the KL term, and with a receptive field of 0.8s scored only 54.3%. When trained limiting the amount of information on the autoregressive path on segments of 0.3s, both models performed similarly. by using word dropout [49]. Latent collapse can also be avoided 6) Bottleneck bit rate: The speech VQ-VAE encoder can be in deterministic autoencoders, such as [64], which coupled a seen as encoding a signal using a very low bit rate. To achieve convolutional encoder to a powerful autoregressive WaveNet a predetermined target bit rate, one can control both the token decoder [18] to learn a latent representation of music audio rate (i.e., by controlling the degree of downsampling down in consisting of isolated notes from a variety of instruments. the encoder strided convolutions), and the number of tokens We empirically validate that conditioning the decoder on (or equivalently the number of bits) extracted at every step. We speaker information results in encodings which are more found that the token rate is a crucial parameter which must be speaker invariant. Moyer et al. [54] give a rigorous proof chosen carefully, with the best results after 200k training steps that this approach produces representations that are invariant obtained at 50 Hz (56.0% phoneme recognition accuracy ) and to the explicitly provided information and relate it to domain- 25 Hz (56.3%). Accuracy drops abruptly at higher token rates adversarial training, another technique designed to enforce (49.3% at 100 Hz), while lower rates miss very short phones invariance to a known nuisance factor [65]. (53% accuracy at 12.5 Hz). In contrast to the number of tokens, the dimensionality of the When applied to audio, the VQ-VAE uses the WaveNet decoder to free the latent representation from modeling VQ-VAE embedding has a secondary effect on representation information that is easily recoverable form the recent past quality. We found 64 to be a good setting, with much smaller [19]. It avoids the problem of posterior collapse by using a dimensions deteriorating performance for models with a small discrete latent code with a uniform prior which results in a number of tokens and higher dimensionalities negatively constant KL penalty. We employ the same strategy to design affecting performance for models with a large number of tokens. the latent representation regularizer: rather than extending the For completeness, we observe that even for the model with cost function with a penalty term that can cause the latent space the largest inventory of tokens, the overall encoder bitrate is to collapse, we rely on random copies of the latent variables low: 14 bits at 50 Hz = 700 bps, which is on par with the to prevent their co-adaptation and promote stability over time. lowest bitrate of classical speech codecs [63]. The randomized time-jitter regularization introduced in this 7) Training corpus size: We experimented with training paper is inspired by slow representations of data [48] and models on subsets of the LibriSpeech training set, varying by dropout, which randomly removes during training neurons the size from 4.6 hours (1%) to 460 hours (100%). Training to prevent their co-adaptation [50]. It is also very similar to on 4.6 hours of data, phoneme recognition accuracy peaked Zoneout [51] which relies on random time copies of selected at 50.5% at 100k steps and then deteriorated. Training on 9 neurons to regularize recurrent neural networks. hours led to a peak accuracy of 52.5% at 180k sets. When the size of training set was increased past 23 hours the phoneme Several authors have recently proposed to model sequences recognition reached 54% after around 900k steps. No further with VAEs that use a hierarchy of variables. [66] explore a improvements were found by training on the full 460 hours of hierarchical latent space which separates sequence-dependent data. We did not observe any overfitting, and for best results variables from those which are sequence-independent ones. trained models until reaching 900k steps with no early stopping. Their model was shown to perform speaker conversion and to 11 improve automatic speech recognititon (ASR) performance in from speaker characteristics. Furthermore, we observe that the the presence of domain mismatch. [67] introduce a stochastic latent collapse problem induced by bottlenecks which are too latent variable model for sequential data which also yields strong can be avoided by making the bottleneck strength a disentangled representations and allows content swapping model hyperparameter, either removing it completely (as in between generated sequences. These other approaches could the VQ-VAE), or by using the free-information VAE objective. possibly benefit from regularizing the latent representation to To further improve representation quality, we introduced a achieve further information disentanglement. time-jitter regularization scheme which limits the capacity of Acoustic unit discovery systems aim at transducing the the latent code yet does not result in a collapse of the latent acoustic signal into a sequence of interpretable units akin space. We hope that this can similarly improve performance to phones. They often involve clustering of acoustic frames, of latent variable models used with auto-regressive decoders MFCC or neural network bottleneck features, regularized using in other problem domains. a probabilistic prior. DP-GMM [68] imposes a Dirichlet Process Both the VAE and VQ-VAE constrain the information prior over a Gaussian Mixture Model. Extending it with an bandwidth of the latent representation. However, the VQ-VAE HMM temporal structure for sub-phonetic units leads to the uses a quantization mechanism, which deterministically forces DP-HMM and the HDP-HMM [69], [70], [71]. HMM-VAE the encoding to be equal to a prototype, while the VAE limits proposes the use of a deep neural network instead of a GMM the amount of information by injecting noise. In our study, [72], [73]. These approaches enforce top-down constraints via the VQ-VAE resulted in better information separation than HMM temporal smoothing and temporal modeling. Linguistic the VAE. However, further experiments are needed to fully unit discovery models detect recurring speech patterns at a understand this effect. In particular, is this a consequence of word-like level, finding commonly repeated segments with a the quantization, or of the deterministic operation? constrained dynamic time warping [74]. We also observe that while the VQ-VAE produces a discrete In the segmental unsupervised speech recognition framework, representation, for best results it uses a token set so large that neural autoencoders were used to embed variable length speech it is impractical to assign a separate meaning to each one. In segments into a common vector space where they could be particular, in our ZeroSpeech experiments we used the dense clustered into word types [75]. [76] replace the segmental embedding representation of each token, which provided a autoencoder with a model that instead predicts a nearby more nuanced token similarity measure than simply using the speech segment and demonstrate that the representation shares token identity. Perhaps a more structured latent representation many properties with word embeddings. Coupled with an is needed, in which a small set of units can be modulated in a unsupervised word segmentation algorithm and unsupervised continuous fashion. mapping of word embeddings discovered on separate corpora Extensive hyperparameter evaluation indicated that opti- [77] the approach yielded an ASR system trained on unpaired mizing the receptive field sizes of the encoder and decoder speech and text data [78]. networks is important for good model performance. A multi- Several entries to the ZeroSpeech 2017 challenge relied scale modeling approach could furthermore separate the on neural networks for phonetic unit discovery. [61] trains prosodic information. Our autoencoding approach could also an autoencoder on pairs of speech segments found using an be combined with penalties that are more specialized to speech unsupervised term discovery system [79]. [59] first clustered processing. Introducing a HMM prior as in [73] could promote speech frames, then trained a neural network to predict the a latent representation which better mimics the temporal cluster IDs and used its hidden representation as features. phonetic structure of speech. [60] extended this scheme with features discovered by an autoencoder trained on MFCCs. ACKNOWLEDGMENTS The authors thank Tara Sainath, Ulfar Erlingsson, Aren VI. CONCLUSIONS Jansen, Sander Dieleman, Jesse Engel, Łukasz Kaiser, Tom We applied sequence autoencoders to speech modeling and Walters, Cristina Garbacea, and the Google Brain team for compared different information bottlenecks, including VAEs their helpful discussions and feedback. and VQ-VAEs. We carefully evaluated the induced latent representation using interpretability criteria as well as the ability REFERENCES to discriminate between similar speech sounds. The comparison [1] G. E. Hinton and R. R. Salakhutdinov, “Reducing the dimensionality of of bottlenecks revealed that discrete representations obtained data with neural networks,” Science, vol. 313, no. 5786, 2006. using VQ-VAE preserved the most phonetic information [2] P. Vincent, H. Larochelle, Y. Bengio, and P.-A. Manzagol, “Extracting and composing robust features with denoising autoencoders,” in Proc. while also being the most speaker-invariant. The extracted International Conference on Machine Learning, 2008. representation allowed for accurate mapping of the extracted [3] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification symbols into phonemes and obtained competitive performance with deep convolutional neural networks,” in Advances in Neural Information Processing Systems, 2012, pp. 1097–1105. on the ZeroSpeech 2017 acoustic unit discovery task. A similar [4] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan, combination of VQ-VAE encoder and WaveNet decoder by V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,” in Cho et al. had the best acoustic unit discovery performance in Proc. IEEE Conference on Computer Vision and Pattern Recognition, ZeroSpeech 2019 [80]. [5] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by We established that an information bottleneck is required jointly learning to align and translate,” in Proc. International Conference for the model to learn a representation that separates content on Learning Representations, 2015. 12 [6] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey, [29] J. Yosinski, J. Clune, Y. Bengio, and H. Lipson, “How transferable are M. Krikun, Y. Cao, Q. Gao, K. Macherey, and et al, “Google’s neural features in deep neural networks?” in Advances in Neural Information machine translation system: Bridging the gap between human and Processing Systems, 2014, pp. 3320–3328. machine translation,” arXiv preprint arXiv:1609.08144, 2016. [30] K. Vesely, ` M. Karafiat, ´ F. Grezl, ´ M. Janda, and E. Egorova, “The [7] A. Graves, A.-r. Mohamed, and G. Hinton, “Speech recognition with language-independent bottleneck features,” in Proc. Spoken Language deep recurrent neural networks,” in Proc. International Conference on Technology Workshop (SLT), 2012, pp. 336–341. Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6645–6649. [31] D. Yu and M. L. Seltzer, “Improved bottleneck features using pretrained [8] C.-C.Chiu, T. N. Sainath, Y. Wu, R. Prabhavalkar, P. Nguyen, Z. Chen, deep neural networks,” in Proc. Interspeech, 2011. A. Kannan, R. J. Weiss, K. Rao, K. Gonina, N. Jaitly, B. Li, J. Chorowski, [32] B. McCann, J. Bradbury, C. Xiong, and R. Socher, “Learned in translation: and M. Bacchiani, “State-of-the-art speech recognition with sequence- Contextualized word vectors,” in Advances in Neural Information to-sequence models,” in Proc. International Conference on Acoustics, Processing Systems, 2017, pp. 6294–6305. Speech and Signal Processing (ICASSP), 2018. [33] S. R. Bowman, G. Angeli, C. Potts, and C. Manning, “A large annotated [9] W. Wang, N. Yang, F. Wei, B. Chang, and M. Zhou, “Gated self-matching corpus for learning natural language inference,” in Proc. Conference on networks for reading comprehension and question answering,” in Proc. Empirical Methods in Natural Language Processing, 2015. 55th Annual Meeting of the Association for Computational Linguistics [34] A. Conneau, D. Kiela, H. Schwenk, L. Barrault, and A. Bordes, (Volume 1: Long Papers), vol. 1, 2017, pp. 189–198. “Supervised learning of universal sentence representations from natural [10] A. W. Yu, D. Dohan, M.-T. Luong, R. Zhao, K. Chen, M. Norouzi, language inference data,” in Proc. Conference on Empirical Methods in and Q. V. Le, “QANet: Combining local convolution with global self- Natural Language Processing (EMNLP), September 2017, pp. 670–680. attention for reading comprehension,” in Proc. International Conference [35] C. M. Bishop, “Continuous latent variables,” in Pattern Recognition and on Learning Representations, 2018. Machine Learning. Springer, 2006, ch. 12. [11] M. D. Zeiler and R. Fergus, “Visualizing and understanding convolutional [36] D. D. Lee and H. S. Seung, “Learning the parts of objects by non-negative networks,” in European Conference on Computer Vision, 2014. matrix factorization,” Nature, vol. 401, no. 6755, p. 788, 1999. [12] M. Sundararajan, A. Taly, and Q. Yan, “Axiomatic attribution for deep [37] B. A. Olshausen and D. J. Field, “Emergence of simple-cell receptive networks,” in Proc. International Conference on Machine Learning, 2017. field properties by learning a sparse code for natural images,” Nature, [13] T. Nagamine and N. Mesgarani, “Understanding the representation vol. 381, no. 6583, p. 607, 1996. and computation of multilayer perceptrons: A case study in speech [38] D. P. Kingma and M. Welling, “Auto-encoding variational bayes,” in recognition,” in Proc. International Conference on Machine Learning, Proc. International Conference on Learning Representations, 2014. [39] I. Higgins, L. Matthey, A. Pal, C. Burgess, X. Glorot, M. Botvinick, [14] J. Chorowski, R. J. Weiss, R. A. Saurous, and S. Bengio, “On using S. Mohamed, and A. Lerchner, “Beta-VAE: Learning basic visual backpropagation for speech texture generation and voice conversion,” concepts with a constrained variational framework,” in Proc. International in Proc. International Conference on Acoustics, Speech and Signal Conference on Learning Representations, 2017. Processing (ICASSP), Apr. 2018. [40] A. A. Alemi, I. Fischer, J. V. Dillon, and K. Murphy, “Deep variational [15] P. Swietojanski, A. Ghoshal, and S. Renals, “Unsupervised cross-lingual information bottleneck,” in Proc. International Conference on Learning knowledge transfer in DNN-based LVCSR,” in Proc. Spoken Language Representations, 2017. Technology Workshop (SLT), 2012, pp. 246–251. [41] D. P. Kingma, T. Salimans, R. Jozefowicz, X. Chen, I. Sutskever, and [16] S. Thomas, M. L. Seltzer, K. Church, and H. Hermansky, “Deep neural M. Welling, “Improved variational inference with inverse autoregressive network features and semi-supervised training for low resource speech flow,” in Advances in Neural Information Processing Systems, 2016. recognition,” in Proc. International Conference on Acoustics, Speech and Signal Processing (ICASSP), 2013, pp. 6704–6708. [42] Y. Bengio, N. Leonard, ´ and A. Courville, “Estimating or propagating [17] V. Panayotov, G. Chen, D. Povey, and S. Khudanpur, “LibriSpeech: an gradients through stochastic neurons for conditional computation,” arXiv ASR corpus based on public domain audio books,” in Proc. International preprint arXiv:1308.3432, 2013. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2015. [43] D. Jurafsky and J. H. Martin, Speech and Language Processing (2nd [18] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals, Edition). Upper Saddle River, NJ, USA: Prentice-Hall, Inc., 2009. A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “WaveNet: [44] Y. N. Dauphin, A. Fan, M. Auli, and D. Grangier, “Language modeling A generative model for raw audio,” arXiv preprint arXiv:1609.03499, with gated convolutional networks,” in Proc. International Conference on Machine Learning, 2017. [19] A. van den Oord, O. Vinyals, and K. Kavukcuoglu, “Neural discrete [45] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic representation learning,” in Advances in Neural Information Processing convolutional and recurrent networks for sequence modeling,” arXiv Systems, 2017, pp. 6309–6318. preprint arXiv:1803.01271, 2018. [20] E. Dunbar, X. N. Cao, J. Benjumea, J. Karadayi, M. Bernard, L. Besacier, [46] W.-N. Hsu, Y. Zhang, R. J. Weiss, H. Zen, Y. Wu, Y. Wang, Y. Cao, X. Anguera, and E. Dupoux, “The zero resource speech challenge 2017,” Y. Jia, Z. Chen, J. Shen, P. Nguyen, and R. Pang, “Hierarchical generative in Proc. Automatic Speech Recognition and Understanding Workshop modeling for controllable speech synthesis,” in Proc. International (ASRU), 2017. Conference on Learning Representations, 2019. [21] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Learning rep- [47] I. Gulrajani, K. Kumar, F. Ahmed, A. A. Ta¨ ıga, F. Visin, D. Vazquez, ´ resentations by back-propagating errors,” Nature, vol. 323, no. 6088, and A. Courville, “PixelVAE: A latent variable model for natural images,” in Proc. International Conference on Learning Representations, 2017. [22] H. Lee, C. Ekanadham, and A. Ng, “Sparse deep belief net model for [48] L. Wiskott and T. J. Sejnowski, “Slow feature analysis: Unsupervised visual area V2,” in Advances in Neural Information Processing Systems, learning of invariances,” Neural Computation, vol. 14, no. 4, 2002. [49] S. R. Bowman, L. Vilnis, O. Vinyals, A. M. Dai, R. Jozefowicz, and [23] S. Dieleman and B. Schrauwen, “End-to-end learning for music audio,” S. Bengio, “Generating sentences from a continuous space,” in SIGNLL in Proc. International Conference on Acoustics, Speech and Signal Conference on Computational Natural Language Learning, 2016. Processing (ICASSP), 2014, pp. 6964–6968. [50] N. Srivastava, G. Hinton, A. Krizhevsky, I. Sutskever, and R. Salakhutdi- [24] N. Jaitly and G. Hinton, “Learning a better representation of speech nov, “Dropout: A simple way to prevent neural networks from overfitting,” soundwaves using restricted Boltzmann machines,” in Proc. International Journal of Machine Learning Research, vol. 15, no. 1, 2014. Conference on Acoustics, Speech and Signal Processing (ICASSP), 2011. [51] D. Krueger, T. Maharaj, J. Kramar ´ , M. Pezeshki, N. Ballas, N. R. Ke, [25] Z. Tusk ¨ e, P. Golik, R. Schluter ¨ , and H. Ney, “Acoustic modeling with deep A. Goyal, Y. Bengio, A. Courville, and C. Pal, “Zoneout: Regularizing neural networks using raw time signal for LVCSR,” in Proc. Interspeech, RNNs by randomly preserving hidden activations,” in Proc. International Conference on Learning Representations, 2017. [26] D. Palaz, M. Magima Doss, and R. Collobert, “Analysis of CNN- [52] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” based speech recognition system using raw speech as input,” in Proc. in Proc. International Conference on Learning Representations, 2015. Interspeech, 2015. [53] B. T. Polyak and A. B. Juditsky, “Acceleration of stochastic approximation [27] T. N. Sainath, R. J. Weiss, A. Senior, K. W. Wilson, and O. Vinyals, by averaging,” SIAM Journal on Control and Optimization, vol. 30, no. 4, “Learning the speech front-end with raw waveform CLDNNs,” in Proc. Interspeech, 2015. pp. 838–855, 1992. [28] J. Deng, W. Dong, R. Socher, L.-J. Li, K. Li, and L. Fei-Fei, “ImageNet: [54] D. Moyer, S. Gao, R. Brekelmans, A. Galstyan, and G. Ver Steeg, A Large-Scale Hierarchical Image Database,” in Proc. IEEE Conference “Invariant Representations without Adversarial Training,” in Advances in on Computer Vision and Pattern Recognition, 2009. Neural Information Processing Systems 31, 2018, pp. 9084–9093. 13 [55] D. Povey, A. Ghoshal, G. Boulianne, L. Burget, O. Glembek, N. Goel, [79] A. Jansen and B. Van Durme, “Efficient spoken term discovery using M. Hannemann, P. Motlicek, Y. Qian, P. Schwarz, J. Silovsky, G. Stemmer, randomized algorithms,” in Proc. Automatic Speech Recognition and and K. Vesely, “The Kaldi speech recognition toolkit,” in Proc. Automatic Understanding Workshop (ASRU), 2011, pp. 401–406. Speech Recognition and Understanding Workshop (ASRU), 2011. [80] E. Dunbar, R. Algayres, J. Karadayi, M. Bernard, J. Benjumea, X.-N. Cao, L. Miskic, C. Dugrain, L. Ondel, A. W. Black, L. Besacier, S. Sakti, and [56] T. Schatz, V. Peddinti, F. Bach, A. Jansen, H. Hermansky, and E. Dupoux, E. Dupoux, “The Zero Resource Speech Challenge 2019: TTS without T,” “Evaluating speech features with the minimal-pair ABX task: Analysis arXiv preprint arXiv:1904.11469, 2019, accepted to Interspeech 2019. of the classical MFC/PLP pipeline,” in Proc. Interspeech, 2013, pp. 1–5. [57] T. Schatz, V. Peddinti, X.-N. Cao, F. Bach, H. Hermansky, and E. Dupoux, “Evaluating speech features with the minimal-pair ABX task (ii): Resistance to noise,” in Proc. Interspeech, 2014. [58] M. Heck, S. Sakti, and S. Nakamura, “Unsupervised linear discriminant analysis for supporting DPGMM clustering in the zero resource scenario,” Jan Chorowski is an Associate Professor at Faculty Procedia Computer Science, vol. 81, pp. 73–79, 2016. of Mathematics and Computer Science at the Uni- [59] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Multilingual bottle- versity of Wrocław. He received his M.Sc. degree in neck feature learning from untranscribed speech,” in Proc. Automatic electrical engineering from the Wrocław University of Speech Recognition and Understanding Workshop (ASRU), 2017. Technology, Poland and EE PhD from the University [60] T. Ansari, R. Kumar, S. Singh, and S. Ganapathy, “Deep learning methods of Louisville, Kentucky in 2012. He has worked for unsupervised acoustic modeling—leap submission to zerospeech chal- with several research teams, including Google Brain, lenge 2017,” in Proc. Automatic Speech Recognition and Understanding Microsoft Research and Yoshua Bengio’s lab at the Workshop (ASRU), 2017, pp. 754–761. University of Montreal. His research interests are [61] Y. Yuan, C. C. Leung, L. Xie, H. Chen, B. Ma, and H. Li, “Extracting applications of neural networks to problems which bottleneck features and word-like pairs from untranscribed speech for are intuitive for humans but difficult for machines, feature representation,” in Proc. Automatic Speech Recognition and such as speech and natural language processing. Understanding Workshop (ASRU), Dec 2017, pp. 734–739. [62] J. Shen, R. Pang, R. J. Weiss, M. Schuster, N. Jaitly, Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. J. Skerry-Ryan, R. A. Saurous, Y. Agiomyrgian- nakis, and Y. Wu, “Natural TTS synthesis by conditioning WaveNet on mel spectrogram predictions,” in Proc. International Conference on Ron J. Weiss is a software engineer at Google Acoustics, Speech and Signal Processing (ICASSP), 2018. where he has worked on content-based audio analysis, [63] X. Wang and C.-C. J. Kuo, “An 800 bps VQ-based LPC voice coder,” recommender systems for music, noise robust speech Journal of the Acoustical Society of America, vol. 103, no. 5, 1998. recognition, speech translation, and speech synthesis. [64] J. Engel, C. Resnick, A. Roberts, S. Dieleman, M. Norouzi, D. Eck, and Ron completed his Ph.D. in electrical engineering K. Simonyan, “Neural audio synthesis of musical notes with wavenet from Columbia University in 2009 where he worked autoencoders,” in Proc. International Conference on Machine Learning, in the Laboratory for the Recognition of Speech and 2017, pp. 1068–1077. Audio. From 2009 to 2010 he was a postdoctoral re- searcher in the Music and Audio Research Laboratory [65] Y. Ganin, E. Ustinova, H. Ajakan, P. Germain, H. Larochelle, F. Laviolette, at New York University. M. Marchand, and V. Lempitsky, “Domain-Adversarial Training of Neural Networks,” Journal of Machine Learning Research, vol. 17, no. 59, pp. 1–35, 2016. [66] W.-N. Hsu, Y. Zhang, and J. Glass, “Unsupervised learning of disentan- gled and interpretable representations from sequential data,” in Advances in Neural Information Processing Systems, 2017, pp. 1876–1887. [67] Y. Li and S. Mandt, “Disentangled sequential autoencoder,” in Proc. Samy Bengio (PhD in computer science, University International Conference on Machine Learning, 2018. of Montreal, 1993) is a research scientist at Google [68] H. Chen, C.-C. Leung, L. Xie, B. Ma, and H. Li, “Parallel Inference of since 2007. He currently leads a group of research Dirichlet Process Gaussian Mixture Models for Unsupervised Acoustic scientists in the Google Brain team, conducting Modeling: A Feasibility Study,” in Proc. Interspeech, 2015. research in many areas of machine learning such as [69] C.-y. Lee and J. Glass, “A Nonparametric Bayesian Approach to Acoustic deep architectures, representation learning, sequence Model Discovery,” in Proc. 50th Annual Meeting of the Association for processing, speech recognition, image understanding, Computational Linguistics (Volume 1: Long Papers), Jul. 2012, pp. 40–49. large-scale problems, adversarial settings, etc. He is the general chair for Neural Information [70] L. Ondel, L. Burget, and J. Cernocky, ´ “Variational Inference for Acoustic Processing Systems (NeurIPS) 2018, the main con- Unit Discovery,” Procedia Computer Science, vol. 81, Jan. 2016. ference venue for machine learning, was the program [71] R. Marxer and H. Purwins, “Unsupervised Incremental Online Learning chair for NeurIPS in 2017, is action editor of the Journal of Machine Learning and Prediction of Musical Audio Signals,” IEEE/ACM Transactions on Research and on the editorial board of the Machine Learning Journal, was Audio, Speech, and Language Processing, vol. 24, no. 5, May 2016. program chair of the International Conference on Learning Representations [72] J. Ebbers, J. Heymann, L. Drude, T. Glarner, R. Haeb-Umbach, and (ICLR 2015, 2016), general chair of BayLearn (2012-2015) and the Workshops B. Raj, “Hidden Markov Model Variational Autoencoder for Acoustic on Machine Learning for Multimodal Interactions (MLMI’2004-2006), as well Unit Discovery,” in Proc. Interspeech, Aug. 2017, pp. 488–492. as the IEEE Workshop on Neural Networks for Signal Processing (NNSP’2002), [73] T. Glarner, P. Hanebrink, J. Ebbers, and R. Haeb-Umbach, “Full Bayesian and on the program committee of several international conferences such as Hidden Markov Model Variational Autoencoder for Acoustic Unit NeurIPS, ICML, ICLR, ECML and IJCAI. Discovery,” in Proc. Interspeech, Sep. 2018, pp. 2688–2692. [74] A. S. Park and J. R. Glass, “Unsupervised Pattern Discovery in Speech,” IEEE Transactions on Audio, Speech, and Language Processing, vol. 16, no. 1, pp. 186–197, Jan. 2008. [75] H. Kamper, A. Jansen, and S. Goldwater, “A segmental framework Aar ¨ on van den Oord is a research scientist at for fully-unsupervised large-vocabulary speech recognition,” Computer DeepMind, London. Aaron ¨ completed his PhD at Speech & Language, vol. 46, pp. 154–174, 2017. the University of Ghent, Belgium in 2015. He [76] Y.-A. Chung and J. Glass, “Learning word embeddings from speech,” has worked on unsupervised representation learning, arXiv preprint arXiv:1711.01515, 2017. music recommendation, generative modeling with [77] G. Lample, A. Conneau, L. Denoyer, and M. Ranzato, “Unsupervised autoregressive networks and various applications of Machine Translation Using Monolingual Corpora Only,” in Proc. Inter- generative models such text-to-speech synthesis and national Conference on Learning Representations, 2018. data compression. [78] Y.-A. Chung, W.-H. Weng, S. Tong, and J. Glass, “Unsupervised cross- modal alignment of speech and text embedding spaces,” Advances in Neural Information Processing Systems, 2018.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jan 25, 2019

There are no references for this article.