Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

A context encoder for audio inpainting

A context encoder for audio inpainting Andr es Mara oti, Nathana el Perraudin, Nicki Holighaus, and Piotr Majdak October 11, 2019 Abstract meaningful information while preventing audible artifacts so that the listener remains unaware of any occurred prob- We study the ability of deep neural networks (DNNs) to lem. Successful algorithms are limited to deal with a par- restore missing audio content based on its context, i.e., in- ticular class of audio signals [5], or they focus on a speci c paint audio gaps. We focus on a condition which has not duration of the problematic signal parts [6], and/or they received much attention yet: gaps in the range of tens of exploit a-priori information about the problem [7]. milliseconds. We propose a DNN structure that is pro- In this work, we explore a new machine-learning algo- vided with the signal surrounding the gap in the form of rithm with respect to the reconstruction of lost parts of time-frequency (TF) coecients. Two DNNs with either audio signals, i.e., gaps. From all possible classes of audio complex-valued TF coecient output or magnitude TF co- signals, we limit the reconstruction to instrumental music, ecient output were studied by separately training them i.e., mix of sounds from musical instruments organized in on inpainting two types of audio signals (music and musi- time. We focus on gaps of medium durations, that is, in cal instruments) having 64-ms long gaps. The magnitude the range of tens of milliseconds. We assume that gaps are DNN outperformed the complex-valued DNN in terms of separated in time, such that the local audio information signal-to-noise ratios and objective di erence grades. Al- surrounding the gap, namely, the context, is reliable and though, for instruments, a reference inpainting obtained can be exploited. through linear predictive coding performed better in both The proposed algorithm is based on an unsupervised metrics, it performed worse than the magnitude DNN for feature-learning algorithm driven by context-based sam- music. This demonstrates the potential of the magnitude ple prediction. It relies on a DNNs with convolutional and DNN, in particular for inpainting signals that are more fully connected layers (FCLs) trained to generate TF rep- complex than single instrument sounds. resentations of sounds being conditioned on contextual TF information. We call the algorithm context encoder, as in- troduced for images [8] in analogy to auto encoders [9]. 1 Introduction Our context encoder aims at studying the general ability of DNNs to accurately inpaint audio in the range of tens Locally degraded or even lost information is encountered of milliseconds from limited but reliable context in order in various audio processing tasks. Some examples are to determine factors with the largest potential for future corrupted audio les, lost information in audio transmis- improvement and details requiring a more sophisticated sion (referred to as packet-loss in the context of voice- method. over-IP transmission), and audio signals locally contami- nated by noise. Restoration of lost information in audio has been referred to as audio inpainting [1], audio inter- 1.1 Related deep-learning techniques /extrapolation [2, 3], or waveform substitution [4]. Re- Deep learning excels in classi cation, regression, and construction is usually aimed at providing a coherent and anomaly detection tasks [9] and it has also shown good Manuscript received on October 2018; revised on April 2019. results in generative modeling with techniques such as Andr es Mara oti, Nicki Holighaus, and Piotr Majdak are with variational auto encoders [10] and generative adversarial the Acoustics Research Institute, Austrian Academy of Sciences, networks [11]. Unfortunately, for audio synthesis only Wohllebengasse 12{14, 1040 Vienna, Austria. the latter has been studied, applying it to generate snip- Nathana el Perraudin is with the Swiss Data Science Center, ETH Zuric  h, Universit atstrasse 25, 8006 Zuric  h pets of sound [12{14]. In order to obtain meaningful Accompanying web page (sound examples, Matlab and Python results, state-of-the-art audio synthesis requires sophisti- code, color gures): cated networks [15, 16]. While these approaches directly https://andimarafioti.github.io/audioContextEncoder/. We thank the reviewers and the editor for their review and their help- predict audio samples based on the preceding samples, ful suggestions. This work has been supported by Austrian Science in the speech-synthesis eld, synthesis of audio in do- Fund (FWF) project MERLIN (Modern methods for the restoration mains other than time such as spectrograms [17], and mel- of lost information in digital signals;I 3067-N30). We gratefully ac- spectrograms [18, 19] have been proposed. In the eld knowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. of speech transmission, DNNs have been used to achieve arXiv:1810.12138v2 [cs.SD] 10 Oct 2019 packet loss concealment [20]. tion originates in the combination of the TF representa- The synthesis of musical audio signals using deep learn- tion and the assumption of sparsity: TF sparse methods ing, however, is even more challenging [21]. A music signal are ill-suited to restore gaps that approach or exceed the is comprised of complex sequences ranging from short-term duration of the TF analysis and synthesis windows. This structures (any periodicity in the waveform) to long-term limitation is also valid, if less severe, for structured TF structures (like gures, motifs, or sections). In order to sparsity, rendering the sparsity-based methods as unsatis- simplify the problem brought by long-range dependencies, factory for inpainting medium duration gaps. TF domain music synthesis in multiple steps has been proposed in- is popular for inpainting short gaps, e.g., interpolation of cluding an intermediate symbolic representation like MIDI audio based on a Gabor regression model [6], or nonnega- sequences [22], and features of a parametric vocoder [23]. tive matrix and tensor factorization [31{33]. More recently, While these contributions provide insights on the design a powerful framework has been proposed for various audio of a neural network for audio synthesis, none of them ad- inverse problems [34] including time-domain audio inpaint- dresses conditions in which some audio information has ing, source separation [35], and declipping [36] even in a been lost, but the surrounding context is available. multichannel scenario [37]. All of these systems require valid audio data within a time-domain window, cf. [36], which makes them perfect for inpainting short gaps, but 1.2 Related audio-inpainting algorithms unsatisfactory for medium gap durations. The term "audio inpainting" was coined by Adler et al. to On the other hand, for inpainting long gaps, recent meth- describe a large class of inverse problems in audio process- ods leverage repetition and determine the most promis- ing, while focussing their own study on the restoration of ing reliable segment from uncorrupted portions of the in- gaps in audio signals [1]. The general assumption for audio put signal [5, 7]. Restoration is then achieved by inserting inpainting is that audio is represented in some domain as the determined segment into the gaps. These methods do data and some chunks of that data are corrupted yielding not claim to restore the missing gap perfectly, they aim gaps in the representation. at plausibility. For example, a method based on MFCC The number and duration of the gaps as well as the type feature similarity has been proposed for packet loss con- of corruption is manifold. For example, in declicking and cealment [5]. It explicitly targets a perceptually plausible declipping, corruptions may be frequent, but mostly con- restoration. Similarly, exemplar-based inpainting was pro- ned to disconnected time-segments of only few millisec- posed based on a graph encoding spectro-temporal simi- onds duration or less. We refer to inpainting such gaps as larities within an audio signal [7]. In both studies, gap du- inpainting of short gaps. On the other hand, gaps on a rations were beyond several hundreds of milliseconds and scale of hundreds of milliseconds or even seconds may hap- their reconstruction needed to be evaluated in psychoa- pen, e.g., when reading partially damaged physical media, coustic experiments. Other examples for similar methods in live music recordings, when unwanted noise originat- are [38{41]. While all these methods might be in general ing from the audience needs to be removed, or in audio capable of inpainting gaps of medium duration, the tar- transmission with a total loss of the connection between get of the inpainting is always plausible instead of accurate transmitter and receiver lasting for seconds. We refer to reconstructions. inpainting such gaps as inpainting long gaps. In contrast, we de ne medium gaps as those with tens of When restricting the inpainting to simple sounds such as milliseconds duration, a scale on which the non-stationary musical instruments, linear prediction coding (LPC) [42] characteristic of audio already becomes important, but the can be applied even for medium gap durations. While LPC extrapolation of the missing information from short con- may sound antiquated, it is particularly suitable for the text surrounding the gap still seems feasible. Medium instrument sounds as it models the way the sound is cre- gaps may arise as a consequence of packet loss in audio ated by many instruments, i.e., by means of weighted sum transmission [5] or when short interruption happens while of resonances. From the algorithmic perspective, LPC is reading audio from partially damaged physical media. In- simple but recursive, thus, allows to synthesize complex terestingly, not much has been done for audio inpainting sound signals at a low computational power. Initially pro- of medium gaps. posed for inpainting short bursts of lost samples [43], LPC- In contrast, for inpainting short gaps, various solutions based inpainting algorithms model the signal as an acoustic have been proposed. [1] proposed a framework based on source ltered by an all-pole lter. The model parameters orthogonal matching pursuit (OMP), which has inspired are derived from the context and the missing signal part a considerable amount of research exploiting TF spar- is synthesized by extrapolating the context into the gap. sity [24{27] or structured sparsity [28{30]. Being tempted LPC-based methods work well for inpainting gaps for du- to extend these works to medium gap durations, one gets rations from 5 to 100 ms [3, 44]. LPC-based methods are disappointed quite soon because for increasing gap dura- particularly good in inpainting gaps consisting of many tions (from the originally targeted of 10 ms to medium consecutive missing audio samples surrounded by reliable gap durations of around 50 ms), the reconstruction quality context [44]. In our experiments for medium gaps, the substantially decreases, see Fig. 1 in [27]. The degrada- LPC-based algorithm [44] performed better than the lat- 2 ests reports on OMP-based algorithms [27]. As it seems, 2.1 Pre-processing stage when it comes to inpainting medium gaps, the LPC-based We use STFT, which enables a robust synthesis of the time- method [44] seems to be the choice for a reference method. domain signal from the reconstructed TF coecients. The The performance of LPC-based methods relies on the un- STFT is determined by the analysis window, hop size a, derlying assumption of signal stationarity. Deep-learning and the number of frequency channels M . In our study, techniques, on the other hand, promise a more generalized the analysis window was an appropriately normalized Hann signal representation. A combination of TF representation window of length M and a was M=4, enabling perfect re- with deep-learning techniques may provide better inpaint- construction by an inverse STFT with the same parameters ing whenever the lost data cannot be predicted by LPC. and window. Thus, here, we propose to link deep-learning techniques The STFT is applied to the signal s 2 R (containing L with audio inpainting. samples of audio) resulting in S, both of which consist of the context before and after the gap (containing L samples each) and the gap (containing L samples), 2 Context Encoder 0 1 Our end-to-end system is presented in Fig. 1. We con- @ A s = 0 and S = S ; 0 ; S ; L 1 b a g (M=2+1)N sider the audio signal s consisting of the gap s and the context signals before and after the gap, s and s , re- b a spectively (Fig. 1a). Given that convolutional networks L where s ; s 2 R , N = (L M )=a + 1, and S ; S 2 b a g g b a applied directly on time-domain signals would require ex- (M=2+1)N C with N = L =a. 0 is a matrix with R c c RC tremely large training datasets [45], we provide the network rows and C columns containing only zeros. with TF coecients. The TF coecients are obtained from Then, S and S are split into real and imaginary parts, b a an invertible representation, namely, a redundant short- Re Im Re Im resulting in four channels S ; S ; S ; S , which are b b a a time Fourier transform (STFT) [46, 47]. Our network, in- fed to the network. spired by the context encoder for image inpainting [8], is an encoder-decoder pipeline fed with TF coecients of the 2.2 Encoder context information, S and S (Fig. 1b). In order to b a study the general ability of DNNs to accurately inpaint For the architecture of the encoder, [8] used the rst ve audio in the range of tens of milliseconds, our network is layers from [52] to process images. To adapt the design of comprised only of standard widely-used building blocks, our network to process TF coecients, our encoder con- i.e., convolutional layers, FCLs, and recti ed linear units sists of six regular convolutional layers sequentially con- (ReLUs). The network predicts TF coecients of the gap nected via ReLUs, after which batch normalization [53] is S (Fig. 1c), which are then merged with the stripped TF applied. Instead of using classical squared lters, we used coecients of the context, (Fig. 1d), in order to synthesize rectangular lters to give the encoder more capacity on fre- the reconstruction in the time domain, s (Fig. 1e). quency over time in the TF representation. For M = 512, To study the e ect of the phase of the reconstructed the resulting encoder architecture is shown in Figure 2. TF representations, we considered two equivalent networks Re Im Re Im The inputs S ; S ; S ; S of the context informa- b b a a with di erent outputs: (a) complex network, i.e., a net- tion are treated as separate channels, thus, the network is work directly reconstructing the complex-valued TF coef- required to learn how the channels interact and how to mix cients which are then applied to the inverse STFT for the them. Because the encoder is comprised of only convolu- synthesis of the time-domain audio signal, and (b) mag- tional layers, the information can not reliably propagate nitude network, i.e., a network reconstructing the magni- from one end of the feature map to another. This is a con- tude coecients only, which are then applied to a phase- sequence of convolutional layers connecting all the feature reconstruction algorithm in order to obtain complex-valued maps together, but never directly connecting all locations TF coecients required for the signal synthesis. From ac- within a speci c feature map [8]. curate TF magnitude information, phaseless reconstruc- tion methods such as [48{50] are known to provide per- ceptually close, often indiscernible, reconstruction despite 2.3 Decoder the resulting time-domain waveforms usually being rather Similar to [8], the decoder begins with a FCL and a ReLU di erent. nonlinearity in order to spread the encoder's information The software was implemented in Tensor ow [51] and is among the channels. FCLs are computationally expensive; publicly available. in our case it contains 38% of all the parameters of the network. All the subsequent layers are (de-)convolutional Before xing the network structure described in the remainder of this section, we experimented with di erent standard architectures, and, as for the encoder, connected by ReLUs with batch depths, and kernel shapes, out of which the current structure showed the most promise. This is in contrast to machine-learning methods solving classi - www.github.com/andimara oti/audioContextEncoder cation tasks, in which such a synthesis is not targeted. 3 b) c) d) S ' S '  b a a) e) Context STFT Merge Synthesis Encoder s s s s ' s ' g s '  b g a b a S S S '  b a g S ' S ' S ' b g a Figure 1: The end-to-end system. a) Audio signal in the time domain, s is the gap. b) Audio signal in the TF domain, S and S is the context before and after the gap, respectively. c) Reconstructed gap S in the TF domain. b a g 0 0 0 d) Reconstruction S merged with the stripped context S and S in the TF domain. e) Reconstructed signal in the g b a time domain, including the inpainted gap, s . (3,17) (2,11) (1,9) (1,5) (2,5) channels 128 512 256 160 128 channel channels (89, 7) channels channels channels channels Reshape 8 8 2 2 2 height: width: 8 Figure 2: The encoder is a convolutional network with six layers followed by reshaping. The four channel TF input is encoded into a matrix of size of 2048. Gray rectangles represent the convolution lters with size expressed as (height, width). White cubes represent the signal. normalization. The rst three layers use squared lters, the 2.4 Post-processing stage remaining two layers use rectangular lters to give the de- The post-processing stage synthesizes the audio signal coder more capacity on frequency over time in the output of the context and the inpainted gap. To this end, TF representation. Figure 3 shows the decoder architec- (M=a 1) coecients of the context extending into the ture for M = 512 and a gap size L = 1024 samples. 0 0 gap are removed, yielding the stripped context, S ; S 2 b a (M=2+1)(N M=a+1) C . Then, the reconstructed TF coe- The decoder does not only output the gap content, but cients from the decoder, S , are inserted between the TF also the TF coecients connecting the gap with the con- g 0 0 0 coecient of the stripped context, S and S , yielding the b a text. Thus, the decoder output S is larger than the origi- 0 0 0 0 sequence S = (S ; S ; S ), having the same size as S. b g a nal gap by M=a 1 columns before and after the gap each, 0 (M=2+1)((L +M)=a1) Stripping the context and insertion of the reconstruction i.e., S 2 C . In our example with directly in the TF domain prevents transitional artifacts L = 1024, M = 512 and a = M=4, shown in Fig. 3, every between the context and the gap because synthesis by the decoder output channel is of size 257 11. inverse STFT introduces an inherent cross-fading. For the complex network, the decoder output represents Note that the nal layer depends on the network. For the real and imaginary parts of complex-valued TF coef- the complex network, the nal layer has two outputs, cor- cients S and the inverse STFT can be directly applied responding to the real and imaginary part of the complex- yielding s . valued TF coecients. For the magnitude network, the nal layer has a single output for the magnitude TF coef- For the magnitude network, the decoder output repre- cients. We denote the output TF coecients as S . sents the magnitudes of the TF coecients and the missing 0 (5,67) (11,257) (8,8) (5,5) (3,3) channel 128 11 32 channels 128 512 257 channels channels channels 1 or 2 FCL channels channels channels channels Reshape Reshape Reshape 32 32 height: 514 257 16 11 width: 1 Figure 3: The decoder architecture for the complex and magnitude network producing one and two channels of TF coecients, respectively. All other conventions as in Figure 2. phase information needs to be estimated separately. First, where the constant c > 0 controls the incorporated com- the phase gradient heap integration algorithm proposed pensation for small amplitude. In our experiments, c = 5 in [54] was applied to the magnitude coecients produced yielded good results. by the decoder in order to obtain an initial estimation of Finally, as proposed in [57], the total loss is the sum of the TF phase. Then, this estimation was re ned by apply- the loss function and a regularization term controlling the ing 100 iterations of the fast Grin-Lim algorithm [48, 49]. trainable weights in terms of their ` -norm: We modi ed the version implemented in the Phase Re- 0 2 trieval Toolbox Library [55] to use the valid phase from T = F (S ; S ) + w ; (2) g g the context at every iteration. The resulting complex- valued TF coecients S were then transformed into a with w being weights of the network and  being the reg- time-domain signal s by inverse STFT. ularization parameter, here, set to 0:01. The numerical optimizations were done using the stochastic gradient de- scent solver ADAM [58]. 2.5 Loss Function The network training is based on the minimization of the 3 Evaluation total loss of the reconstruction. To this end, the recon- struction loss is computed by comparing the original gap The main objective of the evaluation was to investigate TF coecients S with the reconstructed gap TF coe- 0 our networks' ability to adapt to audio signals. The evalu- cients S . Targeting an accurate reconstruction of the lost ation is based on a comparison of the inpainting results to information, we optimize an adapted ` -based loss instead those obtained for the reference method, i.e., LPC-based of mixing the ` -loss with an adversarial term [8]. For this extrapolation [44]. The inpainting quality was evaluated type of network [56], the comparison can be done on the by means of objective di erence grades (ODGs, [59]) and basis of the squared ` -norm of the di erence between S signal-to-noise ratios (SNRs) applied to the time-domain and S , commonly known as mean squared error (MSE). waveforms and magnitude spectrograms. The MSE would depend on the total energy of S , putting We considered two classes of audio signals: instrument more weight on signals containing more energy. In order sounds and music. The respective networks were trained to avoid that, the normalized mean squared error (NMSE) on the targeted signal class, with an assumed gap size of can be used, which normalizes MSE by the energy of S . 64 ms. Reconstruction was evaluated on the trained signal Compared to MSE, NMSE puts more weight on small er- class and other signals for 64 ms gaps. rors when the energy of S is small. In practice, however, Additionally, we evaluated the e ect of the gap duration minor deviations from S are insigni cant regardless of the by evaluating the magnitude network for 48 ms gaps. content of S , and NMSE would be too sensitive. Therefore, for the calculation of the loss function, we use a weighted mix between MSE and NMSE, 3.1 Parameters 0 2 The sampling rate was 16 kHz. We considered audio seg- kS S k g g F(S ; S ) = ; (1) g g ments with a duration of 320 ms, which corresponds to 1 2 c +kS k L = 5120 samples. For the STFT, the size of the win- 4 dow and the number of frequency channels M were xed The combination of these two algorithms provided consistently better results than separate application of either. to 512 samples, and a was 128 samples. 5 Each segment was separated in a gap of 64 ms corre- 3.3 Evaluation metrics sponding to L = 1024 of the central part of a segment and The rst metric was the SNR in dB, the context of twice of 128 ms, corresponding to L = 2048 samples. Consequently, N was 16, the input to the en- 2 kxk 25716 SNR(x; x ) = 10 log (3) coder was S ; S 2 C , and the output of the decoder b a 0 2 kx x k 0 25711 was S 2 C . calculated separately for each segment of a testing dataset. Then, we averaged SNRs across all segments of a testing 3.2 Datasets dataset. For the evaluation in the time domain, we used The dataset representing musical instruments was derived SNR(s ; s ), which is the SNR calculated on the gaps g g from the NSynth dataset [60]. NSynth is an audio dataset of the actual and reconstructed signals, s and s , respec- g g containing 305,979 musical notes from 1,006 instruments, tively. We refer to the average of this metric across all each with a unique pitch, timbre, and envelope. Each ex- segments to as SNR in the time domain (SNR ). TD ample is four seconds long, monophonic, and sampled at The SNR was also calculated on the magnitude spec- 16 kHz. trograms in order to accommodate for perceptually less- The dataset representing music was derived from the free relevant phase changes. We calculated SNR(jS j;jS j), g rg music archive (FMA, [61]). The FMA is an open and eas- where S represents the central 5 frames of the STFT rg ily accessible dataset, usually used for evaluating tasks in computed from the restored signal s and thus represents musical information retrieval. We used the small version of the restoration of the gap. In other words, we compute the the FMA comprised of 8,000 30-s segments of songs with SNR between the spectrograms of the original signal and eight balanced genres sampled at 44:1 kHz. We resampled the restored signal in the region of the gap. We refer to each segment to the sampling rate of 16 kHz. the average of this metric (across all segments of a test- ing dataset) to as SNR , where MS stands for magnitude The original segments in the two datasets were processed MS spectrogram. Note that SNR is directly related to the to t the evaluation parameters. First, for each example MS spectral convergence proposed in [62]. the silence at the beginning and end was removed. Second, Additionally, we computed the ODGs, which correspond from each example, pieces of the duration of 320 ms were to the subjective di erence grade used in human-based au- copied, starting with the rst segment at the beginning of dio test and is derived from the perceptual evaluation of a segment, continuing with further segments with a shift audio quality (PEAQ, [59]). ODG range from 0 to 4 of 32 ms. Thus, each example yielded multiple overlap- with the interpretation shown in Tab. 2. We calculated the ping segments s. Then, the energy of the segments was ODGs on signals of 2-s duration, with the inpainted gap evaluated and the ones that were completely silent were beginning at 0.5-s. We used the algorithm implemented removed. Note that for a gap of 64 ms, the segment can in [63]. be considered as a 3-tuple by labeling the rst 128 ms as the context before the gap s , the subsequent 64 ms as the ODG Impairment gap s , and the last 128 ms as the context after the gap 0 Imperceptible s . -1 Perceptible, but not annoying In order to avoid over tting, the datasets were split into -2 Slightly annoying training, validation, and testing sets before segmenting -3 Annoying them. For the instruments, we used the splitting proposed -4 Very annoying by [60]. The music dataset, was split into 70%, 20% and 10%, respectively. The statistics of the resulting sets are presented in Table 1. Table 2: Interpretation of ODGs. Count Percentage 3.4 Training Instruments training 19.4M 94.1 Instruments validation 0.9M 4.4 Both complex and magnitude networks were trained for Instruments testing 0.3M 1.5 the instrument and music dataset, resulting in four trained Music training 5.2M 70.0 networks. Each training started with the learning rate of Music validation 1.5M 20.0 3 10 . In the case of the magnitude network, the recon- Music testing 0.7M 10.0 structed phase was not considered in the training. Ev- ery 2000 steps, the training progress was monitored. To this end, signals from the validation dataset were inpainted Table 1: Subdivision of the datasets used in the evaluation. and the weighted NMSE was calculated between the pre- Count is the amount of examples. Percentage is calculated dicted and the actual TF coecients of the gap. When with respect to the full dataset. converging, which usually happened after approximately 6 4 600k steps, the learning rate was reduced to 10 and the were directly synthesized as sine oscillations with a xed training was continued by additional 200k steps. Table 3 frequency. The probes were generated within a logarithmic shows the SNR calculated for the training, validation, frequency range from 20 Hz to 8 kHz, linear phase shift MS and testing datasets. The similar values across subsets in- range from 0 to , and linear amplitude range from 0:1 to dicate no evidence for an over tting. 1. The duration was 320 ms corresponding to 5120 samples at the sampling rate of 16 kHz. Music Instruments Train Valid Test Train Valid Test Mean 7.6 7.8 7.8 22.1 21.9 21.9 Mag Std 4.2 4.0 4.3 9.9 10.2 10.0 Mean 4.9 5.1 5.4 17.8 18.3 18.2 Complex Std 4.0 4.2 4.5 10.5 10.3 10.1 Table 3: Over tting check by means of SNR (in dB) MS calculated between generated and original TF-coecients without the synthesis step for 64 ms gaps. A4 A# B C C# D D# E F F# G G# A5 Notes 3.5 Reference method Figure 4: SNR for reconstruction of pure tones with the MS We compared our results to those obtained with a refer- complex network trained on the instrument (black) and ence method based on LPC. For the implementation, we music (grey) dataset. SNR are shown as a function of MS followed [44], especially [44, Section 5.3]. In detail, the musical notes corresponding to the Standard pitch, i.e., the context signals s and s were extrapolated onto the gap b a note A4 corresponds to the frequency of 440 Hz. s by computing their impulse responses and using them as prediction lters for a classical linear predictor. The impulse responses were obtained using Burg's method [64] Figure 4 shows the SNR of the reconstruction ob- MS and were xed to have 1000 coecients according to [2] tained with the complex network. The abscissa shows and [65]. Their duration was the same as that for our notes, i.e., frequencies corresponding to the Standard pitch context encoder in order to provide the same amount of (with A corresponding to the frequency of 440 Hz). For context information. The two extrapolations were mixed the network trained on the instruments, the SNR was MS with the squared-cosine weighting function. Our imple- large in the proximity of notes and decreased by more than mentation of the LPC extrapolation is available online . 15 dB for frequencies between the notes. This shows that Then, we evaluated the results produced by the refer- the network was able to better predict signals correspond- ence method in the same way as we evaluated the results ing to the trained notes, indicating a good adaptation to produced by the networks. the trained material. Music contains more broadband sounds such as drums, breathing, tone glides, i.e., sounds with non-signi cant 4 Results and discussion energy at frequencies between the Standard pitch being non-stationary even within the tested 320 ms. A network 4.1 Ability to adapt to the training mate- trained on music is expected to be less sensitive to predic- rial tions performed on Standard pitch only. Figure 4 shows As a general rule, a trained neural network should perform the SNR obtained for the reconstruction of pure tones MS well on the distribution that it learned from. As the instru- with the network trained on the music. The SNR uc- MS ment dataset is made of discrete in-tune instrument notes, tuations were smaller than those from the network trained each note can be considered as a sum of discrete frequen- on the instruments. This further supports our conclusion cies arranged in time. If our network was able to adapt about the good ability of our network structure to adapt to the instrument sounds then it should perform on these to various training materials. frequencies better than on others. To evaluate this, we probed our trained networks with 4.2 E ect of the network type stationary tones of various frequencies. The pure tones 5 The di erence between the magnitude and complex net- We also considered training on the instrument training dataset (800k steps) followed by a re nement with the music training dataset works both trained on instruments can be anticipated from (300k steps). While it did not show substantial di erences to the the Figure 5, which shows the SNR of the reconstruc- MS training performed on music only, a pre-trained network on music tions of pure tones. As an average over frequency, the with a subsequent re nement to genre may show improvements for magnitude network provided an SNR of 10:2 dB larger MS that genre. www.github.com/andimara oti/audioContextEncoder than that of the complex network. For the magnitude net- SNR [dB] MS work, the SNR was more or less similar for frequencies types, reconstructions of the testing datasets were per- MS up to 200 Hz and decreased with frequency. For the com- formed. Table 4 shows the SNR and ODG of those MS plex network, the SNR decrease started already at ap- predictions. The magnitude network resulted in consis- MS proximately 100 Hz and was much steeper than that of the tently better results with an SNR di erence of 2:3 dB MS magnitude network. Above the frequency of approximately and 3:5 dB when tested on music and instruments, respec- 4 kHz, the complex network provided an extremely poor tively. Similarly, ODGs favor the magnitude network, al- SNR of 5 dB or less, indicating that the complex network though to a smaller extent. The comparison may appear MS had problems reconstructing the signals at higher frequen- awed because the magnitude network has to predict only cies. This is in line with [66], where neural networks were half of the features to be predicted by the complex network, trained to reconstruct phases of amplitude spectrograms at almost the same number of neurons. However, even and their predictions were also poorer for higher frequen- doubling the size of the complex network would not yield cies. signi cantly better predictions, as the link between the size of a DNN and its performance is not proportional [67]. In addition to the improvement in SNR and ODG MS of the magnitude network over the complex network, the complex network predictions were observed to often be cor- rupted by clearly audible broadband noise . Music Instruments Mag Complex LPC Mag Complex LPC Mean SNR 7.7 5.4 6.3 22.4 18.5 30.5 MS Magnitude network Std SNR 4.3 4.5 5.1 10.7 10.2 18.9 MS Complex network Mean ODG -0.8 -1.0 -0.8 -1.6 -1.8 -0.3 Std ODG 0.4 0.2 0.2 1.0 0.9 0.3 0.1 0.5 1 2 4 8 Frequency [kHz] Table 4: SNR (in dB) and ODGs of reconstructions of MS Figure 5: SNR for reconstruction of pure tones with MS 64 ms gaps for the complex and magnitude networks, as the complex (black) and magnitude (grey) networks both well as for the LPC-based method. trained to the instruments database. The thicker lines show averages over 25 surrounding frequency points. 4.3 Comparison to the reference method Table 4 provides the SNR and ODGs for the LPC-based MS Unfortunately, the problem of poor high-frequency re- reference reconstruction method. When tested on music, construction also persisted when predicting instrument on average, our magnitude network outperformed the LPC- sounds instead of pure tones. Figure 6 shows the spec- based method in terms of SNR by 1.4 dB. When tested MS trogram of an original sound from the instrument testing on instruments, our magnitude network underperformed set (left panel) and of its reconstruction obtained from the LPC by 8.6 dB, which was also re ected in poorer the complex network (center panel). The reconstruction ODGs. Both SNRs and ODGs reveal a consistent pic- clearly fails at frequencies higher than 4 kHz. ture. The LPC-based method seems to better inpaint in- struments. The CE seems to be better or equivalent for inpainting music. This can be attributed to the better compliance of the instruments with the LPC, and a better universality of our CE. In order to look more deeply into the di erences between the two inpainting methods, we compared their abilities to inpaint frequency sweeps. A sweep represents a controlled frequency modulation, which violates the assumptions for the LPC and is not present in the data the CE was trained 0 10 20 0 10 20 0 10 20 on. The signal consisted of a sum of ve linear frequency sweeps with a 320-ms duration each, starting frequencies Figure 6: Magnitude spectrograms (in dB) of an exem- of 500, 2000, 3500, 5000 and 6500 Hz, and bandwidth of plary signal reconstruction. Left: Original signal. Center: 500 Hz. Figure 7 shows the signal and the inpainting re- Reconstruction by the complex network. Right: Recon- sults. The gap inpainted by the LPC method (right panel) struction by the LPC-based method. The gap was the shows constant frequencies expanding into the gap causing area between the two red lines. a discontinuity in the gap's center. In contrast, the gap visit https://andimara oti.github.io/audioContextEncoder/ for In order to further compare between the two network audio examples. Frequency [kHz] SNR [dB] MS 8 inpainted by the magnitude network (center panel) follows the frequency changes better at the price of noise appearing between the sweeps. Other interesting examples are shown in Figure 8. The top row shows an example in which the magnitude net- work outperformed the LPC-based method. In this case, the signal is comprised of steady harmonic tones in the left side context and a broadband sound in the right side context. While the LPC-based method extrapolated the 0 13 24 37 0 13 24 37 0 13 24 37 broadband noise into the gap, the magnitude network was able to foresee the transition from the steady sounds to the broadband burst, yielding a prediction much closer to the original gap, with a 13 dB larger SNR than that from MS the LPC-based method. On the other hand, the magnitude network did not al- ways outperform the LPC-based method. The bottom row of Fig. 8 shows spectrograms of such an example. This sig- nal had stable sounds in the gap, which were well-suited for an extrapolation, but rather complex to be perfectly recon- 0 0 13 24 37 0 13 24 37 0 13 24 37 structed by the magnitude network. Thus, the LPC-based method outperformed the magnitude network yielding a Figure 8: Magnitude spectrograms (in dB) of exemplary 9 dB larger SNR . MS signal reconstructions. Left: Original signal. Center: Re- construction by the magnitude network. Right: Recon- struction by the LPC-based reference method. Top: Ex- ample with the magnitude network outperforming the ref- erence by an SNR of 13 dB. Bottom: Example with the MS magnitude network underperforming the reference by an SNR of 9 dB. MS performed our network by 12 dB. 0 13 24 37 0 13 24 37 0 13 24 37 The excellent performance of the LPC-based method re- constructing instruments can be explained by the assump- Figure 7: Log-magnitude spectrograms (in dB) of an ex- tions behind the LPC well- tting to the single-note instru- ponential frequency sweep. Left: Original signal. Center: ment sounds. These sounds usually consist of harmon- Reconstruction by the magnitude network. Right: Recon- ics stable on a short-time scale. LPC extrapolates these struction by the LPC-based method. harmonics preserving the spectral envelope of the signal. Nevertheless, the magnitude network yielded an SNR MS of 22.4 dB, on average, demonstrating a good ability to Finally, Table 5 presents the SNR of reconstructions TD reconstruct instrument sounds. of the instrument and music. Note that the SNR pro- When applied on music, the performance in terms of TD vided for the magnitude network is for the sake of com- SNR of both methods was much poorer, with our net- MS pleteness only. The SNR metric is highly sensitive to work performing slightly but statistically signi cantly bet- TD phase di erences, which do not necessarily lead to percep- ter than the LPC-based method. The better performance tual di erences and, for the magnitude network, is recon- of our network can be explained by its ability to adapt structed with an accuracy of up to a constant phase shift. to transient sounds and modulations in frequencies, sound Thus, SNR can remain low even in cases of very good properties that the LPC-based method is not suited to han- TD reconstructions. Hence, here, we compare the performance dle. of the complex network with that of the LPC-based method The gap duration of 64 ms is close to those tested in only. [27] when comparing various OMP methods. For 50 ms, For the music, on average, the complex network outper- their approaches showed SNR below 2 dB and ODG TD formed the LPC-based method providing a 0.3 dB larger values around -3 (see their Fig. 1 and 4). The LPC-based SNR . Given the large standard deviation, we performed method showed average SNR of 3.8 dB and ODGs of TD TD a pair t-test on the SNR which showed that the di er- -0.8. This con rms our assumption that for the studied TD ence was statistically signi cant (p < 0:001). For the in- range, the LPC is better suited than the sparsity-based struments, on average, the LPC-based reconstruction out- audio inpainting techniques. Frequency [kHz] Frequency [kHz] Frequency [kHz] Music Instruments both methods were rated equally with ODG between im- Complex Mag LPC Complex Mag LPC perceptible and perceptible but not annoying. LPC yielded Mean 3.8 1.1 3.5 16.0 14.6 28.0 better results when applied on more simple signals like in- Std 4.1 3.9 5.0 9.7 10.8 19.1 strument sounds. In general, our results suggest that stan- dard DNN components and a moderately sized network can be applied to form audio-inpainting models, o ering a Table 5: SNR (in dB) of reconstructions of 64 ms gaps TD number of angles for future improvement. for the complex and magnitude networks, as well as for the For example, we have analyzed two types of networks. LPC-based method. The complex network works directly on the complex-valued TF coecients. The magnitude network provides only 4.4 E ect of the gap duration magnitudes of TF coecients as output and relies on a sub- sequent phase reconstruction. We observed clear improve- The proposed network structure can be trained with di er- ment of the magnitude network over the complex network ent contexts and gap durations. For problems of varying especially in reconstructing high-frequency content. gap duration, a network trained to the particular gap dura- From our study, it follows that DNNs, when applied to tion might appear optimal. However, training takes time, inpainting audio gaps for medium durations, do not su er and it might be simpler to train a network to single gap from the restrictions of previous methods. Additionally, duration and use it to reconstruct any shorter gap as well. even for a simple DNN, the performance on complex signals In order to test this idea, we introduced gaps of 48 ms is already on par with the state of the art. It also follows (corresponding to L = 768 samples) in our testing that by representing audio as TF coecients, a generative datasets. These gaps were then reconstructed by the mag- network developed for image inpainting can be adapted to nitude network trained for 64 ms gaps. As this network audio inpainting. outputs, at reconstruction time, a solution for a gap of Generally, better results can be expected for increased length 64-ms, the 48-ms gaps needs to be enlarged. We depth of the network and the available context. Experi- tested three approaches to enlarge them: by discarding ments with our method for longer medium-duration gaps 16 ms forwards, 16 ms backwards, or 8 ms forwards and and longer context can be easily implemented just by 8 ms backwards (centered). adapting the parameters of the network. Nevertheless, we Table 6 shows SNR obtained from averaging the re- MS expect technical limitations like computational power to constructions of the three types of gap enlargements. Also, be an issue for long contexts. Instead, a study of more e- the corresponding SNR for the LPC-based method are MS cient audio features will be required. Our STFT features, shown. The results are similar to those obtained for larger meant in this study as a reasonable rst choice, provided gaps: for the instruments, the LPC-based method outper- a decent performance, however, in the future, we expect formed our network; for the music, our network outper- hearing-related features to provide better reconstructions. formed the LPC-based method. In particular, an investigation of Audlet frames, i.e., in- Music Instruments vertible time-frequency systems adapted to perceptual fre- Ours LPC Ours LPC quency scales, [68], as features for audio inpainting seem Mean 8.0 6.9 21.8 33.2 to o er intriguing opportunities. Std 4.6 5.5 11.8 20.1 In the future, instead of training on a very general dataset, improved performance can be obtained for more specialized networks trained to speci c genres or instru- Table 6: SNR (in dB) of reconstructions of 48 ms gaps MS mentation. Further, applied to a complex mixture and for the magnitude network and the LPC-based method. potentially preceded by a source-separation algorithm, our proposed architecture could be used jointly in a mixture- of-experts, [69], approach. 5 Conclusions and Outlook We proposed a neural network architecture for inpainting References medium gaps of audio. The study aims at showing general abilities of a neural network working on TF coecients as a [1] A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribon- context encoder. The proposed network was able to adapt val, and M. D. Plumbley, \Audio inpainting," IEEE to the particular frequencies provided by the training ma- Transactions on Audio, Speech and Language Process- terial. It was able to reconstruct frequency modulations ing, vol. 20, no. 3, pp. 922{932, March 2012. better than the LPC-based reference method and it was able to inpaint gaps shorter than the trained ones. For the [2] I. Kauppinen, J. Kauppinen, and P. Saarinen, \A reconstruction of complex signals like music, our network method for long extrapolation of audio signals," Jour- was able to outperform the LPC-based reference method, nal of the Audio Engineering Society, vol. 49, no. 12, in terms SNR calculated on magnitude spectrograms, and pp. 1167{1180, 2001. 10 [3] W. Etter, \Restoration of a discrete-time signal seg- [14] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, ment by interpolation based on the left-sided and C. Donahue, and A. Roberts, \Gansynth: Adversarial right-sided autoregressive parameters," IEEE Trans- neural audio synthesis," in Proceedings of the 7th In- actions on Signal Processing, vol. 44, no. 5, pp. 1124{ ternational Conference on Learning Representations, 1135, may 1996. 2019. [4] D. Goodman, G. Lockhart, O. Wasem, and W.-C. [15] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, Wong, \Waveform substitution techniques for recover- J. Sotelo, A. Courville, and Y. Bengio, \SampleRNN: ing missing speech segments in packet voice commu- An unconditional end-to-end neural audio generation nications," IEEE Transactions on Acoustics, Speech model," in Proc. of ICLR, 2017. and Signal Processing, vol. 34, no. 6, pp. 1440{1448, [16] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, dec 1986. O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, [5] Y. Bahat, Y. Schechner, and M. Elad, \Self-content- and K. Kavukcuoglu, \Wavenet: A generative model based audio inpainting," Signal Processing, vol. 111, for raw audio," CoRR, vol. abs/1609.03499, 2016. pp. 61{72, jun 2015. [Online]. Available: http://arxiv.org/abs/1609.03499 [6] P. J. Wolfe and S. J. Godsill, \Interpolation of missing [17] Y. Saito, S. Takamichi, and H. Saruwatari, \Text-to- data values for audio signal restoration using a gabor speech synthesis using STFT spectra based on low- regression model," in Proc. of ICASSP, vol. 5. IEEE, /multi-resolution generative adversarial networks," in 2005, pp. v{517. Proc. of ICASSP. IEEE, 2018, pp. 5299{5303. [7] N. Perraudin, N. Holighaus, P. Majdak, and P. Bal- [18] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, azs, \Inpainting of long audio segments with similarity Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry- graphs," IEEE/ACM Transactions on Audio, Speech Ryan, R. Saurous, Y. Agiomyrgiannakis, and Y. Wu, and Language Processing, vol. PP, no. 99, pp. 1{1, \Natural TTS synthesis by conditioning WaveNet on 2018. mel spectrogram predictions," in Proc. of ICASSP. IEEE, 2018. [8] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros, \Context encoders: Feature learning by [19] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, \Fft- inpainting," in Proc. of CVPR, 2016. net: A real-time speaker-dependent neural vocoder," in Proc. of ICASSP. IEEE, 2018, pp. 2251{2255. [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www. [20] B.-K. Lee and J.-H. Chang, \Packet loss concealment deeplearningbook.org. based on deep neural networks for digital speech transmission," IEEE/ACM Trans. Audio, Speech and [10] D. Kingma and M. Welling, \Auto-encoding varia- Lang. Proc., vol. 24, no. 2, pp. 378{387, Feb. tional bayes." in Proc. of ICLR, 2014. 2016. [Online]. Available: http://dx.doi.org/10.1109/ TASLP.2015.2509780 [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- [21] S. Dieleman, A. v. d. Oord, and K. Simonyan, \The gio, \Generative adversarial nets," in Advances in challenge of realistic music generation: modelling raw neural information processing systems, 2014, pp. audio at scale," in Proc. of NeurIPS, 2018. 2672{2680. [22] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin- [12] C. Donahue, J. McAuley, and M. Puckette, \Adver- cent, \Modeling temporal dependencies in high- sarial audio synthesis," in Proceedings of the 7th In- dimensional sequences: Application to polyphonic ternational Conference on Learning Representations, music generation and transcription," in Proc. of ICML, 2012. [13] A. Mara oti, N. Perraudin, N. Holighaus, and [23] M. Blaauw and J. Bonada, \A neural parametric P. Majdak, \Adversarial generation of time-frequency singing synthesizer," in Proc. of INTERSPEECH, features with application in audio synthesis," in Proc. of the 36th ICML, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, [24] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, California, USA: PMLR, 09{15 Jun 2019, pp. and M. Plumbley, \A constrained matching pursuit 4352{4362. [Online]. Available: http://proceedings. approach to audio declipping," in Proc. of ICASSP. mlr.press/v97/mara oti19a.html IEEE, may 2011. 11 [25] I. Toumi and V. Emiya, \Sparse non-local similarity [36] ||, \Audio declipping via nonnegative matrix fac- modeling for audio inpainting," in Proc. of ICASSP. torization," in 2015 IEEE Workshop on Applications Calgary, Canada: IEEE, Apr. 2018. of Signal Processing to Audio and Acoustics (WAS- PAA). IEEE, 2015, pp. 1{5. [26] S. Kiti c, N. Bertin, and R. Gribonval, \Sparsity and [37] A. Ozerov, C  . Bilen, and P. P erez, \Multichannel au- cosparsity for audio declipping: a exible non-convex dio declipping," in Proc. of ICASSP. IEEE, 2016, approach," in LVA/ICA 2015 - The 12th International pp. 659{663. Conference on Latent Variable Analysis and Signal Separation, Liberec, Czech Republic, Aug. 2015, p. 8. [38] E. Manilow and B. Pardo, \Leveraging repetition to [Online]. Available: https://hal.inria.fr/hal-01159700 do audio imputation," in 2017 IEEE Workshop on Ap- plications of Signal Processing to Audio and Acoustics [27] O. Mokry,  P. Z aviska, P. Rajmic, and V. Vesely, (WASPAA). IEEE, 2017, pp. 309{313. \Introducing SPAIN (sparse audion inpainter)," CoRR, vol. abs/1810.13137, 2018. [Online]. Available: [39] B. Martin, P. Hanna, T. V. Thong, M. Desainte- http://arxiv.org/abs/1810.13137 Catherine, and P. Ferraro, \Exemplar-based assign- ment of large missing audio parts using string match- [28] C. Gaultier, S. Kiti c, N. Bertin, and R. Gribonval, ing on tonal features." in Proc. of ISMIR, 2011, pp. \AUDASCITY: AUdio Denoising by Adaptive Social 507{512. CosparsITY," in 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, Aug. 2017. [40] R. C. Maher, \A method for extrapolation of missing [Online]. Available: https://hal.inria.fr/hal-01540945 digital audio data," Journal of the Audio Engineering Society, vol. 42, no. 5, pp. 350{357, 1994. [29] K. Siedenburg, M. Kowalski, and M. D or er, \Audio Declipping with Social Sparsity," in Proc. of ICASSP. [41] A. Lukin and J. Todd, \Parametric interpolation of Florence, Italy: IEEE, May 2014, pp. AASP{L2. gaps in audio signals," in Audio Engineering Society [Online]. Available: https://hal.archives-ouvertes.fr/ Convention 125. Audio Engineering Society, 2008. hal-01002998 [42] T. E. Tremain, \The government standard linear pre- [30] F. Lieb and H.-G. Stark, \Audio inpainting: Evalua- dictive coding algorithm: Lpc-10," Speech Technology, tion of time-frequency representations and structured pp. 40{49, Apr. 1982. sparsity approaches," Signal Processing, vol. 153, pp. 291{299, 2018. [43] A. Janssen, R. Veldhuis, and L. Vries, \Adaptive in- terpolation of discrete-time signals that can be mod- [31] J. Le Roux, H. Kameoka, N. Ono, A. De Cheveigne, eled as autoregressive processes," IEEE Transactions and S. Sagayama, \Computational auditory induction on Acoustics, Speech, and Signal Processing, vol. 34, as a missing-data model- tting problem with bregman no. 2, pp. 317{330, 1986. divergence," Speech Communication, vol. 53, no. 5, pp. 658{676, 2011. [44] I. Kauppinen and K. Roth, \Audio signal extrapolation{theory and applications," in Proc. [32] P. Smaragdis, B. Raj, and M. Shashanka, \Missing DAFx, 2002, pp. 105{110. data imputation for time-frequency representations of audio signals," Journal of signal processing systems, [45] J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. vol. 65, no. 3, pp. 361{370, 2011. Ehmann, and X. Serra, \End-to-end learning for mu- sic audio tagging at scale," in Proc. of ISMIR, 2018. [33] U. S im sekli, Y. K. Ylmaz, and A. T. Cemgil, \Score guided audio restoration via generalised coupled ten- [46] M. Portno , \Implementation of the digital phase sor factorisation," in Proc. of ICASSP. IEEE, 2012, vocoder using the fast fourier transform," IEEE pp. 5369{5372. Trans. Acoust. Speech Signal Process., vol. 24, no. 3, pp. 243{248, 1976. [34] C. Bilen, A. Ozerov, and P. Prez, \Solving time- domain audio inverse problems using nonnegative ten- [47] K. Gr ochenig, Foundations of Time-Frequency Anal- sor factorization," IEEE Transactions on Signal Pro- ysis, ser. Appl. Numer. Harmon. Anal. Birkh auser, cessing, vol. 66, no. 21, pp. 5604{5617, Nov 2018. [35] C  . Bilen, A. Ozerov, and P. P erez, \Joint audio in- [48] D. Grin and J. Lim, \Signal estimation from modi- painting and source separation," in International Con- ed short-time fourier transform," IEEE Transactions ference on Latent Variable Analysis and Signal Sepa- on Acoustics, Speech and Signal Processing, vol. 32, ration. Springer, 2015, pp. 251{258. no. 2, pp. 236{243, 1984. 12 [49] N. Perraudin, P. Balazs, and P. L. Sndergaard, \A [60] J. Engel, C. Resnick, A. Roberts, S. Dieleman, fast grin-lim algorithm," in Applications of Signal M. Norouzi, D. Eck, and K. Simonyan, \Neural au- Processing to Audio and Acoustics (WASPAA), 2013 dio synthesis of musical notes with wavenet autoen- IEEE Workshop on. IEEE, 2013, pp. 1{4. coders," in Proc. of ICML, 2017, pp. 1068{1077. [61] M. De errard, K. Benzi, P. Vandergheynst, and [50] Z. Pr u sa, P. Balazs, and P. Sndergaard, \A nonitera- X. Bresson, \Fma: A dataset for music analysis," in tive method for reconstruction of phase from stft mag- 18th International Society for Music Information Re- nitude," IEEE/ACM Transactions on Audio, Speech trieval Conference, 2017. and Language Processing, vol. 25, no. 5, pp. 1154{ 1164, 2017. [62] N. Sturmel and L. Daudet, \Signal reconstruction from stft magnitude: A state of the art," in Inter- [51] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, national conference on digital audio e ects (DAFx), Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, 2011, pp. 375{386. M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, [63] P. Kabal et al., \An examination and interpretation M. Kudlur, J. Levenberg, D. Man e, R. Monga, of itu-r bs. 1387: Perceptual evaluation of audio qual- S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, ity," TSP Lab Technical Report, Dept. Electrical & B. Steiner, I. Sutskever, K. Talwar, P. Tucker, Computer Engineering, McGill University, pp. 1{89, V. Vanhoucke, V. Vasudevan, F. Vi egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, \TensorFlow: Large-scale machine [64] J. P. Burg, \Maximum entropy spectral analysis," learning on heterogeneous systems," 2015, software 37th Annual International Meeting, Soc. of Explor. available from tensor ow.org. [Online]. Available: Geophys., Oklahoma City, 1967. https://www.tensor ow.org/ [65] I. Kauppinen and J. Kauppinen, \Reconstruction method for missing or damaged long portions in au- [52] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \Im- dio signal," Journal of the Audio Engineering Society, agenet classi cation with deep convolutional neural vol. 50, no. 7/8, pp. 594{602, 2002. networks," in Proc. of NIPS, 2012, pp. 1097{1105. [66] S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, [53] S. Io e and C. Szegedy, \Batch normalization: Ac- and H. Saruwatari, \Phase reconstruction from am- celerating deep network training by reducing internal plitude spectrograms based on von-mises-distribution covariate shift," in Proc. of ICML, 2015, pp. 448{456. deep neural network," in International Workshop on [54] Z. Pr u sa and P. L. Sndergaard, \Real-Time Spec- Acoustic Signal Enhancement (IWAENC), 2018, pp. trogram Inversion Using Phase Gradient Heap Inte- 286{290. gration," in Proc. Int. Conf. Digital Audio E ects [67] K. He, X. Zhang, S. Ren, and J. Sun, \Deep residual (DAFx-16), Sep 2016, pp. 17{21. learning for image recognition," in 2016 IEEE Con- ference on Computer Vision and Pattern Recognition [55] Z. Pr u sa, \The Phase Retrieval Toolbox," in AES In- (CVPR), June 2016, pp. 770{778. ternational Conference On Semantic Audio, Erlangen, Germany, June 2017. [68] T. Necciari, N. Holighaus, P. Balazs, Z. Pra, P. Ma- jdak, and O. Derrien, \Audlet lter banks: A versa- [56] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, \Loss tile analysis/synthesis framework using auditory fre- functions for image restoration with neural networks," quency scales," Applied Sciences, vol. 8, no. 1:96, IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47{57, March 2017. [69] S. E. Yuksel, J. N. Wilson, and P. D. Gader, \Twenty [57] A. Krogh and J. Hertz, \A simple weight decay can years of mixture of experts," IEEE transactions on improve generalization," in Advances in neural infor- neural networks and learning systems, vol. 23, no. 8, mation processing systems 4. Morgan Kaufmann, pp. 1177{1193, 2012. 1992, pp. 950{957. [58] D. Kingma and J. Ba, \Adam: A method for stochas- tic optimization," in Proc. of ICLR, 2015. [59] I. Recommendation, \1387: Method for objective measurements of perceived audio quality," Interna- tional Telecommunication Union, Geneva, Switzer- land, 2001. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Loading next page...
 
/lp/arxiv-cornell-university/a-context-encoder-for-audio-inpainting-QUlGabnykP

References (75)

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2019.2947232
Publisher site
See Article on Publisher Site

Abstract

Andr es Mara oti, Nathana el Perraudin, Nicki Holighaus, and Piotr Majdak October 11, 2019 Abstract meaningful information while preventing audible artifacts so that the listener remains unaware of any occurred prob- We study the ability of deep neural networks (DNNs) to lem. Successful algorithms are limited to deal with a par- restore missing audio content based on its context, i.e., in- ticular class of audio signals [5], or they focus on a speci c paint audio gaps. We focus on a condition which has not duration of the problematic signal parts [6], and/or they received much attention yet: gaps in the range of tens of exploit a-priori information about the problem [7]. milliseconds. We propose a DNN structure that is pro- In this work, we explore a new machine-learning algo- vided with the signal surrounding the gap in the form of rithm with respect to the reconstruction of lost parts of time-frequency (TF) coecients. Two DNNs with either audio signals, i.e., gaps. From all possible classes of audio complex-valued TF coecient output or magnitude TF co- signals, we limit the reconstruction to instrumental music, ecient output were studied by separately training them i.e., mix of sounds from musical instruments organized in on inpainting two types of audio signals (music and musi- time. We focus on gaps of medium durations, that is, in cal instruments) having 64-ms long gaps. The magnitude the range of tens of milliseconds. We assume that gaps are DNN outperformed the complex-valued DNN in terms of separated in time, such that the local audio information signal-to-noise ratios and objective di erence grades. Al- surrounding the gap, namely, the context, is reliable and though, for instruments, a reference inpainting obtained can be exploited. through linear predictive coding performed better in both The proposed algorithm is based on an unsupervised metrics, it performed worse than the magnitude DNN for feature-learning algorithm driven by context-based sam- music. This demonstrates the potential of the magnitude ple prediction. It relies on a DNNs with convolutional and DNN, in particular for inpainting signals that are more fully connected layers (FCLs) trained to generate TF rep- complex than single instrument sounds. resentations of sounds being conditioned on contextual TF information. We call the algorithm context encoder, as in- troduced for images [8] in analogy to auto encoders [9]. 1 Introduction Our context encoder aims at studying the general ability of DNNs to accurately inpaint audio in the range of tens Locally degraded or even lost information is encountered of milliseconds from limited but reliable context in order in various audio processing tasks. Some examples are to determine factors with the largest potential for future corrupted audio les, lost information in audio transmis- improvement and details requiring a more sophisticated sion (referred to as packet-loss in the context of voice- method. over-IP transmission), and audio signals locally contami- nated by noise. Restoration of lost information in audio has been referred to as audio inpainting [1], audio inter- 1.1 Related deep-learning techniques /extrapolation [2, 3], or waveform substitution [4]. Re- Deep learning excels in classi cation, regression, and construction is usually aimed at providing a coherent and anomaly detection tasks [9] and it has also shown good Manuscript received on October 2018; revised on April 2019. results in generative modeling with techniques such as Andr es Mara oti, Nicki Holighaus, and Piotr Majdak are with variational auto encoders [10] and generative adversarial the Acoustics Research Institute, Austrian Academy of Sciences, networks [11]. Unfortunately, for audio synthesis only Wohllebengasse 12{14, 1040 Vienna, Austria. the latter has been studied, applying it to generate snip- Nathana el Perraudin is with the Swiss Data Science Center, ETH Zuric  h, Universit atstrasse 25, 8006 Zuric  h pets of sound [12{14]. In order to obtain meaningful Accompanying web page (sound examples, Matlab and Python results, state-of-the-art audio synthesis requires sophisti- code, color gures): cated networks [15, 16]. While these approaches directly https://andimarafioti.github.io/audioContextEncoder/. We thank the reviewers and the editor for their review and their help- predict audio samples based on the preceding samples, ful suggestions. This work has been supported by Austrian Science in the speech-synthesis eld, synthesis of audio in do- Fund (FWF) project MERLIN (Modern methods for the restoration mains other than time such as spectrograms [17], and mel- of lost information in digital signals;I 3067-N30). We gratefully ac- spectrograms [18, 19] have been proposed. In the eld knowledge the support of NVIDIA Corporation with the donation of the Titan X Pascal GPU used for this research. of speech transmission, DNNs have been used to achieve arXiv:1810.12138v2 [cs.SD] 10 Oct 2019 packet loss concealment [20]. tion originates in the combination of the TF representa- The synthesis of musical audio signals using deep learn- tion and the assumption of sparsity: TF sparse methods ing, however, is even more challenging [21]. A music signal are ill-suited to restore gaps that approach or exceed the is comprised of complex sequences ranging from short-term duration of the TF analysis and synthesis windows. This structures (any periodicity in the waveform) to long-term limitation is also valid, if less severe, for structured TF structures (like gures, motifs, or sections). In order to sparsity, rendering the sparsity-based methods as unsatis- simplify the problem brought by long-range dependencies, factory for inpainting medium duration gaps. TF domain music synthesis in multiple steps has been proposed in- is popular for inpainting short gaps, e.g., interpolation of cluding an intermediate symbolic representation like MIDI audio based on a Gabor regression model [6], or nonnega- sequences [22], and features of a parametric vocoder [23]. tive matrix and tensor factorization [31{33]. More recently, While these contributions provide insights on the design a powerful framework has been proposed for various audio of a neural network for audio synthesis, none of them ad- inverse problems [34] including time-domain audio inpaint- dresses conditions in which some audio information has ing, source separation [35], and declipping [36] even in a been lost, but the surrounding context is available. multichannel scenario [37]. All of these systems require valid audio data within a time-domain window, cf. [36], which makes them perfect for inpainting short gaps, but 1.2 Related audio-inpainting algorithms unsatisfactory for medium gap durations. The term "audio inpainting" was coined by Adler et al. to On the other hand, for inpainting long gaps, recent meth- describe a large class of inverse problems in audio process- ods leverage repetition and determine the most promis- ing, while focussing their own study on the restoration of ing reliable segment from uncorrupted portions of the in- gaps in audio signals [1]. The general assumption for audio put signal [5, 7]. Restoration is then achieved by inserting inpainting is that audio is represented in some domain as the determined segment into the gaps. These methods do data and some chunks of that data are corrupted yielding not claim to restore the missing gap perfectly, they aim gaps in the representation. at plausibility. For example, a method based on MFCC The number and duration of the gaps as well as the type feature similarity has been proposed for packet loss con- of corruption is manifold. For example, in declicking and cealment [5]. It explicitly targets a perceptually plausible declipping, corruptions may be frequent, but mostly con- restoration. Similarly, exemplar-based inpainting was pro- ned to disconnected time-segments of only few millisec- posed based on a graph encoding spectro-temporal simi- onds duration or less. We refer to inpainting such gaps as larities within an audio signal [7]. In both studies, gap du- inpainting of short gaps. On the other hand, gaps on a rations were beyond several hundreds of milliseconds and scale of hundreds of milliseconds or even seconds may hap- their reconstruction needed to be evaluated in psychoa- pen, e.g., when reading partially damaged physical media, coustic experiments. Other examples for similar methods in live music recordings, when unwanted noise originat- are [38{41]. While all these methods might be in general ing from the audience needs to be removed, or in audio capable of inpainting gaps of medium duration, the tar- transmission with a total loss of the connection between get of the inpainting is always plausible instead of accurate transmitter and receiver lasting for seconds. We refer to reconstructions. inpainting such gaps as inpainting long gaps. In contrast, we de ne medium gaps as those with tens of When restricting the inpainting to simple sounds such as milliseconds duration, a scale on which the non-stationary musical instruments, linear prediction coding (LPC) [42] characteristic of audio already becomes important, but the can be applied even for medium gap durations. While LPC extrapolation of the missing information from short con- may sound antiquated, it is particularly suitable for the text surrounding the gap still seems feasible. Medium instrument sounds as it models the way the sound is cre- gaps may arise as a consequence of packet loss in audio ated by many instruments, i.e., by means of weighted sum transmission [5] or when short interruption happens while of resonances. From the algorithmic perspective, LPC is reading audio from partially damaged physical media. In- simple but recursive, thus, allows to synthesize complex terestingly, not much has been done for audio inpainting sound signals at a low computational power. Initially pro- of medium gaps. posed for inpainting short bursts of lost samples [43], LPC- In contrast, for inpainting short gaps, various solutions based inpainting algorithms model the signal as an acoustic have been proposed. [1] proposed a framework based on source ltered by an all-pole lter. The model parameters orthogonal matching pursuit (OMP), which has inspired are derived from the context and the missing signal part a considerable amount of research exploiting TF spar- is synthesized by extrapolating the context into the gap. sity [24{27] or structured sparsity [28{30]. Being tempted LPC-based methods work well for inpainting gaps for du- to extend these works to medium gap durations, one gets rations from 5 to 100 ms [3, 44]. LPC-based methods are disappointed quite soon because for increasing gap dura- particularly good in inpainting gaps consisting of many tions (from the originally targeted of 10 ms to medium consecutive missing audio samples surrounded by reliable gap durations of around 50 ms), the reconstruction quality context [44]. In our experiments for medium gaps, the substantially decreases, see Fig. 1 in [27]. The degrada- LPC-based algorithm [44] performed better than the lat- 2 ests reports on OMP-based algorithms [27]. As it seems, 2.1 Pre-processing stage when it comes to inpainting medium gaps, the LPC-based We use STFT, which enables a robust synthesis of the time- method [44] seems to be the choice for a reference method. domain signal from the reconstructed TF coecients. The The performance of LPC-based methods relies on the un- STFT is determined by the analysis window, hop size a, derlying assumption of signal stationarity. Deep-learning and the number of frequency channels M . In our study, techniques, on the other hand, promise a more generalized the analysis window was an appropriately normalized Hann signal representation. A combination of TF representation window of length M and a was M=4, enabling perfect re- with deep-learning techniques may provide better inpaint- construction by an inverse STFT with the same parameters ing whenever the lost data cannot be predicted by LPC. and window. Thus, here, we propose to link deep-learning techniques The STFT is applied to the signal s 2 R (containing L with audio inpainting. samples of audio) resulting in S, both of which consist of the context before and after the gap (containing L samples each) and the gap (containing L samples), 2 Context Encoder 0 1 Our end-to-end system is presented in Fig. 1. We con- @ A s = 0 and S = S ; 0 ; S ; L 1 b a g (M=2+1)N sider the audio signal s consisting of the gap s and the context signals before and after the gap, s and s , re- b a spectively (Fig. 1a). Given that convolutional networks L where s ; s 2 R , N = (L M )=a + 1, and S ; S 2 b a g g b a applied directly on time-domain signals would require ex- (M=2+1)N C with N = L =a. 0 is a matrix with R c c RC tremely large training datasets [45], we provide the network rows and C columns containing only zeros. with TF coecients. The TF coecients are obtained from Then, S and S are split into real and imaginary parts, b a an invertible representation, namely, a redundant short- Re Im Re Im resulting in four channels S ; S ; S ; S , which are b b a a time Fourier transform (STFT) [46, 47]. Our network, in- fed to the network. spired by the context encoder for image inpainting [8], is an encoder-decoder pipeline fed with TF coecients of the 2.2 Encoder context information, S and S (Fig. 1b). In order to b a study the general ability of DNNs to accurately inpaint For the architecture of the encoder, [8] used the rst ve audio in the range of tens of milliseconds, our network is layers from [52] to process images. To adapt the design of comprised only of standard widely-used building blocks, our network to process TF coecients, our encoder con- i.e., convolutional layers, FCLs, and recti ed linear units sists of six regular convolutional layers sequentially con- (ReLUs). The network predicts TF coecients of the gap nected via ReLUs, after which batch normalization [53] is S (Fig. 1c), which are then merged with the stripped TF applied. Instead of using classical squared lters, we used coecients of the context, (Fig. 1d), in order to synthesize rectangular lters to give the encoder more capacity on fre- the reconstruction in the time domain, s (Fig. 1e). quency over time in the TF representation. For M = 512, To study the e ect of the phase of the reconstructed the resulting encoder architecture is shown in Figure 2. TF representations, we considered two equivalent networks Re Im Re Im The inputs S ; S ; S ; S of the context informa- b b a a with di erent outputs: (a) complex network, i.e., a net- tion are treated as separate channels, thus, the network is work directly reconstructing the complex-valued TF coef- required to learn how the channels interact and how to mix cients which are then applied to the inverse STFT for the them. Because the encoder is comprised of only convolu- synthesis of the time-domain audio signal, and (b) mag- tional layers, the information can not reliably propagate nitude network, i.e., a network reconstructing the magni- from one end of the feature map to another. This is a con- tude coecients only, which are then applied to a phase- sequence of convolutional layers connecting all the feature reconstruction algorithm in order to obtain complex-valued maps together, but never directly connecting all locations TF coecients required for the signal synthesis. From ac- within a speci c feature map [8]. curate TF magnitude information, phaseless reconstruc- tion methods such as [48{50] are known to provide per- ceptually close, often indiscernible, reconstruction despite 2.3 Decoder the resulting time-domain waveforms usually being rather Similar to [8], the decoder begins with a FCL and a ReLU di erent. nonlinearity in order to spread the encoder's information The software was implemented in Tensor ow [51] and is among the channels. FCLs are computationally expensive; publicly available. in our case it contains 38% of all the parameters of the network. All the subsequent layers are (de-)convolutional Before xing the network structure described in the remainder of this section, we experimented with di erent standard architectures, and, as for the encoder, connected by ReLUs with batch depths, and kernel shapes, out of which the current structure showed the most promise. This is in contrast to machine-learning methods solving classi - www.github.com/andimara oti/audioContextEncoder cation tasks, in which such a synthesis is not targeted. 3 b) c) d) S ' S '  b a a) e) Context STFT Merge Synthesis Encoder s s s s ' s ' g s '  b g a b a S S S '  b a g S ' S ' S ' b g a Figure 1: The end-to-end system. a) Audio signal in the time domain, s is the gap. b) Audio signal in the TF domain, S and S is the context before and after the gap, respectively. c) Reconstructed gap S in the TF domain. b a g 0 0 0 d) Reconstruction S merged with the stripped context S and S in the TF domain. e) Reconstructed signal in the g b a time domain, including the inpainted gap, s . (3,17) (2,11) (1,9) (1,5) (2,5) channels 128 512 256 160 128 channel channels (89, 7) channels channels channels channels Reshape 8 8 2 2 2 height: width: 8 Figure 2: The encoder is a convolutional network with six layers followed by reshaping. The four channel TF input is encoded into a matrix of size of 2048. Gray rectangles represent the convolution lters with size expressed as (height, width). White cubes represent the signal. normalization. The rst three layers use squared lters, the 2.4 Post-processing stage remaining two layers use rectangular lters to give the de- The post-processing stage synthesizes the audio signal coder more capacity on frequency over time in the output of the context and the inpainted gap. To this end, TF representation. Figure 3 shows the decoder architec- (M=a 1) coecients of the context extending into the ture for M = 512 and a gap size L = 1024 samples. 0 0 gap are removed, yielding the stripped context, S ; S 2 b a (M=2+1)(N M=a+1) C . Then, the reconstructed TF coe- The decoder does not only output the gap content, but cients from the decoder, S , are inserted between the TF also the TF coecients connecting the gap with the con- g 0 0 0 coecient of the stripped context, S and S , yielding the b a text. Thus, the decoder output S is larger than the origi- 0 0 0 0 sequence S = (S ; S ; S ), having the same size as S. b g a nal gap by M=a 1 columns before and after the gap each, 0 (M=2+1)((L +M)=a1) Stripping the context and insertion of the reconstruction i.e., S 2 C . In our example with directly in the TF domain prevents transitional artifacts L = 1024, M = 512 and a = M=4, shown in Fig. 3, every between the context and the gap because synthesis by the decoder output channel is of size 257 11. inverse STFT introduces an inherent cross-fading. For the complex network, the decoder output represents Note that the nal layer depends on the network. For the real and imaginary parts of complex-valued TF coef- the complex network, the nal layer has two outputs, cor- cients S and the inverse STFT can be directly applied responding to the real and imaginary part of the complex- yielding s . valued TF coecients. For the magnitude network, the nal layer has a single output for the magnitude TF coef- For the magnitude network, the decoder output repre- cients. We denote the output TF coecients as S . sents the magnitudes of the TF coecients and the missing 0 (5,67) (11,257) (8,8) (5,5) (3,3) channel 128 11 32 channels 128 512 257 channels channels channels 1 or 2 FCL channels channels channels channels Reshape Reshape Reshape 32 32 height: 514 257 16 11 width: 1 Figure 3: The decoder architecture for the complex and magnitude network producing one and two channels of TF coecients, respectively. All other conventions as in Figure 2. phase information needs to be estimated separately. First, where the constant c > 0 controls the incorporated com- the phase gradient heap integration algorithm proposed pensation for small amplitude. In our experiments, c = 5 in [54] was applied to the magnitude coecients produced yielded good results. by the decoder in order to obtain an initial estimation of Finally, as proposed in [57], the total loss is the sum of the TF phase. Then, this estimation was re ned by apply- the loss function and a regularization term controlling the ing 100 iterations of the fast Grin-Lim algorithm [48, 49]. trainable weights in terms of their ` -norm: We modi ed the version implemented in the Phase Re- 0 2 trieval Toolbox Library [55] to use the valid phase from T = F (S ; S ) + w ; (2) g g the context at every iteration. The resulting complex- valued TF coecients S were then transformed into a with w being weights of the network and  being the reg- time-domain signal s by inverse STFT. ularization parameter, here, set to 0:01. The numerical optimizations were done using the stochastic gradient de- scent solver ADAM [58]. 2.5 Loss Function The network training is based on the minimization of the 3 Evaluation total loss of the reconstruction. To this end, the recon- struction loss is computed by comparing the original gap The main objective of the evaluation was to investigate TF coecients S with the reconstructed gap TF coe- 0 our networks' ability to adapt to audio signals. The evalu- cients S . Targeting an accurate reconstruction of the lost ation is based on a comparison of the inpainting results to information, we optimize an adapted ` -based loss instead those obtained for the reference method, i.e., LPC-based of mixing the ` -loss with an adversarial term [8]. For this extrapolation [44]. The inpainting quality was evaluated type of network [56], the comparison can be done on the by means of objective di erence grades (ODGs, [59]) and basis of the squared ` -norm of the di erence between S signal-to-noise ratios (SNRs) applied to the time-domain and S , commonly known as mean squared error (MSE). waveforms and magnitude spectrograms. The MSE would depend on the total energy of S , putting We considered two classes of audio signals: instrument more weight on signals containing more energy. In order sounds and music. The respective networks were trained to avoid that, the normalized mean squared error (NMSE) on the targeted signal class, with an assumed gap size of can be used, which normalizes MSE by the energy of S . 64 ms. Reconstruction was evaluated on the trained signal Compared to MSE, NMSE puts more weight on small er- class and other signals for 64 ms gaps. rors when the energy of S is small. In practice, however, Additionally, we evaluated the e ect of the gap duration minor deviations from S are insigni cant regardless of the by evaluating the magnitude network for 48 ms gaps. content of S , and NMSE would be too sensitive. Therefore, for the calculation of the loss function, we use a weighted mix between MSE and NMSE, 3.1 Parameters 0 2 The sampling rate was 16 kHz. We considered audio seg- kS S k g g F(S ; S ) = ; (1) g g ments with a duration of 320 ms, which corresponds to 1 2 c +kS k L = 5120 samples. For the STFT, the size of the win- 4 dow and the number of frequency channels M were xed The combination of these two algorithms provided consistently better results than separate application of either. to 512 samples, and a was 128 samples. 5 Each segment was separated in a gap of 64 ms corre- 3.3 Evaluation metrics sponding to L = 1024 of the central part of a segment and The rst metric was the SNR in dB, the context of twice of 128 ms, corresponding to L = 2048 samples. Consequently, N was 16, the input to the en- 2 kxk 25716 SNR(x; x ) = 10 log (3) coder was S ; S 2 C , and the output of the decoder b a 0 2 kx x k 0 25711 was S 2 C . calculated separately for each segment of a testing dataset. Then, we averaged SNRs across all segments of a testing 3.2 Datasets dataset. For the evaluation in the time domain, we used The dataset representing musical instruments was derived SNR(s ; s ), which is the SNR calculated on the gaps g g from the NSynth dataset [60]. NSynth is an audio dataset of the actual and reconstructed signals, s and s , respec- g g containing 305,979 musical notes from 1,006 instruments, tively. We refer to the average of this metric across all each with a unique pitch, timbre, and envelope. Each ex- segments to as SNR in the time domain (SNR ). TD ample is four seconds long, monophonic, and sampled at The SNR was also calculated on the magnitude spec- 16 kHz. trograms in order to accommodate for perceptually less- The dataset representing music was derived from the free relevant phase changes. We calculated SNR(jS j;jS j), g rg music archive (FMA, [61]). The FMA is an open and eas- where S represents the central 5 frames of the STFT rg ily accessible dataset, usually used for evaluating tasks in computed from the restored signal s and thus represents musical information retrieval. We used the small version of the restoration of the gap. In other words, we compute the the FMA comprised of 8,000 30-s segments of songs with SNR between the spectrograms of the original signal and eight balanced genres sampled at 44:1 kHz. We resampled the restored signal in the region of the gap. We refer to each segment to the sampling rate of 16 kHz. the average of this metric (across all segments of a test- ing dataset) to as SNR , where MS stands for magnitude The original segments in the two datasets were processed MS spectrogram. Note that SNR is directly related to the to t the evaluation parameters. First, for each example MS spectral convergence proposed in [62]. the silence at the beginning and end was removed. Second, Additionally, we computed the ODGs, which correspond from each example, pieces of the duration of 320 ms were to the subjective di erence grade used in human-based au- copied, starting with the rst segment at the beginning of dio test and is derived from the perceptual evaluation of a segment, continuing with further segments with a shift audio quality (PEAQ, [59]). ODG range from 0 to 4 of 32 ms. Thus, each example yielded multiple overlap- with the interpretation shown in Tab. 2. We calculated the ping segments s. Then, the energy of the segments was ODGs on signals of 2-s duration, with the inpainted gap evaluated and the ones that were completely silent were beginning at 0.5-s. We used the algorithm implemented removed. Note that for a gap of 64 ms, the segment can in [63]. be considered as a 3-tuple by labeling the rst 128 ms as the context before the gap s , the subsequent 64 ms as the ODG Impairment gap s , and the last 128 ms as the context after the gap 0 Imperceptible s . -1 Perceptible, but not annoying In order to avoid over tting, the datasets were split into -2 Slightly annoying training, validation, and testing sets before segmenting -3 Annoying them. For the instruments, we used the splitting proposed -4 Very annoying by [60]. The music dataset, was split into 70%, 20% and 10%, respectively. The statistics of the resulting sets are presented in Table 1. Table 2: Interpretation of ODGs. Count Percentage 3.4 Training Instruments training 19.4M 94.1 Instruments validation 0.9M 4.4 Both complex and magnitude networks were trained for Instruments testing 0.3M 1.5 the instrument and music dataset, resulting in four trained Music training 5.2M 70.0 networks. Each training started with the learning rate of Music validation 1.5M 20.0 3 10 . In the case of the magnitude network, the recon- Music testing 0.7M 10.0 structed phase was not considered in the training. Ev- ery 2000 steps, the training progress was monitored. To this end, signals from the validation dataset were inpainted Table 1: Subdivision of the datasets used in the evaluation. and the weighted NMSE was calculated between the pre- Count is the amount of examples. Percentage is calculated dicted and the actual TF coecients of the gap. When with respect to the full dataset. converging, which usually happened after approximately 6 4 600k steps, the learning rate was reduced to 10 and the were directly synthesized as sine oscillations with a xed training was continued by additional 200k steps. Table 3 frequency. The probes were generated within a logarithmic shows the SNR calculated for the training, validation, frequency range from 20 Hz to 8 kHz, linear phase shift MS and testing datasets. The similar values across subsets in- range from 0 to , and linear amplitude range from 0:1 to dicate no evidence for an over tting. 1. The duration was 320 ms corresponding to 5120 samples at the sampling rate of 16 kHz. Music Instruments Train Valid Test Train Valid Test Mean 7.6 7.8 7.8 22.1 21.9 21.9 Mag Std 4.2 4.0 4.3 9.9 10.2 10.0 Mean 4.9 5.1 5.4 17.8 18.3 18.2 Complex Std 4.0 4.2 4.5 10.5 10.3 10.1 Table 3: Over tting check by means of SNR (in dB) MS calculated between generated and original TF-coecients without the synthesis step for 64 ms gaps. A4 A# B C C# D D# E F F# G G# A5 Notes 3.5 Reference method Figure 4: SNR for reconstruction of pure tones with the MS We compared our results to those obtained with a refer- complex network trained on the instrument (black) and ence method based on LPC. For the implementation, we music (grey) dataset. SNR are shown as a function of MS followed [44], especially [44, Section 5.3]. In detail, the musical notes corresponding to the Standard pitch, i.e., the context signals s and s were extrapolated onto the gap b a note A4 corresponds to the frequency of 440 Hz. s by computing their impulse responses and using them as prediction lters for a classical linear predictor. The impulse responses were obtained using Burg's method [64] Figure 4 shows the SNR of the reconstruction ob- MS and were xed to have 1000 coecients according to [2] tained with the complex network. The abscissa shows and [65]. Their duration was the same as that for our notes, i.e., frequencies corresponding to the Standard pitch context encoder in order to provide the same amount of (with A corresponding to the frequency of 440 Hz). For context information. The two extrapolations were mixed the network trained on the instruments, the SNR was MS with the squared-cosine weighting function. Our imple- large in the proximity of notes and decreased by more than mentation of the LPC extrapolation is available online . 15 dB for frequencies between the notes. This shows that Then, we evaluated the results produced by the refer- the network was able to better predict signals correspond- ence method in the same way as we evaluated the results ing to the trained notes, indicating a good adaptation to produced by the networks. the trained material. Music contains more broadband sounds such as drums, breathing, tone glides, i.e., sounds with non-signi cant 4 Results and discussion energy at frequencies between the Standard pitch being non-stationary even within the tested 320 ms. A network 4.1 Ability to adapt to the training mate- trained on music is expected to be less sensitive to predic- rial tions performed on Standard pitch only. Figure 4 shows As a general rule, a trained neural network should perform the SNR obtained for the reconstruction of pure tones MS well on the distribution that it learned from. As the instru- with the network trained on the music. The SNR uc- MS ment dataset is made of discrete in-tune instrument notes, tuations were smaller than those from the network trained each note can be considered as a sum of discrete frequen- on the instruments. This further supports our conclusion cies arranged in time. If our network was able to adapt about the good ability of our network structure to adapt to the instrument sounds then it should perform on these to various training materials. frequencies better than on others. To evaluate this, we probed our trained networks with 4.2 E ect of the network type stationary tones of various frequencies. The pure tones 5 The di erence between the magnitude and complex net- We also considered training on the instrument training dataset (800k steps) followed by a re nement with the music training dataset works both trained on instruments can be anticipated from (300k steps). While it did not show substantial di erences to the the Figure 5, which shows the SNR of the reconstruc- MS training performed on music only, a pre-trained network on music tions of pure tones. As an average over frequency, the with a subsequent re nement to genre may show improvements for magnitude network provided an SNR of 10:2 dB larger MS that genre. www.github.com/andimara oti/audioContextEncoder than that of the complex network. For the magnitude net- SNR [dB] MS work, the SNR was more or less similar for frequencies types, reconstructions of the testing datasets were per- MS up to 200 Hz and decreased with frequency. For the com- formed. Table 4 shows the SNR and ODG of those MS plex network, the SNR decrease started already at ap- predictions. The magnitude network resulted in consis- MS proximately 100 Hz and was much steeper than that of the tently better results with an SNR di erence of 2:3 dB MS magnitude network. Above the frequency of approximately and 3:5 dB when tested on music and instruments, respec- 4 kHz, the complex network provided an extremely poor tively. Similarly, ODGs favor the magnitude network, al- SNR of 5 dB or less, indicating that the complex network though to a smaller extent. The comparison may appear MS had problems reconstructing the signals at higher frequen- awed because the magnitude network has to predict only cies. This is in line with [66], where neural networks were half of the features to be predicted by the complex network, trained to reconstruct phases of amplitude spectrograms at almost the same number of neurons. However, even and their predictions were also poorer for higher frequen- doubling the size of the complex network would not yield cies. signi cantly better predictions, as the link between the size of a DNN and its performance is not proportional [67]. In addition to the improvement in SNR and ODG MS of the magnitude network over the complex network, the complex network predictions were observed to often be cor- rupted by clearly audible broadband noise . Music Instruments Mag Complex LPC Mag Complex LPC Mean SNR 7.7 5.4 6.3 22.4 18.5 30.5 MS Magnitude network Std SNR 4.3 4.5 5.1 10.7 10.2 18.9 MS Complex network Mean ODG -0.8 -1.0 -0.8 -1.6 -1.8 -0.3 Std ODG 0.4 0.2 0.2 1.0 0.9 0.3 0.1 0.5 1 2 4 8 Frequency [kHz] Table 4: SNR (in dB) and ODGs of reconstructions of MS Figure 5: SNR for reconstruction of pure tones with MS 64 ms gaps for the complex and magnitude networks, as the complex (black) and magnitude (grey) networks both well as for the LPC-based method. trained to the instruments database. The thicker lines show averages over 25 surrounding frequency points. 4.3 Comparison to the reference method Table 4 provides the SNR and ODGs for the LPC-based MS Unfortunately, the problem of poor high-frequency re- reference reconstruction method. When tested on music, construction also persisted when predicting instrument on average, our magnitude network outperformed the LPC- sounds instead of pure tones. Figure 6 shows the spec- based method in terms of SNR by 1.4 dB. When tested MS trogram of an original sound from the instrument testing on instruments, our magnitude network underperformed set (left panel) and of its reconstruction obtained from the LPC by 8.6 dB, which was also re ected in poorer the complex network (center panel). The reconstruction ODGs. Both SNRs and ODGs reveal a consistent pic- clearly fails at frequencies higher than 4 kHz. ture. The LPC-based method seems to better inpaint in- struments. The CE seems to be better or equivalent for inpainting music. This can be attributed to the better compliance of the instruments with the LPC, and a better universality of our CE. In order to look more deeply into the di erences between the two inpainting methods, we compared their abilities to inpaint frequency sweeps. A sweep represents a controlled frequency modulation, which violates the assumptions for the LPC and is not present in the data the CE was trained 0 10 20 0 10 20 0 10 20 on. The signal consisted of a sum of ve linear frequency sweeps with a 320-ms duration each, starting frequencies Figure 6: Magnitude spectrograms (in dB) of an exem- of 500, 2000, 3500, 5000 and 6500 Hz, and bandwidth of plary signal reconstruction. Left: Original signal. Center: 500 Hz. Figure 7 shows the signal and the inpainting re- Reconstruction by the complex network. Right: Recon- sults. The gap inpainted by the LPC method (right panel) struction by the LPC-based method. The gap was the shows constant frequencies expanding into the gap causing area between the two red lines. a discontinuity in the gap's center. In contrast, the gap visit https://andimara oti.github.io/audioContextEncoder/ for In order to further compare between the two network audio examples. Frequency [kHz] SNR [dB] MS 8 inpainted by the magnitude network (center panel) follows the frequency changes better at the price of noise appearing between the sweeps. Other interesting examples are shown in Figure 8. The top row shows an example in which the magnitude net- work outperformed the LPC-based method. In this case, the signal is comprised of steady harmonic tones in the left side context and a broadband sound in the right side context. While the LPC-based method extrapolated the 0 13 24 37 0 13 24 37 0 13 24 37 broadband noise into the gap, the magnitude network was able to foresee the transition from the steady sounds to the broadband burst, yielding a prediction much closer to the original gap, with a 13 dB larger SNR than that from MS the LPC-based method. On the other hand, the magnitude network did not al- ways outperform the LPC-based method. The bottom row of Fig. 8 shows spectrograms of such an example. This sig- nal had stable sounds in the gap, which were well-suited for an extrapolation, but rather complex to be perfectly recon- 0 0 13 24 37 0 13 24 37 0 13 24 37 structed by the magnitude network. Thus, the LPC-based method outperformed the magnitude network yielding a Figure 8: Magnitude spectrograms (in dB) of exemplary 9 dB larger SNR . MS signal reconstructions. Left: Original signal. Center: Re- construction by the magnitude network. Right: Recon- struction by the LPC-based reference method. Top: Ex- ample with the magnitude network outperforming the ref- erence by an SNR of 13 dB. Bottom: Example with the MS magnitude network underperforming the reference by an SNR of 9 dB. MS performed our network by 12 dB. 0 13 24 37 0 13 24 37 0 13 24 37 The excellent performance of the LPC-based method re- constructing instruments can be explained by the assump- Figure 7: Log-magnitude spectrograms (in dB) of an ex- tions behind the LPC well- tting to the single-note instru- ponential frequency sweep. Left: Original signal. Center: ment sounds. These sounds usually consist of harmon- Reconstruction by the magnitude network. Right: Recon- ics stable on a short-time scale. LPC extrapolates these struction by the LPC-based method. harmonics preserving the spectral envelope of the signal. Nevertheless, the magnitude network yielded an SNR MS of 22.4 dB, on average, demonstrating a good ability to Finally, Table 5 presents the SNR of reconstructions TD reconstruct instrument sounds. of the instrument and music. Note that the SNR pro- When applied on music, the performance in terms of TD vided for the magnitude network is for the sake of com- SNR of both methods was much poorer, with our net- MS pleteness only. The SNR metric is highly sensitive to work performing slightly but statistically signi cantly bet- TD phase di erences, which do not necessarily lead to percep- ter than the LPC-based method. The better performance tual di erences and, for the magnitude network, is recon- of our network can be explained by its ability to adapt structed with an accuracy of up to a constant phase shift. to transient sounds and modulations in frequencies, sound Thus, SNR can remain low even in cases of very good properties that the LPC-based method is not suited to han- TD reconstructions. Hence, here, we compare the performance dle. of the complex network with that of the LPC-based method The gap duration of 64 ms is close to those tested in only. [27] when comparing various OMP methods. For 50 ms, For the music, on average, the complex network outper- their approaches showed SNR below 2 dB and ODG TD formed the LPC-based method providing a 0.3 dB larger values around -3 (see their Fig. 1 and 4). The LPC-based SNR . Given the large standard deviation, we performed method showed average SNR of 3.8 dB and ODGs of TD TD a pair t-test on the SNR which showed that the di er- -0.8. This con rms our assumption that for the studied TD ence was statistically signi cant (p < 0:001). For the in- range, the LPC is better suited than the sparsity-based struments, on average, the LPC-based reconstruction out- audio inpainting techniques. Frequency [kHz] Frequency [kHz] Frequency [kHz] Music Instruments both methods were rated equally with ODG between im- Complex Mag LPC Complex Mag LPC perceptible and perceptible but not annoying. LPC yielded Mean 3.8 1.1 3.5 16.0 14.6 28.0 better results when applied on more simple signals like in- Std 4.1 3.9 5.0 9.7 10.8 19.1 strument sounds. In general, our results suggest that stan- dard DNN components and a moderately sized network can be applied to form audio-inpainting models, o ering a Table 5: SNR (in dB) of reconstructions of 64 ms gaps TD number of angles for future improvement. for the complex and magnitude networks, as well as for the For example, we have analyzed two types of networks. LPC-based method. The complex network works directly on the complex-valued TF coecients. The magnitude network provides only 4.4 E ect of the gap duration magnitudes of TF coecients as output and relies on a sub- sequent phase reconstruction. We observed clear improve- The proposed network structure can be trained with di er- ment of the magnitude network over the complex network ent contexts and gap durations. For problems of varying especially in reconstructing high-frequency content. gap duration, a network trained to the particular gap dura- From our study, it follows that DNNs, when applied to tion might appear optimal. However, training takes time, inpainting audio gaps for medium durations, do not su er and it might be simpler to train a network to single gap from the restrictions of previous methods. Additionally, duration and use it to reconstruct any shorter gap as well. even for a simple DNN, the performance on complex signals In order to test this idea, we introduced gaps of 48 ms is already on par with the state of the art. It also follows (corresponding to L = 768 samples) in our testing that by representing audio as TF coecients, a generative datasets. These gaps were then reconstructed by the mag- network developed for image inpainting can be adapted to nitude network trained for 64 ms gaps. As this network audio inpainting. outputs, at reconstruction time, a solution for a gap of Generally, better results can be expected for increased length 64-ms, the 48-ms gaps needs to be enlarged. We depth of the network and the available context. Experi- tested three approaches to enlarge them: by discarding ments with our method for longer medium-duration gaps 16 ms forwards, 16 ms backwards, or 8 ms forwards and and longer context can be easily implemented just by 8 ms backwards (centered). adapting the parameters of the network. Nevertheless, we Table 6 shows SNR obtained from averaging the re- MS expect technical limitations like computational power to constructions of the three types of gap enlargements. Also, be an issue for long contexts. Instead, a study of more e- the corresponding SNR for the LPC-based method are MS cient audio features will be required. Our STFT features, shown. The results are similar to those obtained for larger meant in this study as a reasonable rst choice, provided gaps: for the instruments, the LPC-based method outper- a decent performance, however, in the future, we expect formed our network; for the music, our network outper- hearing-related features to provide better reconstructions. formed the LPC-based method. In particular, an investigation of Audlet frames, i.e., in- Music Instruments vertible time-frequency systems adapted to perceptual fre- Ours LPC Ours LPC quency scales, [68], as features for audio inpainting seem Mean 8.0 6.9 21.8 33.2 to o er intriguing opportunities. Std 4.6 5.5 11.8 20.1 In the future, instead of training on a very general dataset, improved performance can be obtained for more specialized networks trained to speci c genres or instru- Table 6: SNR (in dB) of reconstructions of 48 ms gaps MS mentation. Further, applied to a complex mixture and for the magnitude network and the LPC-based method. potentially preceded by a source-separation algorithm, our proposed architecture could be used jointly in a mixture- of-experts, [69], approach. 5 Conclusions and Outlook We proposed a neural network architecture for inpainting References medium gaps of audio. The study aims at showing general abilities of a neural network working on TF coecients as a [1] A. Adler, V. Emiya, M. G. Jafari, M. Elad, R. Gribon- context encoder. The proposed network was able to adapt val, and M. D. Plumbley, \Audio inpainting," IEEE to the particular frequencies provided by the training ma- Transactions on Audio, Speech and Language Process- terial. It was able to reconstruct frequency modulations ing, vol. 20, no. 3, pp. 922{932, March 2012. better than the LPC-based reference method and it was able to inpaint gaps shorter than the trained ones. For the [2] I. Kauppinen, J. Kauppinen, and P. Saarinen, \A reconstruction of complex signals like music, our network method for long extrapolation of audio signals," Jour- was able to outperform the LPC-based reference method, nal of the Audio Engineering Society, vol. 49, no. 12, in terms SNR calculated on magnitude spectrograms, and pp. 1167{1180, 2001. 10 [3] W. Etter, \Restoration of a discrete-time signal seg- [14] J. Engel, K. K. Agrawal, S. Chen, I. Gulrajani, ment by interpolation based on the left-sided and C. Donahue, and A. Roberts, \Gansynth: Adversarial right-sided autoregressive parameters," IEEE Trans- neural audio synthesis," in Proceedings of the 7th In- actions on Signal Processing, vol. 44, no. 5, pp. 1124{ ternational Conference on Learning Representations, 1135, may 1996. 2019. [4] D. Goodman, G. Lockhart, O. Wasem, and W.-C. [15] S. Mehri, K. Kumar, I. Gulrajani, R. Kumar, S. Jain, Wong, \Waveform substitution techniques for recover- J. Sotelo, A. Courville, and Y. Bengio, \SampleRNN: ing missing speech segments in packet voice commu- An unconditional end-to-end neural audio generation nications," IEEE Transactions on Acoustics, Speech model," in Proc. of ICLR, 2017. and Signal Processing, vol. 34, no. 6, pp. 1440{1448, [16] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, dec 1986. O. Vinyals, A. Graves, N. Kalchbrenner, A. Senior, [5] Y. Bahat, Y. Schechner, and M. Elad, \Self-content- and K. Kavukcuoglu, \Wavenet: A generative model based audio inpainting," Signal Processing, vol. 111, for raw audio," CoRR, vol. abs/1609.03499, 2016. pp. 61{72, jun 2015. [Online]. Available: http://arxiv.org/abs/1609.03499 [6] P. J. Wolfe and S. J. Godsill, \Interpolation of missing [17] Y. Saito, S. Takamichi, and H. Saruwatari, \Text-to- data values for audio signal restoration using a gabor speech synthesis using STFT spectra based on low- regression model," in Proc. of ICASSP, vol. 5. IEEE, /multi-resolution generative adversarial networks," in 2005, pp. v{517. Proc. of ICASSP. IEEE, 2018, pp. 5299{5303. [7] N. Perraudin, N. Holighaus, P. Majdak, and P. Bal- [18] J. Shen, R. Pang, R. Weiss, M. Schuster, N. Jaitly, azs, \Inpainting of long audio segments with similarity Z. Yang, Z. Chen, Y. Zhang, Y. Wang, R. Skerry- graphs," IEEE/ACM Transactions on Audio, Speech Ryan, R. Saurous, Y. Agiomyrgiannakis, and Y. Wu, and Language Processing, vol. PP, no. 99, pp. 1{1, \Natural TTS synthesis by conditioning WaveNet on 2018. mel spectrogram predictions," in Proc. of ICASSP. IEEE, 2018. [8] D. Pathak, P. Krahenbuhl, J. Donahue, T. Darrell, and A. Efros, \Context encoders: Feature learning by [19] Z. Jin, A. Finkelstein, G. J. Mysore, and J. Lu, \Fft- inpainting," in Proc. of CVPR, 2016. net: A real-time speaker-dependent neural vocoder," in Proc. of ICASSP. IEEE, 2018, pp. 2251{2255. [9] I. Goodfellow, Y. Bengio, and A. Courville, Deep Learning. MIT Press, 2016, http://www. [20] B.-K. Lee and J.-H. Chang, \Packet loss concealment deeplearningbook.org. based on deep neural networks for digital speech transmission," IEEE/ACM Trans. Audio, Speech and [10] D. Kingma and M. Welling, \Auto-encoding varia- Lang. Proc., vol. 24, no. 2, pp. 378{387, Feb. tional bayes." in Proc. of ICLR, 2014. 2016. [Online]. Available: http://dx.doi.org/10.1109/ TASLP.2015.2509780 [11] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair, A. Courville, and Y. Ben- [21] S. Dieleman, A. v. d. Oord, and K. Simonyan, \The gio, \Generative adversarial nets," in Advances in challenge of realistic music generation: modelling raw neural information processing systems, 2014, pp. audio at scale," in Proc. of NeurIPS, 2018. 2672{2680. [22] N. Boulanger-Lewandowski, Y. Bengio, and P. Vin- [12] C. Donahue, J. McAuley, and M. Puckette, \Adver- cent, \Modeling temporal dependencies in high- sarial audio synthesis," in Proceedings of the 7th In- dimensional sequences: Application to polyphonic ternational Conference on Learning Representations, music generation and transcription," in Proc. of ICML, 2012. [13] A. Mara oti, N. Perraudin, N. Holighaus, and [23] M. Blaauw and J. Bonada, \A neural parametric P. Majdak, \Adversarial generation of time-frequency singing synthesizer," in Proc. of INTERSPEECH, features with application in audio synthesis," in Proc. of the 36th ICML, K. Chaudhuri and R. Salakhutdinov, Eds., vol. 97. Long Beach, [24] A. Adler, V. Emiya, M. Jafari, M. Elad, R. Gribonval, California, USA: PMLR, 09{15 Jun 2019, pp. and M. Plumbley, \A constrained matching pursuit 4352{4362. [Online]. Available: http://proceedings. approach to audio declipping," in Proc. of ICASSP. mlr.press/v97/mara oti19a.html IEEE, may 2011. 11 [25] I. Toumi and V. Emiya, \Sparse non-local similarity [36] ||, \Audio declipping via nonnegative matrix fac- modeling for audio inpainting," in Proc. of ICASSP. torization," in 2015 IEEE Workshop on Applications Calgary, Canada: IEEE, Apr. 2018. of Signal Processing to Audio and Acoustics (WAS- PAA). IEEE, 2015, pp. 1{5. [26] S. Kiti c, N. Bertin, and R. Gribonval, \Sparsity and [37] A. Ozerov, C  . Bilen, and P. P erez, \Multichannel au- cosparsity for audio declipping: a exible non-convex dio declipping," in Proc. of ICASSP. IEEE, 2016, approach," in LVA/ICA 2015 - The 12th International pp. 659{663. Conference on Latent Variable Analysis and Signal Separation, Liberec, Czech Republic, Aug. 2015, p. 8. [38] E. Manilow and B. Pardo, \Leveraging repetition to [Online]. Available: https://hal.inria.fr/hal-01159700 do audio imputation," in 2017 IEEE Workshop on Ap- plications of Signal Processing to Audio and Acoustics [27] O. Mokry,  P. Z aviska, P. Rajmic, and V. Vesely, (WASPAA). IEEE, 2017, pp. 309{313. \Introducing SPAIN (sparse audion inpainter)," CoRR, vol. abs/1810.13137, 2018. [Online]. Available: [39] B. Martin, P. Hanna, T. V. Thong, M. Desainte- http://arxiv.org/abs/1810.13137 Catherine, and P. Ferraro, \Exemplar-based assign- ment of large missing audio parts using string match- [28] C. Gaultier, S. Kiti c, N. Bertin, and R. Gribonval, ing on tonal features." in Proc. of ISMIR, 2011, pp. \AUDASCITY: AUdio Denoising by Adaptive Social 507{512. CosparsITY," in 25th European Signal Processing Conference (EUSIPCO), Kos, Greece, Aug. 2017. [40] R. C. Maher, \A method for extrapolation of missing [Online]. Available: https://hal.inria.fr/hal-01540945 digital audio data," Journal of the Audio Engineering Society, vol. 42, no. 5, pp. 350{357, 1994. [29] K. Siedenburg, M. Kowalski, and M. D or er, \Audio Declipping with Social Sparsity," in Proc. of ICASSP. [41] A. Lukin and J. Todd, \Parametric interpolation of Florence, Italy: IEEE, May 2014, pp. AASP{L2. gaps in audio signals," in Audio Engineering Society [Online]. Available: https://hal.archives-ouvertes.fr/ Convention 125. Audio Engineering Society, 2008. hal-01002998 [42] T. E. Tremain, \The government standard linear pre- [30] F. Lieb and H.-G. Stark, \Audio inpainting: Evalua- dictive coding algorithm: Lpc-10," Speech Technology, tion of time-frequency representations and structured pp. 40{49, Apr. 1982. sparsity approaches," Signal Processing, vol. 153, pp. 291{299, 2018. [43] A. Janssen, R. Veldhuis, and L. Vries, \Adaptive in- terpolation of discrete-time signals that can be mod- [31] J. Le Roux, H. Kameoka, N. Ono, A. De Cheveigne, eled as autoregressive processes," IEEE Transactions and S. Sagayama, \Computational auditory induction on Acoustics, Speech, and Signal Processing, vol. 34, as a missing-data model- tting problem with bregman no. 2, pp. 317{330, 1986. divergence," Speech Communication, vol. 53, no. 5, pp. 658{676, 2011. [44] I. Kauppinen and K. Roth, \Audio signal extrapolation{theory and applications," in Proc. [32] P. Smaragdis, B. Raj, and M. Shashanka, \Missing DAFx, 2002, pp. 105{110. data imputation for time-frequency representations of audio signals," Journal of signal processing systems, [45] J. Pons, O. Nieto, M. Prockup, E. M. Schmidt, A. F. vol. 65, no. 3, pp. 361{370, 2011. Ehmann, and X. Serra, \End-to-end learning for mu- sic audio tagging at scale," in Proc. of ISMIR, 2018. [33] U. S im sekli, Y. K. Ylmaz, and A. T. Cemgil, \Score guided audio restoration via generalised coupled ten- [46] M. Portno , \Implementation of the digital phase sor factorisation," in Proc. of ICASSP. IEEE, 2012, vocoder using the fast fourier transform," IEEE pp. 5369{5372. Trans. Acoust. Speech Signal Process., vol. 24, no. 3, pp. 243{248, 1976. [34] C. Bilen, A. Ozerov, and P. Prez, \Solving time- domain audio inverse problems using nonnegative ten- [47] K. Gr ochenig, Foundations of Time-Frequency Anal- sor factorization," IEEE Transactions on Signal Pro- ysis, ser. Appl. Numer. Harmon. Anal. Birkh auser, cessing, vol. 66, no. 21, pp. 5604{5617, Nov 2018. [35] C  . Bilen, A. Ozerov, and P. P erez, \Joint audio in- [48] D. Grin and J. Lim, \Signal estimation from modi- painting and source separation," in International Con- ed short-time fourier transform," IEEE Transactions ference on Latent Variable Analysis and Signal Sepa- on Acoustics, Speech and Signal Processing, vol. 32, ration. Springer, 2015, pp. 251{258. no. 2, pp. 236{243, 1984. 12 [49] N. Perraudin, P. Balazs, and P. L. Sndergaard, \A [60] J. Engel, C. Resnick, A. Roberts, S. Dieleman, fast grin-lim algorithm," in Applications of Signal M. Norouzi, D. Eck, and K. Simonyan, \Neural au- Processing to Audio and Acoustics (WASPAA), 2013 dio synthesis of musical notes with wavenet autoen- IEEE Workshop on. IEEE, 2013, pp. 1{4. coders," in Proc. of ICML, 2017, pp. 1068{1077. [61] M. De errard, K. Benzi, P. Vandergheynst, and [50] Z. Pr u sa, P. Balazs, and P. Sndergaard, \A nonitera- X. Bresson, \Fma: A dataset for music analysis," in tive method for reconstruction of phase from stft mag- 18th International Society for Music Information Re- nitude," IEEE/ACM Transactions on Audio, Speech trieval Conference, 2017. and Language Processing, vol. 25, no. 5, pp. 1154{ 1164, 2017. [62] N. Sturmel and L. Daudet, \Signal reconstruction from stft magnitude: A state of the art," in Inter- [51] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, national conference on digital audio e ects (DAFx), Z. Chen, C. Citro, G. Corrado, A. Davis, J. Dean, 2011, pp. 375{386. M. Devin, S. Ghemawat, I. Goodfellow, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser, [63] P. Kabal et al., \An examination and interpretation M. Kudlur, J. Levenberg, D. Man e, R. Monga, of itu-r bs. 1387: Perceptual evaluation of audio qual- S. Moore, D. Murray, C. Olah, M. Schuster, J. Shlens, ity," TSP Lab Technical Report, Dept. Electrical & B. Steiner, I. Sutskever, K. Talwar, P. Tucker, Computer Engineering, McGill University, pp. 1{89, V. Vanhoucke, V. Vasudevan, F. Vi egas, O. Vinyals, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng, \TensorFlow: Large-scale machine [64] J. P. Burg, \Maximum entropy spectral analysis," learning on heterogeneous systems," 2015, software 37th Annual International Meeting, Soc. of Explor. available from tensor ow.org. [Online]. Available: Geophys., Oklahoma City, 1967. https://www.tensor ow.org/ [65] I. Kauppinen and J. Kauppinen, \Reconstruction method for missing or damaged long portions in au- [52] A. Krizhevsky, I. Sutskever, and G. E. Hinton, \Im- dio signal," Journal of the Audio Engineering Society, agenet classi cation with deep convolutional neural vol. 50, no. 7/8, pp. 594{602, 2002. networks," in Proc. of NIPS, 2012, pp. 1097{1105. [66] S. Takamichi, Y. Saito, N. Takamune, D. Kitamura, [53] S. Io e and C. Szegedy, \Batch normalization: Ac- and H. Saruwatari, \Phase reconstruction from am- celerating deep network training by reducing internal plitude spectrograms based on von-mises-distribution covariate shift," in Proc. of ICML, 2015, pp. 448{456. deep neural network," in International Workshop on [54] Z. Pr u sa and P. L. Sndergaard, \Real-Time Spec- Acoustic Signal Enhancement (IWAENC), 2018, pp. trogram Inversion Using Phase Gradient Heap Inte- 286{290. gration," in Proc. Int. Conf. Digital Audio E ects [67] K. He, X. Zhang, S. Ren, and J. Sun, \Deep residual (DAFx-16), Sep 2016, pp. 17{21. learning for image recognition," in 2016 IEEE Con- ference on Computer Vision and Pattern Recognition [55] Z. Pr u sa, \The Phase Retrieval Toolbox," in AES In- (CVPR), June 2016, pp. 770{778. ternational Conference On Semantic Audio, Erlangen, Germany, June 2017. [68] T. Necciari, N. Holighaus, P. Balazs, Z. Pra, P. Ma- jdak, and O. Derrien, \Audlet lter banks: A versa- [56] H. Zhao, O. Gallo, I. Frosio, and J. Kautz, \Loss tile analysis/synthesis framework using auditory fre- functions for image restoration with neural networks," quency scales," Applied Sciences, vol. 8, no. 1:96, IEEE Transactions on Computational Imaging, vol. 3, no. 1, pp. 47{57, March 2017. [69] S. E. Yuksel, J. N. Wilson, and P. D. Gader, \Twenty [57] A. Krogh and J. Hertz, \A simple weight decay can years of mixture of experts," IEEE transactions on improve generalization," in Advances in neural infor- neural networks and learning systems, vol. 23, no. 8, mation processing systems 4. Morgan Kaufmann, pp. 1177{1193, 2012. 1992, pp. 950{957. [58] D. Kingma and J. Ba, \Adam: A method for stochas- tic optimization," in Proc. of ICLR, 2015. [59] I. Recommendation, \1387: Method for objective measurements of perceived audio quality," Interna- tional Telecommunication Union, Geneva, Switzer- land, 2001.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Oct 29, 2018

There are no references for this article.