Get 20M+ Full-Text Papers For Less Than $1.50/day. Subscribe now for You or Your Team.

Learn More →

Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization

Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 1 Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization Dalia El Badawy and Ivan Dokmanic, ´ Member, IEEE Abstract—Conventional approaches to sound source localiza- for frequency-dependent ILDs in the HRTF also provides tion require at least two microphones. It is known, however, monaural cues. The question is then, can these monaural cues that people with unilateral hearing loss can also localize sounds. embedded in the HRTF be used for localization? Monaural localization is possible thanks to the scattering by the Indeed, monaural cues are known to help localize in eleva- head, though it hinges on learning the spectra of the various tion [1] and resolve the front/back confusion [2]: two cases sources. We take inspiration from this human ability to propose algorithms for accurate sound source localization using a single where binaural cues are not sufficient. Additionally, studies microphone embedded in an arbitrary scattering structure. The on the HRTFs of cats [3] and bats [4] also reveal their use structure modifies the frequency response of the microphone for localization in both azimuth and elevation, albeit in a in a direction-dependent way giving each direction a signature. binaural setting. This implies that the directional selectivity While knowing those signatures is sufficient to localize sources of the HRTF i.e., the monaural cues, is sufficient to enable of white noise, localizing speech is much more challenging: it is an ill-posed inverse problem which we regularize by prior people with unilateral hearing loss to localize sounds, though knowledge in the form of learned non-negative dictionaries. We with a reduced accuracy compared to the binaural case [5]. demonstrate a monaural speech localization algorithm based on non-negative matrix factorization that does not depend on sophisticated, designed scatterers. In fact, we show experimental A. Related Work results with ad hoc scatterers made of LEGO bricks. Even with Combining HRTF-like directional selectivity with source these rudimentary structures we can accurately localize arbitrary speakers; that is, we do not need to learn the dictionary for models has already been explored in the literature [6], [7], the particular speaker to be localized. Finally, we discuss multi- [8], [9]. For example, in one study [8], a small microphone source localization and the related limitations of our approach. enclosure was used to localize one source with the help of a Hidden Markov Model (HMM) trained on a variety of sounds Index Terms—direction-of-arrival estimation, group sparsity, including speech. In another study [7], a metamaterial-coated monaural localization, non-negative matrix factorization, sound device with a diameter of 40 cm and a dictionary of noise scattering, universal speech model prototypes were used to localize known noise sources. In our previous work [9], we used an omnidirectional sensor I. I NTRODUCTION surrounded by cubes of different sizes and a dictionary of N this paper, we present a computational study of the spectral prototypes to localize speech sources. role of scattering in sound source localization. We study I A single omnidirectional sensor can also be used to localize a setting in which localization is a priori not possible: that of sound sources inside a known room [10]. Indeed, in place of a single microphone, referred to as monaural localization. It the head, the scattering structure is then the room itself and the is well established that people with normal hearing localize localization cues are provided by the echoes from the walls sounds primarily from binaural cues—those that require both [11]. The drawback is that the room should be known with ears. Different directions of arrival (DoA) result in different considerable accuracy—it is much more realistic to assume interaural time differences which are the dominant cues for knowing the geometry of a small scatterer. localization at lower frequencies, as well as in interaural level As for source models, those used in previous work on differences (ILD) which are dominant at higher frequencies monaural localization rely on full complex-valued spectra [7]. [1]. The latter are linked to the head-related transfer function Other approaches to multi-sensor localization with sparsity (HRTF) which encodes how human and animal heads, ears, constraints also operate in the complex frequency domain and torsos scatter incoming sound waves. This scattering re- [12], [13], [14]. In this paper, we choose to work with non- sults in direction-dependent filtering whereby frequencies are negative data which in this case corresponds to the power or selectively attenuated or boosted; the exact filtering depends on magnitude spectra of the audio. We highlight two reasons for the shape of the head and ears and therefore varies for different this choice. First, unlike the multi-sensor case, the monaural people and animals. Thus the same mechanism responsible setting generates fewer useful relative phase cues. Second, if prototypes—that is, the exact source waveform—are assumed In line with the philosophy of reproducible research, code and data to be known as in [7], there are no modeling errors or chal- to reproduce the results of this paper are available at http://github.com/ swing-research/scatsense. lenges associated with the phase information. We, however, D. El Badawy is a student at EPFL, Switzerland, e-mail: assume much less, namely only that the source is speech. It is dalia.elbadawy@epfl.ch. then natural to leverage the large body of work that addresses I. Dokmanic ´ is with ECE Illinois, e-mail: dokmanic@illinois.edu. Manuscript received January xx, 2018; revised Month xx, 2018. dictionary learning with real or non-negative values as opposed 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. arXiv:1801.03740v3 [eess.AS] 28 Aug 2018 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 2 to complex values. In particular, we consider models based discretize the azimuth into D candidate source locations on non-negative matrix factorization (NMF). NMF results in = f ;  ; : : : ;  g and consider the standard mixing model 1 2 D a parts-based representation of an input signal [15] and can in the time domain for J sources incoming from directions for instance identify individual musical notes [16]. Thus with  = f g , j j2J training data, NMF can be used to learn a representation for y(t) = s (t) h (t) + e(t); (1) j j each source [17], [18]. For more flexibility, it can also be j2J used to learn an overcomplete dictionary where each source def admits a sparse representation [17], [18]. For the latter, either where J  f1; 2; : : : ; Dg = D, jJj = J ,  denotes convo- th multiple representations are concatenated [17] or the learning lution, y is the observed signal, s is the j source signal, def is modified by including sparsity penalties [18], [19]. h (t) = h(t;  ) is the impulse response of the directionally- j j To solve the localization problem, we first fit the postulated dependent filter, and e is additive noise. The goal of local- non-negative model to the observed measurements. The cost ization is then to estimate the set of directions  from the functions previously used often involve the Euclidean distance observed signal y. Note that in general we could also include [7], [9], [13], [12], [14]. Non-negative modeling lets us use the elevation by considering a set of D directions in 3D, other measures more suitable for speech and audio such as though this would likely yield many additional ambiguities. the Itakura–Saito divergence [16]. While NMF is routinely The mixing (1) can be approximated in the short-time used in single-channel source separation [17], [20], [21], [22], Fourier transform (STFT) domain as speech enhancement [23], polyphonic music transcription [24], Y (n; f ) = S (n; f )H (f ) + E(n; f ); (2) j j and has been used in a multichannel joint separation and j2J localization scenario [25], the present work is to the best of where n and f denote the time and frequency indices. This our knowledge the first time NMF is used in single-channel so-called narrowband approximation holds when the filter h source localization. Finally, when the localization problem is j is short enough with respect to the STFT analysis window ill-posed, as is the case for the monaural setting, various reg- [26], [27]. For reference, the impulse response corresponding ularizations are utilized. Typical regularizers promote sparsity to an HRTF is around 4.5 ms long [28], while the duration of [7], group sparsity [13], [14] or a combination thereof [9]. the STFT window for audio is commonly anywhere between 5 ms and 128 ms during which the signal is assumed stationary. B. Contributions & Outline Finally, the mixture’s spectrogram with N time frames and F The current paper extends our previous work [9] in several frequency bins can be written as important ways. We summarize the contributions as follows: Y = diag(H )S + E; (3) j j We derive an NMF formulation for monaural localization j2J via scattering; FN FN We formulate two different regularized cost functions where Y 2 C , S 2 C the spectrogram of the source with different distance measures in the data fidelity term impinging from  , H 2 C is the frequency response of the j j FN to solve the localization based on either universal or directionally-dependent filter, E 2 C is the spectrogram speaker-dependent dictionaries; of the additive noise, and diag(v) is a matrix with v on the We present extensive numerical evidence using simple diagonal. “devices” made from LEGO bricks; At least conceptually, monaural localization is a simple For the sake of reproducibility, we make freely available matter if the source is always the same: for each direction the the code and data used to generate the results. HRTF imprints a distinct spectral signature onto the sound which can be detected through correlation. In reality, the Unlike [8], the source model we present easily accommodates sources are diverse but this fixed-source case lets us develop more than one source. And unlike [6] or [7], we present a good intuition. localization of challenging sources such as speech without the need for metamaterials or accurate source models—we only A. Intuition use ad hoc scatterers and NMF. In this paper we limit ourselves To see how scattering helps, suppose the sources are white to anechoic conditions and localization in the horizontal plane and a set of D directional transfer functions fH g of our as our goal is to assess the potential of this simple setup. d=1 device is known. The power spectral density (PSD) of a white In the following, we first lay down an intuitive argument 2 2 source is flat and scaled by the source’s power: E[jS j ] =  . for how monaural cues help as well as a simple algorithm for Assuming the noise has zero mean, the PSD of the observation localizing white sources. We then formulate the localization problem using NMF and give an algorithm for general colored is 2 2 2 sources in Section III. In Section IV, we describe our devices E[jYj ] =  jH j ; (4) and results for localizing white noise and speech. j2J which is a positive linear combination of the squared magni- II. BACKGROUND tudes of the transfer functions. In other words, E[jYj ] belongs to a cone defined as The sensor we consider in this work is a microphone, possibly omnidirectional, embedded in a compact scattering 2 C = fx : x = c jH j ; c > 0g; (5) J j j j structure; we henceforth refer to it as “the device”. We j2J 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 3 Algorithm 1 White Noise Localization Input: Number of sources J , magnitudes of directional trans- 2 FN fer functions fjH j g , N audio frames Y 2 C . j j2D b b b Output: Directions of arrival  = f ; : : : ;  g. 1 J 1 2 Compute the empirical PSD y = jY j N n=1 for every J  D, jJj = J do B jH j J j j2J P B B J J end for J arg min k(I P )yk n o (a) No scattering (b) LEGO1 b b j j 2 J smooth variations. Finally, Figures 1(b) and 1(c) correspond to our devices constructed using LEGO bricks whose responses have more fluctuating variations. In a nutshell, scattering induces a union-of-cones structure that enables us to localize white sources using a single sensor; stronger and more diverse scattering implies easier localization. B. White Noise Localization (c) LEGO2 (d) KEMAR In this section we describe a simple algorithm for localizing noise sources based on the intuition provided in the previous Fig. 1. Directional frequency magnitude response for different devices. Each section . Our experiments with white noise localization will horizontal slice is the polar pattern at the corresponding frequency between provide us with an ideal case baseline. 0-8000 Hz from bottom to top. The colors only aid visualization. First, we need to replace the expected value E[jYj ] by its empirical mean computed from N time frames. For many types of sources this approximation will be accurate already Each configuration of sourcesJ results in a different cone C . with a small number of frames by the various concentration For D directions and J white sources, there are possible of measure results [29]; we corroborate this claim empirically. cones which are known a priori since we assume knowing the Second, for simplicity, we replace each cone C by its scatterer. These cones reside in an F -dimensional space of smallest enclosing subspace S = span jH j repre- J j direction-dependent spectral magnitude responses, R , rather j2J sented by a matrix than the physical scatterer space R . While the arrangement of cones in R is indeed determined by the geometry of the def 2 2 B = jH j ; : : : ; jH j ; j 2 J : J j j k 3 1 J device in R , the relation is complicated and nonlinear, namely it requires solving a boundary value problem for the Helmholtz This way the closest cone can be approximately determined equation at each frequency. by selecting J  D such that the subspace projection error Thus, we have E[jYj ] 2 C , and in theory, the is the smallest possible. The details of the resulting algorithm are given in Algorithm 1; note the implicit assumption that localization problem becomes one of identifying the correct J < F as otherwise all cones lie in the same subspace. cone The robustness of Algorithm 1 to noise largely depends on b b J = arg min dist E[jYj ]; C ; (6) the angles between pairs of subspaces S for different config- urations J , with smaller angles implying a higher likelihood where E jYj denotes the empirical estimate of the corre- of error. Intuitively, a transfer function that varies smoothly sponding expectation from observed measurements. We dis- across directions is unfavorable as it yields smaller subspace cuss this further in the next section where we give the complete angles (more similar subspaces). algorithm. We now turn our attention to the realistic case where Testing for cone membership results in correct localization sound sources are diverse: how can we determine whether when C = C implies J = J (distinct direction sets J J 1 2 1 2 an observed spectral variation is due to the directivity of span distinct cones)—a condition that is loosely speaking the sensor or a property of the sound source itself? In more likely to hold the more diverse H are. Examples of fact, localization of unfamiliar sounds degrades not only for jH j are illustrated in Figure 1. In particular, Figure 1(a) monaural but also binaural listening [30]. It has also been corresponds to an omnidirectional microphone with a flat found that older children with unilateral hearing loss perform frequency response and no scattering structure. In this case better in localization tasks than younger children [31]. We C = f 1 :   0g and monaural localization is impossible. Figure 1(d) corresponds to an HRTF which features relatively This algorithm appears in our previous conference publication [9]. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 4 can thus conclude that both knowledge and experience allow B. Regularization us to dissociate source spectra from directional cues. Once the Still, recovering X from (8) is an ill-posed problem. To get HRTF and the source spectra have been learned, it becomes a reasonable solution, we must regularize by prior knowledge possible to differentiate directions based on their modifications about X. We thus make the following two assumptions. First, by the scatterer. the sources are few (J  D), which means that most groups X are zero. Second, each source has a sparse representation in the dictionary W. These assumptions are enforced by con- III. M ETHOD sidering the solution to the following penalized optimization problem We can think of an ideal white source as belonging to the 2 2 subspace spanf1g since jSj = 1 . In the following, we arg min D(Yk AX) +  (X) + (X); (9) g s generalize the source model to more interesting signals such X0 as speech. For those signals, testing for cone membership the where D(k) is the data fitting term, is a group-sparsity same way we did for white sources is not straightforward. penalty to enforce the first assumption, and is a sparsity We can, however, take advantage of the non-negativity of the penalty to enforce the second assumption. The parameters  > data to design efficient localization algorithms based on NMF. 0 and > 0 are the weights given to the respective penalties. Instead of continuing to work with power spectra jSj , we A common choice of D(k) for speech is the Itakura–Saito switch to magnitude spectra jSj: prior work [20], [23] and our divergence [16], which for strictly positive scalars v and v ^, is own experiments found that magnitude spectra perform better defined as in this context. v v d (vk v ^) = log 1; (10) IS v ^ v ^ so that D(Vk V) = d (v jjv ^ ). Another option is A. Problem Statement IS fn fn fn the Euclidean distance We adopt the usual assumption that magnitude spectra are additive [20], [21]. Then the magnitude spectrogram of the D(Vk V) = (v v ^ ) : (11) fn fn observation (3) can be expressed as fn Both the Itakura–Saito divergence and the Euclidean distance Y = diag(H )S + E; (7) j j belong to the family of -divergences with = 0 and = 2 j2J respectively [32]. The former is scale-invariant and is thus preferred for audio which has a large dynamic range [16]. for Y = jYj, H = jHj, S = jS j, and E = jEj. We further j j To promote group sparsity, we choose to be the log =` g 1 model the source S as a non-negative linear combination of FK penalty [33] defined as K atoms W 2 R such that S = WX . The atoms in W j j can correspond to either spectral prototypes of the sources to be localized or they can be learned from training data. Using (X) = log( +kvec(X )k ); (12) g d 1 this source model, we rewrite (7) as d=1 where vec() is a vectorization operator. To promote sparsity Y = AX + E; (8) of the dictionary expansion coefficients, we choose to be FN ` -norm [34] as where Y 2 R is the observation, (X) = kvec(X)k : (13) FKD s 1 A = diag(H )W; : : : ; diag(H )W 2 R 1 D The combination of sparsity and group-sparsity penalties re- is the mixing matrix, and sults in a small number of active groups that are themselves sparse. Thus the joint penalty is known as sparse-group T T KDN X = X ; : : : ; X 2 R sparsity [35]. 1 D + We note that our main optimization (9) is performed only KN are the dictionary coefficients. Each group X 2 R d over the latent variables X; the non-negative dictionary A, corresponds to the set of coefficients for one source at one which is constructed by merging a source dictionary learned by direction d. off-the-shelf implementations of standard algorithms with the For localization, we wish to recover X; however, we are direction-dependent transfer functions as described in Section not interested in the coefficient values themselves but rather III-A, is taken as input. We thus avoid the joint optimization whether given coefficients are active or not—the activity of a over A and X which is a major source of non-convexity. coefficient indicates the presence of a source. In other words, However, our choices for non-convex functionals like the we are only concerned with identifying the support of X. Itakura-Saito divergence and the log =` penalty (although the Localization is achieved by selecting the J directions whose latter is quasi-convex) render the whole optimization (9) non- corresponding groups X have the highest norms. convex. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 5 Algorithm 2 MU for NMF with Sparse-group Sparsity C. Derivation Input: Y, A, , The minimization (9) can be solved iteratively by multi- Output: X plicative updates (MU) which preserve non-negativity when Initialize X = A Y the variables are initialized with non-negative values. The up- Y AX date rules for X are derived using maximization-minimization repeat for the group-sparsity penalty in [33] and for the ` -penalty in for d = 1; : : : ; D do [32]. They amount to dividing the negative part of the gradient by the positive part and raising to an exponent. In the following d +kvec(X )k d 1 we derive the MU rules for our objective (9). end for Note that the objective is separable over the columns of X if Itakura–Saito then D T 2 X A (Y Y ) X X C (x) = D(yk Ax) +  log( +kx k ) + kxk ; (14) d 1 1 T b1 A Y + P + d=1 else if Euclidean then F FK where y 2 R , x 2 R are columns of Y and X + + A Y P (i) respectively. With x as the current iterate, the gradient of X X T b A Y (14) with respect to one element x of x when D(k) is the end if Itakura–Saito divergence is given by Y AX (i) (i) 2 until convergence r C (x ) = y (Ax ) a x f fk (i) 1 + (Ax ) a +  + ; fk (i) 1) Attempt localization on a coarse grid, +kx k 2) Identify the top T direction candidates, (15) 3) Construct the model matrix using the T candidates and where a = [A] are entries of A. The update rule is fk fk their neighbors at a finer resolution, then given as 4) Rerun the NMF localization. 2 The final algorithm for source localization by NMF with and (i) r C (x ) (i+1) (i) without multiresolution is shown in Algorithm 3. Since (9) x = x k k + (i) r C (x ) is non-convex, different initializations of X might lead to 0 1 P 2 different results. We thus later run an experiment to test the (i) y (Ax ) a f fk (i) f f influence on the actual localization performance in Section IV. @ A = x P ; k 1 (i) (Ax ) a +  + fk (i) +kx k Algorithm 3 Direction of Arrival Estimation by NMF (16) Input: Observation y(t), Number of sources J , Parameter for where is a corrective exponent [32]. The updates in matrix group sparsity , Parameter for ` sparsity , magnitudes of 2 1 form are shown in Algorithm 2 where the multiplication , directional transfer functions fH g , source model W j j2D division, and power operations are elementwise and P is a b b b Output: Directions of arrival  = f ; : : : ;  g 1 J matrix of the same size as X. Also shown are the updates Construct A diag(H )W; : : : ; diag(H )W 1 D for using the Euclidean distance following [32], [36] where Construct Y jSTFTfygj [v] = maxfv; g is a thresholding operator to maintain non- Factorize Y  AX using Algorithm 2 negativity with  = 10 . Calculate D = fkvec(X )k for d = 1; 2; : : : ; Dg d 1 if Multiresolution then D. Algorithm Identify T candidates and their RT neighbors t=T;r=R fH g The discretization of the azimuth into D evenly-spaced t;r t=1;r=0 directions has a direct correspondence with the localization Construct A diag(H )W; : : : ; diag(H )W 1;0 T;R e e errors. On the one hand, a course discretization limits the Factorize Y  AX using Algorithm 2 localization accuracy to approximately the size of the dis- Calculate D = fkvec(X )k for d = 1; 2; : : : ; (R + 1)Tg d 1 cretization bin . On the other hand a fine discretization end if may warrant a smaller error floor, but it implies a model matrix J fIndices of the J largest elements in Dg n o with a higher coherence only worsening the ill-posedness of b b j j 2 J the optimization problem (9). It additionally results in a larger matrix which hampers the matrix factorization algorithms that are of complexity O(FKDN ) per iteration [16], [33]. A common compromise is the multiresolution approach [12], [8] IV. E XPERIM ENTAL RESULTS in which position estimates are first computed on a coarse A. Devices grid, and then subsequently refined on a finer grid concentrated around the initial guesses. We test the following strategy: We ran experiments using three different devices: 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 6 a) LEGO1 and LEGO2: The first two devices are struc- TABLE I PARAM ETERS PER DEVICE. tures composed of LEGO bricks as shown in Figure 2. Since we aimed for diverse random-like scattering, we stacked LEGO1 LEGO2 KEMAR haphazard brick constructions on a base plate of size 25 cm Frequency 3000-8000 Hz 3000-8000 Hz 0-8000 Hz 25 cm along with one omnidirectional microphone. The Prototypes  = 10, = 10  = 10, = 1  = 10, = 0:1 heights of the different constructions vary between 4 and 12.5 USM ( = 0)  = 0:1, = 10  = 10, = 1  = 100, = 10 USM ( = 2)  = 1, = 1  = 1, = 1  = 1, = 1 cm. We did not attempt to optimize the layout. The only Multiresolution  = 0:1, = 1  = 100, = 0:1 - assumption we make regarding the dimensions of the device is that some energy of the target source resides at frequencies where the device observably interacts with the acoustic wave. performance averaged for one and two sources were chosen. We note that the problem of designing and optimizing the We additionally tested whether the lower frequencies can be structure to get a desired response is that of inverse obstacle ignored in localization since, as mentioned before, for the scattering which is a hard inverse problem in its own right relatively small scatterers the lower frequency range lacks [37], [38]. For the present work, we simply observe that our variation and is thus uninformative. Moreover, truncating the random structures result in the desired random-like scattering. lower frequencies would help reduce coherence between the The directional impulse response measurements were then directional transfer functions. The final parameters and used done in an anechoic chamber where the device was placed on frequency range are summarized in Table I. a turntable as shown in Figure 2(c) and a loudspeaker at a Source Dictionary: For speech localization, we test two distance of 3.5 m emitted a linear sweep. We note that the source dictionaries. For the first experiment, we use a dictio- turntable is symmetric, so its effect on localization in the nary of prototypes of magnitude spectra from 4 speakers (2 horizontal plane, if any, is negligible. The duration of the female, 2 male) in the test set. measured impulse responses averages around 20 ms. Figures For the second experiment, we use a more general universal 1(b) and 1(c) show the corresponding magnitude response speech model (USM) [17] learned from a training set of 25 for the two devices. Due to their relatively small size, they female and 25 male speakers, also from TIMIT. We use a mostly scatter high frequency waves and so the response at random initialization for the NMF when learning the USM. lower frequencies is comparably flat. We thus expect that only Each speaker in the training set is modeled using K = 10 sources with enough energy in the higher range of frequencies F500 atoms, thus the final USM is W 2 R . In total, we use can be accurately localized. four versions of the USM in the experiments. Two versions b) KEMAR: The third device is KEMAR [39] which is correspond to learning the model by minimizing either the modeled after a human head and torso so that its response Itakura–Saito divergence or the Euclidean distance. The other accurately approximates a human HRTF. The mannequin’s two versions correspond to learning the model using only the torso measures 44 24 73 cm and the head’s diameter is 18 subset of frequencies to be utilized in the localization. cm. The duration of the impulse response is 10 ms. Figure 1(d) shows the corresponding magnitude response. As can be seen, C. Evaluation the variation across the directions is very smooth which we expect to result in worse monaural localization performance. We estimate the azimuth of the sources in the range [0 ; 360 ). The model (8) assumes a discrete set of 36 evenly spaced directions while the sources are randomly placed on B. Data and parameters a finer grid of 360 directions. Given the estimated directions The mixtures are created by first convolving the source ^ ^ ^ = f ; : : : ;  g and the true directions  = f ; : : : ;  g, 1 J 1 J signals with the impulse responses and then corrupting the the localization error is computed as the average absolute result by additive white Gaussian noise at various levels of difference modulo 360 as signal-to-noise ratio defined as min (  + 180) mod 360 180 ; (17) (j) j k s (t) h (t)k  J j j 2 j2J SNR = 20 log dB: ke(t)k where  : J ! J is a permutation that best matches the We use frame-based processing using the STFT with a Hann ordering in  and . window of length 64 ms, with a 50% overlap. The number of For each experiment, we test 5000 random sets of directions. iterations in NMF (Algorithm 2) was set to 100. We emphasize that we have been careful to avoid an inverse The test data contains 10 speech sources (5 female, 5 male) crime, and we produced the measurements by convolution in from TIMIT [40] sampled at 16000 Hz. The duration of the time domain, not by multiplication in the STFT domain. the speech varies between 3.1 and 4.5 s and the maximum Thus in this set up, the reported errors also reflect the modeling amplitude is normalized to 1 so that all sources have the mismatch. same volume. No preprocessing of the sources such as silence Following [41], we report the accuracy defined as the removal was done; when mixing two sources, the longest one percentage of sources localized to their closest 10 -wide bin as was truncated. well as the mean error for those accurately localized sources. A separate validation set was used to select the best sparsity For 36 bins, there is an inherent average error of 2:5 . Thus, parameters for each device. The parameters that gave the best ideally the accuracy would be 100% and the error 2:5 . 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 7 (a) (b) (c) Fig. 2. Sensing devices made of LEGO bricks. The location of the microphone is marked by an “x”. (a) LEGO1. (b) LEGO2. (c) Calibration setup in an anechoic chamber. Additionally, we report the accuracy per source, that is, the The accuracy rate and the mean localization error for the rate at which a source is correctly localized regardless of the different devices are shown in Table III. In the one source other sources. case, all devices perform well. The mean error achieved by the devices for one white source is close to the ideal grid-matched 2:5 which is better than the reported 4:3 and 8:8 in [8] D. NMF Initialization using an HMM. For two sources, the accuracy of the LEGO Since in a non-convex problem different initializations devices is still high, though lower than for one source. At the might lead to different results, we run an experiment to test same time the accuracy of KEMAR deteriorates considerably. the effect of the initialization of X on the localization per- This is consistent with the intuition that interesting scattering formance. The experiment consists of 300 tests for localizing patterns such as those of the LEGO devices result in better one female speaker using LEGO2 and a USM. We compare localization. T 2 the initialization mentioned in Algorithm 2 (X = A Y) to We also test the effect of the discretization on the local- different random initializations. The estimated DoAs were in ization performance. In Table IV, we report the localization agreement for both initializations 98.67% of the time with errors using LEGO1 at three different resolutions: 2 , 5 , Itakura-Saito and 97% with Euclidean distance. We show in and 10 . We find that improving the resolution results in Table II the localization accuracy rates for that experiment more accurate localization for both one and two sources which are comparable. This means that there are either “hard” but the average error is still larger than the ideal 0:5 and situations where localization fails regardless of the initializa- 1:25 for the 2 and 5 resolutions respectively, especially tion or “easy” situations where it succeeds regardless of the for two sources. Since white sources are flat, this observation initialization. Certainly, tailor-made initializations in the spirit highlights a limitation of the device itself in terms of coherent of [42], [43] may work slightly better, but such constructions or ambiguous directions. are outside the scope of this paper. Additionally, we note that in these works initializations are constructed for the basis matrix. In our case, this matrix is A which is given as input F. Speech Localization with Prototypes to the algorithm. We now turn to speech localization which is considerably more challenging than white noise, especially in the monaural TABLE II setting. Using the three devices, we test the localization of one L OCALIZATION ACCURACY FOR DIFFERENT NMF INITIALIZATIONS. and two speakers at 30 dB SNR. In this first experiment, we use a subset of 4 speakers from the test data (two female, two A Y Random male) and consider an easier scenario where we assume know- Itakura-Saito 93.00% 93.33% Euclidean 89.67% 90.00% ing the exact magnitude spectral prototypes of the sources. Still, localization with colored prototypes is harder compared to noise prototypes (as in [7]). This scenario serves as a gauge for the quality of the sensing devices for localizing speech E. White Noise Localization sources. We organize the results by the number of sources as We first test the localization of one and two white sources at well as by whether the speaker is male or female. We expect various levels of SNR using Algorithm 1. Each source is 0.5 s the localization of female speakers to be more accurate since of white Gaussian noise. We compare the performance using they have relatively more energy in the higher frequency range the three devices LEGO1, LEGO2, and KEMAR described where the device responses are more informative. above. For white sources, using the full range of frequencies, The results for the three devices are shown in Table V. not a subset, was found to perform better. As expected the overall localization performance by the less smooth LEGO scatterers is significantly better than by KE- We use a deterministic initialization to facilitate reproducibility and multithreaded implementations. MAR. Also as expected, the localization of male speech is 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 8 TABLE III ERROR FOR WHITE NOISE LOCALIZATION AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 99.56% 2:63 96.64% 2:54 92.06% 2:72 20 dB 99.58% 2:63 96.54% 2:53 92.12% 2:71 10 dB 99.60% 2:60 96.42% 2:53 91.78% 2:73 Two sources 30 dB 94.72% 2:75 83.64% 2:62 25.22% 3:44 20 dB 94.54% 2:75 83.34% 2:62 25.48% 3:45 10 dB 92.32% 2:73 81.52% 2:62 21.20% 3:59 TABLE IV D ISCRETIZATION COMPARISON FOR WHITE NOISE LOCALIZATION USING LEGO1. 2 5 10 SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 100.0% 0:52 100.0% 1:27 99.56% 2:63 20 dB 100.0% 0:52 100.0% 1:27 99.58% 2:63 10 dB 100.0% 0:54 100.0% 1:26 99.60% 2:60 Two sources 30 dB 98.56% 0:70 98.78% 1:43 94.72% 2:75 20 dB 98.50% 0:71 98.70% 1:43 94.54% 2:75 10 dB 97.30% 0:82 97.32% 1:47 92.32% 2:73 worse than female speech except for LEGO1. Similar to the result of the absence of spectral variation for male speech in white noise case, the accuracy for localizing two sources is the used higher frequency range. lower in comparison to one source. Moreover, we find that For two sources, the number of outliers increases for both the presence of one female speaker improves the accuracy for types as seen in Figure 3(b). We also plot in Figure 3(a) LEGO2 and KEMAR, most likely due to the spectral content. the confusion matrix for the case of using prototypes which has less outliers in comparison due to the stronger model. Note that outliers exist even with white sources as shown in G. Speech Localization with USM Figure 3(c), which points to a deficiency of the device itself as mentioned before. However, we note that while the reported In this experiment, we switch to a more realistic and accuracy corresponds to correctly localizing the two sources challenging setup where we use a learned universal speech simultaneously, the average accuracy per source which reflects model. We compare the performance of the Itakura–Saito the number of times at least one of the sources is correctly divergence to that of the Euclidean distance in the cost function localized is often higher. For instance for female speakers, the (9). The accuracy and mean error for the three devices are accuracy is 53.52% while the average accuracy per source is shown in Table VI. We observe that using the Itakura–Saito higher at 73.93%. The overall best performance is achieved divergence results in better performance in a majority of cases by LEGO2 with Itakura–Saito divergence. which is in line with the recommendations for using Itakura– 1) Finer resolution: As mentioned, one straightforward Saito for audio. improvement to our system is to increase the resolution. We Similar observations as in the previous experiment hold with show in Table VII the result of doubling the resolution from the LEGO scatterers offering better localization than KEMAR. 10 to 5 . For a single female speaker, the error is slightly We find that localizing one female speaker is successful with higher than the ideal average of 1:25 and the accuracy is 93% accuracy. Compared to the use of prototypes, the source improved relative to the initial bin size of 10 . While some im- model is here speaker-independent and the test set is larger provement is apparent for the localization of one male speaker containing 10 speakers; however, the accuracy is still only as well, the mismatch between the useful scattering range and lower by 3-5%. We also note that the mean localization error source spectrum still prevents good performance. However, in is 2:5 which is smaller than the reported 7:7 in [8] with an line with the discussion in Section III-D, localization of two HMM though at a lower SNR of 18 dB. sources is worse than at a coarser grid due to the increased As expected, the localization accuracy for male speakers matrix coherence, with the accuracy dropping from 55% to is lower than for female speakers. Since the mean errors 45% for two female speakers. are however not much larger than the ideal 2:5 , the lower accuracy points to the presence of outliers. We thus plot 2) Multiresolution: Next we tested the multiresolution confusion matrices in Figures 4 and 3 for female and male strategy where we refine the top estimates on the coarse grid speakers respectively. On the horizontal axis, we have the using a search on a finer grid. We arbitrarily use the best 7 estimated direction which is one of 36 only. First, we look candidates at the 10 grid spacing, and redo the localization at the single source case in Figures 3(a) and 4(a) where we at a finer 2 grid centered around the 7 initial guesses. The can clearly see the few outliers away from the diagonal. The hyperparameters for localization on the finer grid were tuned number of outliers is larger for male speakers which is a direct on a separate validation set and are given in Table I. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 9 TABLE V ERROR FOR SPEECH LOCALIZATION USING PROTOTYPES AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source female speech 98.48% 2:53 98.48% 96.94% 2:51 96.94% 79.74% 3:42 79.74% male speech 98.76% 2:56 98.76% 96.00% 2:53 96.00% 72.06% 3:35 72.06% female/female 75.24% 2:46 87.07% 78.28% 2:40 88.31% 11.66% 3:50 46.70% female/male 76.60% 2:44 87.79% 74.36% 2:41 86.17% 10.90% 3:59 44.47% male/male 80.24% 2:43 89.82% 74.22% 2:39 86.04% 9.24% 3:91 43.09% TABLE VI ERROR FOR SPEECH LOCALIZATION USING A USM AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source Itakura–Saito female speech 93.20% 2:67 93.20% 93.72% 2:54 93.72% 46.56% 3:33 46.56% male speech 89.80% 2:74 89.80% 87.70% 2:66 87.70% 35.56% 3:46 35.56% female/female 26.38% 2:64 54.65% 53.52% 2:42 73.93% 7.60% 3:90 35.29% female/male 24.76% 2:77 54.42% 49.22% 2:49 70.93% 7.40% 4:01 35.56% male/male 19.78% 3:02 50.61% 39.54% 2:63 65.45% 7.44% 4:36 33.76% Euclidean female speech 85.60% 2:79 85.60% 91.26% 2:57 91.26% 29.26% 3:75 29.26% male speech 76.00% 2:78 76.00% 86.74% 2:65 86.74% 23.24% 3:78 23.24% female/female 29.34% 2:88 56.66% 46.86% 2:48 69.89% 4.62% 4:40 23.75% female/male 30.62% 2:88 57.55% 42.28% 2:58 66.40% 3.36% 4:34 21.19% male/male 23.72% 2:96 52.67% 35.50% 2:74 62.71% 2.80% 3:97 18.60% TABLE VII ERROR FOR SPEECH LOCALIZATION AT A RESOLUTION OF 5 . LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 97.08% 1:59 97.08% 99.72% 1:41 99.72% male speech 93.26% 1:76 93.26% 92.68% 1:57 92.68% female/female 22.24% 1:95 55.25% 43.26% 1:47 71.23% female/male 21.60% 2:14 55.33% 39.66% 1:61 68.82% male/male 15.42% 2:47 50.38% 29.72% 1:87 63.31% As before, multiresolution localization results in some im- well. For two-source localization, however, a good source provement for one source but not for two sources (Table model like prototypes is required. VIII). We show the relevant confusion matrices in Figure 5: the lack of increase in performance can be explained by V. C ONCLUSION the fact that in the second round of localization the included directions are still strongly correlated and the only way to Any scattering that causes spectral variations across di- resolve the resulting ambiguities is through more constrained rections enables monaural localization of one white source. source models. Additionally, the set of correlated directions are On the other hand, more complex and interesting scattering not necessarily concentrated around the true direction which patterns are needed to localize multiple sources. As shown might explain the drop in accuracy for LEGO1. Overall, it by our “random” LEGO constructions, interesting scattering seems the extra computation for the multiresolution approach is not hard to come by. In order to localize general, non-white does not bring about significant improvements compared to sources, one further requires a good source model. using a finer discretization. We demonstrated successful localization of one speaker Finally, in Figure 6, we show a summary of the performance using regularized NMF and a universal speech model. Both our of the different methods for localizing one or two female LEGO scatterers were found to be superior in localization to a speakers using LEGO2 along with the average accuracy and mannequin’s HRTF. Finally, we stress that speech localization error. Note that the results for prototypes use a smaller test is challenging and note that the fundamental frequency of set and that the error is lower bounded by the grid size. We the human voice is below 300 Hz while the range of usable also show the size of the model matrix A from (8) which frequencies for our devices is above 3000 Hz. This discrepancy contributes to the overall complexity of NMF as well as the is responsible for outliers when localizing multiple speakers, actual runtime which depends on the machine. The figure a problem that can potentially be alleviated by increasing the suggests that overall using a USM and a 10 resolution works size of the device or using sophisticated metamaterial-based 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 10 TABLE VIII ERROR FOR SPEECH LOCALIZATION W ITH A MULTIRESOLUTION APPROACH. LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 96.94% 1:15 96.94% 99.08% 0:70 99.08% male speech 86.00% 1:26 86.00% 90.62% 0:95 90.62% female/female 17.88% 1:80 56.66% 32.26% 1:08 65.39% female/male 17.64% 1:87 56.17% 29.06% 1:33 63.47% male/male 13.84% 2:19 52.72% 20.22% 1:64 57.68% Martin Vetterli for numerous insights and discussions, and for suggesting Figure 1. This work was supported by the Swiss National Science Foundation grant number 20FP-1 151073, Inverse Problems regularized by Sparsity. VII. D ISCLAIM ER LEGO is a trademark of the LEGO Group which does not sponsor, authorize or endorse this work. (a) 10 (b) 10 REFERENCES [1] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, 1997. [2] A. D. Musicant and R. A. Butler, “The Influence of Pinnae-based Spectral Cues on Sound Localization,” J. Acoust. Soc. Am., vol. 75, no. 4, pp. 1195–1200, 1984. [3] J. J. Rice, B. J. May, G. A. Spirou, and E. D. Young, “Pinna-based Spectral Cues for Sound Localization in Cat,” Hearing Research, vol. 58, no. 2, pp. 132–152, 1992. (c) 5 (d) 5 [4] M. Aytekin, E. Grassi, M. Sahota, and C. F. Moss, “The Bat Head- related Transfer Function Reveals Binaural Cues for Sound Localization in Azimuth and Elevation,” J. Acoust. Soc. Am., vol. 116, no. 6, pp. 3594–3605, 2004. Fig. 3. Confusion matrices for localizing one speaker using LEGO2. Female [5] S. R. Oldfield and S. P. A. Parker, “Acuity of Sound Localisation: speech has less outliers and improving the resolution decreases the number A Topography of Auditory Space. III. Monaural Hearing Conditions,” of outliers. Left: Female speech. Right: Male speech. Perception, vol. 15, no. 1, pp. 67–81, 1986, PMID: 3774479. [6] J. G. Harris, C.-J. Pu, and J. C. Principe, “A Monaural Cue Sound Localizer,” Analog Integrated Circuits and Signal Processing, vol. 23, no. 2, pp. 163–172, May 2000. designs. Perhaps a source model other than the universal dic- [7] Y. Xie, T. Tsai, A. Konneker, B. Popa, D. J. Brady, and S. A. Cummer, tionary could approach the performance of using prototypes. “Single-sensor Multispeaker Listening with Acoustic Metamaterials,” Proc. Natl. Acad. Sci. U.S.A., vol. 112, no. 34, pp. 10595–10598, Aug. Finally, we presented our results for anechoic conditions. Preliminary numerical experiments show that the current ap- [8] A. Saxena and A.Y. Ng, “Learning Sound Location from a Single proach underperforms in a reverberant setting. This shortcom- Microphone,” in Proc. IEEE Int. Conf. on Robotics and Automation, 2009, pp. 1737–1742. ing is partly due to violations of our modeling assumptions. [9] D. El Badawy, I. Dokmanic, ´ and M. Vetterli, “Acoustic DoA Estimation For example, in Eq. (1), the noise is assumed independent by One Unsophisticated Sensor,” in 13th Int. Conf.on Latent Variable of the sources which is no longer true in the presence of Analysis and Signal Separation - LVA/ICA, P. Tichavsky, ´ M. B. Zadeh, O. Michel, and N. Thirion-Moreau, Eds. 2017, vol. 9237 of Lecture reverberation. For practical scenarios it is thus necessary to Notes in Computer Science, pp. 489–496, Springer. extend the approach to handle reverberant conditions as well [10] I. Dokmanic, ´ Listening to Distances and Hearing Shapes: Inverse Prob- as to test the localization performance in 3D i.e., estimate lems in Room Acoustics and Beyond, Ph.D. thesis, Ecole polytechnique fed ´ erale ´ de Lausanne, 2015. both the azimuth and the elevation. For accurate localization [11] I. Dokmanic ´ and M. Vetterli, “Room Helps: Acoustic Localization in elevation, we expect that a taller device with more variation with Finite Elements,” in Proc. IEEE Int. Conf. Audio, Speech, Signal along the vertical axis would perform better. Since we only use Process., Mar. 2012, pp. 2617–2620. [12] D. Malioutov, M. Cetin, and A. S. Willsky, “A Sparse Signal Recon- one microphone, the number of ambiguous directions would struction Perspective for Source Localization with Sensor Arrays,” IEEE likely grow considerably in 3D making the problem compa- Trans. Signal Process., vol. 53, no. 8, pp. 3010–3022, Aug. 2005. rably harder. Other interesting open questions include blind [13] P. T. Boufounos, P. Smaragdis, and B. Raj, “Joint Sparsity Models for Wideband Array Processing,” in SPIE, 2011, vol. 8138, pp. 81380K– learning of the directional transfer functions and understanding 81380K–10. the benefits of scattering in the case of multiple sensors. [14] E. Cagli, D. Carrera, G. Aletti, G. Naldi, and B. Rossi, “Robust DOA Estimation of Speech Signals via Sparsity Models Using Microphone Arrays,” in Proc. IEEE Workshop on Applications of Signal Process. VI. ACKNOWLEDGMENT Audio Acoust., Oct. 2013, pp. 1–4. [15] D. D. Lee and H. S. Seung, “Learning the Parts of Objects by Non- We thank Robin Scheibler and Mihailo Kolundzija ˇ for help negative Matrix Factorization,” Nature, vol. 401, pp. 788–791, Oct. with experiments and valuable comments. We also thank 1999. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 11 (a) (b) (c) Fig. 4. Confusion matrices for localizing two sources using LEGO2 at a resolution of 10 . (a) With prototypes. (b) With a USM. (c) White sources. Matrix Factorization With Temporal Continuity and Sparseness Criteria,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [21] P. Smaragdis, “Convolutive Speech Bases and Their Application to Supervised Speech Separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 15, no. 1, pp. 1–12, Jan. 2007. [22] O. Dikmen and A. T. Cemgil, “Unsupervised Single-channel Source Separation using Bayesian NMF,” in Proc. IEEE Workshop on Appli- cations of Signal Process. Audio Acoust., Oct. 2009, pp. 93–96. [23] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and Unsu- (a) One speaker. (b) Two speakers. pervised Speech Enhancement Using Nonnegative Matrix Factorization,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 21, no. 10, pp. 2140–2151, Oct. 2013. Fig. 5. Confusion matrices for localizing female speech with LEGO2 using [24] P. Smaragdis and J. C. Brown, “Non-negative Matrix Factorization a multiresolution approach. Improving the resolution decreases the number of for Polyphonic Music Transcription,” in Proc. IEEE Workshop on outliers in the one-speaker case but not the two-speaker case. Applications of Signal Process. Audio Acoust., Oct. 2003, pp. 177–180. [25] J. Traa, P. Smaragdis, N. D. Stein, and D. Wingate, “Directional NMF for Joint Source Localization and Separation,” in Proc. IEEE Workshop on Applications of Signal Process. Audio Acoust., 2015, pp. 1–5. [26] M. Kowalski, E. Vincent, and R. Gribonval, “Beyond the Narrowband Approximation: Wideband Convex Methods for Under-Determined Re- verberant Audio Source Separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1818–1829, Sep. 2010. [27] L. Parra and C. Spence, “Convolutive Blind Separation of Non-stationary Sources,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 320– 327, May 2000. [28] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano, “The CIPIC HRTF Database,” in Proc. IEEE Workshop on Applications of Signal Process. Audio Acoust., 2001, pp. 99–102. [29] M. Ledoux, The Concentration of Measure Phenomenon, Math. Surveys Monogr. American Mathematical Society, Providence (R.I.), 2001. [30] J. Hebrank and D. Wright, “Are Two Ears Necessary for Localization of Sound Sources on the Median Plane?,” J. Acoust. Soc. Am., vol. 56, no. 3, pp. 935–938, 1974. [31] R. M. Reeder, J. Cadieux, and J. B. Firszt, “Quantification of Speech- in-Noise and Sound Localisation Abilities in Children with Unilateral Hearing Loss and Comparison to Normal Hearing Peers,” Audiology and Neurotology, vol. 20(suppl 1), no. Suppl. 1, pp. 31–37, 2015. [32] C. Fevotte and J. Idier, “Algorithms for Non-negative Matrix Factor- ization with the Beta-divergence,” Neural Comput., vol. 23, no. 9, pp. 2421–2456, Sep. 2011. ` ´ [33] A. Lefevre, F. Bach, and C. Fevotte, “Itakura–Saito Non-negative Matrix Fig. 6. Summary of localizing one (left) or two (right) female speakers using Factorization with Group Sparsity,” in Proc. IEEE Int. Conf. Audio, LEGO2. Speech, Signal Process., May 2011, pp. 21–24. [34] D. L. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal l1-norm Solution is also the Sparsest Solution,” [16] C. Fev ´ otte, N. Bertin, and J. Durrieu, “Non-negative Matrix Factor- Comm. Pure Appl. Math, vol. 59, pp. 797–829, 2004. ization with the Itakura-Saito Divergence. With Application to Music [35] J. Friedman, T. Hastie, and R. Tibshirani, “A Note on the Group Lasso Analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009. and a Sparse Group Lasso,” arXiv, 2010. [17] D. L. Sun and G. J. Mysore, “Universal Speech Models for Speaker [36] A. Cichocki, R. Zdunek, and S. Amari, “New Algorithms for Non- Independent Single Channel Source Separation,” in Proc. IEEE Int. Negative Matrix Factorization in Applications to Blind Source Separa- Conf. Audio, Speech, Signal Process., 2013, pp. 141–145. tion,” in Proc. IEEE Int. Conf. Audio, Speech, Signal Process., May [18] M. N. Schmidt and R. K Olsson, “Single-channel Speech Separation 2006, vol. 5, pp. V621–V624. using Sparse Non-negative Matrix Factorization,” in Interspeech, 2006, [37] D. Colton and R. Kress, Inverse Acoustic and Electromagnetic Scattering pp. 2614–2617. Theory, Applied Mathematical Sciences. Springer, New York, NY, 3 [19] J. Le Roux, F. J. Weninger, and J. R. Hershey, “Sparse NMF – Half- edition, 2013. baked or Well Done?,” Tech. Rep. TR2015-023, Mitsubishi Electric [38] D. Colton, J. Coyle, and P. Monk, “Recent Developments in Inverse Research Laboratories (MERL), Cambridge, MA, USA, Mar. 2015. Acoustic Scattering Theory,” SIAM Review, vol. 42, no. 3, pp. 369– [20] T. Virtanen, “Monaural Sound Source Separation by Nonnegative 414, 2000. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 12 [39] H. Wierstorf, A. Geier, M.and Raake, and S. Spors, “A Free Database of Head-Related Impulse Response Measurements in the Horizontal Plane with Multiple Distances,” June 2016. [40] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren, “DARPA TIMIT: Acoustic-phonetic Continuous Speech Corpus,” Tech. Rep., NIST, 1993, distributed with the TIMIT CD-ROM. [41] J. Woodruff and D. Wang, “Binaural Localization of Multiple Sources in Reverberant and Noisy Environments,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 20, no. 5, pp. 1503–1512, July 2012. [42] D. Kitamura and N. Ono, “Efficient Initialization for Nonnegative Matrix Factorization based on Nonnegative Independent Component Analysis,” in Proc. IEEE Int. Workshop on Acoustic Signal Enhancement, Sep. 2016, pp. 1–5. [43] A. N. Langville, C. D. Meyer, R. Albright, J. Cox, and D. Duling, “Al- gorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization,” arXiv, 2014. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Electrical Engineering and Systems Science arXiv (Cornell University)

Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization

Loading next page...
 
/lp/arxiv-cornell-university/direction-of-arrival-with-one-microphone-a-few-legos-and-non-negative-5v58UA7kjy

References

References for this paper are not available at this time. We will be adding them shortly, thank you for your patience.

ISSN
2329-9290
eISSN
ARCH-3348
DOI
10.1109/TASLP.2018.2867081
Publisher site
See Article on Publisher Site

Abstract

This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 1 Direction of Arrival with One Microphone, a few LEGOs, and Non-Negative Matrix Factorization Dalia El Badawy and Ivan Dokmanic, ´ Member, IEEE Abstract—Conventional approaches to sound source localiza- for frequency-dependent ILDs in the HRTF also provides tion require at least two microphones. It is known, however, monaural cues. The question is then, can these monaural cues that people with unilateral hearing loss can also localize sounds. embedded in the HRTF be used for localization? Monaural localization is possible thanks to the scattering by the Indeed, monaural cues are known to help localize in eleva- head, though it hinges on learning the spectra of the various tion [1] and resolve the front/back confusion [2]: two cases sources. We take inspiration from this human ability to propose algorithms for accurate sound source localization using a single where binaural cues are not sufficient. Additionally, studies microphone embedded in an arbitrary scattering structure. The on the HRTFs of cats [3] and bats [4] also reveal their use structure modifies the frequency response of the microphone for localization in both azimuth and elevation, albeit in a in a direction-dependent way giving each direction a signature. binaural setting. This implies that the directional selectivity While knowing those signatures is sufficient to localize sources of the HRTF i.e., the monaural cues, is sufficient to enable of white noise, localizing speech is much more challenging: it is an ill-posed inverse problem which we regularize by prior people with unilateral hearing loss to localize sounds, though knowledge in the form of learned non-negative dictionaries. We with a reduced accuracy compared to the binaural case [5]. demonstrate a monaural speech localization algorithm based on non-negative matrix factorization that does not depend on sophisticated, designed scatterers. In fact, we show experimental A. Related Work results with ad hoc scatterers made of LEGO bricks. Even with Combining HRTF-like directional selectivity with source these rudimentary structures we can accurately localize arbitrary speakers; that is, we do not need to learn the dictionary for models has already been explored in the literature [6], [7], the particular speaker to be localized. Finally, we discuss multi- [8], [9]. For example, in one study [8], a small microphone source localization and the related limitations of our approach. enclosure was used to localize one source with the help of a Hidden Markov Model (HMM) trained on a variety of sounds Index Terms—direction-of-arrival estimation, group sparsity, including speech. In another study [7], a metamaterial-coated monaural localization, non-negative matrix factorization, sound device with a diameter of 40 cm and a dictionary of noise scattering, universal speech model prototypes were used to localize known noise sources. In our previous work [9], we used an omnidirectional sensor I. I NTRODUCTION surrounded by cubes of different sizes and a dictionary of N this paper, we present a computational study of the spectral prototypes to localize speech sources. role of scattering in sound source localization. We study I A single omnidirectional sensor can also be used to localize a setting in which localization is a priori not possible: that of sound sources inside a known room [10]. Indeed, in place of a single microphone, referred to as monaural localization. It the head, the scattering structure is then the room itself and the is well established that people with normal hearing localize localization cues are provided by the echoes from the walls sounds primarily from binaural cues—those that require both [11]. The drawback is that the room should be known with ears. Different directions of arrival (DoA) result in different considerable accuracy—it is much more realistic to assume interaural time differences which are the dominant cues for knowing the geometry of a small scatterer. localization at lower frequencies, as well as in interaural level As for source models, those used in previous work on differences (ILD) which are dominant at higher frequencies monaural localization rely on full complex-valued spectra [7]. [1]. The latter are linked to the head-related transfer function Other approaches to multi-sensor localization with sparsity (HRTF) which encodes how human and animal heads, ears, constraints also operate in the complex frequency domain and torsos scatter incoming sound waves. This scattering re- [12], [13], [14]. In this paper, we choose to work with non- sults in direction-dependent filtering whereby frequencies are negative data which in this case corresponds to the power or selectively attenuated or boosted; the exact filtering depends on magnitude spectra of the audio. We highlight two reasons for the shape of the head and ears and therefore varies for different this choice. First, unlike the multi-sensor case, the monaural people and animals. Thus the same mechanism responsible setting generates fewer useful relative phase cues. Second, if prototypes—that is, the exact source waveform—are assumed In line with the philosophy of reproducible research, code and data to be known as in [7], there are no modeling errors or chal- to reproduce the results of this paper are available at http://github.com/ swing-research/scatsense. lenges associated with the phase information. We, however, D. El Badawy is a student at EPFL, Switzerland, e-mail: assume much less, namely only that the source is speech. It is dalia.elbadawy@epfl.ch. then natural to leverage the large body of work that addresses I. Dokmanic ´ is with ECE Illinois, e-mail: dokmanic@illinois.edu. Manuscript received January xx, 2018; revised Month xx, 2018. dictionary learning with real or non-negative values as opposed 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. arXiv:1801.03740v3 [eess.AS] 28 Aug 2018 This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 2 to complex values. In particular, we consider models based discretize the azimuth into D candidate source locations on non-negative matrix factorization (NMF). NMF results in = f ;  ; : : : ;  g and consider the standard mixing model 1 2 D a parts-based representation of an input signal [15] and can in the time domain for J sources incoming from directions for instance identify individual musical notes [16]. Thus with  = f g , j j2J training data, NMF can be used to learn a representation for y(t) = s (t) h (t) + e(t); (1) j j each source [17], [18]. For more flexibility, it can also be j2J used to learn an overcomplete dictionary where each source def admits a sparse representation [17], [18]. For the latter, either where J  f1; 2; : : : ; Dg = D, jJj = J ,  denotes convo- th multiple representations are concatenated [17] or the learning lution, y is the observed signal, s is the j source signal, def is modified by including sparsity penalties [18], [19]. h (t) = h(t;  ) is the impulse response of the directionally- j j To solve the localization problem, we first fit the postulated dependent filter, and e is additive noise. The goal of local- non-negative model to the observed measurements. The cost ization is then to estimate the set of directions  from the functions previously used often involve the Euclidean distance observed signal y. Note that in general we could also include [7], [9], [13], [12], [14]. Non-negative modeling lets us use the elevation by considering a set of D directions in 3D, other measures more suitable for speech and audio such as though this would likely yield many additional ambiguities. the Itakura–Saito divergence [16]. While NMF is routinely The mixing (1) can be approximated in the short-time used in single-channel source separation [17], [20], [21], [22], Fourier transform (STFT) domain as speech enhancement [23], polyphonic music transcription [24], Y (n; f ) = S (n; f )H (f ) + E(n; f ); (2) j j and has been used in a multichannel joint separation and j2J localization scenario [25], the present work is to the best of where n and f denote the time and frequency indices. This our knowledge the first time NMF is used in single-channel so-called narrowband approximation holds when the filter h source localization. Finally, when the localization problem is j is short enough with respect to the STFT analysis window ill-posed, as is the case for the monaural setting, various reg- [26], [27]. For reference, the impulse response corresponding ularizations are utilized. Typical regularizers promote sparsity to an HRTF is around 4.5 ms long [28], while the duration of [7], group sparsity [13], [14] or a combination thereof [9]. the STFT window for audio is commonly anywhere between 5 ms and 128 ms during which the signal is assumed stationary. B. Contributions & Outline Finally, the mixture’s spectrogram with N time frames and F The current paper extends our previous work [9] in several frequency bins can be written as important ways. We summarize the contributions as follows: Y = diag(H )S + E; (3) j j We derive an NMF formulation for monaural localization j2J via scattering; FN FN We formulate two different regularized cost functions where Y 2 C , S 2 C the spectrogram of the source with different distance measures in the data fidelity term impinging from  , H 2 C is the frequency response of the j j FN to solve the localization based on either universal or directionally-dependent filter, E 2 C is the spectrogram speaker-dependent dictionaries; of the additive noise, and diag(v) is a matrix with v on the We present extensive numerical evidence using simple diagonal. “devices” made from LEGO bricks; At least conceptually, monaural localization is a simple For the sake of reproducibility, we make freely available matter if the source is always the same: for each direction the the code and data used to generate the results. HRTF imprints a distinct spectral signature onto the sound which can be detected through correlation. In reality, the Unlike [8], the source model we present easily accommodates sources are diverse but this fixed-source case lets us develop more than one source. And unlike [6] or [7], we present a good intuition. localization of challenging sources such as speech without the need for metamaterials or accurate source models—we only A. Intuition use ad hoc scatterers and NMF. In this paper we limit ourselves To see how scattering helps, suppose the sources are white to anechoic conditions and localization in the horizontal plane and a set of D directional transfer functions fH g of our as our goal is to assess the potential of this simple setup. d=1 device is known. The power spectral density (PSD) of a white In the following, we first lay down an intuitive argument 2 2 source is flat and scaled by the source’s power: E[jS j ] =  . for how monaural cues help as well as a simple algorithm for Assuming the noise has zero mean, the PSD of the observation localizing white sources. We then formulate the localization problem using NMF and give an algorithm for general colored is 2 2 2 sources in Section III. In Section IV, we describe our devices E[jYj ] =  jH j ; (4) and results for localizing white noise and speech. j2J which is a positive linear combination of the squared magni- II. BACKGROUND tudes of the transfer functions. In other words, E[jYj ] belongs to a cone defined as The sensor we consider in this work is a microphone, possibly omnidirectional, embedded in a compact scattering 2 C = fx : x = c jH j ; c > 0g; (5) J j j j structure; we henceforth refer to it as “the device”. We j2J 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 3 Algorithm 1 White Noise Localization Input: Number of sources J , magnitudes of directional trans- 2 FN fer functions fjH j g , N audio frames Y 2 C . j j2D b b b Output: Directions of arrival  = f ; : : : ;  g. 1 J 1 2 Compute the empirical PSD y = jY j N n=1 for every J  D, jJj = J do B jH j J j j2J P B B J J end for J arg min k(I P )yk n o (a) No scattering (b) LEGO1 b b j j 2 J smooth variations. Finally, Figures 1(b) and 1(c) correspond to our devices constructed using LEGO bricks whose responses have more fluctuating variations. In a nutshell, scattering induces a union-of-cones structure that enables us to localize white sources using a single sensor; stronger and more diverse scattering implies easier localization. B. White Noise Localization (c) LEGO2 (d) KEMAR In this section we describe a simple algorithm for localizing noise sources based on the intuition provided in the previous Fig. 1. Directional frequency magnitude response for different devices. Each section . Our experiments with white noise localization will horizontal slice is the polar pattern at the corresponding frequency between provide us with an ideal case baseline. 0-8000 Hz from bottom to top. The colors only aid visualization. First, we need to replace the expected value E[jYj ] by its empirical mean computed from N time frames. For many types of sources this approximation will be accurate already Each configuration of sourcesJ results in a different cone C . with a small number of frames by the various concentration For D directions and J white sources, there are possible of measure results [29]; we corroborate this claim empirically. cones which are known a priori since we assume knowing the Second, for simplicity, we replace each cone C by its scatterer. These cones reside in an F -dimensional space of smallest enclosing subspace S = span jH j repre- J j direction-dependent spectral magnitude responses, R , rather j2J sented by a matrix than the physical scatterer space R . While the arrangement of cones in R is indeed determined by the geometry of the def 2 2 B = jH j ; : : : ; jH j ; j 2 J : J j j k 3 1 J device in R , the relation is complicated and nonlinear, namely it requires solving a boundary value problem for the Helmholtz This way the closest cone can be approximately determined equation at each frequency. by selecting J  D such that the subspace projection error Thus, we have E[jYj ] 2 C , and in theory, the is the smallest possible. The details of the resulting algorithm are given in Algorithm 1; note the implicit assumption that localization problem becomes one of identifying the correct J < F as otherwise all cones lie in the same subspace. cone The robustness of Algorithm 1 to noise largely depends on b b J = arg min dist E[jYj ]; C ; (6) the angles between pairs of subspaces S for different config- urations J , with smaller angles implying a higher likelihood where E jYj denotes the empirical estimate of the corre- of error. Intuitively, a transfer function that varies smoothly sponding expectation from observed measurements. We dis- across directions is unfavorable as it yields smaller subspace cuss this further in the next section where we give the complete angles (more similar subspaces). algorithm. We now turn our attention to the realistic case where Testing for cone membership results in correct localization sound sources are diverse: how can we determine whether when C = C implies J = J (distinct direction sets J J 1 2 1 2 an observed spectral variation is due to the directivity of span distinct cones)—a condition that is loosely speaking the sensor or a property of the sound source itself? In more likely to hold the more diverse H are. Examples of fact, localization of unfamiliar sounds degrades not only for jH j are illustrated in Figure 1. In particular, Figure 1(a) monaural but also binaural listening [30]. It has also been corresponds to an omnidirectional microphone with a flat found that older children with unilateral hearing loss perform frequency response and no scattering structure. In this case better in localization tasks than younger children [31]. We C = f 1 :   0g and monaural localization is impossible. Figure 1(d) corresponds to an HRTF which features relatively This algorithm appears in our previous conference publication [9]. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 4 can thus conclude that both knowledge and experience allow B. Regularization us to dissociate source spectra from directional cues. Once the Still, recovering X from (8) is an ill-posed problem. To get HRTF and the source spectra have been learned, it becomes a reasonable solution, we must regularize by prior knowledge possible to differentiate directions based on their modifications about X. We thus make the following two assumptions. First, by the scatterer. the sources are few (J  D), which means that most groups X are zero. Second, each source has a sparse representation in the dictionary W. These assumptions are enforced by con- III. M ETHOD sidering the solution to the following penalized optimization problem We can think of an ideal white source as belonging to the 2 2 subspace spanf1g since jSj = 1 . In the following, we arg min D(Yk AX) +  (X) + (X); (9) g s generalize the source model to more interesting signals such X0 as speech. For those signals, testing for cone membership the where D(k) is the data fitting term, is a group-sparsity same way we did for white sources is not straightforward. penalty to enforce the first assumption, and is a sparsity We can, however, take advantage of the non-negativity of the penalty to enforce the second assumption. The parameters  > data to design efficient localization algorithms based on NMF. 0 and > 0 are the weights given to the respective penalties. Instead of continuing to work with power spectra jSj , we A common choice of D(k) for speech is the Itakura–Saito switch to magnitude spectra jSj: prior work [20], [23] and our divergence [16], which for strictly positive scalars v and v ^, is own experiments found that magnitude spectra perform better defined as in this context. v v d (vk v ^) = log 1; (10) IS v ^ v ^ so that D(Vk V) = d (v jjv ^ ). Another option is A. Problem Statement IS fn fn fn the Euclidean distance We adopt the usual assumption that magnitude spectra are additive [20], [21]. Then the magnitude spectrogram of the D(Vk V) = (v v ^ ) : (11) fn fn observation (3) can be expressed as fn Both the Itakura–Saito divergence and the Euclidean distance Y = diag(H )S + E; (7) j j belong to the family of -divergences with = 0 and = 2 j2J respectively [32]. The former is scale-invariant and is thus preferred for audio which has a large dynamic range [16]. for Y = jYj, H = jHj, S = jS j, and E = jEj. We further j j To promote group sparsity, we choose to be the log =` g 1 model the source S as a non-negative linear combination of FK penalty [33] defined as K atoms W 2 R such that S = WX . The atoms in W j j can correspond to either spectral prototypes of the sources to be localized or they can be learned from training data. Using (X) = log( +kvec(X )k ); (12) g d 1 this source model, we rewrite (7) as d=1 where vec() is a vectorization operator. To promote sparsity Y = AX + E; (8) of the dictionary expansion coefficients, we choose to be FN ` -norm [34] as where Y 2 R is the observation, (X) = kvec(X)k : (13) FKD s 1 A = diag(H )W; : : : ; diag(H )W 2 R 1 D The combination of sparsity and group-sparsity penalties re- is the mixing matrix, and sults in a small number of active groups that are themselves sparse. Thus the joint penalty is known as sparse-group T T KDN X = X ; : : : ; X 2 R sparsity [35]. 1 D + We note that our main optimization (9) is performed only KN are the dictionary coefficients. Each group X 2 R d over the latent variables X; the non-negative dictionary A, corresponds to the set of coefficients for one source at one which is constructed by merging a source dictionary learned by direction d. off-the-shelf implementations of standard algorithms with the For localization, we wish to recover X; however, we are direction-dependent transfer functions as described in Section not interested in the coefficient values themselves but rather III-A, is taken as input. We thus avoid the joint optimization whether given coefficients are active or not—the activity of a over A and X which is a major source of non-convexity. coefficient indicates the presence of a source. In other words, However, our choices for non-convex functionals like the we are only concerned with identifying the support of X. Itakura-Saito divergence and the log =` penalty (although the Localization is achieved by selecting the J directions whose latter is quasi-convex) render the whole optimization (9) non- corresponding groups X have the highest norms. convex. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 5 Algorithm 2 MU for NMF with Sparse-group Sparsity C. Derivation Input: Y, A, , The minimization (9) can be solved iteratively by multi- Output: X plicative updates (MU) which preserve non-negativity when Initialize X = A Y the variables are initialized with non-negative values. The up- Y AX date rules for X are derived using maximization-minimization repeat for the group-sparsity penalty in [33] and for the ` -penalty in for d = 1; : : : ; D do [32]. They amount to dividing the negative part of the gradient by the positive part and raising to an exponent. In the following d +kvec(X )k d 1 we derive the MU rules for our objective (9). end for Note that the objective is separable over the columns of X if Itakura–Saito then D T 2 X A (Y Y ) X X C (x) = D(yk Ax) +  log( +kx k ) + kxk ; (14) d 1 1 T b1 A Y + P + d=1 else if Euclidean then F FK where y 2 R , x 2 R are columns of Y and X + + A Y P (i) respectively. With x as the current iterate, the gradient of X X T b A Y (14) with respect to one element x of x when D(k) is the end if Itakura–Saito divergence is given by Y AX (i) (i) 2 until convergence r C (x ) = y (Ax ) a x f fk (i) 1 + (Ax ) a +  + ; fk (i) 1) Attempt localization on a coarse grid, +kx k 2) Identify the top T direction candidates, (15) 3) Construct the model matrix using the T candidates and where a = [A] are entries of A. The update rule is fk fk their neighbors at a finer resolution, then given as 4) Rerun the NMF localization. 2 The final algorithm for source localization by NMF with and (i) r C (x ) (i+1) (i) without multiresolution is shown in Algorithm 3. Since (9) x = x k k + (i) r C (x ) is non-convex, different initializations of X might lead to 0 1 P 2 different results. We thus later run an experiment to test the (i) y (Ax ) a f fk (i) f f influence on the actual localization performance in Section IV. @ A = x P ; k 1 (i) (Ax ) a +  + fk (i) +kx k Algorithm 3 Direction of Arrival Estimation by NMF (16) Input: Observation y(t), Number of sources J , Parameter for where is a corrective exponent [32]. The updates in matrix group sparsity , Parameter for ` sparsity , magnitudes of 2 1 form are shown in Algorithm 2 where the multiplication , directional transfer functions fH g , source model W j j2D division, and power operations are elementwise and P is a b b b Output: Directions of arrival  = f ; : : : ;  g 1 J matrix of the same size as X. Also shown are the updates Construct A diag(H )W; : : : ; diag(H )W 1 D for using the Euclidean distance following [32], [36] where Construct Y jSTFTfygj [v] = maxfv; g is a thresholding operator to maintain non- Factorize Y  AX using Algorithm 2 negativity with  = 10 . Calculate D = fkvec(X )k for d = 1; 2; : : : ; Dg d 1 if Multiresolution then D. Algorithm Identify T candidates and their RT neighbors t=T;r=R fH g The discretization of the azimuth into D evenly-spaced t;r t=1;r=0 directions has a direct correspondence with the localization Construct A diag(H )W; : : : ; diag(H )W 1;0 T;R e e errors. On the one hand, a course discretization limits the Factorize Y  AX using Algorithm 2 localization accuracy to approximately the size of the dis- Calculate D = fkvec(X )k for d = 1; 2; : : : ; (R + 1)Tg d 1 cretization bin . On the other hand a fine discretization end if may warrant a smaller error floor, but it implies a model matrix J fIndices of the J largest elements in Dg n o with a higher coherence only worsening the ill-posedness of b b j j 2 J the optimization problem (9). It additionally results in a larger matrix which hampers the matrix factorization algorithms that are of complexity O(FKDN ) per iteration [16], [33]. A common compromise is the multiresolution approach [12], [8] IV. E XPERIM ENTAL RESULTS in which position estimates are first computed on a coarse A. Devices grid, and then subsequently refined on a finer grid concentrated around the initial guesses. We test the following strategy: We ran experiments using three different devices: 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 6 a) LEGO1 and LEGO2: The first two devices are struc- TABLE I PARAM ETERS PER DEVICE. tures composed of LEGO bricks as shown in Figure 2. Since we aimed for diverse random-like scattering, we stacked LEGO1 LEGO2 KEMAR haphazard brick constructions on a base plate of size 25 cm Frequency 3000-8000 Hz 3000-8000 Hz 0-8000 Hz 25 cm along with one omnidirectional microphone. The Prototypes  = 10, = 10  = 10, = 1  = 10, = 0:1 heights of the different constructions vary between 4 and 12.5 USM ( = 0)  = 0:1, = 10  = 10, = 1  = 100, = 10 USM ( = 2)  = 1, = 1  = 1, = 1  = 1, = 1 cm. We did not attempt to optimize the layout. The only Multiresolution  = 0:1, = 1  = 100, = 0:1 - assumption we make regarding the dimensions of the device is that some energy of the target source resides at frequencies where the device observably interacts with the acoustic wave. performance averaged for one and two sources were chosen. We note that the problem of designing and optimizing the We additionally tested whether the lower frequencies can be structure to get a desired response is that of inverse obstacle ignored in localization since, as mentioned before, for the scattering which is a hard inverse problem in its own right relatively small scatterers the lower frequency range lacks [37], [38]. For the present work, we simply observe that our variation and is thus uninformative. Moreover, truncating the random structures result in the desired random-like scattering. lower frequencies would help reduce coherence between the The directional impulse response measurements were then directional transfer functions. The final parameters and used done in an anechoic chamber where the device was placed on frequency range are summarized in Table I. a turntable as shown in Figure 2(c) and a loudspeaker at a Source Dictionary: For speech localization, we test two distance of 3.5 m emitted a linear sweep. We note that the source dictionaries. For the first experiment, we use a dictio- turntable is symmetric, so its effect on localization in the nary of prototypes of magnitude spectra from 4 speakers (2 horizontal plane, if any, is negligible. The duration of the female, 2 male) in the test set. measured impulse responses averages around 20 ms. Figures For the second experiment, we use a more general universal 1(b) and 1(c) show the corresponding magnitude response speech model (USM) [17] learned from a training set of 25 for the two devices. Due to their relatively small size, they female and 25 male speakers, also from TIMIT. We use a mostly scatter high frequency waves and so the response at random initialization for the NMF when learning the USM. lower frequencies is comparably flat. We thus expect that only Each speaker in the training set is modeled using K = 10 sources with enough energy in the higher range of frequencies F500 atoms, thus the final USM is W 2 R . In total, we use can be accurately localized. four versions of the USM in the experiments. Two versions b) KEMAR: The third device is KEMAR [39] which is correspond to learning the model by minimizing either the modeled after a human head and torso so that its response Itakura–Saito divergence or the Euclidean distance. The other accurately approximates a human HRTF. The mannequin’s two versions correspond to learning the model using only the torso measures 44 24 73 cm and the head’s diameter is 18 subset of frequencies to be utilized in the localization. cm. The duration of the impulse response is 10 ms. Figure 1(d) shows the corresponding magnitude response. As can be seen, C. Evaluation the variation across the directions is very smooth which we expect to result in worse monaural localization performance. We estimate the azimuth of the sources in the range [0 ; 360 ). The model (8) assumes a discrete set of 36 evenly spaced directions while the sources are randomly placed on B. Data and parameters a finer grid of 360 directions. Given the estimated directions The mixtures are created by first convolving the source ^ ^ ^ = f ; : : : ;  g and the true directions  = f ; : : : ;  g, 1 J 1 J signals with the impulse responses and then corrupting the the localization error is computed as the average absolute result by additive white Gaussian noise at various levels of difference modulo 360 as signal-to-noise ratio defined as min (  + 180) mod 360 180 ; (17) (j) j k s (t) h (t)k  J j j 2 j2J SNR = 20 log dB: ke(t)k where  : J ! J is a permutation that best matches the We use frame-based processing using the STFT with a Hann ordering in  and . window of length 64 ms, with a 50% overlap. The number of For each experiment, we test 5000 random sets of directions. iterations in NMF (Algorithm 2) was set to 100. We emphasize that we have been careful to avoid an inverse The test data contains 10 speech sources (5 female, 5 male) crime, and we produced the measurements by convolution in from TIMIT [40] sampled at 16000 Hz. The duration of the time domain, not by multiplication in the STFT domain. the speech varies between 3.1 and 4.5 s and the maximum Thus in this set up, the reported errors also reflect the modeling amplitude is normalized to 1 so that all sources have the mismatch. same volume. No preprocessing of the sources such as silence Following [41], we report the accuracy defined as the removal was done; when mixing two sources, the longest one percentage of sources localized to their closest 10 -wide bin as was truncated. well as the mean error for those accurately localized sources. A separate validation set was used to select the best sparsity For 36 bins, there is an inherent average error of 2:5 . Thus, parameters for each device. The parameters that gave the best ideally the accuracy would be 100% and the error 2:5 . 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 7 (a) (b) (c) Fig. 2. Sensing devices made of LEGO bricks. The location of the microphone is marked by an “x”. (a) LEGO1. (b) LEGO2. (c) Calibration setup in an anechoic chamber. Additionally, we report the accuracy per source, that is, the The accuracy rate and the mean localization error for the rate at which a source is correctly localized regardless of the different devices are shown in Table III. In the one source other sources. case, all devices perform well. The mean error achieved by the devices for one white source is close to the ideal grid-matched 2:5 which is better than the reported 4:3 and 8:8 in [8] D. NMF Initialization using an HMM. For two sources, the accuracy of the LEGO Since in a non-convex problem different initializations devices is still high, though lower than for one source. At the might lead to different results, we run an experiment to test same time the accuracy of KEMAR deteriorates considerably. the effect of the initialization of X on the localization per- This is consistent with the intuition that interesting scattering formance. The experiment consists of 300 tests for localizing patterns such as those of the LEGO devices result in better one female speaker using LEGO2 and a USM. We compare localization. T 2 the initialization mentioned in Algorithm 2 (X = A Y) to We also test the effect of the discretization on the local- different random initializations. The estimated DoAs were in ization performance. In Table IV, we report the localization agreement for both initializations 98.67% of the time with errors using LEGO1 at three different resolutions: 2 , 5 , Itakura-Saito and 97% with Euclidean distance. We show in and 10 . We find that improving the resolution results in Table II the localization accuracy rates for that experiment more accurate localization for both one and two sources which are comparable. This means that there are either “hard” but the average error is still larger than the ideal 0:5 and situations where localization fails regardless of the initializa- 1:25 for the 2 and 5 resolutions respectively, especially tion or “easy” situations where it succeeds regardless of the for two sources. Since white sources are flat, this observation initialization. Certainly, tailor-made initializations in the spirit highlights a limitation of the device itself in terms of coherent of [42], [43] may work slightly better, but such constructions or ambiguous directions. are outside the scope of this paper. Additionally, we note that in these works initializations are constructed for the basis matrix. In our case, this matrix is A which is given as input F. Speech Localization with Prototypes to the algorithm. We now turn to speech localization which is considerably more challenging than white noise, especially in the monaural TABLE II setting. Using the three devices, we test the localization of one L OCALIZATION ACCURACY FOR DIFFERENT NMF INITIALIZATIONS. and two speakers at 30 dB SNR. In this first experiment, we use a subset of 4 speakers from the test data (two female, two A Y Random male) and consider an easier scenario where we assume know- Itakura-Saito 93.00% 93.33% Euclidean 89.67% 90.00% ing the exact magnitude spectral prototypes of the sources. Still, localization with colored prototypes is harder compared to noise prototypes (as in [7]). This scenario serves as a gauge for the quality of the sensing devices for localizing speech E. White Noise Localization sources. We organize the results by the number of sources as We first test the localization of one and two white sources at well as by whether the speaker is male or female. We expect various levels of SNR using Algorithm 1. Each source is 0.5 s the localization of female speakers to be more accurate since of white Gaussian noise. We compare the performance using they have relatively more energy in the higher frequency range the three devices LEGO1, LEGO2, and KEMAR described where the device responses are more informative. above. For white sources, using the full range of frequencies, The results for the three devices are shown in Table V. not a subset, was found to perform better. As expected the overall localization performance by the less smooth LEGO scatterers is significantly better than by KE- We use a deterministic initialization to facilitate reproducibility and multithreaded implementations. MAR. Also as expected, the localization of male speech is 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 8 TABLE III ERROR FOR WHITE NOISE LOCALIZATION AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 99.56% 2:63 96.64% 2:54 92.06% 2:72 20 dB 99.58% 2:63 96.54% 2:53 92.12% 2:71 10 dB 99.60% 2:60 96.42% 2:53 91.78% 2:73 Two sources 30 dB 94.72% 2:75 83.64% 2:62 25.22% 3:44 20 dB 94.54% 2:75 83.34% 2:62 25.48% 3:45 10 dB 92.32% 2:73 81.52% 2:62 21.20% 3:59 TABLE IV D ISCRETIZATION COMPARISON FOR WHITE NOISE LOCALIZATION USING LEGO1. 2 5 10 SNR Accuracy Mean Accuracy Mean Accuracy Mean One source 30 dB 100.0% 0:52 100.0% 1:27 99.56% 2:63 20 dB 100.0% 0:52 100.0% 1:27 99.58% 2:63 10 dB 100.0% 0:54 100.0% 1:26 99.60% 2:60 Two sources 30 dB 98.56% 0:70 98.78% 1:43 94.72% 2:75 20 dB 98.50% 0:71 98.70% 1:43 94.54% 2:75 10 dB 97.30% 0:82 97.32% 1:47 92.32% 2:73 worse than female speech except for LEGO1. Similar to the result of the absence of spectral variation for male speech in white noise case, the accuracy for localizing two sources is the used higher frequency range. lower in comparison to one source. Moreover, we find that For two sources, the number of outliers increases for both the presence of one female speaker improves the accuracy for types as seen in Figure 3(b). We also plot in Figure 3(a) LEGO2 and KEMAR, most likely due to the spectral content. the confusion matrix for the case of using prototypes which has less outliers in comparison due to the stronger model. Note that outliers exist even with white sources as shown in G. Speech Localization with USM Figure 3(c), which points to a deficiency of the device itself as mentioned before. However, we note that while the reported In this experiment, we switch to a more realistic and accuracy corresponds to correctly localizing the two sources challenging setup where we use a learned universal speech simultaneously, the average accuracy per source which reflects model. We compare the performance of the Itakura–Saito the number of times at least one of the sources is correctly divergence to that of the Euclidean distance in the cost function localized is often higher. For instance for female speakers, the (9). The accuracy and mean error for the three devices are accuracy is 53.52% while the average accuracy per source is shown in Table VI. We observe that using the Itakura–Saito higher at 73.93%. The overall best performance is achieved divergence results in better performance in a majority of cases by LEGO2 with Itakura–Saito divergence. which is in line with the recommendations for using Itakura– 1) Finer resolution: As mentioned, one straightforward Saito for audio. improvement to our system is to increase the resolution. We Similar observations as in the previous experiment hold with show in Table VII the result of doubling the resolution from the LEGO scatterers offering better localization than KEMAR. 10 to 5 . For a single female speaker, the error is slightly We find that localizing one female speaker is successful with higher than the ideal average of 1:25 and the accuracy is 93% accuracy. Compared to the use of prototypes, the source improved relative to the initial bin size of 10 . While some im- model is here speaker-independent and the test set is larger provement is apparent for the localization of one male speaker containing 10 speakers; however, the accuracy is still only as well, the mismatch between the useful scattering range and lower by 3-5%. We also note that the mean localization error source spectrum still prevents good performance. However, in is 2:5 which is smaller than the reported 7:7 in [8] with an line with the discussion in Section III-D, localization of two HMM though at a lower SNR of 18 dB. sources is worse than at a coarser grid due to the increased As expected, the localization accuracy for male speakers matrix coherence, with the accuracy dropping from 55% to is lower than for female speakers. Since the mean errors 45% for two female speakers. are however not much larger than the ideal 2:5 , the lower accuracy points to the presence of outliers. We thus plot 2) Multiresolution: Next we tested the multiresolution confusion matrices in Figures 4 and 3 for female and male strategy where we refine the top estimates on the coarse grid speakers respectively. On the horizontal axis, we have the using a search on a finer grid. We arbitrarily use the best 7 estimated direction which is one of 36 only. First, we look candidates at the 10 grid spacing, and redo the localization at the single source case in Figures 3(a) and 4(a) where we at a finer 2 grid centered around the 7 initial guesses. The can clearly see the few outliers away from the diagonal. The hyperparameters for localization on the finer grid were tuned number of outliers is larger for male speakers which is a direct on a separate validation set and are given in Table I. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 9 TABLE V ERROR FOR SPEECH LOCALIZATION USING PROTOTYPES AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source female speech 98.48% 2:53 98.48% 96.94% 2:51 96.94% 79.74% 3:42 79.74% male speech 98.76% 2:56 98.76% 96.00% 2:53 96.00% 72.06% 3:35 72.06% female/female 75.24% 2:46 87.07% 78.28% 2:40 88.31% 11.66% 3:50 46.70% female/male 76.60% 2:44 87.79% 74.36% 2:41 86.17% 10.90% 3:59 44.47% male/male 80.24% 2:43 89.82% 74.22% 2:39 86.04% 9.24% 3:91 43.09% TABLE VI ERROR FOR SPEECH LOCALIZATION USING A USM AT A DISCRETIZATION OF 10 LEGO1 LEGO2 KEMAR Accuracy Mean Per Source Accuracy Mean Per Source Accuracy Mean Per Source Itakura–Saito female speech 93.20% 2:67 93.20% 93.72% 2:54 93.72% 46.56% 3:33 46.56% male speech 89.80% 2:74 89.80% 87.70% 2:66 87.70% 35.56% 3:46 35.56% female/female 26.38% 2:64 54.65% 53.52% 2:42 73.93% 7.60% 3:90 35.29% female/male 24.76% 2:77 54.42% 49.22% 2:49 70.93% 7.40% 4:01 35.56% male/male 19.78% 3:02 50.61% 39.54% 2:63 65.45% 7.44% 4:36 33.76% Euclidean female speech 85.60% 2:79 85.60% 91.26% 2:57 91.26% 29.26% 3:75 29.26% male speech 76.00% 2:78 76.00% 86.74% 2:65 86.74% 23.24% 3:78 23.24% female/female 29.34% 2:88 56.66% 46.86% 2:48 69.89% 4.62% 4:40 23.75% female/male 30.62% 2:88 57.55% 42.28% 2:58 66.40% 3.36% 4:34 21.19% male/male 23.72% 2:96 52.67% 35.50% 2:74 62.71% 2.80% 3:97 18.60% TABLE VII ERROR FOR SPEECH LOCALIZATION AT A RESOLUTION OF 5 . LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 97.08% 1:59 97.08% 99.72% 1:41 99.72% male speech 93.26% 1:76 93.26% 92.68% 1:57 92.68% female/female 22.24% 1:95 55.25% 43.26% 1:47 71.23% female/male 21.60% 2:14 55.33% 39.66% 1:61 68.82% male/male 15.42% 2:47 50.38% 29.72% 1:87 63.31% As before, multiresolution localization results in some im- well. For two-source localization, however, a good source provement for one source but not for two sources (Table model like prototypes is required. VIII). We show the relevant confusion matrices in Figure 5: the lack of increase in performance can be explained by V. C ONCLUSION the fact that in the second round of localization the included directions are still strongly correlated and the only way to Any scattering that causes spectral variations across di- resolve the resulting ambiguities is through more constrained rections enables monaural localization of one white source. source models. Additionally, the set of correlated directions are On the other hand, more complex and interesting scattering not necessarily concentrated around the true direction which patterns are needed to localize multiple sources. As shown might explain the drop in accuracy for LEGO1. Overall, it by our “random” LEGO constructions, interesting scattering seems the extra computation for the multiresolution approach is not hard to come by. In order to localize general, non-white does not bring about significant improvements compared to sources, one further requires a good source model. using a finer discretization. We demonstrated successful localization of one speaker Finally, in Figure 6, we show a summary of the performance using regularized NMF and a universal speech model. Both our of the different methods for localizing one or two female LEGO scatterers were found to be superior in localization to a speakers using LEGO2 along with the average accuracy and mannequin’s HRTF. Finally, we stress that speech localization error. Note that the results for prototypes use a smaller test is challenging and note that the fundamental frequency of set and that the error is lower bounded by the grid size. We the human voice is below 300 Hz while the range of usable also show the size of the model matrix A from (8) which frequencies for our devices is above 3000 Hz. This discrepancy contributes to the overall complexity of NMF as well as the is responsible for outliers when localizing multiple speakers, actual runtime which depends on the machine. The figure a problem that can potentially be alleviated by increasing the suggests that overall using a USM and a 10 resolution works size of the device or using sophisticated metamaterial-based 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 10 TABLE VIII ERROR FOR SPEECH LOCALIZATION W ITH A MULTIRESOLUTION APPROACH. LEGO1 LEGO2 Accuracy Mean Per Source Accuracy Mean Per Source female speech 96.94% 1:15 96.94% 99.08% 0:70 99.08% male speech 86.00% 1:26 86.00% 90.62% 0:95 90.62% female/female 17.88% 1:80 56.66% 32.26% 1:08 65.39% female/male 17.64% 1:87 56.17% 29.06% 1:33 63.47% male/male 13.84% 2:19 52.72% 20.22% 1:64 57.68% Martin Vetterli for numerous insights and discussions, and for suggesting Figure 1. This work was supported by the Swiss National Science Foundation grant number 20FP-1 151073, Inverse Problems regularized by Sparsity. VII. D ISCLAIM ER LEGO is a trademark of the LEGO Group which does not sponsor, authorize or endorse this work. (a) 10 (b) 10 REFERENCES [1] J. Blauert, Spatial Hearing: The Psychophysics of Human Sound Localization, The MIT Press, 1997. [2] A. D. Musicant and R. A. Butler, “The Influence of Pinnae-based Spectral Cues on Sound Localization,” J. Acoust. Soc. Am., vol. 75, no. 4, pp. 1195–1200, 1984. [3] J. J. Rice, B. J. May, G. A. Spirou, and E. D. Young, “Pinna-based Spectral Cues for Sound Localization in Cat,” Hearing Research, vol. 58, no. 2, pp. 132–152, 1992. (c) 5 (d) 5 [4] M. Aytekin, E. Grassi, M. Sahota, and C. F. Moss, “The Bat Head- related Transfer Function Reveals Binaural Cues for Sound Localization in Azimuth and Elevation,” J. Acoust. Soc. Am., vol. 116, no. 6, pp. 3594–3605, 2004. Fig. 3. Confusion matrices for localizing one speaker using LEGO2. Female [5] S. R. Oldfield and S. P. A. Parker, “Acuity of Sound Localisation: speech has less outliers and improving the resolution decreases the number A Topography of Auditory Space. III. Monaural Hearing Conditions,” of outliers. Left: Female speech. Right: Male speech. Perception, vol. 15, no. 1, pp. 67–81, 1986, PMID: 3774479. [6] J. G. Harris, C.-J. Pu, and J. C. Principe, “A Monaural Cue Sound Localizer,” Analog Integrated Circuits and Signal Processing, vol. 23, no. 2, pp. 163–172, May 2000. designs. Perhaps a source model other than the universal dic- [7] Y. Xie, T. Tsai, A. Konneker, B. Popa, D. J. Brady, and S. A. Cummer, tionary could approach the performance of using prototypes. “Single-sensor Multispeaker Listening with Acoustic Metamaterials,” Proc. Natl. Acad. Sci. U.S.A., vol. 112, no. 34, pp. 10595–10598, Aug. Finally, we presented our results for anechoic conditions. Preliminary numerical experiments show that the current ap- [8] A. Saxena and A.Y. Ng, “Learning Sound Location from a Single proach underperforms in a reverberant setting. This shortcom- Microphone,” in Proc. IEEE Int. Conf. on Robotics and Automation, 2009, pp. 1737–1742. ing is partly due to violations of our modeling assumptions. [9] D. El Badawy, I. Dokmanic, ´ and M. Vetterli, “Acoustic DoA Estimation For example, in Eq. (1), the noise is assumed independent by One Unsophisticated Sensor,” in 13th Int. Conf.on Latent Variable of the sources which is no longer true in the presence of Analysis and Signal Separation - LVA/ICA, P. Tichavsky, ´ M. B. Zadeh, O. Michel, and N. Thirion-Moreau, Eds. 2017, vol. 9237 of Lecture reverberation. For practical scenarios it is thus necessary to Notes in Computer Science, pp. 489–496, Springer. extend the approach to handle reverberant conditions as well [10] I. Dokmanic, ´ Listening to Distances and Hearing Shapes: Inverse Prob- as to test the localization performance in 3D i.e., estimate lems in Room Acoustics and Beyond, Ph.D. thesis, Ecole polytechnique fed ´ erale ´ de Lausanne, 2015. both the azimuth and the elevation. For accurate localization [11] I. Dokmanic ´ and M. Vetterli, “Room Helps: Acoustic Localization in elevation, we expect that a taller device with more variation with Finite Elements,” in Proc. IEEE Int. Conf. Audio, Speech, Signal along the vertical axis would perform better. Since we only use Process., Mar. 2012, pp. 2617–2620. [12] D. Malioutov, M. Cetin, and A. S. Willsky, “A Sparse Signal Recon- one microphone, the number of ambiguous directions would struction Perspective for Source Localization with Sensor Arrays,” IEEE likely grow considerably in 3D making the problem compa- Trans. Signal Process., vol. 53, no. 8, pp. 3010–3022, Aug. 2005. rably harder. Other interesting open questions include blind [13] P. T. Boufounos, P. Smaragdis, and B. Raj, “Joint Sparsity Models for Wideband Array Processing,” in SPIE, 2011, vol. 8138, pp. 81380K– learning of the directional transfer functions and understanding 81380K–10. the benefits of scattering in the case of multiple sensors. [14] E. Cagli, D. Carrera, G. Aletti, G. Naldi, and B. Rossi, “Robust DOA Estimation of Speech Signals via Sparsity Models Using Microphone Arrays,” in Proc. IEEE Workshop on Applications of Signal Process. VI. ACKNOWLEDGMENT Audio Acoust., Oct. 2013, pp. 1–4. [15] D. D. Lee and H. S. Seung, “Learning the Parts of Objects by Non- We thank Robin Scheibler and Mihailo Kolundzija ˇ for help negative Matrix Factorization,” Nature, vol. 401, pp. 788–791, Oct. with experiments and valuable comments. We also thank 1999. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 11 (a) (b) (c) Fig. 4. Confusion matrices for localizing two sources using LEGO2 at a resolution of 10 . (a) With prototypes. (b) With a USM. (c) White sources. Matrix Factorization With Temporal Continuity and Sparseness Criteria,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 15, no. 3, pp. 1066–1074, Mar. 2007. [21] P. Smaragdis, “Convolutive Speech Bases and Their Application to Supervised Speech Separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 15, no. 1, pp. 1–12, Jan. 2007. [22] O. Dikmen and A. T. Cemgil, “Unsupervised Single-channel Source Separation using Bayesian NMF,” in Proc. IEEE Workshop on Appli- cations of Signal Process. Audio Acoust., Oct. 2009, pp. 93–96. [23] N. Mohammadiha, P. Smaragdis, and A. Leijon, “Supervised and Unsu- (a) One speaker. (b) Two speakers. pervised Speech Enhancement Using Nonnegative Matrix Factorization,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 21, no. 10, pp. 2140–2151, Oct. 2013. Fig. 5. Confusion matrices for localizing female speech with LEGO2 using [24] P. Smaragdis and J. C. Brown, “Non-negative Matrix Factorization a multiresolution approach. Improving the resolution decreases the number of for Polyphonic Music Transcription,” in Proc. IEEE Workshop on outliers in the one-speaker case but not the two-speaker case. Applications of Signal Process. Audio Acoust., Oct. 2003, pp. 177–180. [25] J. Traa, P. Smaragdis, N. D. Stein, and D. Wingate, “Directional NMF for Joint Source Localization and Separation,” in Proc. IEEE Workshop on Applications of Signal Process. Audio Acoust., 2015, pp. 1–5. [26] M. Kowalski, E. Vincent, and R. Gribonval, “Beyond the Narrowband Approximation: Wideband Convex Methods for Under-Determined Re- verberant Audio Source Separation,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 18, no. 7, pp. 1818–1829, Sep. 2010. [27] L. Parra and C. Spence, “Convolutive Blind Separation of Non-stationary Sources,” IEEE Trans. Speech Audio Process., vol. 8, no. 3, pp. 320– 327, May 2000. [28] V. R. Algazi, R. O. Duda, D. M. Thompson, and C. Avendano, “The CIPIC HRTF Database,” in Proc. IEEE Workshop on Applications of Signal Process. Audio Acoust., 2001, pp. 99–102. [29] M. Ledoux, The Concentration of Measure Phenomenon, Math. Surveys Monogr. American Mathematical Society, Providence (R.I.), 2001. [30] J. Hebrank and D. Wright, “Are Two Ears Necessary for Localization of Sound Sources on the Median Plane?,” J. Acoust. Soc. Am., vol. 56, no. 3, pp. 935–938, 1974. [31] R. M. Reeder, J. Cadieux, and J. B. Firszt, “Quantification of Speech- in-Noise and Sound Localisation Abilities in Children with Unilateral Hearing Loss and Comparison to Normal Hearing Peers,” Audiology and Neurotology, vol. 20(suppl 1), no. Suppl. 1, pp. 31–37, 2015. [32] C. Fevotte and J. Idier, “Algorithms for Non-negative Matrix Factor- ization with the Beta-divergence,” Neural Comput., vol. 23, no. 9, pp. 2421–2456, Sep. 2011. ` ´ [33] A. Lefevre, F. Bach, and C. Fevotte, “Itakura–Saito Non-negative Matrix Fig. 6. Summary of localizing one (left) or two (right) female speakers using Factorization with Group Sparsity,” in Proc. IEEE Int. Conf. Audio, LEGO2. Speech, Signal Process., May 2011, pp. 21–24. [34] D. L. Donoho, “For Most Large Underdetermined Systems of Linear Equations the Minimal l1-norm Solution is also the Sparsest Solution,” [16] C. Fev ´ otte, N. Bertin, and J. Durrieu, “Non-negative Matrix Factor- Comm. Pure Appl. Math, vol. 59, pp. 797–829, 2004. ization with the Itakura-Saito Divergence. With Application to Music [35] J. Friedman, T. Hastie, and R. Tibshirani, “A Note on the Group Lasso Analysis,” Neural Computation, vol. 21, no. 3, pp. 793–830, 2009. and a Sparse Group Lasso,” arXiv, 2010. [17] D. L. Sun and G. J. Mysore, “Universal Speech Models for Speaker [36] A. Cichocki, R. Zdunek, and S. Amari, “New Algorithms for Non- Independent Single Channel Source Separation,” in Proc. IEEE Int. Negative Matrix Factorization in Applications to Blind Source Separa- Conf. Audio, Speech, Signal Process., 2013, pp. 141–145. tion,” in Proc. IEEE Int. Conf. Audio, Speech, Signal Process., May [18] M. N. Schmidt and R. K Olsson, “Single-channel Speech Separation 2006, vol. 5, pp. V621–V624. using Sparse Non-negative Matrix Factorization,” in Interspeech, 2006, [37] D. Colton and R. Kress, Inverse Acoustic and Electromagnetic Scattering pp. 2614–2617. Theory, Applied Mathematical Sciences. Springer, New York, NY, 3 [19] J. Le Roux, F. J. Weninger, and J. R. Hershey, “Sparse NMF – Half- edition, 2013. baked or Well Done?,” Tech. Rep. TR2015-023, Mitsubishi Electric [38] D. Colton, J. Coyle, and P. Monk, “Recent Developments in Inverse Research Laboratories (MERL), Cambridge, MA, USA, Mar. 2015. Acoustic Scattering Theory,” SIAM Review, vol. 42, no. 3, pp. 369– [20] T. Virtanen, “Monaural Sound Source Separation by Nonnegative 414, 2000. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information. This article has been accepted for publication in a future issue of this journal, but has not been fully edited. Content may change prior to final publication. Citation information: DOI 10.1109/TASLP.2018.2867081, IEEE/ACM Transactions on Audio, Speech, and Language Processing 12 [39] H. Wierstorf, A. Geier, M.and Raake, and S. Spors, “A Free Database of Head-Related Impulse Response Measurements in the Horizontal Plane with Multiple Distances,” June 2016. [40] J. Garofolo, L. Lamel, W. Fisher, J. Fiscus, D. Pallett, and N. Dahlgren, “DARPA TIMIT: Acoustic-phonetic Continuous Speech Corpus,” Tech. Rep., NIST, 1993, distributed with the TIMIT CD-ROM. [41] J. Woodruff and D. Wang, “Binaural Localization of Multiple Sources in Reverberant and Noisy Environments,” IEEE/ACM Trans. Audio, Speech, Language Process., vol. 20, no. 5, pp. 1503–1512, July 2012. [42] D. Kitamura and N. Ono, “Efficient Initialization for Nonnegative Matrix Factorization based on Nonnegative Independent Component Analysis,” in Proc. IEEE Int. Workshop on Acoustic Signal Enhancement, Sep. 2016, pp. 1–5. [43] A. N. Langville, C. D. Meyer, R. Albright, J. Cox, and D. Duling, “Al- gorithms, Initializations, and Convergence for the Nonnegative Matrix Factorization,” arXiv, 2014. 2329-9290 (c) 2018 IEEE. Personal use is permitted, but republication/redistribution requires IEEE permission. See http://www.ieee.org/publications standards/publications/rights/index.html for more information.

Journal

Electrical Engineering and Systems SciencearXiv (Cornell University)

Published: Jan 11, 2018

There are no references for this article.