Get 20M+ Full-Text Papers For Less Than $1.50/day. Start a 14-Day Trial for You or Your Team.

Learn More →

Effects of Artificial Synthetic Speech Control of SNR and Speech Rate on the Intelligibility of Train Station Announcements

Effects of Artificial Synthetic Speech Control of SNR and Speech Rate on the Intelligibility of... An experimental study on the effect of the speech characteristics of the signal-to-noise ratio (SNR) and speech rate on the intelligibility of announcements at railway stations was conducted using an artificial synthetic voice. Synthesized speech has recently been used in noisy environments both indoors and outdoors, but unlike its use in quiet environments, when the environment is noisy, the intelligibility of announcements may be reduced. For railway station announcements, while natural spoken voices are currently used for multilingual announcements and disaster response broadcasts, deep neural network synthesized voices, which use deep learning, have also been adopted. However, the effect of the acoustic characteristics such as the SNR and speech rate on the intelligibility of reproduced announcements in noisy public spaces such as railway stations has not yet been clarified from a practical viewpoint. In this paper, in order to determine the appropriate SNR and speech rate for synthetic voice announcements in railway stations, auditory impressions of announcements with varying SNR and speech rate were evaluated by participants using a five-point scale. Based on the evaluations, the appropriate conditions for the broadcast of synthetic voice announcements at the ticket gate and on the platform of a station are discussed. Keywords Synthetic voice · Intelligibility of announcement · Sound environment of railway station · Signal-to-noise ratio (SNR) · Speech rate · Subjective evaluation 1 Introduction The ability to accurately grasp spoken information in public places can have a significant impact on the convenience of In public places, the sound environment is important for the space. The, improving the quality of information trans- speech communication in terms of speech transmission mission by speech is important for reasons such as increased performance. In noisy places, this may correspondingly safety and convenience, diversification of the information make speech communication more difficult. The audibil- provided to users, and the need to create an environment that ity of announcements in information transmission is greatly takes into account various types of disabilities. In particu- affected by the background noise (BGN) in the space [1] lar, as shown in these references, information on the use of and speech-specific characteristics such as speech rate [2]. public transport, such as railways, is often broadcast over loudspeakers, and it is important to ensure the intelligibil- B Mizuki Maruoka ity of this information. Currently, announcements such as 7522565@ed.tus.ac.jp those described above are still widely used in broadcasts B Takumi Asakura by a human voice or synthesized speech based on a wave- t_asakura@rs.tus.ac.jp form connection method [3], which is based on the recorded 1 sound of a human voice. In addition to the broadcast volume, Department of Mechanical and Aerospace Engineering, the speech speed also depends on the speaker, and there are Graduate School of Science and Technology, Tokyo University of Science, Noda-Shi, Chiba, Japan no clear rules for the playback conditions of announcement broadcasts. For this reason, there are situations where the Department of Urban and Civil Engineering, Graduate School of Science and Engineering, Ibaraki University, Hitachi-shi, announcements are perceived as too loud or where almost Ibaraki, Japan nothing can be heard. As this method of synthetic speech is Department of Mechanical and Aerospace Engineering, synthesized on the basis of the natural voice, it is difficult to Faculty of Science and Technology, Tokyo University of create multilingual broadcasts, and in the event of a disaster, Science, Noda-shi, Chiba, Japan 123 Acoustics Australia station staff have to deal with the situation by broadcasting importance of speech intelligibility in noisy environments, in their own voices. and there have been many studies on the optimal volume and Artificial synthesized speech has recently been used in var- speech rate of broadcasts in public spaces such as airports and ious places, with Apple’s Siri, Amazon’s Alexa, Microsoft’s train stations [1, 2, 24] and under high noise levels [25–33]. Cortana, and the Google Assistant being representative However, it is difficult to immediately apply these findings to examples [4]. Although some algorithmic improvements the environments in train stations because the acoustic char- have been studied to improve the intelligibility of broadcasts acteristics of the sound environment can differ depending made by natural or synthetic voices in noisy environments, on the location. Moreover, as non-combustibility, durability, synthetic voices, which tend to have mechanical intonation and water resistance are emphasized when designing railway different from that of natural voices, can be particularly stations, building materials are often acoustically reflective, difficult to understand [5–9]. Furthermore, Sato et al. [10] which results in a worse sound environment. Indeed, many experimentally investigated the relationship between the measurements of noise levels in railway stations have already improvement of intelligibility and noisiness by increasing been carried out [34, 35], and Bandyopadhyay et al. [36] the volume and pointed out that higher-volume loudspeaker suggested that the sound pressure levels (SPLs) of BGN and broadcasts may cause discomfort, and that the increase in loudspeaker sound on the platform can cause significant dis- noisiness may be more pronounced than the improvement in comfort to users, as they are largely above acceptable daytime intelligibility. noise levels. Furthermore, although the Architectural Insti- In recent years, the aging of the domestic population is tute of Japan Environmental Standard [37] sets a speech rate progressing in Japan [11], and it has been increasingly noted of 5.5 mora/s, which is derived from the results of a study that people are having trouble obtaining information on how using broadcasts at this rate, there are many other studies that to use and where to board public transportation. Currently, mention the possibility that the comprehensibility of infor- various efforts are being made to strengthen the communica- mation broadcasts may change depending on the speech rate tion of information concerning public transportation [12–18]. [2, 32, 33]. Under these circumstances, the development of For example, in 2020, JR East adopted Toshiba’s ToSpeak G3 clear voice transmission that is efficient and does not cause text-to-speech software as a tablet terminal to be carried by discomfort in the sound environment of station premises is train crews and station employees [19] and installed HOYA’s essential. ReadSpeaker text-to-speech software (hereinafter referred to In this study, the influence of two factors, the SNR and the as HOYA Broadcasting) in the concourses and platform of speech rate, on the auditory impression of announcements the Shinkansen [20]. Conventionally, ATOS (Autonomous was examined. Then, the appropriate conditions for syn- decentralized Transport Operation control System) broad- thetic voice broadcasts were examined. Section 2 describes casts and COSMOS (Computerized Safety, Maintenance the outline and procedure for the experiments, while Sect. 3 and Operation Systems of Shinkansen) broadcasts using the presents the results of each experiment. From the results, the waveform connection method have been used, with ATOS appropriate conditions for the broadcast of synthetic voice for conventional lines and COSMOS for the Shinkansen [21]. announcements at each location in a railway station are dis- Recently, synthetic speech using the DNN method [22], as cussed in Sect. 4. typified by HOYA broadcasts, is being introduced. By pro- moting the introduction of the announcements made with the DNN system, the time and cost associated with record- 2 Methods ing and re-recording by a human announcer can be reduced. In addition, according to the guidelines of the Japan Tourism This study investigates the appropriate SNR and speech rate Agency [23], in addition to multilingual broadcasts in four for synthetic voice announcements at railway stations for languages (Japanese, English, Chinese, and Korean), which normal-hearing people, using announcements made by DNN are in high demand in Japan, it is possible not only to make synthesized speech. As BGN, two locations, the ticket gate emergency announcements using synthetic voice, but also to and the platform of the station, were considered. They are make use of the advantages of instantaneous speech synthe- listed in the Barrier-Free Improvement Guidelines of the sis to make train announcements with timetables different Ministry of Land, Infrastructure and Transport [38] as places from that originally planned, and to make announcements where acoustic facilities should be provided at railway sta- that are applicable for only short periods of time. Thus, if tions, and noise level surveys have been conducted for both DNN synthetic speech can be used effectively in stations, locations [39]. Previous studies [2, 32, 33] have also reported it will be possible to create announcements freely, regard- appropriate guidelines for the speech rate of announcements less of the language or situation. However, in practice, the in public spaces in Japan. In order to investigate the effects effectiveness of synthetic voice for use in stations has not of SNR and speech rate on auditory impression separately, been verified. For example, Tachibana [24] emphasized the the appropriate SNR range was first found by setting the 123 Acoustics Australia Table 1 Conditions of the generated announcement using WaveNet in Table 2 Setting conditions and detailed content of the BGN Google text-to-speech Recording method Binaural recording Level of voice presentation (dB) 60, 65, 70, 75 (75 dB is platform only) Length of sound source 15 (s) Length of announcement (s) About 10 Sound source status 60 dB Small amount of Speech rate (mora/s) 7 (ticket gate) pedestrian traffic 65 dB Moderate amount of pedestrian traffic speech rate considered appropriate beforehand, that is, with- 70 dB Large amount of out considering the acoustic characteristics of the space. pedestrian traffic Then, experiments were used to clarify the optimal speech Sound source status 60 dB No trains are stopping on rate at each location in the railway station. More specifically, (platform) the platform referring to previous studies [1, 2] conducted using natu- 65 dB A train is stopping on the ral voice announcements, Experiment 1 was first conducted opposite platform to examine the appropriate SNR. The obtained SNR results 70 dB A train is stopping on your platform were then introduced, and Experiment 2 was conducted next 75 dB A train is running on your to examine the appropriate speech rate. platform Twelve students in their 20s participated in the experiment 80 dB A train is running on the (6 male, 6 female). Participants were recruited by email to opposite platform students at the Faculty of Science and Technology, Tokyo (wheel noise reaches the University of Science, and received no remuneration for their listener directly) participation. In accordance with EN 50332-2 proposed by CENELEC as a sound pressure regulation for portable audio players, the study was not invasive, and written consent for name, destination name, line number, train type, number of the experiment was obtained from all participants. Experi- cars, etc.) included in each announcement were changed so mental collaborators were briefed on the purpose of the study that the participants could not predict the next sentence when and the experimental methods, and anonymization and use listening to the announcement in the experiment. The BGN of the data. Furthermore, prior to the study, participants were for this experiment was recorded using an earphone-type asked about their hearing and were assured that both ears binaural microphone (Adphox BME-200) at several railway were normal. stations in the Tokyo metropolitan area. Sound sources in the range of noise levels likely to occur at each location were 2.1 Experiment 1 selected from the sound source data recorded for 15 s each time. Because of this, the frequency and time-varying charac- Participants were instructed to listen to an information broad- teristics of the BGN at each noise level were different from cast superimposed on BGN through headphones and to eval- each other. The details of the experimental BGN and their uate the announcement based on their listening impressions. frequency characteristics are shown in Table 2 and Fig. 1, Two evaluation items, “listening difficulty” and “noisiness”, respectively. The 3 kHz peak seen in the BGN at the ticket were selected with reference to previous studies [1, 40]. A gate is the sound of the ticket machine. A precision sound five-point Likert scale was used, with “1  not at all (全 level meter (Rion NL-52) was used to measure the BGN level く ~ ない)”, “2  not very much (それほど ~ ない)”, “3 during the recording. For the experiment, headphones (Sony slightly (多少 ~)”, “4  much (だいぶ ~)”, and “5  MDR-M1ST), an audio interface (RME Fireface 802), and a very much (非常に ~)”, for each of “listening difficulty” and laptop computer were used. “noisiness”. The participants were also asked to assume a Table 3 shows the auditory conditions presented to the situation in which they were trying to navigate by relying on participants in the experiment with BGN at the ticket gate. audio information at a station they were using for the first Each combination of sound sources was presented once at time. The announcement used in this experiment was cre- random. Then, participants evaluated a total of 72 conditions, ated using the WaveNet voice of Google text-to-speech [41], with six levels of SNR for each of three levels of BGN for a speech synthesis application programming interface (API) the conditions at the ticket gate, and four kinds of voice pitch provided by Google. The various conditions of the informa- (low male, low female, high male, and high female). In the tion broadcasts created are shown in Table 1. The speech rate experiment using the BGN of the platform, only one type of was set to 7 mora/s with reference to previous studies [2, 32, information broadcast was used for the female voice, as in 33] and HOYA broadcasts. Additionally, the words (station the HOYA broadcast, because no significant differences for 123 Acoustics Australia (a) Table 5 The SNR setting conditions Place Ticket gate Platform BGN(dB) 606570 60 65 70 75 SNR(dB) +8+7+4+11 +11 +9 +4 Ticket gate (60 dB) Ticket gate (65 dB) 20 dB Table 6 Auditory presentation conditions for conditions at the ticket Ticket gate (70 dB) gate and the platform 100 1000 10000 BGN (dB) 60, 65, 70, 75 (75 dB is platform only) Frequency (Hz) Speech rate (mora/s) 5.5–8.5 Gender of voice (–) (Pitch of voice) low female (b) 2.2 Experiment 2 The participants and equipment used in this experiment were Platform (60 dB) Platform (65 dB) the same as in Experiment 1. Participants were instructed to Platform (70 dB) evaluate their subjective impressions of the broadcast infor- Platform (75 dB) 20 dB mation and BGN, which were presented to them through Platform (80 dB) headphones. Three evaluation items were selected: “listen- 100 1000 10000 ing difficulty” and “noisiness”, which were the same as the Frequency (Hz) evaluation items in Experiment 1, plus “strangeness”, for evaluating the unnaturalness for the speech speed of the Fig. 1 Frequency characteristics of the BGN at (a) the ticket gate and generated voice, with reference to a previous study [2]. A (b) the platform five-point Likert scale of “1  not at all (全く ~ ない)”, “2 not very much (それほど ~ ない)”, “3  slightly (多少 ~)”, “4  much (だいぶ ~)” and “5  very much (非常に ~)” Table 3 Conditions of the auditory stimuli at the ticket gate was used for the “listening difficulty” and “noisiness”, and BGN (dB) 60, 65, 70 for the evaluation of “strangeness”, the speed that was felt SNR(dB) 0 to +15every3dB appropriate for a railway station information broadcast was Gender (–) (Pitch of voice) high female, low female, high set as the standard of “feels appropriate (ちょうどよい)”, male, low male while the speed of “−2  feels slow (遅く感じる)”, “−1 feels slightly slow (やや遅く感じる)”, “0  feels appro- priate (ちょうどよい)”, “1  feels slightly fast (やや速く Table 4 Conditions of the auditory stimuli on the platform 感じる)”, and “2  feels fast (速く感じる)” were selected. The participants were asked to evaluate the announcements in BGN (dB) 60, 65, 70, 75, 80 the same way as in Experiment 1: they were asked to assume SNR (dB) 0 to +15, 0 to +9 every 3 dB (at 80 dB) a situation in which they were trying to move around at a Gender (–) (Pitch of voice) low female station they were using for the first time and were relying on audio information. The creation and recording methods for the announcements and BGN used in this experiment were the gender of the voice were found in the experiment of the also the same as in Experiment 1. The SNR was set as shown ticket gate (see below for details). The auditory conditions in Table 5, referring to the results of Experiment 1. However, presented to the participants are shown in Table 4. Note that as described below, an appropriate SNR could not be found the upper limit of the SNR was set to +9 dB so that the for the 80 dB BGN of the platform, so it was excluded from information broadcast was kept below 90 dB for the 80 dB this experiment. The auditory conditions are shown in Table BGN of the platform. That is to say, participants evaluated a 6. The participants evaluated a total of 28 conditions, with total of 28 conditions, with six levels of SNR (four levels in four levels of speech speed for each of three levels of BGN one part) for each of five levels of BGN for the conditions for the conditions at the ticket gate and four levels of BGN on the platform. for those at the platform. Relative level (dB) Relative level (dB) Acoustics Australia (a) 5.0 5.0 (b) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) (c) 5.0 5.0 (d) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) (e) 5.0 5.0 (f) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) Fig. 2 Transition of “listening difficulty” under BGN levels of a 60 dB, b 65 dB, and c 70 dB, and transition of “noisiness” under BGN levels of d 60 dB, e 65 dB, and f 70 dB, in the ticket gate 3 Results as factors, without including the gender of the participants, for each of the ratings of listening difficulty and noisiness. 3.1 Experiment 1 Then the main effects and interactions of SNR and BGN levels were found to be significant (p < 0.01). Multiple com- First, in order to examine the difference between male and parisons (Tukey’s honestly significant difference, HSD, test) female participants, Student’s t-test was conducted for the showed significant differences between all conditions (p < listening difficulty and noisiness items, and no significant 0.01). On the other hand, no main effect for the gender of the difference was found. Therefore, the relationship between the voice was found, suggesting that the influence of the gender SNR and the listening difficulty and noisiness in the situation of the voice on the evaluation of listening difficulty and nois- with the added BGN of the ticket gate is shown in Fig. 2 iness is small. Therefore, the averages of the evaluated values excluding the classification by participant gender. Error bars obtained for the four types of voices were re-taken and are indicate standard errors. shown in Fig. 3. A three-way analysis of variance (ANOVA) was con- Figure 3 shows that listening difficulty decreases and nois- ducted using gender of the voice, SNR level, and BGN level iness increases as the SNR increases. However, when the Noisiness Listening difficulty Listening difficulty Noisiness Noisiness Listening difficulty Acoustics Australia 5.0 5.0 (a) (a) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 BGN 80 dB 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 0+3 +6 +9 +12 +15 SNR (dB) SNR (dB) (b) (b) 5.0 5.0 BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 BGN 80 dB 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) Fig. 3 Transition of a “listening difficulty” and b “noisiness” when the Fig. 4 Transition of a “listening difficulty” and b “noisiness”on the four voice gender scores are averaged platform announcement level is over 80 dB, the impression of noisi- ness becomes more pronounced, so increased SNR tends to at the ticket gate is shown in Fig. 5, while the relationship increase the listening difficulty. In addition, when comparing among these with added BGN on the platform is shown in the values for the same SNR, the higher the BGN level, the Fig. 6. higher the evaluated loudness. In short, when the BGN level A two-way ANOVA was conducted on the listening dif- is high, even if the SNR is increased, it is difficult to reduce ficulty ratings for both the ticket gate and the platform, with the listening difficulty, and the impression of noisiness is speech rate and BGN level as factors. The results for the main high. effects of the speech rate and BGN level were significant (p < The relationship among the SNR, listening difficulty, and 0.01). Multiple comparisons (Tukey’s HSD test) for speech noisiness for the situation where BGN is added to the plat- rate also showed significant differences between all condi- form is shown in Fig. 4. tions (p < 0.01). The same two-way ANOVA was conducted A two-way ANOVA was conducted for the SNR and BGN on the noisiness ratings as above, and the main effect of the levels as factors for each of the ratings of listening difficulty BGN level was significant (p < 0.01). The same two-way and noisiness. Then the main effects and interactions of the ANOVA was conducted for the rating of strangeness, and SNR and BGN levels were significant (p < 0.01). Multiple the main effects of the speech rate and BGN level were sig- comparisons (Tukey’s HSD test) showed significant differ- nificant (p < 0.01). The results of multiple comparisons for ences between all conditions (p < 0.01). The platform is a speech rate also showed significant differences between all location where the BGN level is often higher than at the ticket conditions (p < 0.01). gate, so when the BGN level is 75 dB or higher, the noisi- ness increases significantly before the listening difficulty is improved, even if the SNR is increased. 4 Discussion 3.2 Experiment 2 Values within the range where the evaluated scores for both listening difficulty and noisiness were less than 2 or 2.5 were The relationship between speech rate and listening difficulty, regarded as appropriate SNR, while the values outside of noisiness, and strangeness in the situation with added BGN this range were regarded as inappropriate SNR. In Fig. 7,the Noisiness Listening difficulty Noisiness Listening difficulty Acoustics Australia 5.0 5.0 (a) (a) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 3.0 3.0 2.0 2.0 1.0 1.0 5.5 6.5 7.5 8.5 5.5 6.5 7.5 8.5 Speech rate (mora/s) Speech rate (mora/s) (b) 5.0 (b) 5.0 BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB 4.0 BGN 75 dB 4.0 3.0 3.0 2.0 2.0 1.0 1.0 5.5 6.5 7.5 8.5 5.56.5 7.58.5 Speech rate (mora/s) Speech rate (mora/s) (c) 2.0 2.0 (c) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 1.0 1.0 -1.0 -1.0 -2.0 -2.0 5.5 6.5 7.5 8.5 5.5 6.5 7.5 8.5 Speech rate (mora/s) Speech rate (mora/s) Fig. 5 Relationship between speech rate and a “listening difficulty”, Fig. 6 Relationship between speech rate and a “listening difficulty”, b “noisiness”, and c “strangeness” in the ticket gate b “noisiness”, and c “strangeness” on the platform range of the SNR where both listening difficulty and noisiness scores were less than 2 is shown in red, and that where the the evaluated value is small, it is thought that the frequency scores were less than 2.5 is shown in blue. Note that when characteristics of the BGN at the turnstiles may have an effect the platform BGN was 80 dB, no scores under 2.0 and 2.5 on the perceived loudness at the ticket gate. It is difficult to were obtained. determine a single appropriate announcement level for all Figure 7 shows the appropriate SNR at each location. BGN levels both at the ticket gate and on the platform. How- Regarding the range of 60–70 dB of BGN at the ticket gate ever, since the range of BGN likely to occur at a station can be and platform, it can be said that the SNR can be set higher on predicted from the number of passengers per day for the sta- the platform than at the ticket gate by reproducing the speech tion and the speed of the entering trains [42], using Fig. 7 as a signal at a higher sound pressure level. This is because the reference, it is possible to consider the appropriate announce- noisiness of announcements tends to be perceived more eas- ment level for a station by selecting the BGN level. When the ily at the ticket gate than on the platform when the SNR is BGN level is over 70 dB at the ticket gate and 75 dB on the increased. Given that the effect of the gender of the voice on platform, the range of appropriate SNR is quite limited. So, it Strangeness Noisiness Listening difficulty Strangeness Noisiness Listening difficulty Acoustics Australia (a) As for the speech rate, the scores for listening difficulty and strangeness were generally similar to each other, so that sim- ilar trends were observed for the appropriate speech rate between the synthetic and the natural voices. The evaluation value for noisiness was slightly higher in this experiment, but +6 this may be because, unlike previous studies using loudspeak- ers, the voice was played from headphones, and in particular, as mentioned above, the ticket gate is a place where the nois- iness of an announcement tends to be easily perceived. 55 60 65 70 75 In this study, the threshold of the evaluation value was LAeq (dB) set at 2 or 2.5, but it is possible to change the thresh- old according to the situation, for example, for emergency (b) announcements of important content, a SNR that is slightly noisy is acceptable, as emphasis is on ease of hearing. Addi- tionally, the results of this experiment provide the appropriate announcement levels for people with normal hearing, and it is necessary to take into account, for example, the elderly population, who are more likely to suffer from hearing loss. In addition, Figs. 5b and 6b show that there is no direct rela- tionship between speech rate and noisiness, so it can be said that speech rate can be changed to suit the broadcast envi- ronment. It should also be noted that since this experiment 55 60 65 70 75 80 was conducted in a stationary state, the appropriate range of LAeq (dB) speech rate may change, especially in places where people Fig. 7 Appropriate SNR in a the ticket gate and b the platform are often in a walking state, such as at an actual ticket gate. From the above, assuming that an appropriate SNR can be set, the standard speech rate of an information broadcast at a is important to promote the introduction of sound-absorbing railway station should be 7.0 mora/s at the ticket gate and 6.5 mora/s on the platform to minimize the listening difficulty materials and other improvements to prevent the noise level at each location from becoming too high, and to give priority when stationary and to avoid sounding unnatural. to this over the consideration of announcement levels. In fact, This study was conducted without considering the char- various existing studies have shown how to consider building acteristics of the speech propagation environment. When materials and sound absorption methods to shorten the rever- applying the current experimental results to the actual station beration time and reduce the ambient noise level at railway environment, there is a need to set the SNR and speech rate in stations [43–46]. It is also effective to control an appropri- consideration of the directivity and reverberation time of the ate SNR by considering the spatial relationship between the loudspeaker in the actual environment. A previous study con- target area and the loudspeakers [47]. sidering differences in reverberation time (with and without sound absorbing material) in station buildings [1] reported Focusing on the experimental results for speech rate, Figs. 5c and 6c show that the evaluated values of unnatu- that a SNR of about 5 dB higher than in a sound-absorbing environment is required in a reflective environment. With ralness in relation to speech rate are almost proportional to 6.5 mora/s. The tendency for the participants to perceive that regard to speech rate, the results of the present study are the announcements on the platform were slightly faster than similar to those of a previous study [2] and are therefore those at the ticket gate at 7.5 mora/s is assumed to be because considered to be independent of the presence or absence of they wanted to listen to the announcements more carefully sound-absorbing material. In this study, we also recorded in the platform situation. BGN at the underground platforms, but the noise levels were Previous studies [1, 2] have reported that the appropri- lower than those at the above-ground platforms due to the ate SNR in station concourses should be around +8 dB for large amount of sound-absorbing material installed on the absorptive ceilings and more than +13 dB for reflective ceil- ceilings of the underground platforms. Therefore, only the BGN of the above-ground platforms was used in this exper- ings. Although the acoustic characteristics of the space were not taken into account in this experiment, a comparison of iment. the results for the concourse and a ticket gate close to that location suggests that there is no significant difference in the appropriate SNR between the synthesized and natural voices. SNR (dB) SNR (dB) Acoustics Australia 5 Conclusion 5. Cooke, M., Mayo, C., Valentini-Botinhao, C.: Intelligibility- enhancing speech modifications: the hurricane challenge. Inter- speech. 3552–3556 (2013) This study attempts to determine an appropriate SNR and 6. Greene, B.G., Logan, J.S., Pisoni, D.B.: Perception of synthetic speech rate for synthetic voice announcements at railway speech produced automatically by rule: Intelligibility of eight stations. text-to-speech systems. Behav. Res. Method Instr. Comput. 18(2), 100–107 (1986) The following results were obtained. First, the appropriate 7. Mirenda, P., Beukelman, D.: A comparison of intelligibility among SNR varied depending on the broadcast location and BGN natural speech and seven speech synthesizers with listeners from level. Second, it was found that increasing the SNR when three age groups. Augment. Alternat. Commun. 6(1), 61–68 (1990) the BGN level was high did not lead to an improvement in 8. Waterworth, J.A.: Why is synthetic speech harder to remember than natural speech? ACM SIGCHI Bull. 16(4), 201–206 (1985) the listening comprehension of the announcement. Finally, 9. Pisoni, D.B.: Perception of synthetic speech. In: Progress in Speech it was confirmed to be possible to set standards for speech Synthesis, pp. 541–560. Springer, New York (1997) speed depending on the broadcast location and situation, and 10. Sato, H., Morimoto, M., Ota, R.: Acceptable range of speech level that the SNR and speech rate show similar trends between in noisy sound fields for young adults and elderly persons. J. Acoust. Soc. Am. 130(3), 1411–1419 (2011) synthesized and natural voices. 11. Cabinet Office. 2022 White Paper on Aging Society. Japanese Synthetic speech can contribute to improving informa- (2022) tion transmission in announcements by setting an appropriate 12. Chang, X.Y., Ikeda, Y., Tsujimura, S., Sakamoto, K.: Examining SNR and speech rate for each location, in the same way as the provision of railway transit information to foreign visitors in the Tokyo metropolitan area and strategies for improvement. Transp. conventional human voices. Future work will focus on find- Res. Rec. 2672(8), 546–556 (2018) ing a method for improving intelligibility independent of the 13. Yamagami, T., Hattori, H., Yoshiji, K., Kamisaka, T.: Transporta- SNR by applying voice processing to announcements, and tion information system for foreign tourists. Hitachi Rev. 67(7), on addressing the same questions while also considering the 866–871 (2018) 14. Schimkowsky, C.: Managing passenger etiquette in Tokyo: acoustic characteristics of the space being examined. between social control and customer service. Mobilit (2021). https://doi.org/10.1080/17450101.2021.1929418 Funding Open access funding provided by Tokyo University of Sci- 15. Sekiguchi, M.: JR East’s approach to universal design of railway ence. stations. JR Transp. Rev. 45, 9–11 (2006) 16. Ito, Y.: Easy-to-access rail–JR East’s initiatives. JR Transp. Rev. 45, 12–16 (2006) Declarations 17. Kameda, A., Sakamoto, K.: Study on the acoustic environment in station concourses for elderly people. In: JR East Tech. Rev. 28 (2014) Conflict of interest The authors declare that they have no known com- 18. Suzuki, T., Nakagawa, Y., Sakai, A.: For smoother use of railway peting financial interests or personal relationships that could have stations. In: JR East Tech. Rev. 13 (2009) appeared to influence the work reported in this paper. 19. TOSHIBA DIGITAL SOLUTIONS CORPORATION [Internet]. RECAIUS Speech Synthesis Middleware ToSpeak™; c2015–2022 Open Access This article is licensed under a Creative Commons Attri- [cited 2022 Sep 10]. Available from: https://www.global.toshiba/ bution 4.0 International License, which permits use, sharing, adaptation, jp/products-solutions/ai-iot/recaius/lineup/tospeak.html. distribution and reproduction in any medium or format, as long as you 20. HOYA CORPORATION [Internet]. ReadSpeaker; c2020–2022 give appropriate credit to the original author(s) and the source, pro- [cited 2022 Sep 10]. Available from: https://www.readspeaker. vide a link to the Creative Commons licence, and indicate if changes com/ were made. The images or other third party material in this article are 21. Yamanouchi, S.: Essential information systems for railways and included in the article’s Creative Commons licence, unless indicated intensive application of ADS technology? COSMOS and ATOS. otherwise in a credit line to the material. If material is not included in In: 2013 IEEE 11th Internat. Symp. Autonom. Decentr. Syst. 2–9 the article’s Creative Commons licence and your intended use is not (1999) permitted by statutory regulation or exceeds the permitted use, you will 22. Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J.: A review of deep need to obtain permission directly from the copyright holder. To view a learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019) copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 23. Japan Tourism Agency. Guidelines for Improving and Strengthen- ing Multilingual Support to Realize a Tourism Nation. Japanese (2014) 24. Tachibana, H.: Public space acoustics for information and safety. References Proc. Meet. Acoust. 19(032005), 1–11 (2013) 25. Bradley, J.S.: Speech intelligibility studies in classrooms. J. Acoust. 1. Tsujimura, S.: Study on reproduce level of announcement at a sta- Soc. Am. 80, 846–854 (1986) tion for the elderly. Inter-Noise. 1–8 (2015) 26. Bradley, J.S.: Predictors of speech intelligibility in rooms. J. Acoust. Soc. Am. 80, 837–845 (1986) 2. Tsujimura, S.: Relationship of the difference of the speech rate of an 27. Bradley, J.S.: On the combined effects of signal-to-noise ratio and announcement at the railway station and the listening impression. room acoustics on speech intelligibility. J. Acoust. Soc. Am. 106, J. Acoust. Soc. Am. 140(4), 3126–3126 (2016) 1820–1828 (1999) 3. Indumathi, A., Chandra, E.: Survey on speech synthesis. Signal 28. Bistafa, S.R., Bradley, J.S.: Reverberation time and maximum Process. Int. J. 6(5), 140 (2012) background-noise level for classrooms from a comparative study of 4. Hoy, M.B.: Alexa, Siri, Cortana, and more: an introduction to voice assistants. Med. Ref. Serv. Quart. 37(1), 81–88 (2018) 123 Acoustics Australia speech intelligibility metrics. J. Acoust. Soc. Am. 107(2), 861–875 40. Morimoto, M., Sato, H., Kobayashi, M.: Listening difficulty as a (2000) subjective measure for evaluation of speech transmission perfor- 29. Kobayashi, M., Morimoto, M., Sato, H., Sato, H.: Optimum speech mance in public spaces. J. Acoust. Soc. Am. 116(3), 1607–1613 level to minimize listening difficulty in public spaces. J. Acoust. (2004) Soc. Am. 121(1), 251–256 (2007) 41. Google LLC [Internet]. Google cloud Text-to-Speech; c2011–2022 30. Sato, H., Bradley, J.S., Morimoto, M.: Using listening difficulty rat- [cited 2022 Sep 10]. Available from https://cloud.google.com/text- ings of conditions for speech communication in rooms. J. Acoust. to-speech/ Soc. Am. 117(3), 1157–1167 (2005) 42. Izumi, Y: Field measurement and subjective evaluation of acous- 31. Sato, H., Sato, H., Morimoto, M.: Effects of aging on word intel- tical condition of railway station in and around Tokyo. Proc. ligibility and listening difficulty in various reverberant fields. J. Inter-Noise (2009) Acoust. Soc. Am. 121(5), 2915–2922 (2007) 43. Shimokura, R., Soeta, Y.: Sound field characteristics of under- 32. Prafiyanto, H., Nose, T., Chiba, Y., Ito, A.: Analysis of preferred ground railway stations–effect of interior materials and noise speaking rate and pause in spoken easy Japanese for non-native source positions. Appl. Acoust. 73(11), 1150–1158 (2012) listeners. Acoust. Sci. Technol. 39(2), 92–100 (2018) 44. Wu, Y., Kang, J., Zheng, W.: Acoustic environment research of 33. Yokoyama, S., Tachibana, H.: Subjective experiment on suitable railway station in China. Energy Proc. 153, 353–358 (2018) speech-rate of public address announcement in public spaces. In: 45. Haan, C.H.: Case study: predicted effect of station design changes Proc. Meet. Acoust. ICA. 19(1) (2013) on high speed train noise. Build. Acoust. 9(4), 311–323 (2002) 34. Yokoyama S, Tachibana H. Study on the acoustical environmental 46. Sü, Z., Çaliskan, ¸ M.: Acoustical design and noise control in metro in public spaces. INTER-NOISE. 1–8 (2008) stations: case studies of the Ankara metro system. Build. Acoust. 35. Li, H., Peng, W., Xiang, Y., Wenjun, Z.: Researches on sound envi- 14(3), 203–221 (2007) ronment in Futian underground railway station. Proced. Eng. 165, 47. Oldfield, A.: Acoustic design of transit stations. Proc. Meet. 730–739 (2016) Acoust. 18(1), 1–6 (2012) 36. Bandyopadhyay, P., Bhattacharya, S.K., Kashyap, S.K.: Assess- ment of noise environment in a major railway station in India. Indust. Health. 32(3), 187–192 (1994) Publisher’s Note Springer Nature remains neutral with regard to juris- 37. Architect Inst. Jap. Standards for evaluation of speech transmission dictional claims in published maps and institutional affiliations. performance in built environment. AIJES-S0002-2011. Japanese (2011) 38. Ministry of Land. Infrastructure and Transport, Guideline for the Improvement of Barrier-Free Passenger Facilities. Passenger facil- ities section of the Barrier-Free Improvement Guidelines. Japanese (2022) 39. Sato. T., Sato, H., Sato, H. Morimoto, M.: Sound environment for speech communication at railway stations in Japan. In: Proc. Int. Congress on Acoust. ;1199–1200 (2004) http://www.deepdyve.com/assets/images/DeepDyve-Logo-lg.png Acoustics Australia Springer Journals

Effects of Artificial Synthetic Speech Control of SNR and Speech Rate on the Intelligibility of Train Station Announcements

Acoustics Australia , Volume OnlineFirst – Sep 30, 2023

Loading next page...
 
/lp/springer-journals/effects-of-artificial-synthetic-speech-control-of-snr-and-speech-rate-xD35c8G94q

References (32)

Publisher
Springer Journals
Copyright
Copyright © The Author(s) 2023
ISSN
0814-6039
eISSN
1839-2571
DOI
10.1007/s40857-023-00306-8
Publisher site
See Article on Publisher Site

Abstract

An experimental study on the effect of the speech characteristics of the signal-to-noise ratio (SNR) and speech rate on the intelligibility of announcements at railway stations was conducted using an artificial synthetic voice. Synthesized speech has recently been used in noisy environments both indoors and outdoors, but unlike its use in quiet environments, when the environment is noisy, the intelligibility of announcements may be reduced. For railway station announcements, while natural spoken voices are currently used for multilingual announcements and disaster response broadcasts, deep neural network synthesized voices, which use deep learning, have also been adopted. However, the effect of the acoustic characteristics such as the SNR and speech rate on the intelligibility of reproduced announcements in noisy public spaces such as railway stations has not yet been clarified from a practical viewpoint. In this paper, in order to determine the appropriate SNR and speech rate for synthetic voice announcements in railway stations, auditory impressions of announcements with varying SNR and speech rate were evaluated by participants using a five-point scale. Based on the evaluations, the appropriate conditions for the broadcast of synthetic voice announcements at the ticket gate and on the platform of a station are discussed. Keywords Synthetic voice · Intelligibility of announcement · Sound environment of railway station · Signal-to-noise ratio (SNR) · Speech rate · Subjective evaluation 1 Introduction The ability to accurately grasp spoken information in public places can have a significant impact on the convenience of In public places, the sound environment is important for the space. The, improving the quality of information trans- speech communication in terms of speech transmission mission by speech is important for reasons such as increased performance. In noisy places, this may correspondingly safety and convenience, diversification of the information make speech communication more difficult. The audibil- provided to users, and the need to create an environment that ity of announcements in information transmission is greatly takes into account various types of disabilities. In particu- affected by the background noise (BGN) in the space [1] lar, as shown in these references, information on the use of and speech-specific characteristics such as speech rate [2]. public transport, such as railways, is often broadcast over loudspeakers, and it is important to ensure the intelligibil- B Mizuki Maruoka ity of this information. Currently, announcements such as 7522565@ed.tus.ac.jp those described above are still widely used in broadcasts B Takumi Asakura by a human voice or synthesized speech based on a wave- t_asakura@rs.tus.ac.jp form connection method [3], which is based on the recorded 1 sound of a human voice. In addition to the broadcast volume, Department of Mechanical and Aerospace Engineering, the speech speed also depends on the speaker, and there are Graduate School of Science and Technology, Tokyo University of Science, Noda-Shi, Chiba, Japan no clear rules for the playback conditions of announcement broadcasts. For this reason, there are situations where the Department of Urban and Civil Engineering, Graduate School of Science and Engineering, Ibaraki University, Hitachi-shi, announcements are perceived as too loud or where almost Ibaraki, Japan nothing can be heard. As this method of synthetic speech is Department of Mechanical and Aerospace Engineering, synthesized on the basis of the natural voice, it is difficult to Faculty of Science and Technology, Tokyo University of create multilingual broadcasts, and in the event of a disaster, Science, Noda-shi, Chiba, Japan 123 Acoustics Australia station staff have to deal with the situation by broadcasting importance of speech intelligibility in noisy environments, in their own voices. and there have been many studies on the optimal volume and Artificial synthesized speech has recently been used in var- speech rate of broadcasts in public spaces such as airports and ious places, with Apple’s Siri, Amazon’s Alexa, Microsoft’s train stations [1, 2, 24] and under high noise levels [25–33]. Cortana, and the Google Assistant being representative However, it is difficult to immediately apply these findings to examples [4]. Although some algorithmic improvements the environments in train stations because the acoustic char- have been studied to improve the intelligibility of broadcasts acteristics of the sound environment can differ depending made by natural or synthetic voices in noisy environments, on the location. Moreover, as non-combustibility, durability, synthetic voices, which tend to have mechanical intonation and water resistance are emphasized when designing railway different from that of natural voices, can be particularly stations, building materials are often acoustically reflective, difficult to understand [5–9]. Furthermore, Sato et al. [10] which results in a worse sound environment. Indeed, many experimentally investigated the relationship between the measurements of noise levels in railway stations have already improvement of intelligibility and noisiness by increasing been carried out [34, 35], and Bandyopadhyay et al. [36] the volume and pointed out that higher-volume loudspeaker suggested that the sound pressure levels (SPLs) of BGN and broadcasts may cause discomfort, and that the increase in loudspeaker sound on the platform can cause significant dis- noisiness may be more pronounced than the improvement in comfort to users, as they are largely above acceptable daytime intelligibility. noise levels. Furthermore, although the Architectural Insti- In recent years, the aging of the domestic population is tute of Japan Environmental Standard [37] sets a speech rate progressing in Japan [11], and it has been increasingly noted of 5.5 mora/s, which is derived from the results of a study that people are having trouble obtaining information on how using broadcasts at this rate, there are many other studies that to use and where to board public transportation. Currently, mention the possibility that the comprehensibility of infor- various efforts are being made to strengthen the communica- mation broadcasts may change depending on the speech rate tion of information concerning public transportation [12–18]. [2, 32, 33]. Under these circumstances, the development of For example, in 2020, JR East adopted Toshiba’s ToSpeak G3 clear voice transmission that is efficient and does not cause text-to-speech software as a tablet terminal to be carried by discomfort in the sound environment of station premises is train crews and station employees [19] and installed HOYA’s essential. ReadSpeaker text-to-speech software (hereinafter referred to In this study, the influence of two factors, the SNR and the as HOYA Broadcasting) in the concourses and platform of speech rate, on the auditory impression of announcements the Shinkansen [20]. Conventionally, ATOS (Autonomous was examined. Then, the appropriate conditions for syn- decentralized Transport Operation control System) broad- thetic voice broadcasts were examined. Section 2 describes casts and COSMOS (Computerized Safety, Maintenance the outline and procedure for the experiments, while Sect. 3 and Operation Systems of Shinkansen) broadcasts using the presents the results of each experiment. From the results, the waveform connection method have been used, with ATOS appropriate conditions for the broadcast of synthetic voice for conventional lines and COSMOS for the Shinkansen [21]. announcements at each location in a railway station are dis- Recently, synthetic speech using the DNN method [22], as cussed in Sect. 4. typified by HOYA broadcasts, is being introduced. By pro- moting the introduction of the announcements made with the DNN system, the time and cost associated with record- 2 Methods ing and re-recording by a human announcer can be reduced. In addition, according to the guidelines of the Japan Tourism This study investigates the appropriate SNR and speech rate Agency [23], in addition to multilingual broadcasts in four for synthetic voice announcements at railway stations for languages (Japanese, English, Chinese, and Korean), which normal-hearing people, using announcements made by DNN are in high demand in Japan, it is possible not only to make synthesized speech. As BGN, two locations, the ticket gate emergency announcements using synthetic voice, but also to and the platform of the station, were considered. They are make use of the advantages of instantaneous speech synthe- listed in the Barrier-Free Improvement Guidelines of the sis to make train announcements with timetables different Ministry of Land, Infrastructure and Transport [38] as places from that originally planned, and to make announcements where acoustic facilities should be provided at railway sta- that are applicable for only short periods of time. Thus, if tions, and noise level surveys have been conducted for both DNN synthetic speech can be used effectively in stations, locations [39]. Previous studies [2, 32, 33] have also reported it will be possible to create announcements freely, regard- appropriate guidelines for the speech rate of announcements less of the language or situation. However, in practice, the in public spaces in Japan. In order to investigate the effects effectiveness of synthetic voice for use in stations has not of SNR and speech rate on auditory impression separately, been verified. For example, Tachibana [24] emphasized the the appropriate SNR range was first found by setting the 123 Acoustics Australia Table 1 Conditions of the generated announcement using WaveNet in Table 2 Setting conditions and detailed content of the BGN Google text-to-speech Recording method Binaural recording Level of voice presentation (dB) 60, 65, 70, 75 (75 dB is platform only) Length of sound source 15 (s) Length of announcement (s) About 10 Sound source status 60 dB Small amount of Speech rate (mora/s) 7 (ticket gate) pedestrian traffic 65 dB Moderate amount of pedestrian traffic speech rate considered appropriate beforehand, that is, with- 70 dB Large amount of out considering the acoustic characteristics of the space. pedestrian traffic Then, experiments were used to clarify the optimal speech Sound source status 60 dB No trains are stopping on rate at each location in the railway station. More specifically, (platform) the platform referring to previous studies [1, 2] conducted using natu- 65 dB A train is stopping on the ral voice announcements, Experiment 1 was first conducted opposite platform to examine the appropriate SNR. The obtained SNR results 70 dB A train is stopping on your platform were then introduced, and Experiment 2 was conducted next 75 dB A train is running on your to examine the appropriate speech rate. platform Twelve students in their 20s participated in the experiment 80 dB A train is running on the (6 male, 6 female). Participants were recruited by email to opposite platform students at the Faculty of Science and Technology, Tokyo (wheel noise reaches the University of Science, and received no remuneration for their listener directly) participation. In accordance with EN 50332-2 proposed by CENELEC as a sound pressure regulation for portable audio players, the study was not invasive, and written consent for name, destination name, line number, train type, number of the experiment was obtained from all participants. Experi- cars, etc.) included in each announcement were changed so mental collaborators were briefed on the purpose of the study that the participants could not predict the next sentence when and the experimental methods, and anonymization and use listening to the announcement in the experiment. The BGN of the data. Furthermore, prior to the study, participants were for this experiment was recorded using an earphone-type asked about their hearing and were assured that both ears binaural microphone (Adphox BME-200) at several railway were normal. stations in the Tokyo metropolitan area. Sound sources in the range of noise levels likely to occur at each location were 2.1 Experiment 1 selected from the sound source data recorded for 15 s each time. Because of this, the frequency and time-varying charac- Participants were instructed to listen to an information broad- teristics of the BGN at each noise level were different from cast superimposed on BGN through headphones and to eval- each other. The details of the experimental BGN and their uate the announcement based on their listening impressions. frequency characteristics are shown in Table 2 and Fig. 1, Two evaluation items, “listening difficulty” and “noisiness”, respectively. The 3 kHz peak seen in the BGN at the ticket were selected with reference to previous studies [1, 40]. A gate is the sound of the ticket machine. A precision sound five-point Likert scale was used, with “1  not at all (全 level meter (Rion NL-52) was used to measure the BGN level く ~ ない)”, “2  not very much (それほど ~ ない)”, “3 during the recording. For the experiment, headphones (Sony slightly (多少 ~)”, “4  much (だいぶ ~)”, and “5  MDR-M1ST), an audio interface (RME Fireface 802), and a very much (非常に ~)”, for each of “listening difficulty” and laptop computer were used. “noisiness”. The participants were also asked to assume a Table 3 shows the auditory conditions presented to the situation in which they were trying to navigate by relying on participants in the experiment with BGN at the ticket gate. audio information at a station they were using for the first Each combination of sound sources was presented once at time. The announcement used in this experiment was cre- random. Then, participants evaluated a total of 72 conditions, ated using the WaveNet voice of Google text-to-speech [41], with six levels of SNR for each of three levels of BGN for a speech synthesis application programming interface (API) the conditions at the ticket gate, and four kinds of voice pitch provided by Google. The various conditions of the informa- (low male, low female, high male, and high female). In the tion broadcasts created are shown in Table 1. The speech rate experiment using the BGN of the platform, only one type of was set to 7 mora/s with reference to previous studies [2, 32, information broadcast was used for the female voice, as in 33] and HOYA broadcasts. Additionally, the words (station the HOYA broadcast, because no significant differences for 123 Acoustics Australia (a) Table 5 The SNR setting conditions Place Ticket gate Platform BGN(dB) 606570 60 65 70 75 SNR(dB) +8+7+4+11 +11 +9 +4 Ticket gate (60 dB) Ticket gate (65 dB) 20 dB Table 6 Auditory presentation conditions for conditions at the ticket Ticket gate (70 dB) gate and the platform 100 1000 10000 BGN (dB) 60, 65, 70, 75 (75 dB is platform only) Frequency (Hz) Speech rate (mora/s) 5.5–8.5 Gender of voice (–) (Pitch of voice) low female (b) 2.2 Experiment 2 The participants and equipment used in this experiment were Platform (60 dB) Platform (65 dB) the same as in Experiment 1. Participants were instructed to Platform (70 dB) evaluate their subjective impressions of the broadcast infor- Platform (75 dB) 20 dB mation and BGN, which were presented to them through Platform (80 dB) headphones. Three evaluation items were selected: “listen- 100 1000 10000 ing difficulty” and “noisiness”, which were the same as the Frequency (Hz) evaluation items in Experiment 1, plus “strangeness”, for evaluating the unnaturalness for the speech speed of the Fig. 1 Frequency characteristics of the BGN at (a) the ticket gate and generated voice, with reference to a previous study [2]. A (b) the platform five-point Likert scale of “1  not at all (全く ~ ない)”, “2 not very much (それほど ~ ない)”, “3  slightly (多少 ~)”, “4  much (だいぶ ~)” and “5  very much (非常に ~)” Table 3 Conditions of the auditory stimuli at the ticket gate was used for the “listening difficulty” and “noisiness”, and BGN (dB) 60, 65, 70 for the evaluation of “strangeness”, the speed that was felt SNR(dB) 0 to +15every3dB appropriate for a railway station information broadcast was Gender (–) (Pitch of voice) high female, low female, high set as the standard of “feels appropriate (ちょうどよい)”, male, low male while the speed of “−2  feels slow (遅く感じる)”, “−1 feels slightly slow (やや遅く感じる)”, “0  feels appro- priate (ちょうどよい)”, “1  feels slightly fast (やや速く Table 4 Conditions of the auditory stimuli on the platform 感じる)”, and “2  feels fast (速く感じる)” were selected. The participants were asked to evaluate the announcements in BGN (dB) 60, 65, 70, 75, 80 the same way as in Experiment 1: they were asked to assume SNR (dB) 0 to +15, 0 to +9 every 3 dB (at 80 dB) a situation in which they were trying to move around at a Gender (–) (Pitch of voice) low female station they were using for the first time and were relying on audio information. The creation and recording methods for the announcements and BGN used in this experiment were the gender of the voice were found in the experiment of the also the same as in Experiment 1. The SNR was set as shown ticket gate (see below for details). The auditory conditions in Table 5, referring to the results of Experiment 1. However, presented to the participants are shown in Table 4. Note that as described below, an appropriate SNR could not be found the upper limit of the SNR was set to +9 dB so that the for the 80 dB BGN of the platform, so it was excluded from information broadcast was kept below 90 dB for the 80 dB this experiment. The auditory conditions are shown in Table BGN of the platform. That is to say, participants evaluated a 6. The participants evaluated a total of 28 conditions, with total of 28 conditions, with six levels of SNR (four levels in four levels of speech speed for each of three levels of BGN one part) for each of five levels of BGN for the conditions for the conditions at the ticket gate and four levels of BGN on the platform. for those at the platform. Relative level (dB) Relative level (dB) Acoustics Australia (a) 5.0 5.0 (b) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) (c) 5.0 5.0 (d) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) (e) 5.0 5.0 (f) Female 1 Female 1 Female 2 Female 2 4.0 4.0 Male 1 Male 1 Male 2 Male 2 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 +0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) Fig. 2 Transition of “listening difficulty” under BGN levels of a 60 dB, b 65 dB, and c 70 dB, and transition of “noisiness” under BGN levels of d 60 dB, e 65 dB, and f 70 dB, in the ticket gate 3 Results as factors, without including the gender of the participants, for each of the ratings of listening difficulty and noisiness. 3.1 Experiment 1 Then the main effects and interactions of SNR and BGN levels were found to be significant (p < 0.01). Multiple com- First, in order to examine the difference between male and parisons (Tukey’s honestly significant difference, HSD, test) female participants, Student’s t-test was conducted for the showed significant differences between all conditions (p < listening difficulty and noisiness items, and no significant 0.01). On the other hand, no main effect for the gender of the difference was found. Therefore, the relationship between the voice was found, suggesting that the influence of the gender SNR and the listening difficulty and noisiness in the situation of the voice on the evaluation of listening difficulty and nois- with the added BGN of the ticket gate is shown in Fig. 2 iness is small. Therefore, the averages of the evaluated values excluding the classification by participant gender. Error bars obtained for the four types of voices were re-taken and are indicate standard errors. shown in Fig. 3. A three-way analysis of variance (ANOVA) was con- Figure 3 shows that listening difficulty decreases and nois- ducted using gender of the voice, SNR level, and BGN level iness increases as the SNR increases. However, when the Noisiness Listening difficulty Listening difficulty Noisiness Noisiness Listening difficulty Acoustics Australia 5.0 5.0 (a) (a) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 BGN 80 dB 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 0+3 +6 +9 +12 +15 SNR (dB) SNR (dB) (b) (b) 5.0 5.0 BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 BGN 80 dB 3.0 3.0 2.0 2.0 1.0 1.0 +0 +3 +6 +9 +12 +15 0 +3 +6 +9 +12 +15 SNR (dB) SNR (dB) Fig. 3 Transition of a “listening difficulty” and b “noisiness” when the Fig. 4 Transition of a “listening difficulty” and b “noisiness”on the four voice gender scores are averaged platform announcement level is over 80 dB, the impression of noisi- ness becomes more pronounced, so increased SNR tends to at the ticket gate is shown in Fig. 5, while the relationship increase the listening difficulty. In addition, when comparing among these with added BGN on the platform is shown in the values for the same SNR, the higher the BGN level, the Fig. 6. higher the evaluated loudness. In short, when the BGN level A two-way ANOVA was conducted on the listening dif- is high, even if the SNR is increased, it is difficult to reduce ficulty ratings for both the ticket gate and the platform, with the listening difficulty, and the impression of noisiness is speech rate and BGN level as factors. The results for the main high. effects of the speech rate and BGN level were significant (p < The relationship among the SNR, listening difficulty, and 0.01). Multiple comparisons (Tukey’s HSD test) for speech noisiness for the situation where BGN is added to the plat- rate also showed significant differences between all condi- form is shown in Fig. 4. tions (p < 0.01). The same two-way ANOVA was conducted A two-way ANOVA was conducted for the SNR and BGN on the noisiness ratings as above, and the main effect of the levels as factors for each of the ratings of listening difficulty BGN level was significant (p < 0.01). The same two-way and noisiness. Then the main effects and interactions of the ANOVA was conducted for the rating of strangeness, and SNR and BGN levels were significant (p < 0.01). Multiple the main effects of the speech rate and BGN level were sig- comparisons (Tukey’s HSD test) showed significant differ- nificant (p < 0.01). The results of multiple comparisons for ences between all conditions (p < 0.01). The platform is a speech rate also showed significant differences between all location where the BGN level is often higher than at the ticket conditions (p < 0.01). gate, so when the BGN level is 75 dB or higher, the noisi- ness increases significantly before the listening difficulty is improved, even if the SNR is increased. 4 Discussion 3.2 Experiment 2 Values within the range where the evaluated scores for both listening difficulty and noisiness were less than 2 or 2.5 were The relationship between speech rate and listening difficulty, regarded as appropriate SNR, while the values outside of noisiness, and strangeness in the situation with added BGN this range were regarded as inappropriate SNR. In Fig. 7,the Noisiness Listening difficulty Noisiness Listening difficulty Acoustics Australia 5.0 5.0 (a) (a) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 4.0 4.0 3.0 3.0 2.0 2.0 1.0 1.0 5.5 6.5 7.5 8.5 5.5 6.5 7.5 8.5 Speech rate (mora/s) Speech rate (mora/s) (b) 5.0 (b) 5.0 BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB 4.0 BGN 75 dB 4.0 3.0 3.0 2.0 2.0 1.0 1.0 5.5 6.5 7.5 8.5 5.56.5 7.58.5 Speech rate (mora/s) Speech rate (mora/s) (c) 2.0 2.0 (c) BGN 60 dB BGN 60 dB BGN 65 dB BGN 65 dB BGN 70 dB BGN 70 dB BGN 75 dB 1.0 1.0 -1.0 -1.0 -2.0 -2.0 5.5 6.5 7.5 8.5 5.5 6.5 7.5 8.5 Speech rate (mora/s) Speech rate (mora/s) Fig. 5 Relationship between speech rate and a “listening difficulty”, Fig. 6 Relationship between speech rate and a “listening difficulty”, b “noisiness”, and c “strangeness” in the ticket gate b “noisiness”, and c “strangeness” on the platform range of the SNR where both listening difficulty and noisiness scores were less than 2 is shown in red, and that where the the evaluated value is small, it is thought that the frequency scores were less than 2.5 is shown in blue. Note that when characteristics of the BGN at the turnstiles may have an effect the platform BGN was 80 dB, no scores under 2.0 and 2.5 on the perceived loudness at the ticket gate. It is difficult to were obtained. determine a single appropriate announcement level for all Figure 7 shows the appropriate SNR at each location. BGN levels both at the ticket gate and on the platform. How- Regarding the range of 60–70 dB of BGN at the ticket gate ever, since the range of BGN likely to occur at a station can be and platform, it can be said that the SNR can be set higher on predicted from the number of passengers per day for the sta- the platform than at the ticket gate by reproducing the speech tion and the speed of the entering trains [42], using Fig. 7 as a signal at a higher sound pressure level. This is because the reference, it is possible to consider the appropriate announce- noisiness of announcements tends to be perceived more eas- ment level for a station by selecting the BGN level. When the ily at the ticket gate than on the platform when the SNR is BGN level is over 70 dB at the ticket gate and 75 dB on the increased. Given that the effect of the gender of the voice on platform, the range of appropriate SNR is quite limited. So, it Strangeness Noisiness Listening difficulty Strangeness Noisiness Listening difficulty Acoustics Australia (a) As for the speech rate, the scores for listening difficulty and strangeness were generally similar to each other, so that sim- ilar trends were observed for the appropriate speech rate between the synthetic and the natural voices. The evaluation value for noisiness was slightly higher in this experiment, but +6 this may be because, unlike previous studies using loudspeak- ers, the voice was played from headphones, and in particular, as mentioned above, the ticket gate is a place where the nois- iness of an announcement tends to be easily perceived. 55 60 65 70 75 In this study, the threshold of the evaluation value was LAeq (dB) set at 2 or 2.5, but it is possible to change the thresh- old according to the situation, for example, for emergency (b) announcements of important content, a SNR that is slightly noisy is acceptable, as emphasis is on ease of hearing. Addi- tionally, the results of this experiment provide the appropriate announcement levels for people with normal hearing, and it is necessary to take into account, for example, the elderly population, who are more likely to suffer from hearing loss. In addition, Figs. 5b and 6b show that there is no direct rela- tionship between speech rate and noisiness, so it can be said that speech rate can be changed to suit the broadcast envi- ronment. It should also be noted that since this experiment 55 60 65 70 75 80 was conducted in a stationary state, the appropriate range of LAeq (dB) speech rate may change, especially in places where people Fig. 7 Appropriate SNR in a the ticket gate and b the platform are often in a walking state, such as at an actual ticket gate. From the above, assuming that an appropriate SNR can be set, the standard speech rate of an information broadcast at a is important to promote the introduction of sound-absorbing railway station should be 7.0 mora/s at the ticket gate and 6.5 mora/s on the platform to minimize the listening difficulty materials and other improvements to prevent the noise level at each location from becoming too high, and to give priority when stationary and to avoid sounding unnatural. to this over the consideration of announcement levels. In fact, This study was conducted without considering the char- various existing studies have shown how to consider building acteristics of the speech propagation environment. When materials and sound absorption methods to shorten the rever- applying the current experimental results to the actual station beration time and reduce the ambient noise level at railway environment, there is a need to set the SNR and speech rate in stations [43–46]. It is also effective to control an appropri- consideration of the directivity and reverberation time of the ate SNR by considering the spatial relationship between the loudspeaker in the actual environment. A previous study con- target area and the loudspeakers [47]. sidering differences in reverberation time (with and without sound absorbing material) in station buildings [1] reported Focusing on the experimental results for speech rate, Figs. 5c and 6c show that the evaluated values of unnatu- that a SNR of about 5 dB higher than in a sound-absorbing environment is required in a reflective environment. With ralness in relation to speech rate are almost proportional to 6.5 mora/s. The tendency for the participants to perceive that regard to speech rate, the results of the present study are the announcements on the platform were slightly faster than similar to those of a previous study [2] and are therefore those at the ticket gate at 7.5 mora/s is assumed to be because considered to be independent of the presence or absence of they wanted to listen to the announcements more carefully sound-absorbing material. In this study, we also recorded in the platform situation. BGN at the underground platforms, but the noise levels were Previous studies [1, 2] have reported that the appropri- lower than those at the above-ground platforms due to the ate SNR in station concourses should be around +8 dB for large amount of sound-absorbing material installed on the absorptive ceilings and more than +13 dB for reflective ceil- ceilings of the underground platforms. Therefore, only the BGN of the above-ground platforms was used in this exper- ings. Although the acoustic characteristics of the space were not taken into account in this experiment, a comparison of iment. the results for the concourse and a ticket gate close to that location suggests that there is no significant difference in the appropriate SNR between the synthesized and natural voices. SNR (dB) SNR (dB) Acoustics Australia 5 Conclusion 5. Cooke, M., Mayo, C., Valentini-Botinhao, C.: Intelligibility- enhancing speech modifications: the hurricane challenge. Inter- speech. 3552–3556 (2013) This study attempts to determine an appropriate SNR and 6. Greene, B.G., Logan, J.S., Pisoni, D.B.: Perception of synthetic speech rate for synthetic voice announcements at railway speech produced automatically by rule: Intelligibility of eight stations. text-to-speech systems. Behav. Res. Method Instr. Comput. 18(2), 100–107 (1986) The following results were obtained. First, the appropriate 7. Mirenda, P., Beukelman, D.: A comparison of intelligibility among SNR varied depending on the broadcast location and BGN natural speech and seven speech synthesizers with listeners from level. Second, it was found that increasing the SNR when three age groups. Augment. Alternat. Commun. 6(1), 61–68 (1990) the BGN level was high did not lead to an improvement in 8. Waterworth, J.A.: Why is synthetic speech harder to remember than natural speech? ACM SIGCHI Bull. 16(4), 201–206 (1985) the listening comprehension of the announcement. Finally, 9. Pisoni, D.B.: Perception of synthetic speech. In: Progress in Speech it was confirmed to be possible to set standards for speech Synthesis, pp. 541–560. Springer, New York (1997) speed depending on the broadcast location and situation, and 10. Sato, H., Morimoto, M., Ota, R.: Acceptable range of speech level that the SNR and speech rate show similar trends between in noisy sound fields for young adults and elderly persons. J. Acoust. Soc. Am. 130(3), 1411–1419 (2011) synthesized and natural voices. 11. Cabinet Office. 2022 White Paper on Aging Society. Japanese Synthetic speech can contribute to improving informa- (2022) tion transmission in announcements by setting an appropriate 12. Chang, X.Y., Ikeda, Y., Tsujimura, S., Sakamoto, K.: Examining SNR and speech rate for each location, in the same way as the provision of railway transit information to foreign visitors in the Tokyo metropolitan area and strategies for improvement. Transp. conventional human voices. Future work will focus on find- Res. Rec. 2672(8), 546–556 (2018) ing a method for improving intelligibility independent of the 13. Yamagami, T., Hattori, H., Yoshiji, K., Kamisaka, T.: Transporta- SNR by applying voice processing to announcements, and tion information system for foreign tourists. Hitachi Rev. 67(7), on addressing the same questions while also considering the 866–871 (2018) 14. Schimkowsky, C.: Managing passenger etiquette in Tokyo: acoustic characteristics of the space being examined. between social control and customer service. Mobilit (2021). https://doi.org/10.1080/17450101.2021.1929418 Funding Open access funding provided by Tokyo University of Sci- 15. Sekiguchi, M.: JR East’s approach to universal design of railway ence. stations. JR Transp. Rev. 45, 9–11 (2006) 16. Ito, Y.: Easy-to-access rail–JR East’s initiatives. JR Transp. Rev. 45, 12–16 (2006) Declarations 17. Kameda, A., Sakamoto, K.: Study on the acoustic environment in station concourses for elderly people. In: JR East Tech. Rev. 28 (2014) Conflict of interest The authors declare that they have no known com- 18. Suzuki, T., Nakagawa, Y., Sakai, A.: For smoother use of railway peting financial interests or personal relationships that could have stations. In: JR East Tech. Rev. 13 (2009) appeared to influence the work reported in this paper. 19. TOSHIBA DIGITAL SOLUTIONS CORPORATION [Internet]. RECAIUS Speech Synthesis Middleware ToSpeak™; c2015–2022 Open Access This article is licensed under a Creative Commons Attri- [cited 2022 Sep 10]. Available from: https://www.global.toshiba/ bution 4.0 International License, which permits use, sharing, adaptation, jp/products-solutions/ai-iot/recaius/lineup/tospeak.html. distribution and reproduction in any medium or format, as long as you 20. HOYA CORPORATION [Internet]. ReadSpeaker; c2020–2022 give appropriate credit to the original author(s) and the source, pro- [cited 2022 Sep 10]. Available from: https://www.readspeaker. vide a link to the Creative Commons licence, and indicate if changes com/ were made. The images or other third party material in this article are 21. Yamanouchi, S.: Essential information systems for railways and included in the article’s Creative Commons licence, unless indicated intensive application of ADS technology? COSMOS and ATOS. otherwise in a credit line to the material. If material is not included in In: 2013 IEEE 11th Internat. Symp. Autonom. Decentr. Syst. 2–9 the article’s Creative Commons licence and your intended use is not (1999) permitted by statutory regulation or exceeds the permitted use, you will 22. Ning, Y., He, S., Wu, Z., Xing, C., Zhang, L.J.: A review of deep need to obtain permission directly from the copyright holder. To view a learning based speech synthesis. Appl. Sci. 9(19), 4050 (2019) copy of this licence, visit http://creativecommons.org/licenses/by/4.0/. 23. Japan Tourism Agency. Guidelines for Improving and Strengthen- ing Multilingual Support to Realize a Tourism Nation. Japanese (2014) 24. Tachibana, H.: Public space acoustics for information and safety. References Proc. Meet. Acoust. 19(032005), 1–11 (2013) 25. Bradley, J.S.: Speech intelligibility studies in classrooms. J. Acoust. 1. Tsujimura, S.: Study on reproduce level of announcement at a sta- Soc. Am. 80, 846–854 (1986) tion for the elderly. Inter-Noise. 1–8 (2015) 26. Bradley, J.S.: Predictors of speech intelligibility in rooms. J. Acoust. Soc. Am. 80, 837–845 (1986) 2. Tsujimura, S.: Relationship of the difference of the speech rate of an 27. Bradley, J.S.: On the combined effects of signal-to-noise ratio and announcement at the railway station and the listening impression. room acoustics on speech intelligibility. J. Acoust. Soc. Am. 106, J. Acoust. Soc. Am. 140(4), 3126–3126 (2016) 1820–1828 (1999) 3. Indumathi, A., Chandra, E.: Survey on speech synthesis. Signal 28. Bistafa, S.R., Bradley, J.S.: Reverberation time and maximum Process. Int. J. 6(5), 140 (2012) background-noise level for classrooms from a comparative study of 4. Hoy, M.B.: Alexa, Siri, Cortana, and more: an introduction to voice assistants. Med. Ref. Serv. Quart. 37(1), 81–88 (2018) 123 Acoustics Australia speech intelligibility metrics. J. Acoust. Soc. Am. 107(2), 861–875 40. Morimoto, M., Sato, H., Kobayashi, M.: Listening difficulty as a (2000) subjective measure for evaluation of speech transmission perfor- 29. Kobayashi, M., Morimoto, M., Sato, H., Sato, H.: Optimum speech mance in public spaces. J. Acoust. Soc. Am. 116(3), 1607–1613 level to minimize listening difficulty in public spaces. J. Acoust. (2004) Soc. Am. 121(1), 251–256 (2007) 41. Google LLC [Internet]. Google cloud Text-to-Speech; c2011–2022 30. Sato, H., Bradley, J.S., Morimoto, M.: Using listening difficulty rat- [cited 2022 Sep 10]. Available from https://cloud.google.com/text- ings of conditions for speech communication in rooms. J. Acoust. to-speech/ Soc. Am. 117(3), 1157–1167 (2005) 42. Izumi, Y: Field measurement and subjective evaluation of acous- 31. Sato, H., Sato, H., Morimoto, M.: Effects of aging on word intel- tical condition of railway station in and around Tokyo. Proc. ligibility and listening difficulty in various reverberant fields. J. Inter-Noise (2009) Acoust. Soc. Am. 121(5), 2915–2922 (2007) 43. Shimokura, R., Soeta, Y.: Sound field characteristics of under- 32. Prafiyanto, H., Nose, T., Chiba, Y., Ito, A.: Analysis of preferred ground railway stations–effect of interior materials and noise speaking rate and pause in spoken easy Japanese for non-native source positions. Appl. Acoust. 73(11), 1150–1158 (2012) listeners. Acoust. Sci. Technol. 39(2), 92–100 (2018) 44. Wu, Y., Kang, J., Zheng, W.: Acoustic environment research of 33. Yokoyama, S., Tachibana, H.: Subjective experiment on suitable railway station in China. Energy Proc. 153, 353–358 (2018) speech-rate of public address announcement in public spaces. In: 45. Haan, C.H.: Case study: predicted effect of station design changes Proc. Meet. Acoust. ICA. 19(1) (2013) on high speed train noise. Build. Acoust. 9(4), 311–323 (2002) 34. Yokoyama S, Tachibana H. Study on the acoustical environmental 46. Sü, Z., Çaliskan, ¸ M.: Acoustical design and noise control in metro in public spaces. INTER-NOISE. 1–8 (2008) stations: case studies of the Ankara metro system. Build. Acoust. 35. Li, H., Peng, W., Xiang, Y., Wenjun, Z.: Researches on sound envi- 14(3), 203–221 (2007) ronment in Futian underground railway station. Proced. Eng. 165, 47. Oldfield, A.: Acoustic design of transit stations. Proc. Meet. 730–739 (2016) Acoust. 18(1), 1–6 (2012) 36. Bandyopadhyay, P., Bhattacharya, S.K., Kashyap, S.K.: Assess- ment of noise environment in a major railway station in India. Indust. Health. 32(3), 187–192 (1994) Publisher’s Note Springer Nature remains neutral with regard to juris- 37. Architect Inst. Jap. Standards for evaluation of speech transmission dictional claims in published maps and institutional affiliations. performance in built environment. AIJES-S0002-2011. Japanese (2011) 38. Ministry of Land. Infrastructure and Transport, Guideline for the Improvement of Barrier-Free Passenger Facilities. Passenger facil- ities section of the Barrier-Free Improvement Guidelines. Japanese (2022) 39. Sato. T., Sato, H., Sato, H. Morimoto, M.: Sound environment for speech communication at railway stations in Japan. In: Proc. Int. Congress on Acoust. ;1199–1200 (2004)

Journal

Acoustics AustraliaSpringer Journals

Published: Sep 30, 2023

Keywords: Synthetic voice; Intelligibility of announcement; Sound environment of railway station; Signal-to-noise ratio (SNR); Speech rate; Subjective evaluation

There are no references for this article.