Building Neural Machine Translation Systems for Multilingual Participatory Spaces
Building Neural Machine Translation Systems for Multilingual Participatory Spaces
Lohar, Pintu;Xie, Guodong;Gallagher, Daniel;Way, Andy
2023-05-01 00:00:00
Article Building Neural Machine Translation Systems for Multilingual Participatory Spaces Pintu Lohar , Guodong Xie , Daniel Gallagher and Andy Way * ADAPT Centre, School of Computing, Dublin City University, D09 E432 Dublin, Ireland; pintu.lohar@adaptcentre.ie (P.L.); guodong.xie@adaptcentre.ie (G.X.); daniel.gallagher@adaptcentre.ie (D.G.) * Correspondence: andy.way@adaptcentre.ie Abstract: This work presents the development of the translation component in a multistage, mul- tilevel, multimode, multilingual and dynamic deliberative (M4D2) system, built to facilitate au- tomated moderation and translation in the languages of five European countries: Italy, Ireland, Germany, France and Poland. Two main topics were to be addressed in the deliberation process: (i) the environment and climate change; and (ii) the economy and inequality. In this work, we describe the development of neural machine translation (NMT) models for these domains for six European languages: Italian, English (included as the second official language of Ireland), Irish, German, French and Polish. As a result, we generate 30 NMT models, initially baseline systems built using freely available online data, which are then adapted to the domains of interest in the project by (i) filtering the corpora, (ii) tuning the systems with automatically extracted in-domain development datasets and (iii) using corpus concatenation techniques to expand the amount of data available. We compare our results produced by the domain-adapted systems with those produced by Google Translate, and demonstrate that fast, high-quality systems can be produced that facilitate multilingual deliberation in a secure environment. Keywords: neural machine translation; domain adaptation; parallel data; deliberative democracy; citizens’ assemblies Citation: Lohar, P.; Xie, G.; Gallagher, 1. Introduction D.; Way, A. Building Neural Machine In the European-funded project EUComMeet (https://www.eucommeet.eu/, accessed Translation Systems for Multilingual on 10 December 2022), we need to facilitate multilingual communication between speakers Participatory Spaces. Analytics 2023, in the context of deliberative democracy. In many European countries, citizens’ assemblies 2, 393–409. https://doi.org/10.3390/ have been set up as a response to the challenges currently besetting liberal democracies. analytics2020022 These assemblies are participatory spaces created to improve the democratic practice by Academic Editor: Jong-Min Kim directly linking citizens with policy makers. Many of the issues currently faced across Europe are pervasive (e.g., immigration, climate change, the Ukraine war, the cost of fuel, Received: 18 February 2023 etc.), so discussions are taking place in these respective citizens’ assemblies, although Revised: 17 April 2023 only in a monolingual context. We aim to permit cross-lingual communication between Accepted: 24 April 2023 European citizens on these important topics via machine translation (MT), considerably Published: 1 May 2023 extending our initial work in Lohar et al. [1]. Sceptics might ask why these participants cannot instead all use English. Despite the widespread claims of English being a lingua franca, this is far from being an actuality Copyright: © 2023 by the authors. today. In any case, in a European context, it flies in the face of our very ideals; in varietate Licensee MDPI, Basel, Switzerland. concordia (“united in diversity”), the official Latin motto of the EU, illustrates that the many This article is an open access article different cultures, traditions and languages in Europe are a hugely positive asset for the distributed under the terms and continent. In Europe’s multilingual setup, all 24 official EU languages are granted equal conditions of the Creative Commons status by the EU Charter (https://commission.europa.eu/aid-development-cooperation- Attribution (CC BY) license (https:// fundamental-rights/your-rights-eu/eu-charter-fundamental-rights_en, accessed on 24 creativecommons.org/licenses/by/ November 2022) and the various Treaties of the EU. (https://en.wikipedia.org/wiki/ 4.0/). Analytics 2023, 2, 393–409. https://doi.org/10.3390/analytics2020022 https://www.mdpi.com/journal/analytics Analytics 2023, 2 394 Treaties_of_the_European_Union, accessed on 12 December 2022) Moreover, the EU is home to 60+ regional and minority languages, which have been protected and promoted under the European Charter for Regional or Minority Languages treaty since 1992 (https://en.m. wikipedia.org/wiki/European_Charter_for_Regional_or_Minority_Languages, accessed on 12 December 2022), in addition to migrant languages and various sign languages, spoken by some 50 million people. Indeed, the findings of the European Language Equality project (https://european- language-equality.eu/, accessed on 14 December 2022) over the past two years demonstrate a very sorry state of affairs: despite the obvious improvements in language technology since the implementation of methods based on neural networks, language barriers still hamper cross-lingual communication and the free flow of knowledge across borders, and many languages are endangered or on the edge of extinction [2,3]. On a global scale, the situation is far worse, of course, especially for languages that do not have a written tradition [4]. Accordingly, then, translation technology has a vital role to play in addressing these problems, but its capabilities have been hugely overhyped. While in principle, the tech- niques involved in neural machine translation (NMT) apply to any pair of languages, in practice, demands on data availability restrict these to a small subset of the world’s languages, so claims by multinational corporations of “bridging the gap between human and machine translation [quality]” [5] or “human parity” [6] are massively overblown (cf. [7,8] for some of the reasons why). Given that participants in the deliberative democracy process in different European countries have heretofore been unable to participate in synchronous discussions, one of the aims of the EUComMeet project was to facilitate cross-lingual communication between users by using NMT as a bridge, in two domains of interest across the EU: (i) environment and climate change; and (ii) the economy and inequality. Large datasets are a prerequisite for high-quality NMT, but these are rarely available for most language pairs and application domains. For the two domains of the project, to the best of our knowledge, no parallel dataset was available, at least for the language pairs with which we are concerned. Accordingly, we began our search for high-quality data in two domains that subsumed those we were interested in, namely economics and the environment. Our aim was to automatically extract tuning and test datasets in these domains with which we could adapt systems trained on generic datasets. We justify this approach in Section 2, while explaining what alternative methods might have been chosen (e.g., building brand new domain-specific engines from scratch, or multilingual models from the outset). Figure 1 shows the significance of MT systems in multilingual participatory delibera- tion. The upper part of this figure represents people from different countries wanting to engage in the deliberative democratic process with their peers. They speak to one another in either the economics or environment domain in their own languages. In the likely scenario where they do not understand each other as they belong to different countries with different native languages, it is crucial to translate specific languages to one or more target languages, so that content expressed in those languages will be comprehensible to all participants and not just those for whom the languages are intelligible. For example, if an Italian person is talking to a German and an Irish person, their speech must be translated into both German and Irish or English so that the (German and Irish) listeners can understand what the speaker is saying. The lower part of Figure 1 shows that MT systems lie at the heart of the whole translation process, facilitating translation from one language to another. To facilitate communication between speakers of the six languages, we needed to build 30 (n(n 1)) NMT models. The remainder of this paper is organised as follows: Section 2 summarises the related work in this field. We discuss the overall architecture of MT systems in Section 3, followed by the details of the data collection process in Section 4. We describe our experiments in Section 5, and present the results in Section 6. Finally, we conclude our work, together with some future possible research avenues, in Section 7. Analytics 2023, 2 395 Figure 1. The role of MT in multilingual deliberative discussion. 2. NMT Background There has been a significant improvement in translation quality in the past few years since the advent of neural MT systems. The earliest NMT systems used encoder–decoder recurrent neural networks (RNNs) [9,10], where the encoder mapped the input string into a hidden representation, which was input to the decoder to generate the target translation. This architecture was improved upon by Bahdanau et al. [11], who introduced the concept of “attention”, which allowed the decoder to use the most relevant source-sentence words as contextual clues when deciding how to choose the most appropriate target words(s) for words that were translationally ambiguous in the same sentence. Following its introduction in Bahdanau et al. [11], the “Transformer ” approach of Vaswani et al. [12] relied solely on attention, while at the same time allowing for the use of much simpler feedforward neural networks. To this day, transformers remain the architecture of choice for a range of NLP tasks, as they have achieved state-of- the-art performance across the board, including in translation, where—especially for high-resource language pairs—researchers have claimed similar performance to human- generated translations [13,14]. We do not propose to go into more detail on these different architectures, all of which have been described elsewhere (see also Koehn [15]). However, we are aware that the topic of this paper is likely to attract readers who have no prior exposure to NMT. Accordingly, while we have attempted to make all technical descriptions understandable to nonexperts, for readers for whom the primary literature might be a little foreboding, very good explanations of the technology have been published for those with a less technical background, or who may be newcomers to the field [16,17]. A range of NMT toolkits have been developed that facilitate the development of high- quality translation engines. These include FAIRSEQ (https://github.com/facebookresearch/ fairseq accessed on 15 May 2022) [18] and Marian (https://marian-nmt.github.io accessed on 15 May 2022) [19]. Both of these toolkits have many satisfied users, as they have much to commend them. It really comes down to a matter of personal choice, but for the purposes of the experiments in this paper, we instead use OpenNMT (https://opennmt.net accessed on 15 May 2022) [20], an open-source NMT framework that has been widely adopted in the research community and uses the transformer architecture, to build our 30 EUComMeet NMT systems. To the best of our knowledge, no prior work exists on the development of multilingual deliberative spaces. However, it is well known in MT that large amounts of high-quality domain-specific data are crucial if good translation outputs are to ensue. For example, a parallel biomedical domain corpus [21] can be useful for building an MT system to translate medical texts. However, we are unaware of parallel corpora available for the economics and environment domains. Accordingly, we immediately ruled out the possibility of building individual MT systems for each language direction. Analytics 2023, 2 396 Despite the overly enthusiastic assertion of van der Meer [22] that “massively multilin- gual MT [is] a single model that can tackle any language pair in the world” (our emphasis; cf. Way [23] for a refutation), it was an option worth considering here. The “massively multilingual” approach of Arivazhagan et al. [24] is a popular choice when the number of languages to be considered is large, but even in cases where only “moderate” amounts of training data are used, the hardware requirements are enormous. Note too that individual MT systems built for specific language pairs have been shown to outperform multilin- gual models [25,26]. Finally, as our MT systems were to be the intermediary in real-time communication, the models needed to be accurate, but also small enough to ensure rapid translation; users simply could not wait for ostensibly “better ” output to ensue if that made communication slow and awkward. We comment more on speed testing in Section 6.3. Given the imperative that our solution simply had to work, we rejected all the above possibilities for the reasons indicated, and finally opted to tune high-quality baseline engines using smaller domain-specific datasets. For the initial build, we chose Europarl [27], the proceedings of the European Parliament, which was available in all the languages of interest (although in different amounts, cf. Table 1), and is well known in the MT community as a good-quality, general corpus that is sufficiently large for training NMT systems. In addition, given the range of topics dealt with on a daily basis in the European Parliament, it comprises at least some texts from the domains of interest to us, which further justifies its selection. Table 1. Parallel corpora for English–French. Corpus Name Domain Number of Sentence Pairs in Millions CCMatrix Mixed domain 328.6 WikiMatrix Mixed domain 6.6 UNPC Mixed domain 22.9 ParaCrawl Mixed domain 27.6 EUbookshop Educational 10.8 CCAligned Mixed domain 15.5 Europarl Mixed domain 2.05 TED2020 TED talks 0.4 News-Commentary News 0.2 QED Educational 1.0 Once these baseline engines were built (cf. Section 5.2), we constructed more advanced systems using additional datasets (cf. Section 4.2), as well as corpus filtering (cf. Section 5.3.1) and domain adaptation techniques (cf. Section 5.3.2), which in all cases caused system performance to improve (cf. Section 6.1) to the point where the majority of our models outperformed Google Translate (cf. Section 6.2). 3. Architecture of Platform for Multilingual Deliberative Democracy In our previous work on the rapid prototyping of high-performing engines capable of translating multilingual COVID-19 content [28], we adopted Web API in order to make the systems freely available. As that had worked well previously—albeit in asynchronous mode—we employed the same method here to construct the outer interface of our MT platform. Web API is an extensible framework for building HTTP-based services that can be accessed in a range of applications on different platforms such as the web, Windows, mobile, etc. It works more or less the same way as an HTTP web server, except that it sends data as a response instead of HTML. As Web API is based on the HTTP protocol, it can be visited without being subject to firewall restrictions on different local area networks. We built a Web API specification and HTTP request format protocol. Each transaction request is formatted as an HTTP request according to the protocol and is sent to the Web API server of our MT platform. After the transaction has been completed, the Web API sends the results formatted as HTTP response data to the user, as illustrated in Figure 2. Analytics 2023, 2 397 Figure 2. Working process of EUComMeet MT platform Web API. Our MT platform adopts a two-level server architecture, where Level 1 is the Web API server and Level 2 comprises the translation GPU servers. This is illustrated in Figure 3. Level 1 is the Web API server, which is responsible for accepting HTTP translation requests from the EUComMeet M4D2 platform, where access verification is handled. Here, each user is assigned a unique password-protected username, and to enhance security, we adopt the JSON Web Token (JWT) to verify user access. JWT is an open industry standard used to share information between two entities, usually a client (such as the EUComMeet M4D2 platform, located in Italy) and a server (such as our MT systems, located in Ireland). As the Web API server does not perform translation tasks itself, it can respond to HTTP requests promptly without incurring any delay. Figure 3. Architecture of EUComMeet MT platform. Level 1 also performs preprocessing and postprocessing of translation requests. In the preprocessing stage, the text to be translated is first split into individual sentences so that each can be translated in isolation. Then sentences are “normalised” which involves dividing punctuation marks from words, lower-casing all upper-case forms, escaping some special characters, etc. It performs these similar tasks in reverse (i.e., reinserting special characters, truecasing, reattaching punctuation marks to words, etc.) in the postprocess- ing stage. Level 2 comprises the actual MT GPU servers, which are responsible for translating the specific source-language text into the requested target language. The GPU servers are all equipped with powerful GPU cards, e.g., GeForce GTX 1080, GeForce GTX TITAN X, etc. Our MT platform includes 4 GPU servers, on which all 30 MT models are loaded in line with each server ’s computing power. As these GPU servers are independent of one another, they can work in parallel, which speeds up the translation tasks considerably. When the Web API receives a translation request, it distributes the translation tasks to each GPU server in parallel and collects the translation results from each one. When all the Analytics 2023, 2 398 translation tasks are complete, the web API packages the results as an HTTP response and sends it back to the user. The chosen two-level architecture ensures the extensibility and efficiency of the whole system. If we find that 4 GPU servers are not enough to support a large number of synchronous translation requests, more GPU servers can be added to the platform without changing the Web API. 4. Data Collection As already noted, data collection is the most important stage of developing MT systems. While several freely available corpora exist online for training MT systems in many language pairs—especially for major languages such as English, French and German—not all of them are useful when specific tasks are considered. Accordingly, data collection involves the following two substages: (i) corpus survey and (ii) corpus consideration. 4.1. Corpus Survey An important part of the experimentation phase is exploring what parallel corpora are available online. The following characteristics are crucial for a parallel corpus: (i) the source- and target-language sentences must be good-quality translations of each other; (ii) the corpus should be large enough that it contains at least a few million good-quality sentence pairs; and (iii) the texts should cover the broad domains of interest (in our case, economics and the environment). A parallel corpus can be considered ideal if it satisfies all of the above requirements, but that is rarely the case. Accordingly, we focused on identifying a target corpus that shared some of these key characteristics. We evaluated a wide range of corpora available on the Internet. OPUS [29], (https://opus.nlpl.eu/ accessed on 6 May 2022) the open parallel corpus, is one of the largest collections of publicly available parallel data from a range of domains for all language pairs of the project. We performed extensive research on a number of parallel corpora available in OPUS. 4.2. Corpus Consideration For ease of exposition, we consider in Table 1 only the English–French pair to show which corpora we manually inspected. Table 1 shows the name of the corpora, domains and their sizes in terms of total number of sentence pairs they contain. Despite being an extremely large corpus, we found the “CCMatrix” dataset to be very noisy and not suitable for MT training. Some of the other very large corpora, such as “UNPC” and “ParaCrawl”, which contain more than 20 million sentence pairs, are not good enough either. Although it is not a very large corpus compared to some available resources, Europarl contains a sufficient number of parallel sentences (around 2 million) to build good-quality baseline NMT engines (cf. Section 5.2). In the next phase, we found that some of the smaller corpora, such as “TED2020”, “News-Commentary”, “QED”, etc., are also good- quality corpora, so we concatenated them to the Europarl data and built larger MT models with improved coverage. In addition, we cleaned some of the larger corpora, such as “CCAligned”, “WikiMatrix”, etc., from which subsets were extracted and concatenated with the already-combined corpora (Section 5.3.3). Our corpus filtering (Section 5.3.1) and domain adaptation (Section 5.3.2) efforts led us to obtain better results over the baseline models in all cases, as demonstrated in Section 6. 5. Experiments MT system development in EUComMeet comprises three steps: (i) corpus prepro- cessing, (ii) building baseline MT models and (iii) building advanced MT models. To build both the baseline and domain-adapted MT models, we use OpenNMT with trans- former architecture (cf. Section 2). To evaluate the quality of translation, using human experts is preferable, but as translation professionals may be difficult to recruit, developers monitoring the incremental progress of their engines typically use automatic evaluation Analytics 2023, 2 399 metrics instead (cf. Way [30] for more on the advantages and disadvantages of human and automatic evaluation). We use sacreBLEU [31] to measure BLEU scores [32]—the most widely used automatic metric in MT—to evaluate our MT outputs. Note that BLEU is a precision-based metric, so higher scores are better; in a nutshell, an MT output is compared against a human reference translation (typically held out from the training data) at the level of 1 to 4 grams (individual words up to sets of contiguous words of max. length 4), and the more matches observed, the higher the BLEU score (modulo a brevity penalty, which discourages the output of unduly short translations, such as “The”, which otherwise would be an unfair manipulation of the process). 5.1. Corpus Preprocessing The first step of training an MT model is preprocessing of the dataset. This is crucial to make the dataset as clean as can be for MT system building. The main methods involved in preprocessing are as follows: • Filtering out long sentences: Extremely long sentences were deleted because MT systems generally produce low-quality translations when confronted with lengthy input. If either side of the parallel data contains too many words (100 words is set as the limit), the sentence pair is discarded. • Removing blank lines: Sentence pairs with no content on either side are removed. • Removing sentence pairs with odd length ratios: Sentences with marginally longer or shorter translations when compared to their original sentences were removed because of the probability of them being incorrect translations. The filtering ratio is 1:3 in our case. • Removing duplicates: All duplicate sentence pairs were discarded. • Tokenisation: We break down the sentences into their most basic elements called “tokens” (e.g., transforming “John’s” into “John” and “’s”). Tokenisation is particularly relevant because it is the form in which MT models ingest sentences. In practice, most NMT models are fed with subwords as tokens (see next bullet). • Byte-pair encoding (BPE): Restricted vocabularies (e.g., 30 k or 70 k) are typically used for reasons of computational efficiency. Accordingly, many valid words are out-of- vocabulary items (OOVs). However, most OOVs can be broken down into “subwords” already in the restricted vocabulary. We used BPE [33] to try to help the model infer the meaning of OOVs through their similarity to known words. BPE performs subword regularisation by building a vocabulary using corpus statistics. Firstly, it learns the most frequently occurring sequences of characters and then greedily merges them to obtain new text segments. 5.2. Building Baseline MT Models We built our standard baseline MT systems with the Europarl data using Open- NMT. However, there are no available domain-specific datasets for the areas of interest to the project, not even small datasets usable for tuning and testing. In this early stage of MT system development, we used Europarl as our training dataset and WMT dev and test datasets (https://www.statmt.org/wmt21/translation-task.html accessed on 17 May 2022) for only the high-resource language pairs (English–German, English–French and English–Italian) because the amount of data available for other language pairs (e.g., Ital- ian–Polish, German–Irish, etc.) as either too small or nonexistent. We used a small part of the parallel corpus as (i) devsets for tuning and (ii) test sets for testing for such lan- guage pairs. Note that as Europarl had already been used as training data, we could not use it for tuning and testing. Instead, we had to consider some other corpus, such as the TED2020 (https://opus.nlpl.eu/TED2020.php accessed on 11 May 2022) and News- Commentary (https://opus.nlpl.eu/News-Commentary.php accessed on 11 May 2022) corpora. However, for the English–Irish pair, there was neither Europarl for training nor TED2020/News-Commentary for testing. Therefore, we considered a major part of the Analytics 2023, 2 400 the MultiParacrawl (https://opus.nlpl.eu/MultiParaCrawl.php accessed on 12 May 2022) corpus for training and the remainder for tuning and testing, respectively. Data Statistics • Training data: We used the Europarl corpus as training data for most of the language pairs of the project. For certain language pairs for which Europarl is not available, we use other corpora such as ParaCrawl, MultiParaCrawl, etc. Table 2 shows the statistics of the training data for all language pairs. Table 2. Statistics of training data for all language pairs. Language Pair Corpus Name #Sentence Pairs EN–DE Europarl 1.95M EN–FR Europarl 2.05 M EN–IT Europarl 1.94 M EN–PL Europarl 550 K EN–GA ParaCrawl 1.9 M DE–FR Europarl 1.94 M DE–IT Europarl 1.83 M DE–PL Europarl 500 K DE–GA MultiParaCrawl 500 K FR–IT Europarl 1.94 M FR–PL Europarl 521 K FR–GA MultiParaCrawl 550 K IT–PL Europarl 594 K IT–GA MultiParaCrawl 500 K PL–GA MultiParaCrawl 350 K • Tuning and testing data: During the development of our baseline MT models, we initially considered only the economics domain to evaluate the quality of the MT output. For most language pairs, the dev and test datasets were available online. For other language pairs such as de–pl, it–ga, etc., we considered either a part of the training data (excluded from the training data per se to avoid artificially inflating the evaluation scores) or other corpora such as TED2020. Note that for all language pairs, as was the case with the training data, both the dev and test data belonged to the general or mixed domain. In order to make them suitable for domain-specific tuning and evaluation, we performed a filtering process using relevant key terms so that the filtered dataset was as similar as possible to the economics domain. Accordingly, we manually prepared a list of key terms for both the economics and environment domains, each of which contains more than 100 key terms for all languages of the project, except Irish (We did not need additional lists of key terms for Irish because there was sufficient domain-specific terminology in the parallel data already). Some example key terms for English from both domains can be seen in Table 3. Note that the key terms in this table were produced during this initial stage of experiments. At a later stage, we extracted an improved version of key terms from Wikipedia containing several hundred key terms in each domain. We will discuss this in the next section. Table 3. Example key terms from economics and environment domains Economics Environment job digitalisation, inequality, consumers, pollution, sustainable, climate, economics, unemployment, investment, deforestation, greenhouse, global warming, globalization, inflation, atmosphere, ecosystem Now, we will discuss how the dev and test datasets are filtered using the domain- specific key terms. Consider the example sentences from the WMT test dataset in Table 4. Analytics 2023, 2 401 Table 4. Filtering data with domain-specific key terms. Key Terms Text Selection Present The new freedoms imported from the West are often synonymous None with chaos, as it happened in the 1990s. Food: Where European inflation slipped up inflation X Which role is Putin going to play after March? None He’s tall, strong, very capable, he has a great chest and long arms. None Consumers also have had to pay significantly more for vegetable oils and dairy products in 2007. Consumers X Sentences are selected for dev and test sets only if some domain-specific key terms are present. In Table 4, we can see that the second and the fifth sentences contain such key terms (e.g., ‘inflation’ and ‘Consumers’), so these two sentences were selected as part of our dev or test datasets. Note that this is merely a small fragment of the whole datasets; in total, there were several thousands of sentences in the original data, from which several hundreds were selected as they contain at least one domain-specific key term. It is common practice to hold out several hundred sentence pairs for each of the dev and test datasets when available. In our case, the size of the dev and test datasets ranges from 500 to 3000 sentence pairs depending upon the language pair and overall dataset size. 5.3. Building Advanced MT Models In the previous section, we discussed the development of baseline models with a single corpus. The next step was to build advanced models using other methodologies and tools. We used the following approaches: (i) corpus filtering, (ii) domain adaptation and (iii) corpus combination. Using such additional methods, we hoped to increase performance. 5.3.1. Corpus Filtering Although we discussed some corpus filtering techniques in Section 5.1, they are standard techniques that cannot filter a corpus based on, for example, semantic similarity. In other words, they cannot discard sentence pairs that are less likely to be translations of each other, or sentence pairs that would be considered to be noisy training data. If such noise exists to a large extent in the training data, it is likely to produce a poor-quality MT model. Therefore, it is crucial to discard less semantically similar sentence pairs and retain only the good-quality ones. Accordingly, we applied a very widely used tool called “LASER” [34] (https://github.com/yannvgn/laserembeddings accessed on 14 July 2022), which is capable of transforming sentences into language-independent vectors. Similar sentences are mapped to close vectors (in terms of cosine distance), regardless of the input language. LASER is very popular among researchers who work on bilingual text mining, as it supports 93 languages, including those of interest to us, and can effectively be used for measuring sentence similarity. Only those sentence pairs that have a similarity score greater than a threshold of 0.6, the value empirically determined by the authors of the original paper to be the best for this task, are maintained in our training corpus. 5.3.2. Domain Adaptation Next, we further expanded our domain-specific key terms with the help of Wikipedia. Previously, our key terms list consisted of fewer than 200 key terms. In an extension to the previous experiment, we considered all the key terms from Wikipedia for each domain. We refer to these expanded lists as EcoTerms (https://en.wikipedia.org/wiki/Glossary_of_ economics accessed on 20 August 2022) and EnvTerms (https://en.wikipedia.org/wiki/ Glossary_of_environmental_science accessed on 20 August 2022). These lists contain a Analytics 2023, 2 402 wide range of key terms, including those mentioned earlier. Table 5 shows some new examples, including the previous terms. Table 5. Enhanced version of domain-specific key terms (EcoTerms and EnvTerms). Total Number of Domain Examples Key Terms job digitalisation, inequality, consumers, economics, unemployment, investment, Economics 557 globalization, inflation, Accounting reform, (EcoTerms) Aggregate supply, Experimental economics, Financial astrology, Intertemporal equilibrium . . . pollution, sustainable, climate, deforestation greenhouse, global warming, atmosphere, Environment 1368 ecosystem, Agricultural biodiversity, (EnvTerms) Biosecurity protocol, Environmental determinism, Ecopsychology, Integrated geography . . . We can see from Table 5 that several new advanced terms (highlighted in blue) are added to each domain over and above those common terms previously considered. The enhanced list of domain-specific key terms allows us to generate more relevant domain- related dev and test sets. Previously, we considered only the economics domain on which the baseline model was tuned and tested. In this stage, we consider those sentence pairs that contain key terms from both domains in order to form new dev and test datasets, which we used to tune and test our new models. As the new dev and test data are more relevant than their previous versions, the results of the MT evaluation in this stage are more reliable than our previous evaluation, as will be shown in Section 6, where performance improved for all language pairs. 5.3.3. Corpus Combination In this approach, we combined multiple corpora to generate larger training sets than those used for baseline MT training. Initially, we filtered the corpora using the LASER tool and then combined them to ensure the quality of the combined corpus. The corpora we considered for combination were TED2020, QED, CCAligned, WikiMatrix, News-Commentary and EUBookShop, all of which were downloaded from OPUS. Note that as stated previously, these corpora are not very suitable for training “as is”, so we filtered them using LASER to make them cleaner. This approach of corpus combination is time-consuming, and is based on several trials over different corpus combinations. This is an ongoing process, and out of a total of 15 language pairs (i.e., ignoring the translation direction, as the training data—but not the dev and test data—are the same for both German-to-English and English-to-German), we have so far performed corpus combination for 8. The remaining seven pairs are EN–GA, DE–GA, FR–PL, FR–GA, IT–PL, IT–GA and PL–GA; note that five out of these seven pairs involve the Irish language (GA). As of now, we prioritise corpus combinations for the other two language pairs (FR–PL and IT–PL); given our local expertise and in-house crawled datasets [35,36], all MT models involving Irish already produce high BLEU scores, and so there is less room for improvement. However, we plan to perform corpus combination for the Irish language once all the other languages are covered. Table 6 shows the amount of the training data after corpus combination, including corpus filtering for the eight language pairs for which this has been completed. As a comparison, we also provide the size of the training data for the baseline MT systems. Corpus combination and filtering more than doubles the amount of training data for most language pairs, but even in the worst case (DE–FR), the size of the training data increases by 44%. Analytics 2023, 2 403 Table 6. Training data size after corpus combination. Language Pairs Size of Training Data in Baseline Size of Training Data in Advanced MT Systems (#Sentence Pairs) MT Systems (#Sentence Pairs) DE–EN 1.95 M 6.86 M FR–EN 2.05 M 3.37 M IT–EN 1.94 M 4.7 M PL–EN 550 K 1.17 M DE–FR 1.94 M 2.8 M DE–IT 1.83 M 3.24 M DE–PL 500 K 1.44 M FR–IT 1.94 M 3.28 M 5.3.4. Architectural Comparison of Baseline vs. Domain-Adapted Systems We demonstrate the differences between the training processes of the baseline and the advanced MT systems in Figures 4 and 5. Figure 4. Architecture of baseline MT system training. Figure 5. Architecture of domain-adapted MT system training. The training of the baseline engines shown in Figure 4 comprises a series of simple steps. Firstly, the parallel corpus passes through the preprocessing stage that contains the sequence of simple steps described in Section 5.1, after which it is used to train the baseline models. The training of the advanced models shown in Figure 5 comprises a few additional steps in the initial phase. Firstly, multiple corpora are taken into account. Secondly, each of these corpora is filtered using the LASER tool. Subsequently, the filtered corpora are combined into a single corpus. After that, the process is the same as for the training of the baselines. 6. Results In this section, we discuss in detail the results obtained by all of the translation models. The performance of the baseline, domain-adapted and Google Translate systems were evaluated on Analytics 2023, 2 404 test data from both the economics and environment domains. We also give some indication of the speed of our models, and initial feedback in user testing of the MT systems to facilitate multilingual discussions in deliberative democracy forums. 6.1. Baseline vs. Domain-Adapted MT System Performance Firstly, we show the comparison between the baseline and domain-adapted systems in Table 7. The BLEU scores for the domain-adapted models (column 4) all surpass the scores of the corresponding baseline engine (column 3). Columns 5 and 6 show the actual improvement in BLEU score and relative improvement as a percentage, respectively. We highlight all improvements over 10% in bold font in the last column. Two of the largest improvements are seen for German as a target language (from English and French). This is noteworthy, as German is typically regarded as a difficult language to translate, owing to its relatively free word order and rich morphology. The highest improvement is seen for German–Irish, with a huge 23.39% relative improvement, despite this being a difficult language pair. Table 7. BLEU score comparison for all models. Higher scores are indicative of better-quality translations. Source Target Baseline Domain-Adapted BLEU Point %Relative Language Language System System Improvement Improvement German 23.7 27.68 3.98 16.79% French 31.12 31.58 0.46 1.47% English Italian 34.18 35.58 1.5 4.38% Polish 30.04 30.25 0.21 0.69% Irish 46.55 50.99 4.44 9.53% English 26.67 27.67 1.0 3.74% French 23.33 23.8 0.47 2.01% German Italian 22.37 24.59 2.22 9.92% Polish 22.55 24.39 1.82 8.07% Irish 36.24 44.72 8.48 23.39% German 19.11 22.85 3.74 19.57% English 29.39 31.02 1.63 5.54% French Italian 24.56 26.97 2.41 9.81% Polish 10.01 10.31 0.3 2.99% Irish 44.69 45.16 0.47 1.05% German 19.39 19.50 0.11 0.56% French 26.65 26.89 0.24 0.9% Italian English 34.62 35.09 0.47 1.35% Polish 9.65 13.16 3.51 3.63% Irish 48.20 48.42 0.22 0.45% German 24.16 26.13 1.97 8.15% French 14.19 14.29 0.1 0.7% Polish Italian 11.32 11.66 0.34 3.0% English 38.47 40.54 2.07 5.38% Irish 38.90 43.61 4.71 12.1% German 43.98 44.79 0.81 1.84% French 47.19 47.65 0.46 0.97% Irish Italian 48.15 48.26 0.11 0.22% Polish 37.69 39.94 2.25 5.96% English 58.34 62.91 4.57 7.83% 6.2. Google Translate vs. Domain-Adapted MT System Performance We compare our domain-adapted systems and Google Translate in Table 8. The last column shows the BLEU-score differences between our domain-adapted systems and Google Translate. The values in red indicate where our MT systems perform worse (preceded by the “