Readability Indices Do Not Say It All on a Text Readability
Readability Indices Do Not Say It All on a Text Readability
Matricciani, Emilio
2023-03-30 00:00:00
Article Emilio Matricciani Dipartimento di Elettronica, Informazione e Bioingegneria (DEIB), Politecnico di Milano, 20133 Milan, Italy; emilio.matricciani@polimi.it Abstract: We propose a universal readability index, G , applicable to any alphabetical language and related to cognitive psychology, the theory of communication, phonics and linguistics. This index also considers readers’ short-term-memory processing capacity, here modeled by the word interval I , namely, the number of words between two interpunctions. Any current readability formula does not consider I , but scatterplots of I versus a readability index show that texts with the same p p readability index can have very different I , ranging from 4 to 9, practically Miller ’s range, which refers to 95% of readers. It is unlikely that I has no impact on reading difficulty. The examples shown are taken from Italian and English Literatures, and from the translations of The New Testament in Latin and in contemporary languages. We also propose an extremely compact formula, relating the capacity of human short-term memory to the difficulty of reading a text. It should synthetically model human reading difficulty, a kind of “footprint” of humans. However, further experimental and multidisciplinary work is necessary to confirm our conjecture about the dependence of a readability index on a reader ’s short-term-memory capacity. Keywords: alphabetical languages; ARI; English literature; Flesch Reading Ease Index; GULPEASE; human footprint; Italian literature; Miller ’s Law; short-term capacity; universal readability index; word interval 1. Introduction First developed in the United States [1–9], readability formulae are applicable to any alphabetical language. They are based on the length of words and sentences, and therefore they allow the comparison of different texts automatically and objectively to assess the Citation: Matricciani, E. Readability Indices Do Not Say It All on a Text difficulty that readers may find in reading them. From the point of view of the writer, a Readability. Analytics 2023, 2, readability formula allows the design of the best possible match between readers and texts. 296–314. https://doi.org/10.3390/ Many readability formulae have been proposed for English [6], and only some for very few analytics2020016 languages [10]. In Reference [11] we have defined a global readability formula applicable to any Academic Editor: Jong-Min Kim alphabetical language, based on a calque of the readability formula used in Italian [12], Received: 2 February 2023 both for providing it for languages that have none, and also for estimating, on common Revised: 2 March 2023 grounds, the readability of texts belonging to different languages/translations. Accepted: 21 March 2023 In fact, because an “absolute” readability formula—i.e., a formula that provides Published: 30 March 2023 numerical indices related to a universal origin, such as “zero”—might not exist at all, the readability formula proposed in Reference [11] can be used to compare different texts, because what counts, in this comparison, is the difference between numerical values. In other words, differences give more insight than absolute values for the purpose of Copyright: © 2023 by the author. comparing texts [11]. Licensee MDPI, Basel, Switzerland. As the title of this article claims, any current readability formula, however, does not say This article is an open access article everything about a text readability, because it neglects the response of readers’ short-term distributed under the terms and conditions of the Creative Commons memory to the partial stimuli contained in a sentence, i.e., to how the words of a sentence Attribution (CC BY) license (https:// are punctuated, a process described by the word interval I [13]. All readability formula creativecommons.org/licenses/by/ neglect, in fact, the empirical connection between the short term memory capacity of 4.0/). Analytics 2023, 2, 296–314. https://doi.org/10.3390/analytics2020016 https://www.mdpi.com/journal/analytics Analytics 2023, 2 297 readers (approximately described by Miller ’s 7 2 law [14]) and the word interval I , which appears, at least empirically, justified and natural [11,13,15–17]. The purpose of this article is to propose a universal readability formula, applicable to any alphabetical language, which includes the effect of short-term memory capacity. We base this formula on the global readability formula defined in Reference [11], which we will modify by including the word interval I . After this Introduction, Section 2 revisits the classical readability formula of Italian and its relationship with the Flesch Reading Ease Index and the Automated Readability Index, largely used in English texts; Section 3 summarizes the relationship between the word interval (number of words between two interpunctions, modeling the short-term memory capacity [13]) and the number of words per sentence; examples are drawn from Italian [13] and English literature [17]; Section 4 defines and discusses our proposal of a universal readability index; Section 5 proposes a synthetic readability index of humans, a kind of “footprint” that links human short-term memory to reading difficulty; finally Section 6 draws a conclusion and suggests future work. 2. A Readability Formula for Alphabetical Languages The observation that differences are more important than absolute values in using readability formulae [13] justifies the development of a readability formula that can be used to compare texts, even those written in different languages [15]. For most languages, in fact, no readability formula has been defined, and only few adapt English formulae to their texts [10,18]. The proposed formula, of course, does not exclude using other readability formulae specifically devised for a language—e.g., the large choice for English—[4,6] but it allows the comparison, on the same ground, of the readability of texts written in any language and in translation. For this purpose, we have proposed in Reference [11] to adopt, as a reference, the readability formula developed for Italian, known by the acronym GULPEASE [12]: G = 89 10C + 300/P (1) P F In Equation (1) C is the number of characters per word, and P , is the number P F of words per sentence. Notice that, like all readability formulae, Equation (1) does not contain any reference to interpunctions (besides, of course, full stops, question marks and exclamation marks, which determine the length of sentences), and therefore it does not consider the parameter very likely linked to the short–term memory capacity, namely the word interval I [13]. G can be interpreted as a readability index by considering the number of years of school attended in Italy’s school system (see Reference [12]), as shown in Figure 1. The larger G, the more readable the text for any number of school years. The continuous lines shown in Figure 1 divide the quadrant into areas of the same performance of texts, such as “almost unintelligible”, “very difficult”, etc. For example, the area labelled “easy” indicates all combinations of values of G and school years required to declare a text “easy” to read. In all cases, it is shown that, as the number of school years of the reader increases, the readability index he/she can tolerate decreases. In Reference [11] we have shown, for Italian literature, that the term 10C varies very little from text to text and across seven centuries, while the term 300/P varies very much and, in practice, determines the value of the readability index. Equation (1) says that a text is more difficult to read if P is large, i.e., if sentences are long, and if C is large, i.e., if words are long. In other words, a text is easier to read if it contains short words and short sentences, a result that is predicted by any known readability formula and should be true, of course, in any language. Analytics 2023, 2, FOR PEER REVIEW 3 Analytics 2023, 2 298 Figure 1. Readability index, G, of Italian (GULPEASE, see Reference [12]), as a function of the number Figure 1. Readability index, 𝐺 , of Italian (GULPEASE, see Reference [12]), as a function of the of school years attended in Italy. The continuous lines divide the quadrant into areas of the same number of school years attended in Italy. The continuous lines divide the quadrant into areas of the performance of texts. Elementary school lasts 5 years, junior high school lasts 3 years, and high school same performance of texts. Elementary school lasts 5 years, junior high school lasts 3 years, and high lasts 5 years. Children stay at school till they are 19 years old. For comparison, the green vertical axis school lasts 5 years. Children stay at school till they are 19 years old. For comparison, the green on the right refers to the Flesh Reading Ease index. vertical axis on the right refers to the Flesh Reading Ease index. In Reference [11], we have proposed the adoption of Equation (1) also for the other Equation (1) says that a text is more difficult to read if 𝑃 is large, i.e., if sentences languages, such as those listed in Table 1, by scaling the constant 10 according to the ratio are long, and if 𝐶 is large, i.e., if words are long. In other words, a text is easier to read if between the average number of characters per word in Italian, < C > = 4.48 and the p,I TA it contains short words and short sentences, a result that is predicted by any known average number of characters per word in another language, e.g., < C > = 4.24 for p,EN G readability formula and should be true, of course, in any language. English. The rationale for this choice is that C is a parameter typical of a language which, In Reference [11], we have proposed the adoption of Equation (1) also for the other if not scaled, would bias G without really quantifying the change in reading difficulty of languages, such as those listed in Table 1, by scaling the constant 10 according to the ratio readers, who are surely accustomed to reading, in their language, shorter or longer words, between the average number of characters per word in Italian, <𝐶 > = 4.48 and the on average, than those found in Italian. This scaling, therefore, avoids changing G for the average number of characters per word in another language, e.g., <𝐶 > = 4.24 for only reason that a language has, on average, words shorter or longer than , Italian. In any English. The rationale for this choice is that 𝐶 is a parameter typical of a language which, case, as recalled above, C affects a readability formula much less than P [13]. if not scaled, would bias 𝐺 without really quantifying the change in reading difficulty of readers, who are surely accustomed to reading, in their language, shorter or longer words, Table 1. Values of C and k of Equations (2) and (3) in the New Testament texts in the indicated on average, than those found in Italian. This scaling, therefore, avoids changing 𝐺 for the languages. Languages are listed according to their language family (see Reference [11]). only reason that a language has, on average, words shorter or longer than Italian. In any Language Language Family C k case, as recalled above, 𝐶 affects a readability formula much less than 𝑃 [13]. Greek Hellenic 4.86 0.92 Latin Italic 5.16 0.87 Table 1. Values of 𝐶 and 𝑘 of Equations (2) and (3) in the New Testament texts in the indicated Esperanto Constructed 4.43 1.01 languages. Languages are listed according to their language family (see Reference [11]). French Romance 4.20 1.07 Language Language Family 𝑪 𝒌 Italian Romance 4.48 1.00 Portuguese Romance 4.43 1.01 Greek Hellenic 4.86 0.92 Romanian Romance 4.34 1.03 Latin Italic 5.16 0.87 Spanish Romance 4.30 1.04 Esperanto Constructed 4.43 1.01 Danish Germanic 4.14 1.08 French Romance 4.20 1.07 English Germanic 4.24 1.06 Finnish Germanic 5.90 0.76 Italian Romance 4.48 1.00 German Germanic 4.68 0.96 Portuguese Romance 4.43 1.01 Analytics 2023, 2 299 Table 1. Cont. Language Language Family C k Icelandic Germanic 4.34 1.03 Norwegian Germanic 4.08 1.10 Swedish Germanic 4.23 1.06 Bulgarian Balto Slavic 4.41 1.02 Czech Balto Slavic 4.51 0.99 Croatian Balto Slavic 4.39 1.02 Polish Balto Slavic 5.10 0.88 Russian Balto Slavic 4.67 0.96 Serbian Balto Slavic 4.24 1.06 Slovak Balto Slavic 4.65 0.96 Ukrainian Balto Slavic 4.56 0.98 Estonian Uralic 4.89 0.92 Hungarian Uralic 5.31 0.84 Albanian Albanian 4.07 1.10 Armenian Armenian 4.75 0.94 Welsh Celtic 4.04 1.11 Basque Isolate 6.22 0.72 Hebrew Semitic 4.22 1.06 Cebuano Austronesian 4.65 0.96 Tagalog Austronesian 4.83 0.93 Chichewa Niger Congo 6.08 0.74 Luganda Niger Congo 6.23 0.72 Somali Afro Asiatic 5.32 0.84 Haitian French Creole 3.37 1.33 Nahuatl Uto Aztecan 6.71 0.67 On the other hand, we have maintained the constant 300 because P depends signifi- cantly on author ’s style [13,15], not on language. Finally, notice that the constant 89 sets just the absolute ordinate scale, and therefore it has no impact on comparisons [13]. In conclusion, in Reference [11] we have defined a global readability index applicable to texts written in a language as: G = 89 10kC + 300/P (2) P F with k = < C >/< C > (3) P,I TA P By using Equations (2) and (3), we force the average value of 10 C of any language to be equal to that found in Italian, namely 10 4.48. Table 1 reports for Greek, Latin and 35 contemporary languages, the average values of C [11] and the calculated values of the constant k of Equation (3). For example, for English texts, C of a sample text is multiplied by 10.6, instead of 10; for Nahuatl (longer words), C is multiplied by 6.7, and for Haitian (shorter words) by 13.3. Notice that k seems to be a stable factor. For example, in the sample of the English literature studied in Reference [17], we have found < C > = 4.23 (instead of the 4.24 P,EN G of Table 1). Now, because the value found in the Italian literature [13] is < C > = 4.67, P,I TA therefore k = 4.67/4.23 = 1.10, instead of the k = 4.48/4.24 = 1.06 of Table 1. As recalled above, all readability formulae substantially tell the same story, and therefore they should be very similar and it is very likely that any one of them can be obtained from another. We illustrate this fact with an example. Because English is the language that has more readability formulae than any other language, let us compare G to the most classical English readability formula proposed and amply discussed by Flesch [1,2], known as the Flesch Reading Ease (RE) formula: RE = 206.8 1.015w 84.6s (4) Analytics 2023, 2, FOR PEER REVIEW 5 As recalled above, all readability formulae substantially tell the same story, and therefore they should be very similar and it is very likely that any one of them can be obtained from another. We illustrate this fact with an example. Because English is the language that has more readability formulae than any other language, let us compare 𝐺 to the most classical English readability formula proposed and amply discussed by Flesch [1,2], known as the Flesch Reading Ease ( ) formula: Analytics 2023, 2 300 = 206.8 − 1.015𝑤 − 84.6𝑠 (4) In Equation (4), 𝑤 is the average number of words per sentence, and 𝑠 is the average number of syllables per word. Because the number of characters per word is, on In Equation (4), w is the average number of words per sentence, and s is the average average, proportional to the number of syllables per word, the parameter 𝑠 paralles 𝐶 number of syllables per word. Because the number of characters per word is, on average, and, of course, 𝑤= 𝑃 . proportional to the number of syllables per word, the parameter s paralles C and, of How Equation (4) quantifies the degree of difficulty was defined by Flesch himself course, w = P . [1,2], and its values are reported in the vertical scale of Figure 1 (right ordinate scale), for How Equation (4) quantifies the degree of difficulty was defined by Flesch himself [1,2], comparison with 𝐺 (left ordinate scale). Figure 2 shows the scatterplot between the and its values are reported in the vertical scale of Figure 1 (right ordinate scale), for values calculated with the global readability index 𝐺 , Equation (2), versus those comparison with G (left ordinate scale). Figure 2 shows the scatterplot between the values calculated with , Equation (4), according to WinWord, in novels from English literature calculated with the global readability index G, Equation (2), versus those calculated with [17], Table 2. RE, Equation (4), according to WinWord, in novels from English literature [17], Table 2. Figure 2. Flesch Reading Ease (RE) index, Equation (4), versus the global index G, Equation (2), for Figure 2. Flesch Reading Ease (RE) index, Equation (4), versus the global index 𝐺 , Equation (2), for the novels of the English Literature listed in Table 2. Robinson Crusoe, cyan “o”; Pride and Prejudice, the novels of the English Literature listed in Table 2. Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”. Farewell to Arms, black “+”. Table 2. Novels from English literature. Deep-language parameters C , P , I , G and universal P F P Table 2. Novels from English literature. Deep-language parameters 𝐶 , 𝑃 , 𝐼 , 𝐺 and universal readability index G , the latter discussed in Section 4. Novels are listed according to the year readability index 𝑮 , the latter discussed in Section 4. Novels are listed according to the year of of publication. publication. Literary Work C P I G G p F P U Literary Work 𝑷 𝑰 𝑮 𝑮 𝑭 𝑷 𝑼 Matthew King James translation (1611) 4.27 23.51 5.91 55.14 55.86 Matthew King James translation (1611) 4.27 23.51 5.91 55.14 55.86 Robinson Crusoe (D. Defoe, 1719) 3.94 57.75 7.12 50.84 42.22 Robinson Crusoe (D. Defoe, 1719) 3.94 57.75 7.12 50.84 42.22 Pride and Prejudice (J. Austen, 1813) 4.40 24.86 7.16 52.79 43.89 Wuthering Heights (E. Brontë, 1845–1846) 4.27 25.82 5.97 53.65 53.89 Vanity Fair (W. Thackeray, 1847–1848) 4.63 25.74 6.73 49.75 44.10 David Copperfield (C. Dickens, 1849–1850) 4.04 24.40 5.61 56.68 59.66 Moby Dick (H. Melville, 1851) 4.52 31.18 6.45 49.11 45.66 The Mill on The Floss (G. Eliot, 1860) 4.29 28.03 7.09 52.70 44.32 Alice’s Adventures in Wonderland (L. Carroll, 1865) 3.96 30.92 5.79 56.14 57.76 𝑅𝐸 𝑅𝐸 𝑅𝐸 Analytics 2023, 2 301 Table 2. Cont. Literary Work C P I G G p F P U Little Women (L.M. Alcott, 1868–1869) 4.18 21.08 6.30 57.31 54.99 Treasure Island (R. L. Stevenson, 1881–1882) 4.02 21.89 6.05 58.78 58.39 Adventures of Huckleberry Finn (M. Twain, 1884) 3.85 24.89 6.63 59.01 54.14 Three Men in a Boat (J.K. Jerome, 1889) 4.25 13.71 6.14 64.19 63.13 The Picture of Dorian Gray (O. Wilde, 1890) 4.19 16.56 6.29 62.83 60.58 The Jungle Book (R. Kipling, 1894) 4.11 21.52 7.15 57.95 49.14 The War of the Worlds (H.G. Wells, 1897) 4.38 20.85 7.67 55.31 42.48 The Wonderful Wizard of Oz (L.F. Baum, 1900) 4.02 20.55 7.63 59.38 46.85 The Hound of The Baskervilles (A.C. Doyle, 1901–1902) 4.15 17.79 7.83 60.27 46.16 Peter Pan (J.M. Barrie, 1902) 4.12 18.20 6.35 60.53 57.85 A Little Princess (F.H. Burnett, 1902–1905) 4.18 16.38 6.80 61.57 55.45 Martin Eden (J. London, 1908–1909) 4.32 16.94 6.76 59.38 53.50 Women in love (D.H. Lawrence, 1920) 4.26 13.71 5.22 63.98 70.02 The Secret Adversary (A. Christie, 1922) 4.28 11.02 5.52 69.08 72.76 The Sun Also Rises (E. Hemingway, 1926) 3.92 10.70 6.02 72.58 72.45 A Farewell to Arms (H. Hemingway,1929) 3.94 10.12 6.80 73.17 66.99 Of Mice and Men (J. Steinbeck, 1937) 4.02 9.67 5.61 74.20 77.24 We can notice a fair agreement between the two indices, with a correlation coefficient of 0.850. The bias could be compensated by downscaling RE. The attribution of the grade level G L in the USA school system was defined by Kincaid et al. [3], by using the same parameters w and s. The grade level is similar to that attributed to G. Another readability formula, the Automated Readability Index (ARI), was also defined by Kincaid et al. ii for specific military documents [3]. It is fully related to G because it depends on the same parameters, C and P : P F AR I = 4.71C + 0.5P 21.43 (5) p F As AR I increases, the age of required readers increases too. Figure 3 shows the scatterplot between the global G, Equation (2), and AR I, for the the same English novels considered in Figure 2. We can see a very tight relationship for fixed C . In conclusion, the global readability formula, Equation (2), provides a readability index that can be directly scaled to AR I and approximately also to RE. For this reason, we continue studying G, which we will modify by introducing the word interval I to obtain the universal readability formula/index mentioned above. To do so we need to recall, in the next section, some fundamental knowledge on I . P Analytics 2023, 2, FOR PEER REVIEW 7 Analytics 2023, 2 302 Figure 3. Automated Readability Index (ARI), Equation (5), versus the global index G, Equation (3), Figure 3. Automated Readability Index (ARI), Equation (5), versus the global index 𝐺 , Equation (3), for the novels of English literature listed in Table 2. The continuous lines assume constant values of for the novels of English literature listed in Table 2. The continuous lines assume constant values of C . Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures 𝐶 . Robinson Crusoe, cyan “o”; Pride and Prejudice, black “o”; Vanity Fair, blue “o”; Alice’s Adventures in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter in Wonderland, magenta “o”; Treasure Island, green “o”; Adventures of Huckleberry Finn, red “+”; Peter Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”. Pan, blue “+”; The Sun Also Rises, green “+”; A Farewell to Arms, black “+”. 3. Word Interval and Short-Term Memory In conclusion, the global readability formula, Equation (2), provides a readability As we have discussed in References [11,13,15], the word interval I ¯namely the number index that can be directly scaled to and approximately also to . For this reason, of words per interpunctions—varies in the same range of the short-term memory capacity- we continue studying 𝐺 , which we will modify by introducing the word interval 𝐼 to given by Miller ’s 7 2 law [14], a range that includes 95% of all cases, and very likely obtain the universal readability formula/index mentioned above. To do so we need to the two ranges are deeply related because interpunctions organize small portions of more recall, in the next section, some fundamental knowledge on 𝐼 . complex arguments (which make a sentence) in short chunks of text, which are the natural input to short-term memory [19–27]. Moreover, I , drawn against the number of words per 3. Word Interval and Short-Term Memory sentence, P , tends to approach a horizontal asymptote as P increases, and this occurs both F F in ancient As we have classical disc languages ussed in (Gr References eek and Latin) [11,13,15], th and in contemporary e word interval languages, 𝐼 —namely the as shown in References [11,13] by studying translations of the New Testament books from Greek. In number of words per interpunctions—varies in the same range of the short-term memory other words, even if sentences get longer, I cannot get larger than about the upper limit capacity-given by Miller’s 7±2 law [14], a ra pnge that includes 95% of all cases, and very of Millers’ law (namely 9), because of the constraints imposed by the short-term memory likely the two ranges are deeply related because interpunctions organize small portions capacity of readers and writers, as well. of more complex arguments (which make a sentence) in short chunks of text, which are The average value of I can be empirically related to the average value of P according the natural input to short-term p memory [19–27]. Moreover, 𝐼 , drawn against the F number to the non-linear relationship [13]: of words per sentence, 𝑃 , tends to approach a horizontal asymptote as 𝑃 increases, and this occurs both in ancient classical languages (Greek and Latin) and in contemporary (<P >