Brief thoughts on inefficient writing systems

by Srdjan Miletic2 min read29th Jul 202136 comments

21

PracticalWorld Optimization
Frontpage

A phonemic language is one where how you write is very similar to how things sound. Essentially, if you know how to say something and you know the alphabet, then you can write/read the same thing. Serbian is an example of a phonemic language. A non-phonemic language is one where words are not written how they sound. English is a non-phonemic language. Non phonemic languages are harder to acquire literacy in. Children learning to read and write in the UK have to learn a host of complex rules, exceptions and individual word spellings. Finally a logogramatic language is one where the alphabet doesn't represent sounds. Instead symbols usually represent words or ideas. Chinese is such a language. If you want to learn how to read/write it to any reasonable level, you'll need to know at least 3000 characters.

The continued existence of non-phonemic writing systems is a suboptimal equilibrium. Such writing systems make learning to be literate in a language far harder than it needs to be. They also probably have very serious costs. They affect hundreds of millions of people and, given that even in first world countries between 10% and 25% of adults are functionally illiterate (remember that your bubble is strong), their costs in terms of lost productivity and individual flourishing are likely large. Still I've never heard anyone talk about language reform. Why is that?

One theory is elite blindness. Almost everyone in the elite is fully literate. You don't get to be a journalist/professor/union leader/politician/NGO worker if you're not literate or have a low IQ (which strongly correlates with literacy). Because our personal bubbles tend to be incredibly strong, most elites are hence surrounded by people similar to them or at most a standard deviation or two away in terms of IQ. Hence most people with power will not only never struggle with literacy themselves but they will also never meaningfully interact to the 10 - 20% of society which does. Hence the problem is essentially invisible to everyone who matters.

Another theory is sunk costs. Even if we did recognize the problem with bad writing systems, there's a massive sunk cost in terms of how many people use the current language and how hard it is to shift that usage to a different form. It would take a national campaign, a large amount of resources and a huge political effort. The problem seems intractable, hence no one bothers to bring it up.

A third theory is that it's not controversial/sexy. There's no bad guy to point at. There's no ideological struggle or way to tie language systems into larger political narratives. It's a fairly dry issue that no individual or political faction can be blamed for.

(N.B: While I'm fairly convinced that having a non-phonemic language makes learning to read and write much, much harder than it needs to be that's based on personal experience and conversations, not an any kind of systemic research. Maybe I'm just wrong. Epistemic Status: Reasonably high confidence)

Also posted to my blog at https://dissent.blog/2021/07/29/inefficient-writing-systems/

21

36 comments, sorted by Highlighting new comments since Today at 11:14 AM
New Comment

Even small problems in orthography are very difficult to fix.

Slovak is a phonemic Slavic language; sounds and letters have almost 1:1 relationship. The greatest exception to this rule are the letters "i" and "y", which correspond to exactly the same sound. That means, reading is easy, but to write correctly, you need to memorize a lot of rules. For example, in words of Greek origin, the letter "ι" is transcribed as "i", and "υ" is transcribed as "y", despite the same pronounciation in Slovak -- but of course, unless you speak Greek, your only way to use this rule is to memorize those words. And the rules for words of Slavic origin include a lot of memorization, too. Realistically, after learning all those rules, you need to read a lot, and hope that your brain will magically provide you the correct version when needed.

Which makes it a great system for recognizing people who remember obscure rules and read a lot!

The reasonable solution would be to simply start writing "i" everywhere. (Plus fix some other irregularities that depend on the "i/y" distinction.) So the kids of future generations would not have to spend two years at elementary school memorizing these rules and exceptions.

Except, when you look at the text written with "i" everywhere, the emotional reaction is that it feels stupid. It feels like written by someone who is not really good at writing, because they can't remember the rules and don't read a lot. The idea that this should become the official version of the language feels offensive. To put it bluntly, it's like you are asking all people to write like retards. Of course most of them will refuse!

You might enjoy the story of how the Korean written language, Hangul, came about. It was developed by one of the Kings because he thought the average Korean should have a language they could read and write and leaning Chinese characters was difficult for many -- probably for no other reason that most didn't have the luxury of time that the nobles and scholars had.

These is an historical movie about the event -- I suspect it's romanticized but the main lines are probably pretty accurate. Called The King's Letters. 

Ease/speed of learning is the wrong metric to evaluate a language or writing system.  How effective it is for communication among the literate is far more important.   And that communication includes interesting hard-to-measure features such as social signaling, intentional ambiguity, and extensibility to new concepts.

I agree that there are many metrics on which you can judge a language. My post above was meant to be more about writing systems specifically than languages generally. (Sorry for the lack of clarity). Given a set language with a certain vocabulary, grammar etc.. I don't why a phonemic system of writing would lead to less communication bandwidth, expressiveness, ambiguity etc... than a non phonemic one. Ditto for logogramatic writing systems.

In essence my mental model is that you can say certain things in certain ways with a given language. Which writing system you use effects how hard or easy it is to change from verbal language to written language, but the writing system itself doesn't change the expressiveness, signalling, capacity fo intentional ambiguity etc...

Also, even if you think that ease of learning is not the only/most important metric, I still think it's worth taking into account and giving at least a fair amount of weight to. After all a language which is far harder to learn (e.g: chinese) will result in a far smaller pool of literate people and even the people who are literate will be comparatively less so than in an alternate world where their language use a easier to learn writing system.

I've looked at this some. 

A deep problem here is that if writing has SOME slop in it, then it is textually usable in nearly identical ways by people whose words-on-the-tongue are quite different... and this is... good? Yes! It is good.

The designed phonological orthography of interslavic is an interesting example where they aspire to something quite similar to written english, so that no particular accent or dialect or nigh-unto-language is particularly privileged but also there are not lots of confusing collisions.

Since the whole language in that case is artificial, they can try to detect sound collisions and or written collisions and tweak the vocabulary at the same time. This leads, however, to the annoyance of people from russia or poland or bulgaria or wherever needing to not just relearn spelling, but to stop using some words, and start using others, based on language mutation since the 700s in totally other parts of the slavic sprachbund.

For interslavic it is probably fine to throw away or recreate words? It is a hobby. And for people with spare brain cycles, learning the twists might help them get ready to learn like... "ALL the other slavic languages"? Maybe?

Here are the first few lines in the least bad english spelling reform I know of, for a pretty funny poem. 

The Keyoss

Direst krîchur in Krieishon,
Studiing Inglish pronunsieishon,
Ay wil tîch yu in may verce
Sounds laik korps, kor, horce and werce.
It wil kîp yu, Sûzy, bizy,
Meik yor hed with hît gro dizy;
Tir in ay yor dress yûl têr.
So shal ay! O, hir may prêr,

Just kompêr hart, bird and herd,
Days and dayet, lord and werd,
Sord and sword, ritein and Britin,
(Maind the later, how its riten!)

Maybe if I read a bunch of easy stuff out loud I'd get the hang of it? It isn't SUPER hard. But I can't recognize the words is the problem. I have to MAYBE say them out loud? And then "hear" what I said and perform recognition on that? (Personally, I like not having to have the sounds in my head when I'm reading, if I don't want them.)

This is the answer key, and the original, which you are strongly encouraged to read out loud:

The Chaos (by G. Nolst Trenite)

Dearest creature in creation, 
Studying English pronunciation. 
I will teach you in my verse 
Sounds like corpse, corps, horse, and worse. 
I will keep you, Suzy, busy, 
Make your head with heat grow dizzy. 
Tear in eye, your dress will tear. 
So shall I! Oh hear my prayer.

Just compare heart, beard, and heard, 
Dies and diet, lord and word, 
Sword and sward, retain and Britain. 
(Mind the latter, how it's written.)

The author of that "riform" has a video (which I have just watched tonight at 1.75X speed).

But I can't recognize the words is the problem.

Because you have already spent years using their current writing. It would take you years to become equally fluent in the new system. Kids who now learn writing, however, would not have the same issue.

Problem is that the convenience of future generations cannot outweigh even a temporary inconvenience of the current generation, because it is always the current generation that makes the decisions.

Very true, and it generalises...

For example, in Year 1 that useless letter "c" would be dropped to be replased either by "k" or "s", and likewise "x" would no longer be part of the alphabet. The only kase in which "c" would be retained would be the "ch" formation, which will be dealt with later. Year 2 might reform "w" spelling, so that "which" and "one" would take the same konsonant, wile Year 3 might well abolish "y" replasing it with "i" and Iear 4 might fiks the "g/j" anomali wonse and for all.

Jenerally, then, the improvement would kontinue iear bai iear with Iear 5 doing awai with useless double konsonants, and Iears 6-12 or so modifaiing vowlz and the rimeining voist and unvoist konsonants. Bai Iear 15 or sou, it wud fainali bi posibl tu meik ius ov thi ridandant letez "c", "y" and "x" -- bai now jast a memori in the maindz ov ould doderez -- tu riplais "ch", "sh", and "th" rispektivli.

Fainali, xen, aafte sam 20 iers ov orxogrefkl riform, wi wud hev a lojikl, kohirnt speling in ius xrewawt xe Ingliy-spiking werld.

Source. Original source unknown.

 I work on NLP AI systems and have spent a lot of the past decade working on developing training data, so I have a degree of expertise here. 

There are a lot of things that go into how hard a language is to learn. How close the spelling is to pronunciation is one of them, but not the dominant one for alphabetic languages (grammatical gender is another, which you don't mention).

But you are totally correct that learning a character-based system is much, much harder than learning an alphabetic (or syllabary) language. I think it's a huge disadvantage for e.g. China, where kids have to devote years of study and memorization to be literate, many more than in alphabetic languages. I would go further and say that they are forced to have their whole school system revolve around immense amounts of rote memorization, which I think could lead to the kinds of relative lack of creativity that the Chinese school systems are sometimes criticized for. I don't think you can learn written Chinese without massive amounts of memorization, leaving little room for other ways of thinking for a decade or more.

Further reading:

Of languages that got multiple votes, Catalan and Spanish (and Esperanto) rated as very easy, and written Chinese (and especially literary Sinitic which is old written Chinese) rated very hard.

The linguist survey isn't a great source since it's significantly biased towards languages that are easy for *English speakers* to learn, which is quite different from languages that are easy for children to learn as a first language. A lot of the languages that are at the beginning of the list are Romance and Germanic languages, and other languages that are morphosyntactically similar to English (eg. most of the "1" languages don't have noun classes, like you mentioned).

I think that's a fair criticism.

Two questions:

  1. Do you think there are significant things other than how phonetic/whether it's logogramatic that make a writing system significantly easier or harder to learn/use?
  2. In terms of language difficulty more generally, what do you think are the most important factors which determine difficulty?

I think characters via alphabets dwarfs everything else for written language, but here are a few other factors:

  • how phonetic it is if you want to learn the writing system.
  • how regular it is -- English is full of exceptions (because of its 3 language family history). Spanish is very regular, Russian also has a ton of exceptions.
  • how complicated the morphology is (Finnish is tough because of the super complex morphology, see also Russian). Mandarin is very easy on this dimension.
  • whether there's a phoneme distinction that you didn't learn as a child -- so for a Japanese speaker, l vs r is hard in English ("Engrish"), and for an English speaker, o versus ō is hard in Japanese.

In general, of course, the more similar it is to your native tongue, the easier it is. Tones are hard to learn if you don't speak a tonal language, but if you do then they are super intuitive. Similar with lots of morphology, fixed versus fluid word orderings, etc.

The other angle is spoken versus heard. Portuguese (especially) and French are much easier to speak than understand because of the various ways that sounds are elided or mushed together with fluent speakers. So you can get basic sentences out before you can understand something at full speed -- generally true but much more so for some languages.

You're making the mistake of thinking that writing is about adequate communication. It's not, or not only. It's also about signalling that you had enough slack / wealth to waste tons of your educational time and effort learning arbitrary rules, and thus that you are probably 1. somewhat disciplined 2. somewhat well-educated and 3. part of the elite, or at least not a total prole. This is how complex writing systems have survived so long.

Language isn’t just about efficiency; cultural aesthetic is a terminal value.

Also:

I think we should also take into account the value of English spellings that maintain common forms with other languages, even at the expense of being phonetic.

For instance, to a speaker of French or Spanish, the English word "diversification" would certainly seem less alien than a hypothetical respelling as "daiversifikeishun". Does having a common (or very similar) cross-language orthography for latinate words offer more advantages than the benefits of phonetic spellings? I'm not sure, but it should certainly be part of the discussion.

I suspect the benefit is greatest in technical fields with lots of latin-derived vocabulary (i.e. health & biology). Would international scientific cooperation become more difficult if French and Spanish speakers had to relearn spellings for words like "capillaries" [kapileirīs] or "canine" [keinain] whereas the original english was almost identical to the spelling in their native language.

I think there may be a lot more insight to pull out of East Asian language case studies.

You can't simply reform Chinese to be phonemic, because written pinyin doesn't give you enough information to distinguish among the massive number of homophones (which you get from context clues in spoken language that aren't available in writing) in a language where each character is a single syllable, and most words are ~2 characters on average, with very few longer than 4. Also, even though this is becoming less important in the modern world, you lose the advantage of mutually incomprehensible dialects being able to share a common written language. When the PRC, which wasn't exactly in favor of classical scholarship, tried to modernize the language, the result was simplified characters, not an alphabet. (Side note: Mandarin has some of the easiest grammar of any language I've come across, and I wouldn't be surprised if that is tied into the use of very short words with almost no inflection/word form changes based on grammar, and also to the maintenance of ideographic writing).

From there, we have Japan and Korea, which started out importing Chinese characters, but have since diverged.

I don't know the history in detail, but Korea created a phonemic alphabet centuries ago, and went back and forth a few times. I'd be curious how much data we have on the results. For one thing, it's easier to learn, but also less stable over time as pronunciations change.

Japan uses multiple writing systems at once, both ideographic and phonetic. I don't know when each was created, when each is chosen today, or how this affects language learners. What does the data look like there? 

I do think it would be great to reform non-phonemic alphabetic written languages like English. That seems closer to a pure dead weight loss, a historical accident of having imported a foreign alphabet at the beginning of  written English. Although... given how different registers of vocabulary in English tend to have different parent language families, I wonder if that might not add its own confusions too? Unlikely, but possible?

As an older example, there's the Rosetta stone and the use of Demotic by linguists to translate previously-indecipherable heiroglyphics. Not sure what the lesson there would be, but seems relevant?

"You aren't going to change English in the same way you aren't going to change QWERTY layouts."

I think you (for some value of "you") could do that in an otherwise-stable world, over a long enough timescale, but other technological changes will obviate the need for it to soon to do so. Things like natural language processing, automated translation, brain-computer interfaces, cochlear implants and other interventions making deafness rarer.

Although now I'm wondering if any sign languages have their own dedicated written forms, and if so, what they're like. Also, whether anyone has created video-to-text software for sign languages, even if that needs to include a machine translation step to convert to a spoken language's written form.

There are various notations for recording the gestures of sign languages (Hamnosys, Stokoe, SignWriting, and others), but none of them are routinely used by deaf people to write in. Such systems have generally been developed by and for people, deaf or otherwise, engaged in the study of sign languages.

Work has been done on recognising signing from video, but it is a very difficult problem and little has been achieved. Translating signing notation to an animated avatar is easier (I've been involved with such a project myself), and various demonstration systems have been made, but nothing beyond that.

Personally, I want a written language that not only is logographic, but breaks concepts down into their component parts, the way an alphabet breaks syllables into their component sounds. Think:

Latin Alphabet : Japanese syllabary :: Logobet : Chinese Logograms

The problem with phonemic orthographies is that they correspond to only one language, while logographies can correspond to any language; this provides a benefit in China, where the population speaks many mutually unintelligible languages, but everybody can read & understand the same written language. This is why logographies are superior to phonetic systems.

Still I've never heard anyone talk about language reform. Why is that?

Maybe because you haven't been paying attention? On the Chinese site there are effort for Simplified Chinese with the latest official implementation of a language reform in 2013. 

The last language reform of German which I'm speaking was in 1996. 

The French of their Academie Française which is constantly trying to do language reform and gets mostly ignored. 

As far as Both English and Chinese go, they are languages with very different language communities use quite different phonemes to express the same word.

To do an English spelling reform you would need to decide which dialect of English is the correct one to map and which dialects are wrong. The US and the UK had problem switching to the metric system which is much easier then doing language reform. 

A US president running on a platform of enforcing a switch to an English language with phonetic spelling would likely be seen much less serious as a candiate like Yang who calls for the switch to the metric system.

If you take Serbian as an example of a language going well, the Balkan is full of wars between people who believe they have a different national identity because they consider their dialect to be a full language and the dialect that other people speak to be another language. 

To do an English spelling reform you would need to decide which dialect of English is the correct one to map and which dialects are wrong.

Disagree: a new writing system can be chosen to accommodate certain dialectical variations (the pen/pin and father/bother issues, for example) and simply represent others. (The name John is pronounced Jawn in some regions of the USA. It would be very easy to spell it with that vowel if we expected spelling to match the sounds.) And it can all be done without applying right/wrong labels to anybody's dialect. It's just a matter of having enough letters (or variations of letters) to accurately represent the differences where they are important. Since we (broadly speaking) currently teach our children to read and write with 26 letters times 4 variations of each (upper- and lower-case in manuscript and cursive) for a total of 104 characters, I see no reason to expect they can't learn letters that map to the 40-ish sounds that make up the English language (plus a few marks to specify which version of the vowel is being indicated.)

If you're concerned about one group "getting" the base vowels (with no diacritics), we could teach that a vowel with no markings represents a family of similar sounds and that adding a marking simply selects for a particular sound. (It's not really all that different than trying to cram 8-12 vowel sounds into 5 letters like we're doing now, just more precise.) You could spell any word with all unmarked vowels with no more ambiguity than that provided by homographs under the current system.

And we could finally represent puns in text! Maybe some of our jokes would actually make sense to archeologists in a few thousand years.

the Balkan is full of wars between people who believe they have a different national identity because they consider their dialect to be a full language and the dialect that other people speak to be another language.

That's region-wide arguing by definition combined with a strong cultural emphasis on national identity. Those problems don't mean that the local spelling system is problematic in any way. Language is a technology and spelling is a tool. That folks are using it to hurt each other doesn't make a tool "bad" somehow, or we'd ban hammers and axes for being involved in murders. Granted, some governments do ban some objects for just that reason. But the practice is extremely uneven, and usually limited to objects that appear to have been designed to hurt people in the first place.

In principle, couldn't you start writing in the International Phonetic Alphabet right now, and then right a script with an English dictionary to convert back and forth to standard English? Maybe providing prompts when you run the script to decide which regional dialect/variation you're trying for, and which homophone?

Sure could! That strategy works just fine for recording language exactly as it's heard, and even for mapping phonetic representations with traditional spellings in a many-to-one relationship, but it lacks the ability to encode words with dialectically neutral vowels. That is, IPA forces you to choose an exact vowel; there is no provision I know of to indicate "some vowel in this range", which is needed to neutralize the spellings of most words. Though, it certainly wouldn't be hard to extend the IPA to include that feature if that's the character set you wanted to start with.

The inability to avoid exact phonetic representations would actually be beneficial, imo, because a fluent writer of IPA could then represent their native accent exactly, and a fluent reader could recognize that accent and imagine the author's voice more accurately while reading. It would be useless for deaf people, though - but all written language reforms are, unfortunately.

Being able to represent accents accurately is definitely a benefit! I'd love to pick up a book and be able to gather information about the writer's cultural background by the way they pronounce words (and without the "mangled" spellings that implies under the current system).

Likewise, I'd like to have the option to write in a neutral voice in order to avoid privileging one group of speakers in the canonical spellings (think, those used in government documents and the like). British and American English both have accents that imply socioeconomic status, and I'm sure that's true of other languages as well. Being forced to write in a specific accent could needlessly alienate some readers who don't identify with the group that speaks that way.

As for deaf people, there are many who learn to speak! Phonetic spellings would make that process much easier for learners who can't get the immediate feedback of clearly hearing themselves and others pronounce words. Once they learned how to produce the sound each character makes, they could know how to pronounce words just by reading them; an even stronger version of the benefit hearing people get from phonetic systems!

I think these are all good examples of language reforms. I guess my issue is that I was over-fixating on english.

Language reforms have always been standardized to the most common dialect that the people making the rules to be. Reforms don't happen often, so by the time another one comes along, most people more or less can understand the new form.

No, the dominant dialect.

"The term RP has murky origins, but it is regarded as the accent of those with power, influence, money and a fine education – and was adopted as a standard by the BBC in 1922. Today, it is used by 2% of the [UK] population"

Likewise, standard German is Prussian, standard Spanish is Castilian.

I'm sorry that's what I meant the dialect used by the rulers.

There are two problems with phonetic writing. 

  1. Omonims. How to recognise, what is "rouz" (it may be rows, rose, rhos, ...).
  2. Compactness. 施氏食狮史 is much shorter then "The Story of Mr. Shi Eating Lions" and much easier to understand then Shī-shì shí shī shǐ (again omonims).

Try to think, why we write 1, 2, 3, ... but (usually) not one, two, three. We learn a lot of characters, such as #$%@<>..., road signs, product signs, math signs, emoji, ...

Chinese is hard to learn, but easier to read.

  1. The same way we recognize those words by ear: through context.
  2. Comparing English and Chinese is difficult because of the different ways the two languages use the melody of speech. I note that you have included the tonal markers on all your "shi"s, which make those words distinct (and, to somebody who has learned to read the language that way, easy to read). Moreover, that representation makes the playfulness of the title obvious, which adds value.

Hence most people with power will not only never struggle with literacy themselves but they will also never meaningfully interact to the 10 − 20% of society which does.

On the other hand teachers are elites.