My Useless Super-Power: Reading Incorrectly Encoded Teletext

I have several weird and mostly useless super-powers. Some of them are actually super-powers that don’t have a rational explanation; I’ll leave them for other posts. The one I’ll talk about here does have a rational explanation.

There is a technology called Teletext. It began in the 1970s, was somewhat popular in some countries in the 1990s, and perhaps it still exists, but it is largely superseded now by websites and by the on-screen text from cable and satellite set-top boxes. It worked together with broadcast television: some extra data was sent together with the TV signal, and if your TV set supported Teletext, you could push a button on your remote control, and the picture would be fully or partially replaced with some letters or crude pixel graphics.

When Teletext replaced the picture partially, it was used to show subtitles for translation or accessibility. When it replaced the picture fully, it could show news, TV programming schedules, weather forecasts, tourist guides, government information, trivia games, and other things for which websites and set-top boxes are used today.

We moved to Israel in 1991 and bought a TV a few months later. I quickly figured out that our new TV set supports Teletext and learned to use it. I loved it! It’s possible that I used Teletext more than actually watching TV! On Israeli TV, the content of Teletext was mostly in Hebrew, and it was very useful for me for learning the language. The way in which you navigate the pages there is by typing a three-digit page number, and I still remember lots of those page numbers. I remember that 100s were for news and TV guides. There was also a range for children with games and stories, probably 300s. And there was a range of pages in English, mostly with tourist guides.

I loved it for a couple of years. I learned a lot of Hebrew from it. Back then, Israeli broadcast TV had only one channel operated by the government, and that’s the only one that our antenna could receive where we lived for the first couple of years. So we only watched the Israeli Channel 1, but it was great for me, because it had Teletext.

Then in 1993 we moved and got cable TV.

Oh boy.

Cable TV had dozens of channels,¹ but the most important one, of course, was MTV. The MTV we had in Israel was not the one from the U.S., but MTV Europe,² which was broadcast from the U.K. The music³ there probably made a stronger cultural impact on my personality than any other thing ever… but that is not really the topic of this post.

The topic of this post is Teletext. MTV Europe had Teletext!

But the Teletext on MTV Europe was broken! It had occasional Latin letters, but almost everything was written in gibberish in Hebrew letters.

It annoyed me greatly: I had already fallen in love with Teletext on Israeli Channel 1, and I immediately fell in love with the music on MTV, and I really wanted to make them work together for me. Initially, I thought that it’s just a malfunction and that something gets garbled because it’s cable and not antenna or because the signal comes from far away.

Pretty quickly, however, I started noticing patterns. It looked liked text—with words, spaces, punctuation, and sentences. And the crude pixel graphics looked OK. So I started trying to decipher it, and realized that the uppercase Latin letters just worked correctly, but the lowercase Latin letters were replaced with Hebrew letters!

The reason that it worked that way is that the people who set up Teletext on Israeli Channel 1 wanted to support both Hebrew and English, but the Teletext technology supported only 7-bit encodings. “Encoding” is a standard that gives numbers to each text symbol that appears on computers: letters, digits, punctuation marks, and so forth. There are many encodings. “7-bit” basically means that the computer can only understand 128 symbols (2⁷ = 128), which means that in a 7-bit encoding, there’s enough room for capital Latin letters, small Latin letters, digits, basic punctuation and math symbols, and not much more—certainly not a whole another alphabet. This was long before Unicode, an encoding with room for all the alphabets of the world, became widely available in the 2000s.

So the Israeli Broadcasting Authority probably got TV sets sold in Israel to show Hebrew letters instead of small Latin letters. Remember that I mentioned that the Israeli Teletext had a range of pages in English for tourists? It was all in capital letters! I actually did notice that it’s all-caps back when I started using it in 1991, but I didn’t pay attention to it, and I thought that it’s just a limitation of how Teletext works. It was, indeed, a limitation, but not the kind of limitation that I thought it was.

Luckily, the lowercase letters were replaced consistently according to a system that I figured out in a few minutes once I realized what was going on: ב was a, ג was b, and so forth. Now that I’m trying to reconstruct it, the symbol for א was probably not a letter, but ` or &.

So I made myself a table and started reading: music news, singles and albums charts in every European country, programming schedule, song playlists on some upcoming special broadcasts like MTV Unplugged, and even personal ads.⁴ At first, I was reading slowly because I had to peek at my list all the time, but after a few weeks, I memorized it and just read it fluently.

Like most families do, we had the TV set in our living room. My parents couldn’t understand why am I staring for hours at black screens with complete gibberish instead of staring at music videos with beautiful dancing people, but I guess that it didn’t bother them too much. I kept doing it for years.

Fast-forward to 1998. I started working as a sysadmin in a place with a lot of computers that were very old already then, but since it was in Israel, they had to support the Hebrew language. In one of my first days there, I noticed a colleague being annoyed about a printout from a dot-matrix printer. I looked at it and saw that the first word is Username. At first, it didn’t even occur to me that Uוםבמעוף is what it actually says.⁵ My colleague was frustrated because he expected it to be in English, but it was in gibberish in Hebrew letters.

You can imagine where it goes from here: I was able to read that printout with zero effort because the ancient server that produced it used the same Hebrew encoding system that Teletext used, and that I had been practicing for years on MTV! My colleague was impressed.

After that, I had no more opportunities to actually use this super-power, but if I ever see such text again, I’d still remember how to read it.

Useless, but fun to tell about to fellow geeks of languages and computers.

¹ It had several Russian channels, which changed a lot since we left Russia in 1991. The Soviet Union and its censorship completely disappeared, TV became diverse and commercial (for better and worse), and a bunch of new channels were added. It also had Arabic, German, Spanish, French, and Turkish channels, which were not so useful to us because we didn’t know the language, but since I do love learning languages, I occasionally watched them and tried to guess what words mean. I remember, for example, that I figured out myself that the words اليوم, heute, hoy, aujourd’hui, and bugün mean “today”—it’s a word that frequently appears on the screen in announcements of TV programs that will be shown later.

² We also had MTV Asia, which later became Channel [V], and which mostly showed Indian music.

³ On MTV Europe, it was the time of Eurodance (2 Unlimited, Dr. Alban, Haddaway), Britpop (Blur, Oasis, Suede), Grunge (Nirvana, Alice in Chains, Pearl Jam), other kinds of “Alternative” (R.E.M., Therapy?, Björk, dEUS), boy bands (Take That, Boyzone), vestiges of hair metal (Guns N’ Roses, Bon Jovi), and occasional hip hop and R&B (Coolio, Mariah Carey, Jodeci). Back then, I felt that the cool thing to do is to love Britpop, Grunge, and other “Alternative” things, and to hate Eurodance, boy bands, hair metal, and R&B. I changed my mind about it thanks to GusGus, Oliver Lake, and, believe it or not, my physical education teacher. But that’s a topic for a different post.

⁴ The personal ads were only for U.K., if I recall correctly. That was also the first time I saw personal ads from gays and lesbians. I remember being pleasantly surprised by how casual and normal they were. I knew what gay and lesbian people are, but back then, they were almost never discussed on the Russian TV, and not too much on Israeli TV either.

⁵ I had to type it backwards. Unfortunately, WordPress doesn’t allow me to use the <bdo> HTML tag or the unicode-bidi: bidi-override CSS rule. They are very rarely needed, but they would be appropriate here.

“The fix is to complete the localization”. Not letting people do it is a bug. (Also, some non-standard observations about American health insurance.)

It sometimes happens in people’s lives that someone tells them something that sounds true and obvious at the time. It turns out that it actually is objectively true, and it is also obvious, or at least sensible, to the person who hears it, but it’s not obvious to other people. But it was obvious to them, so they think that it is obvious to everyone else, even though it isn’t.

It happens to everyone, and we are probably all bad at consistently noticing it, remembering it, and reflecting on it.

This post is an attempt to reflect on one such occurrence in my life; there were many others.

(Comment: This whole post is just my opinion. It doesn’t represent anyone else. In particular, it doesn’t represent other translatewiki.net administrators, MediaWiki developers or localizers, Wikipedia editors, or the Wikimedia Foundation.)

There’s the translatewiki.net website, where the user interface of MediaWiki, the software that powers Wikipedia, as well as of some other Free Software projects, is translated to many languages. This kind of translation is also called “localization”. I mentioned it several times on this blog, most importantly at Amir Aharoni’s Quasi-Pro Tips for Translating the Software That Powers Wikipedia, 2020 Edition.

Siebrand Mazeland used to be the community manager for that website. Now he’s less active there, and, although it’s a bit weird to say it, and it’s not really official, these days I kind of act like one of its community managers.

In 2010 or so, Siebrand heard something about a bug in the support of Wikipedia for a certain language. I don’t remember which language it was or what the bug was. Maybe I myself reported something in the display of Hebrew user interface strings, or maybe it was somebody else complaining about something in another language. But I do remember what happened next. Siebrand examined the bug and, with his typical candor, said: “The fix is to complete the localization”.

What he meant is that one of the causes of that bug, and perhaps the only cause, was that the volunteers who were translating the user interface into that language didn’t translate all the strings for that feature (strings are also known as “messages” in MediaWiki developers’ and localizers’ jargon). So instead of rushing to complain about a bug, they should have completed the localization first.

To generalize it, the functionality of all software depends, among many other things, on the completeness of user interface strings. They are essentially a part of the algorithm. They are more presentation than logic, but the end user doesn’t care about those minor distinctions—the end user wants to get their job done.

Those strings are usually written in one language—often English, but occasionally Japanese, Russian, French, or another one. In some software products, they may be translated into other languages. If the translation is incomplete, then the product may work incorrectly in some ways. On the simplest level, users who want to use that product in one language will see the user interface strings in another language that they possibly can’t read. However, it may go beyond that: writing systems for some languages require special fonts, applying which to letters from another writing system may cause weird appearance; strings that are supposed to be shown from left to right will be shown from right to left or vice versa; text size that is good for one language can be wrong for another; and so forth.

In many cases, simply completing the translation may quietly fix all those bugs. Now, there are reasons why the translation is incomplete: it may be hard to find people who know both English and this language well; the potential translator is a volunteer who is busy with other stuff; the language lacks necessary technical terminology to make the translations, and while this is not a blocker —new terms can be coined along the way—, this may slow things down; a potential translator has good will and wants to volunteer their time, but hasn’t had a chance to use the product and doesn’t understand the messages’ context well enough to make a translation; etc. But in theory, if there is a volunteer who has relevant knowledge and time, then completing the translation, by itself, fixes a lot of bugs.

Of course, it may also happen that the software actually has other bugs that completing the localization won’t fix, but that’s not the kind of bugs I’m talking about in this post. Or, going even further, software developers can go the extra mile and try to make their product work well even if the localization is incomplete. While this is usually commendable, it’s still better for the localizers to complete the localization. After all, it should be done anyway.

That’s one of the main things that motivate me to maintain the localization of MediaWiki and its extensions into Hebrew at 100%. From the perspective of the end users who speak Hebrew, they get a complete user experience in their language. And from my perspective, if there’s a bug in how something works in Wikipedia in Hebrew, then at least I can be sure that the reason for it is not that the translation is incomplete.

As one of the administrators of translatewiki, I try my best to make complete localization in all languages not just possible, but easy.¹ It directly flows out of Wikimedia’s famous vision statement:

Imagine a world in which every single human being can freely share in the sum of all knowledge. That’s our commitment.

I love this vision, and I take the words “Every single human being” and “all knowledge” seriously; they implicitly mean “all languages”, not just for the content, but also for the user interface of the software that people use to read and write this content.

If you speak Hindi, for example, and you need to search for something in the Hindi Wikipedia, but the search form works only in English, and you don’t know English, finding what you need will be somewhere between hard and impossible, even if the content is actually written in Hindi somewhere. (Comment #1: If you think that everyone who knows Hindi and uses computers also knows English, you are wrong. Comment #2: Hindi is just one example; the same applies to all languages.)

Granted, it’s not always actually easy to complete the localization. A few paragraphs above, I gave several general examples of why it can be hard in practice. In the particular case of translatewiki.net, there are several additional, specific reasons. For example, translatewiki.net was never properly adapted to mobile screens, and it’s increasingly a big problem. There are other examples, and all of them are, in essence, bugs. I can’t promise to fix them tomorrow, but I acknowledge them, and I hope that some day we’ll find the resources to fix them.

Many years have passed since I heard Siebrand Mazeland saying that the fix is to complete the localization. Soon after I heard it, I started dedicating at least a few minutes every day to living by that principle, but only today I bothered to reflect on it and write this post. The reason I did it today is surprising: I tried to do something about my American health insurance (just a check-up, I’m well, thanks). I logged in to my dental insurance company’s website, and… OMFG:

What you can see here is that some things are in Hebrew, and some aren’t. If you don’t understand the Hebrew parts, that’s OK, because you aren’t supposed to: they are for Hebrew speakers. But you should note that some parts are in English, and they are all supposed to be in Hebrew.

For example, you can see that the exclamation point is at the wrong end of “Welcome, Amir!“. The comma is placed unusually, too. That’s because they oriented the direction of the page from right to left for Hebrew, but didn’t translate the word “Welcome” in the user interface.² If they did translate it, the bug wouldn’t be there: it would correctly appear as “ברוך בואך, Amir!“, and no fixes in the code would be necessary.

You can also see a wrong exclamation point in the end of “Thanks for being a Guardian member!“.

There are also less obvious bugs here. You can also see that in the word “WIKIMEDIA” under the “Group ID” dropdown, the letter “W” is only partly seen. That’s also a typical RTL bug: the menu may be too narrow for a long string, so the string can be visually truncated, but it should happen at the end of the string and not in the beginning. Because the software here thinks that the end is on the left, the beginning gets truncated instead. This is not exactly an issue that can be fixed just by completing the localization, but if the localization were complete, it would be easier to notice it.

There are even more issues that you don’t notice if you don’t know Hebrew. For example, there’s a button with a weird label at the top right. Most Hebrew speakers will understand that label as “a famous website”, which is probably not what it is supposed to say. It’s more likely that it’s supposed to say “published web page”, and the translator made a mistake. Completing the translation correctly would fix this mistake: a thorough translator would review their work, check all the usages of the relevant words, and likely come up with a correct translation. (And maybe the translation is not even made by a human but by machine translation software, in which case it’s the product manager’s mistake. Software should never, ever be released with user interface strings that were machine-translated and not checked by a human.)

Judging by the logo at the top, the dental insurance company used an off-the-shelf IBM product for managing clients’ info. If I ask IBM or the insurance company nicely, will they let me complete the localization of this product, fixing the existing translation mistakes, and filing the rest of the bugs in their bug tracking software, all without asking for anything in return? Maybe I’ll actually try to do it, but I strongly suspect that they will reject this proposal and think that I’m very weird. In case you wonder, I actually tried doing it with some companies, and that’s what happened most of the time.

And this attitude is a bug. It’s not a bug in code, but it is very much a problem in product management and attitude toward business.

If you want to tell me “Amir, why don’t you just switch to English and save yourself the hassle”, then I have two answers for you.

The first answer is described in detail in a blog post I wrote many years ago: The Software Localization Paradox. Briefly: Sure, I can save myself the hassle, but if I don’t notice it and speak about it, then who will?

The second answer is basically the same, but with more pathos. It’s a quote from Avot 1:14, one of the most famous and cited pieces of Jewish literature outside the Bible: If I am not for myself, who is for me? But if I am for my own self, what am I? And if not now, when? I’m sure that many cultures have proverbs that express similar ideas, but this particular proverb is ours.

And if you want to tell me, “Amir, what is wrong with you? Why does it even cross your mind to want to help not one, but two ultramegarich companies for free?”, then you are quite right, idealistically. But pragmatically, it’s more complicated.

Wikimedia understands the importance of localization and lets volunteers translate everything. So do many other Free Software projects. But experience and observation taught me that for-profit corporations don’t prioritize good support for languages unless regulation forces them to do it or they have exceptionally strong reasons to think that it will be good for their income or marketing.

It did happen a few times that corporations that develop non-Free software let volunteers localize it: Facebook, WhatsApp, and Waze are somewhat famous examples; Twitter used to do it (but stopped long ago); and Microsoft occasionally lets people do such things. Also, Quora reached out to me to review the localization before they launched in Hebrew and even incorporated some of my suggestions.³

Very often, however, corporations don’t want to do this at all, and when they do it, they often don’t do it very well. But people who don’t know English want—and often need!—to use their products. And I never get tired of reminding everyone that most people don’t know English.

So for the sake of most humanity, someone has to make all software, including the non-Free products, better localized, and localizable. Of course, it’s not feasible or sustainable that I alone will do it as a volunteer, even for one language. I barely have time to do it for one language in one product (MediaWiki). But that’s why I am thinking of it: I would be not so much helping a rich corporation here as I would be helping people who don’t know English.

Something has to change in the software development world. It would, of course, be nice if all software became Freely-licensed, but if that doesn’t happen, it would be nice if non-Free software would be more open to accepting localization from volunteers. I don’t know how will this change happen, but it is necessary.

If you bothered to read until here, thank you. I wanted to finish with two things:

To thank Siebrand Mazeland again for doing so much to lay the foundations of the MediaWiki localization and the translatewiki community, and for saying that the fix is to complete the localization. It may have been an off-hand remark at the time, but it turned out that there was much to elaborate on.
To ask you, the reader: If you know any language other than English, please use all apps, websites, and devices in this language as much as you can, bother to report bugs in its localization to that language, and invest some time and effort into volunteering to complete the localization of this software to your language. Localizing the software that runs Wikipedia would be great. Localizing OpenStreetMap is a good idea, too, and it’s done on the same website. Other projects that are good for humanity and that accept volunteer localization are Mozilla, Signal, WordPress, and BeMyEyes. There are many others.⁴ It’s one of the best things that you can do for the people who speak your language and for humanity in general.

¹ And here’s another acknowledgement and reflection: This sentence is based on the first chapter of one of the most classic books about software development in general and about Free Software in particular: Programming Perl by Larry Wall (with Randal L. Schwartz, Tom Christiansen, and Jon Orwant): “Computer languages differ not so much in what they make possible, but in what they make easy”. The same is true for software localization platforms. The sentence about the end user wanting to get their job done is inspired by that book, too.

² I don’t expect them to have my name translated. While it’s quite desirable, it’s understandably difficult, and there are almost no software products that can store people’s names in multiple languages. Facebook kind of tries, but does not totally succeed. Maybe it will work well some day.

³ Unfortunately, as far as I can tell, Quora abandoned the development of the version in Hebrew and in all other non-English languages in 2022, and in 2023, they abandoned the English version, too.

⁴ But please think twice before volunteering to localize blockchain or AI projects. I heard several times about volunteers who invested their time into such things, and I was sad that they wasted their volunteering time on this pointlessness. Almost all blockchain projects are pointless. With AI projects, it’s more complicated: some of them are actually useful, but many are not. So I’m not saying “don’t do it”, but I am saying “think twice”.

Google’s Lies and the Problem That Large Language Models Won’t Solve

I switched the search settings not to use Google by default in web browsers on all my devices after reading the blog post that the Head of Google Search published in response to the many reports of problems in Google’s “AI Overviews”.

Practically every point in that blog post is either a meaningless generality written in corporatespeak or a demonstrable lie. You don’t need specialized engineering knowledge or access to internal information to see it. You just need common sense.

User feedback shows that with AI Overviews, people have higher satisfaction with their search results…

Which people? Everyone? I don’t. I sharply reduced my use of Google search because I no longer trust it.

… and they’re asking longer, more complex questions that they know Google can now help with.

The word “help” is doing a lot of work here. Google can output a piece of text in response. Is this piece of text actually helpful?

AI Overviews work very differently than chatbots and other LLM products that people may have tried out.

No, they don’t. They work exactly the same. Both technologies automatically produce some text that was not written by a human.

They’re not simply generating an output based on training data.

No. They are, in fact, simply generating an output based on training data.

When AI Overviews get it wrong, it’s usually for other reasons: misinterpreting queries, misinterpreting a nuance of language on the web, or not having a lot of great information available. (These are challenges that occur with other Search features too.)

This is one of the few true things in this blog post, but it shows why this feature is completely pointless!

I mean, it’s nice that she doesn’t blame the users here for writing bad queries, but admits that the software that her team developed is bad at interpreting them.

And here’s an even more important thing: Despite the long-standing impression that “you can find everything on Google”, the “AI” innovations of the last couple of years help us realize that there are actually many topics about which there is not a lot of info online. And large language models are not going to solve this problem.

This approach is highly effective.

What does this even mean? “We are able to show more ads and improve our bottom line for the last quarter?”

Overall, our tests show that our accuracy rate for AI Overviews is on par with another popular feature in Search — featured snippets — which also uses AI systems to identify and show key info with links to web content.

This is probably the biggest lie of all in that whle post.

There is no comprehensive test or measure for accuracy! It is logically impossible to make one!

At most, there is some internal metric that middle managers present to senior managers, and it may show that the rate is “positive” according to internal company logic. However, it has absolutely nothing to do with what millions of web users actually need.

This is comparable to metrics of quality of machine translation, such as BLEU and NIST. There are methodologies and formulas behind them, but they are only useful for discussions among researchers, developers, and product and project managers, and they have very limited usefulness at predicting the correctness of the translation of a text that hasn’t yet been tested. Developers have to use those metrics because project managers love metrics, but most of them admit that they are not very good, and such a metric can never become perfect.

In a small number of cases, we have seen AI Overviews misinterpret language on webpages and present inaccurate information.

Yes, thanks again for admitting that computers are not supposed to interpret language in the first place. Humans are supposed to do it.

I could go on, but I have better things to do, like publishing three longish blog posts of my own. One is coming very soon, and it’s going to be fun, at least for me.

In response to accusations of monopolistic behavior, Google has been saying for years that competition is just a click away. It’s true, and it’s good. My experience with DuckDuckGo in the last few days has been perfectly fine.

That said, Google should still be tried for monopolistic behavior. And I kind of wish that there was regulation that prevents the deliberate destruction of fundamental public goods operated by commercial companies, but I guess that it would be very hard to legislate.

In the meantime, let’s try not to be silent about Google’s lies, and let’s consider using the competitors.

Comparison of Social Media Platforms in Light of Possible TikTok Ban by the United States Government

Some people say that the real reason for the possible ban of TikTok by the United States government is that TikTok includes a lot of Pro-Palestinian and Anti-Israeli content. That may be true, but my impression is closer to this:

Feature	Facebook/Instagram	TikTok	X/Twitter	Mastodon/Fediverse
Shows you short videos that may be entertaining, that you didn’t search for, and that you can scroll up and down	Yes	Yes	No	No
Has connections with the government of the United States	Yes	Probably yes	Probably yes	Probably no
Has connections with the government of the People’s Republic of China	Not sure	Yes	Not sure	Probably no
Led by a billionaire who behaves like a dick	Kind of	Not sure	Yes	No
Has Anti-Palestinian content	Yes	Yes	Yes	Yes
Has Anti-Israeli content	Yes	Yes	Yes	Yes
Has Anti-American content	Yes	Yes	Yes	Yes
Has transparent moderation policies	No	No	No	Kind of
Can waste a lot of your time	Yes	Yes	Yes	Kind of

Corrections welcome.

Five More Privileges of English Speakers, part 2: Language and Software

For the previous part in the series, see Five Privileges of English Speakers, part 1.

I’m continuing the series of posts in each of which I write about five privileges that English speakers have without giving it a lot of thought. The examples I give mostly come from my experience translating software, Wikipedia articles, blog posts, and some other texts between English, Hebrew, and Russian. Hebrew and Russian are the languages I know best. If you have interesting examples from other languages, I am very interested in hearing them and writing about them.

I’m writing them mostly as they come into my mind, without a particular order, but the five items in this part of the series will focus on usage of the English language in software, and try to show that the dominance of English is not only a consequence of economics and history, but that it’s further reinforced by features of the language itself.

1. Software usually begins its life in English

English is the main language of software development worldwide.

The world’s best-known place for software development is Silicon Valley, an English-speaking place. That’s the place of Facebook, Google, Apple, Oracle and many others. California is also the home of Adobe.

There are several other hubs of software development in United States: Seattle (Microsoft, Amazon), North Carolina (Red Hat), New York (IBM, CA), Massachusets (TripAdvisor, Lotus, RSA), and more. The U.S. is also the source for much of computer science research and education, coming from Berkeley, MIT, and plenty of other schools. The U.S. is also the birthplace of the Internet, originally supported by the U.S. Department of Defense and several American universities. The world wide web, which brought the Internet to the masses, was created in Switzerland by an English speaker.

Software is developed in other countries—India, Russia, Israel, France, Germany, Estonia, and many other countries. But the dominance of the U.S. and of the English language is clear. The reason for this is not only that the U.S. is the source for much of computer technologies, but also—and probably more importantly—that the U.S. is the biggest consumer market for software. So developers in all countries tend to optimize the product for the highest-paying consumers, and these only need English.

When engineers write the user interface of their software in English, they often do not give any thought to other languages at all, or make translation possible, but complicated by English-centric assumptions about number, gender, text direction, text size, personal names, and plenty of other things, which will be explored in further points.

2. Terminology

English is also the source for much of the computer world’s terminology. Other languages have to adapt terms like smartphone, network, token, download, authentication, and thousands of others.

Some language communities work hard to translate them all meticulously into native words; Icelandic, Lithuanian, French, Chinese, and Croatian are famous examples. This is nice, but requires effort on behalf of terminology committees, who need to keep up with the fast pace of technological development, and on behalf of the software translators, who have to keep with the committees.

Some just transliterate most of them: keep the term essentially in English, but rewritten in the native alphabet. Hindi and Japanese are examples of that. This seems easy, but it is based on a problematic assumption: that the target language speakers who will use the software know at least some English! This assumption is correct for the translators, who don’t just know the English terms, but are probably also quite accustomed to it, but it’s not necessarily correct for the end users. Thus, the privilege is perpetuated.

Some languages, such as Hebrew, German, and Russian, are mid-way, with language academics and purists pulling to purer native language, engineers pulling to more English-based words, and the general public settling somewhere in between—accepting the neologisms for some terms, and going for English-based words for others.

For the non-English languages it provides fertile ground for arguments between purists and realists, in which the needs of the actual users are frequently forgotten. All the while, English speakers are not even aware of all this.

3. Easy binary logic word formation

One particular area of computer terminology is binary logic. This sounds complicated, but it’s actually simple: in electronics and software opposite notions such as true / false, success / failure, OK / Cancel, and so forth, are very common.

This translates to a great need for words that express opposites: enable / disable, do / undo, log in / log out, delete / undelete, block / unblock, select / deselect, online / offline, connect / disconnect, read / unread, configured / misconfigured.

Notice something? All of the above words are formed with the same root, with the addition of a prefix (un-, dis-, de-, mis-, a-), or with the words “on” and “off”.

A distinct, but closely related need, is words for repetition. Computers are famously good at doing things again and again, and that’s where the prefix re- is handy: reconnect, retry, redo, retransmit.

These features happen to be conveniently built into the English language. While English has extremely simple morphology for declension and conjugation (see the section “Spell-checking” in part 1 of the series), it has a slightly more complex morphology for word formation, but it’s still fairly easy.

It is also productive. That is, a software developer can create new words using it. For example, the MediaWiki software has the concept of “oversight”—hiding a problematic page in such a way that only users with a particular permission can read it. What happens if a page was hidden by mistake? Correct: “unoversight”. This word doesn’t quite exist elsewhere, but it doesn’t sound incorrect, because familiar English word formation rules were used to coin it.

As it always happens, English-speaking software engineers either don’t think about it at all, or think that other languages also have similar word formation rules. If you haven’t guessed it already, it is not true. Sime other European languages have similar constructs, but not necessarily as consistent as in English. And for Semitic languages like Hebrew it’s a disaster, because in Semitic languages prefixes are used for entirely different things, and the grammar doesn’t have constructs for repetition and negation. So when translating software user interface strings into Hebrew, we have to use different words as opposites. For example the English pair connect / disconnect is translated as lehitḥabér / lehitnaték—completely different roots, which Hebrew is just lucky to have. Another option is to use negative words like lo and bilti, or bitul, but they are often unnatural or outright wrong. Having to deal with something like “Mark as unread” is every Hebrew software translator’s nightmare, even though it sounds pretty straightforward in English.

English itself also has pairs of negative words that are not formed using the above prefixes, for example next / previous and open / close, but in many other languages they are much more common.

4. Verbing

“Verbing weirds language”, as one of the famous Calvin and Hobbes panels says.

Despite being a funny joke in the comic, it’s a real feature of the English language: because of how English morphology and syntax work, nouns can easily jump into the roles of adjectives and verbs without changing the way they are written.

For English, this is a useful simplification, and it works in labeling, as well as in advertising. “Enjoy Coca-Cola” is something more than an imperative. The fact that it’s a short single word and that it’s the same in all genders and numbers, makes it more usable as a call to action than it would be in other languages. And, other than advertising, where are calls to action very common? Software, of course. When you’re trying to tell a user to do something, a word that happens to be both the abstract concept and the imperative is quite useful.

Perhaps the most famous example of this these days is Facebook’s “Like”. Grammatically, what is it in English? Imperative? A noun describing an abstract action? Maybe a plain old noun, as in “chasing likes” (this is a plural noun—English verb don’t have a plural form!)? Answer: it’s all of them and more.

When translated to Hebrew in Facebook’s interface, it’s Ahávti, which literally means “I loved it”. Actually, this translation is mostly good, because it’s understandable, idiomatic, and colloquial enough without compromising correctness. Still, it’s a verb, which is not imperative, and it’s definitely not a noun, so you cannot use it in a sentence as if it was a noun. Indeed, Hebrew speakers are comfortable using this button, but when they speak and write about this feature, they just use its English name: “like” (in plural láykim). It even became a slightly awkward, but commonly used verb: lelaykék. Something similar happens in Russian.

It would be impossible in Hebrew and Russian to use the exact same word for the noun and the verb, especially in different persons and genders. Sometimes the languages are lucky enough to be able to adapt an English verb in a way that is more or less natural, but sometimes it’s weird, and hurts the user experience.

5. Word length

This one is relatively simple and not unique to English, but should be mentioned anyway: English words are neither very long, nor very short. Examples of languages where words are, on average, longer than in English, are Finnish, Tamil, German, and occasionally Russian. Hebrew tends to be shorter, although sometimes a single English word has to be translated with several Hebrew words, so it can get also get longer. This is true for a pretty much any language, really.

In designing interfaces, especially for smaller screens, the length of the text is often important. If a button label is too long, it may overflow from the button, or be truncated, making the display ugly, or unusable, or both.

If you’re an English speaker, it probably won’t happen with you, because almost all software is usually designed with the word length of your language in mind. Other languages are almost always an afterthought.

The good practice for software engineers and designers is to make sure that translated strings can be longer. Their being shorter is rarely a problem, although sometimes a string is so short that the button may become to small to click or tap conveniently.

Generally, what can you do about these privileges?

Whoever you are, remember it. If you know English, you are privileged: Software is designed more for you than for people who speak other languages.

If you are a software engineer or a designer, at the very least, make your software translatable. Try to stick to good internationalization practices and to standards like Unicode and CLDR. Write explanations for every translatable string in as much detail as possible. Listen to users’ and translators’ complaints patiently—they are not whining, they are trying to improve your software! The more internationalizable it is, the more robust it is for you as a developer, and for your English-speaking users, too, because better design thinking will be going into each of its components, and less problematic assumptions will be made.

Five Privileges of English Speakers, part 1

It’s very common today on progressive blogs to urge people to check their privilege.

Being an English speaker, native or non-native, is a privilege.

It’s not as often as discussed as other forms of privilege, such as white, male, cis, hetero, or rich privilege. The reason for this is simple: The world’s media is dominated by the English language. English-language movies are more popular in many countries than movies in these countries’ own languages, English-language news networks are quoted by the rest of the world, the world’s most popular social networks are based in the U.S. and are optimized for U.S. audiences, etc.

So, when English speakers discuss privilege among each other, English is not much of an issue, and they dedicate more time to race, gender, wealth, religion, and other factors that differentiate between people in English-speaking countries.

Despite this, I am not the first one to describe English as a privilege. A simple Google search for english language privilege will yield many interesting results.

What I do want to try to do in this series of posts is to list the particular nuances that make English such a privilege in as much detail as possible. I wanted to write this for a long time, but there are many such nuances, so I’ll just do it in batches of five, in no particular order:

1. Keyboard

If you speak English, congratulations: A keyboard on which your language can be written is available on all electronic devices.

All of them.

All desktops, laptops, phones, tablets, watches. The only notable exception I can think of is typewriters, which only makes the point more tragic: technology moved forward and made writing easier in English, but harder in many other languages, where local-language typewriters were replaced with computers with English-only keyboard.

At the very worst case, writing English on a computer will be slightly inconvenient in countries like Germany, France, or Turkey, where the placement of the Latin letters on the keys is slightly different from the U.S. and U.K. QWERTY standard. Oh, poor American tourists.

On a more serious note, though, even though a lot of languages use the Latin alphabet, a lot of them also use a lot of extra diacritics and special characters, and English is one of the very few that doesn’t. Of the top 100 world’s languages by native speakers, only Malay, Kinyarwanda, Somali, and Uzbek have standardized orthographies that can be written in the basic 26-letter Latin alphabet without any extra characters. We can also add Swahili, which has a large number of non-native speakers, but that’s it. With other languages you can get stuck and not be able to write your language at all (Hindi, Chinese, Russian, etc.), or you may have to write in a substandard orthography because you can’t type letters like é or ł (French, Vietnamese, Polish, etc.).

The above is just the teeny-tiny tip of the iceberg; the keyboard problem will be explored in more points later.

2. Spell-checking

English word morphology is laughably simple.

There’s -s for plurals and for third person present tense verbs, there’s -‘s for possession, and there are -ed and -ing verb forms. There are also some contractions (‘d, ‘s, ‘ll, ‘ve), and a long, but finite list of irregular verb forms, and an even shorter list of irregular plural noun forms. And that’s it.

Most languages aren’t like that. In most languages words change with prefixes, suffixes, infixes, clitics, and so on, according to their role in the sentence.

Beyond the fact that English writing is (arguably) easier for children and foreigners to learn, this means that software tools for processing a language are easy to develop for English and hard to develop for other languages.

The first simple example is spell-checking.

English has had not just spelling, but also grammar and style checkers built into common word processors for decades, and many languages of today don’t even have spelling checkers, not to mention grammar, or style, or convenient searching. (See below.)

So in English, when you type “kinh”, most word processors will suggest correcting it to “king”, but then, some of them may also suggest replacing this word with “monarch” to be more inclusive for women, and this is just one of the hundreds of style improvement suggestions that these tools can make. For a lot of other languages, even simple spell-checking of single words hasn’t been developed yet, and grammar checking is a barely-imaginable dream.

3. Autocompletion

Simpler morphology has many other effects.

Even though Russian is my first native language and I speak it more fluently than I speak English, I am much slower when I’m typing in Russian on my phone. In English, the autocompleting keyboard makes it possible to write just two or three letters of a word and let the software complete the rest. In Russian, the ending of the word must be typed, and autocompletion rarely guesses it correctly. Typing an incorrect ending will make a sentence convey incorrect information, or just make it completely ungrammatical.

4. Searching

A yet-another issue of the previous point, English’s very simple morphology makes searching easier.

For example word processors have a search and replace function. For English, it will likely find all forms of the word, because there are so few of them anyway. But in Hebrew and Arabic, letters are often inserted or changed in the middle of the word according to its grammatical state, and you need to search for each form, which is quite agonizing. It’s comparable to “man” vs. “men” in English, except that in English such changes are very rare, while in many other languages it happens in almost every word.

With search engines that must find words across thousands of documents it gets even harder. Google can easily figure out that if you’re searching for “drive”, you may also be interested in “driving”, “drove”, and “driven”, but Russian has dozens of other forms for this word. A few languages are lucky: special support was developed for them in search engines, and tasks of this kind are automated, but most languages our just out in the cold. But English barely needs extra support like this in the first place.

5. Very little gender

A lot can be said about gendered language, but as far as basic grammar goes, English has very little in the area of gender. “He” and “She”, and that’s about it. There are also man/woman, actor/actress, boy/girl, etc., but these distinctions are rarely relevant in technology.

In many other languages gender is far more pervasive. In Semitic and Slavic languages, a lot of verb forms have gender. In English, the verb “retweeted” is the same in “Helen retweeted you” and “Michael retweeted you”, but in Hebrew the verb is different. Because Twitter doesn’t know that Helen needs a different verb, it uses the masculine verb there, which sounds silly to Hebrew speakers.

I asked Twitter developers about this many times, and they always replied that there’s no field for gender in the user profile. It becomes more and more amusing lately, now that it has become so common —and for good reasons!— to mention what one’s preferred pronouns are in the Twitter profile bio. So people see it, but computers don’t.

On a more practical note, in the relatively rare cases when third person pronouns must be used in software strings, English will often use the singular “they” instead of “he” or “she”. So English-speaking developers do notice it, but not as often as they should, and when they do, they just use the lazy singular-they solution, which is socially acceptable and doesn’t require any extra coding. If only they’d notice it more often, using their software in other languages would be much more convenient for people of all genders.

The only software packages that I know that have reasonably good support for grammatical gender are MediaWiki and Facebook’s software. I once read that Diaspora had a very progressive solution for that, but I don’t know anybody who actually uses it. There may be other software packages that do, but probably very few.

These are just the first five examples of English-language privilege I can think of. There will be many, many more. Stay tuned, and send me your ideas!

Continuous Translation and Rewarding Volunteers

In November I gave a talk about how we do localization in Wikimedia at a localization meetup in Tel-Aviv, kindly organized by Eyal Mrejen from Wix.

I presented translatewiki.net and UniversalLanguageSelector. I quickly and quite casually said that when you submit a translation at translatewiki, the translation will be deployed to the live Wikipedia sites in your language within a day or two, after one of translatewiki.net staff members will synchronize the translations database with the MediaWiki source code repository and a scheduled job will copy the new translation to the live site.

Yesterday I attended another of those localization meetups, in which Wix developers themselves presented what they call “Continuous Translation”, similarly to “Continuous Integration“, a popular software deployment methodology. Without going into deep details, “Continuous Translation” as described by Wix is pretty much the same thing as what we have been doing in the Wikimedia world: Translators’ work is separated from coding; all languages are stored in the same way; the translations are validated, merged and deployed as quickly and as automatically as possible. That’s how we’ve been doing it since 2009 or so, without bothering to give this methodology a name.

So in my talk I mentioned it quickly and casually, and the Wix developers did most of their talk about it.

I guess that Wix are doing it because it’s good for their business. Wikimedia is also doing it because it’s good for our business, although our business is not about money, but about making end users and volunteer translators happy. Wikimedia’s main goal is to make useful knowledge accessible to all of humanity, and knowledge is more accessible if our website’s user interface is fully translated; and since we have to rely on volunteers for translation, we have to make them happy by making their work as comfortable and rewarding as possible. Quick deployments is one of those things that provide this rewarding feeling.

Another presentation in yesterday’s meetup was by Orit Yehezkel, who showed how localization is done in Waze, a popular traffic-aware GPS navigator app. It is a commercial product that relies on advertisement for revenue, but for the actual functionality of mapping, reporting traffic and localization, it relies on a loyal community of volunteers. One thing that I especially loved in this presentation is Orit’s explanation of why it is better to get the translations from the volunteer community rather than from a commercial translation service: “Our users understand our product better than anybody else”.

I’ve been always saying the same thing about Wikimedia: Wikimedia projects editors are better than anybody else in understanding the internal lingo, the functionality, the processes and hence – the context of all the details of the interface and the right way to translate them.

Weird GMail Habit: Removing Control Characters

GMail has a weirdish feature that probably very few people except me know about. When using it with a Hebrew user interface, invisible control characters—LRM, RLM, RLE, LRE and the like—are added to some strings to make them appear correctly in a mixed-direction interface.

Most notably, they are added to email addresses. I sometimes want to copy these email addresses as text, and my mouse pointer picks the control characters as well. Of course, these control characters are by themselves invisible to humans, but very much visible to computers, and an email address with these characters is not correct, even if it appears to be the same to human eyes.

It already became a habit for me to carefully delete and manually restore the first and the last characters of an email address to make sure that the control characters are removed.

It would be better if GMail just used the <bdi> element or CSS bidi isolation. They are fairly well supported in modern browsers and provide better experience.

Serbian Spam

I always celebrate when I receive spam in a language in which I haven’t yet received spam. I just received spam in Serbian for the first time. It was in the Cyrillic alphabet; Serbian can also be written in Latin, and it is frequently done in Serbia, possibly even more frequently than in Cyrillic, even though the government prefers Cyrillic.

This makes me wonder: Is Serbian in Cyrillic popular and important enough for spamming in it, or did the silly spammer just use Google Translate to translate to Serbian and got the result in Cyrillic, because that’s what Google Translate does?

If you know Serbian, can you please tell me whether it looks real or machine-translated? Words like “5иеарс” and the spaces before the punctuation marks give me a strong suspicion that it’s machine translation, but I might be wrong.

Молим вас за попустљивост за нежељене природи овог писма , али је рођена из очаја и тренутног развоја . Молимо носе са мном . Моје име је сер Алекс Бењамин Хубертревизор Африке развојне банке открио постојећи налог за успавану 5иеарс .

Када сам открио да није било ни наставак ни исплате са овог рачуна на овог дугог периода и наши банкарских закона предвиђа да ће било неупотребљивим чине више од 5иеарс иду на банковни прихода као неостварен фонда .

Ја сам се распитивала за личне депонента и његове најближе , али нажалост ,депонент и његове најближе преминуо на путу до Сенегала за тајкун , а он је оставио иза себе нема тело за ову тврдњу само сам направио ову истрагу само да буде двоструко сигурни у ту чињеницу , а пошто сам био неуспешан у лоцирању родбину .

So, how does it look? And do you receive Serbian spam? Thanks.

The Fateful March of 1998 – my #webstory

I first connected to the web in the summer of 1997. I bought a new computer with Windows 95 and Microsoft Internet Explorer 2. For about a week I thought that that’s how the web is supposed to look, but I kept seeing messages saying “Your browser doesn’t support frames” on a lot of sites. And then I found that there’s this thing called Microsoft Internet Explorer 3. I went to microsoft.com and downloaded it. It was the first piece of software that I downloaded. It was about 10 megabytes and took about an hour on my dial-up connection.

Most notably, Microsoft Internet Explorer 3 supported frames and animated GIFs. I loved animated GIFs! I guess that it makes me quite a hipster.

A cat in headphones dancing to house music. — House cat. Sorry, it’s an anachronism— this animated GIF is from mid-2000s. 1997’s animated GIFs were quite different.

And then Microsoft Internet Explorer 4 came out. I thought—”well, if the move from IE2 to IE3 made such a big difference, then I guess that I should try number 4, and it will be even cooler”. And I tried. And it was a disaster. The installation screwed up everything on my computer. I had no idea how to disable the dreaded Active Desktop, which it introduced. It didn’t work so well with my Hebrew version of Windows 95. So I did what a lot of people did very often back then and formatted my hard drive and re-installed Windows.

And the question arose—which browser should I use? IE3 was stable, but I didn’t like that it was getting old. So I went to netscape.com, to try that Netscape Navigator browser that I kept hearing everybody talking about it.

And I loved it.

I loved its nifty toolbars and its bookmarks manager. I loved the crash reporting; it crashed quite often, actually, but I didn’t feel so bad about it, because Microsoft’s programs crashed often, too, and in case of Netscape I felt good about reporting these crashes. Netscape’s email program, Netscape Messenger, was truly outstanding. I especially loved the green dot, which marked messages as read and unread in one click. Most of all, it said very clearly something that I came to realize only years later: “I am a program that lets you browse the web as well as possible. I am not trying to do anything else.”

Fast forward to March 1998. Netscape made the big announcement that the development of its browser becomes an open source project code-named “Mozilla”. I started hearing about “open source”, “free software” and Linux shortly before that, but it was mostly in the context of crazy geek hobbyists. And then suddenly a big famous end-user product that I love becomes open source—that felt really cool.

I followed Mozilla news since then. I heard about Bugzilla before its first version was released. I liked Mozilla’s decision to redo the whole rendering based on standards, even though many people criticized it. The thing that annoyed me the most in Mozilla’s early years was the lack of support for proper right-to-left text support, which was present in Internet Explorer. That’s why I, sadly, used mostly IE, and even became a bit of an IE power user. But I waited eagerly for Mozilla to do it and tried every alpha release.

The famous New York Times ad.

I was thrilled about the announcement of Firefox, the first stable version of Mozilla’s browser. I gave 10$ to the famous 2004 New York Times Firefox advertisement, and I still have the poster of that advertisement at home.

A long list of names, including Amir Elisha Aharoni — And there’s my name. Third line in the middle.

It always seemed natural to me that I follow Mozilla news so eagerly. I thought that everybody does it. I mean, how is it even possible to use the web in any way without being at least a bit curious about the technology that runs it?

And then in 2008 I wrote a little unimportant post in my Hebrew blog about a funny spelling correction. Tomer Cohen commented on it and suggested me to try the Hebrew spelling dictionary and Hebrew Firefox in general. And that’s how my big love story with software localization began.

I started sending corrections to the translation of Firefox’s interface translation. I started sending corrections to the Hebrew spelling dictionary. I got so curious about the way the spelling dictionary was built that I ended up doing a whole university degree in Hebrew Language. Really.

And in 2011 I started working in the Language Engineering team in the Wikimedia Foundation. I love it, and it probably wouldn’t have happened without my involvement with Mozilla. In the same year I also became a Mozilla Rep—a volunteer representative of Mozilla at conferences, blogs and forums.

Probably the most important thing that I learned from my Mozilla story is that loving the web and being curious about it is not something obvious. Most people just want something that works for checking weather, news, Facebook friends updates, homework help and kitten videos. And for the most part, that is perfectly fine. But the people’s freedom to read reliable and complete news on any electronic device cannot actually be taken for granted. Neither the people’s freedom and privacy to share their thoughts in social networks. Mozilla is among the most important organizations that care for these things and it develops technologies that make them possible. Technologies that let you browse the web as well as possible and don’t try to do anything else.

We do it for one simple reason: We love the web.

Do you love it, too?

P.S. As I began writing this post, I realized that Microsoft’s Active Desktop was not so different from today’s devices, which are heavily based on web technologies: Firefox OS, Chrome OS and others. I can’t say that I love Microsoft, but as it often happens, it was quite pioneering with ideas, and not so good with their execution. Credit where credit’s due.