Teach You Backwards

Introduction: Into the Black Box of Google Translate

Martin Benjamin — Sat, 30 Mar 2019 16:52:18 +0000

Picture 1: Google Translate translation. The top left is an original English quote from Nelson Mandela. Bottom left is a human translation of Mandela’s quote to Japanese. On the right is Google Translate’s translation of native Japanese back to English. (誰もが知るべきネルソン・マンデラの17の知恵, translated by Koichi Higuchi.) Human translations of this quotation to many other languages, and then by Google back to English, including Mandela’s mother tongue Xhosa, are collected at http://kamu.si/mandela-lead

Does Google Translate translate? You might put your faith in it. You might laugh at it. I studied it, for all its languages. This is the first comprehensive examination ever conducted of the world’s most-used translation tool.

The research before you was not intended as an exposé of “fake data” in a multi-billion dollar industry. I set out with a difficult but narrow question: what is the quality of the results produced by Google Translate (GT) among the 108 languages it offers? Answering the first-order question about GT, though, led inexorably to an examination of the integrity of the claims underlying their version of machine translation (MT).¹ I found that, with certain types of formal texts for a very few language pairs, GT often produces remarkably good results. In the vast majority of situations for which they make claims to deliver something they label translation, however, the results go beyond imperfect. They are algorithmically designed to produce output data without any basis in linguistic science, often divorced from any actual form of human expression.

Nobody has empirically analyzed the internal workings of GT before because it is too complicated to figure out what is going on inside, with so many languages that no individual or small team could see into. As a result, the media and the public have long taken Google’s bold claims at face value, the way everybody believed Volkswagen’s “Clean Diesel” deception because nobody had developed the means to test their emissions outside of constrained conditions, until a research team took some VWs on the road and uncovered a massive toxic fraud . An arduous investigation of Enron’s opaque books unearthed a similar scandal, finding that the erratic delivery of electricity to customers in California overlay the defrauding of global investors of billions of dollars . While the economic stakes are not as high – VW paid over $25 billion in fines, and Enron went bankrupt – a journey along the threads of my original research question revealed a business enterprise that deliberately fakes the data to defraud its users by the billions of words every day.

Moreover, as I dug deeper, the research began to call into question many of the major tenets of the entire branch of computer science devoted to translation. Academic journals and the popular media are both full of ballyhoo about the miracles offered by Artificial Intelligence (AI) and the hot trends that will get us there: neural networks, neural machine translation (NMT), machine learning, deep learning, zero-shot translation. Many people take it for granted that we can someday achieve perfect universal translation using these techniques, if we haven’t already. My investigations found quite the opposite: MT on its current trajectory cannot possibly reach the promised land, because it has neither adequate data nor adequate techniques for converting the data it has across languages. This book debunks myths about AI (Myth 1), NMT (Myth 2), zero-shot (Myth 3), and computer omniscience (Myth 4).

Additionally, an entire chapter looks at the relationship between words, numbers, and translation, to explain the difficulties that will always prevent GT and its kin from achieving the results they promise. Please do not be daunted – the discussion is not as heavy as the word “mathematics” in the chapter title might imply. You will learn a lot about MT if you go there, but the hit counter shows people are scared that lurk there.

My line of work is language, or, better said, languages. All of them. A progression of events stemming from anthropological efforts to understand the persistence of poverty in Africa put me on an unlikely journey , the modest goal being the creation of a complete matrix of human expression across time and space. For over a quarter century now, I have been working toward this goal with language specialists, computer scientists, and citizen linguists,² developing systems to collect and process all that is knowable about the words people speak. Along the way, I have run into one consistent question: “Doesn’t Google already do that?” The answer is emphatically no. They don’t, and they can’t. However, “no” is an inadequate answer, so I set out to discover, through empirical research, what it is that Google actually does. What follows are the results of that project.

This is the age of “fake news” and “alternative facts”. The morning starts with presidential prevarication on Twitter, or the White House lawyer arguing “truth isn’t truth” . Facebook sells your personal data while telling you it protects your privacy . Top-ranked search results promulgate medical misinformation, with sometimes fatal consequences . Much of the Internet is populated with fake businesses, fake people, and fake content . You yourself are complicit in the culture of lying each time you agree that you have read the terms of service for the apps and websites you use (see Picture 2).

Picture 2: A checkbox from Google mandating an attestation that, through click history, both parties know is false. All 5 FAAMGs (Facebook, Amazon, Apple, Microsoft, and Google) have services that force users to click next to the words “I have read … the terms” in order to continue.

Google claims that it produces accurate translations among 108 languages. Is Google Translate fake data? It’s complicated. In all languages, some data is real and some is invented. In most languages, most data is³ fictitious.

The takeaway: For 36 languages vis-à-vis English, GT is better than a coin toss, producing text that conveys the main idea more than 50% of the time. For languages at the very top of TYB tests, if five conditions are met (listed in the conclusions – here’s a wormhole), you will often receive conversions that transmit the meaning, with good vocabulary and grammar, though there is always serious risk of a serious error. For 71 languages – over two thirds of the languages they claim to cover – GT fails to produce a minimally comprehensible result 50% of the time or greater. For translation between languages where neither is English, the gist of the original is transmitted in approximately 1% of pairs. As to “universal translation” , GT does not touch the remaining 6900 of the world’s roughly 7000 languages: emphatically no. Full data is available at http://kamu.si/gt-scores.

Roadmap: Teach You Backwards sets out to answer the question, “Doesn’t Google Translate already do that?”, and then tries to answer the question, “If Google doesn’t do it, how can it be done?” The book is divided into several chapters that address, in turn:

What is the context in which people use Google Translate? (You are here )
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool?

To make use of the ongoing efforts the author directs to build precision dictionary and translation tools among myriad languages, and to play games that help grow the data for your language, please visit the KamusiGOLD (Global Online Living Dictionary) website and download the free and ad-free KamusiHere! mobile app for ios (http://kamu.si/ios-here) and Android (http://kamu.si/android-here).

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

Readers’ Note: Repeated acronyms and technical terms can be accessed from the “About” menu at the top of every page. Please also take the informal and colorful expressions used throughout this text as intentional suggestions for your own tests of GT’s translations of everyday English into your chosen language. To notify TYB of a spelling, grammatical, typographical or factual error, or something else that needs fixing, or a reference that should be added, please select the problematic text and press Ctrl+Enter to open a correction submission box.

Overview⁴

I have led an in-depth independent study of Google Translate (GT) from both quantitative and qualitative perspectives. Please read my notes about the genesis of this research and the style in which it is presented; this web-book is a hybrid between scientific and investigative reporting that strays outside academic conventions in the effort to tell the story of GT both to experts and to its millions of users outside the ivory tower. The primary research across all 108 GT languages is described in detail in the Empirical Evaluation, as an objective scientific essay that transparently details the research, and releases the results as open data. The Empirical Evaluation is intentionally separated from the more opinionated parts of this book, which interpret the results, explain why you should care, present a path toward more honest and effective MT, and suggest do’s and don’ts for your relationship with GT. This is the first comprehensive analysis of how GT performs in each language they offer in their current configuration.⁵ Within my qualitative analysis, I further report the results of a four year study across 44 languages of GT’s public contribution system that they say is used to improve their services. Additional questions are answered herein with research among smaller sets of languages.

Journalistic parts of this web-book look at the facts and the fantasies surrounding artificial intelligence (AI) and MT, showing from an expert perspective what GT is capable of versus what it does not and can never achieve. I determine that Google Translate is a useful tool among certain languages when used with caution, but I also show scientifically when, how, and why its truth often isn’t truth. After the qualitative conclusions, I offer detailed recommendations about the contexts in which you should use it and the situations where it cannot be used, including hazard warnings to never use GT for blind translations in any situation that matters, never use GT as a dictionary for individual words, never use GT to translate poetry or jokes, and never use it to translate among non-English pairs. I further advocate that GT should be required to print a disclaimer on all its output stating that the service must not be used in medical situations; where garbled translations could lead to serious harmful consequences, GT’s initially high failure rate for most languages is exacerbated by its lack of training in the health domain, and qualified human translators are essential.

Picture 2.1: “Lead from the back – and let others believe they are in front”, translated to Xhosa, Nelson Mandela’s mother tongue, by Xhosa scholar Dr. Andiswa Mvanyashe, then translated back to English by GT. This translation experiment has been performed across dozens of GT languages.

Sometimes GT offers translations that are indistinguishable from those a person would produce. Sometimes, the service offers clunky, fingernails-on-a-blackboard phrasing that nevertheless translates meaning across languages. Quite often, though, GT delivers words that can in no way be considered meaningful translations, such as transforming “lead from the back” into “teach you backwards” from Japanese (see Picture 1) or “pay back” from Mandela’s Xhosa mother tongue (Picture 2.1). The book before you opens the lid of GT’s sealed black box, to examine what the system actually does in relation to how it is portrayed and understood.

GT, embedded in Google’s market-leading search results and Android operating system, as well as numerous third-party services, positions it as the world’s go-to service for automatic MT. Ubiquity does not indicate sagacity, however. The widespread belief that Google has solved MT could hardly be farther from the truth. From English to the 102 other languages they attempt – their bread and butter – Google’s reliability is good in a few, marginal in many, and non-existent in many others, in the tests I conducted. Translations among all but a handful of non-English languages in the GT set are persistently unreliable, being based on a conversion through English that compounds the error rate on both sides. Nonetheless, Google makes both explicit and implicit claims that GT can consistently translate as well as people. Hugo Barra, a vice-president of Android, Google’s software for mobile devices, says Google’s system is “near-perfect for certain language pairs”, such as between English and Portuguese . On their tenth anniversary, GT stated on their blog:

Ten years ago, we launched Google Translate. Our goal was to break language barriers and to make the world more accessible. Since then we’ve grown from supporting two languages to 103, and from hundreds of users to hundreds of millions… Together we translate more than 100 billion words a day… You can have a conversation no matter what language you speak.

Importantly, these claims underlie not only the free public access mode with which most people are familiar, but also Translation API, a product that Google sells for real money. At https://cloud.google.com/translate, they state, “Translation API supports more than one hundred different languages, from Afrikaans to Zulu. Used in combination, this enables translation between thousands of language pairs.” Picture 3 shows pricing information for bulk users of GT services. Google specifically tells paying customers that they produce valid translations among thousands of language pairs, entering into contracts with their customers in which the output GT delivers in exchange for money is represented as “translation”. I doubt that anybody in South Africa has ever considered paying GT to translate from Afrikaans to Zulu (both official national languages) after one glimpse at the results, but anyone who has paid for translation among such pairs based on Google’s explicit claims has fallen prey to a swindle, buying a product that my research shows is fraudulent by design.

Picture 3: Pricing for bulk users of GT.

In certain cases that will be discussed in this web-book, GT produces fully comprehensible results. Google’s best performances between English and its top tier languages for texts in some formal domains, however, leave many with the erroneous impression that similar results are produced consistently for all text genres in those languages, for all non-investment languages versus English throughout its list, and that such results are also achieved among language pairs when English is neither the source nor the target.⁶ These assumptions are based on several tropes and mind tricks that lead to outsized confidence in the words GT offers as translations.

Tropes and Mind Tricks that make you believe in make believe⁷

Elite language bias.⁸ The studies, demonstrations, and personal experiences that most active users of GT are aware of involve fewer than a dozen market languages on which the service has focused considerable attention: the FIGS, Portuguese, Russian, Chinese, and a few other European languages that, not coincidentally, score among the highest in my tests. Google’s public demonstrations showcase the languages where they feel the most confident. People who have only witnessed GT in elite languages often assume that the same level of performance occurs throughout. Until this study, no research I know of has examined GT’s performance on most of the dozens of languages that the audiences in their lucrative markets would have no occasion to encounter.

2. English speakers see only English.⁹ GT consumers typically consider it a service where English is either the source or the target. That is, “translation” in GT equals translation versus English. Not coincidentally, English is the lopsided focus of natural language processing (NLP) research, regarding both its internal dynamics and translation to other languages. English is therefore the showroom model, with much more window dressing than the languages on the outskirts. Much as you could get iced green tea at Starbucks but you generally go there for coffee, you know you could try Samoan to Somali in GT but you probably won’t.

Picture 3.1: Website for the Museum of Torture in Lucca, Italy, where translation is provided only in English. Lucca is jam-packed with tourists from around the world, especially from Asia, who cannot read either English or Italian. (Language selection box slightly photoshopped to improve visibility.)

Demographically speaking, English prevalence worldwide is very much like left-handedness, homosexuality, and vegetarianism – it occurs everywhere, and in similar numbers. To the extent that GT performs best only between English and a handful of other languages, its grand success is a service for gay vegetarian lefties. Casting no aspersions against that select group, it is surely only a small portion of global translation needs. English speakers tend to vastly over-estimate the number of people worldwide who speak their language, for several reasons:

- If you go to a hotel or upscale restaurant in any city with an international airport, the staff will speak English – it is a job qualification, the way knowledge of drug interactions is a requirement to work in a pharmacy.¹⁰ Some people everywhere learn English, and many of those people will seek jobs where their English is useful, and they will be hired favorably by employers who serve people from abroad. If you find yourself needing to talk to a police officer or a hairdresser, you will be much less likely to find competent English, because learning the language is not a necessary skill for most people training for those professions.
- A few countries, especially in Northern Europe, make a point of teaching English to their youth. People who’s major travel experience is the Netherlands might barely even perceive that Dutch is the national language. On a trip to Norway, I could always find a friendly young person to help me, for example, read the ingredients on a bottle of pasta sauce to figure out if it was vegetarian. Even in Norway, though, many English interactions were strictly limited – a young agent at the train station digging deep for “follow her” to point me toward the transport I needed, or, when three women in a fjord swam past me on a dock where I had a good view of hazards in the water, one was able to put together, “Are there jellyfish?” India has tens of millions of English speakers, but my experience interviewing students for IT positions shows that even the level of brilliant techies is often far less than brilliant. A colleague at a major international organization once assured me that “everyone in India speaks English”; when I visited her home in Delhi, I found that “everyone” did indeed, but not their driver, not their gardener, and not their cook.
- People who travel are also more likely than most in their country to speak English, either because they like traveling and learned English to make their movements easier, or because they belong to a class where both English and international travel are part and parcel.
- People with international professions, including academics, usually learn English as the common language of communication. This is a small minority of people in most countries, but a majority of the people a professional is likely to associate with. It is an absolute requirement for pilots and air traffic controllers, since disasters would result without an agreed common language, so your pilot will almost always make some in-flight announcements in English.
- You will also think that tons of English is spoken because your ears naturally perk up when you hear sounds you recognize, in a way you would not notice someone behind you speaking Polish or Lao or Hausa, unless you are Polish, Laotian, or Nigerian. My ears will pick out the dulcet tones of an American from 40 feet away. Recently, though, I was at a Franco-Italian wedding in Switzerland where the woman seated to my left could not identify what language I was speaking with my family, who were the only English speakers on the guest list.
- English is a language people aspire to. Parents in many parts of Africa are eager for their children to learn it, and many European youth choose it for their language requirements in high school. From Korea to Ecuador, many people have learned enough English to ask you “How are you?”, but not nearly enough to really tell you how they are. Nevertheless, minor interactions with people who can eke out some pleasantries provides confirmation bias that the whole world is speaking the language. Having meandered widely outside of English mother-tongue environments for much of my life, I have witnessed frequently that there is much less of the language than meets the ear.
- English makes many cameo appearances in places where it is not actually understood. People worldwide wear T-shirts that bear English messages, some of which make sense. A dinosaur-skeleton keychain from Japan that I used for years had the message “Joyful Stegosaurus Happy Bone”, which looks like English. Advertisers throughout Europe often use English or English-esque taglines to give an international or youthful flare. Public buses in Thessaloniki announce all their stops: “επόμενη στάση [stop name], next stop [stop name]”, as though (a) no foreigners would cotton on that “epómeni stási” is Greek for “next stop” by the third time they heard it, and (b) all foreigners would understand “next stop”. Products everywhere feature switches labeled “on/off”, but mastering the on/off switch is a far cry from mastering English – and even this text is being replaced by a universal icon. In places far and wide, English words often float through the air like soap bubbles shimmering in the breeze.
- English is like the US dollar. It is the most common global medium of exchange, but that does not mean that most people hold, or have ever even seen, dollars. Elites from every country use dollars for international transactions, but they use the local currency and the local language for their local activities. Americans in particular assume that the world runs on the greenback. I once stood in line behind an American who had just arrived at London Heathrow airport, and, when he demanded a train ticket to Buckingham Palace, was outraged that the clerk would not accept his money. For sure, when dealing with funds for Kamusi, the US dollar is essential to pay for IT and linguistic work in places like India and Kenya, or sometimes even Switzerland. However, when I go grocery shopping in Switzerland, or anywhere else on the European continent, my dollars are always useless, and my English is only marginally more likely to oil the transaction.

Until this study, third-language translations have gone almost entirely unexamined, with viable translations of formal texts from a few elite languages to English serving as a synecdoche for all other cases. Although anybody reading this article obviously has a sophisticated knowledge of the language, only 15% of people speak English somewhere on the spectrum from knowing to get off the bus when they hear “next stop”, to conducting research in nuclear physics. For 85% of the world, English rests as far from their daily experience as deep-sea fishing probably does from yours, though access to dependable translation of the large chunk of global knowledge that is housed in English would benefit speakers of any other language.

A sense of English competence around the world can be gleaned from the EF Proficiency Index, which ranks countries by English skills on the basis of online tests of people interested in online English classes (not a representative measure!). I’ve been to 34 of the top 50 countries on the list, and can attest that moving around the Netherlands (#1) in English is simplicity itself, but getting a haircut in urban Portugal (#11) involves a lot of pointing and hand waving, and no English words will help find out what’s going on with a diverted train in southern Bulgaria (#24).

Picture 3.2: The population of Africa has roughly doubled in the past quarter century. This graph from World Population Review shows the number of people born into African language communities in Nigeria during that time.

Meanwhile, the population of people born into many non-English-speaking communities as much as doubles every twenty-five years. Nigeria, for example, had roughly 108,000,000 people in 1995, and 207,000,000 million in 2020, as shown in Picture 3.2. Although the “official” language of Nigeria is English, those youth are mostly speaking Yoruba or Igbo or Hausa or others of more than 500 languages. The percentage of the population age 15 and above who can, with understanding, read and write a short, simple statement on their everyday life in any language, a.k.a. the literacy rate, was estimated at 62% in 2018. What proportion of those people have basic literacy in English is unreported, but only about 45% of Nigerians attend even a year of secondary school, which is where students would be expected to aim toward proficiency. Many African countries have similar demographics, but only some have English as their colonial souvenir language. Most countries with similar population growth in Asia and the Americas have almost no official relationship to English. We can reasonably posit, then, that many hundreds of languages are growing rapidly in terms of raw numbers of speakers, while English could actually be losing ground in terms of the percentage of people globally who speak it.

English is thus an important but fairly small part of the potential global demand for translation, with geographically proximate pairs such as Chinese-Japanese, Hindi-Bengali, Yoruba-Swahili, and Quechua-Aymara often the mixes that matter to the roughly 7 billion people who cannot read this paragraph. Using round numbers, the current global population is 7.7 billion people. Although reliable census data does not exist, the rough guess from knowledgable people hovers at around 5% as native English speakers, not all of whom are literate. Roughly 10% of others are guessed to have some knowledge of the language. If half of second language learners have functional literacy, we can posit around 700 million readers of English, and 7 billion for whom English is just background noise.

Picture 4: First and seventh result for “wild turkey” from Google Images.

3. In Google We Trust.¹¹ People have implicit faith that, in general, Google delivers what it says it will, be that websites, map locations, images, or translations. All users will share Hofstadter’s experience: “if I copy and paste a page of text in Language A into Google Translate, only moments will elapse before I get back a page filled with words in Language B” . The fact is, though, that you never blindly accept the first photo proposed by Google Images. You accept that the results are all indeed pictures, but you are able to use your own judgement to assess which of hundreds of images best meets your needs. If you were to search for a picture of “wild turkey”, the bird in Google’s first result, Picture 4, might meet your exact desires, or perhaps you prefer a different image from their bird candidates, or maybe the bottle of bourbon that appears seventh will better quench your thirst. Your confidence in the service derives from their ability to unearth a host of images for you to choose from, even if sometimes they miss completely. If you are actually looking for a picture of Turkish wilderness that would resemble the results for “wild bulgaria”, their decision that you want birds or bourbon is a fail, and you hone your search with a term such as “mountain” or “forest”. People blindly accept text from GT, however, because they do not have the linguistic skills to evaluate the proposed translations. It would be unimaginable to accept the first suitor chosen by a dating website – services like Tinder revolve around users swiping left to reject matches the algorithm proposes – but that is the confidence that many users accord the basket of words delivered by GT.

Research by Resende and Way demonstrates that students trust Google so much that they often use it as a tool for language learning: “Participants trusted the GT output enough to change their linguistic behaviour in order to mirror the system’s choices.” In their carefully controlled trials, they found that students adopted a grammatical pattern that Google produced consistently from Portuguese to English, which was an improvement versus replacing Portuguese words with English words while maintaining the Portuguese grammatical pattern. The authors go on, though to “wonder whether MT pitfalls can be learnt and generalized by users when speaking or writing in English.” TYB shows that GT gets it wrong in the proportion of cases demonstrated in the empirical findings, yet Resende and Way show that students often trust the service to teach them the proper way to speak a foreign language. What could possibly go wrong?

4. The willing suspension of disbelief.¹² Although people see faulty translations in GT time and again (the “Translation Fails” channel by Malinda Kathleen Reese on YouTube contains numerous examples), they credulously assume the next translation will somehow work. This is the psychology of desire over evidence. For low-scoring languages, many people seem to take pride in the inclusion of their language within the service, though their experience shows that it is of no value in their own translation needs. For example, though he scored his own language a zero in my tests, a respondent still wrote, “I also would like to thank Google for including my language (Tajiki) into this 102 languages on their online service”. An evaluator for Kinyarwanda who assessed the results as “misleading”, “inadequate”, and “strange”, continued, “Despite all these, the tool is amazing when I consider being without it. It psychologically gives you an impression of hope of having an improved version as time goes on”. Similarly, though Hawaiian failed 75% of the time in my tests, Aiu quotes a Hawaiian student named Hōʻeamaikalanikawahine as saying, “To see our language thriving and being recognized by Google makes me proud. I imagine that our kūpuna who got beaten for speaking Hawaiian in school, but continued to speak Hawaiian in order to make sure the language was not lost, are also standing proud right now. Their bravery and perseverance resulted in the revival of such a beautiful language. I’m hoping that as more people see, hear and use our ʻŌlelo, they become familiar with the idea that Hawai’i is more than just a tourist attraction, but the home of a thriving and healthy language.” This might be called the “Fleetwood Mac effect”: “I’ll settle for one day to believe in you… Tell me sweet little lies”.

5. “It isn’t perfect, but…”¹³ I have heard these exact words repeatedly, and for people who use GT for informal translations in languages at the very top of the heap, it might be a worthy sentiment. For example, a Sudanese woman who works in the planning department at UNICEF in Geneva said, “It isn’t perfect, but…” in regard to her use of the service to convert to English for things she cannot understand in French. However, she chuckled derisively when asked whether she uses it for her native Arabic. A British simultaneous translator from English to Spanish for the International Olympic Committee also used the phrase, explaining that he uses GT when he gets the text of a speech at the last minute so he can quickly see if there is tricky vocabulary coming his way. The first Google hit for “google translate accuracy”, the website of a translation agency, concludes GT is good for casual use but urges human translation when the result is important: “It gets the general message across, but is still far from perfect.” ¹⁴ Phrased a bit differently, “Google Translate’s pretty good but it’s not always accurate”.¹⁵ I could go on.

Picture 4.1: The imprecision of imperfection. When something “isn’t perfect, but…”, where does it fall on the scale between perfect and perfectly wrong?

Framing the results in terms of “perfect” introduces a matter of psychology. I’ll do some mind reading by predicting your answer to the question “what is the opposite of ‘perfect’?”. You said “imperfect”, right? The term blocks you from considering any other antonym. But what is “imperfect”? If my daughter were to bring home a German exam where she lost one point for an errant umlaüt, that would be imperfect. If she were to bring home the same exam with only one word right, that too would be imperfect. When evaluating translations, “imperfect” is somewhere between 1% and 99%, and the opposite of “perfect” is “wrong” (see Picture 4.1).

https://www.teachyoubackwards.com/wp-content/uploads/2019/03/half_of_crazy.mp4

Many people do not expect perfect translations from GT, and feel that getting some of the words is better than nothing. We often get satisfaction from the effort, not the result, such as the pleasure a parent feels at a child’s music recital or sports event despite the distance from reaching professional standards, or the conviction that their child is a genius the half the time that they correctly say something is left versus right. This is church coffee : free and well-intentioned, which people drink even if it tastes like it was brewed in the mop bucket. Although a Wired review of the GT “Pixel Buds” product a year after its release evaluated it as “a bit of a con” (Leprince-Ringuet 2018), this piece of journalism, where failure is techsplained as a raging success, accompanied its release:

The coolest part of [Google’s] Pixel Buds is the ability to use them as a universal translator. It’s like something out of “Star Trek” — at its November launch, you’ll be able to use the Pixel Buds to have a conversation across 40 languages. I had the chance to try this feature out. And it works! Mostly. Now, the magic happens. In my demo, I tried out my mediocre Spanish on a Google spokesperson wearing Pixel Buds, so I’ll use that as my example. He spoke English into the Pixel Buds, asking “hi, how are you?” The Pixel 2 phone spoke, out loud, the equivalent Spanish phrase: “¿hola, como estas?” This text was also displayed on the screen, which is good, because the demo area was noisy. That noisy room also led to the demo’s biggest glitch: When I went to answer in Spanish — “muy bien, y tu?” — the Pixel 2’s microphone didn’t pick me up clearly. In theory, my conversation partner should have heard “very well, and you?” Instead, all the app heard, and translated, was “William?” Bummer. I’m willing to cut Google some slack, here — the room was cacophonous with the sounds of my fellow tech reporters playing around with all of Google’s new gadgets. In my own experiences with Google Translate, it’s pretty solid at recognizing language, so I trust that it would work as well here. Still, be aware that it might not work in a noisy bar… It’s another sign of how Google is turning its considerable edge in artificial intelligence into futuristic, but very real products that make a difference today.

6. GT learns constantly from its users.¹⁶ The impression that GT has a steady program to improve based on input from users. GT says they are doing this through the promulgation of the “translate community”. I conducted a four year experiment of their crowd contribution process among 44 languages, with more than 200 volunteers, and find it to be largely ineffective and inherently flawed, contravening scientific requirements for linguistics and computer science. There is some learning going on – for example, about a year after I submitted an error report, the wrong translation of “Tutaonana” from Swahili to English shown in Picture 37 was eventually corrected to “See you later” if entered by itself in uppercase, or “we will see each other” if typed in lowercase (both valid though my suggestion was the all-purpose “Farewell”, so these suggestions came from elsewhere), but mutilated when put into the formula that closes millions of phone conversations every day (“Haya, tutaonana” is rendered as “Well, we’ll see”, instead of “Ok, goodbye”). The belief that GT learns from its users is the fifth myth investigated in the chapter on qualitative analysis .

Picture 5: Robot-generated text to which humans might ascribe meaning. (http://inspirobot.me)

7. Don’t mind the gaps.¹⁷ People are naturally inclined to skip over gaps in meaning. If a cluster of words seem like they should make sense, people will twist their minds to give them sense. The effect is well illustrated at InspiroBot, which generates fake inspirational quotes such as “Be in support of the wind”, and “Expect eternity. Not your inner child”, and the pseudo-profundity in Picture 5. Peder Jorgensen, co-creator of the site, ponders, “Our heads, our minds, you try so hard to give them meaning, so they start making sense in a way,” and his radio interviewer adds, “It’s not what the words say. It’s what they say to you” . It is apparent from comments by participants in this study that machine translation from English “works” a lot better for people who already know a substantial amount of the source language. For example, a Polish evaluator who lives in the US was able to apply her stock of American metaphors to the Polish word-by-word conversions of English expressions, and consider them fine translations, whereas a Warsaw-based evaluator did not have the parallel English conceptual framework and therefore had no clue about what motivated the translation. Similarly, evaluators across languages with substantial English and some familiarity with gay culture were better prepared to understand a literal translation of “out of the closet”, despite the non-existence of that metaphor in their own society, than people who had never encountered the English phrase. In the words of Hofstadter , “It’s hard for a human, with a lifetime of experience and understanding and of using words in a meaningful way, to realize how devoid of content all the words thrown onto the screen by Google Translate are. It’s almost irresistible for people to presume that a piece of software that deals so fluently with words must surely know what they mean”.

8. Confirmation bias.¹⁸ You might have a hunch about how a phrase should be translated. For example, you might think that “spring tide” has something to do with the seasons. When GT gives you a result that uses a term for the spring season, your hunch seems to be verified. GT is wrong, but you judge that it is right because its proposal matches the guess you would make yourself. Even more significantly, GT often does get all or part of a translation right. We often register the wins in the victory column, but shrug off the misses, with the lasting impression that the overall quality is higher than shown by empirical tests.

9. Gaslighting.¹⁹ If you have weak skills in one of the languages in the translation pair, you might accept something from Google that does not look instinctively right. Maybe you question the translation, but then you question yourself for questioning it. Maybe GT knows something you don’t? After all, they’ve analyzed millions of documents and you struggle with a newspaper article. Who are you going to believe, the supercomputers at Google or your lying eyes? This functions in the same manner as your belief in your GPS; surely you have been led down cow paths when your intuition told you to stay on the main road to reach your destination, but, nevertheless, you find yourself turning onto every future cow path your device directs you toward.

Picture 5.1: When arranging a playdate for my daughter, with the phone set to French, the autocorrect suggestions were incest or drunkenness. NLP for investment-grade languages like French does not generally reach the same level as English. For non-market languages, ivresse is a pretty good descriptor.

10. We are suckers for hype.²⁰ We are repeatedly led to believe that GT and its competitors²¹ are delivering valid translations because the media tell us so, because Google tells us so (see Picture 6), and because its fellow members of FAAMG tell us so. “Google says its new AI-powered translation tool scores nearly identically to human translators” , say the headlines, and “Google Announces New Universal Translator” . Please read the transcript and watch a presentation where Facebook’s Chief AI Scientist states as fact that we have essentially solved MT among all 7000 human languages, a proposition that I have heard echoed by many other computer scientists who should know better. (The many errors in his claim are dissected as Myth 1 and Myth 2 in the qualitative analysis; click the footnote to read a Facebook translation from Chinese to English that conveys that the writer went toy shopping, but does not indicate Facebook has licked MT.²²) “One of our goals for the next five to 10 years is to basically get better than human level at all of the primary human senses: vision, hearing, language, general cognition”, says Facebook CEO Mark Zuckerberg . Similarly, in one podcast, David Spiegelhalter, a deservedly well-respected slayer of myths about risk and statistics at the University of Cambridge said, “There are algorithms that you want to operate completely automatically. The one that when you pick up your camera phone it will identify faces… the machine translation, the vision, they’re brilliant, and they all work, and I don’t want to have to ask them every time, ‘why do you know that’s a face?’” . Prior to this study, you had no tool to evaluate the hype on your own, so you can be forgiven for taking wild exaggerations about GT at face value. Now that you have this study’s empirical results and qualitative analysis at hand, you have the basis for a more realistic understanding of the hype that is spewed about the words GT spawns.

Late-breaking news (24 September, 2019): In my experience, travel for intergovernmental agencies is often finalized at the last minute. As I write, I am waiting for tickets for such a trip, for a meeting in a few days. The last time I flew for this agency, the ticket was confirmed as I sat on the train to the airport. In response to my latest email hoping to prevent a repeat of history, today I received a reply, “Notre agent de voyage émettra tous les billets ce jour même.” I was fine with the first part, but, the most frequent French term for today is “aujourd’hui”. “Ce jour” is a common alternative, but I knew that mixing it with “là” could change the meaning to “that date”, and I had never encountered a construction with “même”, French for “same”. Would my tickets be issued today, with enough time to pre-order vegetarian meals for the 12 hours in the air, or was I being told that I should make my way to the airport check-in desk for a ticket to be issued on the same day? From GT: “Our travel agent will issue all tickets that day” (Bing and DeepL said that same day). This is an important communication, from one of GT’s highest-investment languages to its English heart. “Today” is the 214th most common word in the English corpus. GT is using an NMT model for French that we are assured is nearly flawless, and getting better every day. And yet, I had to wait until I picked up my daughter after school to get the native-speaker interpretation, “today”, reassuring me that maybe tomorrow I’ll have confirmed seats. As to “late-breaking news”, GT gives a few variations from English to French, depending on how you capitalize, hyphenate, and punctuate the phrase, ranging from fine to freaky, but delivers the equivalent of “news about being late” for other languages. One major language pair (English-French), two common terms that needed translation in the normal course of a day (today and late-breaking news), two cases where the meaning is important (getting an intercontinental flight, and publishing for a global audience that includes many non-native readers), two instances where the result is dubious. Words have been rendered, but you should not trust that translation has occurred.

Instant Camera Translation²³

Picture 5.1.1: Menu posted outside a restaurant in Heidelberg, Germany.

Google Translate’s Android app has one oft-touted feature that is cool to use, but is not actually related to translation. You can point your camera at any written text, such as a sign or a menu, and the app will overwrite the text with a translation, simulating the original design characteristics (font, color, size, angle, etc). The underlying technology combines image processing, optical character recognition, and language identification (which Google does very well, except for some difficulty with, e.g., texts from the Wikipedias for closely related languages like Bosnian and Croatian). Once the image has been parsed as text, it is sent to the standard GT translation engine, then returned to the pixel processor to overlay the original text with the translated words in an emulation of the original format.

Picture 5.1.2: German menu translated to English using Instant Camera.

It is important to note that Instant Camera results are at best only as good as the translations that would result from typing the same text into the input box, rather than taking a photo. Results are often degraded when parts of the text are not properly recognized by the imaging software, as happened in some of my informal tests with Google’s flagship Pixel 3 phone. Discussing the imaging technology in the same breath as discussing the translation technology is reminiscent of answering the question “How was your flight?” by saying, “I had a Coca-Cola in a plastic cup from the beverage cart. It was sweet and tangy and really quenched my thirst” – completely tangential to the question at hand. Picture 5.1.1 shows a German menu from a restaurant in Heidelberg, and Picture 5.1.2 shows the Instant Camera translation to English. You might not know the Japanese on a food package attempted by @droidy on Twitter, but you do know that Instant Camera is winging it with “The god is Pakkane” or “finger is in a paging cage” as she was shown. A Kenyan journalist reacts to the introduction of the technology for her language with due perspective: “a quick trial of the new feature doesn’t seem to produce the touted accuracy in certain instances, but we all know how poor Swahili translations can be from our experiences with Twitter and Netflix.”

I suggest a trial that you can do on your own. I did it with a box of herbal tea from a local supermarket, where the ingredients and instructions were printed in parallel in German, French, and Italian – you can probably find something similar in your own pantry, or you can use a multilingual instruction manual you have saved for some gadget. For my test, I used the GT app to take a photo of the Italian portion of the tea box and translate it into German (both high scoring languages in TYB tests). There was very little room for ambiguity. Nevertheless, the German results were unusable. Many of the words were right (including “Maltodextrin” for “maltodestrina” – nice get!), but the overall output was a hodgepodge of words, fragments, and isolated letters. The sentence meaning “Store in a cool dry place” came into German at a Tarzan level, with a word choice indicating one should put the tea in the basement and forget about it. This was a good exercise because it displayed GT results in one hand on the phone, and professional results in the other hand on the box. The teabox experiment had too much text to lend itself to photography, though, so I followed my own advice and tried a more photogenic instruction manual for walkie-talkies that had been professionally translated by Motorola. Picture 5.1.3 shows the results from Hungarian to English. You can judge whether you understand the translation from Hungarian in the center to give you the same information as the original English on the left.

Picture 5.1.3: On the left is a page in English from a Motorola instruction manual. On the right is the same set of information in Hungarian. In the center is the Instant Camera translation from Hungarian to English, as produced with Google Lens software on a Google Pixel 3 camera.

Picture 5.2: The lei lie. GT asserts that “ලෙයි”, which spells the sound “lei”, is a real word with the same meaning. This exercise can be repeated in every Google language as an example of MUSA, the Make Up Stuff Algorithm. (In the rare case that “lei” has been adopted into a language you know, such as Dutch, you can find the same phenomenon with any of bajillions of terms that do not appear in the parallel training vocabulary, such as “bro hug” or “bajillion“).

MUSA: The Make Up Stuff Algorithm²⁴

English is by far the language with the most investment and the most NLP resources, even serving as the primary stomping grounds for computational linguistic research in Europe, Asia, and Africa because that’s where the datasets are and that’s where the jobs are . We all know that Google gets English wrong much of the time, pointing us to off-target search results or silly Android text predictions, but we let ourselves be snookered into believing that they have brilliant algorithms within and among scores of other languages – what we can call BEDS, Better-than-English Derangement Syndrome (see Picture 5.1). Google does have an algorithm that it invokes regularly to generate the aura of brilliance. I call it MUSA, the Make Up Stuff Algorithm. If GT does not find an equivalent for “porn” in its Latin vocabulary, it invokes the MUSA imperative to produce, caveat emptor, “Christmas cruises”.

One genre of MUSA is the “lei lie”. The term “lei” has entered English as a result of the historical relationship between Hawaii and the United States of which it now forms a part, but it is not a universal term in all languages. Nevertheless, GT will provide a “translation” for “lei” in every language in its system, for example “ලෙයි” in Sinhala (Picture 5.2) and “லெய்” in Tamil, the languages of more than 21 million people in Sri Lanka. Both of those are merely phonetic transliterations of the sounds in “l-e-i” in their respective alphabets, not actual words with actual meaning. Yet, not only does GT assert that the glyphs represent real terms in those languages, but offers “லெய்” as a translation of “ලෙයි”. Sri Lankans enjoy flower garlands – one could visit Lassana Flora in Colombo and purchase what transliterates as “mal mala” in Sinhala or “maalai” in Tamil, but GT’s confidence in declaring that such things are called something like “lei” in any language you ask is a confidence scam, a.ka. fake data, a.k.a. a lie.

You could spend all day finding examples of lei lies and other prevarications for all 108 GT languages. Here is a little game derived from an amusing tweet. Using Spanish as your source language, pick an English word that is not a Spanish cognate, and plug it into the blank: “No _______o.” For example, No understando, or No mouseo. Now, choose any target language you can make some sense of. GT will apply some rule from its Spanish repertoire, and throughput English to give you plausible-sounding results in your target language. “No mouseo” comes out in Galician with the rough meaning, “I don’t do mouse”, and in Esperanto, Frisian, German, Swahili, and many other languages as “I am not a mouse”. In French it comes back as “I do not smile” because “souris” is the French word for “mouse”, GT invents “je souris” as the first person conjugation of “to mouse”, and “je souris” just happens to already be the real first person conjugation of “to smile”. Feel welcome to share similar MUSA tests you discover for other languages in the Comments section below.

An MT research scientist at one major company producing public translations, who asks not to be named, suggests that MUSA may be driven in part by consumer demand. They tell TYB, “A translation provider is expected to translate. The user insists on getting something, even if it’s garbage.” This could explain the imperative designers of MT engines feel to provide random words instead of a frank [unknown] label that some portion of a text is outside their confidence zone. Potemkin did not build fake village facades because he liked building fake villages, but because Catherine the Great liked seeing pretty villages as she cruised along the river. Whether consumers really want MUSA may or may not be true, but if service providers think it is true, that helps explain why they algorithmically cover up missing data with fake data.

With no counter-narrative explaining what the hype gets wrong – that GT in fact uses MUSA to invent a French verb, “snooker” as a translation for “snookered”, and then conjugates it by proper rules as “a été snooké“! – most people will assume that the hype is the truth. The research presented herein is the first major attempt to investigate what is true in the claims you hear about GT, what should be taken with salt, and what is pure snake oil.

0.0006% of the Way to Universal Translation²⁵

The tale that GT provides universal translation breaks down even further when you look outside the 108 languages it claims to service. Stated categorically, GT completely fails in 99.9994% of possible translation scenarios. With no service and no known plans²⁶ for 7009 of the world’s 7117 languages that have been granted ISO 639-3 codes (see McCulloch for insight into why most languages remain unserved), “universal” is nowhere on the agenda – neither between any of those languages and English, nor in almost the entire 25,000,000 potential pairs.

Picture 6: Google’s top result for “the ability to translate among any language” is categorically false, but feeds its own hype.

Picture 7: Google mission statement

Granted, first, the languages claimed in GT are spoken to at least some degree by a good half the world’s population,²⁷ and second, many languages on the long tail will never have occasion to come in contact, but leaving the languages of smaller or less powerful groups out of the notion of the quest for “universal” contributes to their further marginalization, and, it could be argued, often works toward their extinction. It is of course not Google’s responsibility to attempt such a feat, but it falls under the rubric of their corporate mission statement (see Picture 7), and they have gotten significant mileage for the publicity surrounding their modest corporate contribution to the Endangered Languages Project²⁸ e.g. . It is the corporation’s responsibility to debunk the impression they let float that they are organizing comprehensive linguistic information and making it universally accessible and useful within MT for most of the world’s languages.

Original Dutch text	Professional human translation	Google Translate translation	Tarzan	BLEU score
1. Chef-staf Witte Huis John Kelly stapt op	White House Chief of Staff John Kelly steps down	Chief of Staff White House John Kelly gets up		23.74
2. AMC kocht honderden hoofden van omstreden Amerikaans bedrijf	AMC bought hundreds of human heads from controversial American company	AMC bought hundreds of heads of controversial American company		48.96
3. Russische mensenrechtenactiviste van het eerste uur overleden	Life-long Russian human rights activist has died	Russian human rights activist died from the very beginning		29.07
4. Week in beeld: nog meer gele hesjes, goedheiligman en George Bush	Week in pictures: even more yellow vests, Saint Nicholas, and George Bush	Week in the picture: even more yellow vests, good saintman and George Bush		49.36
5. Oorzaak Italiaans nachtclubdrama mogelijk ingestorte balustrade	Collapsed railing possible cause of Italian nightclub drama	Cause Italian nightclub drama collapsed balustrade		19.74
Table 1: Comparison of human and machine translations of 5 headlines from the leading Dutch television news program NOS Journaal, as they appeared online on December 8, 2018. Scores calculated by the Interactive BLEU score evaluator from Tilde Custom Machine Translation. indicates a translation from which the original sense cannot be extracted. indicates a a translation where a speaker of the target language will understand some of the original sense. indicates a translation where a speaker of the target language can fully understand the original sense.

Ideology and Computer Science²⁹

MT is in some ways made harder by the ideology of computer science (CS). Computer science sees translation as a computational problem. When the algorithms are perfected, the thinking goes, we will achieve the goal. We have done this before: we built ships to cross the oceans, airplanes to break the sound barrier, rockets to escape gravity; we harnessed electricity, invented the radio, created the Internet. With enough spirit and enough effort, we can achieve any goal we can identify.

For language, the evident goal is the ability to input any language and have human-quality output in any other. We have already witnessed impressive feats along that path, so a few more tweaks and we will most certainly arrive. We will, so it is thought, reach the day when any linguistic expression such as the Dutch sentences in Table 1 can be converted by machine to the caliber of the human translations in the second column. In CS ideology, the actual outputs that machines currently produce, like the GT translations in Table 1, are merely the messy residue of the construction phase, soon to be remembered along with CRT monitors and dial-up modems.

Success in MT is measured by the “BLEU” score (discussed in more depth in the qualitative analysis) – rating whether the machine got a lot of the same words as a human and whether they were arranged in a similar order – despite, as Sentence 2 in Table 1 shows, the entire semantic value of a translation being rendered unintelligible by an error as small as “of” instead of “from”, and despite nuances such as the difference between calling a country “Macedonia” or “North Macedonia” (see Picture 16) being the subject of a costly 28 year international dispute. Based on BLEU score, the big problem with Sentence 1 in Table 1 is GT’s semantically irrelevant word order of “Chief of Staff White House”, which would raise the BLEU score by 42.31 points to 66.05 if fixed. Changing “gets up” to “resigns” would only nudge BLEU up by 0.39 points, to 24.13, although that is the semantic crux of the sentence; shrinking the translation to two words, “Kelly resigns”, would carry all the information needed for a clear translation, but would sink the BLEU score to 4.38. That is, the difference between the failure or the success of a translation for a consumer is the difference between “gets up” and “resigns”, whereas to a computer scientist it is the difference between 4.38 and 66.05. This is not to disparage interesting work that might one day contribute to superior MT, but when results with BLEU between 9% and 17% are reported as “acceptable” , we must be clear that this refers to acceptability in pushing the envelope from an experimental standpoint, not from the standpoint of linguistic viability.

People have natural intelligence that makes any human with normal brain functioning capable of learning, say, Bengali (the world’s seventh most-spoken language). No human who has not learned Bengali is capable of translating it, however. Similarly, in principle, any computer could perform in Bengali the full gamut of things we ask computers to do in English. In practice, though, without substantial Bengali data, even the most sophisticated supercomputer will never be able to make the language operational. Allusions to cognitive neuroscience – AI (Myth 1), deep learning, neural networks (Myth 2) – along with false claims about the success of “zero-shot translation” (Myth 3), produce the illusion that effective translation for any language is a few algorithmic tweaks away. The fact is that no adequate digitized data exists for more than a smattering of lucrative languages so MT is impossible in most circumstances. You cannot get blood from a stone, no matter how brilliant your code base. Though GT will undoubtedly eventually add some more languages to its roster (indigenous American languages being conspicuous in their absence), a new approach that builds upon natural intelligence is needed if the supermajority of languages are ever to participate in the information age.

In-depth research into the potentials for universal translation, of which the present study is a part, points to a conclusion that flies in the face of the prevailing computer science ideology. I find that computers are indeed a powerful tool that can perform much of the grunt work of translation, and do so for many more languages than FAAMG has any current plans to support. “Computer Assisted Translation” (CAT), wherein the machine proposes a sheaf of text such as Sentence 4 in Table 1, and the person separates the wheat from the chaff by post-editing, is already used seriously by professional translators. The disruptive approach I outline here , though, will strike many as heretical, or at the least fantastical. I propose that meaning can be ascribed in the source language – “goedheiligman” can be known to be the idea of Saint Nicolas – before a text ever reaches the eyes of a translator. This can be achieved for even the smallest languages, by collecting data from human participants through systems under development, and using text analysis tools already under development, at costs remarkably lower than what has been invested over decades in languages like English and Dutch. Reaching that level – where real translation among hundreds of languages occurs by using MT as a chisel in the sculptor’s hands, not the sculptor itself – can be realized, but is only possible with a paradigm shift that drops the pretense that true translation can ever be fully automated.

MT is perhaps better seen in the first order as a social science problem – how to collect the language data that machines can process in the first place, and how to present the collected data in a way that people can refine for final translations. More than two hundred million people store a full Bengali dataset and processing algorithms in their heads. For Bengali and thousands of other languages, computers can be used as the tool to collect the data from people, in a format that establishes parallel concepts across the board. Once people provide data, computers handily store and perform the mechanical tasks of regurgitating it. Machines can further be set to work analyzing the data to find patterns that might not be apparent to humans. This is where AI comes in: to flag cross-linguistic correspondences requires a person to be well-versed in two languages, but only requires a computer to have comparable data. It then remains for the computer to put its propositions before human eyes, and for the humans to render final judgement.

GT has achieved demonstrable success in reaching somewhere on the spectrum between silence and true translations for the languages it covers – a worthwhile goal, as long as it is understood that the output is a fictional set of sometimes-informed guesses. In theory, computation could attain competent Tarzan-level translations between any language pair that is supplied with a large stock of comparable data. The argument in this study is not that universal translation cannot be accomplished, but that the empirical results prove that GT has not accomplished it, and analysis of their methods prove that it cannot be accomplished with the methods that FAAMG are currently pursuing.

Picture 8: Credit

Picture 9: Credit

Picture 10: Credit

GT output is built on three pillars. The first pillar is known facts about language, including the internal structures of the source language and confirmed correspondences to the target language – hard wiring, as depicted in Picture 8. The second pillar can be represented by the rolling of multisided dice, as depicted in Picture 9, where the machine gambles from numerous possibilities based on some sort of prediction from nebulous data. The third pillar, depicted in Picture 10, is pure air, packed in the output box in order to fill space. Crucially, GT employs all three of these techniques in all of their languages, often within the space of a single sentence. The research presented in the empirical evaluation gives an indication of how the weight of each language is balanced atop the pillars of knowledge, prayer, and pixie dust.

References

Items cited on this page are listed below. You can also view a complete list of references cited in Teach You Backwards, compiled using Zotero, on the Bibliography, Acronyms, and Technical Terms page.

The post Introduction: Into the Black Box of Google Translate appeared first on Teach You Backwards.

Empirical Evaluation of Google Translate across 107 Languages

Martin Benjamin — Sat, 30 Mar 2019 16:52:17 +0000

Google Translate (GT) is the world’s biggest multilingual translation service, both in terms of number of languages and number of users. I conducted a scientific evaluation of translations produced by Google, for all 107 languages vs. English in their roster. Such a study, using native speakers for every language, has never been done before. Evaluators everywhere from Samoa to Uzbekistan were recruited to examine Google translations in their language. This chapter reports their results.

Picture 10.1: Full data for the empirical evaluation of 107 languages versus English in Google Translate can be inspected, and copied as open data, by clicking on the picture or visiting http://kamu.si/gt-scores

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements GT across all its 108 languages. (You are here )
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool?

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

Description of Empirical Tests³⁰

A. English to Language X³¹

I asked native speakers of each language to evaluate a fixed set of 20 items translated from English to their language. Respondents were asked to rate each translation on a scale of A (good), B (understandable), or C (wrong). For example, an evaluator might give the equivalent of “not good see” a B score for “out of focus”. From these reports, I determined two ratings. The “Tarzan” score reflects the percentage of time that GT gave a result that transmits roughly the original meaning, regardless of the quality. The “Bard”³² score is a measure of how often GT arrives at human-caliber translations, with A translations weighted double over B. “Tarzan” indicates the base level functionality of GT between English and a given language – the ability to communicate “Me Tarzan, you Jane”³³ (MTyJ) in such a way that Jane and Tarzan know what each other are talking about. “Bard” indicates the extent to which GT produces translations from English that sound natural to a language’s speakers. The “Tarzan” designation is not meant to be pejorative – enabling rudimentary communication between people who could not otherwise converse is an extraordinary accomplishment, with an instance of effective Tarzan translation shown in Sentence 5 of Table 1 and a photo of a real-life Tarzan translation situation in Picture 16.1 – but it is intended to indicate where the language’s technical integration sits between stone axes and iPhones.

The goal of the test was to discover whether GT delivers understandable equivalents of meaningful formulations that occur regularly within common English. Translating outward from English is the most consistent use-case for GT. In addition, Google has developed extremely strong processing tools for English that it can apply whenever English is the source language, such as the ability to identify parts of speech (e.g. distinguishing the verb from the noun in “I will pack a lunch in my pack”), named entities (e.g. treating “Out of Africa” as a film title that should be left intact), and inflected forms (e.g. marking “I thought” as first person past tense of “think”). I therefore tested the task on which GT is most likely to show its strongest performance: from the vocabulary and training materials in its central language, vis-à-vis the languages that they have directly matched.

B. Language X to English³⁴

I did not systematically test translations from other languages toward English, which would be a perniciously difficult study to implement. One could not ensure, for instance, that a random starting text for Basque was at the same complexity as a starting text for Maori. I have done one small experiment, however, that demonstrates how a true test from all languages to English could be accomplished.

The quote from Nelson Mandela that gave birth to the title of this study, “Lead from the back – and let others believe they are in front” – rendered by GT from its human Japanese translation back to English as “Teach you backwards – and make you believe that you are at the top are you guys” – is a typical instance of the sort of phrase that people regularly seek to translate using MT. The meaning of the complete phrase is unambiguous to any competent English speaker, but several of the words are ambiguous when taken individually. I sought human translations for the entire GT language set (though if this parenthetical comment is visible, at least one language at http://kamu.si/mandela-lead is still missing). This quote had already been translated into 14 GT languages plus Aymara³⁵ by volunteer translators at Global Voices. I elicited many additional translations by tapping various social and professional networks. Assuming 10 minutes per language – to find a contact, send a note explaining the task, and then copy the reply, run it through GT, copy the GT result to the data file and a separate file for BLEU testing, perform and record the BLEU test, credit and thank the translator – completing this one phrase in 107 languages is a minimum investment of over 1000 minutes – and more if the first contact does not respond, or a treasure hunt is needed to find a native speaker.³⁶

These human translations all express exactly the same idea. For the experiment, I fed each of these Language X identical phrases to GT, and share the output as open data at http://kamu.si/mandela-lead. You can see that the results vary, from perfect for Spanish and Afrikaans (using the word “behind” instead of “the back”, which is completely legitimate though would bring down the BLEU score) to collections of words that include some that appear in Mandela’s sentence, such as Malayalam (“Lead people back and forth – believe others are still ahead”), Yoruba (“Behind the back so that some people can not believe that they are in front”), and the Slovak clause “Drive away from the seclusion”. I ended up with five different German translations (including two that contacts found already existed on the web and two variations based on register from a single translator), none of which came back through GT with the full essence of Mandela’s sentiment.

No scientific inferences about the quality of translations from a given Language X to English are possible from this single data point for each language. However, the experiment does let a little light through the blinds.

Machine translations will almost always include some words in English that are the same as those a human would choose. For example, though “Hi, don’t let others believe that they are before” (Zulu) fails to transmit the meaning in any way, it does contain the elements “let others believe” and “they are”.³⁷ “Due to the buttocks, other mice should walk in the front” (Yiddish) contains “other” and “front”.
Translations can be correct for one clause and wrong for others. For example, “Walk back” (Macedonian) and “Head as if you were behind your back” (Lithuanian) both botch the first half, but nail the second.
There can be shades of grey about whether a translation conveys the original meaning. For example, “Rule from the back” (Polish) has more of an aspect of dictatorship than the benevolent guidance that Mandela had in mind, blurring but not erasing the overall intent of the quote.
As with English to other languages, translations can vary for unpredictable reasons. For example, the end of one Georgian translation switched between “to go” and “to be better off” depending on whether the phrase ended with a period or a semicolon. This makes overall quality assessment difficult, because you do not know if a translation might be more or less accurate based on untestable variables.
Different human versions within one language can meet differing success in English. For example, Korean had one translation that GT returned spot on, one that was close but no cigar, “Stand behind and lead – and believe that people lead themselves”, and one that was completely off the rails, “Head back, let others believe you’re leading him”. This points to why a large variety of such parallel translations from each language would be necessary for valid inferences.
How closely GT came to reproducing the original Mandela quote may be related to how closely the translator could hew to the original English structure versus cracking some eggs to make the omelet. For example, “Judge in such a way that people feel they are responsible for making things work” is wrong, but it is wrong in a way that could reveal how the translator tried to make the sentiment come out right in Shona.
High-investment languages do not necessarily have higher results. For example, “Head from behind” (French), “Run from behind” (German), and “Straight from the rear” (Italian) all fail to convey the original intent.
Many human translators take their tasks quite seriously, fretting over minor nuances. Several submitted more than one translation of the same phrase. GT does offer a second translation possibility for power users who know where to click (do you?), but both translations are subject to the same computational limitations. The options for Somali, for example, are either “Follow the back of the lead and let others think they are already” or “Taking the lead from the back and let other people do not think of themselves that they have levels”.
As seen in translations from English to Language X and from Language A to Language B elsewhere in this study (e.g. the phrase “I will be unavailable tomorrow” between English and 44 languages discussed in Myth 5 of the qualitative analysis, and the German translation of a French news article discussed in the analysis of Google’s pivot process), Language X to English translations have a non-trivial risk of forming human-sounding phrases that flip the meaning 180°. For example, “Convince others that you are ahead of them and manage them” (Azerbaijani).

A thorough study of all GT languages toward English would repeat this exercise of getting human translations for a substantial number of test sentences and reverse-translating them to English with GT, and would then use multiple native English-speaking human evaluators to rate how well each reverse translation came toward capturing the original English meaning. The methodology is not complex, but it would take tremendous time and effort, and was well outside the scope of the present unfunded research.

C. Language A to Language B³⁸

Picture 10.2: Predictive scores of translation among all language pairs within GT except pairs with KOTTU, based on Tarzan ratings to the English pivot. Full data is at http://kamu.si/gt_scores_cross_language_extrapolation

Picture 11: Almost all Language A to Language B translations in GT are two stage conversions through an English pivot.

I did not evaluate translations directly among languages other than English, because in almost all cases³⁹ Google makes a two-stage conversion from Language A to English and then English to Language B. A technical lead at GT minimizes the centrality of the English pathway, saying, “Sometimes we bridge through other languages and lose a little bit of information along the way” (, 1:21:25), then inexplicably denies that English as the pivot is the predominant route; my research finds this denial to be untrue.

The English bridge is evident by testing translations even in cases where the linguistic, research, and market environments would suggest a direct path, such as Corsican to French or Italian (see Picture 12 for an instance where GT revealed the pivot language). Language A to Language B translations degrade at roughly⁴⁰ a multiple of the score of each vis-à-vis English. For example, the Tarzan ratings are 55 for Serbian and 50 for Russian, so Serbian to Russian could be estimated at 50% of 55, for a Tarzan score of 27.5. However, I cannot be sure that the score we’ve obtained from English to Language A is the same as what would be obtained from Language A to English. In the present example, I assume that the Tarzan score for Serbian to English is 55, on the basis of the score from English to Serbian, but I do not have the hard evidence to make declarative statements based on this claim. We know a priori that two-stage translations inherently compound the errors each language experiences in relation to English, but I cannot give figures that have a confirmed scientific basis. Calculating the score of Language B as a percentage of the score of Language A produces an indicator of confidence between the two languages, but not a final measure.

By presenting their translation proposals from Language A to Language B without caveats, Google gives an implicit confidence measure of 100%. Google has never published quantitative analysis of the vast majority of their translation pairs in any scientific format. Their unqualified representation that they produce empirically valid translations is false between almost all non-English pairs.

The table in the tab “Language-to-Language Tarzan Extrapolations” in the TYB data release, http://kamu.si/gt_scores_cross_language_extrapolation, shows confidence estimates between each language pair, assuming English as a bridge; I repeat that these pairs have not been individually tested, so the numbers should be viewed only as indicative.

Picture 12: GT from Italian to Corsican revealing the English bridge in most translations from Language A to Language B. Italian and Corsican are closely related languages.

Methodology⁴¹

I tested clusters of two or more words that often occur together, and that have a meaning that is generally discernible when they do;⁴² for example, tweets with “out cold” almost always imply unconsciousness.

I did not test single words, which can be highly ambiguous (for example, right = correct, legal entitlement, politically conservative, the not left direction, etc.), and therefore too arbitrary in isolation, although a major challenge for MT from English is that the top 100 words (as tallied in the Oxford English Corpus), which constitute about 50% of all written English text, have on average 15.45 senses in Wiktionary.⁴³ Although GT does not advertise itself as a dictionary, single-word lookups are a major proportion of real world uses of the service.

Nor did I test full sentences, which add a lot of complexity to scoring, since translations might include both correct and incorrect elements. For example, the sentence “’Out of Africa’ won the Academy Award for best picture in 1986” had mistakes in all ten of the languages I spot-tested it in (including high-performers like Afrikaans and German), but different languages made mistakes in different parts of the sentence, particularly how they rendered the title and whether “picture” was given as an illustration or a film. Moreover, humans can translate the same sentence many ways, making it impossible to determine a gold standard for full sentences, especially across dozens of languages – an example can be seen in the German translations submitted for the Mandela quotation that kicks off this web-book. It should be noted that GT alters its vocabulary choice on the fly, so the words chosen to translate a short phrase may not be those selected in a longer sentence, or even at a different moment; for example, “run of the mill” is represented in French by “courir du moulin” in isolation, and “course de l’usine” when translating a longer tweet (both wrongly referring to foot races), and other instances might produce other results.⁴⁴

Neither did I test full documents, where Läubli, Sennrich, and Volk show that raters show a markedly stronger preference for human translations as compared to an evaluation of single, isolated sentences; such a study would be extremely expensive across more than 100 languages.

All of the clusters contained the word “out”. This word is ubiquitous, ranking as the 43^rd most common word in the Oxford English Corpus.⁴⁵ It is extremely ambiguous in isolation, with 38 senses in Wiktionary, but often occurs in clusters with unmistakable meanings. WordReference.com gives definitions and French equivalents for nearly 1700 composed expressions that include “out”, from “a fish out of water” to “zoom out”.⁴⁶ Additionally, a dataset with compositional information for 560 phrasal verbs ending in out is available as open data . While “out” is an outlier in terms of its scope, it is a known entity within English lexicography and NLP. I chose 20 formulations that are lexicalized in WordReference as composed forms such as “out of style”, or that, as queried on Twitter, usually reduce to defined meanings when matched with other particular words, such as “out of milk”.

All of these items have been translated in electronic bilingual dictionaries, so are thus similarly viable as units for machine translation. The expressions were not chosen to be especially simple or difficult, nor based on corpus frequency. Rather, they were chosen because they had clear meanings,⁴⁷ and are broadly representative of the types of phrase that ordinary users are likely to seek to translate – for example, an American expatriate in Budapest related that he tried several times to use the Hungarian conversion of “like a bat out of hell” until he learned that GT got the words all right and the meaning all wrong.

I explicitly did not test ambiguous expressions in GT, because these are already known weak points in MT. An ambiguous expression from English is likely to be translated correctly in one or more contexts, but not in all. For example, ascertaining the failure rate of the sentences in Table 2, expected to be 75% to 87.5% (assuming that GT has one or two translations of the candidate term in its repertoire that have to land somewhere), will only reveal the general truism that GT has difficulty with common English phrasal verbs, rather than showing how this difficulty manifests across languages. You are encouraged to test that premise on the following tweets with a language you know well:

1. I wonder the emotions of Joseph when he held up [= raised] the baby Jesus, knowing his image wouldn’t be reflected in the face of that child.⁴⁸

2. Got held up [= mugged] at gunpoint on Wednesday walking to work in broad daylight by 2 men. They took my phone, wallet, glasses, keys, headphones, everything.⁴⁹

3. Our journey home from a hilltop restaurant was held up [= stopped] for 10 minutes. Why? Ask the porcupine that decided to try & outrun the bus.⁵⁰

4. My great grandmother had some epic #style. These @ray_ban sunglasses have held up [= endured] for more than 50 years .⁵¹

5. The argument that belief in the internet is necessary for sending tweets has never held up [= survived] to scrutiny.⁵²

6. Rain, rain, rain. But spirits held up [= buttressed] by the constantly surprising friendliness of the locals.⁵³

7. “She’s being held up [= slowed] from behind by the real guilty party”.⁵⁴

8. “The UK is held up [= glorified] as a liberal democracy”⁵⁵

Table 2: “Held up” is an example of an expression that was considered inherently likely to fail in all languages, and therefore too ambiguous to include in the study for gaining useful comparative information across languages.

The selection of phrases for this study (see Table 3) is therefore not rigidly scientific, and the reader can decide whether the items provide a fair test of MT capability.⁵⁶ The least-recognized phrase, “out cold” had a 90% fail rate, with just 3 “A” ratings (Hausa, Hindi, and Malaysian) and understandable to some extent 10 times, while the phrase “out of the office” produced 49 “A” ratings and an understandable result 83.6% of the time.

	Original English Phrase	English Paraphrase	A	AB	AC	B	BC	C	Bard	Tarzan	Fail
1	fly out of London	take an airplane from London	33	10	0	43	1	23	34.5%	79.1%	20.9%
2	like a bat out of hell	escaping as quickly as possible	2	1	0	17	6	84	2.3%	23.6%	76.4%
3	out cold	unconscious	3	0	2	6	0	99	3.6%	10.0%	90.0%
4	out of bounds	unacceptable	7	0	0	34	4	65	6.4%	40.9%	59.1%
5	out of breath	gasping for air (for example, after running)	35	2	1	27	5	40	33.2%	63.6%	36.4%
6	out of curiosity	because a person is casually interested in something	33	7	0	30	3	37	33.2%	66.4%	33.6%
7	out of focus	not clear to see (blurry)	21	6	0	24	3	56	21.8%	49.1%	50.9%
8	out of his mind	crazy	5	0	0	21	8	76	4.5%	30.9%	69.1%
9	out of milk	the supply of milk is finished	4	0	1	12	4	89	4.1%	19.1%	80.9%
10	out of order	does not function (broken)	32	3	0	18	1	54	30.9%	50.9%	49.1%
11	out of pocket	paid for something from personal money	18	0	0	33	5	54	16.4%	50.9%	49.1%
12	out of steam	no more energy (exhausted)	3	1	0	5	3	98	3.2%	10.9%	89.1%
13	out of style	unfashionable	21	4	0	43	2	40	20.9%	63.6%	36.4%
14	out of the closet	openly homosexual	9	2	0	13	3	83	9.1%	24.5%	75.5%
15	out of the game	no longer participating in a game	34	3	1	34	5	35	30.9%	68.2%	31.8%
16	out of the office	away from the office	49	5	1	35	2	18	47.3%	83.6%	16.4%
17	out of this world	excellent	3	2	0	8	5	92	3.6%	16.4%	83.6%
18	out of time	a deadline has passed	14	7	0	47	3	39	15.9%	64.5%	35.5%
19	out of wedlock	between partners who are not married	30	6	1	29	1	43	30.5%	60.9%	39.1%
20	out on the town	having a fun time going shopping or to bars/ restaurants (carousing)	4	0	1	15	4	86	4.1%	21.8%	78.2%
Table 3: Evaluation phrases and their scores. AB, AC, and BC signal inter-annotator disagreement. Updated with KOTTU, March 2020. Itemized scores and translations for all languages can be viewed at: http:// kamu.si/gtscores_itemized_and_translations

Picture 12.1: Itemized translations and scores from English for all other 107 languages in GT. Full data is at http:// kamu.si/gtscores_itemized_and_translations

Scores are not absolute, for two reasons. First, the choice of expressions was arbitrary. A different selection of English expressions would generate different numerical results within each language; for example, scores would probably fall were a larger number of idioms such as the confounding “like a bat out of hell” to be included. Relative results would likely remain essentially the same, however; a language with high scores for my test set should perform highly with other phrases, while a language with low scores herein would have similarly low results with other input. Second, most language scores show the subjective opinion of a single reviewer. One could well argue that more reviewers per language would produce more reliable data. Several languages had multiple reviewers, and inter-annotator disagreements were usually minor, with a handful of entries per language being judged good versus marginal, or marginal versus wrong. A single entry being ranked by different evaluators as good versus marginal does not change the Tarzan score, and changes the Bard score by 2.5 points, and a disagreement between marginal and wrong changes both Tarzan and Bard by 2.5. I averaged the scores where annotators disagreed, and kept the disagreements visible in the public data release. Based on the inter-annotator disagreement values I discovered, the reader is advised to place mental error bars of ± 10 around the score that is reported.

For Portuguese, I report European, American, and Cape Verdean scores as independent results, and for Spanish I report separate scores for Europe and Latin America. On the other hand, I considered the “traditional” and “simplified” versions of Chinese (Mandarin) as one language, because the only difference is the writing system (similar to the equivalence of Serbian whether written in Latin or Cyrillic script), as verified by my tests. For cross-language comparisons, I used the highest-scoring member of the set.

Evaluators⁵⁷

Picture 13: A typical MT task where readers cannot gauge accuracy if they are not already proficient in both the source and target languages would be to render this article from Armenian into another language. The non-Armenian reader has no way to judge whether the output conveys the original intent.

Translations were evaluated by native speakers of each language.⁵⁸ Evaluators had a range of backgrounds, such as nurse, high school student, diplomat, musician, computer scientist, and language professional. Many of the evaluators were identified through the PI’s prior social and professional networks. When those fell short, extensive use of LinkedIn revealed many second or third degree contacts whose profiles stated they had native proficiency, and who agreed to participate. When LinkedIn fell short, members of Facebook language interest groups whose pages had posts in the required language were contacted. In a few cases, none of these methods were fruitful, and university librarians or public service organizations were contacted to help locate a willing native volunteer.

Evaluating translation results presents a paradox: the person reading the translation should be able to understand the intent of the original expression without having any knowledge of the original words, yet their understanding cannot be confirmed without recourse to the original. Therefore, some knowledge of English was essential to the evaluation task. First, the evaluator needed to understand the instructions as written in English. Second, the only way to test whether the original meaning had been understood in the target language was to share that meaning in English, using the same wording for each evaluator. The English testing environment inevitably alters the response. In an ideal test, the respondents would have no knowledge of the original intent, since a major goal of MT is to render documents for people who do not have a human-level communication bridge, such as reading a news article in an unfamiliar language (see Picture 13).

I found that evaluators who spoke English at a very high level tended to give higher scores than people with lesser proficiency. For example, the Bard rating for Indonesian from a speaker working at a European office of Google (not for GT) was ten points higher than from an evaluator who has always lived in Indonesia. In a few cases, respondents tried to reverse-engineer the survey, replacing the provided paraphrases with the original phrase and rating whether the translation rendered the actual words – for example, did “out of steam” translate “steam”, instead of the question of whether it translated the idea of exhaustion.

In some cases, the English connection points toward interesting questions for future linguistics research. Several of the languages co-exist geographically in places where English dominates the communications landscape, including Afrikaans, Hawaiian, Irish, Maori, Samoan, Scots Gaelic, Welsh, Xhosa, and Zulu. English is often at least a co-equal language in the lives of those language’s speakers. This milieu might predispose people to an English metaphorical framework that could shape their interpretation of translations of phrases such as “like a bat out of hell”.

The language that performed the best in this study is Afrikaans, which has close common ancestors with English, drifting from Dutch only within the past 300 years, during which it has interacted intensively with English in South Africa. The Afrikaner evaluators understood the figurative meanings of several phrases that were translated word for word, where such literal translations had no resonance for speakers of other languages. Conversely, some English metaphors have been adopted in other languages. For example, “the closet” as hidden sexual identity was understood in a few places where the local gay community imported that sense to their own word for closet, or, in Latin American Spanish, actually borrowed the English word. The present study opens more questions than it answers – for example, Xhosa and Zulu overlap in the same South African landscape with English and Afrikaans, but scored near the bottom, and this study is not equipped to resolve how much of this discrepancy is due to differential exposure to English by the evaluators, underlying similarity between Afrikaans and English that does not exist for the two indigenous African languages, or limitations to the translations proposed by GT for Xhosa and Zulu. It is clear, though, that the extent to which GT users already understand English has some effect on their ability to understand the translations between English and their language.

Empirical Results⁵⁹

This section reports the numerical findings of the study. Qualitative interpretation of the results is intentionally excluded from this report, and presented in separate parts of this article. The intent is transparency. The study was motivated by the need to locate underserved translation situations, to build a case for work that might improve them. However, mixing together research findings and broader editorial concerns would open up considerations of bias in the numbers themselves. I have gone to great lengths to ensure that the numbers will withstand scientific scrutiny, including forthright discussion of their limitations in the Methodology section above. Consider the results reported below to be similar to the findings of climate scientists who believe that their measurements of rising temperatures support particular interventions that they propose; this part is akin to a report that lays out the bare temperature news in 107 locations around the world,⁶⁰ while the other parts offer informed opinions about what the numbers mean⁶¹ and what should be done to change them.⁶²

My results seem to resemble those newly reported by Google researchers . They state that their top 25 languages had an average BLEU score from English of 29.34, their middle 52 languages averaged 17.50, and their bottom 25 averaged 11.72, using an open-domain dataset containing over 25 billion sentences imputed to be parallel. They do not provide disaggregated per-language scores. Exact comparison is not possible without access to their specific data, but my findings seem to paint in pointillistic detail an image similar to what they present in watercolor.

A. Elegance from English⁶³

Bard Score	# of languages	% of languages
50 or above	12	11.2 %
25 to 47.5	55	51.4%
below 25	40	37.4 %
Table 4: Bard Scores (A rating = 5, B = 2.5, C = 0)

Table 4 shows how closely evaluators deemed a translation to arrive at human quality. Twelve languages achieved a Bard score of 50 or above, including separate evaluations by different regional speakers of Portuguese and Spanish, in this order: Afrikaans (67.5); German and Portuguese-PT (60); Spanish-SP (57.5); Polish (56.25); Chinese, Croatian, and Spanish-LA (55); Dutch, Galician, Greek, Portuguese-BR and Portuguese-CV (52.5), and Italian and Latvian (50). Fifty-five languages were rated between 25 and 47.5. Forty languages received a Bard rating below 25.

Aggregated across languages, one phrase achieved a human level in the range of 50%, about one-third reached this level about one-third of the time, and about one-third of candidates resembled human translations less than one time in twenty, as shown in Figure 1.

Figure 1: Percentage of languages for which each candidate translation was judged to resemble a human translation. Data for all 107 languages is in Table 3. This figure shows the 102 pre-KOTTU languages.

B. Gist from English⁶⁴

Tarzan Score	# of languages	% of languages
75 or above	4	3.7%
50 to 72.5	39	36.4%
25 to 47.5	48	44.9%
below 25	16	15.0 %
Table 5: Tarzan Scores (A rating = 5, B = 5, C = 0)

Table 5 shows the frequency with which evaluators deemed a translation to convey the general intent of the original English phrase.⁶⁵ 43 languages received a Tarzan score of 50 or above, with 12 languages achieving understandable results approximately 2/3 or more of the time: Afrikaans (87.5); German (82.5); Portuguese (BR and CV) (80); Spanish (LA and SP) (75); Polish (72.5); Danish and Greek (70); and Chinese, Croatian, Dutch, Finnish, Hungarian, and Portuguese-PT (65).

One third of the languages failed at least two thirds of the time. Two thirds of the languages failed at least half the time.

Bengali, Haitian Creole, and Tajik failed 100% of the time, and these languages failed 80% or more: Cebuano, Georgian, Kinyarwanda, Kurdish, Latin, Malaysian, Maori, Nepali, Persian, Punjabi, Urdu, and Uzbek. Aggregated across languages as shown in Figure 2, two translation candidates were rated intelligible in about 80% of cases, and about one-third conveyed the meaning about two-thirds of the time. Three-fifths of candidate expressions had the gist rendered half the time or less, and more than a third were comprehensible in fewer than a quarter of the languages.

Figure 2: Percentage of languages for which each candidate translation was judged to convey at least the general sense of the expression. Green indicates greater than about two-thirds, yellow indicates between a half and two-thirds, and red shows scores in the range of half and below. Data for all 107 languages is in Table 3. This figure shows the 102 pre-KOTTU languages.

Tarzan-to-Tarzan Scores	# of Pairs	% of Total Pairs
50% or above	54	1.05 %
25% to 49.9%	1236	24 %
below 25%	3861	74.95 %
10% or below	1474	28.6%
Table 6: Tarzan to Tarzan Cross-Scores (Language A x % Language B)

The video above shows native speakers trying to find the gist of Google translations of well-known English quotes, in four languages.

C. Non-English Pairs⁶⁶


Picture 14: Language A to Language B translations showing the single sense that GT has calculated through English

Table 6 shows an estimation of how likely results are to have some intelligibility from one language to another, when English is neither source nor target. Full itemized scores can be viewed at http://kamu.si/gt_scores_cross_language_extrapolation. Translation from any language in the GT system to any other is a fundamental claim that has never faced systematic evaluation. For Language A to Language B comparisons, I extrapolated round scores based on Tarzan results between each language and English. For a very few pairs that do not pivot through English (such as Catalan-Spanish, which I predict at 45), the TYB score will be categorically wrong. For almost all other pairs, the score is broadly indicative. Picture 14 shows actual GT results for a term that has 37 English senses in the Merriam-Webster dictionary (though not necessarily 37 unique translations, e.g. Dutch uses “droog” for dry wine, dry humor, and dry laundry), where the service always funnels non-English translations through their estimate of the most likely English result for each polysemous term, with rampant errors introduced due to the inherent mathematics of polysemy ; a more precise measure would account for the number of times that any given term has a translation other than the one selected in each language, with some accounting for frequency of each sense.

Many terms like “out” and “run” are highly polysemous, so the expected percentage of correct second-generation translations through English would be a fraction of the first-generation Tarzan estimate from English. If an English term has 50 senses that have different translations in two languages being paired, all of equal likelihood (for example, https://en.wiktionary.org/wiki/run has 100 senses for “run”, such as standing for office or operating a machine; some of these senses might equate to the same word in a given target language), a good match could be expected 0.04% of the time. Meanwhile, a term such as “polysemous”, which has only one sense, should be translated correctly 100% of the time.

Picture 14.1: Translations of long texts in GT can flicker among words like the lights in this showerhead, as minor elements such as capitalization, punctuation, word order, or words unrelated to the context change. Photo by author.

In theory, longer translations with more context should increase the accuracy of the proposed translation equivalent; for example, “run a company” often returns the foot-racing sense of “run” on its own, but the management sense in a longer sentence. This is not necessarily the case in practice, however; for example, “delivery room” by itself was returned with the correct sense of birthing in several sample languages (though verified incorrectly for German as shown in Picture 46), but erroneously regarding packages or (inexplicably) calls in a longer sentence from a newspaper. The clause “while I swim my laps” might variously be translated to French by omitting the concept of laps entirely, using the correct swimming term “longueurs”, using the term for laps around a racetrack “tours”, or using the word “genoux” for “knees” (because French speakers put their children and their laptop computers on their knees, laps not being an anatomical concept for them), depending on whether the previous clause begins with “Please,” or “Please” (that is, comma or no comma) or various other changes that should have no effect on the clear association of “swim” and “laps”. Picture 14.1 is a visual metaphor for the way GT translations for longer texts can change arbitrarily due to factors that should be irrelevant to a machine’s ability to derive accuracy through context. Because there is no way to test for such factors as whether a comma changes a translation from swimming around a track to swimming your knees, trying to measure translations of long texts from Language A to Language B would also introduce a lot of noise, like trying to pin down the color on a spinning showerhead, that could not be filtered out.

I did not consider Bard scores, which will always be lower than Tarzan scores, because I have no information about the quality of translations to English that begin with other languages; sample tests with non-English bilinguals show that elegant translations do not emerge for languages pairs with no direct model. I acknowledge that Tarzan-to-Tarzan calculations are an imperfect measurement. However, with 5671 total non-English language pairs within GT,⁶⁷ most of which have no living bilingual speakers, there is no conceivable method of independently testing each combination.

With the above caveats, Tarzan-to-Tarzan scores for the 5151 pre-KOTTU pairs indicate the following:

54 pairs, slightly more than 1% (pre-KOTTU; slightly less than 1% post-KOTTU), will produce translation results where a reader can get the gist of the original intent more than half the time. 26 of these are in pairs with Afrikaans, 12 more in pairs with German, 11 more in pairs with Portuguese, and the remaining 3 pairs among combinations with Spanish.
23.8% of pairs produce results where the gist can be understood between a quarter and a half of the time. This includes many major commercial pairs such as French-German and Spanish-Italian.
75.2% of pairs produce results where the gist can be understood less than a quarter of the time.
Nearly 30% of pairs produce results where the gist can be understood 10% of the time or less.

Picture 15: Arbitrarily selected German text from Wikipedia. https://de.wikipedia.org/wiki/Der_Spiegel

Subjective tests for some language pairs suggest that relative results largely adhere to the rankings, but non-English translations in certain situations might be significantly more understandable than the numbers indicate. For example, an arbitrary selection of text from the German Wikipedia (Picture 15) was rendered in French that was much more than 49.5% readable, and more understandable in Swahili than suggested by a Tarzan-to-Tarzan score of 21. More dramatically, translation of a news article from Haitian Creole,⁶⁸ a language that scored zero in my tests, would receive Tarzan scores above 50 in French, Spanish, and perhaps Swahili.⁶⁹

Several factors may affect translation outcomes. First, the type of text is important, since GT is trained with certain types of parallel text. Well-structured documents that are written in formal language for a general audience, such as Wikipedia or news articles, are generally translated better than other types of writing, such as correspondence (all along the gamut from tweets to business letters), literature that makes use of figurative daily language, or domain-specific texts from restaurant menus to academic articles that rely on specialized vocabularies. Second, long segments and full sentences often translate better than short fragments, whether because the translation engine has more context to make an informed calculation about which sense of a word is appropriate,⁷⁰ or because the reader has more context to overlook mistakes such as, in the Haitian case, a mistranslation of “last straw” as “final killers”. Third, the starting language is important; a language that has a high initial score to English (German ranks second overall) will retain more fidelity at the pivot point. In the case of the German -> English translation from the Wikipedia article in Picture 15, this full sentence, “In 1958, the debate on the emergency laws began in the mirror, from which later (1960, 1963, 1965) various draft laws of the Interior Minister Gerhard Schröder were” is clearly missing important elements,⁷¹, ⁷² but the reader’s eyes can pass over “in the mirror”, and the properly constituted parts give context for a B-level Tarzan understanding. The translations downstream to French and Swahili degrade further, but keep enough of the sense that the overall topic of the sentence can be gleaned. What the sentence in question illustrates is that:

Picture 16: Human and GT translations of Macedonian tweet. GT misses only one word, for a Bard rating of 95.

a. errors from Language A to English do get amplified in the subsequent step from English to Language B

b. the amount of amplification should be roughly proportional to the Tarzan-to-Tarzan ratings I report (that is, pairs with high scores versus English will do better than pairs with low scores)

c. my data is not extensive enough to pin a scaled numerical value to full-sentence translations

d. Language A to English scores ideally should be measured independently for all 107 languages, as Table 1 does for Dutch, rather than imputed from the English to Language A performance.⁷³ Picture 16 shows a near-perfect translation from Macedonian to English, as compared to a human,⁷⁴ whereas Picture 1 shows a total failure from Japanese.⁷⁵, and Macedonian also fared poorly on the same quotation from Mandela. If the score from Language A toward English is substantially different than its score away from English, then my cross-language calculation will be significantly in error.

Further testing would be necessary to make definitive statements about translations among any non-English pairs. It would be impossible to find native speakers from both directions to test all 5671 pairs. However, testing GT against human translations between each language and English, as performed in Picture 1 and Picture 16, could be done, at considerable time and expense. I propose the Tarzan-to-Tarzan scores as one metric for estimating the quality of results among languages in the GT system, but suggest that a more in-depth study would yield superior results.

Empirical Conclusions⁷⁶

‎This empirical evaluation has presented the numerical findings of research about GT for 107 languages. I tested how well 20 short English phrases transmitted their meaning in each language. The highest-scoring language, Afrikaans, preserved the original meaning in 16 of the 20 test phrases, and used the same words that a human translator would have chosen in as many as 11 cases.⁷⁷ On the other hand, Bengali (the world’s seventh most-spoken language), Haitian Creole, and Tajik did not convey the original meaning in any cases. For 71 languages (66.4%), the gist of the phrase was transmitted half the time or less.

All of the data from this research is openly available at http://kamu.si/gt-scores. This includes the 20 original English phrases, the 20 English paraphrases that were provided to evaluators, the 2,140 translations produced through GT, and the evaluator scores for each of those translations. The spreadsheet contains tables showing the scores for each language in terms of producing human quality translations, understandable results, or incomprehensible results. These tables are arranged alphabetically, by quality rank, by comprehensibility rank, by frequency of failure, and by number of speakers of each language. An additional table, on a separate tab, estimates interlanguage comprehensibility for all 5151 pre-KOTTU pairs.

In the empirical evaluation, the data was offered without further interpretive commentary. The next section is a qualitative analysis that delves into interpreting the results in the wider context of contemporary research regarding MT in general and GT in particular. The qualitative part contains opinions that may be considered as subjective. It is presented separately so the objective data above can stand on its own, as the first study to compare Google Translate across its entire range of languages.

Picture 16.1: At a fountain in Evian-les-Bains in France, a sign instructs tourists to come near in order to activate the water flow. A screenshot of the Tarzan-level GT translation from French to English from Google via Instant Camera is shown on the right. Reacting to this montage at 9.5 years old, my daughter, native in both languages, sums up what it means to “get the gist” in a nutshell: “It has the same general sense but it does not grammatically or orally mean anything. But it’s not too horrible.” A large proportion of visitors to Source Evian are non-English speakers, so will be attempting translations of the French (unbeknownst to them via English) to languages such as Japanese and Hindi.

References

The post Empirical Evaluation of Google Translate across 107 Languages appeared first on Teach You Backwards.

Qualitative Analysis of Google Translate across 108 Languages

Martin Benjamin — Mon, 01 Apr 2019 12:36:27 +0000

Picture 17: A product that fails regularly in its basic operation. (Photo by author)

The notion that Google has mastered universal translation is built on several prominent myths. These myths feed stories that you may have seen in the press, or even heard from Google itself: earbuds that can translate 40 languages in nearly real time, underlying artificial intelligence that can teach itself any language on earth, neural networks that translate words with better-than-human accuracy, executives saying their system is “near-perfect for certain language pairs”, shining improvements from user contributions. This chapter inspects the myths, looking at what has been said versus what my research shows Google Translate (GT) has accomplished. The next chapter examines the finite limits to what GT can ever hope to accomplish using its current methods.

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does? (You are here )
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool?

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

Picture 17.1: Google Maps for rural Tanzania, with about as much information, I presume, as Stanley had on his way through to find Livingstone in 1871. Detailed 1:50,000 topographic maps, produced since 1967 by the Tanzania Ministry of Lands, can easily be found and digitized from places such as Yale’s Sterling Memorial Library Map Collection.

I should preface this discussion with a note about Google as a corporation. People often view corporations in binary terms, as forces for good or forces for evil. I view Google as a disparate group of nearly 100,000 people, each doing their best on their particular project. Images, Maps, Search, self-driving cars, Android, Translate – these are all different things, all earnest efforts, and none directed by a cackling overlord in a mountain lair. Google has become huge and powerful not because it has the best solution for every market, but because it has a brilliant business model that pumps in cash that it can use to probe ever more ventures.

The corporation itself, founded in September 1998, turned 21 (the age when one is legally adult enough to drink alcohol in the US) just five days before the release of this web-book. It is not omniscient. It has failed in many efforts, such as social media, and been half-hearted in others, such as the perpetually limited functionality of Google Docs versus more robust office software. Some of its practices have seriously broken laws in dozens of countries, with billions of dollars in penalties for doing things wrong. Each thing that the corporation produces deserves to be judged on its own merits.

Google Maps for Paris is amazing. Google Maps for the area where I’ve lived in Tanzania is pathetic, as shown in Picture 17.1. While in no way reflecting badly on the Maps engineer diligently writing code in Zurich, Maps executives long ago made a decision that African villages do not have big enough markets to justify the product at the same quality they produce for Europe (87 of the world’s ~200 nations have been “largely mapped” to the Street View level); they will lend camera equipment to volunteer visual cartographers and compensate them by offering, says a product manager, “a platform to host gigabytes and terabytes of imagery and publish it to the entire world, absolutely for free,” but, as a geography professor states, “Google Maps is not a public service. Google Maps is a product from a company, and things are included and excluded based on the company’s needs” .

When the company produces something great, they deserve plaudits. When they produce something wretched, that should be known. Nobody anointed Google as master of 108 languages – the company hired some engineers and some people with linguistic knowledge, assembled some data, and presented a product to the world. The point of this study is to figure out what is good in GT and what is wretched, and what can be done by Google or others to overcome the problems this knowledge exposes. In the interests of transparency, you should know that I have consulted with Google regarding African languages, I use a Google Pixel telephone, produced this web-book on their Chrome browser, communicated with the hundreds of people involved in this project using Gmail, distribute the empirical data from this study using Google Docs, and in many other ways benefit from and appreciate products they create. That does not mean, however, that I should hold back from investigating whether their offerings for languages from Afrikaans to Zulu live up to their promises, and for praising their efforts for the former (12.5% failure rate) while decrying as fraudulent (70% failure rate) their efforts for the latter. For people to know what they are getting when they use Google Translate, it is essential to examine where they produce results like Google Maps produces for Paris, where they merely pin a red dot on a grey background, and where they break laws in the science of translation. This study is meant to judge the quality of the translation service Google claims to provide, putting to the test their new corporate slogan, “Do the right thing“. It takes no position on whether the corporation lives up to its old motto, “Don’t be evil“.

The Dangers of Bad Translation⁷⁸

Before looking at the results of the study, we need to ask why translation matters. Here I might be speaking especially to Americans, who are prone to seeing translation as an afterthought – if you just speak English louder, foreigners will understand what you are saying, and doesn’t everyone in the world know some English these days? While it might sound like a caricature, indifference to other languages pervades police departments, hospitals, schools, philanthropies, and most other walks of American life, to the point that hostility to anything but English is a common feature of the political landscape.

Elsewhere, Europe suffers from Split Personality Disorder, with Dr. Jekyll investing billions of euros for translation among languages on the continent as vital to peace and trade, but Mr. Hyde assuming that the rest of the world should fall into line behind English, French, and Portuguese. Asian countries tend to see translation with English as important for international engagement, but translation with other languages as beyond the pale of where to invest time and money – for example, the papers at the annual Asian Association for Lexicography conferences are usually about monolingual dictionaries or bilingual dictionaries vis-a-vis English. African countries, conversely, see multilingual translation as important, not just with the colonial souvenir languages but among languages on the continent, but have very few resources to put toward the cause and are resigned toward making do with the technologies that external developers such as Google see fit to produce. (I say all this based on years of engagement on all four continents mentioned. Please read Benjamin and watch Part 1 and Part 2 of an academic workshop presentation.)

Despite these attitudes, though, migration and globalization result in millions of encounters every day where people face unpredictable translation needs – Japanese businesses with Vietnamese suppliers, refugees, workers on China’s Belt and Road Initiative, Africans seeking opportunities to move, study, or trade, Estonian tourists in the Peruvian highlands. Bad translations for tourists might be inconsequential, until they have a medical crisis. Bad translations for secondary students in Africa often mean failure for the international exams that they must take to qualify for university.

I discuss more real-world situations briefly in the embedded TEDx talk “Silenced by Language”, and expound in more detail in the bulleted paragraphs that follow, the point being that translation is not esoteric entertainment, but a vital component of communications in the modern world. It is important that the words that people believe are translations actually render the intent of what is being translated.

Picture 18: A product that achieves between 1/12 and 1/3 of its 72 hour truth claim

You expect that a product will substantially live up to the claims on its label. A badminton racket should not allow a birdie to pass through its holes. A deodorant that you apply in the morning should keep you from smelling like a goat by the time you come home. You probably do not expect a “72 hour” deodorant to actually last three days, but you would consider six hours to be a failure and 24 hours a minimum standard for usability. Unfortunately, the product shown in Picture 18 lasts not even 6 hours on a hot summer day that includes a short bike ride, and barely a full day in autumn chill without any notable sweat-making event; I, for one, will not buy it in the future. Nor will I buy sports equipment again from the superstore that sold the racket shown in Picture 17, which frequently traps the birdie between its strings instead of rebounding it into the air. The extent to which GT lives up to its label as a translator, versus the measurable rate at which it leaves you bleating like a goat, is important, because translations often have consequences. Some real examples:

- When Facebook mistranslated “يصبحهم”, the caption on a selfie a Palestinian man posted leaning against a bulldozer, as “attack them” in Hebrew, or “hurt them” in English, instead of “Good Morning”, the man was arrested and interrogated for several hours . (I did not test translations of Arabic or any other language on Facebook, but my personal experience trying to read Arabic posts on the FB walls of friends from the region has always maxed out at Tarzan, transmitting enough of the gist for an English speaker to get the general idea of the original post.)
- When a mother and her son seeking asylum turned themselves in to the US Border Patrol, declaring that they had a fear of returning to their home country of Brazil, an agent used Google Translate to explain, as the little boy erupted in tears, that because she had not presented herself at an official port of entry, she had entered Trump-era America illegally, so she would go to jail, and he would go to a shelter. (Portuguese ranked in 3^rd place in my tests from English at the Tarzan level.) It is not clear what either the mother or son understood before the boy watched his mother being handcuffed and he was abducted by the Trump administration, not to see her again for a month . Because many of the words that are generated by GT in Portuguese are, as found by TYB, certifiably fake data, it is almost certain that some or all of the verbiage from the uniformed kidnappers was incomprehensible, or reversed the speaker’s intent. A human translator would probably be the only legal way to make sure that the family’s rights were protected, under either US asylum law or the 1951 UN Refugee Convention. In 2019, Google declined to respond to a ProPublica investigation of the use of its services in immigration cases. However, rather than acknowledge evidence that their services were causing debilitating miscommunications in official proceedings, when Reuters asked a Google spokesperson about concerns over the use of machine translation in asylum cases in 2023, they were told that the tool underwent strict quality controls, and pointed out that it was offered free of charge (as though that were a justification – we didn’t charge for the spoiled food, so we are not responsible for the food poisoning). “We rigorously train and test our systems to ensure each of the 133 languages we support meets a high standard for translation quality,” the spokesperson said.
- Developers put enormous time and effort into preparing their Android apps to meet the requirements to appear in the Google Play store, including writing careful descriptions of their products. Google then automatically translates the descriptive content for each local market. When, after months of work, the Kamusi Here app (http://kamu.si/here) was uploaded to Google Play, we soon received mail from Romania about the incoherent product description – especially embarrassing for a product that was promising precision translation from Romanian (tied for 10^th place) to dozens of other languages.
- Picture 19 shows an example of GT used in a Swedish business context. While the consequences of “an endless cat and rat anal” sound more painful than they probably are, many corporations utilize GT instead of paying professional translators, with serious ramifications for incomprehensible or patently incorrect text on the target side. An international conglomerate that owns media companies with archives of several million articles in French (French and Swedish tied for 9^th) recently inquired about using GT to translate that content for its other markets, because the CEO believed Google’s assertions about the service’s capabilities; such an investment would necessarily result in a product that would be widely ridiculed and rarely purchased.

Picture 19: GT used from Swedish to English on the site of Sweden’s largest public service television company: https://www.svt.se/lange-leve-demokratin/

- The first Spanish-language version of the Obamacare website was apparently created via MT, leaving millions of Spanish-speaking Americans mystified when they attempted to sign up for government-mandated health care . (I cannot verify whether GT was the service used. The Spanish-language site has presumably been professionally translated by actual people since the initial fiasco. Spanish remains the only language for which healthcare.gov has been localized. Spanish ranked among the top 4 languages for both Bard and Tarzan scores in my tests.)
- Reports are legion of GT being used in patient/doctor interactions, with medical staff unaware that the service is not well trained in the health domain. Poor medical translations lead to overlooked symptoms, incorrect diagnoses, medications, and procedures, increased suffering, and even death. As a small example, the question in an emergency room, “How many shots did you take?” would lead to very different interventions depending on whether “shots” was automatically translated in the sense of photographs, drug injections, alcohol, or bullets. A pregnant woman looking for the delivery room at a hospital in Germany, relying on GT, would be directed to the mailroom instead (see Picture 46).Immigrants and travelers around the world interact with health services where they do not speak the language. Khoong, Steinbrook and Fernandez found that 8% of discharge instructions translated from English to Chinese and 2% from English to Spanish at a hospital in San Francisco had the potential for “significant harm” – Chinese being among the top 7 languages in the Bard rankings and among the top 12 for Tarzan scores. When a crisis arose after 72 hours of labor during the birth of my own daughter in a hospital in Switzerland, we were fortunate that the head doctor directed a member of the medical team with an Irish grandfather to switch to English, as neither our level of French at the time nor GT had the capacity to maneuver cogently through the situation. A study in the British Medical Journal that was skewed in favor of languages that performed best in our tests (European languages with the most data and financial resources) found:

Ten medical phrases were evaluated in 26 languages (8 Western European, 5 Eastern European, 11 Asian, and 2 African), giving 260 translated phrases. Of the total translations, 150 (57.7%) were correct while 110 (42.3%) were wrong. African languages scored lowest (45% correct), followed by Asian languages (46%), Eastern European next with 62%, and Western European languages were most accurate at 74%. The medical phrase that was best translated across all languages was “Your husband has the opportunity to donate his organs” (88.5%), while “Your child has been fitting” was translated accurately in only 7.7%. Swahili scored lowest with only 10% correct, while Portuguese scored highest at 90%. There were some serious errors. For instance, “Your child is fitting” translated in Swahili to “Your child is dead.” In Polish “Your husband has the opportunity to donate his organs” translated to “Your husband can donate his tools.” In Marathi “Your husband had a cardiac arrest” translated to “Your husband had an imprisonment of heart.” “Your wife needs to be ventilated” in Bengali translated to “Your wife wind movement needed.”

Picture 20: A clue in The Atlantic Crossword that came from GT, not actual French (16 November, 2018)

- The Fourth Amendment of the US Constitution prohibits searches and seizures without a warrant. A person who agrees to a warrantless search must provide consent knowingly, freely, and voluntarily. After a police officer used GT on a traffic stop to request to search a car, resulting in the discovery of several kilos of drugs, a Kansas court found that GTs “literal but nonsensical” interpretations changed the meaning of the officer’s questions, and therefore the defendant’s consent to the search couldn’t really have been knowing . Nevertheless, the top judge of England and Wales has stated, “I have little doubt that within a few years high quality simultaneous translation will be available and see the end of interpreters” (Welsh ranked 8^th). If courts around the world come to accept that MT provides adequate legal understanding, miscarriages of justice will inevitably ensue.
- Nearly 90% of students learning languages in one US university report that they commonly use online translators as an aid in their studies) . When used cleverly to home in on potential vocabulary and sentence structure, GT and its cohort can help with language learning. However, many students use GT the way they use a calculator, to perform rote computations and accept the output uncritically. 89% of 900 students surveyed at another US university reported using GT as a dictionary, to look up individual words and party terms . GT is not a dictionary. Its results for single words are statistical best guesses with a high chance of failure, and its results for party terms are generally catastrophic. A calculator will always return correct results: 2+2=4. This is patently not the case for GT, where 2+2=i in the proportions I measured for each language. The consequence is that students learn and hand in assignments using a simulacrum of their foreign language rather than the language itself.

Picture 21: False translation from GT that was used to fact check GT translation.

- GT is blithely used as a go-to source for translations that end up in publications or other professional contexts. An example is shown in Picture 20, where the crossword editor of The Atlantic, a prestigious US publication, ran with GT’s rendition of “harvest moon” in French as “lune de récolte”, instead of “lune des moissons” as expressed by actual French speakers.⁷⁹ It is true that lune=moon and récolte=harvest, but putting those two together as a translation of the English party term is a fake fact, an invention by a machine. GT is the only place on the Internet that makes the pairing posed by The Atlantic, but their puzzle creator did not know that because, unaware that the MUSA Make Up Stuff Algorithm is designed to fabricate a cosmetic response when the data falls short, he trusted GT when he made his clue. Then (as kindly confirmed by their fact checking department), the fact was checked using GT, and of course this alternate reality creation confirmed itself as French. Especially for party terms, proper translation often is not possible to obtain using standard search techniques. For example, Linguee does not have a large repertoire of English source documents containing “harvest moon” that have been translated to French, even proposing texts involving UN Secretary General Ban Ki-moon for consideration. “Harvest moon” is a subsection of “full moon” in the English Wikipedia, from where the interwiki link to French reaches the general concept of “pleine lune” – there is no pathway for a computer to use Wikipedia to infer a bridge between the concept as expressed in English and the way it expressed in any other language. Nor does the term appear in a battery of print or online dictionaries. Companies frequently consider GT outputs to be facts because they come, as in Picture 21, with an almost trillion dollar imprimatur that has surface visual resemblances to the outputs of deep lexicographical research at Oxford or Larousse, rather than, as in Picture 22, verifying translations with actual speakers. Human translations can be time-consuming and costly, but the consequences of machine mistranslations, in contexts more significant than the daily crossword, can be more so.

Picture 22: Fact checking translations by consulting native speakers.

“Latino outreach or Google Translate? 2020 Dems bungle Spanish websites” . The headline says a lot, but it is worth giving the full Politico article a read. Several contenders to be the Democratic presidential nominee in 2020 were found to have used GT for the Spanish-language versions of their websites. In the three cases where Rodriguez shows side-by-side translations from GT and the final web version, only Cory Booker followed the recommended procedure of using the MT service as a convenient starting point from which to embark on radical human post-editing with a knowledgeable speaker (if Booker’s team used GT at all, because the similarities are words a human translator would also likely have chosen). Julián Castro at most changed four words in the passage Politico analyzed (or GT was in a different mood when his translation was run versus when Rodriguez tried it, as often happens), and Amy Klobuchar did a pure copy-and-paste. Castro got lucky with a high-Bard output, while Klobuchar’s text literally stranded her in the middle of the Mississippi River. What is noteworthy is that these are serious political campaigns seeking to woo the votes of a substantial part of their electorate, and raising millions of dollars to spend on persuasion. They, or their staff, went to GT with the belief that the output was human caliber, and that finding a qualified volunteer or investing a small amount for a professional translator was an unnecessary step. Rodriguez’s reporting prompted most of the campaigns to abandon GT in favor of humans going forward, but botched outreach courtesy of GT will inevitably come back around for future candidates.
And then there’s this:⁸⁰

Why is #googletranslate doing me like this.
This is not what I said .. pic.twitter.com/OXjF3vwGHS

— 沃爾特 (@a1kindoe) September 20, 2019
And this (Maltese tied for 9th), which Air Malta quickly fixed once they learned they were selling seats to non-existent cities: https://lovinmalta.com/humour/now-boarding-to-sabih-air-maltas-translate-mishap-creates-a-bunch-of-new-european-cities/
And this (Galician tied for 9^th): https://www.thelocal.es/20151102/galicia-celebrates-its-annual-clitoris-festival-thanks-to-google-translate
And the video below. Airline safety instructions may be tedious, but they are important. If you ever fly Kenya Airways and see a placard regarding “Life Vest Under Center Armrest” in Swahili, please feel secure that considerable effort went into ensuring that every piece of signage communicated the exact intent.

The examples above show that translation is often a deadly serious activity with substantial consequences in the real world. Many of the puzzles in MT do not harm the lives of computer scientists or journalists if the solutions are imperfect, however, so the experiments that raise the bar are venerated as research success even though the outcome fails its consumers at the rates I have measured. An error that reverses the meaning of a randomly chosen text might be just one word in an otherwise Bard-level translation, and affects GT engineers as little as the small software error in the stabilization system of the 737 MAX affected the Boeing executives who swore that travelers had nothing to fear after the first crash of their jet, in Indonesia. It is entirely conceivable that an airplane could crash someday because a maintenance worker followed an instruction that was erroneously translated by GT, similar to a 1983 near disaster when ground crew read the number in their fueling instructions as lbs instead of kgs during Air Canada’s transition to the metric system.

The public has a confidence in MT, particularly in GT, that has not been earned. This confidence is built on several foundational myths:

Myth 1: Artificial Intelligence Solves Machine Translation⁸¹

AI is the big buzz in computer science these days, eclipsing the enthusiasm for Big Data earlier in the decade. Some recent headlines about AI and machine learning exhibit the current hype:

Students looking for internships regularly approach Kamusi Labs in hope of a project that combines AI and NLP. I must always point them in other directions because, at this moment and for the foreseeable future, AI is impossible for most MT and other NLP tasks. I hear your shrieks: many interesting advances have already been made! Yes, I reply, if you constrain your scope to English and a few other investment-grade languages – and even then, impressive linguistic feats cannot crash the barrier of meaning . For most languages, and even for a lot of activity relating to English nuance, there is no scope for AI because there is not enough data on which a machine can act – regardless of headlines such as this from MIT Technology Review : Artificial Intelligence Can Translate Languages Without a Dictionary.

Additionally, the phrase “machine learning” as it is usually applied to MT is a misnomer. In most contexts, “learning” implies that knowledge has been ascertained, whereas in MT the data has merely been inferred. No teacher is standing by to either guide or correct the inferences. Consider: my nine-year-old does not have English instruction at her Swiss public school, so she learns the written language by reading books to me at home every evening. The only way for this to work, though, is for us to sit on the couch together so that I can help her through tough words like “though” and “through” and “tough“.⁸² Without a teacher at her side, she would make guesses at such words that would often be wrong, but that she would lock into her head as fact – she’ll have “learned” the wrong thing. I can think of many things that I “learned” wrongly by inference – thank you to my grandmother, an English teacher, for her yeoman’s work catching my writing errors all the way through graduate school, such as using “reticence” when I meant “reluctance”. We can feel empathetic embarrassment for the narrator on This American Life⁸³ for his assumption that “banal” does not rhyme with the word it most resembles, keeping in mind the aphorism that “assume” makes an ASS out of U and ME; we can hope he learned from someone gently correcting him after his broadcast, but also fear that some listeners learned the mispronunciation from hearing his bungled English.

Yet, when AI computes associations between languages that are presented as data without any human verification, data science writers tout it as gained knowledge, in confidently written reviews such as “Machine Learning Translation and the Google Translate Algorithm” . Consider: in the TYB test, GT translated “out on the town” into Dhivehi as “ރަށުގެ މައްޗަށް ނުކުމެއްޖެއެވެ”. Our evaluator gave that a C rating, completely wrong, and translated it back to English as the word-slaw “came out top of the island”. GT’s reverse translation back to English, though, is “out on the town”, which has nothing to do with what a Dhivehi speaker sees or infers. Obviously, GT taught itself from its own faulty translation (perhaps when I ran the phrase to prepare the evaluation document) – the computation decided “out on the town” is ރަށުގެ މައްޗަށް ނުކުމެއްޖެއެވެ , so therefore ރަށުގެ މައްޗަށް ނުކުމެއްޖެއެވެ is “out on the town”, q.e.d., machine learning in action. Unless you are one of 340,000 Maldivians, you don’t speak Dhivehi, while GT has “learned” the translation. Who are you to question Google?

As you continue reading, please keep in mind any indigenous African language of your choosing. Africa is home to roughly 2000 languages, spoken by approximately 1.3 billion people.⁸⁴ About 5% of the continent is literate in one of the colonial souvenir languages (English, French, or Portuguese), with no indication of that number increasing significantly for new generations. A few languages have contemporary or historical written sources that could become viable monolingual digital corpora at the service of NLP (limited corpora exist for Swahili⁸⁵ and a few other languages in forms that could be exploited through payment or research agreements), but none have substantial digitized parallel data with any other language. 12 African languages have words in GT: Afrikaans (a sibling of Dutch and the best-scoring language globally in TYB tests), Arabic (which straddles Africa and Asia and had a 60% failure rate), and 10 languages that originated on the continent: Amharic (60% failure), Chichewa (70% failure), Hausa (65% failure), Igbo (40% failure), Malagasy (60% failure, which makes it better than the 5.5 million pellets of robot-produced space junk that is the Malagasy Wiktionary), Shona (70% failure), Swahili (75% failure), Xhosa (77.5% failure), Yoruba (70% failure), and Zulu (70% failure). It would be racist to posit that Africans speak languages that are so much more complicated than those that have already experienced NLP advances, that it would be too difficult to extend systems originally engineered for the more feeble minds of Europe. It would be racist in the other direction to assert that Africans do not deserve to be included in global knowledge and technology systems because they do not speak languages that wealthy funders choose to invest in. Your mission as you read, then, is to envision how your chosen African language can effectively be incorporated within the realm of MT, so that its speakers can communicate with their African neighbors and with their international trading partners.

You may extend this mental exercise to any of the thousands of non-market languages of Asia, Austronesia, or the Americas. With around a quarter billion native speakers, Bengali is a good candidate. Hakka Chinese, one of China’s 299 languages, with 30 million speakers (more people than Australia and New Zealand combined), is another good example, as would be Quechua spoken by about 10 million people in South America. The exercise can also be brought to Europe, for example regarding around 7.5 million people who speak Albanian. Which of these languages is too complicated for ICT? Which people would you choose to exclude? If your answer to those questions is “none”, read on with an eye toward why current technologies have failed each of these languages, and what among existing or realistic technologies can be implemented for them to reach parity within the computational realm.

Consider these scenarios:

1. A language with virtually no digital presence – the majority of the world’s languages.⁸⁶ The first step in digitizing thousands of languages is collecting data about them: what are the expressions, what do they mean, and how do they function together? Without a large body of coherent data, a machine could not even perform basic operations on a language. For example, it cannot put all the words in alphabetical order if it does not have a list of words. Digital information for perhaps half the world’s languages consists solely of metadata – where it is spoken, how many speakers, language family – the minimal information that someone who knows something about linguistics has gathered sometime in the past 150-ish years to determine that Language X exists and is distinct from its neighbors. Machines cannot pluck data out of the ether. Surely you will agree that AI is impossible in this situation.

2. A language with some digital presence, but no coherent alignment to other knowledge sets.⁸⁷ For example, the language might have a digitized dictionary, and that dictionary might indicate that the basic form of a word matches to an English word such as “light”. There is still no basis for a computer to infer whether that is light in weight, or in color, or in calories, or in seriousness. Nor could the machine begin to guess aspects such as plural forms or verb conjugations, much less the positions of words to form grammatical sentences. Most digitized linguistic data exists in isolated containers that, you will probably agree, AI does not have the basis to penetrate. As a human who has mastered languages and knows what research questions to ask, you could stare at someone else’s dictionary for weeks on end and gain no traction in conversing in their language. Although we might regard them with mystical talismanic power, computers have no magical ability to spin gold out of floss.

As a case study, consider how to treat the Sena language spoken by 1.5 million people in Malawi and Mozambique. Religious organizations have published an online Bible, and audio recordings of Bible stories. The Bible has been parsed into words that are listed by frequency, in files available at http://crubadan.org/languages/seh. A 62 page Portuguese-Sena print dictionary from 2008 is available in the SIL Archives in Dallas, Texas, and a 263 page Portuguese-Sena/ Sena-Portuguese print dictionary, completed before 1956, is available in the Graduate Institute of Applied Linguistics Library, also in Dallas. 15 articles about Sena are listed at http://www.language-archives.org/language/seh, including one from 1897 and one from 1900. The only study that is available in digitized form is a scan of a book chapter written in German and Portuguese from 1919, that you will want to look at to visualize the technical challenge of extracting operationalizable linguistic data or models.

You have before you the sum total of resources for Sena. You are now challenged to lay out a strategy to use AI to produce an MT engine for the language. You cannot do it. Neither can Larry Page⁸⁸ and a phalanx of his engineers at Google. Again, AI is impossible in this situation.

Picture 24: Ambiguity that natural intelligence can slice through in an instant, but is opaque to AI.

3.⁸⁹ A language with considerable digital presence. Here we can begin to see the possibilities for the computer to make some high-level inferences. For example, with a large text corpus, a machine could learn how a language speaks of potatoes. Seeing that a language has baked, boiled, fried, or couch potatoes, and that chicken can also be baked, boiled, or fried, AI can even surmise the existence of couch chickens. However, unless there is a lot of parallel information, you should still agree that AI will be ill-equipped to fathom that “French fries” might or might not refer to deep-fried potatoes that have particular translation equivalents in other languages that have nothing to do with France (see Picture 24).

Parallel data opens doors for machines to learn patterns between languages. However, these patterns will always be limited to the training data available – direct sentence-aligned translations between two languages, or potentially a somewhat wider relational space with NMT. (The case of “zero-shot translation”, with nonsensical results , relies on two stages of direct translations, with the third (bridge) language removed after training.) The largest set of topically related multilingual data outside of the official European languages, Wikipedia, is entirely unsuited as a source for MT training. To glimpse inside the major parallel datasets on which MT is actually trained, spend some time trying words and phrases on Linguee.com. Compare the areas that Linguee highlights in yellow as translations from the source side to the target side, as shown in Picture 25. These are extremely useful when you filter with your natural intelligence, but are devilishly difficult for the machine to match with certainty, because:

Picture 25: Results for “short-order cook” in Linguee. The best translation is “cuisinier(ère) de repas-minute”. Notice (a) that translation occurs only one time, and (b) making the translation useable, in this case factoring around the parentheses for gender, would require human intervention. It is worthy of note that all of the valid translations come from Canadian sources, where greasy spoons are popular, and none from France, where a short-order cook would be summarily guillotined. Also, do not hire whoever was responsible for the translation on the second line.

one source phrase might be translated in many ways (in a lot of ways, in a variety of ways, in multiple ways, varyingly) by different people across source documents
current algorithms cannot identify most party terms, especially those that are separated (e.g., “cut corners” can be separated by one or many words: “Boeing executives wouldn’t let the engineers do their job and they cut those deadly corners on the 737 MAX 8.”⁹⁰).
the source phrase might be rare within the training corpus, without enough instances for a machine to detect patterns. This is especially true for:
1. colloquial expressions, which might be used all the time by regular folks but are frowned on in the formal documents that enter the public record
2. ironically, American English expressions, often missing because the most-translated texts come from official EU documents where UK English is standard. For example, people don’t “call an audible” when they change plans at the last moment in the UK, so the expression does not make it into parallel corpora, and (look yourself) GT translates the words devoid of the meaning of the expression in all languages
3. non-English as the source. Although every language has its own set of complex expressions, only the most powerful are likely to be the source of widespread official translation. For example, all EU documents must be translated to Lithuanian, but they are likely to start out in English, French, or German; precious few documents start in Lithuanian for translation throughout the EU language set. If you can find a source of party terms for a non-investment language such as Kyrgyz, much less their translations to any other language, please sing about it in the comments section
the machine can pick out the wrong elements on the target side as the translations
polysemy: one word or phrase with several meanings

Using some test phrases that are evident to most native English speakers (e.g.: chicken tenders; short notice; hot under the collar; off to the races; mercy killing; top level domain; short-order cook), you can see in Linguee that current parallel corpus data is probably sufficient for AI to learn from human translations from English to French for mercy killing and top level domain, maybe for short notice, highly doubtful for chicken tenders (see Picture 45), and inconceivable for hot under the collar, off to the races, or short-order cook. AI can discover many interesting patterns that can be combined with other methods to advance translation where data is available (for example, short-order cook could be mined from Canada’s official Termium term bank), but it can only ever be a heart stent, not an entire artificial heart.

Swahili is a good case study of a language with a lot of digitized data. Kamusi has a database with rich lexical information (such as noun class designations) for more than 60,000 terms, as well as a detailed language model that has been programmed to parse any verb; unfortunately, funds for African languages are so dry that the data has been forced offline. TUKI , an even more extensive bi-directional bilingual dictionary between English and Swahili, from the University of Dar es Salaam, is available, but entries are bricks of text that can barely be interpreted by a knowledgeable bilingual human, much less by a machine (see Picture 26). An open monolingual Swahili corpus based on recent newspaper archives is being constructed at http://www.kiswahili-acalan.org, and an older corpus is available for research purposes, with difficulty, via https://www.kielipankki.fi/news/hcs-a-v2-in-korp. No parallel corpus exists with English, though one could probably be created as a graduate project using the digitized archives of the Tanzanian parliament and other well-translated government documents.

Picture 26: Partial entry for “light” from TUKI English-Swahili Dictionary.

Swahili has research institutes, a large print literature, an active online culture, a moderate investment in localization from Google and Microsoft, and crowdsourced localization by Facebook and Wikipedia (full disclosure, I have been involved in all four of those projects). Currently, GT and Bing both offer Swahili within their translation grid, but the results are only occasionally useable.

For example, for an article in the Mwananchi newspaper about prices for the crucial cashew crop, GT translates the fifteen occurrences of “korosho” (cashews) as riots, crisis, vendors, bugs, cats, cereal, cuts, raid, and foreign exchange, or leaves the term untranslated, but never once uses the word “cashew”. Meanwhile Bing recognizes the central word, and gives a passable rendition in English at the Tarzan level (one knows that farmers are facing insecurity selling their crop because of fluctuations in the international market) – but uproariously translates the word “jambo”, which has its major meaning in written and spoken Swahili as an issue, matter, or thing, as “hi”. ( “Jambo” is the tourist version of a complicated Swahili greeting based on negating the verb “kujambo” (to be unwell), which Kamusi log files confirm is a lead search term for first time visitors, and which Bing has evidently chosen to hard-code regardless of context.)⁹¹

Google claims that Swahili makes use of their neural network technology, but that technology is clearly mimicking the brain of some lesser species. In fact, it may be that the technology is actively delivering worse results, by finding words that occur in similar vector space; in tests using the word “light”, in cases where GT actually picks up the right sense, it delivers the Swahili equivalent of the antonym heavy when the context is weight (“light load” is rendered as a heavy load, “mzigo mzito”) and the antonym harsh when the context is gentle (“light breeze” is rendered as a harsh wind, “upepo mkali”). The question is, with proper attention and the digitized resources at hand, could AI make significant headway? To the extent that a clever algorithm was set to work finding patterns within the corpus, then calculated rule-based connections through the lexical database that links to English, a much smarter MT platform could emerge. Creating such a system, though, would demand a lot of human intervention – the computer is the tool, but that tool needs an artisan to craft with it. AI can aid MT for those few language pairs that have significant parallel data, but it is not a stand-alone solution.

Picture 27: English predictions based on AI might include “doing”, “feeling”, and “today”, rather than the options captured in this image from an Android phone.

4. The best-resourced language.⁹² English provides many cautionary examples showing why you should be skeptical of the notion that AI is the cure for NLP. Think of all the ways you interact with machines in English that, after decades of intense research and development, do not yet work as envisioned. Speech recognition, for example, causes you cold sweats whenever you are asked to dictate your sixteen digit account number into a telephone, although in theory AI should be able to interpret sounds to the same extent that it can recognize images.⁹³ Word processing software that learns from people should be able to flag when you have typed that someone is a good fiend rather than a good friend, and learn from millions of users repeatedly mistyping and correcting “langauge”. Auto-predict should know that “fly to” is likely to be followed by one of a limited set of cities, and “fly in the” will call for ointment or afternoon. Google combines language processing and some AI brilliantly in its English search – figuring out the relevant party terms, correcting typos, and combing your personal data to provide you results based on your movements, purchases, and perceived interests. Take it from their research leaders: “Bridging words and meaning — from turning search queries into relevant results to suggesting targeted keywords for advertisers — is Google’s core competency, and important for many other tasks in information retrieval and natural language processing” . Yet still, Google Search frequently misses; to find out about fruit by typing in “apple” you will need to do some serious scrolling, and be prepared for an ad hominem slur on 1.8 billion people if you seek to learn the difference between Shia and Sunni (Picture 28).

Picture 28. Google Search uses their NLP machine learning techniques to parse the user’s search term and give a phenomenally offensive first result for “Islamic groups” (4 April, 2019). Search results may vary depending on your location, your browser history, the processor’s mood, or whether someone at Google has seen this image and taken corrective action.

When it works – when you say “tell my brother I’ll be ten minutes late” and it sends an SMS with the right information to the right person – you feel as though yesterday’s science fiction has become today’s science fact. Yet you know enough to not really trust the technology at this stage in its development. You would be a fool not to double check that your message was sent to the right person with the right content. During a pause while writing this paragraph to send a WhatsApp message to someone that “I will be in U.S.”,⁹⁴ the phone proposed that the next thought would be “district court”, an absurd prediction for most people other than an attorney or a member of Donald Trump’s criminal cabal. For AI to make actual intelligent predictions for your texting needs, your phone would merge billions of data points from the consenting universe of people sending messages, combined with your personal activity record, and recognize that “for Mary’s wedding” or “until the 30^th” would be more appropriate follow-ons to “I will be in [place name]”. (See McCulloch for a nice gaze into autocomplete.)

Picture 28.1: The magic of Google Search is an interactive process between user and machine, teaming computation and natural intelligence to whittle toward relevance.

In fact, much of what seems like AI magic in Google Search is actually Google casting a wide net based on their huge database of previous searches, and using your personal knowledge of your personal desires for you to select the intelligent options from the many off-base ones also on offer. Examine Picture 28.1: Google knows that people who have searched for the term “hackie” have ended up clicking most frequently on “hacky sack”, so that is the top result they propose. Previous searchers have evidently also back-tracked and ended up looking for “jackie”s instead, even more so than hunts for people whose name is actually spelled “Hackie” as typed by the user (such as Tupac Shakur’s bodyguard). If you were looking for Jacqueline Kennedy Onassis and had started by typing “hackie”, you would think that Google was genius for suggesting her. If you were looking for “hacky sack”, you probably wouldn’t even notice that “jackie o” was one of your options. Getting you to your destination relies on a few tricks that may involve brute frequency counts (e.g. 47% of “hackie” searches end up at “hacky sack”), some NLP (e.g. “hackie” could be a typo for “hacky” or “jackie”), or even some AI (e.g. “hackie” could be a typo for “hacky” and we have detected a trending site called “Hacky Easter“). The proposal “hackie tupac” clearly appears because previous users had not found the man they were looking for until they used the modifier of his famous employer, thereby using their natural intelligence to train Google.

You can try the magic yourself: do a search for “Clarence Thomas”, then search for “Ruth Bader Ginsberg”, then search for “John Roberts”, and finally search “s”. Do your choices include “Sonia Sotomayor”, “Stephen Breyer”, “Samuel Alito”, and “supreme court justices”? Pretty cool, no? The biggest trick to matching you to the information you are seeking, though, is that your own eyeballs scan the 10 options Google presents, sends those signals to your brain, and your flesh-and-blood neural network processes the options and makes the intelligent decision. When your search fails – when you want to know why the devil is called Lucifer but searching Google for Lucifer only gives you information about a TV show by that name – you simply refine your search to “lucifer devil” and write off the cock-up. Google Search is awesome, and certainly involves increasing use of AI to find patterns that respond to what is on your mind, but that is a far cry from the notion that AI can now, or in the future will be able to, predict your thoughts better than you can yourself.

The brilliant citation software Zotero (https://www.zotero.org) demonstrates the difficulties.⁹⁵ The task for Zotero is to find information on web pages and pdfs that can be used for bibliographic references. That is, Zotero searches for specific elements, such as title, author name, date of publication, and publisher. The program is aided by frequently-occurring keywords, such as “by” or “author”, and by custom evaluation of the patterns used by major sources such as academic publishers. Even so, Zotero often makes mistakes. For example, sometimes it cannot identify an author who is clearly stated on the page, sometimes it will repeat that author’s name as a second author, and sometimes it will miss one or more additional authors. Zotero assigns itself one job, in one language, and it does that job as well as any user could hope. Nevertheless, it is unable to overcome the vast multiplicity of ways that similar data is presented on similar websites; non-empirically, at least half the references in this web-book needed some post-Zotero post-editing. A next-generation Zotero could conceivably use ML for an AI approach that homes in on recognizing how diverse publications structure their placement of the author names variable, but better resolution of this one feat in English text analysis is akin to landing a probe on a single asteroid as a component of charting the solar system.

Systems are constantly improving to recognize increasingly subtle aspects of English, because the language has enormous data sets and research bodies and corporate interests with extraordinary processing power to devote to the task, but it is crucial to recognize that (a) English is not “language” writ large, but a uniquely privileged case, and (b) analysis of a single language is just the train ride to the airport, while translation is everything involved in getting you into the stratosphere and landed safely across an ocean. The New York Times, speaking of English, states that Google’s “BERT” AI system “can learn the vagaries of language in general ways and then apply what they have learned to a variety of specific tasks,” but then goes on to say they “have already trained it in 102 languages” . Looking at what BERT has actually done for those languages,⁹⁶ however, shows much less than the comprehensive language data you might expect would aid in the analysis of “the vagaries of language”: Wikipedia articles, which are next to useless as a corpus resource for comparing languages (you can read a detailed side article written to justify this statement) have been scraped and processed with a number of reductionist linguistic assumptions (e.g. no party terms) that invalidate any glimmer of adhering to science. Nevertheless, the New York Times journalist continues to repeat his formulation that “the world’s leading A.I. labs have built elaborate neural networks that can learn the vagaries of language” – conflating the ability to learn from vast troves of English text with the ability to learn any of the languages a bit farther to the right on the steep slope of digital equity. We are still far from AI that comes close to producing English with a speaker’s fluency, or recognizing nuances when a person strays off the text or acoustic patterns our devices are trained on. We know this, and we laugh about it, with funny tweets about Alexa and autocorrect, or this video employing the latest AI from 2022 on one relatively simple linguistic task in English:

Google puts some of the same NLP that it uses for search into English-source GT (e.g. recognizing “French fries” as a party term in some cases), but the service is a long way from interacting with English with a human understanding. What is not done with English is by default not replicated to any of the top tier languages discussed in point 3 above, and most certainly does not trickle down to data-poor or data-null languages. In the words of noted AI researcher Pedro Domingos , “People worry that computers will get too smart and take over the world, but the real problem is that they’re too stupid and they’ve already taken over the world.”

Nobody recognizes the need for good data as essential for good AI, in fact, more than Waymo, Google’s autonomous car division. Waymo has decided that it would be bad for their driverless cars to kill people. As part of the billions they are investing in designing vehicles that can interpret every road sign, reflection, and darting dog, they hire a bureau of Kenyans at $9/day to label each element of millions of pictures taken by their Street View cars . There is nothing artificial about the intelligence of the people who spend their days telling Google what part of an image is a pothole and what is a pedestrian. The AI comes in later, when the processor in a car zipping down the street has to interpret the conditions it sees based on the data it was trained with. It is not such a leap to assert that effective translation is impossible without data similarly procured from natural intelligence. Google hires its Nairobi employees for $9/day to make it safe for Californians cruising to the mall to stare at their phones instead of watching the road. They could easily double their staff and pay $9/day for linguistic data from Kenya’s large pool of unemployed, educated youth to power NLP for Kenya’s 60-odd languages. That they choose not to do so is not an indication that they have mastered AI with their current techniques, but rather that they do not see the investment in reliable translation to yield the same return as investment in non-lethal cars. As long as people believe the myth that AI has conquered or soon will conquer MT based on the current weak state of data across languages, rather than demanding systems that have a plausible chance of working, such mastery will never occur.

Artificial Intelligence, Machine Translation, and the Flying Car⁹⁷

“Latest Planes Herald New Era of Safety: With Inventors’ Producing Foolproof, Nonsmashable Aircraft, Experts Say We’ll All Fly Our Own Machines Soon” . Thus read the headline about Henry Ford’s plans in the November 1926 edition of Popular Science magazine, half a year before Charles Lindbergh flew across the Atlantic. Before a brief exegesis⁹⁸ of why we will never live in a world where flying cars are common transportation, we could amble through the next 93 years of articles promoting this fantasy. Let us land at “You Can Pilot Larry Page’s New Flying Car With Just an Hour of Training” , however, because it unites two themes relevant to MT: a huge investment in a much-hyped technology by a co-founder of Google, and journalists falling for and promoting the hype, hook, line, and sinker. Before continuing, please enjoy this video and the Fortune article about Page’s aircraft, Kitty Hawk.

Page’s team has produced a tremendously cool vehicle that would be a rush to fly. It takes off from the water, flies at the altitude of a basketball, moves at the speed of an electric bike, makes the noise of a delivery truck, stays aloft for about as long as it takes to make and eat an omelet, and, could it physically rise above power lines and rooftops, would not legally be allowed to fly over populated areas. And, what should be obvious: it is not a car. It does not take off, land, or taxi on a roadway, it cannot ferry you to work, it cannot carry passengers or groceries, it cannot go out on a rainy or windy day, it cannot fit in a parking space. It is not a car, though “flying cars”, machines that can operate on roads and in the air, will soon exist as a toy for rich people.

On that line, companies that are not Google have developed substantially better models; the Dutch PAL-V, at $400,000, is a gyrocopter that flies higher and faster, for much longer distances (no information on the noise profile of its 200hp engine), then can have its wings folded in a parking lot and perform reasonably on the road, with energy usage comparable to contemporary vehicles in both air and land modes. Were such machines to become affordable to millions, though, numerous factors would prevent their viability in urban or suburban environments, with acceptability plummeting as lower monetary cost increased uptake.

Operating a vehicle in three dimensions requires a skill set that greatly surpasses movement along the ground – for example, constant awareness of safe landing spots and the ever-shifting localized wind conditions needed to take off and land safely. Among the major causes of fatal automobile accidents , all of which are variations of driver error, weather conditions, road conditions, mechanical failures, and encroachments, only a few (running red lights and stop signs, potholes, tire blowouts, sharp curves) would not occur in the air. We can fairly predict that operator strokes and heart attacks will occur at roughly the same rate in the air as on the ground. Drunk flying will occur at about the same rate as drunk driving, without the possibility of police patrols to stop impaired people from operating their vehicles. Meanwhile, the skies introduce many new dangers, such as birds, icing on the wings, insects clogging the pitot tubes essential to determining airspeed and altitude, wind shear, and vehicles approaching from any angle. People will sometimes fail to make sensible decisions, such as how much fog, rain, or wind is too much to abort a trip, or keeping loads balanced and within weight limits, or maintaining adequate fuel reserves. RIP, Kobe Bryant.

Mid-air collisions are rare today because planes are few and spaced by miles, with pilots adhering to a variety of protocols. A significant increase in low-altitude traffic, though, with vehicles hopping among random points like popcorn, would produce insupportably dense skies. A fender-bender on the ground is Game Over at 1000 feet. A car with a sudden mechanical problem stops and the passengers go home in a taxi; a gyrocopter with a sudden mechanical problem drops and the passengers go home in a box.

Automation would exacerbate the dangers, with myriad people punching buttons in aircraft they did not know how to control in emergencies as simple as a fatigued engine bolt. Has an electrical failure ever caused your phone or computer to shut off? If your avionics shut off in the air, your race to reboot would fare poorly against gravity. On-board sensors would have even more difficulty than human eyes in picking out hazards such as electric wires, flag poles, fences, overhead road signs, rocky landing surfaces, geese, and drones. People will maintain their flying cars with the same diligence with which they maintain their earthbound rides, which is to say they will not keep their fluids topped or have their equipment serviced on tight schedules, much less conduct pre-flight walk-arounds or detailed safety inspections (if they even know what to look for and have access to view critical components).

As risky as flying cars would be to the people inside them, they would be a greater terror to people on the ground. Perhaps you’ve heard a book-sized drone buzzing around your neighborhood; increase the engine to a size needed to lift a few humans and their cargo and you increase the noise proportionally (and don’t imagine an electric propellor would be silent); increase the number of vehicles and you’ve got the clamoring cacophony of lawnmowers overhead night and day. What is your risk tolerance for an occasional disabled machine falling through your roof? Or a drowsy or suicidal pilot? When anyone can get a flying car, anyone can drop a homemade bomb on their ex’s house, or unload a vat of acid over a church picnic, then flee without a trace. People toss cigarette butts and McDonalds wrappers out of their cars all the time – now they are small missiles heading straight for your head.

Vehicles such as Kitty Hawk could some day see service autonomously airlifting venture capitalists around Silicon Valley. However, they will not and cannot provide to the masses the transportation solution of the future. No matter a century of drool from a gullible press that does not dig even as deeply into the question as this sidebar does. No matter continued headlines such as “We were promised flying cars. It looks like we’re finally getting them” . After all, the dream, which many of us have shared since our childhood fantasies, is put forward confidently by proven innovators like Henry Ford and Larry Page who we want to believe. All this is to drive home a point: artificial intelligence is to translation as the flying car is to transportation: extremely interesting in privileged contexts, infeasible in most.

Myth 2: Neural Networks Solve Machine Translation⁹⁹

Picture 29: Bettors predict the winners of horse races based on small statistical variations. People who put their money on 19 of the 20 horses in this picture will lose. Credit

Lest you noticed that previous generations of GT did not quite produce perfect translations, you might be reassured that recent developments in neural networks (NN) have ironed out previous wrinkles, and the sun is rising on flawless universal translation. A 2016 cover story for the New York Times Sunday Magazine , for example, presents the story of GT’s conversion from statistical (SMT) to neural (NMT) machine translation, with the assertion in the opening section that GT had become “uncannily artful”.

In SMT, a lot of data is compared between two languages, and translations are based on imputed likelihoods, like predicting the winner of a horse race based on a set of knowledge about past performance (Picture 29). Much of the challenge of translation arises from the ambiguity wherein a single spelling can mask multiple meanings, the problem of polysemy. Assume, for example, that “spring” in English is found to correspond to “primavera” in Portuguese in 40% of parallel texts, and corresponds to terms matching six other senses (bounciness, a metal coil, water flowing from the ground, elastic force, stretchiness, a jump) at the rate of 10% each. As the plurality sense, SMT will offer the translation pertaining to the season much more than 40% of the time, because it is four times more likely than any of the others individually.

Picture 29.1: An honest pregnancy due-date calculator that stresses the weakness of its statistical prediction. By contrast, although basing its predictions on exactly the same algorithm, mamanatural.com woos parents to its site by claiming in the metadata optimized for Google search results that they have an “Amazingly Accurate Pregnancy Calculator”.

Let’s look at a non-linguistic example of using statistics to make predictions. Babies are notoriously poor at scheduling their emergence from the womb. Natural births usually occur within a one month range, the mid-point of which is about 280 days after the mother’s last menstrual period (which itself is rarely marked on a calendar). About 4% of babies, or one in twenty-five, are actually born on the date projected by the algorithm. My own daughter was born two weeks after the “due date”. Had we booked flights for her grandmother to be in the hospital for the birth based on that statistical guess, my mother-in-law would instead have spent many days watching us trying to induce labor by playing beach paddle ball in our local lake, and been home before the first contractions kicked in. There is a scientific basis for proclaiming that a baby will be born somewhere around a given date – but the wise and honest emphasis should be on the “somewhere around”, done well as shown in Picture 29.1. For translation, SMT makes similar guesses that are only sometimes, like race predictions that have some additional information about the horses and jockeys, based on more sophisticated algorithms.

NMT, by contrast, infers connections based on the other words embedded in the “vector space” surrounding a term. Vector space, which sounds impressively mathy, is used in MT to describe the mapping of different words and phrases in a corpus to capture their degrees of similarity . Theoretically, NMT should recognize that, when mountain and spring appear together, “spring” likely pertains to the sense of flowing water. GT’s self-professed “near perfect” NMT Portuguese produces the erroneous “primavera de montanha” when given “mountain spring” by itself, but with tweaking (collocation with the word “water” is necessary but not sufficient), a sentence can be found that generates the correct “nascente”, demonstrating that NMT can improve on statistics: “A água flui da nascente da montanha” when shown the context “Water flows from the mountain spring”. The guesses that NMT makes when mapping its vectors, though, are often wildly erratic. DeepL, for example, intuits that “Regards” is the closing of a letter, compares that to a French closing from some official letter in the EU archive, and proposes “Je vous prie d’agréer, Monsieur le Président, mes salutations distinguées” (see Picture 68). GT recognizes that 习近平 is the Chinese president Xi Jinping, finds text about presidents, and translates the name as Эмомали Рахмон in Kyrgyz – that being Emomali Rahmon, president of Tajikistan.

In cases where two languages have substantial parallel data, and the domain is well treated in the corpus, results can be astounding. This can be seen in action with DeepL’s spot-on translation of “his contract came to an end” to French, correctly using the non-literal “prendre fin”; using Linguee to view the data on which DeepL (discussed in this sidebar) is presumably based, one can see numerous examples where a contract “came to an end” in English and “a pris fin” in French (mot-à-mot “has taken end”). In GT, the English phrase “easy as pie” is improperly verified “c’est de la tarte” for French, but given the correct colloquial translation “simple comme bonjour” in some, but not all, longer test sentences.

I recommend reading “Understanding Neural Networks” for a somewhat approachable overview of the computer science involved. At the nub of Yiu’s explanation: neural networks are models of a complex web of connections (and weights and biases) that make it possible to “learn” the complicated relationships hidden in our data. Basically, the computer does a lot of checking of some elements A, B, C, and D in comparison to other elements W, X, Y, and Z, and patterns often emerge. There can be a lot of layers – A, B, C, and D are first checked against E, F, G, and H, and then those results are checked against I, J, K, and L, and then onwards toward W, X, Y, and Z and beyond. The more layers in a system, the “deeper”, and thus the name “deep learning”.

Picture 29.2: The Quick Draw neural network experiment, https://quickdraw.withgoogle.com.

A fun neural network experiment called Quick Draw (sponsored by Google) that invites your participation gives an indication of both the strengths and limitations of what neural networks can achieve. Players are given a term and 20 seconds to sketch it on their device. (High art is not expected, or possible.) Picture 29.2 shows some results: the program recognized my “dumbbell” before I had a chance to complete it, based on thousands of other people taking a very similar approach to the same prompt. It also recognized “hedgehog” before the clock ticked over, though I myself would not guess the result from my own drawing. However, it did not recognize “bat” because I drew “baseball bat” instead of the flying mammal it was expecting, and it did not recognize “axe” or “animal migration” although I suspect many human readers will discern the terms from the doodles. (It also did not recognize my “tractor”, but who would?) The game is fun for the whole family, and sometimes the program will correctly glean all of your drawings. The first point here is that a neural network working on a limited data set of a few dozen terms, with more than 50 million data points and counting, is still highly constrained in the inferences it can make. Neural networks can discern a lot of patterns that machines cannot otherwise recognize, but they can only dive “deep” to the extent that they have lots of clean data to learn from. A second point is that a neural network could not have guessed any item I drew outside of its training set – for example, had I drawn what I intended as a cowboy hat, it would have guessed “flying saucer”, because only the latter is within its data universe.

In certain circumstances, NMT thus offers a marked improvement over phrase-based SMT. Toral and Way measure this for three literary texts translated from English to Catalan, and give the level-headed comparison that sentences attaining human quality rose from 7.5% with SMT to 16.7% with NMT in one text, 18.1% to 31.8% for a second, and 19.8% to 34.3% for a third, concluding, “if NMT translations were to be used to assist a professional translator (e.g. by means of post-editing), then around one third of the sentences for [the latter two] and one sixth for [the first] would not need any correction.” This study went to great lengths to stay within the lanes of reporting on a single language with rich data (more than 1.5 million human-translated parallel sentences), and reached the notable conclusion “NMT outperformed SMT”. It was puffed by the tech media cotton candy machine, though, into “Machine Translates Literature and About 25% Was Flawless” .

Ooga Booga: Better than a Dictionary¹⁰⁰

Nota bene: because neural machine translation (NMT) is based around the context in which words appear, it can offer no improvements to statistical machine translation’s (SMT’s) horse-race guesses for the 89% of users who improperly attempt to use a machine translation (MT) service as a bilingual dictionary.

A window into how NMT performs as a dictionary comes from translations of the nonsense phrase “ooga booga” from all 107 languages to English.

In 82 cases, NMT does not find a friend, so the MUSA mandate to supply a word, any word, repeats “ooga booga”. If Google is giving us “better than human” translations , these results must be “better” than a dictionary. After all, no dictionary would be bold enough to include “ooga booga” in any of those languages.

Another 8 cases transform the capitalization and/or the number of o’s and g’s in various ways (when delivering mass in Latin, for example, the Pope apparently considers “Ooga Booga” to be a proper noun). And then things get weird.

For 5 languages, the neural networks found independent translations to English (with Urdu helpfully proposing that the purported English term be read in Urdu script).

But for 16 languages, the neural machine translations form into geographic clusters. The three Baltic states and distant Hungary all change “booga” to “boga””; Latvian and Lithuanian are about as close to each other as English is to Dutch, but unrelated to the other two, which are distant Finno-Ugric cousins of each other. The three Celtic languages around the Irish Sea go all downward-facing dog, though “ooga yoga” is the popular style in Ireland while “soft yoga” is the rage in Scotland and Wales. I posit that the neural net has transmuted “booga” to “blog” and then elaborated on some word embedding for three unrelated African languages from north of the equator (two spoken in Nigeria). The six African languages from south of the equator are the biggest puzzle – they are all Bantu languages with sparse parallel data versus each other and versus English, so perhaps GNMT mixes them all in the same blender in an attempt at enhanced results?

I can only report the odd way that Google’s neural networks performs in their simulation of a dictionary. I cannot begin to explain. I can only suggest that we do “ooga booga” as any good Tamil speaker would: Go speculative!

Africa middle latitudes (except Hausa)
Amharic: look at the blog
Igbo: look at the blog
Yoruba: look at the blog

Africa southern latitudes
Chichewa: you are in trouble
Sesotho: you are in trouble
Shona: you are in trouble
Swahili: you are in trouble
Xhosa: you are in trouble
Zulu: you are in trouble

Baltic + Hungary
Estonian: ooga boga
Hungarian: ooga boga
Latvian: ooga boga
Lithuanian: ooga boga

Celtic languages
Irish: ooga yoga
Scots Gaelic: soft yoga
Welsh: soft yoga

Two South Asia + one Baltic + Latin
Belarusian: Ooga booga
Kannada: Ogga Boga
Latin: Ooga Booga
Telugu: Oga Boga

Four South Asia + one middle Africa
Bengali: Booga in the era
Gujarati: Grown bug
Hausa: on the surface
Tamil: Go speculative
Urdu: وگا بوگا

The research that Google presented in correlation with the launch of GNMT (Google NMT) assigns a specific numerical value to this improvement. “Human evaluations show that GNMT has reduced translation errors by 60% compared to our previous phrase-based system on many pairs of languages: EnglishFrench, EnglishSpanish, and EnglishChinese” (Wu et al – read this article for the definitive mechanical overview of GNMT). Without fixating on “many” being defined as 3 out of 102 in each direction in relation to English, do note that a) all three of these high-investment languages are in the top tier in my tests, and b) not a word is said about non-English pairs. Not Urdu, not Lao, not Javanese, not Yoruba, not any testing reported in any down-market language. Not Afrikaans nor Russian where results for formal texts might surpass the study languages. There is no scientifically valid way to extrapolate the findings in Wu to any GT language that has not been tested. That is, the blanket conclusion that GNMT produces “better” results than previous models when comparable corpus data is available is likely correct, but Google’s research does not support a blanket “60%” claim for the 99 languages within GT at that time. Nevertheless, a great many academics now take the plausible-sounding claim in a peer-reviewed publication at face value as applying across the board.¹⁰¹

Even were that level of improvement to be consistent across languages, most of the languages in the system, as measured in this study, began at a much lower starting point, begging the baseline question “60% better than what?” Nevertheless, the headline the public sees erases any nuance in interpreting what Wu legitimately shows, blaring “Improvements to Google Translate Boost Accuracy by 60 Percent” . So, NMT (or what my testing indicates inconclusively to be a hybrid deployment using SMT for short phrases) is better than SMT by itself, and in some cases produces Bard-like results. This is encouraging for resource-rich languages, leading to the recommendation that you use GT, with caution, in certain advised situations. But there is no plausible research underlying headlines about GNMT such as “Google’s New Service Translates Languages Almost as Well as Humans Can” .

The manner in which this news has been extrapolated leads us back to the flying car. Hofstadter notes “verbal spinmeistery” beneath use of the term “deep” to suggest “profound” instead of its technical signification of “more layers” of processing than older networks. For example, a second New York Times Magazine cover story about AI expounds, “Deep neural nets remain a hotbed of research because they have produced some of the most breathtaking technological accomplishments of the last decade, from learning how to translate words with better-than-human accuracy to learning how to drive” . Read that again: in a major article, the US “newspaper of record” rhetorically ratcheted NMT from “near perfect” all the way to “better than human”. Google’s own research , far from “better-than-human” results, reports, in conjunction with the milestone achievement of all 103 languages then incorporated into GNMT, an average BLEU score of 29.34 from English to its top 25 languages, 17.50 for its middle 50, and 11.72 for its bottom 25 (they do not share disaggregated data) – numbers that very much adhere to the empirical findings of TYB. It is the journalistic spin, not the MT results, that should leave you gasping for breath.

When people repeatedly hear plausible-sounding hyperbole echoed by sources they trust, those statements become confirmed truth in the public mind, even among ICT professionals who then trumpet that fake news in bars, interviews, and conferences. A translation professional informed me of a client who went so far as to reject human translations on behalf of a major European airline because the results did not look like the output from the alleged authority of Google Translate. A little bit of knowledge is a dangerous thing, or, after that aphorism being translated by a bilingual person to Japanese, 生兵法は怪我の元, is “60% better” reverse translated by GT to say, “The law of living is the source of injuries.”

A few days after the second NY Times article, the prestigious journal Science published an article with the headline, “Artificial intelligence goes bilingual—without a dictionary” (Hutson 2017). Science was reporting on forthcoming articles by Artetxe et al and Lample et al about research in unsupervised learning. In Lample, neural nets processed a random half of the English side of a test corpus, independently processed a random half of the French side of that same corpus, and obtained a non-zero BLEU score of 15.1. In essence, BLEU¹⁰² as a (problematic (see , and )) metric is a scoring, between 1 and 100, of the extent to which MT output has the same words, in the same order, as a human translation.¹⁰³ A score of 15.1 indicates output similar to a low Tarzan score in my tests, where some word equivalencies are found but the results are largely incomprehensible to a native speaker. Google’s NMT for English to French obtained a BLEU of 38.95, which is 2.5 times higher than the experimental results that Artexte’s and Lample’s groups independently achieved. The production French score is comparable to the 38.30 overall production Dutch-English score attained in my spot test shown in Table 1, indicating a high probability of debilitating errors. A proper reading of the articles is that unsupervised learning as a technique was tried and the results were not entirely futile – like reporting on an anti-baldness medication that was only 40% as effective as the leading product, but did lead to 15% hair recovery on one test subject. An improper reading would be, “For The First Time, AI Can Teach Itself Any Language On Earth” , patently false 7000 times over.¹⁰⁴

Yann LeCun, Facebook’s Chief AI Scientist,¹⁰⁵ amplifies the misinterpretation of the research results as solving the question for all 25 million pairs in the grid,¹⁰⁶ only missing by about 24,999,990, “How do we translate from any language to any other language?” In his full explanation (which you can watch or read), based on Lample , LeCun declares, “in fact, that actually works, amazingly enough!”

Picture 29.3: Neural Machine Translation of nonsense text, exposing the NMT imperative to deliver the closest result it conjures from the data.

Moreover, when NMT encounters text outside its training regimen, according to Harvard’s Alexander Rush, the system can “hallucinate” bizarre outputs. “The models are black-boxes, that are learned from as many training instances that you can find. The vast majority of these will look like human language, and when you give it a new one it is trained to produce something, at all costs, that also looks like human language. However if you give it something very different, the best translation will be something still fluent, but not at all connected to the input” . Sometimes the output will be so egregiously concocted that we can see where NMT gropes (“deep learns”) through its training data. Though not examined in TYB, Microsoft makes similar claims to Google and runs on a similar NMT-based system. They title a blogpost from 29 June, 2019, “Neural Machine Translation Enabling Human Parity Innovations In the Cloud“, and assert therein (with a few caveats), “we showed for the first time a Machine Translation system that could perform as well as human translators” that they were beginning to launch “in production of our latest generation of neural Machine Translation models” that “incorporate most of the goodness of our research system”. Picture 29.3 exposes the bricks for some fake Arabic, where a random text input is routed to a selection of English words that represent the closest connection Microsoft could find in the training corpus. GNMT finds the Bible to be a source for deep learning for its five African languages from north of the equator (Amharic, Hausa, Igbo, Somali, and Yoruba), asserting that “da da da da da da da da” translates from all to English as, “and the Lord of the heavens and the earth!” (The same string cleans up along a partial Indochinese Peninsula cluster as “taking a bath” in Hmong and Lao, “shower, bath, bath, shower, bath” in Khmer, and “shower, bath, shower, shower” in Myanmar (Burmese))¹⁰⁷ You will usually not, however, see such blatant smoke from the gun in your translations, so there is no way to take cover from such common misfires of the neural synapses.

We can take “spring tide”, the bi-monthly tidal extremes that occur during every full and new moon,¹⁰⁸ as a case study. By inspecting Linguee, we can see that the term occurs with some frequency in the documents that DeepL uses to translate between English and French, but not between English and Portuguese. The French term “marée de vives-eaux” has not been mapped by humans to “spring tide” within Linguee, but DeepL nevertheless discovers the right translation to French from their training data. With the term absent from the English-Portuguese parallel text, though, DeepL goes to the “springtime” sense of “spring”, even when given a sentence designed to point the way, such as “There will be flooding during the spring tide tonight.” GT blows the sentence in every language, flailing for the “springtime” sense in French and Portuguese, and rendering Swahili that back-translates as “There will be with floods when water of water night of today” – despite these being three languages of seafaring people who have used their term for “spring tide” for hundreds of years. Bing, which also describes NMT as their default method, translates the sentence to Swahili as “Kuna mapenzi kuwa mafuriko wakati wa spring wimbi usiku wa leo”, which back-translates as “There is lovemaking to be floods when «spring» wave night of today”.

This is not a mistake, it is MUSA, the requirement that the MT system conjure up text of some sort, regardless of its basis in human translation. MUSA is invoked whether the data is merely missing, as with the equivalence between “bamvua” and “spring tide” being missed in GT’s Swahili training data,¹⁰⁹ but also when the data does not exist. Nepali is the language of a landlocked country high in the mountains, which has never had reason to develop a vocabulary for oceanic phenomena, but GT sails forth to render the example sentence as “त्यहाँ चिसो चिसो आज रातभरि बाढी हुनेछ।”, which a Nepali-speaker back-translates as “Over there all night long last night will be a very, very wet flood”. This language-like hallucination is not a bug – filling in the gaps with anything at hand is inherent to the NMT algorithm.

In sum, NMT offers certain improvements over SMT, where extensive parallel data training data can be exploited, to the point where MT can be used to transmit information unreliably between English and about 35 languages identified in this study in non-critical situations. From that, FAAMG and the popular press have produced the empirically false myth, stated as fact in no uncertain terms, that we now live in an age of universal translation. To modify a saying popularized by Mark Twain, there used to be three kinds of lies: lies, damned lies, and statistics. Now there’s a fourth contender: neural networks.

Myth 3: “Zero-Shot” Translation¹¹⁰

This section unfurls the tale of how a test became an untruth, and an untruth became a widespread belief. GT produced a lot of buzz around its MT engine when it announced that “our method [GNMT] also enables “Zero-Shot Translation” — translation between language pairs never seen explicitly by the system” . The announcement, timed to the publication of Johnson et. al , led to headlines such as, “Google’s AI can translate language pairs it has never seen” . You can review all the coverage at https://www.google.com/search?q=zero-shot+translation. Zero-shot is a fantastic idea, and a long-term goal we are taking early steps toward at Kamusi. The problem is, a close reading of the article underlying the announcement shows that Google did not, in fact, achieve anything close to what they led the public to believe.

That bloggers and journalists came to believe what is patently untrue, that Google can blindly translate between any language pair, can be pinned largely to the deceptive wording of GT’s blogpost and journal article:

This inspired us to ask the following question: Can we translate between a language pair which the system has never seen before? An example of this would be translations between Korean and Japanese where Korean⇄Japanese examples were not shown to the system. Impressively, the answer is yes — it can generate reasonable Korean⇄Japanese translations, even though it has never been taught to do so. To the best of our knowledge, this is the first time this type of transfer learning has worked in Machine Translation.

In the journal article, they speak of “reasonably good” translations from Portuguese to Spanish, “the first demonstration of true multilingual zero-shot translation”, and “a successful example of transfer learning in machine translation, without any additional steps”. In their conclusion, the paragraph that the non-specialist reader is most likely to inspect, they state: “We show that zero-shot translation without explicit bridging is possible, which is the first time to our knowledge that a form of true transfer learning has been shown to work for machine translation. … Our approach has been shown to work reliably in a Google-scale production setting and enables us to scale to a large number of languages quickly”.

Picture 30: “Zero-Sort Recycling” lets people mix all of their recyclable waste in a single bin that is processed by waste engineers behind the scenes. The name “Zero-Shot Translation” implies that similarly little user effort is required to achieve similarly effective results. Photo by author.

To understand what they actually achieved, it is important to look at the data they actually present. For their experiment from Portuguese to Spanish, they report BLEU scores of 31.50 for NMT based directly on parallel data between those two languages, a close 30.91 for a two-step NMT from Portuguese toward English and then English toward Spanish, and 24.75 for their best zero-shot model. That is, the score drops 6.75 points, or 21.4%. This is in comparison to the best BLEU score they mention in the study, 82.50 for translations from Belarusian to Russian, two very close languages, based on direct parallel data (a perfect 100 score does not occur, because even two human translators will rarely choose identical words across a series of translations). A score of 24.75 indicates that Tarzan-quality results were attained in some portion of situations, but is far from demonstrating the achievement of understandable translation. For their experiment from Spanish to Japanese, the BLEU score drops in half, from an already-low 18.00 with bridging through English, to a floor-scraping 9.14 with zero-shot.¹¹¹ Nonetheless, their text states that, “despite the quality drop, this proves that our approach enables zero-shot translation even between unrelated languages”.

Let us repeat: the numbers presented show something indicating a near-zero Tarzan rating, yet the article claims that the translation is “enabled”. For Korean to Japanese, the empirical evidence for “impressively reasonable translations” is {Ø} (the empty set): no numerical or human evaluations are given. Again: the entire premise that “true transfer learning has been shown to work for machine translation”, what the blogpost trumpets as “the success of zero-shot translation” for Korean to Japanese, is based on zero quality analysis – it is remarkable that this section did not get shot down during peer review. Google makes an explicit truth claim, using the word “true”, that is swallowed whole by the media but is not supported with data.

Picture 30.0.1: Word cloud from “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” (Wu et al 2019), generated at http://wordclouds.com

Perhaps we are witnessing a cultural clash here. When computer scientists approach MT, their goal is for their computational methods to achieve discernible results. Getting measurable numbers, even the murky ones scored for Spanish to Japanese, indicates “success” that “has been shown to work”, in that the results were better than zero. Google taught their program to compare two languages via an English bridge, then took away the bridge, and the program had learned enough to retain some equivalents. That is an impressive accomplishment, and offers promise for future research. However, it is a long way from “reasonable quality” as a language specialist understands the term, or as an MT consumer would expect. What is being reported is that the operation was a success, though unfortunately the patient died.

The words they use provide a good window for peering into the culture of computer scientists. “Google’s Neural Machine Translation System: Bridging the Gap between Human and Machine Translation” is a pertinent entrepôt, but you would get similar results if you pulled any journal article about MT out of a hat. The word cloud in Picture 30.0.1 highlights the concerns of the authors: BLEU, decoder, encoder, layers, constraints, length, beam, operations, stacked, model, function, wordpiece, output, speed, results, score, scores. Lower-frequency terms that you cannot see in the image but can be found in the text include: inference, normalization, vectors, softmax, n-grams, accumulators, axis, quantized. On the other hand, words that do not appear in the article include: noun, verb, adjective, sense, meaning, polysemy, multiword expression, syntax, grammar. That is to say, an entire major article about translation that will be cited thousands of times contains not one word about language or linguistics – not language in general, much less the inner workings of any given language in the system. The visual evidence in Picture 30.0.1 shows that the focus of getting translation right, for computer science, is a matter of finding a procedure that will lead to the highest BLEU score. Language, in this view, is not only secondary within translation – it is irrelevant.

Picture 30.1. A real tomato. Credit

An analogy. When I was young, my grandfather grew wonderful tomatoes in his garden in the hills of Vermont. All August long, we would gorge ourselves on sweet, juicy tomatoes fresh off the vine. In February, though, the homegrown tomatoes were long gone. Nor could tomatoes be bought in the supermarket. Then, some plant scientists engineered tomatoes that could survive a 4000 kilometer truck journey from California without rotting, bruising, or smooshing into juice. Now we could buy tomatoes in the middle of the winter! There was just one slight problem: the tomatoes had the taste and texture of styrofoam. The scientists had successfully produced an object that looked like a tomato, which pleased the agricultural conglomerates that sponsored their work because they could sell something in February that looked like a tomato, and pleased the administrators of my school cafeteria because they could put something on the lunch plate that they could check off as a vegetable. These tomatoes were not the tasty treat of Picture 30.1. Despite a common genetic origin, the industrial product was a simulacrum of the fruit that Mesoamericans had cultivated since antiquity.

Similarly, current MT produces a simulacrum of the languages people have cultivated. This artifice can be better or worse depending on the data and models deployed, but at least shares with human translation the fundamentals of finding words and patterns that some entity can directly compare. Zero-shot is more like the styrofoam ball in Picture 30.2, stained red with tomato juice. To food scientists, maybe the juice could be stored within the ball’s cell structure and the styrofoam tomato could be an interesting harbinger of a way to transport tomatoishness to a future Mars colony. To computer scientists who really understand what is going on, normal MT research is about trying to engineer a more palatable winter tomato (though built on the crunchy styrofoam pillars of lei lies and MUSA), while first results for zero-shot are exciting because they foretell decades more research for the linguistic flying car. To translation consumers, MT satisfies some of the cravings of a tomato that masks the potentially carcinogenic chemical compositions they are swallowing, while zero-shot is best considered the plastic window displays that Japanese call “sampuru” (サンプル). The goal of MT research is tomato engineering: output that achieves incremental improvements in the correspondence with the words a human might choose between languages. The goal of MT for consumers of linguistic services is an edible tomato: the successful conveyance of meaning. We can think of it as the styrofoam tomato paradox (STP).

Picture 30.2. A red styrofoam ball that could be made to look like a tomato. Credit

Unfortunately, though, Google’s claims and journalistic amplification leave the public with the impression that zero-shot translation is now a stunning fact. To wit: “Take Google-size data, the flexibility of a neural net, and all (well, most) of the languages of the world, and what you end up with is … Google’s new neural machine translation system, … that with minimal tweaking, … can accommodate many different languages in a single neural net, that… can do a half-decent job of translating between language pairs it’s never been explicitly trained on”.¹¹² We started with a valid claim shown by Google’s experiments with a few languages – that if you have data to learn from between Language A and English, and you also have data between English and Language B, then machines can learn some direct connections between A and B. In a flash, though, we are teleported to the completely false notions that (1) you can pour some data from A and some data from B into a bottle, give it a good shake, and end up with a translation, and (2) GT has done this successfully for most of the world’s languages.

But let us pull away from the popular media, and look also at the secret world of computer science professionals. In agreement with the ideas expressed above, an anonymous reviewer writes, “An idea that neural network researchers (not from the field of natural language processing, but from the mathematical side of machine learning) have marketed is that you can take any raw data, throw it at a neural network (learning method) and you will get a good system.” The reviewer objects, though to the prevalence of this notion in the subfield of NLP, writing, “It is wrong to assume that NLP researchers (at least the majority) believe or advocate this statement!” All well and good until one reads the second anonymous review for the same article (and the comment that triggered this entire section): “Recent zero-shot NMT approaches learn to translate without using parallel data but just monolingual corpora. I miss a mention to this in the paper.”

To invoke aviation again, misguided journalism has led to the widespread belief that the world’s airplane fleets fly themselves while their pilots hang around to push buttons in case of emergency, where the truth is that pilots will always be necessary to run the show except for the purely mechanical operation of keeping the plane straight and level during cruise.¹¹³ In practice, over-reliance on automization leads to aviation accidents in the real world when pilots cannot override algorithms, the cause of 60% of 26 accidents studied by the FAA over ten years and of hundreds of deaths . In the words of Patrick Smith, the author of the Ask the Pilot blog, “To be clear, I’m not arguing the technological impossibility of a pilotless plane. Certainly we have the capability. Just as we have the capability to be living in domed cities on Mars. But because it’s possible doesn’t mean that it’s affordable, practical, safe, or even desirable. And the technological and logistical challenges are daunting to say the least” .

AI-dominated translation and full automation in aviation are worthwhile aspirational goals, but cartoon dreams of future systems should not be confused with the technologies that currently exist or are likely in the medium term. I am not saying that it will be impossible to someday achieve zero-shot translation based on good input and output models, good data, and neural learning. I am shouting, though, that my research shows that GT has not implemented effective zero-shot translation for any language pair, and their own research shows their probability of success is somewhere in the vector space demarcated by now-defunct Google Answers, Google Glass, Google Wave, and Google+ .

Myth 4: Magic Wand Translation¹¹⁴

Picture 31: While Google Maps seems to recognize your thoughts as it types, you actually provide the service with quite a lot of information before it offers your target address.

It’s magical. It’s mind-reading. It’s deceptive. GT and its competitors give the illusion that the computer knows what you are thinking even before you are done thinking it; therefore by the time you’ve stopped typing, they must have arrived at an optimized answer. The fact that the software gives you words in real time does not equate to those being the right words, but it enhances the belief that they must be right. After all, if you do the same thing in Google Maps (GM), the service usually finds the correct location as it homes in on what you type, as shown in Picture 31 in Table 7.

But look closely in Picture 32 at what is actually happening with Maps, in comparison with Translate. When I first start typing “16”, GM starts predicting based on where my web searches or location-enabled amblings located me recently. Add a little bit of the street name and the prediction starts getting wobbly, wandering around nearby cities and countries. A few more letters and GM moves toward the right street name, though still on the wrong continent. The start of the next word ascertains that we are looking in English, for “Street” instead of “Close” or “Road” or “Ridge”. When I finally type the initial letters of the name of the town, the search has been narrowed down to the small range of places that have a matching street name, and I can click confidently for directions to get there (or, in the case shown, an honest assessment that GM “can’t find a way there” from my current location).

While the successful result makes it seem as though GM can read your mind, you can see by slowing down the illusion that the system works because it is actively picking your brain for the requisite information. GM whittles down to the best possible result because (a) the possibilities become increasingly limited as the search string becomes more specific, and (b) the user is actively involved in guiding the program by providing ever-more precise parameters.

Picture 32: Google Maps refines its predictions as you type to end up displaying the correct address, often enhanced with local information and imagery. What seems like magic is actually a process of interrogating the user to arrive at a solution from known matching data entities.

The behavior of predictive search within GM conditions us to expect similar magic when we perform a similar typing action within GT. However, this is not the way translation works, in the same way that reflexively deflecting a fist thrown your way will not prepare you to swat off an incoming bullet. With translation, an additional letter can radically alter the meaning, and each additional word in a search string increases the opportunity for ambiguity and consequent error. Take the word “heretically”, and choose any language in GT to watch this performance (I have not tested this on all 102 languages, but a random look at Samoan shows mistakes marching up the ladder, from saying that “h” in English is “f” in Samoan, to problems I could uncover in a dictionary when the display word is “he”, and then “her”, (see Pratt (1984)), and certifiable inventions along the progression to an obviously wrong choice for the final form). Table 8 shows the progression in Swahili.

English letters	Swahili translation (GT)	English reverse translation (human)
h	h
he	yeye	he
her	yake	her	Correct as a possessive of a Class 4 or 9 noun by an animate actor (eg “her telephone”), incorrect in >95% of uses of “her” (eg “I called her”).
here	hapa	here
heret	tatizo or huwa	trouble or “it normally is”
hereti	hii	this
heretic	kihistoria	historical
heretica	hapatica
heretical	uongo	lie
hereticall	hapaticall
heretically	hapa	here
Table 8: MUSA transformations in GT translation of “heretically” from English to Swahili. GT displays 11 separate proposals, 3 of which could be correct renderings if the thought were complete at that point in the typing, but none of which are remotely in the neighborhood of the final destination.

Start by typing “h”,¹¹⁵ which has no meaning in English but, taking a sample of GT languages, is shown as “h” not only in Swahili but in Armenian, Chinese, Urdu, and Yiddish (all of which have non-Latin character sets), given local characters in Amharic (ሸ), Arabic (ح), Georgian (თ) and Hebrew (ח), and given words in Albanian (orë), Belarusian (гадзіну), Gujarati (એચ) and Japanese (時間). Adding “e” forms “he”, a pronoun that could, in principle, generate changes to other words already typed in the input box. Another stroke and you’ve changed the gender to “her” and made the pronoun possessive. Add an “e” to form the adverb “here”, add “t” and “i” to produce the lei lies “heret” and “hereti” or random guesses (e.g. GT opined once that “heret” is “tatizo” (trouble) in Swahili, and postulated a conjugated form of “to be”, “huwa”, on another attempt) on the target side, add “c” to form the noun “heretic” with the nonsensical translation offering of “kihistoria” (basically “historical”), add “a” to get the verbal hallucination “hepatica”, add “l” to make an English adjective that is translated with the noun “uongo” that means lie and so could conceivably have come from the same vector space, and finally move through the invented “hapaticall” (“hapa” (here) + “ticall” (letters from the English term)) to finish your intended adverb. All the while, GT has been changing its output to keep up with your keystrokes, and inventing fake words or mappings if it cannot locate a known equivalent. Typing a longer phrase such as “I told the heretical priest” causes emerging words to jump around as the system continuously revises its word associations, creates lei lies, and adjusts word order according to the grammatical patterns for the parts of speech it perceives at the instant. We can call these constant shifts “trumplations”: momentary changes in what is presented as reality that may or may not be related to facts on the ground, like negotiating with Jell-O.

Try “she will he… hear… heart… heartily enter… entertain” with any language to watch a small GT gymnastics show. Unlike GM, each successive keystroke does not build on the previous to guide the system toward a more perfect answer. The end results for “heretically” in Samoan and “she will heartily entertain” in Swahili, examples typed into GT with no greater purpose than to witness the machine in action with shape-shifting source terms, are flat out wrong. The likelihood of having an acceptable translation for any given source text accords with the overall scores for the languages in my tests, with absolutely no regard for the incidental characters typed along the way. Yet GT invests enormous processing power in kneading out latency while you type, so that you ultimately feel confident that the final output is the result of the most precise computation.

There is one other important difference between the way Google Maps and Google Translate deliver data. In Picture 31, GM shows the banks, churches, and funeral homes in the neighborhood of the search address. These are known and verified facts. GM would not invent map features and present them as real, as I did in picture 34. People would scream bloody murder if the map they were relying on invented a zoo, a pizza parlor, a river, and an iconic church. (Although they might be happy to find that there is no river where I drew one, since I forgot to draw a bridge.) GT, on the other hand, has no compunction about putting fictional information on screen, such as “hapatical”.

When GT does not have adequate data to hypothesize a translation, it invokes the Make Up Stuff Algorithm (MUSA). You can watch MUSA in action as performed by a fake sign language interpreter at Nelson Mandela’s 2014 memorial service. In picture 35, I wrote a real sentence that mentions the features on the fake map. Without laboriously breaking down the broken down Swahili output, let’s just state that the only words that are correct are “We can”, “from”, and, because it is an untranslatable name, “Notre-Dame”. The remainder is a combination of (a) wrong word choice (“on our way” becomes “on top of our path”), (b) vocabulary that exists in Swahili but GT does not have (“cathedral”) and thus renders in English, and (c) throwing some English words that do not have Swahili equivalents into the correct Swahili adjective-noun word order (though leaving out the preposition for “of” that should sit between).

Picture 35 is, therefore, the spiritual twin of Picture 34: a made up spattering of seeming facts about things that do not exist in Swahili-speaking Malangali, Tanzania. The difference is that Google has the humility to produce the map that shows the actual extent of their data in Picture 33 rather than inventing features in the physical space, whereas they have the hubris to fill the linguistic space on their screen with froth. The MUSA illusion is rarely subjected to close scrutiny, so users assume that the characters that GT prints to the screen are derived from some legitimately computed linguistic relationship – unless you can read the target language, you are bound to accept that the words they give you are a viable translation.

Picture 33: Actual Google Map for Malangali, Tanzania, original screen capture

Picture 34: Fake Map of Malangali, Tanzania, with a few imaginary details superimposed via Photoshop.

Picture 35: Google Translate from English to Swahili and French for a sentence featuring elements on the Fake Map of Picture 36.2. The French translation is real GT output, but has been consolidated into this screenshot via Photoshop.

In fact, GT makes stuff up in every language. You can try the “heretically” test with top-scoring languages like Afrikaans and German, watching GT fill in gaps in its data with blather as you type. Importantly, not everything is made up. The French translation in Picture 35 is pretty good, raising native eyebrows with its handling of “some pizza” (which, unlike Swahili,¹¹⁶ does exist as such in French), but nicely reduces “on our way from” to “between” and converts the English party term “petting zoo” to the correct French party term that literally translates as “zoo for children”. However, examples throughout this web-book show that French, near the top of the GT charts, has many little fails (40% in the empirical tests), often from MUSA, that can wash translations down the drain.

The animation in Picture 36¹¹⁷ shows MUSA for the twenty languages that had the highest Bard scores in my tests. I gave GT a test phrase that should have returned empty results, “sdrawkcab uoy hcaet”, which is “teach you backwards” spelled backwards. For all twenty languages, GT confidently put forth some text, never suggesting that “sdrawkcab” just might be outside the set of words ever voiced on Planet Earth. Twenty languages, twenty lei lies. For a lucrative language like French, Google’s investments sometimes return splendid results like in Picture 35, sometimes complete fabrications like Picture 36, and usually, algorithmically, something in between. When you watch a magician perform, you know that the wave of a magic wand is often masking an illusion – the magician slips the rabbit into the hat in a way that you cannot see. With MUSA, the output is often the illusion: what looks like a rabbit pulled from a hat is actually a rabbit pulled from a hidden handkerchief, and what looks like Serbian or Indonesian or Dutch is actually “backwards” backwards.

Picture 36: Google translations for the nonsense collection of letters “sdrawkcab uoy hcaet” (teach you backwards, spelled backwards), for the 20 languages that scored at the top in this study’s empirical evaluations (Bard results of 45/100 or greater).

Myth 5: Google Translate learns from its users¹¹⁸

Picture 37: Select “Suggest an edit” to open an input box in GT. (“Tutaonana” does not mean “we will see” under any circumstances.)

One reason that people have confidence in GT is that they think its results improve over time via user contributions. They have two reasons to think this. First, GT includes a “suggest an edit” feature (Picture 37 and Picture 38). Second, Google invites people to contribute, and tells them that their contributions will be used to improve translation quality (Picture 39).

In December 2014, I began a test of GT’s “suggest an edit” feature with speakers of 44 languages. The phrase “I will be unavailable tomorrow” was being rendered incorrectly as “I will be available tomorrow” in quite a few languages. My test asked participants to assess the existing translation for their language, and to submit a suggestion if the GT result was inadequate. More than 200 people responded to my online survey that month. It is also possible that some people completed the submission task to GT but did not complete the survey to inform us. No suggestions are known to have been submitted for 7 languages that were rated “very good” or “perfect” by all respondents (Azerbaijani, Belarusian, Danish, Finnish, Norwegian, Swedish, Ukranian). The remaining 37 languages had at least one submission, even for 4 languages (Afrikaans, Dutch, Romanian, Russian) that were judged highly by most participants. 33 languages were judged “marginal” or “wrong” by all respondents.

Picture 38: Type a suggestion and click “submit” to send an edit to GT. (“Tutaonana” literally means “we will see each other”, and is frequently used in the context of bidding farewell.)

Several of the original translations would have received high BLEU scores, since the only difference from a human translation was a small negation such as לא in Hebrew; for example, wrong Irish scored 64.32. I asked participants to all submit changes to this single phrase so that we would be able to follow our bread crumbs later. In the wild, it would be exceedingly unlikely that more than one GT user would come across any particular mistranslated phrase and then go through the steps necessary to submit an improvement. A few phrases, such as “Happy birthday” and “I love you”, might inspire mass editing, or the oft-tweeted “Happy St. Patrick’s Day” from Irish¹¹⁹ as shown in Picture 40, but I felt safe in owning the submissions for “I will be unavailable tomorrow”. I was also conscious not to “Google bomb” GT with wrong answers, but rather to leave their service better off if they accepted our good-faith edits. The one thing I could not control for was whether GT themselves made manual changes to the phrase, beyond their normal review procedures, once they got wind of the experiment. Because participants were recruited via LinkedIn translation groups¹²⁰ that certainly included members of Google staff, such intervention is a possibility.

Picture 39: “Help improve Google Translate for the languages you speak. Contribute to Translate Community to help people around the world understand your language a little better.” https://translate.google.com/intl/en/about/contribute.html

In December 2018, I ran the phrase again on all 44 languages. Interestingly, 41 languages were different from their 2014 version. For 5 of the changed languages, we did not submit any edits; the changes may come from normal fluctuations as we have seen for other translations tested at different times, or because of new data, or through the transition to NMT. Among 21 languages for which we submitted edits, the 2018 version has changed, but not with anything by study respondents. In 5 cases, we submitted multiple responses and one of those is active in 2018. For 9 languages, we submitted a single change and that is identical to the 2018 GT version. For 1 language, Malayalam, the contributor did not send us the edit they submitted, but GT now has a correct version . Only 1 language, Russian, which was rated highly by 2/3 of the male respondents but was unacceptable for females because it was grammatically incorrect for them and connoted sexual availability, remained the same although two edits were submitted.

Picture 40: GT translation of “Happy St. Patrick’s Day” from Irish to English. (GT will often change its output depending on punctuation and capitalization. Your guess about why is as good as mine.)

14 of the 37 languages for which we submitted an edit, or 38% (and possibly Malayalam), have a 2018 result that matches one of our submissions. I cannot say definitively whether those improvements came from our participants, or whether GT hit upon them independently, nor whether GT made an unusual manual effort to incorporate our data. Submissions did not make it into GT after 4 years for 22 of the 37 languages (59%) to which a human submitted an edit. 2 of those 37 were already deemed passable by most participants, and the new version is judged equally acceptable. 8 of the 22 languages that made changes other than our submissions were unacceptable before, but can now be understood; for example, a new human evaluator was able to give the correct reverse translation for Hindi मैं कल अनुपलब्ध होगा but commented that it was grammatically incorrect, Hungarian mates a singular subject with a plural verb, and Hebrew is correct for a female subject. In many languages, two or more different human translations were submitted by participants and none were accepted. Additionally, in some languages identical translations were submitted by multiple contributors; for instance, five Germans submitted “Ich werde morgen nicht verfügbar sein”, but that suggestion was not accepted, and the active 2018 version has been evaluated as worse than the 2014 version. In short, “suggest an edit” led to an improvement in Google Translate in a maximum of 2/5 of languages, over a 4 year experiment.

Stepping back, we can examine why the GT version of crowdsourcing is not, and cannot be, effective. From the more than 36 trillion words they attempt to translate every year , they receive untold millions of suggested edits from people who are betting against long odds that their improvements will be accepted. Google has no knowledge of the linguistic skills of their contributors, so they must either accept the contributions blindly, or they must put them through some test. Validation must come from a human, because the act of submitting an edit is a declaration that GT’s best machine effort has fallen short.

Picture 41: Participation statistics from Zooniverse, a successful crowdsourcing project. http://www.zooniverse.org

Crowdsourcing is an emerging field. Various methods have been reported on to elicit knowledge from members of the public, each with their own benefits and hazards . Among the challenges of crowdsourcing are the problems of motivation, design, and quality. It is not enough to have people click into a crowdsourcing project and complete a few introductory tasks. Participants are scarce, so they must be motivated to return repeatedly in order to make the recruitment effort worthwhile. Once recruited, the tasks they are assigned must be clear and manageable by design – instructions can be misconstrued in the most ingenious ways.¹²¹ Finally, participants might provide bad information, either because they do not have the requisite knowledge or attention to detail, or because they are actively intent on malice. Motivation can come in a variety of forms. First is financial, usually in the form of per-task micropayments. Micropayments are effective for circumscribed tasks, such as a thousand dollar study that needs 100 respondents to each answer 1000 questions for a penny a pop. An open-ended project with unlimited billions of microtasks would be a financial black hole, and GT therefore does not pay its crowd. Appeals to people’s sense of civic duty, as GT does in Picture 39, can be instrumental in recruiting them to investigate a crowd project, but are not effective in retaining interest over the long term. In the absence of financial reward, retention comes from making a compelling activity with intrinsic rewards, such as seeing one’s efforts result in evident improvements (such as a better Wikipedia page), or winning a game.

Picture 42: Badges and levels for participants in the Google Translate Community

GT makes a nod toward gamification by declaring that participants have reached ever-higher levels, and received some badges of honor, as shown in Picture 42, but such a perfunctory use of badges and levels has been shown to leave users feeling they are on an Escheresque treadmill . Although there is some intrinsic gratification in feeling that one is nudging GT toward improvement, most people are busy and have many other options for their time. Readers of this web-book probably do not spend hours contributing to GT, and probably do not have acquaintances who do so, and neither do the citizens of Tashkent pass their idle time correcting Google’s version of Uzbeki. The motivations that Google offers are sufficient to attract an intrepid few for some trials, but cannot be imagined to compel sustained input from the millions of volunteers who would be needed to work through the volumes of edits submitted to GT every day. If GT were motivating people to submit and validate hundreds of millions of translations, a tally such as crowdsourcing projects like Zooniverse provides (see Picture 41) would be somewhere Googleable. GT’s design of tasks for the community, from the perspective of obtaining meaningful linguistic data, could not pass a scientific review committee. Participants are asked to either provide their own translations of a short snippet, or asked to validate previous translations. In either case, a few words are presented, completely devoid of context. Here are some English phrases that Google asks its volunteers to translate:

Picture 43: Item posed to the “Google Translate Community” for translation from English.

justin is preparing something, I’m sure.
score boosters
neck kisses
god with you
is out
how old are u
waking the demon
skill be with you
not sleep

Here are two examples of items that Google asks its volunteers to translate from a sample language, Swahili, with a best attempt to render the incoherent word jumbles in English. Keep in mind that the starting point is alleged to be Swahili, but these texts are not anything a Swahili speaker would mumble unless they were on heavy meds:

Moja ya vyumba vya kulala wageni: One of rooms of guests to sleep.
Hakuna data zaidi roaming gharama: this could be Rorschached (see Trope 7) as “data roaming has no charges”, or “no more data roaming credits”, were data roaming a Swahili term.

Picture 44: Google Translate Community validation example where the translation is correct in one context and wrong in a plethora of others.

Not all requested terms are silly. “What is your favorite actor’s last name?”, “window dressing”, and “drifted apart” are all the sorts of things that occur with regularity in the normal ebb and flow of daily discourse. The latter two are in fact party terms that should be documented with their own dictionary entries. Without context, though, the average Telugu or Danish speaker can only guess at what “window dressing” might be – and many will guess. Consequently, off-target translations are often embedded with a certification badge () as validated short text translations.

Anecdotally, a group of advanced German learners of English thought the verified mistranslation shown in Picture 46 was perfectly okay until the association with birthing was explained to them. “Drift apart” should be translated with nautical terms when speaking of people on rafts, but lexicalized as a phrasal verb that is not translated with the watery English metaphor when speaking of people in relationships. Assuming the participant intuits the meaning of “drift apart” as an emotional term, “drifted apart” can still not be translated to many languages unless it is known whether it is we, you, or they doing the drifting.

In the Swahili validation example in Picture 44, “linaloundwa” is a perfect translation in the phrase, “The car that is made up of a body, wheels, and windows”, but would be wrong in past or future tense, or for the perhaps 95% of nouns that belong to a different “class”, or if the relative pronoun “that” were removed from the source phrase. The open-ended translation task is destined to cement innumerable translations that are only sometimes correct. The ✓/Ⅹ (yes/no) validation task will elicit many “✓” responses that should really be “sometimes”, as well as many conscientious “Ⅹ” responses that should also be “it depends”, because many participants will feel responsible to give a definitive answer. (In my own efforts to do no harm, I used “skip” for ambiguous terms during this trial.) In most cases, there is no way to give GT a certifiably “right” answer, because the design favors all-or-nothing answers in a situational vacuum. The technical term in computer science for bad input that results in bad output is “GIGO” – garbage in, garbage out, in this case a function of a design that is incapable of generating useable data. Were it clear that GT users understood their task correctly, the company still has not apparently implemented viable quality control. The process is opaque, but apparently works like this:

User 1 offers a translation
User 2 validates with ✓/Ⅹ
Repeat Step 2 with new users until five ✓s are received ¹²²

Picture 45: A translation that has been “Verified by Translate Community”. The English source phrase is a food item, processed fried chicken morsels. The “verified” German term is a person who takes care of living chickens the way a shepherd takes care of sheep.

Picture 46: A “verified” translation from English to German in GT. The English source phrase is a place for mothers to give birth. The German construction is firmly a mailroom (https://www.linguee.com/german-english/translation/versandraum.html).

It is not unusual to find laugh-out-loud “verified” translations that should be rejected by a high percentage of volunteers, such as those shown in Picture 45 and Picture 46 (animated in Picture 46.1 below). Either the acceptance threshold is lower than stipulated, or five participants misconstrue an answer as correct for a variety of reasons. For example, a basket of Hamburgers and Frankfurters might not have enough familiarity with American haute cuisine to know that a chicken tender is a food rather than a profession. Or, a respondent might feel that an answer is close enough, even if there are small errors.

Or, a critical mass might choose either ✓ or Ⅹ for a polysemic party term where the translation is indeed correct in one context, e.g. if the translation of “run off” is correct in the sense of forcing someone away from their property, which will almost certainly be wrong in the sense of a second election between the top candidates of a first round, or waste that overflows in a flood – the majority is right (while simultaneously being wrong), but in the end the Google system is set to make a binary decision. The same issue occurs if the translation is correct in one register but not another, for example if a translation is correct for a male subject but not a female, and therefore given ✓ by men and Ⅹ by women.

More ominous is the problem of malicious users. For fun or profit, some users will inevitably try to mess with any system that invites public participation. The GT Community system is perfectly designed for maliciousness. A user could spend hours randomly clicking Ⅹ✓Ⅹ✓✓ⅩⅩⅩ✓✓✓Ⅹ✓✓Ⅹ, or could set up multiple “sock puppet “ accounts to carefully submit the same bad answers until a translation crossed an acceptance threshold. Methods to evaluate and control for trustworthiness are a topic of advanced study; we have no way of knowing how GT addresses this problem, but Picture 45 and Picture 46 demonstrate that bad verifications come through with some regularity.

Even cases where the community makes excellent verifications can ripple through to only limited improvements in actual translations. For example, type in the colloquial English party term “through thick and thin” and you will see not one, but two, verified colloquial expressions in French that have the same meaning but do not use any words regarding thickness or thinness. However, vary your input a little bit, typing either “through thick and through thin” (the variation used by singer Randy Travis in a song that hit #3 on the country music charts) or a full sentence such as “She stuck by me through thick and thin“, and the system reverts to the word-for-word approach that renders the translation meaningless. (Spanish and Italian also translated the sentiment for the shorter expression, but slipped back to word-for-word in the other cases. I did not test the phrase systematically throughout the roster, but most languages seem to be word-for-word in all instances. You can test by typing “through thick and thin” on the English side of a chosen language, then individually reverse-translating the words to see if you get back “thick” and “thin” in English, as happens, for example, with Kyrgyz.) This demonstrates that Google is able to learn some stock phrases, varying by language, from its Translate Community platform, but it has not learned how to incorporate that information within its neurons.

When a bad translation is accepted by GT, it is ossified as the sole or primary result that will be returned for that phrase far into the future. For example, “Continua” in Italian has been poorly validated as “Keep it going”, which left me befuddled when trying to find the “Next” button on a GT-translated e-commerce site. When a polysemous term is verified with a single translation, all other meanings will perpetually revert to that chosen sense. When a gendered translation is confirmed, half of users will be forever excluded. In the cases where GT is actually changing over time via user contributions (a maximum of 40% according to my tests over four years), those changes often lead to outcomes that are worse because they lock in information that is wrong in at least some contexts. I conclude, through empirical research across 44 languages, that crowd contributions to GT lead to definite improvements in a small percentage of cases, cement bad translations in another percentage of cases, and the majority of user submissions are ignored.

Picture 46.1: Nubjorn Babey GIF from Babycoming GIFs

Qualitative Synopsis¹²³

This chapter has shown how expectations about GT have been created, how public perceptions that GT produces scientifically-based translations in 108 languages have been cemented despite the objective low performance this study has measured, and why many of the myths surrounding GT are demonstrably false. We have been led to expect that GT performs like this:

In truth, for the top third of its languages, GT performs more like this, which is remarkable in many ways but is not mastery of the instrument:

However, for the bottom third of languages within the GT system, the video below hews closer to the results they have achieved. The next chapter looks at why translation is much more difficult than the industry leads us to believe, in an exposition of the linguistic challenges that GT and its competitors will never overcome for most languages, with the approaches on which they currently focus.

References

The post Qualitative Analysis of Google Translate across 108 Languages appeared first on Teach You Backwards.

The Astounding Mathematics of Machine Translation

Martin Benjamin — Mon, 01 Apr 2019 12:40:41 +0000

The finite limits to how well GT can ever translate

Picture 47: Attaining victory over a human in the constrained space of a Go board is considered one of the greatest feats achieved by computer science. A similar board for languages would have 7000 lines, cut along the diagonal, forming 25 million intersections for every concept, assuming one expression per concept per language. Image credit: Google Deep Mind

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does? (You are here )
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool?

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

The proudest moment that AI researchers have ever experienced was the defeat of a human master in the game of Go, shown in the 2017 documentary film AlphaGo.¹²⁴ The board has 19 vertical lines and 19 horizontal lines, one color of stone for each player, and one thing to do per turn: put a stone on one of the intersections (see Picture 47). The first play has 361 options. As stones are played, the number of possible plays per turn usually decreases, although the removal of captured stones from the board will increase the available plays. Two rules, one against self capture and one against recursive capture, further reduce the possibilities for play later in the game. In short, the maximum number of choices per move is 361, and that number is much less later in the game. Nevertheless, there are more possible Go games than all the atoms in the known universe.¹²⁵ Crunching the numbers to determine where to play for the best outcome of any given move is beyond the mathematical power of any person or machine – intuition of some sort is essential. Beating a human with AI was an incredible computational achievement. Yet Go is simple in comparison to language.

Without diminishing the magnitude of the accomplishment, language is much more complicated. There are no limits to the number of thoughts a person might try to express, and any number of ways to express any given thought. Looking back a few sentences, instead of leading with “Crunching the numbers”, I could have said “To crunch the numbers”, or “Number crunching”, or “Performing calculations”, or “Calculations”, or “Raw computation”, or many other linguistic paths to the same sentiment. I could have approached the sentence from a different direction: “It is beyond the mathematical power…”, or “No person or machine could crunch…”. We could certainly list 361 different fully comprehensible ways to express the content of that one sentence in English (but let’s not). Native speakers could equally find 361 formulations of the ideas in the sentence in Hawaiian, 361 in Gurani, 361 in Kinyindu. Scrolling through the most recent 361 tweets on Twitter with “crunch the numbers” barely gets you back to last weekend.

 Picture 47.1: A walrus doing crunches via GIPHY. GT translates "crunch the numbers" to Dutch as "crunch de cijfers", deploying a term used in Dutch for abdominal exercises. Either GT found that equivalence in some parallel data or, alternatively, put forward "crunch" as a lei lie that happens to correspond to a Dutch anglicism of a concept unrelated to number crunching.

And yet, when GT is faced with three words used by every engineer in Mountain View, “crunch the numbers”, it variously produces equivalents in French, Italian, Swahili, Polish, Dutch, and German for bite, crack, hit, break, abdominal exercise, or grind numbers, with no sense of performing mathematical operations. (I did not try this phrase on all 102 GT languages, but zero for six indicates a clear pattern.) We have reduced the problem from 361 possibilities in 7000 languages to seek just one concept equivalent in 6 languages, and ended up “biting” digits. Translation tasks are computationally closer to forecasting the weather a month in advance than playing a static game of Go. This section looks at the underlying factors that make translation so difficult, and why current approaches to MT, as represented by Google Translate, face insurmountably finite limits. ‎I introduce ways to surmount those limits in the next chapter about forthcoming disruptions to the translation industry.

The volume of untreated basic concepts¹²⁶

Picture 47.2. Three official Canadian translations of named entities. The first two items refer to the same entity by full name and acronym respectively in both languages, while the third English term was translated to French in both masculine and feminine forms. The Translation Bureau of Canada employs 1200 language professionals, mostly between English and French, creating the most thorough dataset for terminology for any language pair. Unfortunately, though the data is public, it is not available in a format that can be readily incorporated in MT or other NLP applications.

Let’s review some numbers. In a world where each word was represented by a single concept in one language, and each concept was represented by a unique word, and each language had an expression for that concept, translation would start with a simple mechanical task of matching vocabulary. That is, every language is going to have a word for , and if that concept is represented by a single word such as “mouth”, and the word “mouth” only represents and not any other idea like “the end of a river”, then we just need the data and we’re good to go. As it is, we only have parallel data in digital form for some thousands of concepts in some dozens of languages, so our initial condition falls about 6950 terms short of the 7000 needed to translate the single concept of around the world, and 6950 terms short for each of the other thousands of concepts that have been aligned through the Global WordNet¹²⁷ – hold this thought, because we will return to data shortfalls in ‎the discussion of disruptive approaches to translation. The Princeton WordNet for English has almost 118,000 synsets (clusters of expressions with similar semantic values, like “joint”, “reefer”, and “spliff”), while Wiktionary has more than 948,000 definitions,¹²⁸ so immediately we can identify many hundreds of thousands of concepts that have not been aligned multilingually in any useful way,¹²⁹ with hundreds of thousands of unique expressions in English alone.

Looking at specialized vocabularies, such as scientific terminology, brings the number of concepts into the millions. Adding in “named entities” such as the names of people, places, or organizations (see Picture 47.2) that might appear in any newspaper article adds millions more base concepts in English alone. GT has some named entities in its database that it handles correctly, such as properly maintaining “Grand Rapids”, a city in Michigan, as “Grand Rapids” across many (but not all) languages – for example, transliterating it to Japanese as グランドラピッズ (Gurandorapizzu) – but there are many millions of place names that are outside of its catalogue. Sometimes GT handle uncatalogued names correctly through the clues of capitalization, and sometimes as a by-product of MUSA. Picture 60 shows a real-world example where the failure of MT to recognize a named entity, a town in Germany, could have resulted in serious loss. Many concepts can have multiple forms, such as the 112 English versions of the name of Libya’s dearly departed ruler that expand to 413 forms in a multilingual news corpus,¹³⁰ or the conjugated forms of verbs (such as see, sees, seen, saw, and seeing).

Picture 47.3: My 9-year-old described a bottle like the one pictured above as “undecrushable”, using the rules of agglutination to compose a unique but valid English word. In some languages, such on-the-fly compositions are intrinsic to almost every sentence. (Undecrushable does not occur in Google search results as of this posting on 17 Sept. 2019.)

It would not be unreasonable to estimate that English has 10,000,000 discrete expressions that occur somewhere in print. German is more complicated, because it compounds nouns together such that “Autobahnmarkierungsentfernungskomitee” (a committee that is responsible for removing the road marks on German highways), or any other amalgamated concept, becomes a single word. Many other languages, particularly the 400 Bantu languages spread across central and southern Africa, have even more complicated agglutination patterns that can result in hundreds of millions of complete sentences smooshed inside a single word. In our starting point of a one-to-one-to-7000 world, the number of unique human expressions is immense beyond calculation.

Lexical gaps – concepts without direct translations¹³¹

Subtraction adds complication to translation. Not all languages have expressions for all concepts, but ways of expressing those concepts must still be produced when translating from the languages where the ideas arise. For example, most African languages do not have a term for “winter” because most of the continent does not experience that season as such, but any African might have occasion to read a newspaper article about winter occurring elsewhere. GT proposes “baridi” (cold) from English to Swahili; ripped from the headlines, the first Google News result for “winter” at the moment of writing is “Winter storm could make trip home from Thanksgiving hazardous for KC-area travelers” , translated to Swahili as “Dhoruba ya baridi inaweza kufanya safari nyumbani kutoka kwa Hatari ya Shukrani kwa wasafiri wa eneo la KC”, with “dhoruba ya baridi”, literally “storm of cold” leading the incomprehensible word salad (trust me on this) that follows. Each language has its own trove of unique concepts, so “lexical gaps” occur throughout the translation matrix; e.g., the “kanga” printed fabric that a Swahili-speaking woman would wear to a wedding becomes a “knife” or a “cord” when GT is asked to provide a term in English or Spanish, and a “place” in French. Fewer words, or one-to-zero relationships, mean more work for MT. Either previous translation attempts to mind the gaps must be found from parallel corpora, or knowledgeable bilingual people must be asked to provide explanatory terms that can be called upon in the target language whenever the unfamiliar concept arises in the source text . For languages that have parallel corpora, the chances of finding a consistent go-to translation are fleeting, as seen in Picture 24. When the corpora fail for GT, the use a technique called MUSA to simply make stuff up: “baridi” as “winter” becomes a “fact” because, if they used NMT as stated, their processor found a word that appears frequently in the same vector space and, hey, who is going to tell Google they’re wrong? For the supermajority of language pairs between which parallel corpora will never exist, even fake data is not an option. The Kamusi solution is to collect and validate such data directly from linguistic communities.

Semantic drift – concepts that do not fully correspond¹³²

Figure 3: Partial overlap among concepts in different languages

Fractions, known as “semantic drift”, also muddy the math. In many Bantu languages, is a single upper limb, e.g. “mkono” in Swahili. However, in most European languages, is two different things, a hand and an arm (unless you lose your arm, in which case the hand goes with it). In GT, “Mkono wake uliumwa karibu na bega” (literally “[His or her] upper limb hurt near the shoulder”; a human translator would intuit gender and “arm” from context) is converted to “His hand was bent around the shoulder” (BLEU = 14.54) – GT provides a guess in the guise of a fact, despite the clue “shoulder” that was intended to keep the translation from drifting over a waterfall. (sheep) and (goat) are different animals in both English and Swahili, but breeds of one animal 羊 in Chinese, leading to translation confusion during the Year of the 羊 . GT renders 羊 as “sheep” in English, and consequently the equivalents of that term in Swahili, French, Romanian, and down the line. With about a billion domesticated goats herded around the planet, a mistranslation could have substantial economic ramifications (no pun regarding male sheep intended) for the international meat trade. Fractional overlaps between concepts, or one-to-(less-than-one), exist between all language pairs, for innumerable concepts, in unpredictable ways. As with lexical gaps, I propose that codification of human semantic knowledge is the route toward true facts that can be used in MT.

Polysemy – words with multiple meanings¹³³

Figure 4: The multiplication effect of polysemy in direct translation.

Multiplication begins with “polysemy”, the ability of one expression to have more than one meaning. Wiktionary has 38 senses for “out”, the forty-third most common English word, for example, and wordreference.com lists 42 primary senses and 1637 concepts composed with the word, from “act out” to “zoom out”. Daffodils that are “out” in English are “en fleur” in French, while an “out” homosexual is “ouvertemente gay”, expired time that is “out” is “terminé” or “fini”, a film that is “out” in the theaters is “sorti”, etc.¹³⁴ Figure 4 shows how translation possibilities multiply, though the math gets messy if some senses happen to share an expression when translated, such as a worker being “out” for the day and a tennis ball being “out” of bounds both having equivalents in Spanish as “fuera”. Picture 48 shows an instance from Google where “shingle” is defined with 4 noun senses and 2 verb senses, questionably translated but not aligned to French with 3 noun terms and 3 verb terms, with items highlighted in green showing a term that is polysemous in French and the items in red showing an English sense with multiple French translations.

Were each polysemous sense of an expression in Language A to match to one and only one unique expression in Language B, then “out” in English would have on the order of 40 x 7000 translations (that is, 280,000, more than a quarter million translations for one three-letter English word) before considering 1600 x 7000 (=11,200,000) composed forms. In the absence of further information to influence statistical or neural choices, discounting party terms, and assuming that translation data was available for all senses, MT would have a 1/40 chance of matching “out” to the correct term in Language B. You can try prompting GT with senses that NMT should pick out due to obvious collocations, and you might get the right result – in French, for example, “The film came out in October” correctly uses “sorti”, but “The homosexual came out” and “The sun came out” also both erroneously use “sorti”. My tests of GT involved 20 expressions in which the sense of “out” should be evident from word embeddings or other context, in all of the 107 languages they translate toward from English.

An additional round of multiplication occurs when “translation” steps through a pivot language, as shown in Figure 5. GT passes through English in almost all of the 5151 language pairs they claim to support. All else being equal, an expression that has three senses in Language A with unique translations to English, with each of those 3 English words also being polysemous with three senses that each have unique translations to Language C, would result in a 1/9 chance of being translated correctly from A to C. For example, “galets” in French can correspond to the first sense of “shingle” in Picture 48 (pebbles on a beach), or refer to small objects in the shape of “discs” (round rice crackers are “galettes de riz“), or a car part called a “tensioner”, but, following a different thread through polysemous “shingle” in English, GT erroneously produces a term for “roofing tile” from French to German, Polish, Portuguese, and many more.

Picture 48: Google dictionary and French translations for “shingle”. Matching color overlay shows correspondences between English senses and proposed French terms.

According to my count, the top 100 words in the two billion word Oxford English Corpus, which make up about half of all written English,¹³⁵ average more than 15 senses a piece in Wiktionary. One-to-many multiplication occurs much more often than not in direct translation, and two-step translations are fraught with one-to-many-to-(many-more) relationships. When faced with polysemy, GT makes a choice, with success rates shown in my empirical tests, and, despite failing 50% or more of the time in 2/3 of languages, presents the result as computed truth.

Figure 5: The multiplication effects of polysemy in translation through a pivot language.

Mandarin Chinese, the world’s most spoken language, has a special wrinkle: tones. With four main tones and one neutral one, the same word can be pronounced in five different ways, with five different meanings. The complexity of Chinese to English translation, among the highest pairs in demand, thus takes polysemy and multiplies by five. There is no telling which of these 35 Chinese translation fails might have come from GT, but polysemy is clearly to blame in some cases, so you should beware of missing foot, as well as poisonous and evil rubbish. And please don’t be edible.

Party terms (or multiword expressions) – words that play together¹³⁶

Picture 48.1: English party terms beginning with “out” and containing 9 additional letters, collected from millions of crossword clues. The wordplays.com corpus can be used to identify almost all party terms up to 21 characters beginning with, ending with, or containing a key word.

The math becomes exponentially more complicated by the preponderance of “party terms” (a.k.a “multiword expressions” or “MWEs”)¹³⁷ throughout human languages. Party terms, which often do not appear in dictionaries, are expressions of two or more words that take on meanings when they dance together that cannot be derived from the sum of their parts. “Chicken sandwich” is not a party term because a person, and maybe MT, can easily discern what it is by looking at the definitions of “chicken” and “sandwich” independently. “Afternoon sandwich” is not even a thing, although the two words frequently appear together in the corpus. “Shit sandwich” and the named entity “South Sandwich Islands”, however, are party terms, because you could never figure them out by inspecting their component parts. Party terms can have two words, e.g. “run out”, or several, e.g. “run out the clock”. English exacerbates the problem with thousands of phrasal verbs, e.g. these distinct actions involving r-u-n that have nothing to do with moving forward quickly by foot: run about, run across, run after, run against, run along, run around, run away, run back, run by, run down, run for, run into, run low, run off, run on, run out of, run over, run past, run through, run up, run with. Much technical terminology consists of party terms, such as an airplane’s “service ceiling”. Idiomatic expressions, such as “feel under the weather”, are inherently party terms (and even technical terms can be idiomatic, such as “pie chart” or “server farm”). MT must evaluate whether to translate “service” and “ceiling” separately, or identify the words as a unit that should be treated together – a task that multiplies with each additional word in an expression, such as five characters in Chinese, 南书房行走, that refer to an intellectual assistant to emperors of the Qing Dynasty (akin to the contemporary party term “White House policy advisor”), invented by GT meaninglessly using MUSA as “South study walking” .¹³⁸

Party terms can also be polysemous, e.g. you can “run out” of milk and then “run out” to the store on your bike to get more. Furthermore, a concept can be expressed as a party term in one language but a single word in another language, such as “pie chart” in English being the differently idiomatic “camembert” in French (several-to-one or one-to-several without polysemy, several-to-many if the source term is polysemous), or using party terms in both languages (several-to-several), such as an Italian human translator’s rendition of “when pigs fly” as “il 31 febbraio” (the 31^st of February, i.e. never).

Detecting party terms within each language’s uniquely nebulous corpus is a major ongoing endeavor among NLP researchers . Automated corpus scraping can deliver candidate terms for human eyeballs to evaluate. Picture 48.1 shows that a manageable amount of labor could identify nearly every English party term containing “out”, ascribe definitions to each sense of those terms, and offer the terms to bilinguals to provide certifiable translation equivalents in other languages. Computers can help discover party terms, but only human-centric natural intelligence methods , which GT nods toward but bungles badly, can actually seal the deal – AI cannot tell a categorical difference between an afternoon sandwich and a shit sandwich, much less determine a translation for the latter term in even a single language. Such mining techniques are outright impossible for the 6900 languages without corpora.

Picture 49: A literal interpretation of the party term “drive up the wall”. Credit

The mathematics of party terms causes stack overflow errors when the components move away from each other on the dance floor, what computer scientists call “discontiguity” or “discontinuity” and linguists call “separability”. A party term such as “drive up the wall” actually requires a split: she drives [someone] up the wall. It is difficult for MT to connect “drive” with “up the wall” if the item in the middle is “me”, and doubly difficult if the item is two words, e.g. “my sister” or “many people”, and the difficulty gets greater as the distance expands; “she drove [almost everyone she worked closely with in her last really good job at the old aircraft assembly factory up the block by the river] up the wall” has a separation of 24 words. Traditional MT does not even attempt to bridge distances of more than a chosen “discontinuity parameter” of a few words , with precision deteriorating drastically beyond a separation of three words . NMT, which looks for words that float around each other, is theoretically better geared toward discovering separated party terms by compressing all the necessary information in a source sentence into a fixed length vector , though Klyueva et al report that a separation of one unit resulted in failure for Romanian, and Gharbieh et al does not report success in bridging gaps. To wit, “she drove me up the wall” in GT brings translations pertaining to motoring on vertical surfaces, as depicted in Picture 49. Finding sensible translations for party terms separated over a short distance, sev-(three-at-most)-veral-to-something, is a hard nut that MT has mostly not learned to crack, while dividing a party term by longer distances, sev-(four-words-or-more)-veral- to-something has the same general effect as dividing by zero. For some party terms that GT has identified as such in English, it takes the same approach as Kamusi proposes ‎and uses the first word as a trigger to look downstream in a sentence for the term’s resolution before driving toward a translation. However, in the preponderance of cases, GT takes party terms literally. A piece of cake is a portion of baked confectionery, no matter the context.

Morphology – words that shift shape¹³⁹

Inflections have repercussions for MT that go well beyond the proliferation of word forms. A translation program must recognize that “sees”, “saw”, “seen”, and “seeing” could all be tied to the canonical form “see”. A sentence such as “I see your point” already requires determining that “see” refers to understanding instead of vision. With inflections, the same determination must be made for “I saw your point”, “She sees your point”, “Your point was seen by all”, and “I’m not seeing your point”. English is a relative piece of cake in this regard. Arabic nouns can have six inflected forms, French verbs hover around 96, and while Bantu adjectives have a polite dozen forms or so, their verbs can reach close to a billion. Inflected forms, many of which are unseen or too sparse for inferences in the training data on either or both sides, must be decomposed from the source language and composed afresh in any target language, resulting in a “combinatorial explosion” .

Inferring meaning for each different inflected form would compound the difficulty of the problem – that is, discovering that “see” can mean “understand”, and then discovering that “saw” can mean “understand”, and then discovering the same fact nugget for “seen” and “seeing”. It is simpler to discover the morphological variations that map to “see,” and then discover the meanings that map to that lemma, and use that combination to find equivalent terms and their matching inflections in the target language. In fact, this appears to be the approach that GT takes. Changing the subject, tense and object in simple sentences does not usually change the surrounding vocabulary. Though not tested exhaustively, sentences like “I see the film” and “She saw the film” conjugated correctly in sample languages (although selected the wrong “see” in all cases), indicating that GT is applying linguistic rules rather than brute force searches of the corpus.

In some cases, inflected forms can add to the polysemy challenge: consider “I saw the wood”, which can involve present tense carpentry or past tense vision, or “seeing” as a noun (“Travel is more than the seeing of sights”) or adjective (“seeing eye dog”). Furthermore, many party terms can be inflected, and many of those can be separated: drive/ drove/ drives/ driven/ driving [somebody] up the wall. GT finds the correct vocabulary for “drive me crazy” in some sample languages, and conjugates it correctly in those test cases when it does. However, while the program correctly conjugates the motoring sense of “drive” in sample languages, its use in “drive up the wall” is nonsensical. On the basis of observation, not measurement, it seems that GT has separated the mapping of inflections to their canonical forms from mapping meanings across languages. That is, converting “drove” to “drive” is a different process in GT than determining whether “drive” refers to steering a car, leading an organization, or hitting a golf ball.

Picture 49.1: Inflected Swahili verb that does not appear in any documents. This example is somewhat absurd, but was constructed to demonstrate one of the many millions of forms of any verb that could roll off the tongue of a speaker of a Bantu language. The root verb is “sukuma”, meaning “to push”, and the inflected elements are tu+me+sha+m+sukum+ish+ia+na, translatable to English as “we have already been made to push each other on his/her behalf”.

For roughly 400 Bantu languages, an entire sentence can occur within the morphological manipulations of a single verb. For Bantu languages, the principle of agglutination that lets German compound any set of nouns willy nilly is utilized for bonding together many different grammatical elements, such as subject, tense, object, negativity, passivity, reciprocity, and much more. Picture 49.1 shows Bantu inflection in action. More detail that would boggle the brains of non-speakers is available in Kiswahili Grammar Notes, but the strict set of rules governing how people inflect verbs is internally consistent, describable, and perfectly suited for computer code – in fact, my team has built a parser for Swahili that can tear apart any verb into its component parts. Each verb in Kinyarwanda (spoken by more than 10 million people) has about 900,000,000 legitimate forms that could appear in print – were Kinyarwanda limited to 1100 verbs, the language would already have one trillion combinatorial possibilities. Correct morphological reconstruction of who drove what and when in GT’s target languages involves 108 different models that I did not evaluate per se. Nor did I evaluate conversions of inflected forms from other languages to the often simpler morphology of English. While I do not have a basis to comment empirically on the success rate of rendering inflections across languages, I can note that languages with higher Bard scores in my tests are probably doing it better.

Categories – gender, class, register, and other ways people frame their world¹⁴⁰

Languages often add mental categories that inject further multipliers.

Picture 50: GT translation to French in 2018 of the first sentence of the first Google Search result, from Wikipedia, for the female Prime Minister of the UK, using entirely male vocabulary and agreements. As of 2023, this sentence now translates with the correct gender. We have no way to know whether the correction is the result of new training data, new coding to recognize gender assumptions, or manual intervention by Google, perhaps as a result of this exposure in TYB.

Gender is a non-issue in Swahili (other than a few things where the man is voiced as active and the woman as passive, like “marry” and “be married”, and the act of sex itself) and a relatively minor concern in English (with distinctions mostly regarding third person singular pronouns and some nouns such as waiter/ waitress). However, gender can carry linguistic loads far beyond sex designation in many languages, determining things like verb and adjective forms to agree with any noun. GT tends to default toward masculine constructions, except for instances such as “nurse” where the training data maps toward feminine correspondences – as of February 2022, using GT to translate “un infirmier” (a male nurse, unambiguously) from French begets “una enfermera” (a female nurse, unambiguously) in Spanish, and the same phenomenon occurs for Portuguese, Italian, and … you get the point. Wellner and Rothman raise the question of whether bias lies in the training data (e.g., all 46 US presidents have been male, so linguistic data about US presidents in the past and present tense will always attach to male pronouns) or in AI algorithms that jump to gendered assumptions based on patterns the machine teaches itself (e.g., firefighters appear in the training data as both strong and male, so a dentist who is described as strong is by association identified by AI as male). Certain cases have either been resolved by a mass of training data or by rules or by manual intervention; for example, news from the UK in 2018 was sometimes translated to French with the correct gender of the then-current Prime Minister, but she is rendered as “el primer ministro” (masculine) in Spanish, and the female Prime Minister of Barbados is given gender reassignment surgery in French as well.

Gender experiment: ¹⁴¹ A small test, conducted several months after their blog announced efforts to address and reduce gender bias in GT, provides inconclusive results. Starting with a snippet from a news article, “the singer Cindi Lauper said she would not cancel her concert”, you can change the name of the singer, or the “she”, or the “her”, to see whether GT renders “chanteur” or “chanteuse” in French. Results are shown in Table 9. Sometimes the gender is shown correctly just by the name, even if both pronouns are wrong on the English side, such as “the singer Madonna said he would not cancel his concert”. However, Cher requires agreement with the first pronoun, and Bono requires agreement with both. Inserting a generic first name has mixed results; Mary is adamantly “la chanteuse” regardless of the pronouns, whereas Martin becomes feminine if either of the pronouns is feminine. Both pronouns must be masculine in order to produce “le chanteur” if no name is given. “The singer John” is masculine even with two feminine pronouns, but “the singer John Lennon” is feminine with two feminine pronouns, and masculine if either pronoun is masculine. “The singer from the Backstreet Boys” changes based on the first pronoun. Results are similar but different in Spanish; for example, Neil Diamond is always a man in Spanish, but needs two masculine pronouns to be a man in French, while “Neil” on its own is masculine if either pronoun is masculine in Spanish and feminine if either pronoun is feminine in French. The fact that gender designation can wobble around either side of the pronoun divide indicates a statistical coin toss, with weight given to some sort of evidence – either Madonna is statistically always female in the corpus and John always male, or GT has tables that indicate gender for certain names. The fact that many current female heads of state are given the correct gender in some circumstances could indicate that GT has a table of gender for some important cases, or it could mean that their corpus is quite up to date. Without solid evidence, I am inclined toward the latter, given (a) appropriate designations of “la première minister” and “la présidente” for female leaders in cases such as Romania and Taiwan who appear several million times in Google Search results, versus the mishandling of Namibia’s longer-serving prime minister Saara Kuugongelwa, who has around 40,000 hits, and (b) the teetering behavior in the singer experiment that indicates constant statistical decisions, and (c) the egg-on-face translation in Picture 50 that Google would rush to patch were it to appear in the news, and can most likely be explained by 200 years of predominantly masculine prime ministers muscling out a feminine “Theresa” in the corpus computations.

Subject in test phrase “the singer [Subject] said [pronoun] would not cancel [pronoun’s] concert	he / his	he / her	she / his	she / her
Cindi Lauper [French ]
Cindi Lauper [Spanish ]
Cindi [ ]
Cindi []
Madonna [ ]
Madonna []
Cher [ ]
Cher []
Bono [ ]
Bono []
Sting [ ]
Sting []
Mary [ ]
Mary []
Martin [ ]
Martin []
[no name] [ ]
[no name] []
John [ ]
John []
John Lennon [ ]
John Lennon []
from the Backstreet Boys [ ]
from the Backstreet Boys []
Neil [ ]
Neil []
Neil Diamond [ ]
Neil Diamond []
Table 9: GT selection of gender from English to French and Spanish based on changes of name and pronouns

Noun classes in Bantu languages have similar multiplication effects. In Swahili, for example, the markers attached to verbs to indicate subject, direct object, indirect object, and relative object, will change based on the class of each of those. (The verb morphology also changes based on time, conditionality, negativity, subjunctivity, passivity, causativity, and a few other factors .) So various bits will change depending on whether the noun in question is me, us, singular you, plural you, class 1 (eg a pilot), class 2 (pilots), class 3 (a tree), class 4 (trees), class 5 (a car), class 6 (cars), class 7 (a knife), class 8 (knives), class 9 (an airplane), class 10 (airplanes), class 11/14 (beauty), class 16 (at a specific location), class 17 (in a general location), or class 18 (inside something else). To summarize extensive testing that a Swahili speaker could quickly replicate and would drive a non-speaker up the wall, GT handles noun classes terribly. Other languages around the world have their own multipliers that are similarly outside the capacities of GT.

A final multiplier that spans many languages is register, or level of formality. The distinction could be as basic as Spanish “tú” for people you are friendly with versus “usted” for your superiors or people you do not know well.

Register experiment:¹⁴² I performed a test of 20 translations from English to Spanish, shown in Table 10.

To sum: A human would have used “usted” in 9 cases, while GT chose that register for three of those, or 67% failure. The 4 cases that should have been “tú” were 100% correct. 2 of the 4 cases (50%) that should unambiguously have had the formal plural “ustedes” instead received “tú”. 2 of the 3 cases that should have had the informal plural “vosotros” in Spain and the sole usual plural “ustedes” in Mexico used “tu”, while the other used “ustedes”, for a failure of 67% or 100% depending on region. The overall success was 45% for Spain and 50% for Mexico.

All of the failures were for sentences where the formal register was expected, but the familiar register was delivered – that is, GT leans to the informal register for Spanish, making it extremely risky for translating any document where formality is required, such as business communication or a letter to a government agency. On the same test set from English to French, GT got all 4 “tu” constructions correct, and 6 of the 9 “vous” formal singulars; the test was not configured for French, though, because that language elides informal and formal plural people into “vous”. Nevertheless, one is cautioned against using GT to arrest someone or call them a dirty name in French, because the service will reverse the tone of the message.content here.

Test sentence	GT	Expected	Result
You may select among three options	usted	usted
You are quite kind, my love	tú	tú
You are quite kind, Mr. Menendez	usted	usted
I want you to lie quietly on the floor while I handcuff you and bring you to the station	tú	usted
You must pay your bill by tomorrow	tú	usted
You all must pay your bills by tomorrow	ustedes	ustedes
You must go to the first window on the right	tú	usted
I want you to see a specialist	tú	usted
I want you to wear a condom	tú	tú
You are an asshole, motherfucker	tú	usted
You may start your engines	tú	ustedes
You may contact the head office	usted	usted
I have given you a large table to fit the entire group	tú	ustedes
All passengers please go to the gate for immediate boarding	ustedes	ustedes
Children, you make me laugh	tú	vosotros/ ustedes
You all make me laugh, friends	ustedes	vosotros/ ustedes	/
Your teamwork made me proud	tú	vosotros/ ustedes
You must know I love you	tú	tú
I need to prescribe you new glasses	tú	usted
You have beautiful eyes	tú	tú
Table 10: Register produced by GT from English to Spanish for 20 sample sentences.

Japanese, on the other hand, has a system of honorifics that has earned it a 5300 word Wikipedia article,¹⁴³ with an elaborate set of prefixes and suffixes performing much of the work. Something that uses proper vocabulary and syntax, but is translated in the wrong register, will generally be understood, but perhaps not appreciated. For example, legal documents must be in a formal register, while informality is often expected within chat rooms. I attempted an experiment that asked native Japanese speakers to evaluate 10 sentences translated from English for whether they were wrong, understandable, or perfect in registers suitable for a friend, a colleague, or an authority. Unfortunately, despite my getting myself banned briefly on Facebook as a spammer for trying, not enough people responded for the findings to have sufficient statistical merit for detailed analysis. The test confirms that some translations hit closer to the mark in one register or another, and the results trend toward showing that the translations skew to the less formal. Of the 10 sentences, the most neutral, “It’s raining today”, was judged as equally appropriate in all three registers. Most worked better in the least formal and/or mid register than in the most formal, even if that was not the register anticipated. For example, the evidently formal “Teacher, I have something to ask you” and “Can you write a letter of recommendation for me” were judged more appropriate in the informal, as were the less obvious “I am sorry to have kept you waiting” and “Please have a seat”. The responses seem to show that GT results would be riskier to deploy in professional situations, though this conclusion is purely tentative. What is clear is that, in the event that a GT output in Japanese is understandable, it is likely to be more correct when said to one category of person than to another.

The problem with pronouns¹⁴⁴

Pronouns are little elements that take the place of specifically naming certain items in a sentence. They are Voldemort – he who must not be named – and they are as evil in MT as they are in Harry Potter. “This is a famous book. I read it last year. It was written by Melville. My grandmother¹⁴⁵ studied him. She wrote critically about it ” In a world without pronouns, we might phrase those thoughts thusly: “Moby Dick in Martin’s hands is a famous book. Martin read Moby Dick last year. Moby Dick was written by Melville. Martin’s grandmother studied Melville. Adler wrote critically about Moby Dick (ibid.).” We use pronouns because we know the context about which we form our expressions. If you say “Show him”, you and your listener both know there is one guy in focus, else in a room with several males to choose from, you would say, for example, “Show the boy in the tiger shirt”. Pronouns can point to a variety of information in the conceptual space, including gender, time (“see you then”), location (“see you there”), possession (“it’s hers”), quantity (“I’d like some”), and noun class. Or not – Japanese, for example, tends to leave them out when the subject is assumed, expressing “Took the train” instead of “I took the train”.

Because pronouns refer to things that are not directly named, they come to MT partially wrapped in an invisibility cloak. First is the issue of endophora, associating the pronoun with the named item it refers to elsewhere in the text, if at all. That knowledge may be needed to perform tasks such as giving the correct gender form of an adjective in the target language. Second is ambiguity where different languages have different pronoun systems. “You” in English could be “tú”, “usted”, “vosotros”, or “ustedes” in Spanish. When GT’s algorithms cannot determine mapping, the system makes a guess that it presents as truth. “That is a bottle. I saw it in the store” in English becomes “Eso es una botella. Lo vi en la tienda” in Spanish – GT twice chooses the wrong gender pronoun in a simple, unambiguous situation with the service’s third-best Bard rating from my testing (57.5/100). With more complex sentences, languages that have less developed models versus English, or language pairs that do not involve English, the errors mount geometrically. Kamusi Labs has a more effective potential solution, using a widget within SlowBrew translation to enable users to choose how their pronouns should map.

The finite limits of corpora¹⁴⁶

An additional mathematical constraint for MT is the size of the corpora available for training. A corpus is a body of digitized text that serves as a reference for how language has been used. NLP researchers have learned to exploit corpora for many exciting tasks, such as identification of words, morphology, part of speech, party terms, disambiguation, and the construction of syntactic models. Corpora usually consist of relatively formal documents from the public domain, including older books and more recent public records such as parliamentary proceedings. Efforts are often made to transcribe speech from audio recordings, but the work is expensive and time consuming, generally yielding smaller datasets than corpora obtained from written literature. Most corpora are monolingual, because most texts are produced for a single language audience, and the small percentage of texts that are translated do not usually land in open datasets.

Parallel corpora between languages enable MT by providing numerous instances of directly-translated sentences, which can either be replicated exactly or used to extrapolate patterns and associations. Parallel corpora consist of texts that have been professionally translated and are publicly available, which primarily means official documents produced by bodies such as the EU, in the official languages of interest to the sponsoring governments. The translations of modern texts such as “Harry Potter” are under lock and key, as is the work product of translation agencies the world over. While out-of-copyright books like “Alice in Wonderland” and “Heidi” have been translated to dozens of languages, those translations are not necessarily open source, and nobody has yet taken up the project of aligning such works across languages. Not even the Bible, which has been translated to hundreds of languages and has a verse numbering system that is ideal for alignment, has been prepared as an open parallel corpus resource. The bigger the corpus, the more useful; for example, a corpus based solely on the concerns of the Bible’s ancient cultures, where people drew swords, drew water, and drew out leviathans with a hook, but did not draw pictures,¹⁴⁷ would not on its own be a powerful source to translate documents of contemporary concern.

Picture 50.1: GT translates “let it go, let it go” to French based on the parallel lines in a popular song, probably learning the association from multiple websites. It is a logical guess, but it is wrong.

New corpora are often gleaned by robots that crawl the web to find texts in recognizable languages, and match languages that seem to have parallel texts. This method yields a lot of data, but much of that data is messy, often with corrupt data from innumerable sites that have been produced using GT. Computers cannot separate the wheat from the chaff without human oversight, especially when inferring translation parallels. Picture 50.1 shows an instance where GT probably imputed a translation from crawling the web. Disney professionally translated and recorded the hit song “Let It Go” from its blockbuster movie “Frozen” into 41 languages (none from Africa or India, nor any major indigenous languages of the Americas, and with an authoritarian version of Arabic that no child speaks ). Examining the reverse translations of the titles shows that there was no mandate for the translators to adhere faithfully to the command “Let it go” – for example, the Romanian version reverse translates as “It happened”, Arabic as “Release Your Secret”, and Polish goes with the inspirational “I have this power”. The French version is “Libérée, délivrée”, which is well-rendered in English as “Freed, released”. Google’s neural networks, however, have apparently identified the frequent association in the corpora between “let it go, let it go” and “”libérée, délivrée”, on sites that publish subtitles and/or lyrics on the web, and GT thus erroneously declares the latter to be the proper translation of the former.

Many European lexicographers and computational linguists I meet, who do great things with corpora, do not fully appreciate that such work is not possible in the 99.96% of languages for which corpora are not a thing. For the more than 120 languages spoken in Tanzania, for example, only Swahili has three corpora, two of which are restricted (Helsinki and Sketch Engine), one of which is very small (ACALAN), and none of which is parallel to another language. Some corpora for the many European languages represented in Sketch Engine have several billion tokens.¹⁴⁸

Africa’s 2000 languages, on the other hand, are limited to (in round numbers), Afrikaans (750,000), Amharic (20,000,000), Igbo (400,000), N’ko/ Manding (4,600,000), Oromo (5,000,000), Somali (80,000,000), Swahili (21,400,000), Tigrinya (2,500,000), Tswana (13,500,000), and Yoruba (3,500,000) – not quite 150 million tokens for the entirety of languages spoken by over a billion people. Amharic, Oromo, Somali, and Tigrinya were the output of a project¹⁴⁹ supported by the Norwegian and Czech governments. Sketch Engine produced the Igbo, Swahili, Tswana, and Yoruba corpora through web crawls on their own initiative; importantly, because crawling algorithms will identify documents that have (e.g.) Yoruba words as Yoruba texts regardless of their provenance, corpora based on web crawls of languages in GT are now deeply corrupted because the well has been poisoned by machine-generated data. Afrikaans is the only African language included in an open parallel corpus¹⁵⁰ of 40 languages funded by Finland. N’ko¹⁵¹ is a unique story, the relevant aspect in this discussion being the demonstration that any language with written records could have a useful corpus, where financial support and community passions converge. Some other African corpora exist, but the big picture is that the sum total of the digitized African corpus is roughly a tenth the size of what is available for wealthy Estonian, which has roughly a thousandth the number of speakers as African languages. Estonian had a 55% failure in my tests of GT, despite having over 2 billion tokens in readily available corpora. The languages enumerated above with the relatively toy corpora have little hope of entering into usable MT through the pathways established by languages with money. The remainder, and more than 6800 similarly-positioned languages throughout Asia, Australia, the Americas, and even minority languages in Europe, have none.¹⁵²

Syntax – the difference between Tarzan and the Bard¹⁵³

Picture 51: Though GT renders Stockholm’s delightful Röda Båten hotel in Romanian as roșu barcă instead of barca roșie, a Romanian will understand at a Tarzan level that it is a boat red, despite the syntax and morphology errors. http://theredboat.com

Until now, I have been talking mostly about the challenge of finding the words to populate text on the source side. If the words on the target side have the same dictionary meaning as was intended by the writer of the source text, you have what could be called “Me Tarzan, you Jane” (MTyJ) translation – Tarzan points at his and says “gură”,¹⁵⁴ Jane says “mouth”, and communication ensues. The Tarzan scores in my tests reflect the centrality of getting the right vocabulary as the basis for understanding, as can be seen in Sentence 5 of Table 1. In order to produce elegant, Bard-like text, those words must be put in the right shapes and assembled in an order that makes sense to a native of the target language. To some extent, this is a mechanical task: verbs come at the end of the sentence in German and Japanese, so an algorithm should know to put them in that slot in those languages. GT clearly pays attention to the rules of syntax, as can be seen by testing the phrase “red boat” and seeing the translation terms exchange places depending on the noun-adjective word order of the test language. I did not perform this test systematically, and did find some failures (e.g. Romanian, see Picture 51¹⁵⁵ ), but overall the GT success rate for noun-adjective order is much higher than random chance. Morphology can also be constructed to some extent with a good language model, for example if the source verb is determined to be conditional past tense with a human subject, then the machine can be instructed or taught to produce the form of the target verb that matches those parameters.

Beyond rules, which many in the MT world are allergic to based on failures in the pre-SMT era, there is great scope for applying AI on the target side by attempting to mimic good native text from the corpus, for those languages where corpora exist. Good algorithms can move MT a substantial way from Tarzan to Bard, though true human-like output can only be achieved through human post-editing. Comparing the Tarzan and Bard scores in my tests shows the degree to which GT has moved from finding workable vocabulary for a language, to putting it in a shape that seems natural to the language’s readers. The highest Bard ranking, Afrikaans, is 67.5, indicating that the best current MT resources can automate at most 2/3 of the translation load (like auto-pilot keeping an aircraft on course for the routine cruise portion of a flight), with the difficult 1/3 of producing artful prose left for human intervention. The next section introduces a few new concepts that could help many more languages get to that 2/3 level, and could also raise the service ceiling through the current 1/3 remaining.

References

The post The Astounding Mathematics of Machine Translation appeared first on Teach You Backwards.

Disruptive Approaches for Next Generation Machine Translation

Martin Benjamin — Thu, 28 Mar 2019 12:08:57 +0000

The previous parts of this web-book have discussed what Google Translate (GT) does and does not accomplish, with special attention to things that it cannot accomplish because of features inherent to machine translation as we know it. In this part, I briefly introduce some new paradigms. Many of these have already been piloted successfully at Kamusi, some are under active development, and some cannot be tried until more groundwork is laid or until funds can be found.

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished? (You are here )
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool?

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

Some of the ideas in this chapter are not controversial, but others are antithetical to precepts within the machine translation (MT) community. Among computer scientists, it is a matter of faith that improved machine processes are the chief path to improved translation. I do not dispute that improved computation will continue to push MT incrementally forward. However, most such improvements are only possible in cases where rich corpora are available and attention is paid to models of the languages involved. Where data is sparse or inexistent, MT is impossible. Even where substantial data is available, though, this chapter argues that the most promising pathway toward imbuing translations with meaning lies in collating data from the speakers of each language based on what they know, more than in inferring connections based on computed textual patterns. Computation can be used to extract the questions that need to be asked of human interlocutors in order to gather data, and computation can be used to parlay the knowledge humans provide into programs that remix the data in ways that people never could – envision connecting a Guarani-speaking student in Paraguay with her Mansi-speaking contemporary in Siberia to work on a school project together, using an MT intermediary in their chat service that builds on (currently unavailable) data and models for those languages. However, until neuroscientists can put USB ports in our skulls and download the linguistic data housed in our crania, the only way to get 99.96% of that data is to ask the people in whose brains it lives: deep learning directly from the people who speak a language. This approach cuts against the grain of most current work in MT. Many will say it is pie in the sky, that clever computing is the path toward universal translation. I say, read this chapter to the end. If you find flaws, raise them in the comments section, so the plan can grow from your insights. And then join in – play the games developed for your language, use the products that emerge, hack, donate, tell your friends – and let’s disrupt the translation industry with translation that really translates.

I would like to share with you the ideas we are developing at Kamusi Labs in direct response to the question, “If Google is not capable of translating among 7000 languages, how can that goal ever be reached?” Below, I lay out a strategy that I assert can lead to much better translations than GT and its competitors, among many more languages. Dozens of institutions around the world have signed letters of collaborative intent to join forces on a Human Languages Project, using what is outlined in this chapter to build a matrix of human expression across time and space to the greatest extent possible, if we can find funding to support the work involved. Perhaps I am delusional and what I propose is technically impossible, though no debilitating linguistic or computational flaws have been identified to date by peer reviewers (who sometimes object to elements as utopian, unproven, or overly ambitious, and often suggest overlooked elements that end up improving the model). Perhaps I am selling my own brand of snake oil, though, given the minuscule Kamusi budget, I am clearly not a good salesman. Or perhaps it is time to disrupt the MT industry, and what is described herein really is the next step on the journey to universal translation.

Picture 51.1. The Hindi translation of the common English expression “there’s no there there“, वहां कोई नहीं है, translates in GT with words meaning “no one is there”, and the French meaning is “it is not there”. Google Search reveals over a quarter million pages with the English quote.

I will begin with the assertion that we actually can attain human-quality translations for expressions like “there’s no there there“, and even for each of the sentences shown as “Absurdities” below, and millions more, into, out of, and among all 108 GT languages and on towards 7000. This statement reeks of hubris, and I do not expect to accomplish universal translation before retirement, but it is helpful to define the goal. In fact, I will extend the ambition a little farther. The systems we are developing will enable translation of sentences like these into any language:

Absurdity 1.

Light light light light light light light light light light.
(1) Light (2) light (3) light (4) light (5) light (6) light (7) light (8) light (9) light (10) light.
(1) Light-weight (2) non-heavy people (3) who ignite (4) bright (5) visible energy (6) illuminate (7) non-serious (8) light-weight (9) non-heavy people (10) lightly.

Absurdity 2.

Best worst worst worst worst worst best best best best best worst worst best best best.
(1) Best (2) worst (3) worst (4) worst (5) worst (6) worst (7) best (8) best (9) best (10) best (11) best (12) worst (13) worst (14) best (15) best (16) best.
(1) The top (2) of the bottom (3) who defeat (4) the absolute (5) bottom people (6) most poorly (7) ought to (8) beat (9) most ably (10) the top (11) of the top (12) who destroy (13) the least (14) good (15) finest clothing (16) most ably.

Absurdity 3.

Pick up pick up picked picked up picked up pick up pick up up.
(1) Pick up (2) pick up (3a) picked (4) picked up (5) picked up (6) pick up (7) pick up (3b) up.
(1) The truck driver’s (2) casual date (3a) collected (4) the reenergized (5) lifted (6) hitchhiker (7) stimulant (3b) [connect to 3a]

Picture 51.2: An absurd sentence that Google Assistant has been trained to recognize.

These absurd yet technically grammatical sentences were crafted to demonstrate ambiguity at its extremes, to highlight the impossibility of a computer divining the meanings needed to translate ambiguous items in many situations. Whereas GT translates the first absurdity into Maori as “Rama marama marama marama marama marama marama” (Lamp brightness brightness brightness brightness brightness brightness), and provides similar results for its other languages, I argue that a well-thought-out collaboration between people and our machines can enable MTyJ (Me Tarzan you Jane) or better translations across the board, even for preposterous sentences such as those above. Google has accomplished such an NLP feat with certain English phrases in its Assistant application, hiring a team to prepare humorous canned responses to voice input such as “how much wood would a woodchuck chuck if a woodchuck could chuck wood?” (Picture 51.2). Yet – try it yourself – the same sentence in GT generates wild stabs in every language, scampering to the computational drawing board instead of building on human insight. Although the people at Google Assistant recognize it as an iconic English artifact, the “woodchuck” absurdity does not exist in any parallel corpora so it cannot be gleaned through NMT. That does not mean such absurdities, or the infinite other linguistic combinations that cannot be fathomed through automatic comparisons, cannot be translated. Let’s enumerate what is needed.

A set of definitions for each term in a language (often known as a dictionary), separated so that each individual sense of a polysemous term can be treated as its own entity. This will make it possible to distinguish that the “light” in position 2 of Absurdity 1 has one particular meaning, and the “light” in position 6 has a different meaning.
A set of translations from the concept-designated terms in the source language to terms with equivalent meaning in the target language. This makes it possible for the machine to propose a word that corresponds to the idea of “non-heavy people” for the noun in position 2 of the first absurdity, and a word that corresponds to the idea of “illuminate” for the verb in position 6.
A means of identifying inflected forms of a word and mapping them back to the lemmatic form, for example recognizing that “bought” is really a form of “buy”. This makes it possible to produce the correct vocabulary for non-canonical forms on the target side, along with the knowledge of elements such as tense that can be used to produce the correct morphology.
A means of identifying elements of the sentence according to part of speech, gender, register, or other relevant aspects. This will make it possible to seek and position equivalent elements on the target side.
Language models for each language that identify features such as the order of nouns and adjectives. This makes it possible to arrange vocabulary pieces according to syntactic patterns, pushing translations from Tarzan toward Bard.
A method to identify the parts of a party term so they are treated as a lexical unit, rather than as individual words. This makes it possible to correctly treat elements such as “up” in position 3b of Absurdity 3.
Methods to identify terms or concepts that are not in the database, and to gather them from digitized or human resources. This makes it possible to expand translation beyond the present set of digitized data.

Taken individually, none of these components is extraordinary. Pause: an invited reviewer reacts, “I would go ahead and say that all of these components are extraordinary. They all pretty much face insurmountable bottlenecks given the current state-of-the-art, so stringing them together is not likely to end well.” Response: One hundred years ago, nobody imagined that a network of highways, tunnels, and bridges would make it possible to drive comfortably from Tromsø above the Arctic Circle to Ankara in Turkey, although roadways and bridges go back at least to Roman times, long tunnels came in with the railroads, and the Ford Model T had already been in production for a decade. Although the only route across the Strait of Gibraltar today is via ferry, you would not be the least surprised to ride in an autonomous vehicle across a bridge connecting Europe and Africa sometime before you die – after all, 44 bridges on the planet already span a longer distance. Today’s transport networks were built one element at a time, using known technologies when they worked and improving on those ideas when they didn’t; while the fact we can now ride in a train across the English Channel is an amazing feat of engineering, serious engineering proposals for the dream began as far back as 1802, and exploratory tunnels more than a mile long were dug on both the French and English sides in 1881, before any but a couple of handfuls of people who lived to see it completed 113 years later had been born. I am not saying that the vision laid forth here for translation is easy. I am saying that it is comprised of elements that already have established templates in languages spoken by the wealthy, and for which the extension to any other language is therefore more a question of will than technology . Just as the transport network was built with one traffic circle in Aberdeen and one stop light in Athens, the entire field of NLP is devoted to building bits and pieces that could collectively enable MT across whatever languages society deems valuable. However, billions of dollars using the methods GT has deployed to date cannot get us to truly effective translation across large numbers of languages because the data does not now exist in a form technology could ever act on, and what data can be collected cannot be deployed without a great deal of human input to carefully steer the efficiencies offered by computation. Regarding the 7 points above, then:

(1) Dictionaries per se are a venerable technology, and opuses such as the Oxford English Dictionary prove that they can thoroughly document most expressions in a chosen language. Digitizing lexical data is relatively new, but e-lexicography is now an established pursuit. Separating out senses in ways that can be repurposed in other data applications is uncommon (for example, it cannot be done from any online dictionary other than Kamusi, for English or any other language), but has been given a hearty first pass for English in a way that can be parsed by the data savvy with the Princeton WordNet .

(2) Matching those English senses to equivalent terms in other individual languages has been done with varying success by several dozen international wordnet teams, and aligned in differing ways by the Open Multilingual Wordnet and by Kamusi.

(3) Morphological identification has been handled in various ways by toolkits that researchers have developed for the languages that matter to funders, such as FreeLing that is available for demo in 13 languages.

(4) Successful grammatical models have been built for many languages, though not all are available open source. Researching and implementing a model is often an appropriate Master’s level thesis project. Some portion of the world’s repertoire of languages could be doled out to interested students every year, with the goal of interoperable models for thousands of languages over the course of a decade. The obstacle is funding.¹⁵⁶

(5) MT-focused syntactic models also exist for many languages. For languages where models exist, it should be possible to generate somewhat grammatical text between any Language A and Language B by wiring together the known elements such as vocabulary, subject, object, tense, and gender. All five of these components have been done for privileged languages such as Norwegian. What can be done for Norwegian can be done for Northern Nuni (spoken by about 45,000 people in Burkina Faso), if people choose to do so. The technology is the same and the process is the same. If the will of the people who fund language technology resources is drastically different, that is a social and financial decision by the nation of Norway (Norad and the Ministry of Education and Research) that Norwegian children are more worthy of investment than are Burkinabe , not one of technical viability.

(6) As discussed in the chapter on translation mathematics, party terms are an ongoing Achilles’ heel for MT. Kamusi is developing a method for identifying and translating party terms, described below, that should work out of the box for any language, regardless of the distance by which words are separated within the expression.

(7) Gathering terms and meanings is the established field of lexicography. Unfortunately, the time, money, and effort to pursue extensive lexicons of most languages are prohibitive, with few people imagining that OED equivalents for languages like Kinyindu (spoken by about 10,000 people in eastern Congo) could really come to be. Kamusi Labs has developed technology, now available in the Kamusi Here app’s participatory features described below, that can lead to the rapid production of rich monolingual dictionaries for each language, linked as MT-ready data between languages – and we will be field-testing the system with our Kinyindu partner organization when we can fund their efforts. Further, our “SlowBrew” system for disambiguation of terms and the identification of party terms on the source side, in advanced construction, will be capable of bringing vocabulary errors close to zero. The unsolved constraints to collecting the words of most languages are financial, because potential funders either believe that it is not worth doing, or that Google has already done it.

The semantics of translation¹⁵⁷

Picture 52: A head to head comparison of results from Kamusi (left) and results from GT for the same search.

Many examples of MT failure throughout this paper have hinged on ambiguity, particularly the problem of polysemy. Kamusi has largely solved this problem. Our system aligns terms across languages based on the underlying concept, rather than a crude approach centered around the happenstance of English spelling. Our initial implementation is a relatively simplistic interlinking of data from the Global Wordnet across 44 languages and counting; as time goes on, we will expand the concept set far beyond those covered by Wordnet, account for morphology and semantic drift, and extend to hundreds or thousands of non-market languages. To demonstrate the difference in accuracy between GT’s best-guess term translations and Kamusi’s semantic-equivalency approach, I have prepared a head-to-head comparison between the two. These are screenshots from the live versions of the KamusiHere! and Google Translate mobile apps, for the same set of searches. The images show how the Kamusi model provides much more precise vocabulary than Google is capable of. Although the Kamusi data is still young, you can already see how much more confident you can be in its results. The first matchup shows results between English and another language, in this case Chinese. In this scenario, Kamusi shows each sense, with enough information for a user to decide which term matches their intended meaning. GT, on the other hand, shows some options, but not enough context to make a determination. The remaining screenshots show matchups between non-English pairs, using Greek, Romanian, and Italian as the sample cases. GT provides one guess per pair, whereas Kamusi gives a valid match for each polysemous concept for which data has been collected to date.

To achieve the level of specificity needed for precise translation, we deal internally, ironically, with numbers instead of words, as illustrated in Picture 52.1. Each sense of each word is a “spelling/meaning unit”, so each one gets its own “smurf”, or “spelling/meaning unit reference”. This is Data Science 101: a nebulous tag causes problems, and a unique identifier solves them. For example, in 2001, about 9000 Floridians were erroneously scrubbed from the voter roles because they had the same name as a dead person or convicted felon . Consequently, the National Research Council recommended that voter registration databases across the 50 American states work toward coordination through clear numerical indicators such as Social Security number, so as not to disenfranchise one Willie Whiting (SSN 640-34-8769) because of the crimes of another Willie Whiting (SSN 251-02-0883). At Kamusi, we invoke unique identification of a spelling/meaning unit as a fundamental precept. That is, “dry” (not wet) is assigned a unique ID, “dry” (boring) has a completely different number, Italian “secco” (not wet) has a different number, and “secco” (very skinny) has yet another. Let us say (using mini-smurfs to ease the visualization):

- dry (not wet): smurf = k-1112
- dry (boring): smurf = k-2346
- secco (not wet): smurf = k-8745
- secco (very skinny): smurf = k-5540
- noioso (boring [Italian]): smurf = k-6413
- 마른 (not wet [Korean]): smurf = k-3033
- 지루한 (boring [Korean]): smurf = k-9796

With the spelling/meaning units converted to numbers, it is then a cakewalk to store k-1112 = k-8745. We can also store k-1112 = k-3033. Following the rules of transitivity, k-8745 = k-3033. Thus, we have high confidence that secco = 마른. Furthermore, k-6413 = k-2346 = k-9796, so noioso = 지루한.

Using this method, we can easily see that there is no relationship between either k-8745 or k-5540 and k-9796: secco ≠ 지루한. Contrast the GT method – if secco matches to d-r-y in the English-Italian dataset, and 지루한 matches to d-r-y in the English-Korean data, then standard MT posits secco = dry = 지루한 – a very skinny Italian comedian has a good chance of being known in Seoul as a boring person of indeterminate weight. Using the smurfs, meanings never cross. Using statistical or neural methods, though, such crossing is inherent to the procedure.

In many cases, a term in one language will only be approximately equivalent to the term given as its translation in another language, the problem of semantic drift. That is, perhaps k-6413≅ k-2346, and k-2346 ≅ k-9796. For the moment, we address the issue by displaying all the information that we have about the term in Language A, (k-6413) the term in Language C (k-9796) , and the English term that established the bridge (k-2346); that is at least a starting point for the user to evaluate whether they are really looking at the same concept on both sides. Unfortunately, in many cases our data did not arrive with proper definitions of the local terms; instead local teams relied on the Wordnet definition of the English equivalent, which often misses the strike zone . We are implementing processes to elicit local definitions from users. In the future, users will mark cases that have only partial semantic equivalence – terms are considered “parallel” if they are generally the same idea, “similar” if they have partial conceptual overlap, and “explanatory” if we are producing a stock phrase to fill a lexical gap. This will enable us to establish a confidence index, so we can give an honest alert about the more dubious automated matches.

Picture 52.1: Translation relationships among terms in different languages in the Kamusi database. Terms and features (such as part of speech and inflections) are expressed as numbers. Those numbers can be joined in various ways for translation and other natural language processing tasks. Information about relations for how various senses of “dry” relate to their equivalents in dozens of other languages is embedded in those numbers. The computer never tries to guess which sense of “dry” matches across languages, as occurs in present-day MT systems such as Google. Do not spend too much time trying to decipher the data in this snapshot – it can only be understood in the context of many other tables that use numbers to portray intricate aspects of millions of terms.

As a note to specialists, smurfs add an order of precision beyond what is achieved with Wordnet. Wordnet has identification numbers for “synsets”, which are clusters of terms that share a general meaning. The Wordnet 3.1 synset 02958343-n, for example, contains “car”, “auto”, “automobile”, “motorcar”, and “machine”. Other meanings of “car” belong to other synsets, as do other meanings of “auto” and “machine”. It is possible to zero in on a precise spelling/meaning unit, essentially by combining spelling and meaning in a formula (car-02958343-n, more or less), but this becomes extremely inefficient when you start to make complicated calculations across languages. At Kamusi, we maintain the Wordnet synset to show one level of relationship, and to facilitate coordination with the many external projects that use the Wordnet system. However, we are transitioning to the smurf level to more precisely differentiate nuance that is lost within a Wordnet synset such as, for example, “hit”, “strike”, “impinge on”, “run into”, and “collide with”. The goal is not, as some reviewers mistakenly assume, to build a bigger Wordnet, but rather to use Wordnet as a starting point for a more complex, three-dimensional matrix.

If you have an hour, you can watch this invited lecture, “The Particles of Language: “The Dictionary” as elemental data for 7000 languages across time and space“, that I gave at CERN in 2015 that dives further into what I call “molecular lexicography”: https://cds.cern.ch/record/2054123. (The room is less empty than it looks, but most of the audience sat out of camera range in the back rows by the entrance.) There are lots of pretty pictures and no math, I promise.

A monumental difference between Kamusi and GT is that, if Kamusi does not have data for a particular concept, we will tell you that we do not know, and ask for your help in obtaining the missing information. GT, on the other hand, invokes MUSA to always print some sort of output, regardless of whether it has the slightest basis for doing so, and never indicates when their content is certifiably fecal. We do not subscribe to the MUSA “make stuff up” theory of machine translation that is, as documented in this research, Google’s fallback position.

SlowBrew disambiguation¹⁵⁸

Services like GT have inculcated the expectation of magic wand translation, which was ‎exposed as a myth in the chapter on qualitative analysis. A system in advanced development at Kamusi Labs introduces a human review phase prior to passing a source document to a translation engine. Instead of undrinkable instant coffee, the results of SlowBrew will satisfy the palate, with the potential (as the underlying data improves) to produce precise vocabulary matches every time. I acknowledge that uptake will involve a change in consumer expectations, which I believe will occur (a) as users experience a vast increase in output quality versus magic wand translation, and (b) users come on board from languages with little or no current engagement with MT. SlowBrew takes advantage of Kamusi’s semantic mappings to prepare a document for translation. In brief, the system analyzes a sentence using available NLP tools for features such as lemmatization and part of speech tagging. The user is then presented a list of senses for each polysemous term. If the text contains “dry”, for example, the user is given a list to choose from:

dry (not wet)
dry (boring)
dry (not sweet)
dry (not permitting alcohol)

The user selects the sense that applies; future work will float machine estimates of the most likely sense to the top of the stack of concepts on offer, such that a sentence containing will elevate « dry (not producing milk) ». If a sense or term is not present, the user is given options to contribute it for review as an addition to the dictionary. Once the user has marked their intended senses, we are no longer wrestling with the word “dry”, but instead with a smurf, a number like k-2346 that is associated with a meaning across languages. Because the meaning is known at this point, no fancy computation is necessary to guess the matching vocabulary based on the happenstance of letter sequences. We have eliminated the problem of choosing from among all possible terms on the target side that match to any possible sense on the source side. In the case of the Absurdities above, the user will have identified the concepts (along with their parts of speech, inherently), so the computer will never be asked to guess at all the lights and bests and worsts and picks – the computer only sees the vocabulary as smurfs. For the machine, the task for the Absurdities is the same as all other SlowBrew translations, creating a seating arrangement (determining the logical syntax patterns) for the smurfs on the target side who are married to the smurfs on the source side across the aisle, and dressing them in appropriate clothing (producing the right inflections).

Future programming will allow users to tag their source documents with even greater refinement. No amount of data or learning will enable statistical or neural predictions of what is meant by “some” in “save some for me”, or “they” in “they were hot” in many situations – the female cheerleaders? the male roosters? the pizzas? – but a manual endophora identification tool is on the task list for SlowBrew. When a model to recognize them as such is in place for a language, users can be prompted to link ambiguous pronouns with the first mention of the thing they refer to, thereby also attaching knowable attributes such as gender or noun class.

Similarly, users can be asked to clarify known problem elements in a language, such as declaring a gender for a subject, or the gender or relative status of the person to whom they wish to communicate. The program will need to adapt to the special ways that source languages mismatch with their targets, for example asking English speakers to clarify who is meant by “you”, or asking Japanese speakers to pinpoint a subject for a verb when none is provided. SlowBrew 2.0 will not introduce complicated new technical challenges, but it will require tailoring in consultation with a lot of different language experts. Some will argue that users will resist taking the time to tag their documents prior to MT, remaining content with systems like GT that arbitrarily make grammatical decisions on their behalf that are guaranteed to be wrong much of the time. I suggest that they will flock to a system where women are female, friends are addressed informally while business associates are addressed with respect, all the words mean what they think they mean, and all the pronouns perform their grammatical duties correctly. If we have already gathered an equivalent term in the target language within our database, it can be passed to the target side for MTyJ translation, and further processed for syntax and morphology if and when a model is developed for that language.

When two or more equivalents exist in the target language, the future goal is for the machine to make predictions and offer options to the user, DeepL style. When a term is not in the system, whereas GT just makes something up (inventing fake words such as, for Swahili, “hapatical” as a translation for “heretical”, or fake mappings such as producing the real Swahili word “tatizo” for the non-existent English “heret”, as shown in Myth 4), in SlowBrew the user is honestly alerted that the data is not yet available, and the term is flagged and placed within the collection workflow.

Party terms¹⁵⁹ are a bête noire for MT because they are difficult to identify, and if they are not identified then the MT engine will nonsensically translate them word for word. SlowBrew is geared to identify party terms, in order to correctly translate their underlying meaning. Party terms are lexicalized in Kamusi – that is, when a combination of two or more words has a meaning other than what would be understood by looking at those words in isolation, the expression is turned into its very own dictionary item, given its own definition, and, crucially, associated with its own smurf. Examine these party terms from three languages:

over the moon (English meaning: filled with joy. Translated word-for-word by GT to Spanish as “sobre la Luna” and to French as “sur la lune”). smurf = k-5538
tocando el cielo con la punta de los dedos (Spanish meaning: filled with joy. Translated word-for-word by GT as “touching the sky with the tips of your fingers”). smurf = k-3094
aux anges (French meaning: filled with joy. Translated word-for-word by GT as “to angels”). smurf = k-9471

All three of those party terms mean exactly the same thing, but direct word-for-word translation fails at 100%, with the meaning evaporated in all six possible translation scenarios. Using the Kamusi smurfs, though, neither the moon nor the sky nor angels are part of the translation equation. Instead, our database will match k-5538 = k-3094, and k-9471 = k-5538, and make the high-confidence association that k-3094 = k-9471. When asked to translate from French to Spanish, even if we do not have a human-confirmed link between the two party terms, we can look to the numbers, and voilà, posit that “aux anges” translates as “tocando el cielo con la punta de los dedos”.

SlowBrew identifies party terms in two ways:

If the expression has already been lexicalized in the Kamusi database, SlowBrew is triggered by the lead word to search for the other party-goers farther in the sentence. The database will have, for example, numerous expressions beginning with “drive”: “drive up the wall”, “drive home the point”, “drive to distraction”, etc. My testing confirms that GT also does such an analysis on the source side in limited cases for English phrasal verbs and other party terms in its database; for example, the vocabulary choice for “pick” will change (with variable success) with the addition of “up” for the phrases “He picked his son and daughter [up] at school” and “He picked his hot new girlfriend [up] at the bar” (rendered in Turkish as “He lifted his enflamed new girlfriend at the bar”), and “drive” will change for “The new product will drive everyone who tries it crazy” but does not change for “The new product will drive everyone who tries it up the wall”. It is not so difficult to spin through dozens or hundreds of identified party terms, and offer the user any candidates that appear in the text, e.g. the sentence “she drove everyone in the room up the wall” would resolve to “she” + “drive-up-the-wall” + “everyone” + “in” + “the” + “room” if the user so agreed. When there are multiple possibilities within a sentence, such as the multiple “up”s in Absurdity 3 above, the user can choose the one that applies.
If an expression has not already been lexicalized, the SlowBrew interface enables users to mark any individual words in a sentence as members of a party term and submit that term to the dictionary, organically growing the supply of knowledge for cultivation by future users. That GT sometimes uses leading words to identify separated party terms hints that the underlying idea is not so crazy; the Kamusi notion that the terms also can be identified, disambiguated, submitted to the database by users, and associated with confirmed vocabulary in other languages might also not be so far off the wall.

Learning ideas from users¹⁶⁰

When users grant permission, their sense tagging can be used to learn the human associations between words and meanings in context – creating, over time, an annotated monolingual corpus that connects directly to confirmed vocabulary across languages. That is, SlowBrew is currently capable of offering the user multiple choices for “drive”, regarding cars, golf balls, bargains, etc. After repeated sightings in phrases like “drive to New Jersey” and “drive to school”, AI analysis of the data could spot that users generally choose the automotive sense for the pattern “drive to [location]”, with an exception for the golf sense if the phrase is “drive to the green”. Such experiential learning is out of realm in the GT process, where the source text is at best fodder for building a context-free corpus. This method of learning benefits from users’ vested interests in generating superior translations for their own needs. Machine learning compares a million pictures, for example, and comes to recognize features that generally suggest “fish” (but could be “dolphin” in murky water). That generalization is then applied to the million-and-first image, so that Google Assistant can confidently tell you whether the purple flowers you photograph are lilacs or wisteria. A system based on SlowBrew tagging will have precise information to determine not only that an item is a fish, but her name is Wanda. With the vocabulary weak points largely solved, MT technologists can focus on the rote processes at which computers excel, such as morphology and syntax – interested collaborators should please get in touch.

The basic procedure is not complex, though lot of complex coding is needed to make it operational. After millions of people have used SlowBrew to instruct the system on their intended meanings, we will have data to see that in a sentence such as “I kept dry during the storm by staying inside”, users chose item k-1112, whereas in a sentence such as “The lecture was so dry that I couldn’t keep my eyes open” the choice was k-2346. Building on the techniques currently used by NMT, we can examine the words that are embedded nearby. Where current MT slides on thin ice to make some sort of predictive calculation about appropriate vocabulary, though, SlowBrew snowshoes on solid ground: “keep dry” and “storm” and “stay inside” were usually tagged by users as k-1112, while “lecture” and “keep [one’s] eyes open” were most often associated with k-2346. Moreover, we will know that the “keep” (remain in the same condition) in the first sentence was k-7718, while the “keep” (force to stay) in the second sentence was k-3625. When it comes time to translate to another language, we can then transmit the smurfs (k-1112 + k-7718) in the first instance and (k-2346 + k-3625) in the second, and the right vocabulary for ambiguous terms will be generated on the target side, no matter whether the desired language is Latvian, Luo, or Laotian.

Crucially, whereas GT vacillates among sense predictions from one target language to the next (e.g., whether the “spring” in her step maps to springtime or a water source might change from Azerbaijani to Uzbek), because SlowBrew is tagged on the source side, the information adheres to the author’s language. There is no need to relearn or recalculate senses depending the target language. We learn for English based on documents written in English, regardless of whether the original translation target is Luxembourgish or Malagasy, and we learn for Luxembourgish based on documents written in Luxembourgish, regardless of whether the translation target is English, Frisian, or Esperanto. This is opposite to the way GT claims to learn from users, who are somehow supposed to be able to identify whether translations are good on the target side. My confirmation of a translation to Kurdish, say, of “the cow escaped from the pen” is utterly useless as future data because random user Martin Benjamin does not know a word of Kurdish, whereas my confirmation that my meaning for “pen” was k-4545 is bankable information. Tagged corpora will accrue over time for each language users translate out of, helping to discover not only word associations but also information about other linguistic features. This hypothetical system remains unfunded, but when a brave sponsor arrives, it will fundamentally alter the landscape of MT.

Hofstadter writes:

Machine translation has never focused on understanding language. Instead, the field has always tried to “decode”—to get away without worrying about what understanding and meaning are… It’s familiar solely with strings composed of words composed of letters. It’s all about ultrarapid processing of pieces of text, not about thinking or imagining or remembering or understanding… All sorts of statistical facts about the huge databases are embodied in the neural nets, but these statistics merely relate words to other words, not to ideas. There’s no attempt to create internal structures that could be thought of as ideas, images, memories, or experiences. Such mental etherea are still far too elusive to deal with computationally, and so, as a substitute, fast and sophisticated statistical word-clustering algorithms are used. But the results of such techniques are no match for actually having ideas involved as one reads, understands, creates, modifies, and judges a piece of writing.

Where Hofstadter could be wrong, though, is when he says, “Having ever more “big data” won’t bring you any closer to understanding, since understanding involves having ideas, and lack of ideas is the root of all the problems for machine translation today. So I would venture that bigger databases—even vastly bigger ones—won’t turn the trick”. What Kamusi proposes is not just the growth of data for all languages, but a transformation of the type of data collected. Instead of figuring out how to convert words from Language A to Language B, having users clarify their meanings on the source side will in fact attach ideas to the way they are phrased. The machine task, then, is not for computers to figure out what people are saying, but for people to figure out what people are saying, and for computers to retain that normally ephemeral information and reconstitute it when others want to say something similar down the road. Perfect translation will never be possible, but learning the ideas underlying texts from speakers of their own languages, and aligning those ideas with translation terms validated by bilinguals in target languages, has the potential to ratchet the believability of the output to a level not attainable with current automated inferences.

Efficiency¹⁶¹

Processing speed is not my department, but I suggest that our methods could lead to notable improvements for MT

Picture 52.2: I coined the term “trumplation” sometime in the spring of 2019. Soon thereafter, I developed the sham map in Picture 33 to further illuminate points about fake data. Amazingly, the US Child-abductor-in-Chief personally combined these two threads in September 2019. He made factually false claims that the trajectory of Hurricane Dorian would bring it to Alabama (the state labelled “AL” in the image). When the National Weather Service issued an immediate rebuttal, the criminal mastermind doubled down on his falsehood. Instead of admitting he had been mistaken, he produced the fake map in this image, a possible trajectory issued much earlier in the storm’s journey across the Atlantic, with an extra loop hand drawn with a black Sharpie marker to show the fantasy movement of the storm into Alabama airspace. Life imitates art – and it turns out fake weather forecasts are illegal. Photo Credit: via GIPHY

Picture 52.3: In this video, Japanese YouTuber “An Odd World of Mine / ぼくのオッドワールド” discovers trumplation. Watch to see English evolve before your eyes.

MT involves a lot of processing power, much of which is wasted. This waste occurs in three areas

1. The magic wand. An enormous amount of energy is sunk into recalculating output with every keystroke a user types. GT uses tremendous processing power on trumplations, blasting out guesses for words that have nothing to do with what the user is writing (eg, it will attempt “us”, “use” and “user” while the user is en route to “users”), and guesses part way through unfinished thoughts (eg, translating “user” when the user is en route to “user interface”). Type this sentence that has exactly a hundred characters excluding spaces into GT at a moderate speed to see it in action. Google made 100 calculations for the 100 letters you just typed, of which 99 were of no value to you. (Typing quickly will reduce the number of calculations, but making and correcting typos will increase the number.) Though estimates for the energy burned in the round trip between your device and the MT server farm are unreliable, there actually is a notable ecological cost in trillions of wasted computations. At the very least, energy could be saved by increasing latency, anathema as that might sound to some ears. Greater efficiency would be gained by calculating whole words after a space has been keyed, and still greater by waiting for a punctuation mark. There is not really a point in calculating anything, though, until the user has input their entire text, then either waited a few seconds or proactively pressed a button. Everything prior is showmanship, not service, producing the illusion that the 99 wasted computations build cumulatively toward a highly refined ultimate output. Turning off the magic wand would have no effect on the final result, allowing resources to be focused on actual demand.

2. Calculating vocabulary. For polysemous terms, GT makes estimations about what vocabulary to use in the target language based on associations it finds across corpora. As my research shows across 108 languages in ‎the empirical evaluation , chances of GT choosing the wrong sense are quite high, even for the best languages, and most especially for the preponderance of party terms that have not been lexicalized together. The cost of such errors is either befuddling translations shipped as are, or lengthy time post-editing “a jumble made of [language] ingredients” . By having users confirm their sense before translation (or, in the future, by directing the computer to known translations when the context is evident), the load of guessing and correcting vocabulary errors can be substantially reduced. The argument here is not the efficiency of the immediate process within the machine, but the overall efficiency of achieving a dependable outcome.

Figure 6: Polysemy resolved by source side disambiguation, where the user selects the specific sense that pertains.

Figure 4 showed that the likelihood of vocabulary failure in MT increases as a function of polysemy, with the top 100 words in English having an average of over 15 senses. Figure 6 shows that interjecting a human eye at the critical juncture will pull the error rate from (x-1)/x toward zero. In Language A to Language C scenarios with English as the pivot, the GT error rate can be expressed as ((E_A1+E_A2+… E_Ax-1)/(E_A1+E_A2+… E_Ax)) (with caveats that not all senses have equal weight, and multiple senses might be translated with a single expression), while in Kamusi that little green checkbox in Figure 6 leads directly to the concept set that houses the confirmed semantic link between terms A and C. Most errors that arise from computational inference are thereby eliminated. All else being equal, if each of 3 senses in Language A map to English terms that also have 3 senses apiece, GT will have an error rate in Language C of 89%, and if there are 15 senses per term, with a 1/225 chance that the ball falls through the pinball machine into the right hole, the error rate will climb to 99.55%. If we assume (without hard numbers¹⁶²) that users always select the correct sense using SlowBrew, then, because terms are semantically aligned across languages, the error rate will approach zero.¹⁶³

3. Multilingual simultaneous translation. Currently, when one document is translated into several languages, translators for each language need to make independent judgements about what was intended on the source side. The disambiguation that occurs within SlowBrew can be performed once, and then applied to any number of languages. In principle, a newspaper could pre-translate its articles once, and publish a marked up version that could be siphoned through an MT membrane for any language. Professional post-editing would obviously be preferable, but the clarification of a knowledgeable reader of the source language would eliminate a significant proportion of errors and provide the base vocabulary needed for at least MTyJ translation. A single click of the green checkbox in Figure 6 leads to the correct vocabulary not only in Language B, but also in languages C, D, and onward. The more languages for which the translation is needed, the greater the economy of scale. This is especially beneficial when the source language is not English, and human translators and machine models are rare among the desired pairs.

4. Source-side rules. In my analysis, it is clear that GT conducts some amount of rule-based NLP on the source side prior to passing text to SMT or NMT. Going from English, GT recognizes features such as part of speech and tense, and tends to map those features identically to sample target languages. The deficiency here is that high-quality NLP is restricted to top tier languages. Within GT, analysis of languages like Swahili is execrable. Yet, NLP is just as important to getting it right for non-lucrative languages as it is for those that have benefited from substantial investment. For example, one would not create a table with every possible form (around 18,576,000 per¹⁶⁴) for every known Swahili verb except “to be”, “to have”, and “to rain”, because one would need to store and access billions of terms, many with 10 characters or more, and all their mappings. Instead, I have written a set of a few hundred rules that can parse any Swahili verb into its component parts, isolating the concept, the tense, the subject, the objects, whether it is positive or negative, the noun class that affects other aspects of the sentence, and extended constructions such as passivity and a grammatical question mark. Kinyarwanda, which has extra features and more noun classes, has nearly 900,000,000 forms per verb. Encoded rules reduce the storage and search requirements from trillions to thousands of forms, while increasing accuracy by pinpointing grammatical elements that can be reconstructed across translation languages. Bantu languages with a similar agglutinative structure to Swahili and Kinyarwanda constitute more than 5% of the world’s total, so this one example among many of untreated rules is non-trivial. A specialist in translation technology for under-resourced languages urged that I emphasize the positive advantages rules offer over NMT, which he also notes cannot possibly be engaged for 99% of languages, writing in his review of this article, “What we need is a lexicon and a grammar of the language. No written text [corpus] is necessarily needed, but helpful. Rule-based approaches offer alternatives, which work on any language. One should start to use brains again.” The problem here is financial, not technical, because the work is straightforward but needs the time of experts who would expect compensation for their efforts. With just weeks of work per language to delimit and encode the essential rules, any language can have an efficient model for NLP on the source side (Ranta 2011). Does any funder choose to develop the rules necessary for NLP for languages spoken by people that, to date, have not been considered important?

Learning terms from users¹⁶⁵

The primordial element needed to launch MT for a language is the set of terms in that language, and how those terms map to other languages. While this article is not the place to discuss the details (Yakutsk was a better venue), it is important to mention that 2019 marks the launch of systems within Kamusi to intake data from users for any of the roughly 7000 languages identified with ISO 639-3 code designations. Users contribute equivalent expressions for their language from concepts associated with specific English definitions, from Wordnet, Wiktionary, and a variety of other sources. The relationships between words are encoded numerically; we prompt the user with a linguistic representation of the term, e.g. dry (not wet) or dry (not boring), but underneath we are asking them for equivalents for k-1112 or k-2346, and converting their response for each spelling/meaning unit in Piedmontese or Potawatomi to its own unique smurf. Answers are validated through a consensus model. Each validated term becomes part of a Data Unified Concept Knowledge Set (DUCKS) that is linked as numerical data across languages, so a term that is contributed in reference to English is automatically aligned to the expression of that same idea in all other languages.

At the time of release of Teach You Backwards, Kamusi has five active tools, with more under development, for acquiring and validating terms and additional lexical data from the public for any of the roughly 7000 languages with ISO 639-3 codes [if you are reading this bracketed text, the tools are available on Android and iPhone, but have not yet been implemented on the Web]:

GOLDbox (Pictures 53.3 and 53.4). When a dictionary user encounters a concept that we have in the language they are searching from but is missing in the language they are searching to, they can suggest an equivalent for the target language. Similarly, they can suggest own-language definitions for any term that does not yet have one [though our validation system for definitions has not yet been finalized if you are reading this text].
WordRace (Picture 53.5). Game players are presented with an English term and its definition, and asked to type the best equivalent in their language. A stopwatch shows them their elapsed time. After the same term has been proposed by a number of players, those players are awarded points based on who produced the answer in the least time.
IdeaPacks (Pictures 53.7, 53.8, and 53.9). Users select sets of words that are organized based on topic, such as body parts, restaurant menus, or soccer terms. They provide equivalents for the terms they know, and are awarded points when their answers achieve consensus.
QuizMe (Picture 53.6). In this game, we present a term that has been proposed through one of the intake systems, alongside another word chosen randomly from the target language, and the English definition. Players are asked which term, if any, matches the definition they see. In this way, we can speed through validation, reaching consensus quickly even for terms that have only been suggested once. This game is especially important for validating synonyms and rejecting bad data. For example, we might see “lunettes” many times in WordRace as the French equivalent for “sunglasses”, but only see “jumelles” once. Since “jumelles” is a valid but less frequent term for the concept, QuizMe lets us put it before enough French eyes that we can confirm that it is also a legitimate expression. On the other hand, if someone were to propose “pamplemousse” as a French equivalent, QuizMe players will quickly vote it off the island, and people seen to be playing in bad faith are blocked and their contributions removed.
DUCKS (Picture 53.10): An additional system is designed to extract lexical data from diverse datasets for thousands of languages, especially from the 10,000 sources digitized at PanLex, and engages users in matching those items to Kamusi DUCKS (presented in , with the slides and audio below). In short, an available dataset might say that “ɓeeɓinde” in Fula translates as “dry” in English. Instead of guessing, we show players the possible senses of “dry”, and people get points when they choose the same answer as other speakers of their language. When consensus is achieved, “ɓeeɓinde” with that meaning receives its own smurf, and Fula enters the set of languages that can mutually translate the given sense of “dry”.

Picture 53.1: Kamusi GOLD (Global Online Living Dictionary) search, with the option to select among any of 7000 languages. When that data is missing (as it usually is for languages that financing agencies do not deem valuable), Kamusi asks knowledgeable users to supply the term.	Picture 53.2: Screenshot of Kamusi GOLD result for the concept of an aristocratic family, from Romanian to Xhosa.
Picture 53.3:Unlike Google Translate, Kamusi admits its deficiencies. In this screenshot, GOLDbox asks Bulgarian speakers to provide a term that is equivalent to the Romanian term for an aristocratic family. The border between Romania and Bulgaria is 631 kilometers/ 392 miles, but no Romanian-Bulgarian dictionary is on sale in the bookstores I’ve checked in Bucharest, and the extrapolated Tarzan score between the two languages in GT is 33.	Picture 53.4: Users can rate existing definitions and propose new ones for terms in Kamusi. Forthcoming programming will create a definition validation game, after which approved crowdsourced definitions will be displayed to the public.
Picture 53.5: Screenshot of WordRace in the Kamusi WordUp! games. Players race to have the fastest time for the answer that eventually wins consensus.	Picture 53.6: Screenshot of QuizMe, playing for Swahili, in Kamusi WordUp! games. Players must choose between an answer supplied by another user, or one randomly selected by the computer. This validates the good contributions and eliminates the bad.
Picture 53.7: Screenshot of the Idea Packs that a user has selected to play in the Kamusi WordUp! games.	Picture 53.8: Screenshot of the terms in an Idea Pack in the Kamusi WordUp! games.
Picture 53.9: Screenshot of a term that a player can propose a term for from the “cats” Idea Pack in Kamusi WordUp!.	Picture 53.10: DUCKS game to align datasets with potential matches from the universal “data unified concept knowledge set” in Kamusi.

Asking people for the equivalents of English terms does not solve the problem of capturing terms that are indigenous to languages other than English, but should get us a substantial way toward developing base vocabularies for languages without adequate corpora, and augmenting the vocabularies for languages that already have substantial data. Aligning existing external datasets with DUCKS can uncover some indigenous terms, such as “muukumaaka” in the Fula dictionary that means “chewing or eating in an exaggerated manner”. The lack of a potential English match in Kamusi flags it for consideration as a new concept for the universal knowledge set, with an explanatory English equivalent like “dramatic chewing” (which my inner anthropologist guesses is performed to show appreciation for good food, like slurping noodles in Japan) produced so that non-Fula speakers can have an angle into the concept. Otherwise, unfortunately, for the many languages without corpora to mine, indigenous concepts can only become evident to the system if a speaker of the language notices the omission and takes the time to submit it.

Importantly, no user contribution is considered the last word – the option always exists to add additional senses for an expression, or to edit faulty information, and terms are shown with source information to establish provenance, and caveats to alert users that items in their results box are less omniscient than they appear. Each piece of knowledge that we can learn from users is banked, made available for use in applications such as SlowBrew, and often used as the kernel to gather more information such as inflections and own-language definitions.

DUCKS in a Row from EPFL (École polytechnique fédérale de Lausanne)

https://scholarspace.manoa.hawaii.edu/bitstream/10125/41982/1/41982.mp3

An additional aspect of linguistic expression that can be treated through knowledge capture is one that Google embraces for English, but eschews for translation. Each language has an inventory of stock phrases, things that are said repeatedly in the same way. Similar to party terms, stock phrases can be translated from the outset by people who understand the source context and the nuance necessary to convey the essence in the target language, and those stock translations can be used (with grammatical modifications provided by NLP) when a user inputs the phrase. Google has an entire bureau that identifies stock phrases for Google Assistant, their competitor to Amazon’s Alexa, that is intended to provide an intelligent response to any question or command you might voice. Take an example from their own homepage: when a user says, “Hey Google, dim the bedroom lights”, their machines do not perform deep processing to figure out the components of the sentence and how they relate to each other and how that relates to the required response. Rather, some person has isolated the phrase and specified a chain of events that the phrase should activate – in this case, issuing a command to the user’s “smart home” system.

Picture 53.11: On the left, examples of stock phrases for which Google Assistant is manually trained with responses. On the right, an example of a fiasco when Google Assistant seeks a response algorithmically rather than based on human review.

In Picture 53.11, we can see a number of cases where the Assistant team had some fun writing their canned responses. They obviously identified popular queries: “How many roads must a man walk down” from the Bob Dylan song “Blowin’ in the Wind”, “Do you want to build a snowman” from an obscure arthouse movie called Frozen, “To be or not to be” from Hamlet. On the other hand, the right side of Picture 53.11 shows that queries that have not been reviewed by people are sent to the machine for some sort of relevance prediction, with results that can miss the mark catastrophically. If “dim the bedroom lights” and “to be or not to be” can be identified as stock phrases for people to assign actions to, they can also be identified as candidates for stock translations. A system to acquire validated stock translations from users will be introduced in the next generation of Kamusi Here, beginning with phrases to localize the app’s own user experience. The video below shows a number of instances where stock phrases are butchered by GT, and where enthusiastic members of the public could be enticed to provide well-considered stock translations:

To the greatest extent possible, MT should not be based on intuition, because no amount of data is sufficient to verify that intuition. MT should be based as much as possible on facts. Messy dictionary data to underlie translation that is captured from existing digital sources should be verified by people, not groped at by algorithms. Language communities should have the tools to grow their own data, rather than wait for the day when a corporation will take interest in their needs. MT should declare when it does not have the data to perform a translation, instead of supplying fake filler text. Humans should have the option to verify the senses they have in mind for the texts they produce on the source side, rather than have machines make haphazard guesses. Performance should be judged by the way that MT transmits meaning across languages, not the speed with which it places words on the screen, which is performance art. Computers should learn meanings from the declarations of a language’s speakers, rather than word associations based on shallow inferences (ironically labelled “deep learning”) across corpora. Currently, the ability of the user to select English to 67 of the languages in the GT system fails to produce a transmission of meaning more than half the time, and my tests indicate that only about 1% of 5151 non-English pairs will yield MTyJ scores of 50 or greater. We language technologists can do better, and we at Kamusi Labs are developing the systems described above with that goal in mind.

Picture 54: Kamusi users can contribute for any language with an ISO 639-3 code, by selecting any local language name compiled from all 50,642 spellings and scripts collected by CLDR

References

The post Disruptive Approaches for Next Generation Machine Translation appeared first on Teach You Backwards.

Conclusions: Real Data, Fake Data & Google Translate

Martin Benjamin — Sat, 30 Mar 2019 16:52:18 +0000

Facts matter. In 1962, Der Spiegel published a fact-based article that led to mass demonstrations, the end of the career of the German defense minister and marked a turning point in Germany’s embrace of a free press as foundational to a democratic society. 56 years later, randomly spinning the wheel for German test material for this study, I landed on a Wikipedia article about Der Spiegel that discussed the 1962 scandal, shown in Picture 15 and discussed in ‎the empirical evaluation. About two months after that, Der Spiegel was shaken to its core when it emerged that one of its star reporters had been fabricating stories for years. In their own reporting, the newspaper says their reporter “produces beautifully narrated fiction. Truth and lies are mixed together in his articles and some, at least according to him, were even cleanly reported and free of fabrication. Others, he admits, were embellished with fudged quotes and other made-up facts. Still others were entirely fabricated” . The editors released a long mea culpa, stating:

We want to know exactly what happened and why, so that it can never happen again…. We are deeply sorry about what has happened…. We understand the gravity of the situation. And we will do everything we can to learn from our mistakes… We allowed ourselves to be defrauded: the top editors, the section editors and the fact checkers… Our fact checkers examine each story closely on the search for errors. Now we know that the system is flawed… For a fact checker back home, it’s not always easy to determine if assertions in a story are true or false… We will investigate the case with the humility it requires. That is something we owe you, our readers. We love our magazine, DER SPIEGEL, and we are extremely sorry that we could not spare it, our dear old friend, from this crisis.

In comparison, Google Translate processes 100 billion words a day . GT makes error after error in all 108 languages it treats, often fabricating words out of thin air, as analyzed throughout this report. It has no effective method to fact check their proposed translations, as shown in my four year test across dozens of languages that are discussed as ‎Myth 5 in the qualitative analysis . Their results, though, are presented to the public without humility, as the pinnacle of advanced artificial intelligence in language technology. Where Der Spiegel experienced a foundational crisis when one reporter was found to be inventing fake news, GT’s entire production model is built upon invented facts.

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims? (You are here )
Google Translate sometimes gets it right. How should it be used as a helpful tool?

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

Picture 55 displays a typical translation scenario for GT, a tweet from a British dad translated to French. GT gets a lot of the words right, but makes major errors with the party terms “spring in her step”,¹⁶⁶ “fun fest”, and “in store” – the translation points toward a springtime festival at the mall, instead of cheerful father-daughter time. The verb “had” and its implicit subject “I” are perfectly rendered, while the verb “put” and its use with “a smile on her face” puts a frown on the face of a Parisian informant: “Definitely not. I would say ‘ça l’a fait sourire’” (it made her smile). Taken in total, the output works at the MTyJ level, earning a middling 33.13 BLEU score, because the central sentiment “Had a wonderful weekend with my daughter” sails through with flying colors. The remainder of the tweet is a composite of fake data that would get a student an “F” and get a translator fired. [Update from November 2023: Automatic translation of the tweet in Picture 55, using GPT 4 on Bing, is almost identical to the translation produced by Google more than four years ago, with the exception that the word “ressort” was chosen for spring instead of “printemps”. AI large language models have taken over the public imagination, and are producing better results in certain situations than translations from the neural networks discussed in this work, but the near-identical results for this tweet show that machine translation has not been solved.]

What Hofstadter documented from German, GT’s second highest scoring language, to English holds for the text above for English to French: “It doesn’t mean what the original means—it’s not even in the same ballpark. It just consists of English words haphazardly triggered by the German words. Is that all it takes for a piece of output to deserve the label ‘translation’?” Der Spiegel’s editors would print an immediate correction and apology if they discovered themselves printing such misinformation.

The consequences of mistranslation can be quite high. This is a note from the wife of a political prisoner in Egypt about attempts to arrange her first visit to her husband in five months:

!لا نزلهم نمر، ولا علقوا تعليمات تخصهم (رغم ان كل السجون التانية اتعلقلها تعليمات عالبوابة) ونمر مصلحة السجون الارضي ما بتجمعش ابدا

Picture 55: An English tweet translated to French by GT. Justin Naughton,(@HitchinCavalier)

The Google “translation” from Arabic to English has a word stew that is clearly related to prison, but might lead toward danger in the context of trying to deal with a totalitarian regime: “We do not put them down as a tiger, nor do they post instructions pertaining to them (although all other prisons are subject to instructions on the gate), and the ground prison authority does not assemble at all!” And yet, GT unhesitatingly offers itself as a tool for the justice system, and is uncritically used as a legitimate linguistic authority by police and courts around the world.

Apologists might object that the tweet in Picture 55 is too informal, or that the text attempted by Hofstadter was in some way atypical. A tenured computational linguist reacted to a draft of this report, “I still think that your use of what you call “party terms” like “out of milk” makes for a very high bar for a translator to surmount.” What, then, is the bar? If translating “in store” as “predicted” is too high, then your local weather forecast with “rain in store” for tomorrow is off limits for MT. A human who could not translate “in store” could never hold down a job as a translator. As an academic exercise, it is fine to experiment with the extent to which a translation engine can recognize when “in store” is related to shopping venues. Since MT claims to compete in the same space as human translation, though, it is reasonable for consumers of a commercial product like GT to expect that it will handle whatever correctly-composed text they put before it.

Picture 55a: Google Translate version of the famous Italian song “Bella Ciao”, in comparison to a human translation. Is “goodbye” too big an ask for machine translation?

Granted, chat messages peppered with abbreviations and irregularities might require natural intelligence to decipher, like “La dr sto la ora 13 data 7august”,¹⁶⁷, but are goodbyes too high for machines? GT manhandles common farewells such as the French “à toute” (GT: to all) and “à tout à l’heure” (verified wrongly by the GT community as “right away”), or the Swahili “kwa heri” (GT: for good), “kwa herini” (GT: to the herd) and “tutaonana” (GT: we will see, shown in Picture 37). The problem extends throughout the NMT universe – regard, for example, how DeepL translates the essential closing “Regards” from English to French. And, the problem extends to each and every language GT covers. “Vice”, for example, is “uvane” in Norwegian and Danish, spelled as “ovana” in Swedish, but given as the lei lie “vice” for all three in Googlenavian. Making up stuff about common words and phrases is not a virtue.

Objections that such-and-such a text has some unusual aspect that makes it unfairly tricky to GT, of course, obviate the claim that it has omniscient command of that language. For successful public translation, it should not be incumbent on the user to monitor their own expressions, in order to not get held up by inadvertently confusing the program with any old normal speech such as the tweets in Table 2. In practice, while discovering text that GT cannot handle is easy as pie, manipulating text on the source side to nudge the target output into shape can involve considerable backing and forthing by a knowledgeable user.

Picture 56: Airbnb check-in instructions translated by GT

One reviewer proposes that “there are controlled languages and for controlled languages, MT can be error-free”. That is, perfect translations are possible if the vocabulary and set of concepts are kept within a discrete domain. Having written rejected proposals for international agencies to seek funding along those lines for refugee and emergency communications, and having apps for other domains inching through the lab, I agree that stellar translation is possible among a consistent stock of phrases that are invoked in restricted ways.¹⁶⁸ For example, a weather vocabulary can be built so that weather sites can slot in the local terms expressing a limited range of occurrences, such as “[Rain], [heavy at times], [continuing until] [12:00]”.

Problems occur the minute one ventures away from a human-curated controlled vocabulary. Malinda Kathleen Reese puts GT through its paces by running a text from English to another language, or a few, and then back to English. (Because she does not leave breadcrumbs about the languages involved, her results are not reproducible.) In the video above, she used standard weather report verbiage for her source material. The closest GT came to adhering to the original was that “Average winds steady from the northeast at 9 miles per hour” came back as “A major windstorm in the northeast will be prolonged to 9 years.” Similar drift can be seen in Picture 56, where a controlled situation with millions of texts for training data produces a Tarzan arrival situation for Airbnb – fortunately, in this case, without making a crucial mistake as in Picture 60 that could have left a guest wandering the streets in the snow. In most circumstances, though, neither the user’s vocabulary nor their phraseology can be controlled, and GT does not attempt to do so. The selling point of MT is that it accommodates the language of the user, not that users adjust their language to the machine. In this respect, the blame-the-user argument that certain language is too casual and that other language is too this-or-that falls apart prima facie as an excuse for poor performance. The idea that users should control how they talk to machines makes their language unnatural – a fine project to enhance the way humans and computers interact, but, by definition, not true Natural Language Processing.

Picture 56.1: Drugs.com finds it necessary to warn visitors against relying on the site for “medical advice, diagnosis or treatment”, and links to a more detailed disclaimer from thousands of its pages. Google Translate is used by doctors and other professional staff as a tool to perform diagnoses and dispense instructions and advice in medical situations worldwide, with no such disclaimer.

For one aspect of translation, the medical domain, I would suggest that Google is under an ethical obligation, if not a legal one, to invest however many millions are needed to offer a functional product.¹⁶⁹ With GT on countless Android devices, and their aggressive PR campaign to implant the idea that they provide reliable translations, the service is often used in medical settings where the patients and the care givers do not speak the same language. In a study published in the Journal of the American Medical Association, 8% of hospital discharge instructions translated from English to Chinese were found to have the potential for “serious harm” . Chinese achieved a Tarzan score of 65 on TYB tests, along with 5 other languages. The medical translations will clearly get worse for the 86 languages that scored lower, such as for languages with large immigrant communities in the US like Hmong (Tarzan = 40, see Fadiman for a gripping ethnographic portrait of Hmong interactions with US health care, including language difficulties), Filipino (35), Somali (30), and Haitian Creole (0). A study of GT translations from English to 26 languages of medical phrases in the British Medical Journal, that over-represented well-resourced languages that tested highest in TYB, found that, on a binary scale, 42.3% were “wrong” – including a Swahili instance that was translated as, “Your child is dead”. The problem is exacerbated in continental Europe, where English is usually not one of the languages in host/immigrant interactions, and GT will fail. (Note: this is presented as a definitive statement, not they will “probably” fail. No doubt the occasional medical phrase will survive passage in GT from Norwegian to Kurdish, but there is no way for either participant to know whether any given meaning has been adequately communicated, and the preponderance will be garbled beyond intelligibility.)

A personal experience: After release from the hospital from a serious bike accident, I went for a follow-up consultation with a doctor near where I live in French-speaking Switzerland. Not considering a medical appointment as the best opportunity to test my French, I chose a doctor who listed English as one of his credentials. As we discussed the period of time I had been unconscious in the middle of the road, he groped for a term that he only knew in French. “Google Translate!”, he exclaimed, and turned to his computer. He got his answer and read it to me triumphantly. What he said had no particular meaning in English. He repeated it three times. Three times I shook my head. Finally, he explained the underlying concept, outside of the words that GT had fed him, and I understood. This situation only involved three words, between two people who had somewhat of a command of both languages, with two people in a similar social station (educated western white males of a certain age, as opposed to many situations I have witnessed in the course of medical anthropology research where a doctor steamrolls over a poor African woman who feels completely cowed by the power imbalance), in one of GT’s best language pairs. Communication of the underlying idea using GT, which fortunately was not critical to the health outcome, was not achieved.

For languages at the bottom of the scale, the potential for serious harm is 100% of instructions that could cause harm if not followed correctly. Google does not have a medical license, and they do not have poetic license to manhandle medical terms. Having planted themselves in the middle of life-or-death considerations (does the doctor understand the symptoms? does the patient understand the prescription instructions?), it is incumbent on Google to get those translations right. 8% to 100% risk of harm for 91% of their languages vis-à-vis English, and closer to 100% for 5149 non-English pairs, is medical malpractice. Improving their controlled medical translations would require gleaning terminology and typical phrases from the domain, and paying professionals to translate those items to 101 languages (Latin does not pertain). The cost does not matter – Google has established the notion that their translations are valid in medical situations, so they should spend however many minutes of next Tuesday morning’s profits it takes to fulfill that promise. Anything less is a violation of their professed code of conduct, “Do the right thing”. Until they do, every GT translation should be printed with the same sort of warning that comes on a bottle of untested dietary supplements: THIS PRODUCT HAS NOT BEEN APPROVED FOR MEDICAL PURPOSES. When research in two leading medical journals, JAMA and the BMJ, converges on the same proscription, it is imperative to pay attention: “Google Translate should not be used for taking consent for surgery, procedures, or research from patients or relatives unless all avenues to find a human translator have been exhausted, and the procedure is clinically urgent” .

Picture 56.2: Each episode of the Netflix series “Diagnosis” begins with a strict warning that the show must not replace professional expertise for medical diagnosis or treatment.

Picture 56.3: Machine translation displays could include a warning, as simulated in this image, that alerts users to potential harm from using the product. Additionally, an honest evaluation rating of the chosen translation pair (printed in the chosen languages, of course) would help users anticipate the overall reliability of MT results.

Google welcomes visitors to its homepage with a button saying “I’m feeling lucky” that usually leads to the first result for their search. Do you ever click it? Certainly not. You prefer using your own intelligence to sift through for the result that most appropriately matches the search you have in mind. Google maintains the button for the sake of tradition, but essentially put it out of service in 2010 when they introduced Google Instant to provide as-you-type search predictions. The company realized that their strong suit was parsing search strings in a way that reveals the most likely candidates, rather than choosing which candidate the user will ultimately select. Yet “I’m feeling lucky” is the guiding precept of GT, although translations typically involve search strings that are more complex or ambiguous or unusual than most web searches – about 66% of web searches are 3 words or fewer , and frequently skew toward known terms such as “weather” to which Google can attach certain behaviors based on past user clicks and other heuristics. GT pretends that “I’m feeling lucky” is a legitimate strategy for translation output, and the public, the media, and even many ICT professionals fall for that claim. When people do not speak the language they ask GT to translate toward, they have no way to use their own intelligence to evaluate the results, so they trust what the computer tells them. In a phenomenon I call Better-than-English Derangement Syndrome (BEDS), people are well aware that Google still falls short at NLP tasks in English – such as their Android phone replacing “was” with the unlikely “wad” as they type – but suspend their disbelief for other languages, even though those languages have far less actionable data and linguistic modeling research. Based on my empirical research across GT’s full slate of 108 languages, our trust has not been earned.

Picture 57: A paragraph from a news article from the Yiddish Daily Forward, translated by GT. Saturday is Monday, and February is also September, so book your travel accordingly.

Some people forgive the shortfalls in GT with the expectation that the service is getting better all the time. “Continuous updates. Behind the scenes, Translation API [the paid version of GT] is learning from logs analysis and human translation examples. Existing language pairs improve,” Google tells its Cloud customers. GT is undoubtedly improving in its high-investment languages as it funnels ever more parallel human translations into its training corpus. On a purely subjective basis, news articles translated to English from German or Russian feel to me like they read better than they did a couple of years ago, and others also report a sense of improvement at the top. Even so, at holiday shopping time in 2021 on a big retailer’s German website, “getting better all the time” GT generates this nonsensical collection of English words for “Das könnte dir auch gefallen”, to guide shoppers toward additional recommended products, in big, bold letters: That could you also please

For most languages, there is little possibility of harvesting large numbers of well-translated parallel texts before the glaciers melt in the Alps. Even websites that deliver daily news in more than one language, such as the BBC and Deutsche Welle, do not line up neatly enough among versions to match with confidence which articles are discussing the same events, much less how the internal structures of the articles compare (sentence-by-sentence translations are rare). None of the articles from synchronic online versions of the English and Yiddish newspaper The Forward were the same at the time of writing, meaning there is no way to learn by scraping the web for a human version of the dumpster fire¹⁷⁰ in Picture 57. My four year test of Google’s claims to learn from user contributions found that their “suggest an edit” feature incorporates suggested translations a maximum of 40% of the time, and a detailed analysis of the procedure by which they use the crowd to verify translations shows they deviate widely from accepted best practices for both crowdsourcing and lexicography. ‎The section of this web-book on artificial intelligence details why AI is only conceivable as an aid to MT for the few languages with substantial parallel corpora, and ‎the sections on the mathematics of MT enumerate the confounding factors of lexicography and linguistics that present an astronomical number of translation challenges many orders of magnitude greater than the finite capacity of any supercomputer to account for with today’s best NMT methods. The statement that GT is continuously improving is undoubtedly true, but the implication that those incremental improvements will transform the majority of languages on the list to viability the next time you need them is undoubtedly false.

Picture 57.1: Google Search flags terms that are missing from search results.

The computer scientists who develop MT systems like GT will object vociferously to my imputations against their veracity. I do not mean to accuse them of dishonest work – for all the reasons discussed above, MT is extraordinarily complicated, and most people I know in the field dedicate long hours pushing forward on the frontiers of technology. Rather, I suggest that computer scientists have a definition of successful translation other than the smooth transfer of meaning across languages. Within the field, translation is scored like basketball. You do not expect that every shot will go in. Some shots will roll off the rim, some won’t even hit the backboard, but some will swish through the net and there will be high fives all around. NBA players consider a field goal percentage north of 50% to be very good. MT engines compete using similar metrics for success, with the winning team being the one that lands the most hoops even if the final score is low.

When Facebook’s chief AI scientist states that, “amazingly enough”, their NMT systems are capable of producing “what essentially amounts to dictionaries or translation tables from one language to another without ever having seen parallel text” (see the video and transcript), he genuinely believes that the game is won if some fraction of shots go through the hoop. At Google, Johnson et all , Schuster, Johnson, and Thorat , and Wu et al , are not lying about the results of their research. They are earnestly reporting scores that show some amount of successful approximation between the languages they measured. Tristan Greene, in an article about AI software from Facebook that created dangerous misinformation, offers this colorful counterpoint: “there’s no reasonable threshold for harmless hallucination and lying. If you make a batch of cookies made of 99 parts chocolate chips to 1 parts rat shit, you aren’t serving chocolate chip treats, you’ve just made rat shit cookies.”¹⁷¹

Picture 57.1.1: Google “translation” from Estonian to Latvian; the same result is given from the geographic cluster Estonian/ Latvian/ Lithuanian among each other and to English. Finding MUSA translations is a parlor game for the members of the Google Translate Reddit.

The falsehoods come when their experiments are presented to the world by their company as “translation between thousands of language pairs”. In fact, Google researchers measured 60% average improvements in BLEU score between English and three high-resource languages when they switched from statistical to neural machine translation – nice! – but offered no evidence regarding the 99 other languages in their system, versus English or from Language A to Language B. Nevertheless, mumbo jumbo about NMT bringing Google to near-human or better-than-human (whatever that means) quality across the board, as well as “zero-shot” bridging between languages that have no direct parallel data, has made computer scientists and the public at large believe that we have reached a golden age of automatic translation. While in certain top languages, exclusively when paired to English, GT often swishes pleasantly through the hoop, declaring those successes as victory throughout the translation universe, as measured empirically herein and also seen in virtually every non-systematic spot test (humorously exposed in Picture 57.1.1), is patently untrue.

Particularly awry is the oft-touted notion that translation can be – and by some miracle already has been – achieved by AI for languages where copious training data has not been:

Collected (from humans for most languages, in conjunction with digital corpora for the few dozen where that is feasible)
Reviewed (by humans)

A.I., most people in the tech industry would tell you, is the future of their industry, and it is improving fast thanks to … machine learning. But tech executives rarely discuss the labor-intensive process that goes into its creation. A.I. is learning from humans. Lots and lots of humans. Before an A.I. system can learn, someone has to label the data supplied to it.

AI systems to identify polyps or pornography depend on paid human labor to wade through the millions of images from which machines can learn. Kamusi has now implemented the first stage of a system that can collect granular linguistic data for all 7000 languages, to be put at the service of translation, through a suite of games and tools designed to work on any device. The work would go much faster if there were a bankroll behind it, but the principle of human collection and review remains the same with or without money:

most linguistic data has not been digitized
what has been digitized has generally not been regimented as interoperable data
the data exists in human heads
well-crafted interactive systems can enable people to transfer their knowledge to digital repositories where that knowledge can be preserved and put to use.

In the future, I expect that AI will be able to learn from a massive compendium of real, verified translations among a world of languages – but we are not there yet. We are not even close, but the prevailing fantastical belief is that GT has already brought us much of the way to translation Nirvana, and better-than-human AI-based translation for every language is just a few tweaks away.

Table 11 shows the state of the art for Google’s neural machine translation from Hawaiian to English. Hawaiian is a representative language from the lower 2/3 of my tests from English to Language X, with a Bard rating of 15, a Tarzan rating of 25, and a failure rate of 75% in that direction. The text is an authentic “hula” narrative, with simple sentences, chosen because a native-speaker had published their translation . The overall BLEU score is 5.69, with the best line scoring 63.89 because it identified “the woman”. As with any text, some terms will be “out of vocabulary” for the machine, but with your natural intelligence you should be able to discern three Hawaiian words for “dancing”, understand that “Nānāhuki” is a proper noun that should not be transposed to “Nephilim” (GT correctly recognizes that “Puna” is a name), and learn from line 4 to translate “wahine” as “woman” instead of “Wife” as GT posits in line 12:

	Original Hawaiian text	Human translation by Ku‘ualoha Ho‘omanawanui	Google Translate	BLEU
1	Ke ha‘a lā Puna i ka makani	Puna is dancing in the breeze	Puna weather in the wind	19.36
2	Ha‘a ka ulu hala i Kea‘au	The hala groves at Kea‘au dance	The plant grows in Keaau	8.75
3	Ha‘a Hā‘ena me Hōpoe	Hā‘ena and Hōpoe dance	Dining with Ball	16.19
4	Ha‘a ka wahine	The woman dances	The woman left	63.89
5	‘Ami i kai o Nānāhuki	She dances at the sea of Nānāhuki	Go to the coast of Nephilim	8.17
6	Hula le‘a wale	Dancing is delightfully pleasing	Just enjoy it	16.19
7	I kai o Nānāhuki	At the sea of Nānāhuki	Sea of Nephilim	32.80
8	‘O Puna kai kūwā i ka hala	The voice of Puna resounds	Puna is a haunting past	10.68
9	Pae i ka leo o ke kai	The voice of the sea is carried	Sound the sound of the sea	27.48
10	Ke lū lā i nā pua lehua	While the lehua blossoms are being scattered	Sowing in flowering flowers	3.77
11	Nānā i kai o Hōpoe	Look towards the sea of Hōpoe	View of the Lake of Hole	9.65
12	Ka wahine ‘ami i kai o Nānāhuki	The dancing woman is below, towards Nānāhuki	The Wife of the Sea of Nephilim	5.69
13	Hula le‘a wale	Dancing is delightfully pleasing	Just enjoy it	16.19
14	I kai o Nānāhuki	At the sea of Nānāhuki	Sea of Nephilim	32.80
Table 11: Hula narrative “Ke ha‘a lā Puna i ka Makani”

Picture 57.2: Making up translations in MT is like selling motor oil in a milk bottle when the store does not have milk. Picture photoshopped from public domain images.

I propose that GT, Bing, DeepL, and every other MT engine should conform to a minimal standard: Scrabble rules. In competitive Scrabble, players are not allowed to use words that are not in an approved dictionary. For MT, there are two tests. First, does the word exist on the source side? In GT’s world view, the answer is always yes – we saw that they will find you a translation of “ooga booga” from any language in their system to any other, even though the term exists in none. Consumers might know that the translation of a nonsense term like “ooga booga” is bogus, but if they input a term that is legitimate on the source side (something like “brain scan” in Sinhala, say), but does not occur in the GT dataset, they have no way to know that Google is plugging that hole with random fluff. Second, if the word is in Google’s vocabulary on the source side, does a known equivalent exist on the target side? If a translation for “snookered” does not exist in French, for example, then an MT engine has no right to invent the verb “snooker”, much less to conjugate it as “snooké”. Rather, missing data should be flagged as such, so that users can see the gaps in translations that they cannot necessarily read. This technique has been engineered into Google Search, as seen in Picture 57.1 and, differently presented, Picture 49.1. Graphically, missing items could be shown inline using strikethrough (~~snookered~~) or tofu ☐☐☐☐ characters or shrugs , or the out-of-vocabulary items could be listed below the output. Making holes in the data transparent would increase the value of the translation, because it would show the user where to focus in the effort to achieve intelligibility.

Picture 57.3. A screenshot of Google Maps showcasing the absence of a pathway between two endpoints, rather than inventing roads and bridges. (Yellow highlighting added.)

At Kamusi, it is a fundamental precept that if we do not have data for a particular sense of the term a user searches for, we show the English definition and tell the user that we do not have the data for their language yet, rather than leaving them to believe that, for example, the result we give them for “dry” meaning “not wet” can also be used for “dry” meaning “not funny”. A grocery delivery service that does not have milk cannot fill milk bottles with motor oil for its customers to pour on their breakfast cereal (picture 57.2). “We don’t know” is a valid response for a computer to provide. “Snooké” is not. One company that sometimes knows this is Google, inasmuch as Picture 57.3 shows that they inform their customers when their data does not allow them to propose a driving route between two points.

I therefore propose that MT engines, not just Google, commit to displaying a confidence index for their translations. Arguably, user satisfaction would increase if they were provided with an estimate of doubt. Surely, rather than an absence of information about how to gauge their expectations, international customers of Wish.com appreciate knowing not to hover over their mailboxes while their orders take their sweet time moseying their way from China (Picture 57.4). Translation customers would similarly appreciate an earnest gauge between believability and bunkum. I do not have a firm method in mind for how to calculate this, but it should be industry practice to provide some graphic depiction based on factors such as the amount of polysemy, the adherence to known syntactic models, and the distance within NMT between facts and inferences. People who work within MT know that every translation should be taken with some salt. The public should be able to look at the salt shaker and know how much.

Picture 57.4: Wish.com shipping notification that gives customers an estimated confidence range for delivery

Many corporations hold themselves to high standards for success. Lego, for example, has perfected its production processes to the point that only 18 bricks out of a million fail inspection – you would need to buy 55,555 bricks to see one with a chip, crack, or air bubble. The airline industry struggles to maintain on-time arrival performance against such variables as weather and mechanical issues, but that is in the cause of a greater standard at which US airlines are now nearly perfect – delivering all their passengers alive, with only one death on a large scheduled commercial flight in the US since February 2009. (Internationally, some thousands of passengers have died during this time in various large crashes.) Google, too, holds to the don’t-kill-folks philosophy in their development of self driving cars. The company will not introduce autonomous vehicles that fail to interpret a stop sign 50% of the time, or 2%, or 18 times in a million. Google is so well aware of the necessity of unassailable data for autonomous vehicles, in fact, that they subcontract a large staff in Kenya, at $9/day, to label each element of millions of photos captured with their Street View car . Der Spiegel maintains a fact-checking staff of around 80 full-time equivalents, and considers it a crisis that that bureau failed.

Figure 7: Proportion of test translations across all 102 pre-KOTTU languages for which the results supplied by GT were unrelated to the meaning of the candidate expression

As measured in this study, GT fails 50% or more of the time from English for 67 of the languages they list in their drop-down menus, and I extrapolate this to a 99% failure for the 5149 non-English pairs that take on errors on both sides as they transition through the English pivot. For 35 languages, the gist of English source text is transmitted more than half the time. 13 of those render human-quality translations between half and two thirds of the time. Comparable to studies that show “where there are more guns there is more homicide“, even though many bullets only cause injury or miss entirely, in MT, where there is more data there is more translation. Failures result from numerous causes, including the prevalence of ambiguous terms and incomplete linguistic data. Perhaps most problematic, the MUSA algorithm diagrammed in ‎Myth 4, including but not limited to lei lies, is specifically geared to display invented results when no results occur within accepted statistical parameters; Figure 7 shows the rate at which GT invents translations across languages, with understandable results displayed on the bottom of each bar and nonsense data shown in red in the sky above those functional results. Stating that GT’s output predominantly fails is not meant to disparage the effort, but rather to provide an objective statement of the empirical evidence of what that effort actually achieves.

The 5 conditions for satisfactory approximations with Google Translate:¹⁷²

On the basis of the empirical and qualitative research conducted for Teach You Backwards, I conclude that Google Translate usually produces acceptable conversions if the following five conditions are all met. The fewer of these conditions that apply, the less you should place credence on the GT results. Even in cases that satisfy all five conditions, errors are a constant risk, ranging from minor to serious.

The language is at the upper tier of TYB tests, which generally indicates high investment by Google and the existence of substantial training data
The conversion is to or from English
The text is well structured and written using formal language and short sentences
The text relates to formal topics
The translation is for casual purposes where misunderstanding cannot result in unpleasant consequences

So, is Google Translate an earnest effort to set technology to the cause of multilingual communication? Yes. Do they succeed in providing understandable translations between languages? For their top languages versus English, frequently, but for most configurations GT results are like playing golf in the woods.¹⁷³ Are their translations fake data? Based on empirical analysis of Google Translate across 108 languages, the conclusion is: more often than not.

http://www.teachyoubackwards.com/wp-content/uploads/2019/04/Hitting-a-golf-ball-through-the-trees.mp4

References

The post Conclusions: Real Data, Fake Data & Google Translate appeared first on Teach You Backwards.

When & How to Use Google Translate

Martin Benjamin — Sat, 30 Mar 2019 16:52:18 +0000

Picture 57.3: Google Translate should be used as a tool for translation the way a stick is used for drilling – when used cautiously and combined with other methods (such as dictionaries), it is a tool that can help you get satisfying results. Credit: Tool Usage by Valerie, (CC-by-nc-nd)

Based on the evaluations in Teach You Backwards, there are some situations for which Google Translate (GT) is an excellent tool, some for which it is helpful as part of a broader translation strategy, and many where it should be avoided or cannot be used.

The previous chapters of this study show that any given piece of any given translation in GT has one of three possibilities:

Google Translate knows
Google Translate guesses
Google Translate makes stuff up

Your problem as a user of GT is that you have no way of knowing what mix of those three applies to your particular text. For their top tier languages, many of the translations will largely combine knowing and guessing, though making stuff up remains an active part of the algorithm. For the top 13 or so languages, you have a pretty good chance of getting results that convey much of the original intent, in certain translation scenarios, and the next 20-odd languages will transmit the broad sense of a text more often than not. For the base of the pyramid, though, about 71 languages, fake data factors in too highly for you to have any confidence in the results put forward. Figure 7 shows the proportion of invented results, among 2140 translations in 107 languages, while Figure 8 shows Afrikaans (GT’s best-performing language) versus Zulu, its fellow official South African language (about 3/4 to the bottom in overall ranking). The red in these graphs is your warning that GT translations should never be considered definitive, but can be helpful for advisory purposes vis-à-vis English.

Figure 8: Two languages that share the same geographic territory in South Africa, and are often spoken by the same people, have nearly inverse scores in GT. Afrikaans scored best among all 107 languages tested by TYB, yet still failed 1/8 of the time. You are strongly advised to inspect the scores for your language at http://kamu.si/gt-scores before you decide how much faith to put in GT output, keeping in mind that fake data may be part of any translation in any language, by algorithmic design.

What is the context in which people use Google Translate?
What does Google Translate do? Scientific measurements of GT across all its 108 languages.
Why doesn’t Google Translate do much of what it says it does?
Why can’t Google Translate accomplish what it says it does?
How could more effective translation be accomplished?
So what? What is wrong with Google Translate not doing what it claims?
Google Translate sometimes gets it right. How should it be used as a helpful tool? (You are here )

How much did you learn from Teach You Backwards? Your appreciation is appreciated!:

$ Donation Amount:

Select Payment Method

Stripe - Credit Card
PayPal

Personal Info

First Name *

Last Name

Email Address *

Credit Card Info

This is a secure SSL encrypted payment.

Card Number *

CVC *

Cardholder Name *

Expiration *

Donation Total: $5.00

This chapter discusses methods through which you can benefit from using Google Translate, and situations where you should not rely on its output. You might find Simon Hill’s article in Digital Trends, also called “How to Use Google Translate“, to be a helpful guide for learning about the features of the Google Translate app on various devices: https://www.digitaltrends.com/mobile/how-to-use-google-translate-app/.

Situations When GT is a Good Tool¹⁷⁴

Picture 58: Spanish ranks 5th in the Bard rankings and 4th in the Tarzan rankings. The GT rendering in English has several mistakes and peculiarities, but satisfactorily conveys the main points of the text – you could plan your visit to the local festival based on the translated information. Credit

Let’s accentuate the positive first. From several other languages to English, GT is often capable of producing understandable renditions of the author’s original intent. Users who want to get the main points of a news article or a fairly formal message will usually receive a passable translation, and sometimes an excellent one,¹⁷⁵ from the better languages. However, two notes of caution are necessary. First, I had no resource for an objective test from 107 other languages toward English, so my confidence scores are based on the admittedly flawed premise of flipping performance from English out. Second, most users will have no way to verify whether the translation is correct, so they must take a leap of faith that is only marginally justified.

The best time to use Google Translate is when you yourself are the audience. If you plug in a text from another language, you are fully aware that the translation is machine generated, and you will read it with due caution. Conversely, the worst way to use GT is to send something to somebody else that you have had blindly translated into a language you cannot read, because they will not know that the words before them include guesses and fabrications. A comparison is cooking for yourself versus preparing food at a restaurant. When you scrounge for food at home, you can throw random ingredients together, and suffer through the result if it doesn’t work out. That same experimental plate would sit very badly on the tongue if served to a customer who expected that the chef knew how to cook.

In general, longer texts are better, because you will be able to mentally smooth out inevitable mistakes by understanding the broader context. One of our reviewers reported detailed success using this strategy for initial translations from Hebrew to English, for example, and a journalist from the New York Times discusses GT as his starting point for rough translations for the same pair .¹⁷⁶ Perhaps the best use of Google Translate to extract useful information was American researchers who found Russian-language articles in Ukraine, that helped them trace the criminal conspiracy behind the first impeachment of America’s 45th president.¹⁷⁷ – the elegance of the words was unimportant, but the evidence they gleaned from the texts for their personal understanding gave them the leads they needed to pursue the story.

If you are able to read with a forgiving eye, GT is prescribed as generally beneficial and mostly harmless in the top 36 languages as listed in http://kamu.si/gt-scores, though with guaranteed fatal side effects in some percentage of translations. For the 71 lower-scoring languages, translations to and from English should be regarded with tremendous caution, even in non-sensitive situations.

Picture 58.1: When viewed as a whole, the big picture can look quite good in GT for larger documents in the languages that scored near the top of TYB tests, like the cheerful mural in this image from Renens, Switzerland. Closer inspection will reveal chips and cracks in the computer output, often losing the meaning of particular phrases and sometimes reversing the author’s intent – but you can usually appreciate such documents if you do not need to rely on the details. Compare this image to Picture 64, which hints at the big picture between English and most GT languages, and also speaks to languages at the top tier when the texts do not resemble Google’s formal training material. Photos by author.

From English toward other languages, I propose that our scores are indicative of reliability for each language. I recommend use of GT for informal purposes, such as a reader of a high-scoring language reading an article or email from English, with a healthy dose of skepticism. GT is also an excellent writing aide when you have an intermediate knowledge of the target language – that is, if you already have a sense of what the output should be, GT can confirm your intuition and provide correct spellings and accents, and sometimes correct agreements (though DeepL usually provides a wider range of options for the languages it covers, including alternative registers).

Systran: Direct translation between French and German, Spanish, and Italian.¹⁷⁸ Systran is a less-known competitor to GT. They general seem to have similar results among the 42 languages in their list (their advertising copy mysteriously says “50+”) , though TYB has only conducted casual testing. One advantage they claim is that they are able to learn from community or in-house translations, but their publicity is unclear how this filters into their free service. They are appropriately humble about their non-English pairings – they clearly label “double translations” that they route through English, and they restrict most languages to jump through English to only French, German, and Spanish, where they have the most confidence. Greek to Vietnamese, for example, is not offered because they know the results are too weak to release, unlike GT that pretends to present legitimate results for such pairs. They make six direct non-English pairings, so in these cases you should definitely try them in preference to GT that goes through English: French→German, German→French, French→Spanish, Spanish→French, French→Italian, and Italian→French. Other combinations of these languages, e.g. Spanish︎Italian, go through English, so there is no apparent advantage among services, but from French to German, Spanish, or Italian, , or those languages to French, you should definitely give Systran a try.

[If you can provide a good translation of that final recommendation to the affected languages, please send it along, so that search engines will present that information to speakers of the relevant languages.]

Students are encouraged to use GT for homework, but not in the way you might think. ¹⁷⁹ Google will always make mistakes, and your teacher will always know, so you should never submit a raw GT translation if you want a good grade. (In one study, English teachers of Japanese students could identify machine translations about 3/4 of the time, with the remaining quarter uncertain whether the tell-tale goofs were more typical of a person or GT .) However, performing a test translation with GT and then tweaking it into shape can be a fantastic exercise for learning a language, engaging you in “higher order thinking” about nuances you might not otherwise confront. You can learn a lot, and enjoy the hunt, by using the machine output as a springboard for investigating how a native speaker would render an expression.

Examine the animated translation in Picture 58.1. You can see that GT’s translation of “pick up” to French shifts as the sentence grows. At some moments the proposal is “ramasser“, which is the word you would use to tell a child to pick up her Legos from the floor. Sometimes the translation shifts to “aller chercher”, which means “to go look for”, and is fine in the context of picking up pizza. I wanted to translate the longer sentence in Picture 35, and was surprised to see “acheter”, meaning “to buy”, as the proposal. So I asked my eight-year-old, who speaks native French, how she would say “We can pick up some pizza on our way home”, and, as a good translator should, she demanded clarification: “Do you mean that we’ll buy it?” Thus, I learned that “acheter” is a splendid word for this sense of “pick up”, and offer kudos to GT for picking that up. However, as you can see in Picture 58.1, had my sentence been one word shorter – removing either the “ham” from the pizza or the word “cathedral” – I would have been given the unacceptable “ramasser”, and were a student to hand in an assignment using “ramasser”, the teacher would not be pleased. The animation also offers other learning opportunities for students of French (e.g., do you express “some pizza” with “de la pizza” or “des pizzas”?) – the point being, fact-checking Google Translate can be a great way to grow your skills in a language – a detective game to teach you backwards.

Fact-checking GT is especially effective if you have a human chat partner trying to learn your language, so you can help each other arrive at native speech. (Pro tip: for the languages they serve, also try DeepL, which specializes in helping you fine-tune their output.) Remember to trust your own ear – if GT is showing you one result, but everything you have learned about the language tells you something else, your own instinct is more likely to be on target.

Picture 58.1: Time-lapse animation of GT shifting its output choices as a sentence grows. Various part of the translation are right or wrong at different moments.

You can use GT for informal communications such as text chats, but be aware that the system is not trained on data to support casual conversation. For starters, most colloquial and other multiword expressions are missed entirely, down to ordinary goodbyes. Words are usually translated literally, rather than in the context of their meaning in combination. Importantly, this occurs despite the claims that AI and deep learning have already, or soon will, overcome such problems; they have not, they are not suited to do so with the data available for the best-resourced languages, and they will always fail even vis-à-vis English for languages with fewer resources. Simply put, GT cannot “learn” expressions such as the test phrase “like a bat out of hell” because there are no datasets where they occur in a way that can bridge languages. Nevertheless, such expressions constitute a large part of informal communications. This applies, secondly, not just to colorful idioms, but to single words in everyday speech. GT makes choices based on the formal contexts it trains on, so its heuristics will inherently miss ordinary meanings that fall outside its training material, such as “he was winded after running”. You absolutely cannot use GT to tell a conversation partner that you are falling asleep by saying, “I need to crash”. In terms of register, GT is often set to give formal second-person conjugations, such as the vous form in French, though some languages such as Spanish are set for the reverse; I have observed this, tested that it is usually impossible to change between formal and informal registers, but have not documented systematically which languages default to casual versus formal constructions. (You can test your own language with a phrase such as “Do you want to go for a walk?”) Similarly, GT usually defaults to masculine constructions, thereby botching first and third person expressions for 50% of its users; in one case, the translation of “I will be unavailable tomorrow” to Russian had the connotation of sexual availability when said by a woman. For these reasons, I recommend that GT be taken with quite a few pinches of salt when translating casual text; do not assume your correspondent will know what you mean when you send a GT translation of “pinch of salt”.

Use GT to check your spelling, with caution. For example, to check whether the French word “pamplemousse” is spelled with one “s” or two, you can type “grapefruit” in English on the source side and see the right spelling. This works pretty well if GT gives you the term you already know to expect. However, if you want to know whether the female cat, “la chatte”, has one “t” or two, you will get the masculine form, “le chat”, if you type “the cat”. You can force the right answer by typing “the female cat”, but you will be told “Le chat femelle” if you type “The female cat”, which, , more or less means “the female he-cat”, some sort of hybrid character from a few Broadway musicals. (Google has taken preliminary steps to address the gender issue in its high-investment languages, so simply typing “cat” now presents you with both the male and female forms in French, but the feature is in its infancy. Appreciate gender options when they appear, but certainly do not rely on them being factored into the results you see.) If the term you are seeking can have more than one form, due to gender, noun class, conjugation, level of formality, or some other variable, you must be clear in your own mind that you are looking at the form you need, because GT is only giving you one guess from a range of possibilities.

Figure 8: Time to post-edit a page from GT in seven languages. (Shah et al 2015)]

GT can be quite helpful as a starting point in translating formal documents such as business letters from English to top tier languages, but should never be given the last word. Figure 8 shows the results of a study that timed how long a person would need to post-edit the GT translation of a biology textbook . Professional human translators generally report a pace of about one 250-word page per hour. By post-editing GT product, some languages are able to reduce the time to as low as 15 minutes; GT does the grunt work of suggesting the vocabulary and grammatical arrangement, saving time for the human in the percentage of cases where the tool gets it right. A professional translator from Hungarian to English estimates unscientifically, for example, that GT gives him a 70% to 80% head start – in keeping with Hungarian’s Tarzan score of 65. You can time yourself post-editing the English in Picture 58 to see how long it takes to polish 37 source words from a top-tier language to Bard caliber. However, this time savings is lost for languages at the lower tiers, where the output is so dubious that a person could spend more time correcting GT than translating from scratch: an average of 2 hours to post-edit GT in Marathi (though it is not known how long the same document would take from scratch, given that much of the editing time involved researching undocumented terminology, so would have been constant whether or not MT was attempted). Unfortunately, without a good reading knowledge of both languages, you cannot know whether GT output conveys the intended message; for example, Lewis-Kraus reprints a translation from Japanese (a mid-tier language) to English that captures all of the essential original information, whereas the best post-editing strategy for the translation from Japanese in Picture 1 would be to hit the delete key.

Cautionary Situations¹⁸⁰

In emergencies, use GT when there is no recourse to a human translator or other tools. Be patient and assume you are Tarzan testing out words for talking to Jane. If you are on Skype trying to reunite a family in Romania and their baby in the USA who was kidnapped by a despotic president, do not hesitate to try the words GT generates. You are unlikely to make the situation worse, and you can see pretty quickly whether there is a semblance of understanding.

GT output should only be viewed as suggestions, in the same way as you inspect search results if you want to book a flight online. In fact, flight results are more trustworthy than translations, because you know that your options between Bucharest and Boston will at least get you somehow between the right cities. (I chose this pair for the alliteration, but a subsequent test, shown in Picture 59, shows how miserable you would be if you let a computer choose their prediction for best flights for this route.) With GT, a speaker of Haitian Creole who wanted to go to “Singapore” would be given the equivalent for “Senegal” in all other languages, which would deposit someone trying to get from Port-au-Prince to Singapore 17,746 km / 11,027 miles away from their destination, nearly half the circumference of the globe, if the mistake were made on a travel site.

My research proves that GT has a statistically high likelihood of making fundamental mistakes even for its best languages, but you can never know when. For example, when GT computes this sentence to French, they very wrongly produce “projet de loi”, a legislative proposal, as the equivalent for “bill”: “In fact, I had misplaced the bill and was anticipating your reminder.” However, when “In fact” is removed, the output is rendered correctly with “facture”, and the same happens if “anticipating” is replaced with “waiting for”. Regular GT users will see this phenomenon at play throughout the top tier – you might get a brilliant translation at one moment, or get something incomprehensible with a slight change in wording or punctuation. Some amount of catastrophically erroneous results are part and parcel of what GT shamelessly calls “near human” translation. Now that you know, if you are writing to a creditor about an unpaid bill, and you copy and paste the output from GT without inspecting it closely, and your creditor is completely baffled, shame on you.

Picture 59: “Best flights” on kayak.com from Bucharest to Boston. The recommended journey involves 7 flights with no meals or water on 3 no-frills airlines instead of 4 flights on one regular carrier, has no option to check luggage through and charges luggage fees on each separate airline, totals more than a full extra day of your life than the fastest routing, will easily set you back $100 per passenger for food during your layovers outside Milan, Paris, Reykjavik and Berlin, and leaves you to sleep on the floor of Charles de Gaulle airport.

The best way to use GT in situations where the translation matters¹⁸¹

1. Do not use it in isolation.¹⁸² Check your results against Bing and any other services that support the language you are translating versus English. If your language is among the twenty-four served by DeepL, use that as your starting point because it often offers a wide selection of alternatives (such as vocabulary, gender, or politeness). However, all of the online MT services face the same limitations. You should therefore also consult bilingual dictionaries such as WordReference.com, which shows human-curated translations for numerous senses of polysemous terms in many languages, and also includes many multiword expressions. Further, you should try your phrase in Linguee.com (see Picture 25), which highlights direct human translations in known parallel documents.

Picture 60: A real business email that was translated by machine from German to English. The first paragraph is awkward but readable. The address, however, transforms the city name from “Linden” to “lime trees”. Without review of MT output, an expensive computer part could have gotten lost in the post.

2. You must have some knowledge of the target language. ¹⁸³ If you cannot sense whether GT output is within the zone of correctness, you must not use the translation in mission-critical situations. In important situations, blind translations will always – repeat: always, for every language – result in errors that make sections of your document incomprehensible at best, and sometimes the inverse of your intent. Picture 60 shows a case where one crucial error from German (ranked 2^nd in my tests) to English could have resulted in the loss of a 200€ computer part.¹⁸⁴ You can often improve the output with some back-and-forth tweaking, massaging your wording until you get the output you are looking for – but that means your knowledge of the target language must be substantial enough to recognize when GT’s results convey the meaning you intend.

3. If possible, have the results vetted¹⁸⁵ by someone who knows both languages. Experience shows that humans will need to rewrite substantial elements of documents translated by GT, but at a significant time savings versus starting from scratch, for the top tier languages. Experience also shows that many companies do not choose to pay for polishing their translations, and their reliance on the output from GT is consequently unintelligible gibberish.

Picture 61: Message using MT that includes a disclaimer and the text in both the source and target languages.

4. Include a disclaimer.¹⁸⁶ State up front something like: “This document was translated from English with the aid of Google Translate. Apologies for any errors.” I also recommend appending the original text, so a reader with some knowledge of the source language can try to work out the translation failures. Here is a perfectly implemented disclaimer on a professional sports website, which provides due caution to readers along with an assessment in keeping with TYB scores: “For those of us who do not speak Spanish, here’s the Google Translate (which tends to be pretty accurate with Spanish)”. Picture 61 shows the proper way to use machine translation in a situation where you do not have the knowledge to post-edit the automatic results.

5. Tweak your own words.¹⁸⁷ You can say things in your natural speech in an infinite amount of possible ways – way more than a computer has a prayer of predicting. (An important TYB chapter looks in detail at the elements of language that are so confounding to MT. Read it!) Things that seem clear to you will be puzzles for translators, human or machine. As one example from a googol of possibilities, Google has an error message when its Chrome browser crashes, “Aw, Snap!”, that makes perfect sense in California but is mystifying in much of the world. They would have saved their localizers a lot of grief had they tweaked their original language to say, “An error occurred”, or “Something went wrong”, or “We could not load your page”. You can try to iron your original text so that it is smoother for the translation engine, using terms and methods of expression that might occur more often in the data on which GT is trained. Ditch colorful words like “ditch”, uncommon words like “nefariously” for which GT will make up stuff, party terms that form a special meaning together (especially if they are separated like “pick up“, since GT will not pick the fact that they go together up), terms with multiple meanings, and long and convoluted sentences such as the one you are reading. It is best to go back and forth between your text and the computer output, until what they are giving you comes close to your intuition about what you should be getting. For more, read these suggestions from Lionbridge, a major company in the translation industry, “Writing for Translation: 10 Translation Tips to Boost Content Quality“.

Seven situations in which you categorically cannot rely on GT¹⁸⁸

1. Blind translations¹⁸⁹ – that is, translations where you cannot read the results personally. You must assume that GT output will be flawed, at roughly the level measured in this study. If you are sending birthday greetings on Facebook to the speaker of another language who you met once on holiday, you can use the output but anticipate confusion. If, on the other hand, you are preparing a financial document to be notarized for the tax office, the output would put you at serious risk of disaster. If you do not believe that, read the GT terms of service:

We provide our Services using a commercially reasonable level of skill and care and we hope that you will enjoy using them. But there are certain things that we don’t promise about our Services. Other than as expressly set out in these terms or additional terms, neither Google nor its suppliers or distributors make any specific promises about the Services. For example, we don’t make any commitments about the content within the Services, the specific functions of the Services, or their reliability, availability, or ability to meet your needs. We provide the Services “as is”. … To the extent permitted by law, we exclude all warranties. … When permitted by law, Google, and Google’s suppliers and distributors, will not be responsible for lost profits, revenues, or data, financial losses or indirect, special, consequential, exemplary, or punitive damages. … The total liability of Google, and its suppliers and distributors, for any claims under these terms, including for any implied warranties, is limited to the amount you paid us to use the Services.

GT’s lawyers do not place an iota of trust in their output. Nor should you. And you and GT’s lawyers should especially pay attention to:

2. GT must not be used in medical situations.¹⁹⁰ Language presents a major difficulty for hospitals and clinics around the world. Medical staff often do not speak the same language as their patients, whether because the doctor or nurse found work in a place far from home, or because the patient has immigrated from elsewhere. Moreover, medical training in places like India, Africa, and Latin America occurs in university languages like English or Spanish, not in local languages like Marathi, Sesotho, or Nahuatl (which is not a GT language, so the effort would probably be to translate to Spanish because Nahuatl is a language of Mexico, and hope the Nahuatl speaker can scrape by with that, which is not a great assumption, but I digress…). The temptation is to jump on GT and assume that the output will be “good enough”. It. Won’t. Every problem identified in this study is exacerbated in medicine because GT has no demonstrated training in the domain. There is no evident source of parallel texts for GT to learn terms pertaining to medicine between English and 107 other languages, and I will eat my hat if they have been manually curating such data. We saw in Picture 46 that GT does not even know what a delivery room is in German, much less how to talk a woman through delivering a baby in Khmer or Kurdish. Extrapolating to languages that score at the level of Chinese or lower, from research on emergency room discharge instructions for Chinese, Google translations risk serious harm in 8% to 100% of critical instructions for 91% of GT languages .

Google should place a prominent medical disclaimer along with all its output, as an ethical requirement if not a legal one. As stated in an article in the British Medical Journal that found 42.3% “wrong” medical translations from English to 26 languages, “Google Translate should not be used for taking consent for surgery, procedures, or research from patients or relatives unless all avenues to find a human translator have been exhausted, and the procedure is clinically urgent” . Unfortunately, medical service providers will try to cut corners by using GT instead of paying for professional translation. If you know of a medical situation in which Google Translate is being used instead of human translation, scream MALPRACTICE at the top of your lungs.

Don’t take my word for it, though. Take Google’s words about the Covid-19 pandemic:

Picture 62: A Swahili subtitle on Netflix in Kenya, from Breaking Bad, and the result for that phrase on Bing. “Marehemu” is a corpse, which is a way of being “late” that no sentient person would use in the given context, but is proposed by both Google and Bing, while “mpenzi” (lover) is unique to Bing. I cannot definitively trace the sources for the Netflix subtitles, because I do not have enough examples at hand, and machine translations can vary from moment to moment, but there is not a shadow of a doubt that MT is involved. My hunch is that Netflix runs transcripts of their English subtitles through a translation engine and pastes the results into the subtitle track, perhaps with some human polishing for punctuation, capitalization, and the like. Photo of Breaking Bad from @_guchu

3. Do not use GT to produce any text for your business,¹⁹¹ unless you have the output thoroughly reviewed by a speaker of each target language. It is fine to use GT to look through incoming documents, where you yourself know that the text might not make sense, and you can use your natural intelligence to puzzle out bad grammar and research the nonsense parts. You must assume, however, that your clients will not know that the text they are looking at is a machine fantasia. For the bottom 2/3 of languages measured in this study, customers will not have the inclination to suffer through an article, web page, or email that looks like it was written by a drunk baboon. For the upper languages in which GT has invested more time and money, the problem is more pernicious – you will often get output that looks okay at first glance, but misses important details, chooses words that create mysteries, or even reverses your intent. People reading your documents will believe they should take your words literally. When they do, at best they will think you are a fool, and at worst your business relationship with them will fall to ruin. Businesses look ridiculous when they use GT. Chicago O’Hare Airport gives users a “language” option that uses the Translate API to produce pages for the Google languages; as you can test, for most languages, the translations could not guide you to the airport, much less get you through security and onto your plane. Localizing the website for an international airport, at least for the languages of the countries it connects to directly, is a great idea – but that is the sort of service that should be produced by paid human translators, or not at all. Customers who who use GT to try to figure out your documents or website know that they are getting an approximation. As blogger Darren Jansen says, their use of GT “doesn’t reflect on you,” but if you give them GT output that looks like you have produced it, “that is a translation that YOU are providing on YOUR website. YOU have to stand behind it, and it reflects poorly on YOUR brand.”

Similarly, as seen in Picture 62, Netflix has decided to serve its East African market by adding Swahili subtitles that appear to come from Google or Bing.¹⁹² The text has very little to do with the on-screen narrative, so it serves no purpose for the intended audience – and is clear from user discussion that they think Netflix has callously hired a calamitous translation agency. The subtitles have, however, earned Netflix a lot of derision in the local media.¹⁹³ You should never consider using Google Translate with your money or your reputation at stake.

Picture 63: Google claims that it offers a “dictionary” within its English search results, including translations to all GT languages. In this example, the primary English noun sense of “trawl”, a large fishing net that is dragged through the water, is absent from view. For translation, the lei lie “trawl” is presented for a great many languages, and words with no sense information that might or might not be correct in certain contexts are given for high-investment languages like Spanish and French. Nevertheless, Google claims legitimacy for its results with the tag “From Oxford”, which is certainly false when it comes to the “translations” from Malay and all other languages. (Highlighting added to original screenshot.)

4. Do not use GT to translate individual words. Google is not a dictionary,¹⁹⁴ even though it automatically assembles some language data in a form it calls a “dictionary”, as shown in Picture 63. As found by Gaspari and Somers , “using MT services available on the Internet as if they were bilingual dictionaries or lexical look-up facilities is a misguided strategy that is liable to provide users with partial and misleading information, in that MT software is designed to hide lexicographic information from the users and to provide one-to-one target language equivalents.” In addition, if Google does not have your input word in their source vocabulary, they will just make up a fictitious translation and never let you know, as you can see in their straight-faced delivery of translations for “ooga booga” from all of their languages, or their Icelandic to English translation of “barna samfella” (a “onesie” for babies – this cute one supports kamusi.org) as “children’s intercourse“.

Among the top 100 words in the English language, which make up more than 50% of all written English, the average word has more than 15 senses – giving you, on average, fifteen-to-one odds – that is, 6.25% – for a single word translation. GT makes a statistical guess to highlight one translation of one sense, and in some cases offers options that can help a knowledgeable user home in on a better result. However, unless you have a good knowledge of the target language and are only using GT to jog your memory or help with your spelling, you must not trust the result for single words or party terms. Most common words have at least two senses, and many have dozens, so your chance of getting the wrong translation ranges from 50% up to the high nineties, in the cases where Google is not just using MUSA for words that are outside its vocabulary. GT is not a mind reader, and the meaning you have in your mind will differ from the meaning Google picks more often than not.

Hey @Google There are other cities too in the world apart from Faridabad #Translate #ScrewedUp #Google #GoogleTranslate pic.twitter.com/cG6bpFKrl8

— Write it Bold (@wibsocial) September 25, 2019

Don’t take my word for it, though! Here is a game you can play:

Find a real human-compiled dictionary for your favorite language, and look up some words.
Try the same words in GT.
Give GT one point for every result it shares any sense with the genuine dictionary, or proposes something different but viable (e.g., GT says “rich” while the dictionary says “wealthy”). [In the results below, I don’t know how to score the last item, where “\t” inexplicably occurs, but let’s call it a point for Google.]
Give your dictionary one point for every result that does not appear in GT, even if the GT result has some vague similarity (e.g., GT says “run” while the dictionary says “runny”, which are clearly not the same thing).
Take away one point from GT whenever it produces a lei lie, an invented word that does not appear in the real dictionary, and is not related to the dictionary entry in any noticeable way (e.g., don’t count “run” as a lei lie in the previous example because it is in some way based on interpreting data, but do consider it a lei lie if GT merely reproduces the source word on the target side)
Report your results in the comments section below!

Even if you don’t know Swahili, you can dominate the game by using the TUKI dictionary. Many lucrative languages are treated well at wordreference.com, dictionaries for dozens of European languages can be accessed at dictionaryportal.eu, and you can find a variety of other languages at lexilogos.com.

Let’s play for Irish. I cannot recognize any Irish words, so this is a blind test, using Ó Dónaill’s Irish-English Dictionary, Foclóir Gaeilge-Béarla. This dictionary represents the gold standard in Irish lexicography: it “has been the primary orthographical source for the spelling of the language since it was published and provides the most comprehensive coverage of the grammar and other aspects of words in Irish.” Ten words, completely at random, with an experiment you can replicate for any language, to test whether Google Translate is a dictionary. Ladies and gentlemen, start your engines:

Irish Word	Foclóir English	Google English	Foclóir Score	Google Score
rumpach	narrow-rumped, lean	rump	1
rúndaingean	strong-minded, resolute	confidence	1
lacstar	idler, gadabout, playboy	lacstar	1	-1
hapáil	hop	randomization	1	-1
vaigín	waggon	wagon		1
martaíocht	killing, provision of beef, scheming, ingenuity	bureaucracy	1	-1
niciligh	nickel	nickel		1
gadráilte	tied, strung	cached	1	-1
jaingléir	straggler, vagrant, casual fisherman	juggler	1	-1
daingnitheach	strengthening, stabilizing, ratifying	ratification. \t¹⁹⁵		1
Totals			7	-2

Results: With 10 random words, Google matched the gold standard 3 times, came somewhere within the vector space twice, and invented translations out of whole cloth 5 times. Tell all your friends: Google is not a dictionary.

5. Do not use GT for any non-English language pairs¹⁹⁶ (except Catalan to Spanish, Czech-Slovak, and Japanese-Korean, which are the only production pairs known to be based on direct training; if Google tells about other direct pairs, I will update the list). In all other known cases, as demonstrated in repeated tests that you can easily try yourself, the translation passes via English. French to German goes via English. Swedish to Danish goes via English. Hindi to Urdu goes via English. This means that all errors from Language A toward English are retained, and an additional layer of errors is introduced from English to Language B at about the level measured in the empirical study. The rates calculated for all 5151 pairs that do not include KOTTU are indicative, though not definitive because I could not test translation quality from each language toward English. Just over 1% of pairs were estimated to achieve an understandable result more than half the time. Put another way, in 99% of Language A to Language B scenarios that are possible in GT, the chance of understandable output is lower than a coin flip. In three quarters of cases, 3861 pairs, intelligibility is below 25%, and nearly three out of ten cases (1474) will be on target for MTyJ (Me Tarzan you Jane) approximations 10% or less of the time. With numbers like these, GT is essentially useless as a translation tool where English is neither the source nor the target. The Norwegian Olympics team discovered this when their chefs ordered 1,500 eggs in Korea using GT from Norwegian to Korean, and a truck delivered 15,000 instead. You may be able to get the gist of an article, but not in any situation where understandable output actually matters.

Picture 64: The big picture becomes harder and harder to see with GT translations when (a) the text strays from the type of formal writing that GT is trained on, or (b) the translation language is among the majority that TYB evaluators scored poorly. This image was edited in Fotor Photo Editor from the original Picture 58.1 .

6. Do not use GT to translate poetry, literature, or humor.¹⁹⁷ Poetry, or any other artful literature, uses figurative language where it is essential to understand meanings deeper than the words on the surface. GT cannot divine the sentiments in a speaker’s heart. You would expect that artistic features such as rhyme and meter will be lost, but so will any sense of the poet’s message. A line-by-line comparison between GT output and a human translation of a poem from Catalan to English provides a typical case in point. The opening line of the Analects of Confucius, one of history’s most well-known literary works, is 學而時習之、不亦說乎 in the original Chinese, translated to English by a real person as, “Isn’t it a pleasure to study and practice what you have learned?” Use GT, though, and you will learn that the quote means “Learn while learning”. As my daughter likes to say, “Yeah, but no”.

For a fuller look at GT and literature, I urge you to detour to Douglas Hofstadter’s essay “The Soggy Soup Maid“, which he graciously entrusted TYB to reproduce on this site.

The same warning can also be applied to translating humor, especially jokes that depend on twists in the original language. However, the mistakes that GT injects can be so funny in their own right that you will often end up with a collection of words that make you laugh, even though the original pun is certain to be missed. GT translations of jokes will be utterly confusing to anyone reading them blind in the target language.

7. GT cannot be used for any of the roughly 6900 languages that are not in the system.¹⁹⁸ As obvious as this may sound, GT is often touted as a “universal translator”. In fact, 98.48% of ISO-identified languages are not covered, and 99.9996% of potential translation pairings are missed. For three continents, not a single indigenous language is covered: North America, South America, and Australia. This includes languages with millions of speakers, such as Quechua, Aymara,¹⁹⁹ Nahuatl, and Guarani. Of the word’s top 100 languages, 34 are excluded, spoken cumulatively by about three quarters of a billion people, or ten percent of the human species. Of course, Google has no legal or moral mandate to cover any language, and the 1.5% they do cover, if the most generous possible numbers are used regarding speakers and literacy, nominally serve five billion people vis-à-vis English. I am not arguing that the service falls short of its own coverage goals. I do insist, however, that GT falls far short of the media coverage it winkingly accepts, that has convinced the public they provide universal translation. To the extent that they accept that mantel, or do not take steps to refute it, they are complicit in a deception. The consequence of this deception is that almost no research funds are available to bring untreated languages into the the technological realm, because it is so widely believed that Google already does it.

Picture 65: Google weather forecast for Bamako, Mali.

Final takeaway²⁰⁰

The evaluation scores in Teach You Backwards provide a broad indication of how much confidence you should have for a given language. You are also urged to run your own tests to see how trustworthy you find GT for the languages you know. For example, based on personal experience, my recommendation to use GT as a writing aide applies to French, with a Tarzan rating of 60, but not to Swahili, with a rating of 25. Your mileage may vary.²⁰¹ Even in the best languages, though, you should always be prepared that some part of your translation might be gibberish, or the exact opposite of what was intended. As much as you may refer to weather forecasts to give you guidance, you know well enough not to time activities in Bamako for next Tuesday based on a computer model that suggests rain that afternoon – though you could surely use the projection in Picture 65 to guide you against wearing a ski jacket. Similarly with Google Translate: depending on the situation, what you get could range from a few right words to a masterful rendering of the original text. If you know the source and target language well enough and nothing in the output strikes you as crazy – in the weather analogy, if you can step outside, check the thermometer, feel the wind, and look at the clouds to get a sense of whether it will be nice to walk in an hour, which you can merge with what the computer is telling you based on information slightly beyond your horizon – then take the bet that the result is an adequate approximation of the original text. If you don’t have such information, especially for languages beneath the top tier, then all bets are off.

In the end, Google Translate is a set of suggestions based on models that may or may not be accurate, and data that is certainly incomplete in all languages. Treat the output as suggestions, not facts. Do as much extra checking as you can, preferably with human-compiled dictionaries or, best, speakers of the target language. Be prepared for misunderstanding, and expect that sometimes your results will be hilarious Giggle Translate. Do not rely on the translation when the output is important – not if your course grade depends on it, and absolutely not in any medical diagnosis or treatment situation. And then, with this awareness squarely in mind, go ahead and use GT to help you communicate with people you otherwise couldn’t. GT is not usually going to make you Shakespeare, but if it can help you speak like Tarzan, you’ll make it through the jungle okay.

Is DeepL Better than Google Translate? A No, A Punt, and A Yes²⁰²

Picture 65.1: DeepL uses visual trickery in the way they report the results of studies to make themselves look head and shoulders above the competition. The graphs make a 4.5% difference with GT look more like 200%, and an 8% difference with Facebook look like Facebook barely scored. These images are deep lies that would be rejected by any scientific journal. https://www.deepl.com/press.html

Many people suggest that DeepL produces better results than GT – in the words of one reviewer, “Folks outside of Google, who like to drink the AI-hype-honey, will say: yeah but DeepL will fix all of that”. DeepL itself states, “our neural networks have developed a previously unseen sense of ‘understanding’” (http://www.deepl.com/blog.html), and does not shy from reproducing journalists’ assertions of its superiority at the bottom of their main page. There is no scientific basis for this belief. DeepL claims to have tested 100 sentences in 3 languages. Those are potentially interesting results, but fail as legitimate science: “Specific details of our network architecture will not be published at this time.” By withholding the research from inspection, their truth claims cannot be supported. Maybe we should just trust them, but look at the graphs in Picture 65.1, reporting the BLEU scores. Those are DeepLy deceptive visual sleight-of-hand, making one 4 point difference look like a 50% variation, and a second 4 point difference look like it is almost 3 times better – view “How to Spot a Misleading Graph” on TED-Ed for an animation of the trickery at play. Moreover, they cherry-pick which results they choose to share, giving no indication about whether their findings are transmissible to the other languages in their system. Without publishing their findings in a way that can survive peer review, I’m afraid their graphs can only be read as deceptive advertising copy.

Deutsche Welle news service takes the bait anyway, declaring DeepL “three times more powerful than Google Translate” in “regular blind tests”. I am unaware of any rigorous measurements of translation quality between the two services. However, as a late learner of French, I consult both in the course of my personal life in French-speaking Switzerland and my professional life working with people throughout the Francophone world. While writing Teach You Backwards, I have paid close attention to the behaviors of both GT and DeepL. Here I offer my observations, which are not systematic and do not necessarily apply beyond French (though I have cause to suspect they do).

We can compare DeepL and GT in three ways:

Are the first-pass translations better in DeepL than in GT for the same language? I say no. Sometimes they both do a good job, sometimes they both mess up (often making the same mistakes), and it can fall either way when one is good and one is bad. Testing in 2022, for example, both came up with an excellent equivalent in French (donnez à fond) for the English “go all out” when asked to translate a sentence from a real English conversation, they fell on either side of the coin flip while giving a definitive-seeming answer when choosing between formal and informal registers (vous versus tu), and they both failed entirely with a common phrase that includes five simple words with the specific meaning of “honest” when they appear together, “on the up and up”. Both make up stuff as part of their algorithm when their data falls short. Both (TYB testing shows, and you can easily use our method to confirm yourself for the languages of your choice) pivot through English for most language pairs where English is neither the source nor the target, which means that erroneous translations from Language A to English are then chiseled in stone and compounded with errors from English to Language B – with cascading error rates similar to the debilitating numbers for polysemy discussed in the chapter on translation mathematics.
Are good first-pass translations a better percentage of the output from DeepL than from GT? You could spin the facts in this direction, because DeepL works with many fewer languages. By limiting themselves to 24 languages with some of the best data resources (English plus Chinese and 22 official European languages that share a large parallel corpus of professionally translated documents), they don’t enter into the quagmires of languages about which they know a lot less. GT is likely to generate a high percentage of malarkey in most languages outside its top tier. DeepL plays it safe and does not even venture toward those waters. GT fails badly in languages like Uzbek and Cebuano. DeepL does not try. Not trying where you know you cannot succeed might be a cop-out, or it might be the more honest strategy. What say you?
Can a user with some knowledge of the target language get better results from DeepL than from GT? This is a yes, because DeepL offers alternatives for post-editing on the target side that let humans use what they know to zero in on a translation that matches the intent of the source text.

DeepL does not cite any formal studies to back up their claims of superiority. Larousserie and Leloup compare translations of five texts from English to French using DeepL, GT, Bing, Baidu, and Yandex, and preferred DeepL (without metrics or methods), while giving no justification for the claim in their subhead that the service performs three times better than GT for French, not to mention the languages it did not look at versus English or as non-English pairs. Statements such as, “DeepL translates your documents at the click of a button, to the world-class standard you’ve come to expect from us” should be read as marketing, not science. IMHO, this DeepL translation is a “world class” mudslide, with snake eyes for polysemy and full-on MUSA for the letter “f”: “Maybe tomorrow, he proposed to his f” ￫ “Peut-être que demain, il a fait sa demande en mariage à son père” (which back-translates as something like, “Maybe that tomorrow, he will demand in marriage to his father” – I’m not sure whether his father is supposed to be getting engaged to him or approving his engagement). And, while DeepL luckily picked right for its first guess for “Do you sell hamsters?”, its runner-up translations proposed that I might want to buy ham, lobster, fish hooks, a hamlet, hamburgers, beans, herrings, a Turkish bath, lettuce, a harness, marmottes, a hammer, a handle, a cooking pot, overalls, salami, germs, or a few other things I could not find in a dictionary.

Picture 66: DeepL translation of a tweet that makes the same mistakes as GT (as seen in Picture 55)

I use both services frequently, along with Bing, for help drafting letters in French. I have found that all three make similar errors, for example, as seen in Picture 66, mistranslating the tweet botched by GT in Picture 55 almost identically. However, speaking as a consumer, I prefer DeepL for the composition task because it usually offers a wider range of options for refining output on the target side, thereby improving the final result through the interaction of the computer and a somewhat knowledgeable human. Both DeepL and GT might, for example, deliver output where the first sentence uses the formal “vous” and the next sentence switches to the familiar “tu”, but only DeepL is likely to offer you viable alternative options for both sentences in the register you prefer.

This message, to plan an outing with my daughter’s best friend, fried the CPUs of both DeepL and GT, because neither could handle how to formulate verb constructions that worked in both the first clause and the second: “Can Mina, or you and Mina, join me and Nicole…”. Using the DeepL post-editing tools on the French side, however, made it possible for me to arrive at a phrasing that made it clear the invitation was either for the child alone or a parental +1: “Est-ce que Mina peut, ou toi et Mina pouvez vous, joindre à Nicole et moi…”. GT offers a maximum of one alternative per sentence, while DeepL allows you numerous entry points to tweak wherever you see a translation run awry.

My daughter deploys her native French to tell me that “alarm clock” translates to “réveil”. It happens that GT gets this right, with a certification. DeepL proposes “réveil-matin”, which is not wrong, but sounds, in the words of a native-French adult, like “what my grandmother would call it”. DeepL’s first alternative is “réveil réveil”, which back-translates as “alarm clock alarm clock”, and additional alternatives get worse; DeepL as a dictionary never offers the option that real people really say, “réveil”. The sentence “They slept through the alarm clock” is destroyed by both GT and DeepL, giving options for “through” that indicate “during” or “within” or “by means of”. DeepL even suggests “Ils sont morts dans le portillon du radiateur” might work (Picture 67) – which back-translates as “They died inside the radiator gate”. My kid produced “Elles ont raté le réveil” (with correct gender), which uses natural intelligence to locate the inner meaning, “They missed the alarm clock” – and she had a huge laugh about the bollocks from DeepL and Google. More than unfortunate, the reason I needed to know this is a funny story involving panicked calls from my daughter’s school, ’bout which ’nuff said. Curiously, DeepL did propose “réveil” by itself as the first option in its full sentence translation, but it also gave “rÃ©veil” as choice number four.

The advantages of the post-editing feature would vanish were I stupidly to translate a text into a language I could not speak at all, such as Polish or Russian, In such a case I would always inform the recipient that the document had been automatically translated, and attach the original English, without assuming that one or the other service gave a more comprehensible draft. I could not look at Russian and know if a Cyrillic equivalent of “rÃ©veil” is being offered, or if “Child Discovery Center” has the equivalent of this obviously awful option from DeepL: “Centre de découvertes pédophiles”. Can you?

As an example of how post-editing on DeepL can work well with a language you somewhat know, for an important formal communication with a government ministry in francophone Africa regarding a workshop they’d asked me to lead for developing three of their national languages, I needed to ask, “What progress are you able to report?”. Google’s first offer was “Quels progrès pouvez-vous signaler?”, which would be adequate but not fantastic, and the service also offered another stilted option. DeepL also began with a very direct translation that did not land at the right tone: “Quels sont les progrès que vous pouvez signaler ?”. Clicking on “Quels” gave me 59 (!) opening options to scroll through. After deciding that “Pouvez-vous…” sounded best to my ears, I was then able to jump to the middle of the sentence and choose among 28 new options, deciding upon “Pouvez-vous nous communiquer les progrès réalisés ?”, which struck the balance of respect and insistence that the situation required.

Picture 67: The DeepL computer app provides numerous alternatives to refine the translations they offer. In this example, the tool made it possible to produce a plethora of bad translations – the construction shown, which back-translates as “They died inside the radiator gate”, could be made worse with “sarcophagus” or “wake” (trail of a moving boat) as offered. None of the translations produced by the DeepL tool came close to conveying the original meaning. The BLEU score for their first translation is 15.11, and the BLEU score for the words DeepL assembled in the image is 3.28.

If you do not have the skills in the target language to take advantage of such refinement, then the results from GT and DeepL are a tossup. However, the ability to tweak your results if you have some knowledge of the target language is a unique selling point for DeepL in the short stack of languages they serve.

One final factor that now tips the equation to DeepL, but has nothing to do with its underlying translations (you’ll get “rÃ©veil” on both the web and the app): the company has introduced a free tool for PCs and Mac that launches a translation box if you simply highlight some text and press Ctrl+C twice in any other application. You can use their post-editing features to tweak away on the translation on the target side, then press “Insert” to pop your output back into its original location, for example into a WhatsApp message. The software is simple and well-designed, working on the target side in more or less the way Kamusi’s source-side SlowBrew system will function when it is complete, and has become an integral stage in my French communications since it was launched. But, whether you use DeepL on the web or the app, its neural network still conjures up hallucinations such as the honest-to-God translation to French shown in Picture 68 for an email sign-off you use every day, “regards”: “Je vous prie d’agréer, Monsieur le Président, mes salutations distinguées”.

Picture 68: DeepL translation of “Regards” from English to French for a business email. The reverse translation could be, “I hope you will accept, Mr. President, my distinguished greetings.” Without the “Mr. President”, the offering would be an appropriate way to end a formal letter, though somewhat over the top for a work-a-day email.

References

The post When & How to Use Google Translate appeared first on Teach You Backwards.

Teach You Backwards

Introduction: Into the Black Box of Google Translate

Overview4

Tropes and Mind Tricks that make you believe in make believe7

MUSA: The Make Up Stuff Algorithm24

0.0006% of the Way to Universal Translation25

Ideology and Computer Science29

References

Empirical Evaluation of Google Translate across 107 Languages

Description of Empirical Tests30

A. English to Language X31

B. Language X to English34

C. Language A to Language B38

Methodology41

Evaluators57

Empirical Results59

A. Elegance from English63

B. Gist from English64

C. Non-English Pairs66

Empirical Conclusions76

References

Qualitative Analysis of Google Translate across 108 Languages

The Dangers of Bad Translation78

Myth 1: Artificial Intelligence Solves Machine Translation81

Artificial Intelligence, Machine Translation, and the Flying Car97

Myth 2: Neural Networks Solve Machine Translation99

Ooga Booga: Better than a Dictionary100

Myth 3: “Zero-Shot” Translation110

Myth 4: Magic Wand Translation114

Myth 5: Google Translate learns from its users118

Qualitative Synopsis123

References

The Astounding Mathematics of Machine Translation

The finite limits to how well GT can ever translate

The volume of untreated basic concepts126

Lexical gaps – concepts without direct translations131

Semantic drift – concepts that do not fully correspond132

Polysemy – words with multiple meanings133

Party terms (or multiword expressions) – words that play together136

Morphology – words that shift shape139

Categories – gender, class, register, and other ways people frame their world140

The problem with pronouns144

The finite limits of corpora146

Syntax – the difference between Tarzan and the Bard153

References

Disruptive Approaches for Next Generation Machine Translation

The semantics of translation157

SlowBrew disambiguation158

Learning ideas from users160

Efficiency161

Learning terms from users165

References

Conclusions: Real Data, Fake Data & Google Translate

The 5 conditions for satisfactory approximations with Google Translate:172

References

When & How to Use Google Translate

Situations When GT is a Good Tool174

Cautionary Situations180

The best way to use GT in situations where the translation matters181

Seven situations in which you categorically cannot rely on GT188

Final takeaway200

Is DeepL Better than Google Translate? A No, A Punt, and A Yes202

References

Overview⁴

Tropes and Mind Tricks that make you believe in make believe⁷

MUSA: The Make Up Stuff Algorithm²⁴

0.0006% of the Way to Universal Translation²⁵

Ideology and Computer Science²⁹

Description of Empirical Tests³⁰

A. English to Language X³¹

B. Language X to English³⁴

C. Language A to Language B³⁸

Methodology⁴¹

Evaluators⁵⁷

Empirical Results⁵⁹

A. Elegance from English⁶³

B. Gist from English⁶⁴

C. Non-English Pairs⁶⁶

Empirical Conclusions⁷⁶

The Dangers of Bad Translation⁷⁸

Myth 1: Artificial Intelligence Solves Machine Translation⁸¹

Artificial Intelligence, Machine Translation, and the Flying Car⁹⁷

Myth 2: Neural Networks Solve Machine Translation⁹⁹

Ooga Booga: Better than a Dictionary¹⁰⁰

Myth 3: “Zero-Shot” Translation¹¹⁰

Myth 4: Magic Wand Translation¹¹⁴

Myth 5: Google Translate learns from its users¹¹⁸

Qualitative Synopsis¹²³

The volume of untreated basic concepts¹²⁶

Lexical gaps – concepts without direct translations¹³¹

Semantic drift – concepts that do not fully correspond¹³²

Polysemy – words with multiple meanings¹³³

Party terms (or multiword expressions) – words that play together¹³⁶

Morphology – words that shift shape¹³⁹

Categories – gender, class, register, and other ways people frame their world¹⁴⁰

The problem with pronouns¹⁴⁴

The finite limits of corpora¹⁴⁶

Syntax – the difference between Tarzan and the Bard¹⁵³

The semantics of translation¹⁵⁷

SlowBrew disambiguation¹⁵⁸

Learning ideas from users¹⁶⁰

Efficiency¹⁶¹

Learning terms from users¹⁶⁵

The 5 conditions for satisfactory approximations with Google Translate:¹⁷²

Situations When GT is a Good Tool¹⁷⁴

Cautionary Situations¹⁸⁰

The best way to use GT in situations where the translation matters¹⁸¹

Seven situations in which you categorically cannot rely on GT¹⁸⁸

Final takeaway²⁰⁰

Is DeepL Better than Google Translate? A No, A Punt, and A Yes²⁰²