- “I did not test single words” because they’re often (mostly, in fact) ambiguous. On the other hand, the research doesn’t look at full sentences or longer stretches of text — in this case, for for pragmatic reasons (evaluation would be far more complex, far more time-consuming).
- The outcome is that the research focuses on short multiword units, some compositional (fly out of London), most idiomatic: out cold, out of pocket, out of order (which is both non-compositional and polysemous, with three senses in MED), out of the game, and so on. While I appreciate the reasons for not reviewing GT’s performance on longer stretches of text, I would argue that the types of multiword string chosen for the testing are precisely the kinds of unit where I would expect GT to struggle. (Though it really ought to be able to cope with a full-blown idiom such as like a bat out of hell, a unique combination which can only have one meaning.)
- As noted in the section on Empirical Results, non-English pairs: “long segments and full sentences often translate better than short fragments, whether because the translation engine has more context to make an informed calculation about which sense of a word is appropriate, or because the reader has more context to overlook mistakes”. There is a related footnote here (28: in my view, too important to be a footnote), saying: “In theory, longer translations with more context should increase the accuracy of the proposed translation equivalent; for example, “run a company” often returns the foot-racing sense of “run” on its own, but the management sense in a longer sentence”. This is not just true in theory but in practice: it is exactly how humans perform word sense disambiguation (WSD) in real life. Run is highly polysemous and means nothing in isolation (whereas spectrophotometer doesn’t need any context to convey its meaning unambiguously). The intended reading of run in any given instance can only be inferred from contextual clues — so GT has no better chance of making the “right” inference from limited context than a human would. A similar point is made in footnote 32: “the phrase ‘across the board’ is incomprehensibly rendered in isolation in French as ‘à travers le conseil’, but reaches Bard status when the fragment expands to ‘its performance across the board’, translated as ‘sa performance à tous les niveaux’.” And similarly again, this is demonstrated in the last section (When and how to use GT) in the case of “picking up a pizza”.
- There is nothing surprising about any of this: GT is mirroring what happens when humans communicate with one another. Most of the common words we use are highly polysemous, but – almost always – the potential for ambiguity fades as we are exposed to a longer context. (You could argue that it would be better if GT waited until it had more context, rather than translating word for word – usually wrongly – from the moment you input any text. But that’s an issue for them).
- So I feel there is an inherent problem with the methodology. Single words generally do work well when they are terms (like spectrophotometer), but obviously don’t (and can’t) with words like run. (Hence we can predict that the the dictionary game – “Find a real human-compiled dictionary for your favorite language, and look up some words. [then] Try the same words in GT” – will only work well for monosemous words.) But I wouldn’t expect GT to work well with short phrases, and I would expect it to perform best on longer stretches of text: at least a sentence, preferably a paragraph. And on the whole it does, but only in specific use cases…see next.
2. Contexts of use
- This is how I use GT (my own “contexts of use”): for translating between English and Spanish (both directions), and for translating formal documents from German to English (I have shares in an Austrian-based tech start-up and get regular updates – in German only.)
- In both cases it works pretty well. For Spanish, I’m reasonably proficient (B2 or C1), and I can always fall back, when needed, on a Spanish corpus. I was recently translating from English about “watching a cricket match live” (as opposed to on TV). GT translated this as “in vivo” but I wanted to confirm this was normal in Spanish, so I checked the phrase in a Spanish corpus – where I found numerous examples, in the right context. In the case of German, my level is A2 (if that) so I’m completely dependent on GT. But so far I have always got the gist (and often more than that), and haven’t yet seen a translation where I’m left wondering what the hell they mean. (Again, the text type is the kind where GT performs best.)
- All of which bears out what is said in the last section (When and how…): “there are some situations for which GT is an excellent tool”; and in the Introduction: “With certain types of formal texts for a very few language pairs, GT often produces remarkably good results”. I would say that my situation (with regard to Spanish, at least) is pretty much optimal, because:
- I’m fairly proficient in Spanish (e.g. “GT is also an excellent writing aide when you have an intermediate knowledge of the target language”)
- I’m typically (when translating from Spanish to English) working with “Well-structured documents that are written in formal language for a general audience, such as Wikipedia or news articles”, where the report concedes that GT works optimally
- I’m working between English and one of the best-resourced global languages
- I’m a linguist and have access to sophisticated language tools to complement GT
- So with all this favourable wind, GT works well enough for me, provided I see it as useful, rather than necessarily reliable (see also the observation at the end of this document). In my rather particular case I wouldn’t agree that “Your problem as a user of GT is that you have no way of knowing what mix of those three [GT knows, guesses, or makes up] applies to your particular text ”. But without all these advantages, many of problems identified in the report remain serious. Not so much GT’s inability to cope with short multiword units (see above), but in particular:
- GT’s weaknesses when working with anything other than the “best” text types
- GT’s serious failures when working with any but a handful of major, well-resourced languages.
3. Google’s claims for GT
- The report argues convincingly that GT’s claims for its system are wildly overblown. There is no disputing the fact that “GT falls far short of the media coverage it winkingly accepts, that has convinced the public they provide universal translation”. And it’s interesting to see that GT’s own own Terms of Service (full of sensible caveats) give the lie to their public claims: e.g. “we don’t make any commitments about the content within the Services, the specific functions of the Services, or their reliability”.
- In fairness to GT, the “headline” figures given in the report make it look staggeringly bad: “in 99% of Language A to Language B scenarios that are possible in GT, the chance of understandable output is lower than a coin flip. In three quarters of cases, 3861 pairs, intelligibility is below 25%”. Plus the fact that 98.57% of all the world’s languages aren’t covered at all. This is all true, but it assumes a very large number of translation pairs (say, Slovene to Zulu) which are very long tail (and may in reality never be attempted) – most activity on GT won’t be of this type, and I doubt that anyone with a bit of linguistic nous would even expect to get good results in cases like this. Equally, though the claim to cover “99%” of languages is strictly speaking incorrect, most users probably wouldn’t expect GT to function with (for example) some of the near-extinct and purely oral Aboriginal languages of Australia.
- But that’s not to say GT isn’t culpable in its grandiose PR. When we read that “Facebook’s Chief AI Scientist states as fact that we have essentially solved MT among all 7000 human languages”, it’s clear that a lot people who ought to know better have been taken in by Google’s PR.
- Where the report is especially valuable is in demonstrating that:
- The claim that GT learns from its users and “has a steady program to improve based on input from users” (which many people believed, including me) is simply untrue
- The idea that AI is going to crack whatever problems remain appears to be unfounded.
- In most translation scenarios, i.e. where neither source nor target is English, English is still used as a pivot between non-English languages (the exceptions currently being Catalan to Spanish, Czech-Slovak, and Japanese-Korean), so that any errors in the first stage of the process are compounded in the second. This was certainly news to me.
- The subsection of the last part, named “GT output should only be viewed as suggestions”, is instructive: the comparisons with searching for flights on Kayak, prompt the thought that this is how many people, especially digital natives, use the web generally — in a knowing, approximative way rather than a slavishly credulous way: that is, they’re savvy enough (through experience) to know that it is “only” a computer telling them this, and the machine sometimes gives suggestions (a 30-hour flight involving many changes) which may be theoretically (or “literal-mindedly”) valid, but which a human being would instantly discount and move on from — without necessarily thinking: “this system is useless because it has made an impractical suggestion”. Maybe people don’t expect perfection and can easily work around it. This could represent a common attitude towards GT. If only Google would present their product with caveats like this.
Thanks for these thoughtful comments. I have taken your suggestion and moved what had been footnote 28 into the main body of the Empirical Evaluation chapter, expanded the point a bit, and added an illustrative gif: https://teachyoubackwards.com/empirical-evaluation/#picture-14.1