Errol O’Neill
• The database is by far the most comprehensive I’ve seen, evaluating Google Translate (GT) from English to over 100 languages. The use of 20 English expressions tests GT systematically across these languages, using comparable, well-defined qualitative metrics for the two raters of each language. The fact that only 15 languages were judged to be “close to human quality” 50% or more of the time on these expressions, and the number of languages judged by raters to have a high proportion of translations as “Tarzan” or “fail”, in particular for common languages, is surprising.
• It is a good point that GT only has translations for a very small minority of world languages, thus making it a “universal” translator (and based on the rater judgements, a poor one at that) for speakers of only around 100 languages, and this when going to and from English.
• One possible limitation of the data is that all items translated are set expressions. As mentioned by the author, users often translate individual words as if GT were a dictionary. The reasoning behind this choice (the polysemic nature of most individual words) is clearly stated, but it is possible that users translating between English and a target language for individual words would judge GT to do better (or worse) than the raters judged for the expressions.
• One interesting extension of this, for future research, would be to test translations from a selection of languages into English. It is possible that translations from certain languages into English would produce better (or worse) results than translating from English.