References: Bibliography, Acronyms, and Technical Terms

Acronyms and Technical Terms

While most of the acronyms and terminology in this list are in common usage, some of the terms below are being introduced to the world for the first time in Teach You Backwards. Follow the links to reader-friendly sites that have more information about the technical topics. If there is a term in the book that is missing or poorly defined in the list below, please contact me so I can add it.

AI: Artificial Intelligence, the ability of computers to do clever things without specific programming instructions from humans. When a computer does something better than a human because it can process data really quickly, such as searching parallel texts to find all the words that seem to be translations of each other, that is not AI. When the computer makes extrapolations based on that information – for a made-up example, using a sentence that translates as “The jogger ran on the beach” as a model to produce “The sunbather san on the beach” (run/ran, so sun/san) – that is artificial intelligence.

Bard: This is the Teach You Backwards assessment of whether a translation has the natural elegance of a native speaker. A high Bard rating indicates translation that our native evaluators thought was similar to something they might produce themselves.

BEDS: Better than English Derangement Syndrome. Although people are fully aware that computers make numerous mistakes with English, the main language of the tech industry, many tacitly assume that the models for other languages are fully robust.

BLEU: bilingual evaluation understudy, an algorithm for evaluating the similarity between a computer translation and a reference human translation.

CAT: Computer assisted translation. CAT is a different animal than MT. In CAT, the computer places suggestions in front of a human translator, who is free to accept the suggestions, modify them, or take a completely different path, based on their understanding of the two languages. In MT, on the other hand, the user is presented with output as a more-or-less take-it-or-leave-it result.

Corpus (plural = corpora or corpuses): A collection of texts that can be used to extract linguistic data. A monolingual corpus can include billions of words in a single language, collected from books, newspapers, public records, blogs, tweets, and anything else that has been digitized and is publicly available. A parallel corpus matches two or more languages based on translated documents. Europarl is the most extensive parallel corpus, extracted from the proceedings of the European Parliament and professionally translated into all of the European Union’s official languages. Monolingual corpora for non-market languages are rare, and parallel corpora linking such languages with each other or with lucrative languages are almost non-existent. NLP and lexicography are quite often based on data from corpora, inherently excluding the supermajority of the world’s languages.

CS: Computer science

Disambiguation: A word like “pool” is ambiguous – out of context, you don’t know whether it refers to a place for swimming, a game played with balls on a table, a small group of people with a shared purpose, or a communal combination of funds. Disambiguation is the task of recognizing the ambiguity and figuring out which version is correct for the context.

FAAMG: Facebook, Amazon, Apple, Microsoft, and Google. You probably don’t need hyperlinks to find the relevant sites.

GT: Google Translate

HLT: Human Language Technology. Any software where processing human expressions lies at the core. Microsoft Word is not HLT, but the spell-checker within it is. The GoPro camera is not HLT, but the software that enables it to respond to voice commands is.

ICT: Information and Communications Technology

Kamusi: “Kamusi” is the Swahili word for “dictionary”. The word derives from Arabic, and similar words are used in many other languages across Africa, the Middle East, and the Indian Ocean. The Kamusi Project is a non-profit work-in-progress to build a dictionary and data center that will eventually provide nuanced connections among “every word in every language”, for people to use for their own language mastery and to deploy in their machines. Kamusi Here is the free mobile app that puts all that knowledge directly at your fingertips – for Android (http://kamu.si/android-here) and IOs (http://kamu.si/ios-here)

KOTTU: The bulk of the research for Teach You Backwards occurred in 2018 and 2019, when Google Translate reached into 103 languages. GT added 5 more languages in February 2020: Kinyarwanda, Odia (also called Oriya), Tatar, Turkmen, and Uyghur – herein abbreviated as KOTTU. These languages were immediately tested. Most of the results in TYB were updated to reflect the larger dataset. However, some of the numbers and charts remain in the pre-KOTTU state, because a lot of effort would be needed to recalculate what would be, in the end, quite small differences from the original test data.

An example from Facebook of a translation that is replete with Lei Lies, where MT produces fake words such as “gutabandi” that do not exist in the target language. Of the 18 Hindi words in the original post, at least 8 are rendered with Lei Lies in English. The Google translation of the same text has more words that exist in English, but is also a MUSA koan: “Where there are targets and paths, there can not be differences or factions. Pseudo-congenital-Hindutva-rogas-ezam-rites-in-transmitted-are engaged in”

Lei lie: The invention of a word in the target language that does not exist in the parallel training vocabulary (as illustrated in Picture 5.2 and discussed in the associated Point 10 in the Introduction).

ML: Machine Learning, a process through which computers use previous results to improve their next round of outcomes.

MT: Machine Translation, the use of computation to convert text from one language to another.

MTyJ: Me Tarzan, you Jane. This indicates a rough translation, where the basic idea is conveyed from one language to the other, but the grammar and syntax are choppy.

MUSA: The Make Up Stuff Algorithm, the imperative for MT to produce some sort of output regardless of whether it has a basis in actual data. “Musa” is the equivalent name for the prophet Moses in many languages.

MWE: Multiword Expression. This is an extremely important concept in NLP. At Kamusi, we are doing a lot of work to identify, define, and translate MWEs across languages. For that work, we need the help of non-specialists who might find “MWE” to be a user-hostile acronym. We instead use the expression “party term”, coined at a meeting of EU experts, and urge other people working in linguistics and language technology to migrate with us.

NLP: Natural Language Processing, the task that joins computers and linguists in figuring out how the words we say can be made into data that machines can work with. The main academic field for this task is called Computational Linguistics.

NMT: Neural Machine Translation is the act of matching data from one language against data from another language, finding things that seem to correspond, and using that information to make translations in similar situations.

Party term: Two or more words that dance together, where you cannot understand the meaning by looking them up separately. For example, an African fish eagle, a type of bird that lives in Africa and eats fish, is neither an African nor a fish, so the expression does not make any sense unless the three words are understood together as a unit. Here is an example from the show “A Series of Unfortunate Events“, where the dramatic flow is paused for the narrator to interject this soliloquy about the party term he invokes:

It is now necessary for me to use the rather hackneyed phrase “meanwhile, back at the ranch.” “Meanwhile, back at the ranch” is a phrase used to link what is going on in one part of the story to what is going on in another part of the story, and it has nothing to do with cows or with horses or with any people who work in rural areas where ranches are, or even with ranch dressing, which is creamy and put on salads. – Lemony Snicket

Source language: In translation, this is the language you are starting from.

SMT: Statistical Machine Translation is the act of making informed guesses about which words in the target language are likely to correspond to ambiguous words in the source language, largely based on corpus frequency estimates.

Smurf: spelling/meaning unit reference. Treating words as collections of letters creates intractable problems for MT, since something like l-i-g-h-t can refer to many things that are spelled the same way but have different meanings. A spelling/meaning unit is a unique combination of letters and the idea they signify, such as light that is not dark, or light that is not heavy. Spelling/meaning units can be single words, or they can be party terms. In Kamusi, each spelling/meaning unit is assigned what data scientists call a “unique identifier” so that we can pinpoint the term. We call these reference numbers “smurfs” so people will visualize the data as friendly blue cartoon characters instead of eye-glazing digits.

Target language: In translation, this is the language you need to use brain power or machine power to obtain.

Tarzan: This is the Teach You Backwards assessment of whether the gist of a translation can be understood by a native speaker, regardless of how inelegantly it might be expressed.

Trumplation: Reality-adjacent MT output that shifts from moment to moment. Different words might be placed on screen as a user types, or changes punctuation (such as whether or not the input string ends with a period). Output might also vary in different locations, on different browsers, or, like airline fares, at different times of day.

TYB: Teach You Backwards, the title of the Web Book you are reading right now.

Zero-shot translation: Attempts have been made to find areas where the translation space between Language A and English in one dataset and the space between Language B and English in another dataset overlap, and use those to make direct connections between Language A and Language B. It does not render viable translations, but it sounds super cool.

Zero-shot translation builds a bridge between two languages based on their connections through English, then removes the underlying English support. Image modified from The Stronghold Rebuilt.

References

Adler, J. S. (1981). War in Melville’s imagination. New York University Press.

Ahmed, M. (2013, July 26). The Babel fish mobile is on its way ..... Yes, you heard correctly. The Times. https://www.thetimes.co.uk/article/the-babel-fish-mobile-is-on-its-way-yes-you-heard-correctly-mx8tz2qw9vk

Aiken, M., & Balan, S. (2011). An Analysis of Google Translate Accuracy. Translation Journal, 16(2). http://translationjournal.net/journal/56google.htm

Aiu, C. (2016, February 24). Why Google Translate Adding 13 New Languages Isn’t Good News. The Odyssey Online. http://theodysseyonline.com/nyu/why-google-translate-adding-13-new-languages-isnt-actually-the-greatest-thing/326899

Antin, J., & Churchill, E. F. (2011). Badges in Social Media: A Social Psychological Perspective. CHI Conference on Human Factors in Computing System, 4. http://gamification-research.org/wp-content/uploads/2011/04/03-Antin-Churchill.pdf

Arivazhagan, N., Bapna, A., Firat, O., Lepikhin, D., Johnson, M., Krikun, M., Chen, M. X., Cao, Y., Foster, G., Cherry, C., Macherey, W., Chen, Z., & Wu, Y. (2019). Massively Multilingual Neural Machine Translation in the Wild: Findings and Challenges. ArXiv:1907.05019 [Cs]. http://arxiv.org/abs/1907.05019

Artetxe, M., Labaka, G., Agirre, E., & Cho, K. (2017, October 30). Unsupervised Neural Machine Translation. ICLR 2018. ICLR 2018. http://arxiv.org/abs/1710.11041

Bahdanau, D., Cho, K., & Bengio, Y. (2014). Neural Machine Translation by Jointly Learning to Align and Translate. ArXiv:1409.0473 [Cs, Stat]. http://arxiv.org/abs/1409.0473

Barakos, E., & Selleck, C. (2019). Elite multilingualism: discourses, practices, and debates. Journal of Multilingual and Multicultural Development, 0(0), 1–14. https://doi.org/10.1080/01434632.2018.1543691

Benjamin, M. (1997). Malangali and the Cyberians: Reflections on the Internet Living Swahili Dictionary. Africa Today, 44(3), 339–356. https://infoscience.epfl.ch/record/218656

Benjamin, M. (2004). Drinking to their health: social analysis of a micronutrient-fortified beverage field trial. Food and Nutrition Bulletin, 24(4), S141-145. https://infoscience.epfl.ch/record/203525

Benjamin, M. (2011). Toward a Standard for Community Participation in Terminology Development. Proceedings of the First Conference on Terminology, Language, and Content Resources. First Conference on Terminology, Language, and Content Resources, Seoul, Korea. https://infoscience.epfl.ch/record/200389

Benjamin, M. (2014). Collaboration in the Production of a Massively Multilingual Lexicon. LREC 2014 Proceedings. Language Resources and Evaluation Conference, Reykjavik, Iceland. https://infoscience.epfl.ch/record/200376

Benjamin, M. (2014). Elephant Beer and Shinto Gates: Managing Similar Concepts in a Multilingual Database. Proceedings of the Seventh Global Wordnet Conference, 201–205. https://infoscience.epfl.ch/record/200381

Benjamin, M. (2014). Participatory Language Technologies as Core Systems for Sustainable Development Activities. 2014 Tech4Dev International Conference, Lausanne, Switzerland. https://infoscience.epfl.ch/record/200379

Benjamin, M. (2015). Crowdsourcing Microdata for Cost-Effective and Reliable Lexicography. Proceedings of AsiaLex 2015 Hong Kong, 213–221. https://infoscience.epfl.ch/record/215062

Benjamin, M. (2015). Excluded Linguistic Communities and the Production of an Inclusive Multilingual Digital Language Infrastructure. 11th Language and Development Conference, New Delhi, India. https://infoscience.epfl.ch/record/222706

Benjamin, M. (2016). Digital Language Diversity: Seeking the Value Proposition. 2nd Workshop on Collaboration and Computing for Under-Resourced Languages, Portoroz, Slovenia. https://infoscience.epfl.ch/record/222525

Benjamin, M. (2016). Kamusi Pre:D – Lexicon-based source-side predisambiguation for MT and other text processing applications. European Association of e-Lexigography, “Lexicographic data meet computational linguistics and knowledge based systems”, COST ENeL WG3 meeting, Brno, Czech Republic. https://infoscience.epfl.ch/record/222524

Benjamin, M. (2016). Problems and Procedures to Make Wordnet Data (Retro)Fit for a Multilingual Dictionary. Proceedings of the Eighth Global WordNet Conference. Global WordNet Conference 2016, Bucharest, Romania. https://infoscience.epfl.ch/record/221046

Benjamin, M. (2018). Hard Numbers: Language Exclusion in Computational Linguistics and Natural Language Processing. Proceedings of the Eleventh International Conference on Language Resources and Evaluation, 6. http://lrec-conf.org/workshops/lrec2018/W26/pdf/23_W26.pdf

Benjamin, M. (2018). Inside Baseball: Coverage, quality, and culture in the Global WordNet. Cognitive Studies | Études Cognitives, 0(18). https://doi.org/10.11649/cs.1712

Benjamin, M. (2000). Development Consumers: An Ethnography of “The Poorest of the Poor” and International Aid in Rural Tanzania [Yale University]. https://infoscience.epfl.ch/record/215064/

Benjamin, M. (2000). Development Consumers: Conclusions (Chapters 8 and 9) [Yale University]. https://infoscience.epfl.ch/record/215064/

Benjamin, M., & Allen, J. (2016). Kamusi Pre:D - Source-Side Disambiguation and a Sense Aligned Multilingual Lexicon. Translating and the Computer 37, London, England. https://infoscience.epfl.ch/record/215063

Benjamin, M., & Biersteker, A. (2001). The Kamusi Project Edit Engine: A New Tool for Collaborative Lexicography. Journal of African Language Learning and Teaching, 1(1), 75–88. https://infoscience.epfl.ch/record/215061

Benjamin, M., & Houssouba, M. (2015). Looking forward by looking back: Applying lessons from 20 years of African language technology. 7th Language & Technology Conference: Human Language Technologies as a Challenge for Computer Science and Linguistics, Poznań, Poland. https://infoscience.epfl.ch/record/222615

Benjamin, M., & Radetzky, P. (2014). Multilingual Lexicography with a Focus on Less-Resourced Languages: Data Mining, Expert Input, Crowdsourcing, and Gamification. Language Resources and Evaluation Conference, Reykjavik, Iceland. https://infoscience.epfl.ch/record/200375

Benjamin, M., & Radetzky, P. (2014). Small Languages, Big Data: Multilingual Computational Tools and Techniques for the Lexicography of Endangered Languages. Proceedings of the 2014 Workshop on the Use of Computational Methods in the Study of Endangered Languages, 15–23. https://infoscience.epfl.ch/record/200377

Benjamin, M., Mansour Lakouraj, S., & Aberer, K. (2017, March 5). DUCKS in a Row: Aligning open linguistic data through crowdsourcing to build a broad multilingual lexicon. 5th International Conference on Language Documentation and Conservation (ICLDC), Honolulu, Hawaii. http://scholarspace.manoa.hawaii.edu/handle/10125/41982

Bond, E. (2018, June 11). England’s Top Judge Predicts ‘the End of Interpreters.’ Slator. https://slator.com/industry-news/englands-top-judge-predicts-the-end-of-interpreters/

Brabham, D. (2013). Crowdsourcing: A Model for Leveraging Online Communities. In A. Delwiche & J. J. Henderson (Eds.), The Participatory Cultures Handbook (pp. 120–129). Routledge.

Buckley, C. (2015, February 18). A Lunar New Year With a Name That’s a Matter of Opinion. The New York Times. https://www.nytimes.com/2015/02/19/world/asia/chinese-new-year-sheep-goat.html

Carroll, L. (1865). Alice’s Adventures in Wonderland. https://ebooks.adelaide.edu.au/c/carroll/lewis/alice/

Cho, K., van Merrienboer, B., Bahdanau, D., & Bengio, Y. (2014). On the Properties of Neural Machine Translation: Encoder-Decoder Approaches. ArXiv:1409.1259 [Cs, Stat]. http://arxiv.org/abs/1409.1259

Christian, J. (2018, July 20). Why Is Google Translate Spitting Out Sinister Religious Prophecies? Motherboard. https://motherboard.vice.com/en_us/article/j5npeg/why-is-google-translate-spitting-out-sinister-religious-prophecies

Clifford, J., Merschel, L., & Munné, J. (2013). Surveying the Landscape: What is the Role of Machine Translation in Language Learning? @tic Revista d’innovació Educativa, 0(10), 108–121. https://doi.org/10.7203/attic.10.2228

Conneau, A., Lample, G., Ranzato, M., Denoyer, L., & Jégou, H. (2017). Word Translation Without Parallel Data. ArXiv:1710.04087 [Cs]. http://arxiv.org/abs/1710.04087

Constant, M., Eryiğit, G., Monti, J., van der Plas, L., Ramisch, C., Rosner, M., & Todirascu, A. (2017). Multiword Expression Processing: A Survey. Computational Linguistics, 43(4), 837–892. https://doi.org/10.1162/COLI_a_00302

Corbett, E. (2018, June 6). You Can Pilot Larry Page’s New Flying Car With Just An Hour Of Training. Fortune. http://fortune.com/2018/06/06/larry-page-kitty-hawk-personal-car/

Coren, M. J. (2018, October 13). We were promised flying cars. It looks like we’re finally getting them. Quartz. https://qz.com/1422955/flying-cars-are-really-truly-on-their-way/

Cronkleton, R. (2018, November 21). Winter storm could make trip home from Thanksgiving hazardous for KC-area travelers. Kansas City Star. https://www.kansascity.com/weather/article222012695.html

Davis, K. H. (1952). Automatic Recognition of Spoken Digits. Acoustical Society of America Journal, 24, 637. https://doi.org/10.1121/1.1906946

Debczak, M. (2016, October 3). Improvements to Google Translate Boost Accuracy by 60 Percent. http://mentalfloss.com/article/86960/improvements-google-translate-boost-accuracy-60-percent

Dent, S. (2016, November 24). Google’s AI can translate language pairs it has never seen. Engadget. https://www.engadget.com/2016/11/24/google-ai-translate-language-pairs-it-has-never-seen/

Diaz, J. (2017, December 1). For The First Time, AI Can Teach Itself Any Language On Earth. Fast Company. https://www.fastcompany.com/90152951/for-the-first-time-ai-can-teach-itself-any-language-on-earth

Diño, G. (2018, January 19). Machine Translates Literature and About 25% Was Flawless, Research Claims. Slator. https://slator.com/technology/machine-translates-literature-and-about-25-was-flawless-research-claims/

Domingos, P. (2015). The Master Algorithm: How the Quest for the Ultimate Learning Machine Will Remake Our World. Basic Books.

Drum, K. (2017, October 5). Google announces new universal translator. Mother Jones. https://www.motherjones.com/kevin-drum/2017/10/google-announces-new-universal-translator/

Erickson, H., & Gustafsson, M. (1989). Kiswahili Grammar Notes. http://kamusi.org/content/kiswahili-grammar-notes

To notify TYB of a spelling, grammatical, or factual error, or something else that needs fixing, please select the problematic text and press Ctrl+Enter.