As the pandemic at long last winds down, international travel is picking up, with millions looking to make up for lost time. As travelers explore foreign lands, tools like Google’s Neural Machine Translation system may come in handy; released in 2016, the software uses deep learning to draw links between words, figuring out how closely related they are, how likely they are to appear together in a sentence, and in what order.
Google’s tool works well—when the software was compared to human translators, it came close to matching the fluency of humans for some languages—but it’s limited to the more widely-spoken languages of the world.
Meta wants to help, and is pouring resources into its own translation tool, with the aim (among others) of making it far more expansive than Google’s. A paper the company put out this week says Meta’s tool works in more than 40,000 different translation directions between 200 different languages. A “translation direction” refers to translations between language pairs, for example:
Direction 1: English > Spanish
Direction 2: Spanish > English
Direction 3: Spanish > Swahili
Direction 4: Swahili > English
40,000 sounds like a lot, but if you take all the permutations of 200 languages translating between one another, they add up pretty fast. It’s hard to determine precisely how many languages there are in the world, but one reliable estimate put the total at over 6,900. While it would be inaccurate, then, to say that Meta is building a universal translation system, it’s some of the most extensive work that’s ever been done in the field, particularly with what the company calls low-resource languages.
These are defined as languages with fewer than a million publicly-available translated sentence pairs. They’re largely made up of African and Indian languages that aren’t spoken by a large population, and don’t have nearly as much written history as common languages.
“One really interesting phenomenon is that people who speak low-resource languages often have a lower bar for translation quality because they don’t have any other tool,” Meta AI research scientist Angela Fan, who worked on the project, told The Verge. “We have this inclusion motivation of, ‘what would it take to produce translation technology that works for everybody’?”
Meta started its research by interviewing native speakers of low-resource languages to contextualize their need for translation—though the team notes that the majority of the interviewees were “immigrants living in the US and Europe, and about a third of them identify as tech workers,” meaning there may be some built-in bias and a different baseline life experience than the broader group of people who speak their languages.
The team then created models aimed at narrowing the gap between low and high-resource languages. To gauge how the model was performing once it started spitting out translations, the team put together a test dataset of 3,001 sentence pairs for each language covered by the model. The sentences were translated from English into the target languages by native speakers of those languages who are also professional translators.
Researchers fed the sentences through their translation tool and compared its output to human translations using a methodology called Bilingual Evaluation Understudy, or BLEU for short. BLEU is the standard benchmark used to evaluate machine translations, providing a numerical scoring system that measures sentence pair accuracy. Meta’s researchers said their model saw a 44 percent improvement in BLEU scores compared to existing machine translation tools.
That figure should be taken with a grain of salt, though. Language can be highly subjective, and a sentence could take on a completely different meaning based on just a one-word difference; or retain the exact same meaning despite multiple words changing. The data a model is trained on makes all the difference, and even that is subject to built-in bias and the intricacies of the language in question.
An additional differentiating aspect of Meta’s tool is that the company chose to open-source its work—including the model, the evaluation dataset, and the training code—in an attempt to democratize the project and make it a global community effort.
“We worked with linguists, sociologists, and ethicists,” said Fan. “And I think this kind of interdisciplinary approach focuses on the human problem. Like, who wants this technology to be built? How do they want it to be built? How are they going to use it?”
While it will bring benefits to the company’s broad user base, the translation tool is by no means a charitable project; Meta stands to gain a lot from being able to better understand its users and the way they communicate and use language (targeted ads come in all languages, after all). Not to mention, making the company’s platforms available in new languages will open up as-yet-untapped user bases (if there are any remaining).
Like many Big Tech undertakings, Meta’s translator should neither be disdained as an instrument of corporate power nor lauded as a gift to the masses; it will help bring people together and facilitate communication, even as it gives the social media giant new insights into our lives and minds.