To spend an afternoon in Jackson Heights is to spend an afternoon touring the world. Nestled in the northwest corner of New York City’s largest borough, Queens, Jackson Heights is commonly believed to be the most linguistically diverse neighborhood in the most linguistically diverse city in the world. It is home to some 180,000 people who collectively speak more than 160 of the 700-plus languages that have been documented in the city. It has every right to claim the mantle of the “world’s town square.” But just like the world it represents, Jackson Heights has a problem: Its languages are disappearing.
Linguists warn that the vast majority of the roughly 7,000 languages spoken on the planet are at risk of extinction. Estimates by cultural heritage organizations, including UNESCO, suggest that the last living speaker of a language dies roughly every two weeks, and with them any chance of revitalizing a community of speakers. At that point, the best linguists can hope for is that the language is preserved for future study, much as the artifacts of long-gone civilizations are preserved in a museum.
This is a cultural tragedy, driven by economic globalization, technological change, and government policy. It may be too late to save the estimated 600 languages that have been lost in the past century alone. But increasingly, linguists are turning to AI in a bid to not just preserve but also revitalize the languages that remain.
PUTTING THE “LANGUAGE” IN LARGE LANGUAGE MODEL
Artificial intelligence and human language have always been closely linked, but the rise of large language models (LLMs) has made this connection explicit. LLMs are trained on massive amounts of text collected from the internet and digitized books, allowing the AI to respond to many kinds of user input in an accessible, comprehensible, and diverse style of output. Unsurprisingly, linguists and cultural historians quickly recognized the potential of this technology to help save thousands of languages at risk of extinction and to make content online more accessible to speakers of hundreds of languages unsupported by most technologies. Although most LLMs today are capable of handling only about 100 languages—albeit with a wide range of accuracy and quality—this new development offered the possibility of creating models fluent in at-risk and underrepresented languages. In addition to representing a powerful new linguistic research tool, this would be a major boon for the remaining speakers of these languages, helping educate the next generation of speakers and facilitating access to the economic opportunities enabled by the internet.
In principle, it’s an elegant and important use case, but it’s remarkably difficult to put into practice. One of the biggest challenges, says David Adelani, an assistant professor of computer science at McGill University who specializes in computational linguistics, is the lack of data. Training an LLM requires collecting a tremendous amount of text from the internet; even with already sky-high baseline requirements, an LLM continues to produce better results the more data it ingests. The problem is that an estimated 50 percent of all websites are written in English, and the top 10 languages account for more than 80 percent of all content on the internet. The vast majority of the Earth’s languages are what Adelani refers to as “under-resourced languages,” meaning that the mountains of textual data needed to train an LLM in that language don’t exist.
“There is a very big connection between the amount of data available on the web and the languages that are supported by current technologies,” says Adelani. “If your language doesn’t have a lot of text online, it will be less represented in those technologies. There are so many languages that fall into that category, and most of them are from the Global South.”
Endangered languages are, by definition, under-resourced, but not every under-resourced language is endangered. Oromo, for example, is a language spoken by about 45 million people in Ethiopia and Kenya, in stark contrast to a language like Kalasha-mun, which is spoken by a few thousand people in Pakistan. Despite its larger speaker base, Oromo remains underrepresented in LLMs and other AI-driven technologies, primarily because less than 30 percent of the population in the countries where it’s spoken has regular access to the internet. This problem is exponentially worse for the majority of the world’s 7,000 languages, most of which are spoken by fewer than 10,000 people.
But the general lack of linguistic data for training LLMs and other AI systems is only part of the problem. Another major challenge is that LLMs struggle to handle textual input that doesn’t use the Roman alphabet, outside of a few major languages such as Russian and Mandarin. Many languages—especially the most endangered—lack even Unicode standardization for their script, which is necessary to write in a language on the internet. Adelani recalls attempting to train an AI model in Amharic, the most widely spoken language in Ethiopia, only to receive a zero accuracy score. The takeaway, he says, was clear: “If the script isn’t supported, the model isn’t going to work.”
HOW AI CAN HELP LANGUAGES THRIVE
Today, the threat to the thousands of endangered languages spoken in the world is widely recognized by cultural conservationists, linguists, and technologists working on LLMs, who are exploring different solutions for rolling them back from the verge of extinction.
Many of these efforts are focused on digitally documenting endangered languages in the hopes that, if they can’t be saved—in the sense of having a living community of speakers—they can at least be preserved for future study. One of the earliest and most successful examples of a global endangered-language preservation initiative is the Rosetta Project, which was launched in 2002 by the Long Now Foundation to create a digital library of all documented languages. This was supplemented by a separate initiative called the Endangered Languages Project, which launched in 2012 with support from Google. It has since blossomed into a catalog documenting more than 3,000 endangered languages.
In addition to collecting existing data on endangered languages, linguists worldwide are turning to smartphones, AI, and other technologies to accelerate the process of building corpora—essentially, large linguistic databases—for endangered languages, a prerequisite for building a viable AI model that generates text and speech in a language. In 2018, for example, Australia’s ARC Centre of Excellence for the Dynamics of Language (CoEDL) began using a robot called Opie to help teach Indigenous languages to children living in remote communities. CoEDL then partnered with Google to build AI models for Indigenous languages, which saved linguists many hours of transcription time that would’ve been required to produce the data to train the models.
More recently, the Living Tongues Institute for Endangered Languages partnered with Shure, a company that makes small wireless microphones commonly used by social media creators, to simplify the recording of endangered languages in some of the most remote areas of the world.
These sorts of initiatives are an important part of preserving endangered languages, but not everyone is convinced it will be enough to save them—or even build effective AI models of those languages. In fact, in 2013, András Kornai, a professor of mathematical linguistics at the Budapest Institute of Technology, calculated that less than 5 percent of the world’s languages can be captured digitally, as the remainder lack enough data, which means most of the world’s languages are going extinct. In other words, a language may have a small but viable community of living speakers, but based on Kornai’s research, it’s already dead in the sense that it’s unlikely to ever be used for online communication. For most of these receding languages, the best we might hope for is digital preservation.
“Language preservation is a great thing, but it does not lead to a viable language community,” Kornai says. “You can reconstruct many things, but there’s a certain amount of data loss that’s inevitable.”
This loss of native speakers creates a fundamental problem for machine learning applications. Without fluent speakers to validate the output, the applications are extremely difficult to develop and refine. “This also makes it almost impossible to experiment with that language using machine learning, because you don’t have any speakers anymore who can tell if the output is gibberish,” Kornai adds.
Daan van Esch is a linguist who has spent the past 12 years at Google researching ways to make technology more accessible to a broader linguistic community. He remembers when Kornai published his paper a decade ago and sent shock waves through the linguistic community. But a lot has changed in 10 years. Now, he believes that Kornai’s dire prediction may be a bit too pessimistic.
“The big thing that’s changed is now there are smartphones everywhere,” says Van Esch. “There are so many people coming online and wanting to use their language on social media and chat apps, but it historically hasn’t been supported.”
In late 2022, Google announced the launch of its 1,000 Languages Initiative, which set the goal of building an AI model that would support the 1,000 most spoken languages in the world. But almost as soon as the project started, it was faced with the same monumental challenge: how to collect enough language data to build the model. The solution, says Uche Okonkwo, a Responsible AI program manager at Google, was a blend of technical innovation and community engagement. On the technical side, Google researchers focused on finding ways to do more with less data. For example, researchers at Google DeepMind developed a technique for training the company’s LLM Gemini on an endangered Indonesian language by analyzing one of the few formal grammar books that exist for that language.
But the real key, says Okonkwo, has been working directly with speaker communities to collect language data and ensure that the resulting AI model accurately represents the many nuances found in any language. This is particularly important in linguistically diverse regions of the world, such as India, where the dialects spoken by people living in villages only a few miles apart can vary widely.
In developing these models, especially those dealing with endangered or underrepresented languages, responsible AI practices play a crucial role. “Responsible AI, when it comes to language technologies, shouldn’t be seen as burdening the process. It should be baked into it,” says Okonkwo. “We’ve been focused on how to factor in things like community advisory boards and speaking to people on the ground to make sure that we incorporate those research findings all the way through the model development life cycle.”
The 1,000 Languages Initiative has already come a long way in making modern technology more accessible to the millions of speakers of underrepresented languages. But what about the speakers of languages that don’t fall in the top 1,000? Is using AI to try and save the world’s most endangered languages a lost cause? Not necessarily, says Adelani. Most endangered languages will never have enough data to build an LLM as sophisticated as Gemini is with English, but Adelani and many other language technologists are exploring new language model architectures that can be trained on far less data. Another option is to prioritize collecting linguistic data for specific use cases and building AI that is focused on, say, providing medical information in an endangered language.
In the meantime, however, there’s plenty of foundational technical work to do. In early 2024, Stanford University launched the Stanford Initiative on Language Inclusion and Conservation in Old and New Media (SILICON), which is working on encoding endangered languages in standard formats to aid their preservation and use in AI applications. Shortly after, a global consortium of researchers and technologists led by Masakhane, a grassroots organization dedicated to natural language processing for African languages, published IrokoBench, a new set of benchmarks for evaluating the performance of AI systems that feature 16 under-resourced African languages.
Useful AI models are still a distant dream for most endangered languages. Still, the current technology can act as an educational tool that provides a bulwark against the further erosion of endangered languages. In 2021, for example, Google Arts & Culture launched Woolaroo, an open-source smartphone app designed for exploring endangered Indigenous languages, as a way to help younger generations engage with their linguistic heritage. In Woolaroo, the user can point their phone at an object—say, a tree—and see and hear the word in more than 10 Indigenous languages from around the world.
The use of AI for preserving and revitalizing the world’s endangered languages is full of promise, but it’s still too early to know whether it can stave off the mass language extinction predicted by Kornai only a decade ago.
“I don’t believe we can save these endangered languages with just AI,” says Van Esch. “A language is only saved when people continue to speak it. But if AI can be a tool that makes speaking it easier, then I consider that a job well done.”