The final survivor of the 65,000-year-old pre-Neolithic society on the Andaman Islands in the Indian Ocean was a woman by the name of Boa Sr. She passed away in 2010, and the Bo language also perished and went extinct.
If that sounds like a singular occurrence, it’s not. Somewhere in the globe, a language is lost every two weeks.
Consider the Mundas, a group of a million or so people who live in the eastern Indian states of Jharkhand, Orissa, and West Bengal.
According to Dr. Meenakshi Munda, a Munda community member and assistant professor in the anthropology department of a university in Ranchi, Jharkhand, “I learned Mundari very late in life as my parents were in another state where they were working, thus we didn’t speak the language at home.” “I recognise how important identity is to a community, and our younger generation is losing that identity due to language barriers.”
The Munda community is worried about the future of their language because children in schools are only exposed to well-known languages like Bengali, Hindi, and Odiya.
Even though Mundari has a written script, it has very little digital content or an online presence, which provides even less motivation for people to invest in learning the language.
At the Microsoft Research (MSR) lab in India, a few researchers have been working on developing digital ecosystems for languages like Mundari that don’t have enough presence online.
According to Kalika Bali of MSR India, “the way I describe my job for myself is that no person in this world should be prohibited from adopting any technology because they speak a different language.”
The branch of linguistics and artificial intelligence (AI) that focuses on teaching computers to comprehend spoken and written languages, Bali is a specialist in natural language processing.
Her team develops the foundational datasets needed to build AI systems for underrepresented languages in collaboration with local groups and native speakers. They intend to produce a dataset that is accurate and culturally relevant by incorporating the community in the data collection procedure.
English has been the primary language of the internet since its inception. Since then, seven other widely spoken languages, including Chinese and Spanish, may partially rival English in terms of technological compatibility due to better internet availability and a desire for material in native languages. However, that only represents eight of the world’s almost 6,000 languages.
This indicates that only 88% of the languages spoken in the world have sufficient online presence. Additionally, it means that 1.2 billion people, or 20% of the world’s population, are unable to utilise their language to interact with the internet.
As a result, “the gap between the haves and the have-nots got fairly obvious,” says Monojit Choudhury, Bali’s colleague and principal data and applied scientist at Microsoft’s Turing India.
Low-resource languages are those, according to the experts, that lack the resources needed to create technology for a digital presence.
Building digital resources has two goals under Project ELLORA— Enabling Low Resource Languages: In addition to ensuring that speakers of these languages may engage and communicate in the digital world, it is also a step toward conserving a language for future generations.
Launched in 2015, Project ELLORA began with the fundamentals. Identifying existing resources, such as printed materials like books and the degree of a digital presence, was the first step. Bali and her coworkers presented a six-tier classification in a 2020 study, with the top tier representing languages with abundant resources, such as English and Spanish, and the bottom tiers indicating languages with few to no resources.
Project ELLORA’s effort involves gathering the necessary materials for these languages and creating language models to satisfy the digital needs of their speakers.
The researchers of Project ELLORA collaborate with the local populations to identify this demand and the foundational technologies that can help to meet it. According to Bali, “No language technology can be detached from the users.”
In order to determine what the Mundari community needs to preserve the language, the researchers in 2018 financed a study in partnership with IIT Kharagpur.
What began as a straightforward word game for schoolchildren to help them learn the language quickly evolved into complex technology undertakings.
The community will have access to additional Mundari content thanks to MSR researchers’ work on a Hindi-to-Mundari text translation and a speech recognition model.
The Deutsche Gesellschaft für Internationale Zusammenarbeit (GIZ), acting on behalf of the German Ministry for Economic Cooperation and Development, is funding a text-to-speech model as part of the “Forward – Artificial Intelligence for all” programme.
However, it is difficult to develop language translation models for a language for which there is little relevant digital resources on which to train machine learning models.
Initially, the team worked with locals to have them manually translate words from Hindi to Mundari. The team was led by professors from IIT Kharagpur.
Interneural Machine Translation (INMT), a new technology created by MSR researchers to expedite translation, aids in word prediction while someone is translating between languages.
“It (INMT) makes it possible for people to translate between languages more successfully. When I begin typing in Mundari when translating from Hindi, it offers me predictive ideas in Mundari. Similar to the predictive text found in smartphone keyboards, except it works in two languages, according to Bali.
They worked with Karya, which began as a research effort by Vivek Seshadri, a principal researcher at MSR, to create the dataset for text to speech. Karya is a digital platform for working that allows users to record, tag, and annotate data in order to create machine learning and AI models.
The translators were given the translated sentences to record for a male Mundari speaker, identified by the team, and Dr. Munda as the female speaker. On Android cellphones, they recorded the sentences using the Karya app.
For the purpose of training text to voice models, the recordings and the associated text are safely uploaded to the cloud.
To build these three technologies for Mundari, Bali explains, “the idea is that between Microsoft Research, Karya, and IIT Kharagpur, we will have data for machine translation, speech recognition, and text-to-speech synthesis.”