Karnataka villager volunteers have been reading lines in their language, Kannada, into an application over the past few weeks as part of a project to create the nation’s first AI-powered chatbots for tuberculosis. As one of India’s 22 official languages and one of the more than 121 languages spoken by 10,000 people—more than in the most populous nation on earth—Kannada is spoken by about 40 million native speakers. However, Natural Language Processing, which makes it possible for computers to recognize written and spoken words, is only available for a small number of these languages. As a result, a vast number of economic possibilities and useful knowledge are denied to hundreds of millions of Indians.
Principal researcher at Microsoft Research India Kalika Bali commented, “AI tools need to accommodate people who don’t speak English, French, or Spanish in order to work for everyone.” However, we would have to wait an additional ten years if we had to gather all the data in Indian languages needed for a big language model like GPT. Thus, we may build layers on top of generative AI models, like Llama or ChatGPT,” she continued in her remarks to the Thomson Reuters Foundation panel.
Bhashini, the translation system
Thousands of speakers of various Indian languages, including the villagers in Karnataka, currently produce and preserve speech data for newly founded tech companies like Karya, which creates datasets for companies like Microsoft and Google to incorporate AI into healthcare, education, and other services. The Indian government wants to provide excellent online services. Additionally, they are constructing language datasets for the creation of AI tools through Bhashini, an AI-driven language translation system that generates open-source datasets in regional languages.
Bhashini is a crowdsourcing project that allows users to translate text and label photos, check audio or text transcriptions made by others, and submit words in many languages. Tens of thousands of Indians have made active contributions to the platform up to this point.
“The government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism, and in the courts,” stated Pushpak Bhattacharya, head of computation at the Indian Language Technology Lab in Mumbai. However, there are other obstacles, including the oral tradition of most Indian languages, a lack of computerized records, and a high rate of code-mixing. Furthermore, gathering data in less widely spoken languages is challenging and calls for extra work,” he continued.
Economic value of speech data
Less than 100 of the more than 7,000 languages that are spoken in the globe are covered by the major NLPs. Among them, English is regarded as the most advanced. For example, ChatGPT, which debuted last year and is primarily taught in English, has sparked a surge of interest in generative AI. Similar restrictions on other languages can be seen in Google Bard’s inability to speak in any language other than English and Amazon Alexa’s limited response time to just three non-European languages: Arabic, Hindi, and Japanese. Governments and new businesses are attempting to close this gap, nevertheless.
In India, crowdsourcing can efficiently gather voice and linguistic data, claims Kalika Bali. Bali stated, “Language, cultural, and socioeconomic nuances are also better captured through crowdsourcing.” She continued, “But it has to be done ethically by educating the workers, paying them, and making a specific effort to collect smaller languages. There also needs to be awareness of gender, ethnic, and socioeconomic bias.” “If not, it doesn’t scale.”
Safiya Husain, a co-founder of Karya, subsequently discussed the need for languages amid the unknowable rapid rise of AI, especially from scholars who want to preserve them.
Husain claims that Karya employees own a portion of the data they produce, which allows them to get royalties, and that the data might be used to create AI products for the community in industries like farming and healthcare. She said, referring to the language of eastern Odisha state, “We see huge potential for adding economic value with speech data – an hour of Odia speech data used to cost about $3–$4 now it’s $40.”
It is estimated that less than 11% of Indians speak English. The majority of people find it difficult to write and read. For this reason, many AI models focus on voice and speech recognition.
In India, a few initiatives and resources that are actively engaged in speech translation and other digital services include:
Project Vaani: This Google-funded initiative aims to gather voice data from approximately one million Indians and make it publicly available for use in speech-to-speech and automatic speech recognition systems.
The EkStep Foundation’s AI-powered translation tool: The Supreme Courts of Bangladesh and India both use the translation program developed by the Bangalore-based EkStep Foundation.
Jugalbandhi: An AI-based chatbot that can respond to inquiries on welfare programs in many Indian languages has been introduced by the government-backed AI4Bharat.
Gram Vaani: A social enterprise that works with farmers is called Gram Vaani. AI-powered chatbots are used to answer questions about welfare benefits.