As part of a project to create the nation’s first AI-based chatbot for tuberculosis, residents of the Indian state of Karnataka read aloud dozens of words in their native Kannada language into an app over a few weeks in 2023.
Kannada is one of the 22 official languages of India, where there are over 40 million native speakers. It is also one of over 121 languages spoken by 10,000 or more people in the most populous country in the world.
However, natural language processing, the area of artificial intelligence that makes it possible for computers to comprehend spoken and written language, does not encompass many of these languages.
As a result, hundreds of millions of Indians are shut out of numerous economic opportunities and helpful information.
According to Ms. Kalika Bali, lead researcher at Microsoft Research India, “AI tools need to cater to people who don’t speak English, French, or Spanish in order to work for everyone.”
However, we would have to wait an additional ten years if we had to gather all the data in Indian languages needed for a big language model like GPT. Hence, we may build layers on top of generative AI models, like Llama or ChatGPT,” Ms. Bali said in a statement to the Thomson Reuters Foundation.
The Karnataka peasants are among the thousands of speakers of various Indian languages who provide voice data to the tech company Karya. Karya creates datasets that companies like Microsoft and Google utilize in their artificial intelligence models for healthcare, education, and other services.
With Bhashini, an AI-driven language translation system that generates open source datasets in regional languages for the development of AI tools, the Indian government, which seeks to provide more services digitally, is also producing language datasets.
Through the platform’s crowdsourcing project, users can check audio or text transcriptions made by others, translate texts, label photographs, and submit words in a variety of languages. For Bhashini, tens of thousands of Indians have donated.
According to Mr. Pushpak Bhattacharyya, head of the Computation for Indian Language Technology Lab in Mumbai, “the government is pushing very strongly to create datasets to train large language models in Indian languages, and these are already in use in translation tools for education, tourism, and in the courts.”
However, there are other obstacles, including the oral legacy of most Indian languages, the scarcity of computerized data, and the prevalence of code mixing. Additionally, gathering data in less widely spoken languages is challenging and calls for extra work.
Financial worth
Less than 100 of the more than 7,000 live languages in the world are represented in significant databases, with English being the most developed.
A surge of interest in generative AI was sparked by the 2022 release of ChatGPT. English is ChatGPT’s primary training language. Only three of the nine languages that Amazon’s Alexa can speak are non-European: Arabic, Hindi, and Japanese. Google’s Bard can only speak English.
Start-ups and governments are attempting to close this gap. A new huge language model named Jais can enable generative AI applications in Arabic in the United Arab Emirates, while grassroots organization Masakhane seeks to advance research on natural language processes in African languages.
Crowdsourcing is a useful tool for gathering voice and language data in a nation like India, according to Ms. Bali, who was listed by Time magazine in September as one of the 100 most influential people in AI.
Language, cultural, and socioeconomic subtleties are also better captured through crowdsourcing, according to Ms. Bali.
However, she added, “it has to be done ethically, by paying the workers, educating them, and making a specific effort to collect smaller languages. There also needs to be awareness of gender, ethnic, and socioeconomic bias.” “If not, it doesn’t scale.”
According to Safiya Husain, co-founder of Karya, there is a need for languages that “we haven’t even heard of” due to the quick development of AI, especially from academics who want to preserve them.
In order to find workers who are below the poverty line or have an annual income of less than US$325 (S$433), Karya collaborates with non-profit organizations. In exchange, the workers are paid approximately US$5 (S$6.70) per hour, which is significantly more than the Indian minimum wage.
Employees may receive royalties for a portion of the data they produce, and Ms. Husain stated that this data may be used to create AI solutions for the community in industries like farming and healthcare.
Village voice
Out of the 1.4 billion people in India, less than 11% speak English. Since many people find it difficult to read and write, many AI models concentrate on speech and speech recognition.
About a million Indians’ speech data is being gathered by Google-funded Project Vaani, or voice, and made publicly available for use in speech-to-speech and automatic speech recognition systems.
The Supreme Courts of Bangladesh and India both use AI-based translation tools from Bengaluru-based EkStep Foundation, while the government-backed AI4Bharat center has introduced Jugalbandi, an AI-based chatbot that can respond to queries on welfare programs in many Indian languages.
The AI4Bharat and Microsoft reasoning models are employed by the bot, which is called after a duet in which two artists riff off one another. The bot can be accessed on WhatsApp, which is used by roughly 500 million people in India.
In addition to working with farmers, the social venture Gram Vaani, also known as voice of the village, employs AI-powered chatbots to answer inquiries about welfare benefits.
According to Mr. Shubhmoy Kumar Garg, a product lead of Gram Vaani, “automatic speech recognition technologies are helping to mitigate language barriers and provide outreach at the grassroots level.”
“They will support the empowerment of communities that most need it.”
The rising need for speech data in her home Odia has also given Ms. Swarnalata Nayak in the Raghurajpur district of Odisha a much-needed boost in revenue from her job for Karya.
“I work on it at night when I have free time. I can converse on the phone and provide for my family,” she remarked.