“The idea [to create an Indic language LLM germinated from the comment of Sam Altman that its hopeless to build a ChatGPT kind of LLM in India with $10 million dollars,” says C. Chaitanya, a Swecha member spearheading the LLM project. That struck me as a challenge, which I then threw out to the Swecha community. India has the advantage of having 1.4 billion human minds. Can’t we make one AI with those? One of the first Indian language models developed by a start-up in India to take on major players such as Open AI would be Swecha’s chatbot.
Local options can increase computer accessibility
Early on in the movement, Swecha founders like Chandra realized that barriers to computing access could include language and financial constraints. The Telugu working class was unable to use the new technology because of these obstacles. Furthermore, Microsoft Windows, which wasn’t concerned with developing Telugu interfaces or other localized features, was the primary operating system used on computers at the time.
With more than 10,000 Telangana members, Swecha is now developing the Telugu chatbot “Vemana,” which is named after a well-known Telugu poet and philosopher. But before the organization could accomplish this, a dataset had to be created. The Vikram-Betaal and other Chandamama stories entered the picture at this point.
“The Tiny Stories initiative served as our source of inspiration. The language used for this is English. Essentially, little stories are these three- or four-line narratives that are concise, straightforward, and cohesive. This tale is fed through the LLM, and when you train an LLM with this as a target, it generates new stories. That’s the main concept. We decided it would be better to begin with the Chandamama stories since there isn’t anything like that [in Telugu culture] and we have to start somewhere, Ganesh added.
LLMs in Indian languages might provide opportunities for the community
According to Ganesh, Swecha recognized the opportunity to develop a Telugu LLM as a means of enabling even a distant farmer who spoke Telugu to engage in something as special as rapid engineering.
“Anyone can utilize ChatGPT today… Prompt engineering—which is just asking ChatGPT to provide you with the information you want—is a sizable new field in the employment market. Prompt engineering requires merely the ability to speak English. Now, why is it that a farmer in a distant part of Telangana cannot develop into a quick engineer? Our dream is for his entire farm to be [linked to the] Internet of Things in the future. He makes use of technology to increase the productivity of his yield and other things. Assume for the moment that the farmer has to regulate when water enters the field or knows where a specific cattle is located on the opposite side. Why is it not possible for you to use the voice-activated Telegu speech prompt for that purpose? Ganesh remarked, “The only thing holding him back is his lack of knowledge of English.
Compiling a Telugu LLM dataset
The Chandamama stories’ PDFs were collected by Swecha from the publication’s archive website. But they were ancient scans of pictures. Ganesh stated that cleaning and proofreading were still necessary for the material even after it had been trough optical character recognition (OCR) procedures. In order to finish this enormous undertaking, Swecha arranged a datathon on November 16, 2023, wherein almost 7,500 participants from 30 different colleges, organizations, and institutions came together to clean the dataset.
More than 40,000 pages were proofread and edited. That indicates that every Chandamama story was edited and completed in a single day, according to Ganesh. He also mentioned that the organization will now be constructing the “Chandamama Kathalu” LLM utilizing this data.
Chaitanya claims that the datathon demonstrated that extremely cheaply, high-quality datasets can be gathered. “Everything was done by volunteers, and it only took two weeks from planning to execution,” he stated. The open-source data gathering software and the datasets were posted on the Hugging Face site. Others can now use this toolbox to produce datasets that are comparable, according to Chaitanya.