Recently, OpenAI revealed that it is planning to create a new tool named Media Manager. Creators and content owners can specify how their work is used for AI model training and machine learning research with this tool. It is anticipated that the tool will be available by 2025 and is made to honor these preferences.
The catch is that while OpenAI would greatly benefit from this new tool in gathering Indic data and developing GPT models, many Indian AI businesses may suffer as a result. This includes SML Hanooman, Ola Krutrim, and other startups that are still in their infancy and having trouble bringing users into their platforms.
As of late, ChatGPT has over 180 million members worldwide; India is now the platform’s second-largest market. Approximately 14 million members, or 9.08% of the entire user population, are from India. Ola Krutrim and Hanooman are far apart and occupied with pursuing the so-called Indian “culture” card.
This is also the reason that OpenAI recently appointed Pragya Misra, its first employee in India, to lead government relations and advocate with the Indian government to establish a safe environment that will eventually allow OpenAI to operate in the nation without facing any obstacles.
“Tumse Nahi Ho Payega.” Really?
In a recent interview, Bhavish Aggarwal, the chief executive officer of Ola Krutrim, stated, “You won’t be able to do it,” referring to OpenAI. He audaciously declared that his goal is to demonstrate to OpenAI that India is capable of creating its own fundamental language models from the ground up.
Aggarwal acknowledged that Krutrim needed to catch up with ChatGPT, but he also said, “How can we move ahead unless the start is made?”
In a recent statement, he went so far as to create the word “Pronoun Illness” to express his desire for Krutrim to be exclusively Indian and free of Western influence. Ola’s diversity and inclusion policies are being questioned by the developer community, which concurs with his sentiment.
The irony is that the entire concept and the premise for launching Krutrim seem to have been lifted verbatim from OpenAI. In fact, in response to certain user inquiries, the company even claimed that it was constructed using “OpenAI models,” a claim that was evasively subsequently clarified and hasn’t been made again.
Many people think that Krutrim was trained using OpenAI’s GPT-4 output.
It’s interesting to note that Ola Krutrim uses Databricks services to streamline data for its model at the moment, and it probably uses DBRX for model creation as well. According to Ravi Jain, vice president of Krutrim, “we have been working closely with the Databricks team to pre-train and fine-tune our foundational LLM.”
Just Indicated Data Is Required
The difficulties in developing datasets for low-resource Indian languages were brought to light by Vivek Raghavan, co-founder of Sarvam AI: “The amount of high-quality data originally available in Indian languages is quite small.”
In addition, Raghavan stated that other Indian languages have even smaller percentages of Hindi text—even if you use Common Crawl, the most popular web data repository, as an example.
The inventors of Sarvam AI, Pratyush Kumar and Vivek Raghavan, have previously collaborated with AI4Bharat, another domestic AI initiative that is creating Indic language datasets similar to IndicVoices.
Similar to this, Tech Mahindra dispatched a team to North India in order to gather data for their own Hindi LLM, “Project Indus,” which consists of 10 billion Hindi+ dialect tokens and 539 million characteristics.
We visited areas of Bihar, Rajasthan, and Madhya Pradesh. The group’s goal was to gather information on Hindi and dialects through speaking with instructors and using the Bhasha-dan portal on ProjectIndus.in, according to Nikhil Malhotra, the global head of Tech Mahindra’s Makers Lab and the driving force behind Project Indus.
Interestingly, Bhashini also introduced Bhashan Daan to build a sizable and public archive of language data in numerous Indian languages, much as OpenAI’s Media Manager.
Not ego-centric, but customer-centric
Right now, the abundance of Indic datasets that most Indian AI businesses retain or use is their only competitive advantage. Since OpenAI released the Media Manager tool, the company’s influence might rise significantly in the nation while impeding the development of several businesses creating alternatives to ChatGPT.
The majority of Indian AI startups, to be honest, are two years behind OpenAI and other Western AI startups. They’ve only just started, therefore it’s time for them to take a sober look in the mirror and concentrate on creating creative, cooperative solutions rather than pursuing a pointless competitive strategy to serve Indian customers and businesses.
Recently, Nandan Nilekani, the CTO of India, expressed similar opinions. According to him, India should concentrate on developing AI use cases that will benefit all citizens rather than competing to develop LLMs. He declared, “Those who meet customers where they are will win in AI in India.”