Blog

Indian AI model’s local language viability faces content availability barrier

A key goal of Indian startups and the IndiaAI Mission has been to create a foundational large language model that is tuned to Indian languages. That has so far been a tall order, as the amount of Indian language content online — a key source of training data in English, which is what most foundational models like OpenAI and Google’s primarily work with — has been a fraction of other well-represented languages.

“The English data was entirely natural,” Vivekanand Pani, the co-founder of Reverie Language Technologies, said, referring to the online data that drive most existing foundational models. Mr. Pani has engaged policymakers and tech companies on driving more local language internet use in India for over a decade, and has persisting concerns on the penetration of local language internet use.

In monolingual internet societies like the US, China, Japan and South Korea, “people were able to engage freely and without any friction,” as local developers were building for their own societies first. “We haven’t solved that problem. And we are still not willing to solve it,” he rued. 

Digitising content like news and books to extract local language content is also not a surefire solution, Mr. Pani said, as the sheer volume of public user posts on the internet dwarfs user generated content on the public web by volume. He added that Indian languages like Odia use different registers for formal speech like newscasts and informal speech in everyday life. The latter is under-represented in data that can be found online.

Translation quality on services like Google Translate has improved enormously for Indian languages in spite of this constraint. But Mr. Pani said that this was because translation was a “transformative” technology, where the challenge does not extend to creating new text and solving problems natively in a given language. 

The creation of indic language datasets for a homegrown AI model that may be substantially useful, therefore, would depend on better availability of Indian language data, which in turn depends on more Indian language content being posted online. While there is a growing amount of such text on social media, the critical mass of such content needed for training a foundational AI model is a work in progress.

41% of Indians do not use the internet regularly, according to a report by the Internet and Mobile Association of India, with the share of non-internet users standing at 51% in rural India. This share likely coincides with non-English speakers, who may thus be well positioned to participate in a way that AI models can be trained on. 

Any serious advancement on developing a foundational model that can meaningfully engage Indian languages hinges on ongoing and upcoming efforts to compile such data. Karya, a Bengaluru-based firm, has garnered international attention by compensating Indian language speakers to contribute synthetic speech content that can be used in datasets. The IndiaAI Mission is also planning a repository of Indian language datasets, IT Minister Ashwini Vaishnaw said earlier this month, with details of the IndiaAI Datasets Platform to be announced later.

Published - February 14, 2025 12:18 pm IST

Ready to Transform Your Business?

Our team is here to help you with any inquiries or support you may need. Contact us to get answers and learn more about how COINDEEAI can support your business goals.

Discover Now