Workshop Presentation at NASSCOM NLTF 2024 - Building Indian Language Foundation Models


Date
Feb 19, 2024 12:00 PM — 1:30 PM
Location
Mumbai, India

Workshop: Building Indian Language Foundation Models

Presented at NASSCOM’s National Leadership and Technology Forum (NLTF 2024), India’s premier platform for technology and business leadership.

Presentation Overview

Shared insights and technical approaches from building large-scale foundation models specifically designed for Indian languages, addressing the unique challenges of linguistic diversity in India.

Key Topics Covered:

1. Data Collection & Processing:

  • Large-scale multilingual data curation strategies
  • Quality control for diverse Indian language data
  • Handling code-mixing and transliteration challenges
  • Building evaluation datasets for low-resource languages

2. Model Architecture & Training:

  • Tokenizer design for morphologically rich Indian languages
  • Training architecture for multilingual models
  • Distributed training on large accelerator clusters
  • Optimization techniques for efficient training

3. Model Tuning & Deployment:

  • Instruction tuning approaches
  • Preference training for alignment
  • Deployment considerations for production systems
  • Performance evaluation across language families

4. Real-world Impact:

  • Applications in education, government services, and content creation
  • Bridging the digital divide through vernacular AI
  • Democratizing access to AI technology across India

Context

This work was conducted while leading a team of 5 researchers building Indic large language models from scratch, combining technical innovation with practical deployment considerations for India’s multilingual landscape.

The presentation contributed to NASSCOM’s vision of positioning India as a global AI powerhouse.

Ayush Maheshwari
Ayush Maheshwari
Sr. Solutions Architect at NVIDIA
PhD in NLP/ML from CSE, IITB

My research interests include machine learning, NLP and machine translation.