Ayush Maheshwari

Ayush Maheshwari

Research Scientist at Vizzhy Inc.
PhD in NLP/ML from CSE, IITB

Biography

Update: Successfully defended my PhD thesis (2/July/24) 😇

I am Ayush Maheshwari (आयुष माहेश्वरी), completed my PhD from CSE, IITB (India) with Prof. Ganesh Ramakrishnan . I was funded by Ekal fellowship from Ekal foundation during my PhD.

My research interests lie in the area of Natural Language Processing, Graphs from machine learning perspective. I have worked on constrained neural machine translation and semi- and un-supervised machine learning problems with data-programming.

I am a key member of neural machine translation project, UDAAN, which helps publishers to quickly translate technical content in Indian languages. The project is open-source and used by several Indian government technical education agencies and official languages departments.

In my spare time, I enjoy playing and reading about Indian culture, Ramáyaṇa and Mahábhárat.

Download my resumé (Last updated: July 2024)

Interests
  • Large Langauge Models
  • Natural Language Processing
  • Human-in-the-loop AI
  • Neural Machine Translation
  • Machine Learning
  • Information Retrieval
Education
  • PhD in Computer Science, Jan 2019 - Aug 2023 (Defended July 2024)

    Indian Institute of Technology Bombay

Updates

  • [Sep 24] Our paper on Dictionary Constrained Disambiguation for Improved NMT is accepted at EMNLP 2024 [Paper] ❤️ 😌

  • [July 24] I have successfully defended my PhD thesis on Knowledge Integration in Language Processing Models using Constraint Ingestion and Generation 😇😊🎆 {Arxiv soon!}

  • [July 24] Our paper on ‘Enhancing Low-Resource NMT with a Multilingual Encoder and Knowledge Distillation’ will be presented at LoResMT workshop at ACL 2024 🎉.

  • [Feb 24] Our paper on development of English - Sanskrit parallel corpus is accepted at LREC-COLING 2024. [Pre-print]

  • [Jan 24] Our paper on ‘Filtering of automatically induced rules for weak supervision’ accepted at EACL 2024 ❤️. [Paper]

  • 📝 Serving in the PC for ARR Industry Track 2023 - Present.

  • 📝 Serving in the PC for ARR, 2022 - Present.

Click here for updates archive

Experience

 
 
 
 
 
Research Scientist
Sep 2023 – Sep 2024 Bengaluru
  1. Led a team of 5 people to build Indic-large language models from scratch.
  2. Developing data collection & processing pipelines for training and evaluation.
  3. Training of tokenizer and designing model training architecture.
  4. Training the model on large accelerator cluster.
  5. Instruction tuning and preference training of the pre-trained models.
 
 
 
 
 
Adobe Research
Research Intern
May 2021 – Aug 2021 Bengaluru

Worked on prototyping new service for Adobe PDF in the legal domain Responsibilities include:

  • Modeling of the problem
  • Designing, developing and prototyping using ML
  • Deployment and Demonstration
 
 
 
 
 
IIT Bombay
Project Engineer
Jan 2016 – Dec 2018 Mumbai
Develop software solutions for security agencies
 
 
 
 
 
Tata Consultancy Services
System Engineer
Oct 2011 – Jul 2013 Mumbai

Projects

UDAAN - An NMT pipeline + Post-editing tool to translate document (Best Paper Award at CODS-COMAD 2023)
UDAAN has an end-to-end Machine Translation (MT) and post-editing pipeline. Using our tool, users can upload a document, obtain raw MT output, and edit the raw translations. We have digitized >100 dictionaries from CSTT. You can freely download these dictionaries from the project website. Our pipeline is being used by >100 translators across 10 languages to translate >50 books.
SPEAR - Programmatically label and quickly build training data
SPEAR is a python library that reduce data labeling efforts using data programming. It implements several recent approaches such as Snorkel, ImplyLoss, Learning to reweight, etc. In addition to data labeling, it integrates semi-supervised approaches for training and inference.
Temples of India
Temples of India is a not-for-profit knowledge platform to document and store possibly all details of temples across Indian subcontinent. We aim to present each detail related to the temple such as its location, images of the temple, videos, open and close timings, etc.

Recent Publications

Quickly discover relevant content by filtering publications. Complete list at Google scholar.
A Benchmark and Dataset for Post-OCR text correction in Sanskrit
Learning to Robustly Aggregate Labeling Functions for Semi-supervised Data Programming
Rule Augmented Unsupervised Constituency Parsing
Unsupervised Learning of Explainable Parse Trees for Improved Generalisation