Paper Presentations at EMNLP 2022 - Post-OCR Correction and Data Programming


Date
Dec 6, 2022 1:00 PM — Dec 11, 2022 8:00 PM
Location
Abu Dhabi, UAE

Presented Two Papers at EMNLP 2022

Attended the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Abu Dhabi, where I presented two research papers published in EMNLP 2022 Findings.

Paper 1: Post-OCR Text Correction in Sanskrit

Title: “A Benchmark and Dataset for Post-OCR Text Correction in Sanskrit”

Key Contributions:

  • Released multi-domain benchmark with 218K sentences (1.5M words) from 30 different books
  • Covered diverse domains: astronomy, medicine, mathematics (texts up to 18 centuries old)
  • Dataset spans Sanskrit’s linguistic and stylistic diversity across 3 millennia
  • Best model (Byt5+SLP1) achieved 23% improvement over OCR output
  • Open-source dataset enabling digitization of 30 million extant Sanskrit manuscripts

Impact: Addressing the digital resource gap for Sanskrit, a classical language with massive manuscript collections.

Paper 2: SPEAR Data Programming Library

Title: “SPEAR: Semi-supervised Data Programming in Python” (System Demonstration)

Key Features:

  • Open-source Python library for programmatic data labeling
  • Reduces manual annotation effort through weak supervision
  • Implements cutting-edge approaches: Snorkel, ImplyLoss, Learning to Reweight
  • Integrates semi-supervised learning for efficient training
  • 100+ GitHub stars and wide community adoption

Impact: Enabling practitioners to build training datasets efficiently without extensive manual labeling.

Both papers address critical challenges in making NLP more accessible and efficient for low-resource scenarios.

Ayush Maheshwari
Ayush Maheshwari
Sr. Solutions Architect at NVIDIA
PhD in NLP/ML from CSE, IITB

My research interests include machine learning, NLP and machine translation.