Presented Two Papers at EMNLP 2022
Attended the 2022 Conference on Empirical Methods in Natural Language Processing (EMNLP) in Abu Dhabi, where I presented two research papers published in EMNLP 2022 Findings.
Paper 1: Post-OCR Text Correction in Sanskrit
Title: “A Benchmark and Dataset for Post-OCR Text Correction in Sanskrit”
Key Contributions:
- Released multi-domain benchmark with 218K sentences (1.5M words) from 30 different books
- Covered diverse domains: astronomy, medicine, mathematics (texts up to 18 centuries old)
- Dataset spans Sanskrit’s linguistic and stylistic diversity across 3 millennia
- Best model (Byt5+SLP1) achieved 23% improvement over OCR output
- Open-source dataset enabling digitization of 30 million extant Sanskrit manuscripts
Impact: Addressing the digital resource gap for Sanskrit, a classical language with massive manuscript collections.
Paper 2: SPEAR Data Programming Library
Title: “SPEAR: Semi-supervised Data Programming in Python” (System Demonstration)
Key Features:
- Open-source Python library for programmatic data labeling
- Reduces manual annotation effort through weak supervision
- Implements cutting-edge approaches: Snorkel, ImplyLoss, Learning to Reweight
- Integrates semi-supervised learning for efficient training
- 100+ GitHub stars and wide community adoption
Impact: Enabling practitioners to build training datasets efficiently without extensive manual labeling.
Both papers address critical challenges in making NLP more accessible and efficient for low-resource scenarios.