SpeechHigh Impact
Jan 2025IndicVoices-R: Unlocking a Massive Multilingual Multi-Speaker Speech Corpus for Scaling Indian TTS
IIT Madras / AI4Bharat — Speech Lab
Ashwin Sankar, Srija Anand, Praveen Srinivasa Varadhan et al.
Abstract
Release of the largest open Indian multilingual speech corpus: 10,496 hours across 22 Indian languages from 10,496 speakers. Designed specifically for training text-to-speech (TTS) systems that work across India's linguistic diversity. Built on IndicVoices dataset with refined annotation pipeline.
Methodology
Large-scale data collection from diverse demographics across India. Automated and manual quality annotation pipeline. Speaker-level metadata including age, gender, region, dialect. Evaluated by training multi-speaker TTS models (VITS2-based) and measuring MOS scores.
Key Results
10,496 hours of high-quality speech data across all 22 scheduled languages. TTS models trained on this data achieve MOS scores of 3.8-4.2 across languages, approaching human parity for Hindi and Tamil. Largest open Indian speech dataset by 4x margin.
Significance for India
Enables development of natural-sounding AI voice assistants in every Indian language. Critical for accessibility (voice-based interfaces for low-literacy users), government services via IVR, and India's smart city infrastructure. Democratizes speech AI beyond English.