This repository provides a modular, open-source pipeline for converting raw audio + text pairs into high-quality, clean, and aligned speech datasets. The pipeline is designed to work even when audio and text are not perfectly aligned — making it suitable for low-resource or noisy real-world settings.
-
🔊 Audio Preprocessing
- Format conversion (e.g., MP3 to WAV)
- Background music removal using Spleeter
- Stereo-to-mono conversion
- Silence trimming (after alignment)
-
✏️ Text Cleaning and Normalization
- Unicode normalization and punctuation cleanup
- Removal of references, URLs, and metadata
- Spoken-form conversion for numbers (e.g., 2024 → "two thousand twenty-four")
-
⏱ Start-End Alignment
- Trims audio boundaries to match transcript using ASR-assisted matching
-
📌 Forced Alignment
- Segments audio into 2–12s chunks and aligns them with corresponding text spans
- Uses character error rate (CER) thresholds to ensure alignment quality
- Based on Mana Forced Aligner
Aligning long audio files with transcripts can be challenging — especially when the content isn’t an exact match.
This pipeline includes a built-in forced alignment module that:
- Segments audio using silence detection
- Uses multiple ASR outputs to match audio chunks to reference text
- Accepts matches based on CER thresholds — even with small mismatches
➡️ Learn more and use it independently: 👉 Mana Forced Aligner
Dataset Name | Language | Size | License | Links |
---|---|---|---|---|
ManaTTS | Persian | 102+ hrs | CC-0 | |
Quran-Persian | Persian | 20+ hrs | CC-0 |
Feel free to reach out if you'd like yours featured.
You can run the pipeline using the online Google Colab notebook
or offline using the provided notebook: Mata_Dataset-Generation.ipynb
- Actively tested on Persian
- Easily customizable for other low-resource languages with available ASR models
If you use this project in your work, please cite the corresponding paper:
@inproceedings{qharabagh-etal-2025-manatts,
title = "{M}ana{TTS} {P}ersian: a recipe for creating {TTS} datasets for lower resource languages",
author = "Qharabagh, Mahta Fetrat and Dehghanian, Zahra and Rabiee, Hamid R.",
booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
month = apr,
year = "2025",
address = "Albuquerque, New Mexico",
publisher = "Association for Computational Linguistics",
url = "https://aclanthology.org/2025.naacl-long.464/",
pages = "9177--9206",
}
Contributions are welcome! Please open an issue to discuss ideas or submit a pull request.