Mana Speech Dataset Generator

This repository provides a modular, open-source pipeline for converting raw audio + text pairs into high-quality, clean, and aligned speech datasets. The pipeline is designed to work even when audio and text are not perfectly aligned — making it suitable for low-resource or noisy real-world settings.

⚙️ What’s Included in the Pipeline?

🔊 Audio Preprocessing
- Format conversion (e.g., MP3 to WAV)
- Background music removal using Spleeter
- Stereo-to-mono conversion
- Silence trimming (after alignment)
✏️ Text Cleaning and Normalization
- Unicode normalization and punctuation cleanup
- Removal of references, URLs, and metadata
- Spoken-form conversion for numbers (e.g., 2024 → "two thousand twenty-four")
⏱ Start-End Alignment
- Trims audio boundaries to match transcript using ASR-assisted matching
📌 Forced Alignment
- Segments audio into 2–12s chunks and aligns them with corresponding text spans
- Uses character error rate (CER) thresholds to ensure alignment quality
- Based on Mana Forced Aligner

🧩 Pipeline Overview

Audio and text processing pipeline

Detailed text preprocessing steps

🔗 Forced Alignment: Robust Matching of Audio and Text

Aligning long audio files with transcripts can be challenging — especially when the content isn’t an exact match.

This pipeline includes a built-in forced alignment module that:

Segments audio using silence detection
Uses multiple ASR outputs to match audio chunks to reference text
Accepts matches based on CER thresholds — even with small mismatches

➡️ Learn more and use it independently: 👉 Mana Forced Aligner

📦 Datasets Created with This Pipeline

Dataset Name	Language	Size	License	Links
ManaTTS	Persian	102+ hrs	CC-0
Quran-Persian	Persian	20+ hrs	CC-0

Feel free to reach out if you'd like yours featured.

🚀 Getting Started

You can run the pipeline using the online Google Colab notebook or offline using the provided notebook: Mata_Dataset-Generation.ipynb

Supported Languages

Actively tested on Persian
Easily customizable for other low-resource languages with available ASR models

📚 Citation

If you use this project in your work, please cite the corresponding paper:

@inproceedings{qharabagh-etal-2025-manatts,
    title = "{M}ana{TTS} {P}ersian: a recipe for creating {TTS} datasets for lower resource languages",
    author = "Qharabagh, Mahta Fetrat  and Dehghanian, Zahra  and Rabiee, Hamid R.",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.464/",
    pages = "9177--9206",
}

🤝 Contributions

Contributions are welcome! Please open an issue to discuss ideas or submit a pull request.

Name		Name	Last commit message	Last commit date
Latest commit History 6 Commits
assets		assets
licenses		licenses
LICENSE		LICENSE
Mana_Dataset_Generation.ipynb		Mana_Dataset_Generation.ipynb
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

Mana Speech Dataset Generator

⚙️ What’s Included in the Pipeline?

🧩 Pipeline Overview

🔗 Forced Alignment: Robust Matching of Audio and Text

📦 Datasets Created with This Pipeline

🚀 Getting Started

Supported Languages

📚 Citation

🤝 Contributions

🔗 Additional Links

About

Uh oh!

Releases

Packages

Languages

License

MahtaFetrat/Mana-Speech-Dataset-Generator

Folders and files

Latest commit

History

Repository files navigation

Mana Speech Dataset Generator

⚙️ What’s Included in the Pipeline?

🧩 Pipeline Overview

🔗 Forced Alignment: Robust Matching of Audio and Text

📦 Datasets Created with This Pipeline

🚀 Getting Started

Supported Languages

📚 Citation

🤝 Contributions

🔗 Additional Links

About

Topics

Resources

License

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages