Skip to content

MahtaFetrat/Mana-Speech-Dataset-Generator

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

6 Commits
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mana Speech Dataset Generator

This repository provides a modular, open-source pipeline for converting raw audio + text pairs into high-quality, clean, and aligned speech datasets. The pipeline is designed to work even when audio and text are not perfectly aligned — making it suitable for low-resource or noisy real-world settings.


⚙️ What’s Included in the Pipeline?

  • 🔊 Audio Preprocessing

    • Format conversion (e.g., MP3 to WAV)
    • Background music removal using Spleeter
    • Stereo-to-mono conversion
    • Silence trimming (after alignment)
  • ✏️ Text Cleaning and Normalization

    • Unicode normalization and punctuation cleanup
    • Removal of references, URLs, and metadata
    • Spoken-form conversion for numbers (e.g., 2024 → "two thousand twenty-four")
  • Start-End Alignment

    • Trims audio boundaries to match transcript using ASR-assisted matching
  • 📌 Forced Alignment

    • Segments audio into 2–12s chunks and aligns them with corresponding text spans
    • Uses character error rate (CER) thresholds to ensure alignment quality
    • Based on Mana Forced Aligner

🧩 Pipeline Overview

Audio and text processing pipeline

Detailed text preprocessing steps


🔗 Forced Alignment: Robust Matching of Audio and Text

Aligning long audio files with transcripts can be challenging — especially when the content isn’t an exact match.

This pipeline includes a built-in forced alignment module that:

  • Segments audio using silence detection
  • Uses multiple ASR outputs to match audio chunks to reference text
  • Accepts matches based on CER thresholds — even with small mismatches

➡️ Learn more and use it independently: 👉 Mana Forced Aligner


📦 Datasets Created with This Pipeline

Dataset Name Language Size License Links
ManaTTS Persian 102+ hrs CC-0 Hugging Face GitHub
Quran-Persian Persian 20+ hrs CC-0 Hugging Face

Feel free to reach out if you'd like yours featured.


🚀 Getting Started

Open In Colab

You can run the pipeline using the online Google Colab notebook or offline using the provided notebook: Mata_Dataset-Generation.ipynb


Supported Languages

  • Actively tested on Persian
  • Easily customizable for other low-resource languages with available ASR models

📚 Citation

If you use this project in your work, please cite the corresponding paper:

@inproceedings{qharabagh-etal-2025-manatts,
    title = "{M}ana{TTS} {P}ersian: a recipe for creating {TTS} datasets for lower resource languages",
    author = "Qharabagh, Mahta Fetrat  and Dehghanian, Zahra  and Rabiee, Hamid R.",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.464/",
    pages = "9177--9206",
}

🤝 Contributions

Contributions are welcome! Please open an issue to discuss ideas or submit a pull request.


🔗 Additional Links