Skip to content

A robust forced alignment tool for low-resource languages using multiple ASR models and CER-based matching. Built for noisy data and imperfect transcripts.

License

Notifications You must be signed in to change notification settings

MahtaFetrat/Mana-Forced-Aligner

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

13 Commits
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

Mana Forced Aligner: Robust Forced Alignment for Low-Resource Languages

Want to create a speech dataset, but current forced alignment tools don’t work for your setup?

Mana Forced Aligner is a robust, language-agnostic forced alignment module designed specifically for low-resource languages and imperfect audio-text matches. Unlike traditional tools like Aeneas, our method does not require a perfect match between audio and text. It tolerates skipped words, repetitions, or inconsistencies, using multiple ASR models and character-level text similarity with predefined thresholds.

Why Use Mana Forced Aligner?

🧠 ASR-Agnostic and Scalable: Use one or more automatic speech recognition (ASR) models, regardless of their quality. The aligner is designed to work even with imperfect ASRs. However, the more ASR models you provide, the more robust and accurate the alignment becomes, thanks to majority-voting and fallback mechanisms.

🧩 Mismatch-Tolerant: Handles skipped or added phrases, word reorderings, and slight transcription errors.

🧪 Flexible Scoring: Matches based on character error rate (CER), with configurable thresholds.

🔓 Open and Extendable: Built in Python, licensed under MIT, and ready to adapt to your language and ASR tools.

🌍 Proven in Practice: Successfully used to generate 102+ hours of aligned speech-text data for the Persian ManaTTS corpus.


How It Works

The alignment process consists of two key components:

1. Transcription Module

  • Runs multiple ASR models on each audio chunk
  • Discards unreliable transcripts (e.g., truncated outputs)
  • Sorts by reliability and returns top transcripts


Figure: Architecture of the transcription module, showing multiple ASRs and reliability filtering.

2. Forced Alignment

  • Splits audio into 2–12 second segments based on silence

  • Uses the transcription module to hypothesize possible texts

  • Finds best-matching substrings in the reference text

  • Accepts matches based on CER:

    • High quality: CER ≤ 0.05
    • Middle quality: 0.05 < CER ≤ 0.2
  • Chunks with no acceptable match are discarded

The algorithm uses Interval Search first, then Gapped Search only if needed.


Figure: Forced alignment pipeline showing silence splitting, transcription-based matching, CER thresholds, and chunk decisions.


📦 Datasets Created Using Mana Forced Aligner

Dataset Name Language Size License Links
ManaTTS Persian 102+ hrs CC-0 Hugging Face GitHub
Quran-Persian Persian 20+ hrs CC-0 Hugging Face

Feel free to reach out if you'd like yours featured.


Getting Started

Open In Colab

You can use the forced alignment method via Colab or the forced alignment notebook in this repository.

💡 To adapt it for your language, simply embed one or more ASR models (the more the better!) that work for your language into the transcription module section of the notebook.


Supported Languages

  • Actively tested on Persian
  • Easily customizable for other low-resource languages with available ASR models

Citation

If you use Mana Forced Aligner in your work, please cite:

@inproceedings{qharabagh-etal-2025-manatts,
    title = "{M}ana{TTS} {P}ersian: a recipe for creating {TTS} datasets for lower resource languages",
    author = "Qharabagh, Mahta Fetrat  and Dehghanian, Zahra  and Rabiee, Hamid R.",
    booktitle = "Proceedings of the 2025 Conference of the Nations of the Americas Chapter of the Association for Computational Linguistics: Human Language Technologies (Volume 1: Long Papers)",
    month = apr,
    year = "2025",
    address = "Albuquerque, New Mexico",
    publisher = "Association for Computational Linguistics",
    url = "https://aclanthology.org/2025.naacl-long.464/",
    pages = "9177--9206",
}

Additional Links

About

A robust forced alignment tool for low-resource languages using multiple ASR models and CER-based matching. Built for noisy data and imperfect transcripts.

Topics

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published