The consensus layer
for Urdu speech recognition.
CORAL is a five-stage post-processing pipeline that takes noisy outputs from a fleet of ASR back-ends and produces a single clean Urdu transcript — cutting word-error rate by up to 46.5% relative, with no fine-tuning of any acoustic model.
Relative WER drop
Pipeline stages
Final WER
✓ Stage 1 split-merge resolved 1 token · ✓ Stage 3 voted 2 corrections · ✓ Stage 4 grammar pass
— The Problem
Urdu is the world's most under-served
major spoken language.
Despite 230M+ speakers, every off-the-shelf ASR model leaks measurable accuracy on Urdu — and the dominant failure modes are systematic, not random.
230M+
Urdu speakers worldwide
13–20%
WER for current SOTA
36.5%
Split/merge disagreement
0
Public correction layers
— example: tokenisation failure
whisper-large says
وہ کیاہے کام
tokens: 3 · ‘kya-hai’ merged
seamless-large says
وہ کیا ہے کام
tokens: 4 · ‘kya hai’ split
— Research Hypothesis
Five algorithmic levers,
composable, deterministic.
Normalise
Arabic→Urdu Unicode unification. Diacritic removal, hamza normalisation. Zero-risk pre-pass that alone contributes 1.9 WER points.
Split-Merge
Weighted multi-sequence alignment. Classifies every event as SAME / SPLIT / MERGE / NOISE — 36.5% of all inter-model disagreement.
OOV + BK-tree
Hybrid OOV detection with BK-tree edit-distance neighbours re-ranked by an Urdu n-gram language model over a 500K-token corpus.
Vote
Position-wise conservative consensus voting across the ensemble. Source-biased tie-breaking, OOV-aware overrides.
LLM Refine
Bounded LLM polish for grammar, izafat, postpositions and code-switching. Hallucinations structurally constrained by upstream metadata.
— End-to-End Flow
Raw ensemble → clean transcript.
Input
Ensemble
outputs
Stage 00
Normalise
Stage 01
Split-Merge
Stage 02
OOV + BK-tree
Stage 03
Vote
Stage 04
LLM Refine
Output
Corrected
transcript
0 ms
Normalise
+18 ms
Split-Merge
+62 ms
OOV + BK-tree
+12 ms
Vote
+1.4 s
LLM Refine
— Results
The numbers, unambiguous.
Evaluated on 2,995-utterance Common Voice Urdu (read-speech) and a 500-clip conversational benchmark. Every CORAL stage adds measurable WER reduction.
Common Voice Urdu · n = 2,995 · robust config
↓ relative0.0%
Seamless · CV
0.0%
Whisper-Large · CV
0.0%
Whisper-Large · Conversational
— System Architecture
Distributed inference, serverless brain.
Inference Tier
Kaggle GPU nodes
3× T4 · ngrok HTTPS tunnels
- Whisper-Large-v3
- Seamless-M4T-Large
- Wav2Vec2-Urdu
- Self-registering on boot
Backend Tier
FastAPI orchestrator
HF Space · Docker · port 7860
- POST /align
- POST /oov
- POST /correct
- Model registry · transcribe
Frontend Tier
Next.js · React 19
Vercel · client-side LLM
- 4-pass UX flow
- Live alignment viz
- Stage 4 LLM dispatch
- Microphone + file modes
Data Tier
DuckDB
N-gram store · 10.5M rows
BK-tree
28 MB · joblib pickle
HuggingFace
Corpus + benchmark TSV
Eval TSV
Per-stage WER/CER
— Research Innovations
What makes CORAL not just another wrapper.
Split-merge-aware alignment
First Urdu post-processor to treat word-boundary disagreement as a first-class signal rather than substitution noise.
Urdu-specific normalisation
Custom Arabic↔Urdu Unicode collapse table validated against the Common Voice reference set.
Hybrid BK-tree + n-gram
Edit-distance retrieval, then context-aware re-ranking — the OOV long tail solved with classical NLP.
Conservative consensus voting
Avoids the ROVER failure mode where high-WER companions overrule a low-WER source.
Bounded LLM refinement
The LLM stage runs under authority limits derived from upstream metadata — refines, never rewrites freely.
Frozen acoustic models
Plug any open-weight ASR ensemble in. CORAL is a deterministic post-processor; the acoustic models stay swappable.
— Built With
Open weights. Open stack. Open data.
Every layer of CORAL runs on open standards — from the ASR back-ends down to the language-model post-edit. No proprietary models in the critical path.
Frontend
Next.js 15
App Router · React 19
Backend
FastAPI
Python · async REST
Storage
DuckDB
10.5M-row n-gram store
Datasets
Hugging Face
Models · BK-tree · TSV
ASR
Whisper-Large-v3
OpenAI · multilingual
ASR
Seamless-M4T
Meta · low-resource
ASR
Wav2Vec2-Urdu
Self-supervised · CTC
LLM Refine
GPT-OSS · Gemini
Bounded post-edit
— Global Impact
Speech accessibility for the next billion.
01 · Accessibility
Caption Urdu video, broadcast, and lectures with usable accuracy for the deaf and hard-of-hearing community.
02 · Healthcare
Dictation assistance in Urdu-speaking clinics where patients describe symptoms in dialectal speech.
03 · Education
Searchable transcripts of Urdu lecture archives — currently unindexable by modern engines.
04 · Legal
Court and parliamentary record transcription where named-entity precision and code-switching matter.
05 · Low-resource
Architecture transfers to Pashto, Sindhi, Punjabi — Stages 1-4 are not Urdu-specific.
06 · Open release
Code, corpus, BK-tree, and benchmark TSV released under permissive licences for downstream research.
— Try It Now
Drop in audio.
Watch CORAL clean it.
The interactive demo walks you through every stage with real-time alignment visualisation — microphone, file upload, or pre-aligned TSV.
