Research Preview · FAST-NUCES · 2026

The consensus layer
for Urdu speech recognition.

CORAL is a five-stage post-processing pipeline that takes noisy outputs from a fleet of ASR back-ends and produces a single clean Urdu transcript — cutting word-error rate by up to 46.5% relative, with no fine-tuning of any acoustic model.

Try Live Demo Explore Research View Architecture →

0.0%

Relative WER drop

Pipeline stages

0.0%

Final WER

Live · pipeline.run

stage 04 · refining…

whisper-largeزندگی میں مشکل آتی ہے0.81

seamless-largeزندگی میں مشکلیں آتی ہیں0.92

wav2vec2-urduزندگی میں مشکل آتی ہیں0.63

CORALزندگی میں مشکلیں آتی ہیں

✓ Stage 1 split-merge resolved 1 token · ✓ Stage 3 voted 2 corrections · ✓ Stage 4 grammar pass

scroll

— The Problem

Urdu is the world's most under-served
major spoken language.

Despite 230M+ speakers, every off-the-shelf ASR model leaks measurable accuracy on Urdu — and the dominant failure modes are systematic, not random.

230M+

Urdu speakers worldwide

13–20%

WER for current SOTA

36.5%

Split/merge disagreement

Public correction layers

— example: tokenisation failure

whisper-large says

وہ کیاہے کام

tokens: 3 · ‘kya-hai’ merged

seamless-large says

وہ کیا ہے کام

tokens: 4 · ‘kya hai’ split

CORAL resolvesوہ کیا ہے کام

— Research Hypothesis

Five algorithmic levers,
composable, deterministic.

STAGE 00

Normalise

Arabic→Urdu Unicode unification. Diacritic removal, hamza normalisation. Zero-risk pre-pass that alone contributes 1.9 WER points.

→ Split-Merge01/05

STAGE 01

Split-Merge

Weighted multi-sequence alignment. Classifies every event as SAME / SPLIT / MERGE / NOISE — 36.5% of all inter-model disagreement.

→ OOV + BK-tree02/05

STAGE 02

OOV + BK-tree

Hybrid OOV detection with BK-tree edit-distance neighbours re-ranked by an Urdu n-gram language model over a 500K-token corpus.

→ Vote03/05

STAGE 03

Vote

Position-wise conservative consensus voting across the ensemble. Source-biased tie-breaking, OOV-aware overrides.

→ LLM Refine04/05

STAGE 04

LLM Refine

Bounded LLM polish for grammar, izafat, postpositions and code-switching. Hallucinations structurally constrained by upstream metadata.

→ Output05/05

Walk through every stage

— End-to-End Flow

Raw ensemble → clean transcript.

Input

Ensemble
outputs

Stage 00

Normalise

Stage 01

Split-Merge

Stage 02

OOV + BK-tree

Stage 03

Vote

Stage 04

LLM Refine

Output

Corrected
transcript

0 ms

Normalise

+18 ms

Split-Merge

+62 ms

OOV + BK-tree

+12 ms

Vote

+1.4 s

LLM Refine

— Results

The numbers, unambiguous.

Evaluated on 2,995-utterance Common Voice Urdu (read-speech) and a 500-clip conversational benchmark. Every CORAL stage adds measurable WER reduction.

Common Voice Urdu · n = 2,995 · robust config

↓ relative

Seamless-Large18.45%

14.34%

↓22.3%

Whisper-Large-v328.29%

19.97%

↓29.4%

Whisper-Medium40.44%

30.64%

↓24.2%

Wav2Vec2-Urdu53.52%

39.67%

↓25.9%

0.0%

Seamless · CV

0.0%

Whisper-Large · CV

0.0%

Whisper-Large · Conversational

Full ablation, residuals, future work

— System Architecture

Distributed inference, serverless brain.

Inference Tier

Kaggle GPU nodes

3× T4 · ngrok HTTPS tunnels

Whisper-Large-v3
Seamless-M4T-Large
Wav2Vec2-Urdu
Self-registering on boot

Backend Tier

FastAPI orchestrator

HF Space · Docker · port 7860

POST /align
POST /oov
POST /correct
Model registry · transcribe

Frontend Tier

Next.js · React 19

Vercel · client-side LLM

4-pass UX flow
Live alignment viz
Stage 4 LLM dispatch
Microphone + file modes

Data Tier

DuckDB

N-gram store · 10.5M rows

BK-tree

28 MB · joblib pickle

HuggingFace

Corpus + benchmark TSV

Eval TSV

Per-stage WER/CER

— Research Innovations

What makes CORAL not just another wrapper.

Split-merge-aware alignment

First Urdu post-processor to treat word-boundary disagreement as a first-class signal rather than substitution noise.

Urdu-specific normalisation

Custom Arabic↔Urdu Unicode collapse table validated against the Common Voice reference set.

Hybrid BK-tree + n-gram

Edit-distance retrieval, then context-aware re-ranking — the OOV long tail solved with classical NLP.

Conservative consensus voting

Avoids the ROVER failure mode where high-WER companions overrule a low-WER source.

Bounded LLM refinement

The LLM stage runs under authority limits derived from upstream metadata — refines, never rewrites freely.

Frozen acoustic models

Plug any open-weight ASR ensemble in. CORAL is a deterministic post-processor; the acoustic models stay swappable.

— Built With

Open weights. Open stack. Open data.

Every layer of CORAL runs on open standards — from the ASR back-ends down to the language-model post-edit. No proprietary models in the critical path.

Frontend

Next.js 15

App Router · React 19

Backend

FastAPI

Python · async REST

Storage

DuckDB

10.5M-row n-gram store

Datasets

Hugging Face

Models · BK-tree · TSV

ASR

Whisper-Large-v3

OpenAI · multilingual

ASR

Seamless-M4T

Meta · low-resource

ASR

Wav2Vec2-Urdu

Self-supervised · CTC

LLM Refine

GPT-OSS · Gemini

Bounded post-edit

— Global Impact

Speech accessibility for the next billion.

01 · Accessibility

Caption Urdu video, broadcast, and lectures with usable accuracy for the deaf and hard-of-hearing community.

02 · Healthcare

Dictation assistance in Urdu-speaking clinics where patients describe symptoms in dialectal speech.

03 · Education

Searchable transcripts of Urdu lecture archives — currently unindexable by modern engines.

04 · Legal

Court and parliamentary record transcription where named-entity precision and code-switching matter.

05 · Low-resource

Architecture transfers to Pashto, Sindhi, Punjabi — Stages 1-4 are not Urdu-specific.

06 · Open release

Code, corpus, BK-tree, and benchmark TSV released under permissive licences for downstream research.

— Try It Now

Drop in audio.
Watch CORAL clean it.

The interactive demo walks you through every stage with real-time alignment visualisation — microphone, file upload, or pre-aligned TSV.

Launch the Pipeline Read the Research

The consensus layerfor Urdu speech recognition.

Urdu is the world's most under-servedmajor spoken language.

Five algorithmic levers,composable, deterministic.

Normalise

Split-Merge

OOV + BK-tree

Vote

LLM Refine

Raw ensemble → clean transcript.

The numbers, unambiguous.

Distributed inference, serverless brain.

What makes CORAL not just another wrapper.

Split-merge-aware alignment

Urdu-specific normalisation

Hybrid BK-tree + n-gram

Conservative consensus voting

Bounded LLM refinement

Frozen acoustic models

Open weights. Open stack. Open data.

Speech accessibility for the next billion.

Drop in audio.Watch CORAL clean it.

The consensus layer
for Urdu speech recognition.

Urdu is the world's most under-served
major spoken language.

Five algorithmic levers,
composable, deterministic.

Drop in audio.
Watch CORAL clean it.