CORAL
CORAL
Research Preview · FAST-NUCES · 2026

The consensus layer
for Urdu speech recognition.

CORAL is a five-stage post-processing pipeline that takes noisy outputs from a fleet of ASR back-ends and produces a single clean Urdu transcript — cutting word-error rate by up to 46.5% relative, with no fine-tuning of any acoustic model.

0.0%

Relative WER drop

0

Pipeline stages

0.0%

Final WER

Live · pipeline.run
stage 04 · refining…
whisper-largeزندگی میں مشکل آتی ہے0.81
seamless-largeزندگی میں مشکلیں آتی ہیں0.92
wav2vec2-urduزندگی میں مشکل آتی ہیں0.63
CORALزندگی میں مشکلیں آتی ہیں

✓ Stage 1 split-merge resolved 1 token · ✓ Stage 3 voted 2 corrections · ✓ Stage 4 grammar pass

scroll

— The Problem

Urdu is the world's most under-served
major spoken language.

Despite 230M+ speakers, every off-the-shelf ASR model leaks measurable accuracy on Urdu — and the dominant failure modes are systematic, not random.

230M+

Urdu speakers worldwide

13–20%

WER for current SOTA

36.5%

Split/merge disagreement

0

Public correction layers

— example: tokenisation failure

whisper-large says

وہ کیاہے کام

tokens: 3 · ‘kya-hai’ merged

seamless-large says

وہ کیا ہے کام

tokens: 4 · ‘kya hai’ split

CORAL resolvesوہ کیا ہے کام

— Research Hypothesis

Five algorithmic levers,
composable, deterministic.

STAGE 00

Normalise

Arabic→Urdu Unicode unification. Diacritic removal, hamza normalisation. Zero-risk pre-pass that alone contributes 1.9 WER points.

→ Split-Merge01/05
STAGE 01

Split-Merge

Weighted multi-sequence alignment. Classifies every event as SAME / SPLIT / MERGE / NOISE — 36.5% of all inter-model disagreement.

→ OOV + BK-tree02/05
STAGE 02

OOV + BK-tree

Hybrid OOV detection with BK-tree edit-distance neighbours re-ranked by an Urdu n-gram language model over a 500K-token corpus.

→ Vote03/05
STAGE 03

Vote

Position-wise conservative consensus voting across the ensemble. Source-biased tie-breaking, OOV-aware overrides.

→ LLM Refine04/05
STAGE 04

LLM Refine

Bounded LLM polish for grammar, izafat, postpositions and code-switching. Hallucinations structurally constrained by upstream metadata.

→ Output05/05

— End-to-End Flow

Raw ensemble → clean transcript.

Input

Ensemble
outputs

Stage 00

Normalise

Stage 01

Split-Merge

Stage 02

OOV + BK-tree

Stage 03

Vote

Stage 04

LLM Refine

Output

Corrected
transcript

0 ms

Normalise

+18 ms

Split-Merge

+62 ms

OOV + BK-tree

+12 ms

Vote

+1.4 s

LLM Refine

— Results

The numbers, unambiguous.

Evaluated on 2,995-utterance Common Voice Urdu (read-speech) and a 500-clip conversational benchmark. Every CORAL stage adds measurable WER reduction.

Common Voice Urdu · n = 2,995 · robust config

↓ relative
Seamless-Large
14.34%
22.3%
Whisper-Large-v3
19.97%
29.4%
Whisper-Medium
30.64%
24.2%
Wav2Vec2-Urdu
39.67%
25.9%

0.0%

Seamless · CV

0.0%

Whisper-Large · CV

0.0%

Whisper-Large · Conversational

— System Architecture

Distributed inference, serverless brain.

Inference Tier

Kaggle GPU nodes

3× T4 · ngrok HTTPS tunnels

  • Whisper-Large-v3
  • Seamless-M4T-Large
  • Wav2Vec2-Urdu
  • Self-registering on boot

Backend Tier

FastAPI orchestrator

HF Space · Docker · port 7860

  • POST /align
  • POST /oov
  • POST /correct
  • Model registry · transcribe

Frontend Tier

Next.js · React 19

Vercel · client-side LLM

  • 4-pass UX flow
  • Live alignment viz
  • Stage 4 LLM dispatch
  • Microphone + file modes

Data Tier

DuckDB

N-gram store · 10.5M rows

BK-tree

28 MB · joblib pickle

HuggingFace

Corpus + benchmark TSV

Eval TSV

Per-stage WER/CER

— Research Innovations

What makes CORAL not just another wrapper.

01

Split-merge-aware alignment

First Urdu post-processor to treat word-boundary disagreement as a first-class signal rather than substitution noise.

02

Urdu-specific normalisation

Custom Arabic↔Urdu Unicode collapse table validated against the Common Voice reference set.

03

Hybrid BK-tree + n-gram

Edit-distance retrieval, then context-aware re-ranking — the OOV long tail solved with classical NLP.

04

Conservative consensus voting

Avoids the ROVER failure mode where high-WER companions overrule a low-WER source.

05

Bounded LLM refinement

The LLM stage runs under authority limits derived from upstream metadata — refines, never rewrites freely.

06

Frozen acoustic models

Plug any open-weight ASR ensemble in. CORAL is a deterministic post-processor; the acoustic models stay swappable.

— Built With

Open weights. Open stack. Open data.

Every layer of CORAL runs on open standards — from the ASR back-ends down to the language-model post-edit. No proprietary models in the critical path.

Frontend

Next.js 15

App Router · React 19

Backend

FastAPI

Python · async REST

Storage

DuckDB

10.5M-row n-gram store

Datasets

Hugging Face

Models · BK-tree · TSV

ASR

Whisper-Large-v3

OpenAI · multilingual

ASR

Seamless-M4T

Meta · low-resource

ASR

Wav2Vec2-Urdu

Self-supervised · CTC

LLM Refine

GPT-OSS · Gemini

Bounded post-edit

— Global Impact

Speech accessibility for the next billion.

01 · Accessibility

Caption Urdu video, broadcast, and lectures with usable accuracy for the deaf and hard-of-hearing community.

02 · Healthcare

Dictation assistance in Urdu-speaking clinics where patients describe symptoms in dialectal speech.

03 · Education

Searchable transcripts of Urdu lecture archives — currently unindexable by modern engines.

04 · Legal

Court and parliamentary record transcription where named-entity precision and code-switching matter.

05 · Low-resource

Architecture transfers to Pashto, Sindhi, Punjabi — Stages 1-4 are not Urdu-specific.

06 · Open release

Code, corpus, BK-tree, and benchmark TSV released under permissive licences for downstream research.

— Try It Now

Drop in audio.
Watch CORAL clean it.

The interactive demo walks you through every stage with real-time alignment visualisation — microphone, file upload, or pre-aligned TSV.