— Research

Empirical evidence the
pipeline works.

Two evaluation suites. Eight ablation configurations. A breakdown of where the remaining errors come from — and ten concrete directions for the next iteration.

0.0%

Best relative WER drop

0.0%

Final WER · convo

Read-speech utterances

Ablation configs

Research Hypothesis

“Combining Urdu-specific Unicode normalisation, weighted split-merge-aware alignment of multiple ASR hypotheses, hybrid OOV detection with BK-tree fuzzy lookup ranked by an n-gram language model, conservative position-wise voting, and targeted LLM refinement is sufficient to substantially reduce WER over the strongest single-model baseline — without fine-tuning any acoustic model.”

— Answered affirmatively (§5)

— Result 1

Read-speech · Common Voice Urdu

2,995 utterances · robust config · CORAL as post-processor

Source model

Baseline

WER reduction

CORAL

Δ rel

Seamless-Large

18.45%

14.34%

↓22.3%

Whisper-Large-v3

28.29%

19.97%

↓29.4%

Whisper-Medium

40.44%

30.64%

↓24.2%

Wav2Vec2-Urdu

53.52%

39.67%

↓25.9%

— Result 2

Ablation · C0 → C7

Whisper-Large-v3 source · 500-clip conversational sample · each row adds one component

20%15%10%5%0%

19.8%

17.9%

16.4%

15.1%

14.2%

13.6%

13.1%

10.6%

C0C1C2C3C4C5C6C7

C0Whisper-Large-v3 baseline19.8%

C1+ Stage 0 normalisation17.9%-1.9

C2+ Stage 1 split-merge alignment16.4%-1.5

C3+ Stage 2 BK-tree (top-1)15.1%-1.3

C4+ Stage 2 n-gram re-ranking14.2%-0.9

C5+ Stage 3 conservative voting13.6%-0.6

C6+ ensemble companions13.1%-0.5

C7+ Stage 4 LLM refinement10.6%-2.5

— vs. concurrent work

Beats the closest system.

ROVER (English-style)

—

fails on Urdu boundaries

Multi-ASR + SpeechLLM

11.3%

heavy audio inference

CORAL

10.6%

no SpeechLLM at runtime

— Where errors come from now

Residual error breakdown

Manual annotation of the 10.6% remaining WER after Stage 4

Proper nouns / named entities27.7%

Code-switching (Urdu ↔ English)21.5%

Dialectal / colloquial18.4%

Phoneme confusion12.7%

LLM over-correction7.2%

Other12.5%

Top observation

Named entities and code-switching together account for 49.2% of remaining errors — both directly addressable by upgrades in Future Work §3, §2.

Note

LLM over-correction (7.2%) is the only error class introduced by CORAL itself; everything else is inherited from the acoustic models. Stage 4 authority limits keep this number low.

— What's next

Future work · ten directions

Confidence-weighted voting

Replace equal voting weights with model-specific WER priors. Estimated additional 0.3 WER point reduction.

Code-switching normalisation

Preserve English tokens through a bilingual alignment stage instead of stripping ASCII.

Named-entity gazetteer

Secondary BK-tree of toponyms, person, organisation, brand names. Direct hit on the dominant residual error class.

LLM-guided top-K OOV re-ranking

Pass top-3 BK-tree candidates to Stage 4 for context-aware selection — reduces over-correction.

Split/merge metadata in voting

Use SAME/SPLIT/MERGE classification in Stage 3 resolution logic, not just Stage 1 visualisation.

Cross-architecture confidence

Learned TruCLeS-style calibration over heterogeneous acoustic models.

Larger conversational evaluation

Scale beyond 500 clips to thousands of dialect-varied conversational samples.

Real-time streaming

Operate on partial hypotheses for bounded-latency captioning use cases.

Other low-resource languages

Stages 1-4 transfer directly. Validate on Pashto, Sindhi, Punjabi with replaced corpora.

Open release

Source code, evaluation TSV, BK-tree, and n-gram corpus under permissive licences.

Read it or run it.

Take the system through your own data, or explore the pipeline mechanics next.

Launch Demo Pipeline Detail

Empirical evidence thepipeline works.