CORAL
CORAL

— Research

Empirical evidence the
pipeline works.

Two evaluation suites. Eight ablation configurations. A breakdown of where the remaining errors come from — and ten concrete directions for the next iteration.

0.0%

Best relative WER drop

0.0%

Final WER · convo

0

Read-speech utterances

0

Ablation configs

Research Hypothesis

Combining Urdu-specific Unicode normalisation, weighted split-merge-aware alignment of multiple ASR hypotheses, hybrid OOV detection with BK-tree fuzzy lookup ranked by an n-gram language model, conservative position-wise voting, and targeted LLM refinement is sufficient to substantially reduce WER over the strongest single-model baseline — without fine-tuning any acoustic model.

— Answered affirmatively (§5)

— Result 1

Read-speech · Common Voice Urdu

2,995 utterances · robust config · CORAL as post-processor

Source model
Baseline
WER reduction
CORAL
Δ rel
Seamless-Large
18.45%
14.34%
22.3%
Whisper-Large-v3
28.29%
19.97%
29.4%
Whisper-Medium
40.44%
30.64%
24.2%
Wav2Vec2-Urdu
53.52%
39.67%
25.9%

— Result 2

Ablation · C0 → C7

Whisper-Large-v3 source · 500-clip conversational sample · each row adds one component

20%15%10%5%0%
19.8%
17.9%
16.4%
15.1%
14.2%
13.6%
13.1%
10.6%
10.6%
C0C1C2C3C4C5C6C7
C0Whisper-Large-v3 baseline19.8%
C1+ Stage 0 normalisation17.9%-1.9
C2+ Stage 1 split-merge alignment16.4%-1.5
C3+ Stage 2 BK-tree (top-1)15.1%-1.3
C4+ Stage 2 n-gram re-ranking14.2%-0.9
C5+ Stage 3 conservative voting13.6%-0.6
C6+ ensemble companions13.1%-0.5
C7+ Stage 4 LLM refinement10.6%-2.5

— vs. concurrent work

Beats the closest system.

ROVER (English-style)

fails on Urdu boundaries

Multi-ASR + SpeechLLM

11.3%

heavy audio inference

CORAL

10.6%

no SpeechLLM at runtime

— Where errors come from now

Residual error breakdown

Manual annotation of the 10.6% remaining WER after Stage 4

Proper nouns / named entities27.7%
Code-switching (Urdu ↔ English)21.5%
Dialectal / colloquial18.4%
Phoneme confusion12.7%
LLM over-correction7.2%
Other12.5%

Top observation

Named entities and code-switching together account for 49.2% of remaining errors — both directly addressable by upgrades in Future Work §3, §2.

Note

LLM over-correction (7.2%) is the only error class introduced by CORAL itself; everything else is inherited from the acoustic models. Stage 4 authority limits keep this number low.

— What's next

Future work · ten directions

01

Confidence-weighted voting

Replace equal voting weights with model-specific WER priors. Estimated additional 0.3 WER point reduction.

02

Code-switching normalisation

Preserve English tokens through a bilingual alignment stage instead of stripping ASCII.

03

Named-entity gazetteer

Secondary BK-tree of toponyms, person, organisation, brand names. Direct hit on the dominant residual error class.

04

LLM-guided top-K OOV re-ranking

Pass top-3 BK-tree candidates to Stage 4 for context-aware selection — reduces over-correction.

05

Split/merge metadata in voting

Use SAME/SPLIT/MERGE classification in Stage 3 resolution logic, not just Stage 1 visualisation.

06

Cross-architecture confidence

Learned TruCLeS-style calibration over heterogeneous acoustic models.

07

Larger conversational evaluation

Scale beyond 500 clips to thousands of dialect-varied conversational samples.

08

Real-time streaming

Operate on partial hypotheses for bounded-latency captioning use cases.

09

Other low-resource languages

Stages 1-4 transfer directly. Validate on Pashto, Sindhi, Punjabi with replaced corpora.

10

Open release

Source code, evaluation TSV, BK-tree, and n-gram corpus under permissive licences.

Read it or run it.

Take the system through your own data, or explore the pipeline mechanics next.