— Research
Empirical evidence the
pipeline works.
Two evaluation suites. Eight ablation configurations. A breakdown of where the remaining errors come from — and ten concrete directions for the next iteration.
0.0%
Best relative WER drop
0.0%
Final WER · convo
0
Read-speech utterances
0
Ablation configs
“Combining Urdu-specific Unicode normalisation, weighted split-merge-aware alignment of multiple ASR hypotheses, hybrid OOV detection with BK-tree fuzzy lookup ranked by an n-gram language model, conservative position-wise voting, and targeted LLM refinement is sufficient to substantially reduce WER over the strongest single-model baseline — without fine-tuning any acoustic model.”
— Answered affirmatively (§5)
— Result 1
Read-speech · Common Voice Urdu
2,995 utterances · robust config · CORAL as post-processor
— Result 2
Ablation · C0 → C7
Whisper-Large-v3 source · 500-clip conversational sample · each row adds one component
— vs. concurrent work
Beats the closest system.
ROVER (English-style)
—
fails on Urdu boundaries
Multi-ASR + SpeechLLM
11.3%
heavy audio inference
CORAL
10.6%
no SpeechLLM at runtime
— Where errors come from now
Residual error breakdown
Manual annotation of the 10.6% remaining WER after Stage 4
Top observation
Named entities and code-switching together account for 49.2% of remaining errors — both directly addressable by upgrades in Future Work §3, §2.
Note
LLM over-correction (7.2%) is the only error class introduced by CORAL itself; everything else is inherited from the acoustic models. Stage 4 authority limits keep this number low.
— What's next
Future work · ten directions
Confidence-weighted voting
Replace equal voting weights with model-specific WER priors. Estimated additional 0.3 WER point reduction.
Code-switching normalisation
Preserve English tokens through a bilingual alignment stage instead of stripping ASCII.
Named-entity gazetteer
Secondary BK-tree of toponyms, person, organisation, brand names. Direct hit on the dominant residual error class.
LLM-guided top-K OOV re-ranking
Pass top-3 BK-tree candidates to Stage 4 for context-aware selection — reduces over-correction.
Split/merge metadata in voting
Use SAME/SPLIT/MERGE classification in Stage 3 resolution logic, not just Stage 1 visualisation.
Cross-architecture confidence
Learned TruCLeS-style calibration over heterogeneous acoustic models.
Larger conversational evaluation
Scale beyond 500 clips to thousands of dialect-varied conversational samples.
Real-time streaming
Operate on partial hypotheses for bounded-latency captioning use cases.
Other low-resource languages
Stages 1-4 transfer directly. Validate on Pashto, Sindhi, Punjabi with replaced corpora.
Open release
Source code, evaluation TSV, BK-tree, and n-gram corpus under permissive licences.
Read it or run it.
Take the system through your own data, or explore the pipeline mechanics next.
