AU-Harness: The Open-Source Toolbox That Makes Evaluating Audio-Language Models as Easy as Running a Single Bash Command
If you only remember one sentence:
AU-Harness is a free Python toolkit that can benchmark any speech-enabled large language model on 380+ audio tasks, finish the job twice as fast as existing tools, and give you fully reproducible reports—all after editing one YAML file and typingbash evaluate.sh.
1. Why Do We Need Yet Another Audio Benchmark?
Voice AI is booming, but the ruler we use to measure it is still wooden.
Existing evaluation pipelines share three pain points:
| Pain Point | What It Looks Like in Daily Work | What AU-Harness Brings |
|---|---|---|
| Speed | One GPU idles while samples are processed one by one. | Parallel vLLM inference plus dataset sharding keeps every core busy. |
| Fairness | Model A gets a polite prompt, Model B gets a terse cue—scores aren’t comparable. | One single prompt template is locked for every model on the same task. |
| Coverage | Lots of speech-to-text, little “who spoke when” or “write SQL after listening”. | Adds LLM-Adaptive Diarization and Spoken Language Reasoning to the same script. |
If you ever waited two days for a 50-hour benchmark to finish, or scratched your head because last week’s “best model” suddenly dropped five points after changing a prompt, you already understand why the team built AU-Harness.
2. What Exactly Can It Evaluate?
Think of an audio-to-text ability you care about—AU-Harness probably has a task for it:
-
Speech Recognition
-
Normal ASR, long-form ASR, code-switching ASR (mixed-language).
-
-
Paralinguistics
-
Emotion, accent, gender, speaker ID, speaker diarization.
-
-
Audio Understanding
-
Scene classification, music tagging.
-
-
Spoken Language Understanding
-
Intent classification, spoken QA, dialogue summarisation, speech translation.
-
-
Spoken Language Reasoning (new)
-
Speech Function Calling – turn voice command into API call. -
Speech-to-SQL – listen to a question, output SQL. -
Multi-turn instruction following – obey step-by-step audio instructions.
-
-
Safety & Security
-
Adversarial robustness, spoof detection.
-
All tasks are already wired to 50+ public datasets (LibriSpeech, MELD, SLURP, CoVoST2, CallHome, Spider …) and nine metrics ranging from Word Error Rate to LLM-as-Judge.
3. How Does It Run So Fast?
The paper quotes +127 % throughput and −59 % real-time factor (RTF) versus the next-best kit. Three engineering choices matter:
| Choice | Plain-English Meaning |
|---|---|
| vLLM back-end | A high-speed inference library that batches prompts dynamically. |
| Token-based request controller | A single “token bucket” decides which GPU sends the next request—no model waits for a slow neighbour. |
| Dataset sharding proportional to GPU power | If GPU-2 can handle twice the concurrent requests of GPU-1, it gets twice as many audio shards automatically. |
Quick analogy
Old style = one cashier, one item at a time.
AU-Harness = multiple cashiers, multiple baskets, and a traffic lights system that always fills the fastest lane.
4. Installation: Five Copy-Paste Steps
Prerequisites
-
Ubuntu 18+ / CentOS 8+ / macOS (Intel or Apple Silicon) -
Python 3.9 – 3.11 -
NVIDIA driver ≥ 525 -
CUDA 11.8 or 12.1 already installed
Step 0 (optional but recommended) – Install vLLM
pip install vllm==0.5.0 # one command, ~3 min
Step 1 – Clone the repo
git clone https://github.com/ServiceNow/AU-Harness.git
cd AU-Harness
Step 2 – Create virtual environment
python -m venv harness-env
source harness-env/bin/activate # Windows: harness-env\Scripts\activate
Step 3 – Install dependencies
pip install -r requirements.txt
Step 4 – Copy the sample configuration
cp sample_config.yaml config.yaml
Step 5 – (Mandatory) Insert your model end-point
Open config.yaml, scroll to models: section, and replace the placeholder lines:
- name: "my_model"
inference_type: "vllm" # or "openai" / "transcription"
url: "http://YOUR_IP:8000/v1"
auth_token: ${YOUR_TOKEN}
batch_size: 128 # start low, raise until GPU RAM full
chunk_size: 30 # seconds
Save, then run:
bash evaluate.sh
That is literally it. Logs print to screen; final CSV and JSON appear inside run_logs/{timestamp}/.
5. Understanding the Configuration File (No Mysteries)
YAML can look scary; here is a field-by-field decoder.
| Section | Why You Care | Example Value |
|---|---|---|
| dataset_metric | Tells the tool “what to run and how to score” | [librispeech_test_clean, word_error_rate] |
| filter | Shortens long datasets for quick sanity checks | num_samples: 300 or length_filter: [1.0, 30.0] |
| models | One block per endpoint; unique name + batch_size | see Step 5 above |
| generation_params_override | Lowers temperature for ASR, raises it for creative QA | temperature: 0.0 / max_gen_tokens: 64 |
| prompt_overrides | Lets you A/B-test system or user prompts | system_prompt: "You are a helpful assistant." |
| judge_settings | Needed for LLM-as-Judge metrics | judge_model: gpt-4o-mini, judge_concurrency: 300 |
Tip: Start with the defaults; touch one knob at a time. The code will validate YAML and spit readable errors if you mis-indent.
6. Can I Use My Own Audio Files?
Yes. AU-Harness treats every custom corpus as a new “task.”
-
Create a folder tasks/mycompany_sentiment/. -
Write mytask.yamlinside:
task_name: mycompany_sentiment
dataset_path: /data/csv/my_audio_manifest.json # local path
subset: default
split: test
language: english
preprocessor: GeneralPreprocessor
postprocessor: GeneralPostprocessor
audio_column: audio_path
target_column: label
long_audio_processing_logic: truncate
metrics:
- metric: llm_judge_binary
-
Reference it in the main config:
dataset_metric:
- ["mycompany_sentiment", "llm_judge_binary"]
-
Run bash evaluate.sh– the toolkit will automatic convert your JSON/CSV into the same format it uses for LibriSpeech.
7. Reading the Results Without a PhD
After the run you will find three handy artifacts:
-
Per-record CSV
Path:run_logs/{t}/{task}/{task}_{metric}_{model}.csv
Columns:audio_id, reference, prediction, score
Open in Excel → instant sort/filter. -
Final aggregated JSON
Path:run_logs/{t}/final_scores.json
Example snippet:{ "emotion_recognition": { "llm_judge_binary": 72.4 }, "speaker_diarization": { "wder": 8.2 } }Copy the numbers to your internal report slide.
-
Full log
Path:{timestamp}_default.log
GrepERRORto quickly spot network hiccups or CUDA OOM.
8. Real-World Walk-Through: Benchmarking Three Models in One Night
Imagine your team needs to pick a model for a customer-service voice bot. Business KPIs:
-
Word Error Rate ≤ 10 % on phone calls -
Speaker diarization error ≤ 15 % (agent vs customer) -
Intent classification F1 ≥ 80 %
Monday 6 pm – You open laptop:
-
Add three endpoints to config.yaml:
gpt-4o-audio,qwen2.5-omni-7b,voxtral-mini-3b -
Limit samples to 1,000 per task for a quick overnight job. -
Run bash evaluate.sh– detach tmux, go home.
Tuesday 9 am – Open final_scores.json:
| Model | WDER ↓ | Intent F1 ↑ | RTF ↓ |
|---|---|---|---|
| gpt-4o-audio | 7.8 | 83 | 3.1 |
| qwen2.5-omi | 8.4 | 81 | 2.9 |
| voxtral-mini | 11.2 | 79 | 1.8 |
Decision: qwen2.5-omni-7b hits the KPI with the best speed/cost ratio. You commit the YAML into Git – next quarter’s re-test is one click away.
9. Frequently Asked Questions (FAQ)
Q1. My GPU has only 24 GB RAM. What batch_size should I use?
Start at 32 for 7 B-type models, then raise until nvidia-smi shows ~90 % memory.
Q2. Can I run the evaluation on CPU?
Technically yes—set inference_type: openai and point to an HTTP endpoint that runs on CPU. Expect RTF ≈ 100× slower.
Q3. Do I need internet access?
Only if your model endpoint is cloud-based. Datasets are downloaded once and cached offline.
Q4. How do I cite AU-Harness in a paper?
BibTeX is provided in the repo README. Key info:
arXiv:2509.08031, authors: Surapaneni et al., 2025.
Q5. Is commercial use allowed?
Yes. Apache 2.0 licence. Keep the copyright notice and you’re good.
10. Known Limitations (No Sugar-Coating)
-
Backend maturity – Models without vLLM support fall back to slower sequential calls. -
Timestamp precision – Diarization scores degrade when speakers overlap or switch rapidly. -
Language skew – ~70 % of bundled datasets are English; low-resource languages need community contributions.
The maintainers welcome pull requests for new tasks, metrics, and multilingual data.
11. Take-Away Checklist
-
[ ] Install vLLM → clone repo → pip install -r requirements.txt -
[ ] Fill your model URL and auth token in config.yaml -
[ ] Pick datasets & metrics, set batch_size for your GPU -
[ ] bash evaluate.sh→ grab coffee → readfinal_scores.json -
[ ] Lock the YAML in version control for fully reproducible papers or product reports
Benchmarking audio-language models is no longer a week-long slog.
With AU-Harness, it’s a single shell command—fast, fair, and transparent.

