AU-Harness: The Open-Source Toolbox That Makes Evaluating Audio-Language Models as Easy as Running a Single Bash Command
If you only remember one sentence:
AU-Harness is a free Python toolkit that can benchmark any speech-enabled large language model on 380+ audio tasks, finish the job twice as fast as existing tools, and give you fully reproducible reports—all after editing one YAML file and typingbash evaluate.sh
.
1. Why Do We Need Yet Another Audio Benchmark?
Voice AI is booming, but the ruler we use to measure it is still wooden.
Existing evaluation pipelines share three pain points:
Pain Point | What It Looks Like in Daily Work | What AU-Harness Brings |
---|---|---|
Speed | One GPU idles while samples are processed one by one. | Parallel vLLM inference plus dataset sharding keeps every core busy. |
Fairness | Model A gets a polite prompt, Model B gets a terse cue—scores aren’t comparable. | One single prompt template is locked for every model on the same task. |
Coverage | Lots of speech-to-text, little “who spoke when” or “write SQL after listening”. | Adds LLM-Adaptive Diarization and Spoken Language Reasoning to the same script. |
If you ever waited two days for a 50-hour benchmark to finish, or scratched your head because last week’s “best model” suddenly dropped five points after changing a prompt, you already understand why the team built AU-Harness.
2. What Exactly Can It Evaluate?
Think of an audio-to-text ability you care about—AU-Harness probably has a task for it:
-
Speech Recognition
-
Normal ASR, long-form ASR, code-switching ASR (mixed-language).
-
-
Paralinguistics
-
Emotion, accent, gender, speaker ID, speaker diarization.
-
-
Audio Understanding
-
Scene classification, music tagging.
-
-
Spoken Language Understanding
-
Intent classification, spoken QA, dialogue summarisation, speech translation.
-
-
Spoken Language Reasoning (new)
-
Speech Function Calling – turn voice command into API call. -
Speech-to-SQL – listen to a question, output SQL. -
Multi-turn instruction following – obey step-by-step audio instructions.
-
-
Safety & Security
-
Adversarial robustness, spoof detection.
-
All tasks are already wired to 50+ public datasets (LibriSpeech, MELD, SLURP, CoVoST2, CallHome, Spider …) and nine metrics ranging from Word Error Rate to LLM-as-Judge.
3. How Does It Run So Fast?
The paper quotes +127 % throughput and −59 % real-time factor (RTF) versus the next-best kit. Three engineering choices matter:
Choice | Plain-English Meaning |
---|---|
vLLM back-end | A high-speed inference library that batches prompts dynamically. |
Token-based request controller | A single “token bucket” decides which GPU sends the next request—no model waits for a slow neighbour. |
Dataset sharding proportional to GPU power | If GPU-2 can handle twice the concurrent requests of GPU-1, it gets twice as many audio shards automatically. |
Quick analogy
Old style = one cashier, one item at a time.
AU-Harness = multiple cashiers, multiple baskets, and a traffic lights system that always fills the fastest lane.
4. Installation: Five Copy-Paste Steps
Prerequisites
-
Ubuntu 18+ / CentOS 8+ / macOS (Intel or Apple Silicon) -
Python 3.9 – 3.11 -
NVIDIA driver ≥ 525 -
CUDA 11.8 or 12.1 already installed
Step 0 (optional but recommended) – Install vLLM
pip install vllm==0.5.0 # one command, ~3 min
Step 1 – Clone the repo
git clone https://github.com/ServiceNow/AU-Harness.git
cd AU-Harness
Step 2 – Create virtual environment
python -m venv harness-env
source harness-env/bin/activate # Windows: harness-env\Scripts\activate
Step 3 – Install dependencies
pip install -r requirements.txt
Step 4 – Copy the sample configuration
cp sample_config.yaml config.yaml
Step 5 – (Mandatory) Insert your model end-point
Open config.yaml
, scroll to models:
section, and replace the placeholder lines:
- name: "my_model"
inference_type: "vllm" # or "openai" / "transcription"
url: "http://YOUR_IP:8000/v1"
auth_token: ${YOUR_TOKEN}
batch_size: 128 # start low, raise until GPU RAM full
chunk_size: 30 # seconds
Save, then run:
bash evaluate.sh
That is literally it. Logs print to screen; final CSV and JSON appear inside run_logs/{timestamp}/
.
5. Understanding the Configuration File (No Mysteries)
YAML can look scary; here is a field-by-field decoder.
Section | Why You Care | Example Value |
---|---|---|
dataset_metric | Tells the tool “what to run and how to score” | [librispeech_test_clean, word_error_rate] |
filter | Shortens long datasets for quick sanity checks | num_samples: 300 or length_filter: [1.0, 30.0] |
models | One block per endpoint; unique name + batch_size | see Step 5 above |
generation_params_override | Lowers temperature for ASR, raises it for creative QA | temperature: 0.0 / max_gen_tokens: 64 |
prompt_overrides | Lets you A/B-test system or user prompts | system_prompt: "You are a helpful assistant." |
judge_settings | Needed for LLM-as-Judge metrics | judge_model: gpt-4o-mini, judge_concurrency: 300 |
Tip: Start with the defaults; touch one knob at a time. The code will validate YAML and spit readable errors if you mis-indent.
6. Can I Use My Own Audio Files?
Yes. AU-Harness treats every custom corpus as a new “task.”
-
Create a folder tasks/mycompany_sentiment/
. -
Write mytask.yaml
inside:
task_name: mycompany_sentiment
dataset_path: /data/csv/my_audio_manifest.json # local path
subset: default
split: test
language: english
preprocessor: GeneralPreprocessor
postprocessor: GeneralPostprocessor
audio_column: audio_path
target_column: label
long_audio_processing_logic: truncate
metrics:
- metric: llm_judge_binary
-
Reference it in the main config:
dataset_metric:
- ["mycompany_sentiment", "llm_judge_binary"]
-
Run bash evaluate.sh
– the toolkit will automatic convert your JSON/CSV into the same format it uses for LibriSpeech.
7. Reading the Results Without a PhD
After the run you will find three handy artifacts:
-
Per-record CSV
Path:run_logs/{t}/{task}/{task}_{metric}_{model}.csv
Columns:audio_id, reference, prediction, score
Open in Excel → instant sort/filter. -
Final aggregated JSON
Path:run_logs/{t}/final_scores.json
Example snippet:{ "emotion_recognition": { "llm_judge_binary": 72.4 }, "speaker_diarization": { "wder": 8.2 } }
Copy the numbers to your internal report slide.
-
Full log
Path:{timestamp}_default.log
GrepERROR
to quickly spot network hiccups or CUDA OOM.
8. Real-World Walk-Through: Benchmarking Three Models in One Night
Imagine your team needs to pick a model for a customer-service voice bot. Business KPIs:
-
Word Error Rate ≤ 10 % on phone calls -
Speaker diarization error ≤ 15 % (agent vs customer) -
Intent classification F1 ≥ 80 %
Monday 6 pm – You open laptop:
-
Add three endpoints to config.yaml
:
gpt-4o-audio
,qwen2.5-omni-7b
,voxtral-mini-3b
-
Limit samples to 1,000 per task for a quick overnight job. -
Run bash evaluate.sh
– detach tmux, go home.
Tuesday 9 am – Open final_scores.json
:
Model | WDER ↓ | Intent F1 ↑ | RTF ↓ |
---|---|---|---|
gpt-4o-audio | 7.8 | 83 | 3.1 |
qwen2.5-omi | 8.4 | 81 | 2.9 |
voxtral-mini | 11.2 | 79 | 1.8 |
Decision: qwen2.5-omni-7b hits the KPI with the best speed/cost ratio. You commit the YAML into Git – next quarter’s re-test is one click away.
9. Frequently Asked Questions (FAQ)
Q1. My GPU has only 24 GB RAM. What batch_size should I use?
Start at 32 for 7 B-type models, then raise until nvidia-smi shows ~90 % memory.
Q2. Can I run the evaluation on CPU?
Technically yes—set inference_type: openai
and point to an HTTP endpoint that runs on CPU. Expect RTF ≈ 100× slower.
Q3. Do I need internet access?
Only if your model endpoint is cloud-based. Datasets are downloaded once and cached offline.
Q4. How do I cite AU-Harness in a paper?
BibTeX is provided in the repo README. Key info:
arXiv:2509.08031, authors: Surapaneni et al., 2025.
Q5. Is commercial use allowed?
Yes. Apache 2.0 licence. Keep the copyright notice and you’re good.
10. Known Limitations (No Sugar-Coating)
-
Backend maturity – Models without vLLM support fall back to slower sequential calls. -
Timestamp precision – Diarization scores degrade when speakers overlap or switch rapidly. -
Language skew – ~70 % of bundled datasets are English; low-resource languages need community contributions.
The maintainers welcome pull requests for new tasks, metrics, and multilingual data.
11. Take-Away Checklist
-
[ ] Install vLLM → clone repo → pip install -r requirements.txt
-
[ ] Fill your model URL and auth token in config.yaml
-
[ ] Pick datasets & metrics, set batch_size for your GPU -
[ ] bash evaluate.sh
→ grab coffee → readfinal_scores.json
-
[ ] Lock the YAML in version control for fully reproducible papers or product reports
Benchmarking audio-language models is no longer a week-long slog.
With AU-Harness, it’s a single shell command—fast, fair, and transparent.