AU-Harness: The Open-Source Toolbox That Makes Evaluating Audio-Language Models as Easy as Running a Single Bash Command

If you only remember one sentence:
AU-Harness is a free Python toolkit that can benchmark any speech-enabled large language model on 380+ audio tasks, finish the job twice as fast as existing tools, and give you fully reproducible reports—all after editing one YAML file and typing bash evaluate.sh.

1. Why Do We Need Yet Another Audio Benchmark?

Voice AI is booming, but the ruler we use to measure it is still wooden.
Existing evaluation pipelines share three pain points:

Pain Point	What It Looks Like in Daily Work	What AU-Harness Brings
Speed	One GPU idles while samples are processed one by one.	Parallel vLLM inference plus dataset sharding keeps every core busy.
Fairness	Model A gets a polite prompt, Model B gets a terse cue—scores aren’t comparable.	One single prompt template is locked for every model on the same task.
Coverage	Lots of speech-to-text, little “who spoke when” or “write SQL after listening”.	Adds LLM-Adaptive Diarization and Spoken Language Reasoning to the same script.

If you ever waited two days for a 50-hour benchmark to finish, or scratched your head because last week’s “best model” suddenly dropped five points after changing a prompt, you already understand why the team built AU-Harness.

2. What Exactly Can It Evaluate?

Think of an audio-to-text ability you care about—AU-Harness probably has a task for it:

Speech Recognition
- Normal ASR, long-form ASR, code-switching ASR (mixed-language).
Paralinguistics
- Emotion, accent, gender, speaker ID, speaker diarization.
Audio Understanding
- Scene classification, music tagging.
Spoken Language Understanding
- Intent classification, spoken QA, dialogue summarisation, speech translation.
Spoken Language Reasoning (new)
- Speech Function Calling – turn voice command into API call.
- Speech-to-SQL – listen to a question, output SQL.
- Multi-turn instruction following – obey step-by-step audio instructions.
Safety & Security
- Adversarial robustness, spoof detection.

All tasks are already wired to 50+ public datasets (LibriSpeech, MELD, SLURP, CoVoST2, CallHome, Spider …) and nine metrics ranging from Word Error Rate to LLM-as-Judge.

3. How Does It Run So Fast?

The paper quotes +127 % throughput and −59 % real-time factor (RTF) versus the next-best kit. Three engineering choices matter:

Choice	Plain-English Meaning
vLLM back-end	A high-speed inference library that batches prompts dynamically.
Token-based request controller	A single “token bucket” decides which GPU sends the next request—no model waits for a slow neighbour.
Dataset sharding proportional to GPU power	If GPU-2 can handle twice the concurrent requests of GPU-1, it gets twice as many audio shards automatically.

Quick analogy
Old style = one cashier, one item at a time.
AU-Harness = multiple cashiers, multiple baskets, and a traffic lights system that always fills the fastest lane.

4. Installation: Five Copy-Paste Steps

Prerequisites

Ubuntu 18+ / CentOS 8+ / macOS (Intel or Apple Silicon)
Python 3.9 – 3.11
NVIDIA driver ≥ 525
CUDA 11.8 or 12.1 already installed

Step 0 (optional but recommended) – Install vLLM

pip install vllm==0.5.0  # one command, ~3 min

Step 1 – Clone the repo

git clone https://github.com/ServiceNow/AU-Harness.git
cd AU-Harness

Step 2 – Create virtual environment

python -m venv harness-env
source harness-env/bin/activate  # Windows: harness-env\Scripts\activate

Step 3 – Install dependencies

pip install -r requirements.txt

Step 4 – Copy the sample configuration

cp sample_config.yaml config.yaml

Step 5 – (Mandatory) Insert your model end-point
Open config.yaml, scroll to models: section, and replace the placeholder lines:

- name: "my_model"
  inference_type: "vllm"  # or "openai" / "transcription"
  url: "http://YOUR_IP:8000/v1"
  auth_token: ${YOUR_TOKEN}
  batch_size: 128  # start low, raise until GPU RAM full
  chunk_size: 30   # seconds

Save, then run:

bash evaluate.sh

That is literally it. Logs print to screen; final CSV and JSON appear inside run_logs/{timestamp}/.

5. Understanding the Configuration File (No Mysteries)

YAML can look scary; here is a field-by-field decoder.

Section	Why You Care	Example Value
dataset_metric	Tells the tool “what to run and how to score”	`[librispeech_test_clean, word_error_rate]`
filter	Shortens long datasets for quick sanity checks	`num_samples: 300` or `length_filter: [1.0, 30.0]`
models	One block per endpoint; unique name + batch_size	see Step 5 above
generation_params_override	Lowers temperature for ASR, raises it for creative QA	`temperature: 0.0 / max_gen_tokens: 64`
prompt_overrides	Lets you A/B-test system or user prompts	`system_prompt: "You are a helpful assistant."`
judge_settings	Needed for LLM-as-Judge metrics	`judge_model: gpt-4o-mini, judge_concurrency: 300`

Tip: Start with the defaults; touch one knob at a time. The code will validate YAML and spit readable errors if you mis-indent.

6. Can I Use My Own Audio Files?

Yes. AU-Harness treats every custom corpus as a new “task.”

Create a folder tasks/mycompany_sentiment/.
Write mytask.yaml inside:

task_name: mycompany_sentiment
dataset_path: /data/csv/my_audio_manifest.json  # local path
subset: default
split: test
language: english
preprocessor: GeneralPreprocessor
postprocessor: GeneralPostprocessor
audio_column: audio_path
target_column: label
long_audio_processing_logic: truncate
metrics:
  - metric: llm_judge_binary

Reference it in the main config:

dataset_metric:
  - ["mycompany_sentiment", "llm_judge_binary"]

Run bash evaluate.sh – the toolkit will automatic convert your JSON/CSV into the same format it uses for LibriSpeech.

7. Reading the Results Without a PhD

After the run you will find three handy artifacts:

Per-record CSV
Path: run_logs/{t}/{task}/{task}_{metric}_{model}.csv
Columns: audio_id, reference, prediction, score
Open in Excel → instant sort/filter.

Final aggregated JSON
Path: run_logs/{t}/final_scores.json
Example snippet:

{
  "emotion_recognition": {
    "llm_judge_binary": 72.4
  },
  "speaker_diarization": {
    "wder": 8.2
  }
}

Copy the numbers to your internal report slide.

Full log
Path: {timestamp}_default.log
Grep ERROR to quickly spot network hiccups or CUDA OOM.

8. Real-World Walk-Through: Benchmarking Three Models in One Night

Imagine your team needs to pick a model for a customer-service voice bot. Business KPIs:

Word Error Rate ≤ 10 % on phone calls
Speaker diarization error ≤ 15 % (agent vs customer)
Intent classification F1 ≥ 80 %

Monday 6 pm – You open laptop:

Add three endpoints to config.yaml:
gpt-4o-audio, qwen2.5-omni-7b, voxtral-mini-3b
Limit samples to 1,000 per task for a quick overnight job.
Run bash evaluate.sh – detach tmux, go home.

Tuesday 9 am – Open final_scores.json:

Model	WDER ↓	Intent F1 ↑	RTF ↓
gpt-4o-audio	7.8	83	3.1
qwen2.5-omi	8.4	81	2.9
voxtral-mini	11.2	79	1.8

Decision: qwen2.5-omni-7b hits the KPI with the best speed/cost ratio. You commit the YAML into Git – next quarter’s re-test is one click away.

9. Frequently Asked Questions (FAQ)

Q1. My GPU has only 24 GB RAM. What batch_size should I use?
Start at 32 for 7 B-type models, then raise until nvidia-smi shows ~90 % memory.

Q2. Can I run the evaluation on CPU?
Technically yes—set inference_type: openai and point to an HTTP endpoint that runs on CPU. Expect RTF ≈ 100× slower.

Q3. Do I need internet access?
Only if your model endpoint is cloud-based. Datasets are downloaded once and cached offline.

Q4. How do I cite AU-Harness in a paper?
BibTeX is provided in the repo README. Key info:
arXiv:2509.08031, authors: Surapaneni et al., 2025.

Q5. Is commercial use allowed?
Yes. Apache 2.0 licence. Keep the copyright notice and you’re good.

10. Known Limitations (No Sugar-Coating)

Backend maturity – Models without vLLM support fall back to slower sequential calls.
Timestamp precision – Diarization scores degrade when speakers overlap or switch rapidly.
Language skew – ~70 % of bundled datasets are English; low-resource languages need community contributions.

The maintainers welcome pull requests for new tasks, metrics, and multilingual data.

11. Take-Away Checklist

[ ] Install vLLM → clone repo → pip install -r requirements.txt
[ ] Fill your model URL and auth token in config.yaml
[ ] Pick datasets & metrics, set batch_size for your GPU
[ ] bash evaluate.sh → grab coffee → read final_scores.json
[ ] Lock the YAML in version control for fully reproducible papers or product reports

Benchmarking audio-language models is no longer a week-long slog.
With AU-Harness, it’s a single shell command—fast, fair, and transparent.

AU-Harness: Benchmark 380+ Audio Tasks 2x Faster with One Command