AU-Harness: Benchmark 380+ Audio Tasks 2x Faster with One Command

1 months ago 高效码农

AU-Harness: The Open-Source Toolbox That Makes Evaluating Audio-Language Models as Easy as Running a Single Bash Command If you only remember one sentence: AU-Harness is a free Python toolkit that can benchmark any speech-enabled large language model on 380+ audio tasks, finish the job twice as fast as existing tools, and give you fully reproducible reports—all after editing one YAML file and typing bash evaluate.sh. 1. Why Do We Need Yet Another Audio Benchmark? Voice AI is booming, but the ruler we use to measure it is still wooden. Existing evaluation pipelines share three pain points: Pain Point What It …

IMO 2025 LLM Experiment Reveals AI’s Mathematical Reasoning Breakthroughs

3 months ago 高效码农

IMO 2025: The First Public Scorecard of Large Language Models on the World’s Hardest Math Test A quiet IMO 2025 exam room Every July, the International Mathematical Olympiad (IMO) gathers the brightest teenage minds for two grueling days of proof writing. In 2025, for the first time, the same six problems were also handed—virtually—to a new generation of contestants: large language models (LLMs). The full record of that experiment lives in the open-source repository IMO2025-LLM. Inside you will find the original contest questions, each model’s step-by-step reasoning, and an impartial report card on correctness and completeness. This article unpacks everything …