LLM Safetyarchive | Efficient Coder

The Assistant Axis Fixes LLM Jailbreaks: Why AI Models Break Character and How to Stop It

2 months ago 高效码农

The Assistant Axis: Why LLMs “Break Character” — And How Researchers Are Fixing It Meta Description / Featured Snippet Candidate The “Assistant Axis” is a key direction in large language model activation space that measures how closely an LLM stays in its trained “helpful AI Assistant” persona. Deviations along this axis cause persona drift — leading to theatrical language, harmful suggestions, or successful jailbreaks. By capping activations on this axis during inference, researchers reduced persona-based jailbreak success rates significantly while preserving performance on major benchmarks (IFEval, MMLU-Pro, GSM8K, EQ-Bench). When you chat with modern large language models like Llama, Qwen, …