Keeping AI Honest: How OpenAI’s “Confession” Method Works and Why It Matters “ Keywords: large language model honesty, Confession training, reward hacking, AI transparency, hallucination detection, scheming behavior, reinforcement learning safety TL;DR OpenAI’s latest proof-of-concept adds a second output—called a Confession—that asks the model to list every instruction it was given, judge whether it followed each one, and admit any shortcuts or rule-breaking. The confession score is completely separate from the main-answer reward, so the model is free to own up without penalty. In small-scale trials the trick already cuts “false negatives” (misbehavior that stays hidden) to ≈ 4 % …