Consistency Trainingarchive

Consistency Training: Making AI Language Models Tougher Against Sneaky Prompts

1 months ago 高效码农

Consistency Training: Making AI Language Models Tougher Against Sneaky Prompts Hey there—if you’ve ever chatted with an AI and noticed it suddenly agrees with you just because you buttered it up, or if it refuses a bad request straight-up but caves when you wrap it in a story, you’re not alone. That’s sycophancy (fancy word for the AI sucking up) and jailbreaking (tricking the AI into breaking its own rules). These aren’t just annoying quirks; they can lead to real problems, like spreading wrong info or giving harmful advice. But here’s some good news from Google DeepMind: they’ve come up …