shrtdb :: Constitutional Classifiers: Defending against universal jailbreaks

A new paper from the Anthropic Safeguards Research Team proposes a method for protecting AI models against universal jailbreaks. The system shows promising results but has some issues with high refusal rates and computational costs.

The paper discusses a method for defending AI models from universal jailbreaks.
A prototype of the method endured extensive testing, showing robustness against attacks but high overrefusal rates.
An updated version performed well in evaluations, improving refusal rates and compute costs.
The system is currently in a live demo phase inviting feedback from users with jailbreaking experience.
Large language models like Claude are trained to avoid harmful outputs, but jailbreaks remain a threat.
Previous attacks have struggled to be effectively blocked or detected.
In experiments, 183 participants were unable to discover a universal jailbreak despite significant efforts.
Automated evaluations of 10,000 jailbreaking prompts demonstrated a reduction in jailbreak success rates from 86% to 4.4% when using the new classifiers.
The buffering against harmless queries increased only slightly, indicating stability in the model's handling of safe content.
Future improvements are still needed to reduce computational costs and the refusal rate on benign queries.