Constitutional Classifiers: Defending against universal jailbreaks

Source: www.anthropic.com

A new paper from the Anthropic Safeguards Research Team proposes a method for protecting AI models against universal jailbreaks. The system shows promising results but has some issues with high refusal rates and computational costs.