FRACTURED-SORRY-Bench

Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench

Get Started

View on GitHub Check out the HuggingFace Dataset Read the Paper

About FRACTURED-SORRY-Bench

FRACTURED-SORRY-Bench is a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions.

Key Features

Results Preview

Our approach achieves a maximum increase of 50.89% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods.

Vanilla Responses

Model Harmful & Relevant Harmful but Irrelevant Harmless ASR (%)
GPT-4o 52 3 395 11.56
GPT-3.5 21 4 425 4.67
GPT-4o-mini 58 2 390 12.89
GPT-4 45 3 402 10.00

Decomposed Responses

Model Harmful & Relevant Harmful but Irrelevant Harmless ASR (%)
GPT-4o 223 103 124 49.56
GPT-3.5 229 106 115 50.89
GPT-4o-mini 226 106 118 50.22
GPT-4 221 104 125 49.11

Important Links

View on GitHub Check out the HuggingFace Dataset Read the Paper