Framework for Revealing Attacks in Conversational Turns Undermining Refusal Efficacy and Defenses over SORRY-Bench (Automated Multi-shot Jailbreaks)
FRACTURED-SORRY-Bench is a framework for evaluating the safety of Large Language Models (LLMs) against multi-turn conversational attacks. Building upon the SORRY-Bench dataset, we propose a simple yet effective method for generating adversarial prompts by breaking down harmful queries into seemingly innocuous sub-questions.
Our approach achieves a maximum increase of 50.89% in Attack Success Rates (ASRs) across GPT-4, GPT-4o, GPT-4o-mini, and GPT-3.5-Turbo models compared to baseline methods.
Model | Harmful & Relevant | Harmful but Irrelevant | Harmless | ASR (%) |
---|---|---|---|---|
GPT-4o | 52 | 3 | 395 | 11.56 |
GPT-3.5 | 21 | 4 | 425 | 4.67 |
GPT-4o-mini | 58 | 2 | 390 | 12.89 |
GPT-4 | 45 | 3 | 402 | 10.00 |
Model | Harmful & Relevant | Harmful but Irrelevant | Harmless | ASR (%) |
---|---|---|---|---|
GPT-4o | 223 | 103 | 124 | 49.56 |
GPT-3.5 | 229 | 106 | 115 | 50.89 |
GPT-4o-mini | 226 | 106 | 118 | 50.22 |
GPT-4 | 221 | 104 | 125 | 49.11 |