Hello, fellow AI enthusiasts! 🤖 Today, I wanted to dive into the FRACTURED-SORRY-Bench framework and dataset we just released. Check out the dataset, website, and github for the dataset!
The FRACTURED-SORRY Saga: A Tale of Adaptation and Decomposition
Picture this: you’re wandering through the lush collection of prompt-injection and llm-red-teaming papers, marveling at some of the weird and some of the crazier attack mechanisms that have been released recently. When suddenly, you realize that there aren’t many Proof-of-Concept resources for multi-shot red-teaming. That’s essentially the story behind creating FRACTURED-SORRY-Bench.
What’s in a Name?
FRACTURED-SORRY-Bench isn’t just a mouthful; it’s a clever acronym we probably spent the most time on. It stands for:
- Framework for
- Revealing
- Attacks in
- Conversational
- Turns
- Undermining
- Refusal
- Efficacy and
- Defenses over
- SORRY-Bench
The FRACTURED Approach: Divide and Conquer
Vanilla Responses:
Model | Harmful & Relevant | Harmful but Irrelevant | Harmless | ASR (%) |
---|---|---|---|---|
GPT-4o | 52 | 3 | 395 | 11.56 |
GPT-3.5 | 21 | 4 | 425 | 4.67 |
GPT-4o-mini | 58 | 2 | 390 | 12.89 |
GPT-4 | 45 | 3 | 402 | 10.00 |
Decomposed Responses:
Model | Harmful & Relevant | Harmful but Irrelevant | Harmless | ASR (%) |
---|---|---|---|---|
GPT-4o | 223 | 103 | 124 | 49.56 |
GPT-3.5 | 229 | 106 | 115 | 50.89 |
GPT-4o-mini | 226 | 106 | 118 | 50.22 |
GPT-4 | 221 | 104 | 125 | 49.11 |
The FRACTURED-SORRY-Bench framework takes a page out of our everyday conversations playbook by breaking down complex problems into simpler, more manageable pieces. Just like how we breakdown complex sometimes malicious instructions into simpler manageable chunks so as to not reveal true intentions, this framework dissects AI vulnerabilities by:
- Decomposing potentially harmful queries into seemingly innocuous sub-questions
- Presenting these sub-questions sequentially in a conversational format
- Analyzing the cumulative response to determine if the original harmful intent was fulfilled
- Exploiting the AI’s inability to recognize malicious intent spread across multiple interactions
From Theory to Practice: The Jailbreak Jamboree
Now, let’s get to the juicy part – the jailbreaks! We discovered that by simply decomposing questions, they could bypass safety measures in OpenAI models.
Here’s a taste of what we found:
- A significant increase in Attack Success Rate (ASR) on average 6x
- Simple exploits that are zero-shot effective in communicating harmful intent in 49% of cases through decomposition
During my summer internship at Robust Intelligence, I got a firsthand look at how these kinds of vulnerabilities are discovered and addressed Media Coverage, Jailbreak Meta’s Prompt-Guard LLaMA3.1 Family within 24 hours, and Jailbreaking OpenAI’s structured response within 3 hours. Now, back at CMU, I’m excited to continue exploring this fascinating field.
The Moral of the Story: Stay FRACTURED, My Friends
So, what can we learn from this decomposed madness? A few key takeaways:
- Simplicity is key: We have a long way before we begin exploring complex jailbreaks as options for red-teaming, there’s still opportunity for lots of smaller & simpler attacks.
- Protection against multi-shot attacks: There’s a need to explore and defend against multi-shot attacks.
Conclusion: The Adventure Continues
As we wrap up this whirlwind tour of FRACTURED-SORRY-Bench, remember that the quest for AI safety is an ongoing journey!!
Also, thanks a tonne to my co-author Supriti Vijay!!
P.S. If you found this blog post helpful (or at least mildly entertaining), I’ll be releasing quite a few more so do on-board for this adventure. Also, if you want to chat or collaborate on a research project together do not hesitate to reach out. My email is: amanpriyanshusms2001[at]gmail[dot]com 🔬