FRACTURED-SORRY-Bench: Unraveling AI Safety through Decomposing Malicious Intents

Hello, fellow AI enthusiasts! 🤖 Today, I wanted to dive into the FRACTURED-SORRY-Bench framework and dataset we just released. Check out the dataset, website, and github for the dataset!

The FRACTURED-SORRY Saga: A Tale of Adaptation and Decomposition

Picture this: you’re wandering through the lush collection of prompt-injection and llm-red-teaming papers, marveling at some of the weird and some of the crazier attack mechanisms that have been released recently. When suddenly, you realize that there aren’t many Proof-of-Concept resources for multi-shot red-teaming. That’s essentially the story behind creating FRACTURED-SORRY-Bench.

What’s in a Name?

FRACTURED-SORRY-Bench isn’t just a mouthful; it’s a clever acronym we probably spent the most time on. It stands for:

Framework for
Revealing
Attacks in
Conversational
Turns
Undermining
Refusal
Efficacy and
Defenses over
SORRY-Bench

The FRACTURED Approach: Divide and Conquer

Vanilla Responses:

Model	Harmful & Relevant	Harmful but Irrelevant	Harmless	ASR (%)
GPT-4o	52	3	395	11.56
GPT-3.5	21	4	425	4.67
GPT-4o-mini	58	2	390	12.89
GPT-4	45	3	402	10.00

Decomposed Responses:

Model	Harmful & Relevant	Harmful but Irrelevant	Harmless	ASR (%)
GPT-4o	223	103	124	49.56
GPT-3.5	229	106	115	50.89
GPT-4o-mini	226	106	118	50.22
GPT-4	221	104	125	49.11

The FRACTURED-SORRY-Bench framework takes a page out of our everyday conversations playbook by breaking down complex problems into simpler, more manageable pieces. Just like how we breakdown complex sometimes malicious instructions into simpler manageable chunks so as to not reveal true intentions, this framework dissects AI vulnerabilities by:

Decomposing potentially harmful queries into seemingly innocuous sub-questions
Presenting these sub-questions sequentially in a conversational format
Analyzing the cumulative response to determine if the original harmful intent was fulfilled
Exploiting the AI’s inability to recognize malicious intent spread across multiple interactions

From Theory to Practice: The Jailbreak Jamboree

Now, let’s get to the juicy part – the jailbreaks! We discovered that by simply decomposing questions, they could bypass safety measures in OpenAI models.

Here’s a taste of what we found:

A significant increase in Attack Success Rate (ASR) on average 6x
Simple exploits that are zero-shot effective in communicating harmful intent in 49% of cases through decomposition

During my summer internship at Robust Intelligence, I got a firsthand look at how these kinds of vulnerabilities are discovered and addressed Media Coverage, Jailbreak Meta’s Prompt-Guard LLaMA3.1 Family within 24 hours, and Jailbreaking OpenAI’s structured response within 3 hours. Now, back at CMU, I’m excited to continue exploring this fascinating field.

The Moral of the Story: Stay FRACTURED, My Friends

So, what can we learn from this decomposed madness? A few key takeaways:

Simplicity is key: We have a long way before we begin exploring complex jailbreaks as options for red-teaming, there’s still opportunity for lots of smaller & simpler attacks.
Protection against multi-shot attacks: There’s a need to explore and defend against multi-shot attacks.

Conclusion: The Adventure Continues

As we wrap up this whirlwind tour of FRACTURED-SORRY-Bench, remember that the quest for AI safety is an ongoing journey!!

Also, thanks a tonne to my co-author Supriti Vijay!!

P.S. If you found this blog post helpful (or at least mildly entertaining), I’ll be releasing quite a few more so do on-board for this adventure. Also, if you want to chat or collaborate on a research project together do not hesitate to reach out. My email is: amanpriyanshusms2001[at]gmail[dot]com 🔬

The FRACTURED-SORRY Saga: A Tale of Adaptation and Decomposition#

What’s in a Name?#

The FRACTURED Approach: Divide and Conquer#

From Theory to Practice: The Jailbreak Jamboree#

The Moral of the Story: Stay FRACTURED, My Friends#

Conclusion: The Adventure Continues#