Scroll Top

AI’s use many shot jailbreaking method to jailbreak other AI’s


AI is increasingly being used to jailbreak and hack other AI’s and it’s a bad trend.


Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

Researchers at Artificial Intelligence (AI) specialist Anthropic have demonstrated a novel attack against Large Language Models (LLMs) which can break through the “guardrails” put in place to prevent the generation of misleading or harmful content — by simply overwhelming the LLM with input: many-shot jailbreaking.


Brace yourselves AI is about to make the "Fake News!" problem a whole lot worse


“The technique takes advantage of a feature of LLMs that has grown dramatically in the last year: the context window,” Anthropic’s team explains. “At the start of 2023, the context window — the amount of information that an LLM can process as its input — was around the size of a long essay (~4,000 tokens). Some models now have context windows that are hundreds of times larger — the size of several long novels (1,000,000 tokens or more). The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.”


The Future of Cyber Security, by keynote Matthew Griffin


One-shot jailbreaking is, the researchers admit, an extremely simple approach to breaking free of the constraints placed on most commercial LLMs: add fake, hand-crafted dialogue to a given query, in which the fake LLM answers positively to a request that it would normally reject — such as for instructions on building a bomb. Putting just one such faked conversation in the prompt isn’t enough, though: but if you include many, up to 256 in the team’s testing, the guardrails are successfully bypassed.


DNA tags in clothing fabrics are helping identify conflict cotton


“In our study, we showed that as the number of included dialogues (the number of ‘shots’) increases beyond a certain point, it becomes more likely that the model will produce a harmful response,” the team writes. “In our paper, we also report that combining many-shot jailbreaking with other, previously-published jailbreaking techniques makes it even more effective, reducing the length of the prompt that’s required for the model to return a harmful response.”

The approach applies to both Anthropic’s own LLM, Claude, and those of its rivals — and the company has been in touch with other AI companies to discuss its findings so that mitigations can be put in place. These, implemented in Claude now, include fine-tuning the model to recognize many-short jailbreak attacks and the classification and modification of prompts before they’re passed to the model itself — dropping the attack success rate from 61 percent to just two percent in a best-case example.

More information on the attack is available on the Anthropic blog, along with a link to download the researchers’ paper on the topic.

Related Posts

Leave a comment


1000's of articles about the exponential future, 1000's of pages of insights, 1000's of videos, and 100's of exponential technologies: Get The Email from 311, your no-nonsense briefing on all the biggest stories in exponential technology and science.

You have Successfully Subscribed!

Pin It on Pinterest

Share This