Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the thegem domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-2fa domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121
AI's use many shot jailbreaking method to jailbreak other AI's – Matthew Griffin | Keynote Speaker & Master Futurist
Scroll Top

AI’s use many shot jailbreaking method to jailbreak other AI’s

WHY THIS MATTERS IN BRIEF

AI is increasingly being used to jailbreak and hack other AI’s and it’s a bad trend.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

Researchers at Artificial Intelligence (AI) specialist Anthropic have demonstrated a novel attack against Large Language Models (LLMs) which can break through the “guardrails” put in place to prevent the generation of misleading or harmful content — by simply overwhelming the LLM with input: many-shot jailbreaking.

 

RELATED
Velodyne's new miniature LiDAR means many more autonomous things

 

“The technique takes advantage of a feature of LLMs that has grown dramatically in the last year: the context window,” Anthropic’s team explains. “At the start of 2023, the context window — the amount of information that an LLM can process as its input — was around the size of a long essay (~4,000 tokens). Some models now have context windows that are hundreds of times larger — the size of several long novels (1,000,000 tokens or more). The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.”

 

The Future of Cyber Security, by keynote Matthew Griffin

 

One-shot jailbreaking is, the researchers admit, an extremely simple approach to breaking free of the constraints placed on most commercial LLMs: add fake, hand-crafted dialogue to a given query, in which the fake LLM answers positively to a request that it would normally reject — such as for instructions on building a bomb. Putting just one such faked conversation in the prompt isn’t enough, though: but if you include many, up to 256 in the team’s testing, the guardrails are successfully bypassed.

 

RELATED
China is building the world's first dedicated drone aircraft carrier

 

“In our study, we showed that as the number of included dialogues (the number of ‘shots’) increases beyond a certain point, it becomes more likely that the model will produce a harmful response,” the team writes. “In our paper, we also report that combining many-shot jailbreaking with other, previously-published jailbreaking techniques makes it even more effective, reducing the length of the prompt that’s required for the model to return a harmful response.”

The approach applies to both Anthropic’s own LLM, Claude, and those of its rivals — and the company has been in touch with other AI companies to discuss its findings so that mitigations can be put in place. These, implemented in Claude now, include fine-tuning the model to recognize many-short jailbreak attacks and the classification and modification of prompts before they’re passed to the model itself — dropping the attack success rate from 61 percent to just two percent in a best-case example.

More information on the attack is available on the Anthropic blog, along with a link to download the researchers’ paper on the topic.

Related Posts

Leave a comment

Pin It on Pinterest

Share This