AI's use many shot jailbreaking method to jailbreak other AI's – Matthew Griffin

By Matthew Griffin Security and Privacy 18th June 2024

WHY THIS MATTERS IN BRIEF

AI is increasingly being used to jailbreak and hack other AI’s and it’s a bad trend.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

Researchers at Artificial Intelligence (AI) specialist Anthropic have demonstrated a novel attack against Large Language Models (LLMs) which can break through the “guardrails” put in place to prevent the generation of misleading or harmful content — by simply overwhelming the LLM with input: many-shot jailbreaking.

UK's NHS gives millions of Londoners their own on call AI doctor

“The technique takes advantage of a feature of LLMs that has grown dramatically in the last year: the context window,” Anthropic’s team explains. “At the start of 2023, the context window — the amount of information that an LLM can process as its input — was around the size of a long essay (~4,000 tokens). Some models now have context windows that are hundreds of times larger — the size of several long novels (1,000,000 tokens or more). The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.”

The Future of Cyber Security, by keynote Matthew Griffin

One-shot jailbreaking is, the researchers admit, an extremely simple approach to breaking free of the constraints placed on most commercial LLMs: add fake, hand-crafted dialogue to a given query, in which the fake LLM answers positively to a request that it would normally reject — such as for instructions on building a bomb. Putting just one such faked conversation in the prompt isn’t enough, though: but if you include many, up to 256 in the team’s testing, the guardrails are successfully bypassed.

New quantum resistant crypto stops quantum computers spying on your data

“In our study, we showed that as the number of included dialogues (the number of ‘shots’) increases beyond a certain point, it becomes more likely that the model will produce a harmful response,” the team writes. “In our paper, we also report that combining many-shot jailbreaking with other, previously-published jailbreaking techniques makes it even more effective, reducing the length of the prompt that’s required for the model to return a harmful response.”

The approach applies to both Anthropic’s own LLM, Claude, and those of its rivals — and the company has been in touch with other AI companies to discuss its findings so that mitigations can be put in place. These, implemented in Claude now, include fine-tuning the model to recognize many-short jailbreak attacks and the classification and modification of prompts before they’re passed to the model itself — dropping the attack success rate from 61 percent to just two percent in a best-case example.

More information on the attack is available on the Anthropic blog, along with a link to download the researchers’ paper on the topic.

Matthew Griffin / About Author

Described by NASA as a "Walking Encyclopaedia of the Future" and one of the world’s most in demand keynote speakers Matthew Griffin, Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm, is a multi-award winning Futurist and strategic foresight expert, author, lecturer, mentor, podcast and YouTube host.

With a distinguished career running global business units at Atos, EMC, and IBM today Matthew helps leaders and titans of industry from all over the world explore the impact of the inter sections of technology, culture, and leadership, then helps them create high performance organisations that can continuously disrupt, outpace, and outthink their competition.

15 times author of the hit series the Codex of the Future series Matthew's ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders across the G7, G20, and G77, and many of the world's most recognised and exceptional organisations including ABB, Accenture, ABN Amro, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Deloitte, Dentons, Disney, Dow, EY, Google, KPMG, Lego, Microsoft, Legal & General, LinkedIn, Nokia, Paramount, PepsiCo, PWC, Qualcomm, RWE, Samsung, T-Mobile, UBS, Visa, WIRED, WPP and hundreds of others.

Matthew is regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, and his mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.