AI's use many shot jailbreaking method to jailbreak other AI's

By Matthew Griffin Security and Privacy 18th June 2024

WHY THIS MATTERS IN BRIEF

AI is increasingly being used to jailbreak and hack other AI’s and it’s a bad trend.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

Researchers at Artificial Intelligence (AI) specialist Anthropic have demonstrated a novel attack against Large Language Models (LLMs) which can break through the “guardrails” put in place to prevent the generation of misleading or harmful content — by simply overwhelming the LLM with input: many-shot jailbreaking.

This new powerful terahertz camera uses lasers and radiation to see inside objects

“The technique takes advantage of a feature of LLMs that has grown dramatically in the last year: the context window,” Anthropic’s team explains. “At the start of 2023, the context window — the amount of information that an LLM can process as its input — was around the size of a long essay (~4,000 tokens). Some models now have context windows that are hundreds of times larger — the size of several long novels (1,000,000 tokens or more). The ability to input increasingly-large amounts of information has obvious advantages for LLM users, but it also comes with risks: vulnerabilities to jailbreaks that exploit the longer context window.”

The Future of Cyber Security, by keynote Matthew Griffin

One-shot jailbreaking is, the researchers admit, an extremely simple approach to breaking free of the constraints placed on most commercial LLMs: add fake, hand-crafted dialogue to a given query, in which the fake LLM answers positively to a request that it would normally reject — such as for instructions on building a bomb. Putting just one such faked conversation in the prompt isn’t enough, though: but if you include many, up to 256 in the team’s testing, the guardrails are successfully bypassed.

AI Agents autonomously fix bugs to help Resolve AI clients stay running 247

“In our study, we showed that as the number of included dialogues (the number of ‘shots’) increases beyond a certain point, it becomes more likely that the model will produce a harmful response,” the team writes. “In our paper, we also report that combining many-shot jailbreaking with other, previously-published jailbreaking techniques makes it even more effective, reducing the length of the prompt that’s required for the model to return a harmful response.”

The approach applies to both Anthropic’s own LLM, Claude, and those of its rivals — and the company has been in touch with other AI companies to discuss its findings so that mitigations can be put in place. These, implemented in Claude now, include fine-tuning the model to recognize many-short jailbreak attacks and the classification and modification of prompts before they’re passed to the model itself — dropping the attack success rate from 61 percent to just two percent in a best-case example.

More information on the attack is available on the Anthropic blog, along with a link to download the researchers’ paper on the topic.

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.