Anthropic studied what makes an AI evil and gives it its personality

By Matthew Griffin Intelligence and the Senses 10th October 2025

WHY THIS MATTERS IN BRIEF

AI has personalities and its model weights can affect it’s behaviours – and now we know more about how both of those evolve.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

Artificial Intelligence (AI) is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its “personality traits” arise and how to control them. Large learning models (LLMs) use chatbots or “assistants” to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.

UK and US accelerate Quantum Computing research after China advances

Anthropic, the AI company and creator of the LLM Claude, recently released a paper on the arXiv preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model’s neural network – referred to as “persona vectors” – that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that “light up” when a person experiences a certain feeling or does a particular activity.

The Future of AI and Agentic AI, by Keynote Matthew Griffin

Anthropic’s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM’s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.

In their method, a technique called “steering” can be used to control behaviors. They write, “When we steer the model with the ‘evil’ persona vector, we start to see it talking about unethical acts; when we steer with ‘sycophancy,’ it sucks up to the user; and when we steer with ‘hallucination,’ it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.”

World's first full body 3D scanner produces amazing live images of the human body

However, they found that when they made these changes after training, the model loses some of its intelligence. But there was a workaround – the team found that inducing the bad behaviors during training allowed the LLMs to integrate better behavior without reducing their usefulness. Furthermore, they found that they can monitor and predict persona shifts during deployment and training and flag problematic training data that is more likely to produce unwanted traits, even before fine-tuning the model.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine – by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data – we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” they write.

Criminal actors created and weaponised a network of DeepFakes to discredit sovereign countries

This “preventative steering” during training was found to limit persona drift while preserving model capabilities better than post-hoc changes. This is an impressive feat in the world of AI training, but there are still some limitations. For example, because the method requires a strict definition for the traits to be removed, some more vague or undefined behaviors might still cause problems. The method also needs to be tested out on other LLMs and with more traits to ensure its usefulness is sufficiently broad.

Still, this new method is a promising step in the right direction. Anthropic researchers write, “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

Matthew Griffin / About Author

Described by NASA as a "Walking Encyclopaedia of the Future" and one of the world’s most in demand keynote speakers Matthew Griffin, Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm, is a multi-award winning Futurist and strategic foresight expert, author, lecturer, mentor, podcast and YouTube host.

With a distinguished career running global business units at Atos, EMC, and IBM today Matthew helps leaders and titans of industry from all over the world explore the impact of the inter sections of technology, culture, and leadership, then helps them create high performance organisations that can continuously disrupt, outpace, and outthink their competition.

15 times author of the hit series the Codex of the Future series Matthew's ability to identify, track, and explain the impacts of hundreds of emerging technologies and trends on global business, culture, and society has earned him a powerful reputation and a roster of clients that include royal households, world leaders across the G7, G20, and G77, and many of the world's most recognised and exceptional organisations including ABB, Accenture, ABN Amro, Adidas, AON, ARM, BCG, Centrica, Citi Group, Coca Cola, Deloitte, Dentons, Disney, Dow, EY, Google, KPMG, Lego, Microsoft, Legal & General, LinkedIn, Nokia, Paramount, PepsiCo, PWC, Qualcomm, RWE, Samsung, T-Mobile, UBS, Visa, WIRED, WPP and hundreds of others.

Matthew is regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, and his mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.