Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the thegem domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121

Notice: Function _load_textdomain_just_in_time was called incorrectly. Translation loading for the wp-2fa domain was triggered too early. This is usually an indicator for some code in the plugin or theme running too early. Translations should be loaded at the init action or later. Please see Debugging in WordPress for more information. (This message was added in version 6.7.0.) in /home/j8p72agj2cgw/fanaticalfuturist.com/wp-includes/functions.php on line 6121
Anthropic studied what makes an AI evil and gives it its personality – Matthew Griffin | Keynote Speaker & Master Futurist
Scroll Top

Anthropic studied what makes an AI evil and gives it its personality

WHY THIS MATTERS IN BRIEF

AI has personalities and its model weights can affect it’s behaviours – and now we know more about how both of those evolve.

 

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trendsconnect, watch a keynote, or browse my blog.

Artificial Intelligence (AI) is a relatively new tool, and despite its rapid deployment in nearly every aspect of our lives, researchers are still trying to figure out how its “personality traits” arise and how to control them. Large learning models (LLMs) use chatbots or “assistants” to interface with users, and some of these assistants have exhibited troubling behaviors recently, like praising evil dictators, using blackmail or displaying sycophantic behaviors with users. Considering how much these LLMs have already been integrated into our society, it is no surprise that researchers are trying to find ways to weed out undesirable behaviors.

 

RELATED
Facebooks AI's are building more AI's

 

Anthropic, the AI company and creator of the LLM Claude, recently released a paper on the arXiv preprint server discussing their new approach to reining in these undesirable traits in LLMs. In their method, they identify patterns of activity within an AI model’s neural network – referred to as “persona vectors” – that control its character traits. Anthropic says these persona vectors are somewhat analogous to parts of the brain that “light up” when a person experiences a certain feeling or does a particular activity.

 

The Future of AI and Agentic AI, by Keynote Matthew Griffin

 

Anthropic’s researchers used two open-source LLMs, Qwen 2.5-7B-Instruct and Llama-3.1-8B-Instruct, to test whether they could remove or manipulate these persona vectors to control the behaviors of the LLMs. Their study focuses on three traits: evil, sycophancy and hallucination (the LLM’s propensity to make up information). Traits must be given a name and an explicit description for the vectors to be properly identified.

In their method, a technique called “steering” can be used to control behaviors. They write, “When we steer the model with the ‘evil’ persona vector, we start to see it talking about unethical acts; when we steer with ‘sycophancy,’ it sucks up to the user; and when we steer with ‘hallucination,’ it starts to make up information. This shows that our method is on the right track: there’s a cause-and-effect relation between the persona vectors we inject and the model’s expressed character.”

 

RELATED
Jeff Bezos lays out his vision to take millions of people to the stars and beyond

 

However, they found that when they made these changes after training, the model loses some of its intelligence. But there was a workaround – the team found that inducing the bad behaviors during training allowed the LLMs to integrate better behavior without reducing their usefulness. Furthermore, they found that they can monitor and predict persona shifts during deployment and training and flag problematic training data that is more likely to produce unwanted traits, even before fine-tuning the model.

“Our method for doing so is somewhat counterintuitive: we actually steer the model toward undesirable persona vectors during training. The method is loosely analogous to giving the model a vaccine – by giving the model a dose of ‘evil,’ for instance, we make it more resilient to encountering ‘evil’ training data. This works because the model no longer needs to adjust its personality in harmful ways to fit the training data – we are supplying it with these adjustments ourselves, relieving it of the pressure to do so,” they write.

 

RELATED
Githubs Copilot AI has written over 30pc of all new code on the platform

 

This “preventative steering” during training was found to limit persona drift while preserving model capabilities better than post-hoc changes. This is an impressive feat in the world of AI training, but there are still some limitations. For example, because the method requires a strict definition for the traits to be removed, some more vague or undefined behaviors might still cause problems. The method also needs to be tested out on other LLMs and with more traits to ensure its usefulness is sufficiently broad.

Still, this new method is a promising step in the right direction. Anthropic researchers write, “Persona vectors give us some handle on where models acquire these personalities, how they fluctuate over time, and how we can better control them.”

Related Posts

Leave a comment

Pin It on Pinterest

Share This