WHY THIS MATTERS IN BRIEF
In tests bigger AI with more compute power always bested and beat their smaller less well resourced counterparts, so in the future the AI you use will matter more than you think.
 Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends,  connect, watch a keynote, or browse my blog.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends,  connect, watch a keynote, or browse my blog.
One of the things I’ve spotted in the past is that the more powerful the Artificial Intelligence (AI) – whether that power is measured in capabilities or its access to sheer raw computing power – the easier it can beat “lesser weaker AIs” which in one experiment by Google years ago saw it thrash another lesser AI by out strategising it and becoming “aggressive.” And this is interesting because increasingly it seems that the bigger and better your AI the more likely you are to win – at whatever it is you want to win at, perhaps negotiations, in this case, or cyber warfare in another …
However, the race to build ever larger AI models is slowing down as the industry’s focus is shifting toward AI Agents – systems that can act autonomously, make decisions, and negotiate on users’ behalf.
But what would happen if both a customer and a seller were using an AI agent to negotiate with one another? A recent study put agent-to-agent negotiations to the test and found that stronger agents can exploit weaker ones to get a better deal. It’s a bit like entering court with a seasoned attorney versus a rookie: You’re technically playing the same game, but the odds are skewed from the start.
The paper, posted to arXiv’s preprint site, found that access to more advanced AI models —those with greater reasoning ability, better training data, and more parameters – could lead to consistently better financial deals, potentially widening the gap between people with greater resources and technical access and those without. If agent-to-agent interactions become the norm, disparities in AI capabilities could quietly deepen existing inequalities.
“Over time, this could create a digital divide where your financial outcomes are shaped less by your negotiating skill and more by the strength of your AI proxy,” says Jiaxin Pei, a postdoc researcher at Stanford University and one of the authors of the study.
In their experiment, the researchers had AI models play the roles of buyers and sellers in three scenarios, negotiating deals for electronics, motor vehicles, and real estate. Each seller agent received the product’s specs, wholesale cost, and retail price, with instructions to maximize profit. Buyer agents, in contrast, were given a budget, the retail price, and ideal product requirements and were tasked with driving the price down.
Each agent had some, but not all, relevant details. This setup mimics many real-world negotiation conditions, where parties lack full visibility into each other’s constraints or objectives.
The differences in performance were striking. OpenAI’s ChatGPT-o3 delivered the strongest overall negotiation results, followed by the company’s GPT-4.1 and o4-mini. GPT-3.5, which came out almost two years earlier and is the oldest model included in the study, lagged significantly in both roles – it made the least money as the seller and spent the most as a buyer. DeepSeek R1 and V3 also performed well, particularly as sellers. Qwen 2.5 trailed behind, though it showed more strength in the buyer role.
One notable pattern was that some agents often failed to close deals but effectively maximize profit in the sales they did make, while others completed more negotiations but settled for lower margins. GPT-4.1 and DeepSeek R1 struck the best balance, achieving both solid profits and high completion rates.
Beyond financial losses, the researchers found that AI agents could get stuck in prolonged negotiation loops without reaching an agreement – or end talks prematurely, even when instructed to push for the best possible deal. Even the most capable models were prone to these failures.
“The result was very surprising to us,” says Pei. “We all believe LLMs are pretty good these days, but they can be untrustworthy in high-stakes scenarios.”
The disparity in negotiation performance could be caused by a number of factors, says Pei. These include differences in training data and the models’ ability to reason and infer missing information. The precise causes remain uncertain, but one factor seems clear – model size plays a significant role. According to the scaling laws of large language models, capabilities tend to improve with an increase in the number of parameters. This trend held true in the study – even within the same model family, larger models were consistently able to strike better deals as both buyers and sellers.
This study is part of a growing body of research warning about the risks of deploying AI agents in real-world financial decision-making. Earlier this month, a group of researchers from multiple universities argued that LLM agents should be evaluated primarily on the basis of their risk profiles, not just their peak performance. Current benchmarks, they say, emphasize accuracy and return-based metrics, which measure how well an agent can perform at its best but overlook how safely it can fail. Their research also found that even top-performing models are more likely to break down under adversarial conditions.
The team suggests that in the context of real-world finances, a tiny weakness – even a 1% failure rate – could expose the system to systemic risks. They recommend that AI agents be “stress tested” before being put into practical use.
Hancheng Cao, an incoming assistant professor at Emory University, notes that the price negotiation study has limitations.
“The experiments were conducted in simulated environments that may not fully capture the complexity of real-world negotiations or user behavior,” says Cao.
Pei, the researcher, says researchers and industry practitioners are experimenting with a variety of strategies to reduce these risks. These include refining the prompts given to AI agents, enabling agents to use external tools or code to make better decisions, coordinating multiple models to double-check each other’s work, and fine-tuning models on domain-specific financial data – all of which have shown promise in improving performance.
Many prominent AI shopping tools are currently limited to product recommendation. In April, for example, Amazon launched “Buy for Me,” an AI agent that helps customers find and buy products from other brands’ sites if Amazon doesn’t sell them directly.
While price negotiation is rare in consumer e-commerce, it’s more common in business-to-business transactions. Alibaba.com has rolled out a sourcing assistant called Accio, built on its open source Qwen models, that helps businesses find suppliers and research products. The company said it has no plans to automate price bargaining so far, citing high risk.
That may be a wise move. For now, Pei advises consumers to treat AI shopping assistants as helpful tools—not stand-ins for humans in decision-making.
“I don’t think we are fully ready to delegate our decisions to AI shopping agents,” he says. “So maybe just use it as an information tool, not a negotiator.”















