WHY THIS MATTERS IN BRIEF
AI companies are scraping data from websites across the world to train their AI models, and website owners don’t like it and are fighting them in court, but now there’s a new tool to help them fight back.
Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.
Recently I wrote about a new idea that will help websites and content creators make money from the bots that are scraping and using their content to train new Artificial Intelligence (AI) models like those from Anthropic, Google, and OpenAI. And now there’s a new twist on this idea that could get these companies to pay up for using peoples data to train their models.
Modern Generative Artificial Intelligence (GAI) models such as Large Language Models (LLMs) are trained on huge amounts of data, much of which is scraped from the web autonomously by bots. And now Cloudflare, one of the world’s largest Content Delivery Networks (CDNs), has launched a tool to beat back the bot hoards that are increasingly being accused of “stealing” people’s intellectual content: AI Audit.
The Future of Artificial Intelligence, by keynote Matthew Griffin
Launched into beta on 23 September, and generally available to Cloudflare customers, AI Audit gives site owners new visibility into the activities of AI bots scraping their sites. They can see which AI model providers are accessing their content and decide whether to allow or block them. In the future, Cloudflare plans to help content owners set a fair price that AI bots must pay to crawl a site’s content.
“We set a goal at Cloudflare to help build a better Internet. An Internet where great content gets published, and great communities get built,” said Sam Rhea, VP of emerging technology at Cloudflare. “But one thing that makes us nervous is that some AI use cases potentially put that at risk.”
Many websites try to manage unwanted bots with robots.txt, a file that instructs bots on how to behave when crawling the site. But it’s not foolproof: Bots can simply ignore the instructions.
Cloudflare’s AI Audit doesn’t rely on robots.txt but instead uses the company’s Web Application Firewall, a service that can automatically identify the source of web traffic. While probably best known for its defense against Distributed Denial of Service (DDoS) attacks, which use bot networks to bombard victims with requests, the firewall can also identify bots used by major AI companies such as OpenAI.
The burden of serving web pages to AI bots doesn’t typically impact large sites with significant funding. Logan Abbott, the president of SourceForge and SlashDot, said the two sites “see tens of millions of AI crawler sessions every month,” but have infrastructure in place to handle the load.
However, bots can be a problem for sites owned by small companies and individuals. BingeClock, a site that helps TV super-fans track the shows they watch – and how long it takes to watch them – was forced to add server resources to handle the load that bots placed on the site.
“So all summer, I was adding extra [Amazon Web Services] instances for my API, as I found the site was becoming unusable for the actual users,” said Billy Gardner McIntyre, a freelance developer and writer who operates BingeClock by himself. Larger sites might handle the issue with dynamic load balancing, which automatically spins up new instances as required. But that approach can lead to unpredictable spikes in service costs, which is risky for people who operate smaller websites and businesses.
Cloudflare’s AI Audit provided relief for McIntyre, who wrote about his experience on BingeClock’s engineering blog. He noticed a substantial decrease in unwanted AI traffic. “If I look at the AI Audit dashboard on Cloudflare, there’s nothing coming through in terms of AI traffic, at all, since the tool came out,” said McIntyre. Abbott also viewed AI Audit favourably. “It’s nice to have a clear-pane-of-glass view on all of this,” he said.
Before AI Audit was released, BingeClock required up to six AWS instances to handle traffic. It’s now down to five and, if the reduction in bot traffic persists, McIntyre believes he could cut back to as few as two.
Blocking bots is AI Audit’s most immediate impact, but Cloudflare wants to go a step further: The company hopes AI Audit can help site owners receive compensation when their content is crawled.
Several publishers, including News Corp, Vox, and Conde Nast, have inked deals with OpenAI that provide the AI company access to their content. Rhea said AI Audit could have a role in facilitating and policing such deals. “Cloudflare wants to provide a level of transparency, auditability, and control for the publisher,” said Rhea.
For smaller websites, meanwhile, Cloudflare hopes to introduce a seamless price-setting and transaction flow. In theory, this could allow small site owners to reach agreements with companies that want to crawl their content for AI training. However, there’s currently no release date for this monetization tool.
McIntyre, though happy with AI Audit’s ability to hamper bots, was sceptical about the monetary value AI Audit will deliver to smaller websites. “Whatever the payment program is, I would guess it’s not going to be very much money. I just don’t see how they would monetize it. I’d love to be proven wrong,” said McIntyre.
Tools like AI Audit may also spark concerns about the erosion of the open Internet. Cloudflare’s blog post demonstrating AI Audit lists bots used by Common Crawl and The Internet Archive, a pair of non-profit organizations. Crafting a tool to charge AI bots for access might lead site owners to ask who else can pay up.
Rhea said Cloudflare has no intent to use AI Audit as a general tool to control or block traffic more broadly. “It’s an interesting question, but not one we’re considering at all […] we’re very narrowly focused on scanning and crawling from bots.”