CloudFlare's new AI helps websites beat data scraping AI bots

By Matthew Griffin Intelligence and the Senses 6th November 2024

WHY THIS MATTERS IN BRIEF

AI companies are scraping data from websites across the world to train their AI models, and website owners don’t like it and are fighting them in court, but now there’s a new tool to help them fight back.

Love the Exponential Future? Join our XPotential Community, future proof yourself with courses from XPotential University, read about exponential tech and trends, connect, watch a keynote, or browse my blog.

Recently I wrote about a new idea that will help websites and content creators make money from the bots that are scraping and using their content to train new Artificial Intelligence (AI) models like those from Anthropic, Google, and OpenAI. And now there’s a new twist on this idea that could get these companies to pay up for using peoples data to train their models.

Amazons drones will recharge themselves using street lamposts

Modern Generative Artificial Intelligence (GAI) models such as Large Language Models (LLMs) are trained on huge amounts of data, much of which is scraped from the web autonomously by bots. And now Cloudflare, one of the world’s largest Content Delivery Networks (CDNs), has launched a tool to beat back the bot hoards that are increasingly being accused of “stealing” people’s intellectual content: AI Audit.

The Future of Artificial Intelligence, by keynote Matthew Griffin

Launched into beta on 23 September, and generally available to Cloudflare customers, AI Audit gives site owners new visibility into the activities of AI bots scraping their sites. They can see which AI model providers are accessing their content and decide whether to allow or block them. In the future, Cloudflare plans to help content owners set a fair price that AI bots must pay to crawl a site’s content.

“We set a goal at Cloudflare to help build a better Internet. An Internet where great content gets published, and great communities get built,” said Sam Rhea, VP of emerging technology at Cloudflare. “But one thing that makes us nervous is that some AI use cases potentially put that at risk.”

Berkeley Lab team pushes the limits to build a 1nm transistor

Many websites try to manage unwanted bots with robots.txt, a file that instructs bots on how to behave when crawling the site. But it’s not foolproof: Bots can simply ignore the instructions.

Cloudflare’s AI Audit doesn’t rely on robots.txt but instead uses the company’s Web Application Firewall, a service that can automatically identify the source of web traffic. While probably best known for its defense against Distributed Denial of Service (DDoS) attacks, which use bot networks to bombard victims with requests, the firewall can also identify bots used by major AI companies such as OpenAI.

The burden of serving web pages to AI bots doesn’t typically impact large sites with significant funding. Logan Abbott, the president of SourceForge and SlashDot, said the two sites “see tens of millions of AI crawler sessions every month,” but have infrastructure in place to handle the load.

However, bots can be a problem for sites owned by small companies and individuals. BingeClock, a site that helps TV super-fans track the shows they watch – and how long it takes to watch them – was forced to add server resources to handle the load that bots placed on the site.

Coinbase sees first crypto transactions between two AI agents

“So all summer, I was adding extra [Amazon Web Services] instances for my API, as I found the site was becoming unusable for the actual users,” said Billy Gardner McIntyre, a freelance developer and writer who operates BingeClock by himself. Larger sites might handle the issue with dynamic load balancing, which automatically spins up new instances as required. But that approach can lead to unpredictable spikes in service costs, which is risky for people who operate smaller websites and businesses.

Cloudflare’s AI Audit provided relief for McIntyre, who wrote about his experience on BingeClock’s engineering blog. He noticed a substantial decrease in unwanted AI traffic. “If I look at the AI Audit dashboard on Cloudflare, there’s nothing coming through in terms of AI traffic, at all, since the tool came out,” said McIntyre. Abbott also viewed AI Audit favourably. “It’s nice to have a clear-pane-of-glass view on all of this,” he said.

Before AI Audit was released, BingeClock required up to six AWS instances to handle traffic. It’s now down to five and, if the reduction in bot traffic persists, McIntyre believes he could cut back to as few as two.

Blocking bots is AI Audit’s most immediate impact, but Cloudflare wants to go a step further: The company hopes AI Audit can help site owners receive compensation when their content is crawled.

A survey of 1.5 Million people shows a third can't tell AI from Humans

Several publishers, including News Corp, Vox, and Conde Nast, have inked deals with OpenAI that provide the AI company access to their content. Rhea said AI Audit could have a role in facilitating and policing such deals. “Cloudflare wants to provide a level of transparency, auditability, and control for the publisher,” said Rhea.

For smaller websites, meanwhile, Cloudflare hopes to introduce a seamless price-setting and transaction flow. In theory, this could allow small site owners to reach agreements with companies that want to crawl their content for AI training. However, there’s currently no release date for this monetization tool.

McIntyre, though happy with AI Audit’s ability to hamper bots, was sceptical about the monetary value AI Audit will deliver to smaller websites. “Whatever the payment program is, I would guess it’s not going to be very much money. I just don’t see how they would monetize it. I’d love to be proven wrong,” said McIntyre.

Our Algorithmic Society, the deadly consequences of unpredictable code

Tools like AI Audit may also spark concerns about the erosion of the open Internet. Cloudflare’s blog post demonstrating AI Audit lists bots used by Common Crawl and The Internet Archive, a pair of non-profit organizations. Crafting a tool to charge AI bots for access might lead site owners to ask who else can pay up.

Rhea said Cloudflare has no intent to use AI Audit as a general tool to control or block traffic more broadly. “It’s an interesting question, but not one we’re considering at all […] we’re very narrowly focused on scanning and crawling from bots.”

Matthew Griffin / About Author

Matthew Griffin is a multi-award winning Futurist and expert in Disruption and Innovation, Geopolitics, Leadership, and Technology, who NASA have described as a "walking encyclopaedia of the future" and a "futurist Polymath." 15-time best selling author of the "Codex of the Future" series, Matthew is the Founder and Futurist in Chief of the 311 Institute, a global Futures and Deep Futures advisory firm working with royal households, world leaders, G7, G20, and G77 governments, NGOs, and multi-national mid and mega cap firms to help them explore, shape, and lead the next 50 years of business and society.

An award-winning YouTube creator with over a million followers, with an unrivalled global reach and impact, Matthew is a highly sought-after international keynote speaker, lecturer, and mentor who collaborates with global leaders through the United Nations Alliance of Civilizations (UNAOC) and United Nations General Assembly (UNGA) to shape pivotal initiatives such as the UN’s AI for Humanity program, the United Nations Conference of the Parties (UN COP), and the World Economic Forum in Davos.

As the former Global Head of Cloud, National Security, and Enterprise Sales for companies including Atos, Dell-EMC, and IBM, Matthew has a proven track record of building multi-billion dollar business units and turning failing divisions into market leaders. His ability to identify, analyse, and communicate the implications of hundreds of emerging technologies and trends is unparalleled, and his insights are trusted by many of the world’s most respected organisations, including ABB, Accenture, Adidas, AON, ARM, BCG, Centrica, Citi, Coca-Cola, Dentons, Deloitte, Dow Jones, EY, Google, KPMG, Lego, Legal & General, LinkedIn, Microsoft, PepsiCo, Qualcomm, RWE, Samsung, Siemens AG and Siemens Energy, T-Mobile, UBS, VISA, Walmart, Workday, Worldpay and many others.

Regularly featured in the global media including the AP, BBC, Bloomberg, CNBC, Discovery, Forbes, Khaleej Times, Telegraph, TIME, ViacomCBS, WIRED, and the WSJ, Matthews mission is to help organisations create a fair and sustainable future whose benefits are shared by everyone irrespective of their ability, background, or circumstances.