Cloudflare Announces New Content Scraping Protection Feature; "Easy Button" Stops AI Bots With a Click

Cloudflare, one of the world’s largest content delivery networks and web security service providers, is taking on AI bots with a new “Easy button” that simplifies the shutdown of unauthorized content scraping.

The Easy button is available to all customer tiers, including those using the free service option. Cloudflare noted that AI bots scrape about 39% of the top one million internet properties that it provides service to, but only 2.98% of these have been taking measures to block or challenge these crawlers. The new “button” is actually a security toggle switch that has been added to the Cloudflare dashboard, appearing as the “AI Scrapers and Crawlers” switch.

CloudFlare takes on content scraping by blocking “big four” AI crawlers

As the CloudFlare announcement notes, demand for content to feed AI training models has never been higher. Without huge quantities of fresh content, these models quickly stagnate and fall behind. This has led to widespread content scraping, though most of these AI bots belong to the world’s largest AI developers.

About two years ago, CloudFlare introduced more fine-grained tools to see what AI bots might be accessing one’s website and to choose how to individually respond to them. However, these tools were generally permissive to “well-behaved” bots designed to respect the site’s “robots.txt” rules governing content scraping. The company notes that the data it has gathered since demonstrates that its users overwhelmingly choose to block even the “good” AI bots, at rates of 85.2% outright blocking them and 4.4% implementing some sort of a challenge they must pass; only about 10% outright allow them through.

The fact that the biggest sites that provide the richest sources of training data (such as Reddit) have become increasingly locked down to unauthorized content scraping and require hefty payments for access seems to have driven some AI bots to more pushy and rude behavior in their content scraping. CloudFlare notes about two dozen bots that are frequent scrapers, but of these there are four that are by far the most active: ByteDance’s Bytespider, Amazon’s Amazonbot, Anthropic’s ClaudeBot and OpenAI’s GPTBot.

The company also notes that “bad actors” that ignore robots.txt settings and other civil agreements about content scraping have tended to use a set of the same tools and frameworks over an extended period, allowing for opportunity to fingerprint and track them. This has led to data collection that allows CloudFlare’s own defensive AI models to automatically recognize and flag traffic that appears to originate from AI bots, and to quickly adapt to recognize the introduction of new content scraping tools.

AI bots paying less attention to rules as content wars heat up

CloudFlare has stated that new users from this point on will be prompted to pick an Easy button setting when they start their service, but existing users may need to go find it in their settings menu. The company also said that it is rolling out plans for users to implement a “pay per crawl” scheme that would be offered to AI bots when they show up and attempt to engage in content scraping with a protected site.

AI firms have now been battling copyright-related lawsuits for years, many of which remain unresolved. On the whole, the law about this sort of content scraping is still firming up in many countries. These suits continue to roll in, with one of the most recent being the BBC filing a suit against Perplexity for ignoring its robots.txt settings and scraping its vast collection of news content without permission.

One of the biggest ongoing battles, still unresolved, is the similar New York Times lawsuit against both OpenAI and Microsoft for unauthorized content scraping. First filed in October 2023, the suit’s most recent development was an April order from a U.S. District judge that found ChatGPT reproduced material from its articles “numerous” times and rejected motions to dismiss from both AI companies. The Times has also been granted access to OpenAI’s logs to review exactly what its AI bots took and how they operate.

The outcome of this particular suit could be the most consequential in setting precedent for emerging AI copyright law in the US. The AI companies generally defend content scraping by claiming that it falls under the parameters of the “fair use” doctrine, and the AI bots merely access public information in a way that any other average internet user might. But a key to a successful fair use defense is demonstrating a “transformative” quality and keeping to within a certain limited amount of the original content, where the AI outfits may wind up in legal trouble if models regularly regurgitate significant portions of articles without providing source credits.

Dr. Kolochenko, CEO at ImmuniWeb, believes that the “pay to scrape” model will greatly expand in the coming months and could prove to be a significant obstacle for the AI outfits: “This long-awaited feature by Cloudflare is a true disaster for many GenAI vendors, which may be fatal to the current business models of GenAI. Given that Cloudflare protects the majority of the world’s most popular websites, as well as millions of smaller websites that publish academic and scientific content, this security feature will elegantly prevent data-greedy bots from unwarrantedly scraping human-created content without permission and without paying for it. Ironically, the fierce legal battles currently taking place in courts on both sides of the Atlantic – disputing the alleged copyright infringements by numerous AI vendors – are mostly re-litigating arguments that are already lost. At the end of the day, these lawsuits will bring from little to no value to GenAI vendors: virtually all creative content providers are incrementally protecting their content with advanced anti-bot protection mechanisms, which Cloudflare has just made available to everybody in one click. Furthermore, content providers add specific contractual provisions to their terms of service that expressly prohibit any use of their data for LLM training purposes. In case of a violation of such terms of service, content providers will have a straightforward and time-tested legal claim for breach of contract, possibly accompanied with liquidated damages per violation, making such claims extremely lucrative for the plaintiffs. Furthermore, in some jurisdictions, a deliberate bypass of anti-bot protection and massive data scraping may constitute a criminal offense. Of note, all this has virtually nothing to do with copyright law. Ultimately, GenAI vendors – that now vigorously argue in courts that exploitation of third-party content for LLM training purposes constitutes a fair use exception under the copyright law – will likely face even greater liability under the avalanche of breach of contract claims. In sum, most GenAI vendors will soon face a tough reality: paying a fair price for high-quality training data, while staying profitable. In view of the formidable competition emanating from China, many Western GenAI companies may simply quit the business as economically unviable.”