Cloudflare, a global internet security company that claims to protect nearly 20% of the world’s web traffic, has launched what it calls an “easy button” for website owners who want to block AI services from accessing their content. The move comes as demand for content used to train AI models has exploded.
Cloudflare’s core service, which acts as an internet proxy, scans and filters web traffic before it reaches websites. On average, the company says its network receives more than 57 million requests per second.
“To help keep the Internet safe for content creators, we’ve just launched a brand new ‘easy button’ to block all AI bots,” Cloudflare said in its announcement Wednesday. “We’re hearing loud and clear that customers don’t want AI bots visiting their websites, especially those that do so dishonestly.”
While some AI companies correctly identify their web scraping bots and comply with websites’ instructions to stay away, not all of them are transparent about their activities.
The new simple setting is available to all Cloudflare customers, including those on its free tier.
Analysis of AI robot activity
Along with its announcement, Cloudflare shared a plethora of information about the AI crawler activity it observes on its systems.
According to Cloudflare data, AI bots accessed about 39% of the top 1 million “internet properties” in June via Cloudflare. However, only 2.98% of those properties took action to block or contest those requests. Cloudflare also mentions that “the higher an internet property ranks (the more popular it is), the more likely it is to be targeted by AI bots.”
According to the company, web crawlers operated by TikTok-owner ByteDance, Amazon, Anthropic and OpenAI were the most active. The most active crawler was Bytedance’s Bytespider, which topped the charts in terms of the number of requests, the extent of its activity and the frequency of blocks. GPTBot, operated by OpenAI and used to collect training data for products like ChatGPT, ranked second in terms of crawl activity and blocks.
Image: Cloudflare
Perplexity’s web crawler, which has recently sparked controversy for its content-crawling practices, has been detected visiting a fraction of one percent of sites protected by Cloudflare.
Image: Cloudflare
While website owners can implement their own rules to block known web crawlers, Cloudflare also said that most of its customers who do so only block the most common AI developers like OpenAI, Google, or Meta, but not the top crawlers from Bytedance or other companies.
AI vs AI
Cloudflare’s report highlighted how some AI bot operators are using deceptive tactics to circumvent measures aimed at blocking them, trying to pass off their crawler activity as legitimate web traffic.
“Unfortunately, we have observed bot operators attempting to impersonate a genuine browser by using a spoofed user agent,” Cloudflare wrote.
It turns out that AI is a key tool in the company’s arsenal for stopping automated activity, whether it comes from AI developers, search engines, or malicious attackers. Cloudflare said it uses a machine learning model to assign a “bot score” to every request made to a website protected by its services, with low scores indicating a low likelihood that the activity is legitimate.
Using Cloudflare’s massive dataset of global internet traffic, the model considers a number of signals, including request IP address, user agent, and behavioral patterns, to determine the bot’s score.
Image: Cloudflare
To illustrate this, Cloudflare said it looked at traffic from a specific bot known for its evasive behavior. The results were telling: All detections scored below 30 out of 100, with the vast majority falling into the lower two bands, indicating a score of 9 or lower. In other words, even when trying to mask its source, the bot’s activity patterns gave it away, allowing Cloudflare to block it.
Web Content Protection
Generative AI models rely on massive amounts of existing content, much of it from the web. For AI to continue to provide up-to-date information, its developers must continue to collect information at scale.
Website owners and content creators are responding, and major publishers like news organizations are taking legal action against AI companies. In Perplexity’s case, publications like Forbes And Cable Music publisher Sony warned more than 700 tech companies in May to stay away from the platform, and this week Warner Music Group followed suit.
The threat could be existential for publishers if AI increasingly provides users with information without referring them to the source. A recent study published by SparkToro CEO Rand Fishkin suggests that 60% of people searching for information on Google stopped visiting websites that offered it because Google’s AI immediately provided summarized answers.
Edited by Ryan Ozawa.