Cloudflare has released a new free tool that prevents AI companies’ bots from scraping content from its customers’ websites to train large language models. The cloud service provider is making the tool available to its entire customer base, including those on a free plan. “This feature will be automatically updated over time as we see new footprints of offending bots that we identify as broadly crawling the web to train models,” the company said.
In announcing the update, the Cloudflare team also shared some data on how its customers are responding to the rise of bots scraping content to train generative AI models. According to the company’s internal data, 85.2% of customers have chosen to block access to their sites even to AI bots that correctly identify themselves.
Cloudflare also identified the most active bots of the past year. Bytedance-owned Bytespider attempted to access 40% of the websites under Cloudflare’s control and attempted 35% of them. They accounted for half of the top four AI crawlers by number of requests on Cloudflare’s network, along with Amazonbot and ClaudeBot.
It’s proving very difficult to completely and consistently block AI bots from accessing content. The arms race to build models faster has led to cases where companies are skirting or outright breaking existing rules about blocking scrapers from scraping websites without proper permissions. But having a back-end company of Cloudflare’s magnitude seriously working to stop this behavior could lead to some results.
“We are concerned that some AI companies looking to circumvent the rules to access content are constantly adapting to evade bot detection,” the company said. “We will continue to monitor and add more bot blocks to our AI Scrapers and Crawlers rule and evolve our machine learning models to help make the internet a place where content creators can thrive and maintain full control over the models their content is used to train or run inference on.”