Cloudflare strengthens blocking of AI bots that scrape websites

Cloudflare, a global internet security company that claims to protect nearly 20% of the world’s web traffic, has launched what it calls an “easy button” for website owners who want to block AI services from accessing their content. The move comes as demand for content used to train AI models has exploded.

Cloudflare’s core service, which acts as an internet proxy, scans and filters web traffic before it reaches websites. On average, the company says its network receives more than 57 million requests per second.

“To help keep the Internet safe for content creators, we’ve just launched a brand new ‘easy button’ to block all AI bots,” Cloudflare said in its announcement Wednesday. “We’re hearing loud and clear that customers don’t want AI bots visiting their websites, especially those that do so dishonestly.”

While some AI companies correctly identify their web scraping bots and comply with websites’ instructions to stay away, not all of them are transparent about their activities.

The new simple setting is available to all Cloudflare customers, including those on its free tier.

Analysis of AI robot activity

Along with its announcement, Cloudflare shared a plethora of information about the AI crawler activity it observes on its systems.

According to Cloudflare data, AI bots accessed about 39% of the top 1 million “internet properties” in June via Cloudflare. However, only 2.98% of those properties took action to block or contest those requests. Cloudflare also mentions that “the higher an internet property ranks (the more popular it is), the more likely it is to be targeted by AI bots.”

According to the company, web crawlers operated by TikTok-owner ByteDance, Amazon, Anthropic and OpenAI were the most active. The most active crawler was Bytedance’s Bytespider, which topped the charts in terms of the number of requests, the extent of its activity and the frequency of blocks. GPTBot, operated by OpenAI and used to collect training data for products like ChatGPT, ranked second in terms of crawl activity and blocks.

Image: Cloudflare

Perplexity’s web crawler, which has recently sparked controversy for its content-crawling practices, has been detected visiting a fraction of one percent of sites protected by Cloudflare.

Image: Cloudflare

While website owners can implement their own rules to block known web crawlers, Cloudflare also said that most of its customers who do so only block the most common AI developers like OpenAI, Google, or Meta, but not the top crawlers from Bytedance or other companies.

AI vs AI

Cloudflare’s report highlighted how some AI bot operators are using deceptive tactics to circumvent measures aimed at blocking them, trying to pass off their crawler activity as legitimate web traffic.

“Unfortunately, we have observed bot operators attempting to impersonate a genuine browser by using a spoofed user agent,” Cloudflare wrote.

It turns out that AI is a key tool in the company’s arsenal for stopping automated activity, whether it comes from AI developers, search engines, or malicious attackers. Cloudflare said it uses a machine learning model to assign a “bot score” to every request made to a website protected by its services, with low scores indicating a low likelihood that the activity is legitimate.

Using Cloudflare’s massive dataset of global internet traffic, the model considers a number of signals, including request IP address, user agent, and behavioral patterns, to determine the bot’s score.

Image: Cloudflare

To illustrate this, Cloudflare said it looked at traffic from a specific bot known for its evasive behavior. The results were telling: All detections scored below 30 out of 100, with the vast majority falling into the lower two bands, indicating a score of 9 or lower. In other words, even when trying to mask its source, the bot’s activity patterns gave it away, allowing Cloudflare to block it.

Web Content Protection

Generative AI models rely on massive amounts of existing content, much of it from the web. For AI to continue to provide up-to-date information, its developers must continue to collect information at scale.

Website owners and content creators are responding, and major publishers like news organizations are taking legal action against AI companies. In Perplexity’s case, publications like Forbes And Cable Music publisher Sony warned more than 700 tech companies in May to stay away from the platform, and this week Warner Music Group followed suit.

The threat could be existential for publishers if AI increasingly provides users with information without referring them to the source. A recent study published by SparkToro CEO Rand Fishkin suggests that 60% of people searching for information on Google stopped visiting websites that offered it because Google’s AI immediately provided summarized answers.

Edited by Ryan Ozawa.

Source link

What's Hot

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

Cloudflare strengthens blocking of AI bots that scrape websites

New Zealand damages boat on land on first day of America’s Cup

The Supreme Court has indicated it would side with Trump if the election is close.

OpenAI and Anthropic sign U.S. government contract for AI research and testing

Britain’s Ineos beats America’s Magic on Day 1 of the America’s Cup; New Zealand drops boat from crane | Professional National Sports

Extra sleep on weekends could cut heart disease risk by one-fifth – Study | Heart Disease

Why Honeywell is betting big on Gen AI

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

The Supreme Court has indicated it would side with Trump if the election is close.

AdsPower: See you at Affiliate World Europe 2024 in Budapest!

TEMU Affiliate Program 2024: Earn up to £100,000 per month!

Hard Bacon files for bankruptcy as Google search changes strain affiliate marketing business

Getting Started in Affiliate Marketing: How to Make Passive Income in 2024

Our Picks

Travel the World for Less with Home Exchange: Explore Like a Local, Live Like a Local

How to watch CNN’s Harris Waltz interview | 2024 US Election

New Zealand damages boat on land on first day of America’s Cup

Most Popular

Working It guide to AI at work

Meta AI is fun, accessible, and free. Maybe it’s time to make AI chatbots a part of your life | Technology News

Generative AI Might Be Overrated

Subscribe to Updates

What's Hot

Cloudflare strengthens blocking of AI bots that scrape websites

Analysis of AI robot activity

AI vs AI

Web Content Protection

Related Posts