Photo illustration: Intelligencer; Photo: Getty Images
As of last week, the only major search engine that still includes Reddit is Google. Microsoft’s Bing doesn’t find any recent Reddit links, and privacy-focused search engine DuckDuckGo doesn’t get any useful results from the platform. For most people, this change isn’t a big deal in the short term. After all, most people use Google, which has about 90% of the global search market share, and can access Reddit directly. But it’s a strange change. Reddit is a platform with deep web ties that started as a link aggregator and grew into an online community with over 80 million daily active users (the more the better). According to a report by 404 Media, why is Reddit suddenly on high alert?
When tech companies behave strangely and unpredictably these days, the answer often has to do with AI. Earlier this year, Google inked a deal with Reddit, reportedly worth $60 million per year, to license the site’s data to train AI models. In recent months, Google watchers have Also They noticed more Reddit posts appearing in search results and user comments ranking higher for various queries. These stories are related but to some extent independent: Google, like others in the AI industry, wants to license data to build better models and avoid lawsuits. Google users were already adding “Reddit” to their search queries as a kind of search quality hack, so in a way the company was following their lead.
Here’s where the two stories overlap, and that’s where things start to fall apart. Search engines gather relevant and up-to-date information by using bots to crawl the web, indexing the information they find, and ranking it based on its relevance to the user’s search. Websites have some control over whether and how this crawling happens, and there are many different reasons why they might refuse to crawl all or part of their site (an individual might want to keep an old blog online but not searchable, while a company like Facebook might want Google users to be able to find their profile but not be able to search their content). But for decades, crawling has been a simple, clear, win-win part of the arrangement: search engines attracted large audiences by providing a useful service, and websites allowed and even accommodated search engine crawling to connect with that audience.
But in the last few years, crawling has taken on a new purpose: the robots that are indexing your site and loading all your data aren’t just building a search index. They might also be building AI models, which is very important for many websites. do not have It’s part of the deal. As The Verge’s David Pierce writes, the sudden shift from search indexing to AI training means that “the web’s fundamental social contract is unraveling.” Desperate unilateral actions by startups and tech giants are replacing mutually beneficial arrangements with exploitative ones.
Initially, the impact of this collapse was relatively contained, with large websites and platforms owned by large companies like Facebook and Amazon explicitly blocking crawlers from companies like OpenAI. But this clarity didn’t last long. Google is all in on AI. Bing is owned by Microsoft, which is also OpenAI’s largest investor and partner. Suddenly, every search company was an AI company, and new crawlers entered the mix. Every crawl was an AI crawl. It would be naive to assume otherwise. The sudden harvest was obvious to anyone paying attention to traffic statistics. Bots were scraping everything they could.
In response to 404 Media’s reporting, critics argue that Google, a company seemingly fearful of past and future antitrust enforcement, is giving an unfair advantage to a product that is already a near-monopoly. They have a point: without Reddit, one of the largest repositories of real human text on the internet, smaller search engines can’t compete.
But the story wouldn’t be complete without the company actually doing the blocking: Reddit. (Microsoft has confirmed that its crawler is banned.) A Reddit spokesperson told The Verge:
This is in no way related to our recent partnership with Google. We have been in discussions with multiple search engines. We have not been able to reach agreements with all of them because some search engines are unable or unwilling to make enforceable commitments regarding the use of Reddit content, including for use in AI.
This is a pragmatic move by Reddit’s management, a public company with obligations to shareholders, but it’s also clearly bad for the general public. Not only does it reduce access for users who don’t want to use Google, it further subordinates Reddit to Google’s specific search incentives. Either way, that’s changing fast because of AI. Already, spammers are polluting popular Reddit threads that are suddenly getting huge amounts of traffic from Google in an attempt to gain more visibility in search results.
Google, like Reddit, exists and thrives thanks to the principles and practices of the open web, but these exclusive arrangements mark the end of that long and extremely fruitful era. They also portend what’s to come. The web was already in a state of disarray, having been shrunk over the past 15 years by the rise of walled platforms, battered by advertising consolidation, and polluted by a glut of content from the AI products it trained. The rise of AI scraping threatens to put an end to that work, and to see a flawed but highly successful, decades-long experiment in open networking and human communication crumble with a series of hostile deals between rival tech companies.
See all