Freelancer accused Anthropic, the AI startup that developed the Crood large-scale language model, of ignoring the company’s “no-crawl” robots.txt protocol to scrape data from its websites. Meanwhile, iFixit CEO Kyle Wiens said Anthropic was ignoring website policies that forbid its content from being used to train AI models. Freelancer CEO Matt Barry said: information Wiens said Anthropic’s crawler bots are “the most aggressive scrapers ever.” His website reportedly received 3.5 million hits from the company’s crawler within four hours, “probably about five times the number of hits from the next closest AI crawler.” Similarly, Wiens posted on X/Twitter that Anthropic’s bots hit iFixit’s servers a million times in 24 hours. “Not only are you getting our content without paying for it, you’re tying up our development resources,” he wrote.
In June, Wired Defendant Another AI company, Perplexity, banned websites from being crawled despite the presence of a robots exclusion protocol (robots.txt). Robots.txt files typically contain instructions on which pages web crawlers can and cannot access. Compliance is voluntary, but in most cases, they are ignored by bad bots. wired After the article was published, TollBit, a startup that connects AI companies with content publishers, reported that Perplexity isn’t the only company evading robots.txt signals. It didn’t name any of them, but: Business Insider The company said it had learned that OpenAI and Anthropic had also ignored the protocol.
Barry said Freelancer initially tried to deny the bot’s access requests, but eventually had to block Anthropik’s crawlers altogether. “This is nasty scraping,” he said. [which] “It slows down the site for everyone who interacts with it, which ultimately impacts revenue,” he added. As for iFixit, the site said it sets alarms for high traffic and that Anthropic’s activity woke staff up at 3 a.m. The company’s crawlers stopped scraping iFixit after it added a line to its robots.txt file that specifically banned Anthropic bots.
AI startups information The company said it respects robots.txt and that its crawlers “respected that signal” when iFixit implemented it. The company also said it “discovered how quickly it could be used to improve its site performance.” [it crawls] The agency believes the “same domain” is being used and is currently investigating the incident.
AI companies use crawlers to collect content from websites to train their generative AI techniques. As a result, they have been accused of copyright infringement by publishers and have been the target of multiple lawsuits. To prevent further lawsuits, companies like OpenAI have signed deals with publishers and websites. So far, OpenAI’s content partners include News Corp, Vox Media, Financial Times And then there’s Reddit. iFixit’s Wiens seems open to a deal for the site’s repair-how articles, saying in a tweet to Anthropic that he’s open to discussing licensing the content for commercial use.
If any of these requests lead you to our Terms of Use, you will be informed that the use of our content is expressly prohibited. But don’t ask me, ask Claude.
If you would like to discuss licensing any of our content for commercial use, please contact us here. pic.twitter.com/CAkOQDnLjD
— Kyle Wiens (@kwiens) July 24, 2024