Bits flowing showing AI training data

Reddit Sues Anthropic Over Unauthorized AI Training

Reddit is taking Anthropic, the startup behind the “Claude” AI, to court over unauthorized scraping of its site for AI training data.

The Reddit suit claims that Anthropic began regularly scraping the site in December 2021. After being asked to stop, Anthropic issued a public statement in July 2024 indicating that it had stopped all crawling of Reddit for AI training data. However, Reddit claims that Anthropic’s bots continued to access the site some 100,000 times after that statement was made. Such accusations (and even lawsuits) are not all that uncommon, but this particular case is noteworthy given Anthropic’s efforts to brand itself as the major AI brand most concerned with ethics and safety.

AI training suit accuses Anthropic of unlawful use for commercial purposes

Reddit says that it had offered Anthropic a licensing agreement allowing it to scrape AI training data, but that the company declined. The massive message board platform has previously struck similar deals with OpenAI and Google, which together now bring it $130 million per year in added revenue (or roughly 10% of its overall earnings). A 2024 SEC filing indicated that Reddit had already made $203 million from licensing data to AI outfits at that point, but did not disclose exactly who it had partnered with, though around this time it also publicly called out Microsoft and Perplexity for scraping its data without its permission.

Suits over unauthorized use of data for AI training have now been raging for years, but this is the first instance of Anthropic being caught with its hand in the particularly valuable Reddit cookie jar. The development is noteworthy as Anthropic has centered its attempts to keep up with its much bigger and more developed competitors on being recognized as the most “ethical” and “safe” of the major AI options, taking more care to protect user rights and head off unintended consequences as it develops.

Reddit specifically called this out in its lawsuit filing, referring to Anthropic as the “white knight” of the AI industry but calling that branding a facade. Anthropic has thus far faced two other lawsuits over unauthorized access; one brought by Universal Music in 2023 that accused it of pillaging the recording giant’s repository of song lyrics without authorization, and another from three independent authors in California accusing the company of scraping hundreds of thousands of assorted copyrighted books without permission.

Reddit further claims that it is not the only victim of this scraping, but that Anthropic has also violated the privacy of its users by incorporating their messages into AI training data without their consent. Anthropic responded to the lawsuit by saying that it will “defend itself vigorously.”

Anthropic’s “ethics first” approach butts up against hard realities

Anthropic’s journey as the “ethical AI” began with the resignation of a number of former OpenAI staffers over concerns about the ChatGPT developer’s safety practices. But regardless of what its intentions might be, data is the lifeblood of AI training. Without huge amounts of data to scrape, AI models falter and fall behind their more robust competitors. Major sources of data, such as Reddit with its 100 million daily active users, are increasingly closing off to hungry AI companies and demanding substantial payments for access.

Lacking the resources that its bigger rivals have, Anthropic appears to have been simply helping itself to what it can reach and worrying about possible consequences later. It has been called “the most aggressive scraper on the internet” and caught raiding YouTube among other sites in violation of terms of service that explicitly forbid unauthorized AI training. There appears to be no alternative to massive amounts of fresh human-made content at present, as experiments with having AI train itself on content it generates have led to bizarre and unusable output.

In addition to its scraping entanglements, Anthropic has also joined lobbying efforts against the sort of AI safety measures that it supposedly launched to promote. And its partnerships with Google and Amazon have raised questions about anti-competitive behavior, triggering a Competition and Markets Authority (CMA) probe launched in August 2024.

Use of public content on the internet for AI training is also an emerging area of the law that is far from settled or predictable. AI developers argue they are merely accessing content made public in a way that anyone else is able to, but publishers argue that these tools are specifically meant to compete with them and ultimately put them out of business. Google won a dismissal of one such suit against its “Bard” model in June 2024, but with the plaintiffs given the ability to amend and file again. Many other cases, including multiple against OpenAI, remain open.