OpenAI's Defense in Copyright Lawsuit: New York Times "Hacked ChatGPT" To Create Evidence

OpenAI is facing a number of copyright lawsuits that could shape the future of generative AI, and one of the biggest comes from the New York Times (NYT). OpenAI is now accusing the paper of what is essentially evidence fabrication, claiming that it hacked ChatGPT to produce results containing content from its articles.

OpenAI will argue that the paper employed a prompt engineer to massage the chatbot into spitting up content similar to the paper’s articles, in some cases making tens of thousands of attempts in the process. The company’s argument seems to center on the fact that one cannot coerce it into replicating the paper’s paywalled articles, rather than its right to have trained its AI on the articles.

OpenAI asks for dismissal on claim that NYT hacked ChatGPT

OpenAI accused the Times of not living up to its own high journalistic standards in bringing the copyright lawsuit, claiming that the paper hacked ChatGPT and other large language models (LLMs) to generate the material for its claims.

NYT filed suit against OpenAI and Microsoft in late December of last year. The copyright lawsuit is one of a number that will test the always fuzzy bounds of “fair use” protections under the law. OpenAI says that it was invited to work with the NYT and other papers, and that it has a more general right to use copyrighted material as a source of training data without getting express permission from owners.

OpenAI has faced a small storm of suits of this nature over the past year, but most have been brought by individuals claiming infringement of their particular works (and have not involved accusations of having hacked ChatGPT). The NYT is the first major media outlet to file a copyright lawsuit against a generative AI outfit. Communications between the NYT and OpenAI from just days before the lawsuit was filed indicate that the paper saw ChatGPT produce results that were near-verbatim from articles that had been run in the paper, but OpenAI claims the paper did not provide it with any concrete examples before initiating the suit. NYT later included 100 such examples in its court filings.

OpenAI claims that such verbatim reproduction is a “rare bug” and that for NYT to come up with these examples, it must have hacked ChatGPT. But it is not accused of hacking in the traditional sense, rather employing some sort of a “prompt engineer” to take thousands of tries in coaxing the chatbot into committing such a violation. OpenAI is arguing that a normal user of the platform would not use it in this way. The company claims that the paper used prompts specifically designed to induce “uncommon and unintended phenomena” such as requesting the opening paragraph of a specific article and then coaxing the chatbot to reveal the rest of it sentence by sentence.

If successful, OpenAI’s argument would only dismiss part of the “hacked ChatGPT” copyright lawsuit. However, it would negate the majority of NYT’s claim to damages due to copyright infringement and Digital Millennium Copyright Act violations. But the stakes are equally high for OpenAI. Facing fines of up to $150,000 for each piece of infringing content, the company (and similar rivals that used similar training practices) could be forced to entirely dump their store of training data and start over with more limited offerings if a judge ultimately rules against it.

Copyright lawsuit will set fair use precedent

The “NYT hacked ChatGPT” defense directly addresses claims of damages due to the chatbot being used as a potential substitute for a subscription to the paper, much in the same way that many less sophisticated tools allow for bypassing its paywall. But the defense does not address the broader question of whether OpenAI and others have an inherent right to use a copyrighted work to train an AI model, something that will rely on court interpretations of fair use law.

The US fair use doctrine has never had entirely clear terms to cover every circumstance, and is largely built on precedent established by prior court decisions as examples of alleged unauthorized use come up. That is why the outcome of this copyright lawsuit potentially carries a lot of weight. This will be the first direct test of AI use of training materials in this way.

How the courts interpret this use will be absolutely vital to the futures of OpenAI and similar companies; OpenAI has already publicly stated that it is impossible to train these types of LLMs without scraping publicly accessible materials from the internet. Some legal analysts believe there will not just be one case that sets the terms for everyone going forward, however, with the outcomes of multiple cases likely covering different sets of circumstances.

On that front, OpenAI recently had some of the smaller cases against it partially dismissed. A joint case headed up by authors Sarah Silverman, Michael Chabon and others had many of its claims against Meta’s Llama AI rejected by a federal judge in February. However, the judge focused on whether the AI was reproducing their works, and has yet to tackle the question of whether AI companies have the right to scrape the open internet for training material. The decision also left open leave to amend on the dismissed arguments, allowing the plaintiffs to reorganize and make a second attempt.

James McQuiggan, security awareness advocate at KnowBe4, notes that the court’s decision could also open the door for even more lawsuits: “The legal dispute between The New York Times and OpenAI puts society at an intriguing crossroads of copyright law and AI. It creates a number of questions about the use of Generative AI and the ethical implications of how models and Large Language Models are being trained in the evolution of AI and its use in society. There is uncertainty between protecting intellectual property and facilitating AI development. This case could open the door or slam it in their face on how AI can utilize existing copyrighted materials for training purposes without explicit consent. Depending on the outcome of this suit, it will be of interest to other copyright owners wanting to sue tech companies over AI model training processes. There could be a need to redefine copyright standards in the age of artificial intelligence. It’s a balancing act between copyright laws and acknowledging the unique nature of AI’s learning processes and reliance on large and publicly available data sets.”