There has been much excitement about ChatGPT since it launched in November 2022. A bellwether for the advance of generative AI, it can chat, create content, code, translate, brainstorm and more. It can even act as a personal assistant or therapist. Its use cases are almost endless.
The advance of AI raises some important questions. Is it the harbinger of the singularity? Will it replace all our jobs? We don’t aim to answer those questions here… Instead, we focus on the potential IP and data protection issues surrounding use of ChatGPT.
Let’s start at the beginning – how was ChatGPT trained?
Training generative AI tools such as ChatGPT involves taking vast quantities of data from online sources, processing it, and inputting it into a model. This inevitably involves copying of the underlying data. It is also possible that, either in its original training or through interactions with end users, ChatGPT will be provided with confidential information. We look at potential IP issues relating to each of these categories in turn.
Copyright and database right
Individual items of training data, e.g. books, articles and other written works, will qualify for copyright protection if they are original. As a general principle, you cannot copy copyright works without permission. As time has progressed, and technology has developed, a number of exceptions to this general principle have arisen. While the basics of copyright protection are largely consistent across most of the world (thanks to international treaties), these exceptions vary from country to country.
In the UK and EU there is also the sui generis database right. This protects a database if there has been substantial investment (financial, human or technical) in obtaining, verifying or presenting the contents. To infringe database rights in the UK or EU, there must be unauthorised extraction or re-utilisation of all or any substantial part of the database.
To the extent there is large scale copying of third party materials without permission in the training of ChatGPT (and other generative AI tools) and no defence or exception applies, this could infringe copyright and/or database rights.
The TDM exception – is it critical for generative AI tools like ChatGPT to develop?
So, copying of protected works is generally a no-no. But, training of AI tools such as ChatGPT requires copying enormous amounts of data. The two positions appear potentially irreconcilable. This is where the “text and data mining” (TDM) exception comes in. This is an exception to copyright and database rights which generally permits copying for the purpose of computational analysis.
The position on TDM varies across the world, as legislators struggle to find the right balance between encouraging digital innovation and protecting the creative industries and their right to exploit and commercialise their works.
This leaves both AI companies and IP rightsholders in an uncertain position regarding the protection they are afforded, depending on where in the world such TDM activities take place. Examples of different approaches are set out below.
In Japan, copyright exceptions permit copying for the purposes of machine learning and data verification and to permit electronic incidental copies. The Japanese position is widely considered to be the most AI friendly. A similar although slightly more restrictive approach has been taken in Singapore, permitting TDM for “computational data analysis” of lawfully accessed works. In both cases use is permitted for commercial purposes and cannot be restricted by contract.
In Europe, TDM is permitted if works are lawfully accessed, unless the copyright/database owner has opted out. Interestingly, the proposed EU AI Act requires providers of generative AI technologies to provide a summary of any copyright works used to train their AI models. If this requirement remains in the final regulation, this could be in force by 2026. While this would provide transparency in respect of AI training data – the administrative burden of compliance for AI companies may be significant.
In the UK, the current position on TDM is more restrictive. TDM is not permitted in respect of either copyright works or sui generis database rights except for non-commercial purposes. Last year, the UK Government proposed introducing a new exception to permit commercial TDM, but this proved to be extremely unpopular with the creative industries and was ultimately scrapped. Currently, the intention is that the UK Intellectual Property Office will produce a code of practice, by “this summer”, to support AI firms in accessing protected works as input for their models through reasonable licences. It appears that the UK plans to improve the current licensing environment for AI companies, rather than introducing a broad new exception.
In the US, there is no specific TDM exception. The “fair use” defence to copyright infringement may apply, although this is by no means clear. AI companies generally argue that their models do not infringe copyright as they transform the original use. This is currently being tested through a number of AI-related lawsuits going through the US courts, so clarity on this issue should be provided soon. Connected to this, in the recent US Supreme Court ruling in Andy Warhol Foundation for the Visual Arts v Goldsmith (which did not relate to use by AI), the court held that the fair use defence did not apply to the commercial licensing of the “transformed” work in that case. Time will tell whether this decision has narrowed the scope of the fair-use doctrine in ways that are relevant to AI companies..
This clear divergence in approach means that companies such as OpenAI need to carefully consider where is most appropriate to train their AI models before the fully trained products are rolled out internationally.
Reactions to this legal uncertainty are varying. For example, Adobe’s recently launched generative AI tool, Firefly, is trained only on Adobe Stock images, openly licensed content and out of copyright images. Adobe is so confident that its outputs do not and will not infringe third party copyright that it offers to indemnify end users against IP infringement claims!
Confidential information
OpenAI typically collects user inputs (prompts and responses) to enhance and refine ChatGPT. As a result, confidential information could be provided by users to ChatGPT which becomes part of its dataset and is used in outputs.
OpenAI acknowledges, in its T&Cs, that you may receive information through ChatGPT that is confidential. There is, however, no mechanism for you to assert that information you input should be kept confidential.
This has potentially significant implications. For example, an inventor may ask ChatGPT to assist in preparing patent claims for their new invention. In some jurisdictions, this act of describing the invention to ChatGPT without confidentiality provisions in place may be novelty-destroying, rendering that exciting new invention unpatentable.
Similarly, if an employee trying to resolve a knotty problem asks ChatGPT for suggestions, they may unwittingly reveal their employer’s trade secrets to ChatGPT.
OpenAI is clearly alive to these confidentiality issues. At the end of April, it introduced the ability to turn off chat history and announced “ChatGPT Business”, a more secure version of ChatGPT where end users’ data will not be used to train OpenAI’s models.
Personal Data
Given the volume of data used to train ChatGPT it is inevitable that some personal data will be included in the dataset. In accordance with Article 6 of the GDPR, the processing of this personal data requires a legal basis to be considered lawful. Given few, if any, data subjects will have provided GDPR standard consent the only legal basis likely to apply is ‘legitimate interest’. Interestingly at the end of March the Italian data protection regulator (the Garante) temporarily blocked ChatGPT due to its inability to establish a legal justification under Article 6 GDPR. However, less than a month later the Garante were satisfied that OpenAI had addressed their immediate concerns, including by improving its privacy notice and the rights it grants to individuals. It therefore re-granted access to the Italian public.
ChatGPT continues to raise concerns amongst regulators and academics regarding its data protection policies. The European Data Protection Board has established a taskforce to review ChatGPT’s GDPR compliance. Regulatory authorities elsewhere have also launched investigations into ChatGPT’s privacy compliance, notably in Canada.
What about ChatGPT’s outputs?
By their very nature, the workings of ChatGPT and other generative AI tools can be opaque (a “black box”). So, aside from any regulatory transparency requirements such as those currently proposed by the EU, it can be difficult to identify what training data has been used and in what way.
In some cases, training data may be identified because it is substantially reproduced in the AI tool’s outputs. For example, Getty Images has brought lawsuits against Stability AI in the US and UK with respect to its AI art generator, Stable Diffusion. Stable Diffusion has generated images that incorporate Getty’s watermark, demonstrating that it has used Getty images, without permission, in the training of the AI tool. Getty licenses images and data to other AI companies but Stability AI had not taken such a licence. Getty has asserted Stability AI “unlawfully copied and processed millions of images protected by copyright” and in the US is seeking $1.8trillion in damages! In the UK, it has also alleged database right infringement, trade mark infringement and passing off. AI companies and IP rightsholders alike will watch the outcome of these legal challenges with interest.
OpenAI is demonstrably working to ensure that ChatGPT’s outputs do not infringe copyright in the training data. When prompted to provide a specific extract of a copyright work, ChatGPT stated, “sharing copyrighted content…would be a violation of intellectual property rights. The text…is protected by copyright law and reproducing it without proper authorisation is not allowed. I can, however, provide a brief summary…” .
ChatGPT will however, through less direct prompts, still reproduce certain copyright works in whole or substantial part. We will see how OpenAI deals with this in further iterations of ChatGPT.
Who owns the output?
ChatGPT, as a generative AI tool, generates content in response to prompts. This may be entirely new content. It is possible in certain circumstances that copyright subsists in the output.
The UK copyright regime affords copyright protection to “computer-generated” works where there is no human author. This is not the case in various other jurisdictions where human authorship is required for copyright to subsist. For example, earlier this year, the US Copyright Office confirmed that images created using the AI system Midjourney could not be granted copyright protection.
To the extent copyright does subsist in outputs provided by ChatGPT, as a matter of contract, these are owned by the end user who can use the content for any purpose. OpenAI acknowledges, however, that “outputs may not be unique across users”, i.e. there may be duplication of content. Users should therefore be aware that although they have broad use rights, other users may have equivalent rights to the same content.
Can I trust ChatGPT?
One of the concerns regarding ChatGPT outputs is the danger of “hallucinations”. OpenAI admits outputs may be misleading, untruthful or inaccurate. For example, ChatGPT has confidently cited legal cases that do not exist. ChatGPT has also made assertions about individuals which have been shown not to be true, leading to concerns of defamation and general misinformation. Brian Hood, an Australian mayor has recently said he may bring a defamation case against OpenAI if it does not correct false claims against him (it named him as a guilty party in a foreign bribery scandal when in fact he was a whistleblower). This would be the first case of its kind.
A general takeaway for now is that users should take care before re-using or relying upon ChatGPT outputs, which can be simultaneously very convincing and very wrong.
Is there clarity ahead?
The approach to regulating AI varies from country-to-country. While the GDPR – as the “law of everything” – continues to perform a central role in the regulation of AI, the EU is proposing a new AI Act that will impose tiered regulation. That will ban some use cases of AI completely (such as subliminal manipulation causing harm), heavily regulated others (such as employee recruitment), require transparency in some cases (such as the creation of deepfakes) but leave other uses largely unregulated.
In contrast, the UK has taken a very light touch, pro-innovation approach, choosing not to apply specific regulation and instead only asking that regulators apply a set of five general principles when applying existing law to AI. However, the UK Competition and Marketing Authority has called for a review of the AI technology sector including ChatGPT. The position on TDM is yet to be confirmed although we may see clarity through the UKIPO code of practice and judgments coming out of cases such as Getty Images v Stability AI.
As to the rest of the world, watch this space. A number of countries have banned ChatGPT, including Russia, China, Cuba, Iran and Syria. The CEO of OpenAI, Sam Altman, testifying before US Congress in May 2022 said, “regulation of AI is essential”. Law makers worldwide appear to agree as they continue to scramble to regulate the use of generative and other AI. Various AI-related lawsuits are proceeding in the US and elsewhere. What is clear is that we will be talking about ChatGPT and other generative AI tools for some time to come!