Security researchers with Google DeepMind and a collection of universities have discovered a ChatGPT vulnerability that exposes random training data, triggered only by telling the chatbot to repeat a particular word forever.
The researchers found that when ChatGPT is told to repeat a word like “poem” or “part” forever, it will do so for about a few hundred repetitions. Then it will have some sort of a meltdown and start spewing apparent gibberish, but that random text at times contains identifiable data like email address signatures and contact information. The incident raises questions not only about the security of the chatbot, but where exactly it is getting all this personal information from.
ChatGPT vulnerability stems from easily induced glitch
The ChatGPT vulnerability is documented in a new report from about a dozen researchers with Google DeepMind, Cornell University, CMU, UC Berkeley, ETH Zurich and the University of Washington.
The word chosen seems to make a difference in terms of how much personally identifiable training data the chatbot returns. For example, the word “company” causes it to return contact information at 164 times the rate of numerous other words. This is likely owed to how the language model connects individual words to its training data, as the report documents exactly what is returned when the chatbot is asked to repeat “book” infinitely. It does so about 300 times, then pastes in a portion of a seemingly randomly chosen book review before spiraling into a series of book reviews and descriptions that look to be taken from a single source.
The researchers say that testing of the ChatGPT vulnerability yielded personally identifiable information for dozens of people such as user IDs and bitcoin addresses. Explicit information from dating websites could also be fished from training data if a related word was used as the prompt. The researchers also found possible copyrighted or non-public information in the form of snips of programming code and entire passages of books or poems. The researchers said that they spent $200 USD total in queries and from that extracted about 10,000 of these blocks of verbatim memorized training data.
The report notes that the attack does not work against other large language models, and media sources that attempted to replicate the results found either different output or the model functioning as normal. Publisher OpenAI has not issued a public statement about the report as of yet, so it is not clear if the ChatGPT vulnerability has been fully addressed. The researchers say that they responsibly disclosed the vulnerability to OpenAI on August 30 and the issue has reportedly been patched, but the extent of the patch’s ability to mitigate the issue remains in question.
Emerging field of training data attacks breaks LLM guardrails
The report describes a new field of “divergence” attacks that target extractable memorization data from LLM datasets. The researchers find that larger and more capable LLMs tend to be more vulnerable to these attacks, as well as those that incorporate very large volumes of training data. However, one caveat to the study is that the ChatGPT vulnerability was only found to be present in version gpt-3.5-turbo. It does not seem to be applicable to ChatGPT-4 or other production language models. GPT-3.5 is available to the public on a pay-per-query basis, while GPT-4 requires a paid subscription to access.
Similar attacks on unaligned models have already been documented, but the new ChatGPT vulnerability is unique as it is a successful attack on an aligned model. Aligned models are those that have been given specific goals, and usually have had extensive “guardrails” put in place to attempt to eliminate undesirable outcomes.
The ChatGPT vulnerability appears far too random to use in a targeted way, but a scattershot approach by attackers could turn up unexpected pieces of private and valuable information. The report raises some broader concerns, one of which is exactly how often the model is “memorizing” these whole chunks of training data and repeating them word-for-word. OpenAI is already facing numerous lawsuits and regulatory scrutiny over how it gathers training data, which has appeared to involve scraping websites and online services (without knowledge or permission of the sites or users) and even books and other non-public material.
The issue also demonstrates that OpenAI’s present alignment techniques do not eliminate the possibility of a ChatGPT vulnerability involving memorization. Other text published in the report indicates that the training data holds chunks from the CNN website, code from Stack Overflow, passages from assorted WordPress blogs and a casino equipment vendor’s website, among other assorted pieces of information.
What these models scrape and do not scrape on their own is one issue. Another is the data that clients enter into them. There have already been several high-profile cases of employees feeding protected personal information or company secrets into ChatGPT as they attempt to automate portions of their work, something that has led to mass workplace bans of the chatbot throughout the financial industry and at some tech leaders such as Amazon, Apple and Samsung.
As Anurag Gurtu, CPO of StrikeReady, observes: “The exposure of training data in ChatGPT and other generative AI platforms raises significant privacy and security concerns. This situation underscores the need for more stringent data handling and processing protocols in AI development, especially regarding the use of sensitive and personal information. It also highlights the importance of transparency in AI development and the potential risks associated with the use of large-scale data. Addressing these challenges is critical for maintaining user trust and ensuring the responsible use of AI technologies.”