A new Microsoft report documents the rise of “skeleton key” jailbreak attacks aimed at causing LLMs to answer questions they should be restricted from addressing. While people have developed many means of tricking AI models into slipping their guardrails, skeleton key attacks are noteworthy for working consistently across models from different companies and immediately “priming” the AI to directly answer any and all off-limits questions.
The general theme of a skeleton key attack is to convince the AI that the requester is some sort of trusted authority figure or credentialed researcher who has a need for uncensored output. Once the right statement of this nature primes the AI, it will then freely answer any and all questions without regard for usual content restrictions. Microsoft has found techniques that work consistently to jailbreak the biggest AI models in this way: ChatGPT, Meta Llama, Google Gemini, Mistral Large and more.
“Skeleton Key” jailbreak method consistently fools LLMs with little effort
All of the effort in skeleton key attacks goes into crafting the right statement to convince AI models to shed their guardrails entirely. Once a functional statement has been developed, it is essentially a “plug and play” method to jailbreak a variety of models.
The examples that the Microsoft team posts involve convincing the AI models to provide instructions for creating weapons, but the possibilities go far beyond that. A “root” jailbreak of this nature could also be used to convince the AI to develop or even deploy malicious code, for example. Another possibility is that it could convince the model to seek out and provide sensitive personal data that should be off-limits.
The Microsoft researchers say that they have found multiple “skeleton key” approaches and have privately briefed the providers of other AI models on them before disclosing. The lone detailed example that the researchers provide to the public is an “attack flow” technique that simply assures the AI that it is operating in a “safe educational context” with “researchers trained in ethics and safety” that require uncensored outputs. That statement of just several sentences is enough to convince the AI to spit out instructions for building homemade bombs that it had previously refused.
The researchers tested (and were able to jailbreak) the following specific models from April to May 2024: Anthropic Claude 3 Opus, Cohere Commander R Plus, Google Gemini Pro, Meta Llama3-70b-instruct, OpenAI GPT 3.5 Turbo, and OpenAI GPT 4o. ChatGPT 4 did rebuff the most basic form of the attack, but fell for it when it was passed as a user-defined system message (something that developers have API access to).
AI models need significant reinforcement against skeleton key attacks
The jailbreak issue highlights a broader problem with AI models: they can only function by scraping massive amounts of data relatively indiscriminately from sources that are usually not moderated by the same rules, and the data cannot simply be picked out or blocked from the training material. This has left developers scrambling to implement guardrails to limit everything from reproduction of copyrighted content to advocating for violence and self-harm.
This “arms race” between developers and hackers is one that the developers are largely losing thus far. The skeleton key approach is the most convenient of the jailbreak methods thus far, but it is hardly the only one. A recent research paper from the U.K. AI Safety Institute (AISI) found that the safeguards in five of the largest AI models are ineffective against even basic techniques and in some cases volunteer what should be private information in response to benign questions not meant to act as a jailbreak. The study also noted that the LLMs displayed “expert-level” knowledge of chemistry and biology but could only complete “simple” cybersecurity challenges designed for a high school audience.
There are a variety of other techniques that attackers use to jailbreak AI models. One example is the “token manipulation” approach, in which attackers alter tokens in an otherwise valid input text to give the AI something it semantically understands but that skirts around its defined guidelines due to the alternate spelling or presentation. In some cases, this has been as simple as adding a string of exclamation points to an otherwise forbidden request. Some attackers are even customizing their own AI models to probe LLMs for vulnerabilities and oversights in security policy.
Microsoft has updated its own AI offerings in response to these discoveries, and the security team provides some mitigation recommendations to others. These include implementing input and output filtering solutions, using “prompt engineering” defensively to reinforce the system prompts with instructions on what to stay away from, and even introducing a separate AI monitoring system dedicated to detecting adversarial attempts on LLMs.