ChatGPT on smartphone showing prompt injection attack on AI models

·3 min read

HiddenLayer Prompt Injection Attack Able to Break the Guardrails of All Major AI Models

Wei Chieh Lim·May 5, 2025

Researchers with AI security firm HiddenLayer have developed a single prompt injection attack that works across all of the major AI models currently in use. The attack breaks the safety guardrails of essentially any model it’s thrown at, convincing it to reveal its system prompt as well as engage in all manner of potentially harmful exchanges.

The prompt injection attack was tested successfully against the major commercial models from Anthropic, DeepSeek, Google, Meta, Microsoft, Mistral, OpenAI and Qwen. It reportedly can also be adapted to new models with some knowledge of their policy files; the attack essentially replaces their standing instructions about security.

All existing frontier AI models potentially vulnerable to new prompt injection attack

Called the “Policy Puppetry Attack,” the new prompt injection attack focuses on formatting requests to look like the contents of one of the policy files that AI models rely on for their security and safety guidelines. This could take the form of a JSON, XML, INI or several other types of files, but the researchers say there is no specific format for the actual policy language; the instructions must simply be structured in a way such that they look like something the target LLM usually views as policy.

The AI models can then be instructed to do essentially anything that was previously restricted. The researchers found they will reveal their hidden system prompts, as well as produce assistance and instruction for a variety of types of harmful content: self-harm, mass violence and chemical or biological attacks among other examples.

While numerous jailbreaks of this type exist, and some that work across multiple AI models, the unique quality of the “Policy Puppetry” approach is that one well-crafted prompt works across nearly all of the target models with minimal to no modification. The researchers say that certain AI models, such as Gemini 2.5 Pro and ChatGPT o1 and o3-mini, are resistant to the “stock” version of the prompt injection attack but it takes only minor adjustment to one particular section to compromise those. There is one quirk for some of the most “heinous” requests, however; they must be written in “l33tspeak” rather than in plain English for some of the AI models to accept them, for example rearranging “uranium” as “ur4n1um.”

The policy prompt can be compressed to cost fewer than 200 tokens to run, a negligible price for the listed models. Prices vary by model, but in all cases this is well below 1 cent USD per attack. With GPT-4o, for example, a penny would buy thousands of attacks.

Security of AI models once again called into question

The discovery of this universal prompt injection attack seems to have been a happy accident for the researchers. They had been focusing solely on jailbreaking ChatGPT 4 when initially developing the approach, and were surprised to find that it was equally successful when tried on other models.

Opening up AI models to any dangerous command is troubling, but one has an open door to jailbreaking if they can simply convince the model to give up its system prompt. These are supposed to be closely guarded from the outside world, as they describe the exact terms by which the model is regulated. Unlocking one is something like looking at source code, allowing attackers to plot out other ways to get around the guardrails.

The security of AI models is once again in the spotlight due not just to this prompt injection attack, but the new buzz about “Agentic AI” roles for AI assistants meant to take on more human-like duties as a labor solution. Right now, the primary concern about LLMs is preventing them from generating hate speech or instructions for carrying out dangerous acts; and much of that is driven by liability concerns for their creators, as these are largely things that could be found elsewhere with relatively simple searches. But with AI gaining a greater level of permission and direct interaction in companies, these jailbreaking prompts could present much greater threats. For example, medical AI could be convinced to provide bad advice or reveal personal details, or attacks on real-world industrial systems could be amplified by hijacking AI controllers.

The field of prompt injection attacks was also well-developed prior to this exploit, providing threat actors with many other options for massaging AI models into bad or unexpected behavior. Microsoft’s security team published its Context Compliance Attack (CCA) in March of this year, a technique that targets conversation history rather than crafting prompts. This attack type involves starting a normal conversation with the LLM, then injecting a fabricated assistant response at a key moment. This attack was also tested successfully against a broad range of leading AI models.

HiddenLayer Prompt Injection Attack Able to Break the Guardrails of All Major AI Models

All existing frontier AI models potentially vulnerable to new prompt injection attack

Security of AI models once again called into question

New US National Security Order Calls for Pre-Release Access and Assessment of AI Models

“Man in the Prompt”: New Class of Prompt Injection Attacks Pairs With Malicious Browser Extensions to Issue Secret Commands to LLMs

Lawsuit Accusing LinkedIn of Training AI Models With InMail Private Messages Dismissed

Microsoft: “Skeleton Key” Attacks Consistently Jailbreak AI Models, Allows Users to Directly Ask Forbidden Questions

Prompt Injection Vulnerability in Google Gemini Allows for Direct Content Manipulation

New X Privacy Policy Promises No Non-Public Personal Data Use in AI Models, Requires Consent for Biometric Info

“All Your Content Belongs to Us”: Google Privacy Policy Update Suggests It Plans to Scrape Everything on the Internet to Improve AI Models

ChatGPT Adds Ability to Turn Off Chat History, but “Grandfathered” Conversations Still Accessible to AI Models

Ransomware Attack at Coca-Cola’s Fairlife Dairy Company Halts U.S. Operations

Data Breach at Largest Indian Nuclear Power Plant Leaks Sensitive Files

Cyber Attack on Major Japanese Refrigerated Logistics Provider Disrupts KFC and Other Food Chains

Why Purple Teaming Is Becoming the Operating Model for Cyber Defense