A new report from cybersecurity firm HiddenLayer finds that Google Gemini is vulnerable to prompt injection attacks, which could be used in content manipulation that could further impact other users.
The researchers characterize the prompt injection attacks as being open to “profound misuse.” Among these possibilities are a jailbreaking attack that allows the AI to override its prohibition on creating fictional accounts, and the ability to access system prompt information that should only be available to developers.
Google Gemini Advanced, API vulnerable to prompt injection attacks
The researchers say that the prompt injection attacks impact Gemini Advanced accessed by users with Google Workspace, and organizations that are making use of the Gemini API. The content manipulation risk is also said to more generally apply to world governments as it could be used to output inaccurate or falsified information about elections. The risk is particularly acute as Google Gemini has been trained on audio, video, images and code in addition to text.
One of the central issues identified by the researchers is that it is relatively trivial to get Google Gemini to leak system prompt information. This is information about the “prime directives” of the AI model, so to speak, that should not be visible to service users. The researchers’ first prompt injection attack is to simply change the wording when asking the AI about this information, causing it to spit out its core rules when asked about its “foundational instructions” instead.
Another exploit involving the system prompt is a seeming state of confusion that the AI can be thrown into by peppering it with many uncommon tokens. In this case, it was as simple as repeating a word like “google” over and over again as an input. This can cause Google Gemini to output prior instructions; in this case, a secret passphrase that the researchers gave the AI and instructed it to not share under any circumstances.
The content manipulation vulnerability involves jailbreaking Google Gemini’s strict rules about not generating fictional information, meant to ensure that it does not output misinformation. This attack is similar to past prompt injections that compromised ChatGPT and other systems, in that the system appears to be vulnerable to being told it is now in “fictional state” mode and is allowed to tap into its creative capabilities. This is specifically reminiscent of the attack against ChatGPT last year that instructed it to enter “DAN” or “Do Anything Now” mode, enabling it to ignore all of its developer guardrails when responding to user queries.
Content manipulation attacks raise concerns about illicit tutorials, political propaganda
The system prompt injection attacks raise the obvious concern of unearthing means to execute more complex attacks against the AI. The fear with content manipulation is that Google Gemini’s advanced capabilities can be used for more convincing political misinformation during the election season, or for detailed tutorials on dangerous activities.
None of the techniques are at all sophisticated, and presently only require access to Google’s MakerSuite for developers. At minimum developers are advised to not include any sensitive information in the system prompts. Beyond that, Google will likely have to address the prompt injection and content manipulation issues with updates. Shared workspaces that include Gemini Ultra will also have to consider that the tool can be abused to gain access to the contents of files that should be off limits, and that malicious links could also be passed in this way.
Google has responded to the study by restricting Gemini’s ability to answer election-related questions temporarily. This comes shortly after the company had to temporarily suspend its image generation capabilities when it was found that it frequently refused to create images of white people, even going so far as to create fanciful versions of historical figures and peoples when asked for an accurate representation.
Google Gemini is far from the only LLM that has been cracked by researchers as of late. A recent study from a coalition of university researchers, OpenAI and Google Deepmind developed a model-stealing attack that can cause ChatGPT and other systems to spit back personally identifiable (and potentially private) information. Security researchers generally agree that prompt injection attacks of this sort can be found for just about any LLM, and that content manipulation is going to be a serious and recurring problem going forward.
Though you do not hear about it as often as ChatGPT, Google Gemini presently has over 100 million users. Google has issued a statement that addresses content manipulation and prompt injections, assuring users that it regularly runs red-teaming exercises to identify issues and is continually adding new safeguards.
Kelvin Lim (Senior Director, Security Engineering, APAC at Synopsys Software Integrity Group) notes that this is another indication that interested companies will likely have to build their own private LLMs at some point: “The emergence of public Large Language Models (LLMs) presents both opportunities and challenges. While providing powerful capabilities, these models also serve as a new platform for malicious actors to launch attacks. Consequently, companies must decide whether to allow or block staff access to these public LLM models. Should companies allow their staff to access public LLM models, it becomes imperative to establish and communicate clear policies and guidelines for their safe utilization. Among these guidelines, it’s important to stress that company-sensitive and Intellectual Property (IP) information should never be included in public LLM prompts. Additionally, information generated by public LLM models should also undergo rigorous fact-checking procedures to mitigate the risk of misinformation. However, for better security and control, it is advisable for companies to consider developing their own private LLMs.”