The National Institute of Standards and Technology (NIST) has released a guideline paper meant to give AI developers a bird’s-eye view of potential cyber threats that may present during the development and early deployment of their models.
The paper focuses heavily on “poisoning” methods that involve tainting training data with the intent of throwing a learning model off the rails. But it also addresses “evasion” attempts to confuse the AI once it is already in action, and the prompts that cyber threats might use to “jailbreak” the model and get around safety guardrails.
AI developers face “easy to mount,” “minimum knowledge” attacks from assortment of threat actors
Training as they do on large volumes of public material and content generated by others, AI models are uniquely vulnerable during the “learning” phase. There is already substantial question about how poisoned today’s large language models are, relying as they have on scraping the contents of social media and open forums that can contain every manner of misinformation and hostility.
Cyber threats can readily take advantage of known sources of training data to plant misinformation. But most of these learning models do not ship in a complete and sealed form, continuing to take in new information and refine their behavior after deployment. For example, attackers may alter road signs in a particular way that is known to throw an AI model off and prompt a dangerous response. Certain AI models may also have access to things like medical records and confidential corporate files, and attackers will be probing with prompt sequences that convince the model to provide them a pathway to this information.
AI developers also generally do not inform the public of what materials they are using to train their models. This can lead to surprise leaks of personal information, as recently happened with ChatGPT; the model recently was found to spit out seemingly random email addresses and contact information when asked to repeat certain words endlessly. Cyber threats will thus be looking to attack at both ends, poisoning the wells that AI models draw from during development and inducing malfunctions once they are available for use.
In addition to poisoning training data, the paper notes that cyber threats may be able to alter the source code of AI models. Many AI developers are using open source components such as random number generators, or third-party libraries that are accessible outside of their controlled environment.
Joseph Thacker, principal AI engineer and security researcher at AppOmni, strongly urges everyone in the AI development space to read the paper in full: “This is the best AI security publication I’ve seen. What’s most noteworthy are the depth and coverage. It’s the most in-depth content about adversarial attacks on AI systems that I’ve encountered. It covers the different forms of prompt injection, elaborating and giving terminology for components that previously weren’t well-labeled. It even references prolific real-world examples like the DAN (Do Anything Now) jailbreak, and some amazing indirect prompt injection work. It includes multiple sections covering potential mitigations, but is clear about it not being a solved problem yet. It also covers the open vs closed model debate. There’s a helpful glossary at the end, which I personally plan to use as extra “context” to large language models when writing or researching AI security. It will make sure the LLM and I are working with the same definitions specific to this subject domain. Overall, I believe this is the most successful over-arching piece of content covering AI security.”
Prior research on spam cyber threats helping to inform modern AI defenses
Some of the body of work that informs so-called “evasion” attacks and defenses dates as far back as three decades, drawing on attacks designed to evade spam filters and throw off facial recognition systems and neural networks used for image classification. Poisoning attack research similarly overlaps with work dating back to 2006 and the first types of worm signature generation.
AI developers essentially cannot completely secure their models from poisoning by cyber threats, whether planned or otherwise. As the paper notes, a 500 billion parameter model (about one-third of the current capability of ChatGPT 4) requires 11 trillion tokens of training data. It is simply impossible to curate the raw volume of information these leading AI models require, which has led to the indiscriminate scraping of materials available on the public internet.
Developers must simply feed URLs to their models and let them go to work. But “whitelisting” trusted URLs only goes so far as a mitigation measure, particularly considering that an attacker could hack the domain registry of an approved URL or simply purchase it after expiration.
This massive volume of required data also leads to a possible unintentional self-poisoning of AI models. As these models seek more data to train on, they are also generating reams of synthetic content that then becomes publicly available. The models train on the content that they themselves and other similar models are generating, potentially putting them in a spiral that can lead to total collapse of the model. Watermarking may alleviate the problem, but only with the agreement and participation of all parties, and eventually rogue bad actors will have the capabilities that are currently limited to the likes of Google and OpenAI.
Current supply chain cybersecurity issues, which remain far from resolved, are also likely to spill over into AI. And this is all before quantum computing is introduced, potentially in the next decade, which adds another layer of complication to everything.
The paper consists of 100 pages of sometimes highly technical details that are mostly of interest to AI developers, but one of the key impressions for the general public is that the cyber threats do not need to be nearly this advanced or knowledgable to pull off successful attacks. This includes a wide variety of attacks against closed or “black box” systems. Generative AI systems, such as the current crop of chatbots, are also uniquely at risk for data poisoning from a broad range of sources. And NIST sees no “foolproof” methods for curbing these threats at present, instead encouraging AI developers and computer scientists to carefully map out anticipated attack sources and approaches and make considered trade-offs of capability for security.