About Addressing Privacy and Trust With Generative AI for Cyber Threat Intelligence

Generative AI has the potential to strengthen cybersecurity defenses significantly, but each tool’s ability to handle the job depends on vendors’ ability to overcome inherent limitations.

It feels like we are in the middle of a generative AI stampede. As we have seen so many times in cybersecurity, a new technology innovation pops up, and suddenly, it is everywhere – making it difficult to gauge the quality of the solution of the day. Given its ability to deliver powerful data and insights in seconds, generative AI can bring significant, positive change to cybersecurity – and more specifically to cyber threat intelligence (CTI), which requires analyzing and correlating massive amounts of structured and unstructured data. Before that happens, however, generative AI (genAI) vendors must address two critical concerns among users: data privacy and data trust.

There is no question that with genAI, CTI will take a quantum leap forward, as connecting all the threat exposure puzzle pieces to construct a complete CTI analysis is challenging and complex. GenAI can assist security teams with daily challenges, due to the overwhelming volume of data – including management and organization, analysis, summarization, and reporting – and help organizational leaders more easily understand the company’s risk exposure and security posture. The benefits to CTI are even more significant when genAI’s large language model (LLM) – the programming layer responsible for processing the data and delivering outputs – factors an organization’s business, industry, geography, and attack surface context.

As such, genAI is like a supercharged CTI assistant that can help answer questions such as: are we susceptible to ransomware? How resilient are we against specific cyber attack campaigns? Are we doing our best to protect our intellectual property? What threats are we most vulnerable to, and how can we improve cyber defenses?

While the technology holds great promise for cyber defenders, there is growing concern about how different offerings protect users’ and organizations’ data privacy. The wildly popular ChatGPT, launched in late 2022 by OpenAI, and other tools like it are trained on vast volumes of publicly available data. When considering how to also mix in internal and customer-related data, CPOs and other business leaders must have a clear understanding of how the organization uses such platforms.

Another issue with genAI is data trust. The nature of the GPT model (upon which some generativeAI tools are built) is that genAI creates text based on statistical patterns it has learned from the massive amount of data upon which it is trained. As a result, some of the generated text can seem unrealistic. Additionally, without access to the right data or most up-to-date information, answers may be incorrect, or even made up – all of which creates trust issues.

While these issues are not intentional, privacy protections and data trust must be taken seriously by genAI providers if these tools will live up to their full potential. Luckily, vendors are continuously improving their models to reduce this undesired phenomena.

Vendors’ role in safeguarding your data privacy

Most genAI vendors either use publicly available models like OpenAI, or they take a gradual-build approach to their own models based on well-known open source models. For tools using public OpenAI-like models, sensitive data or prompts input to the LLM are stored online by the vendor, from where they could be sold, hacked, or leaked. Additionally, these LLMs ‘learn’ from new prompts and queries, so it’s possible that an organization’s sensitive data may show up in response to a similar query coming from an entirely different entity.

To overcome data privacy concerns, organizations should look for genAI vendors that provide full transparency about how their solutions protect each user’s data – sensitive or otherwise. Further, there are specific methods the vendor can offer to ensure adherence to data privacy regulations, as described in a recent blog post by IDC. The list below can serve as a guideline of features and precautionary measures to look for when evaluating genAI solutions, particularly in cybersecurity and CTI:

The ability to opt out of sharing proprietary data with the LLM or assurance that your data is only used to train the model and not shared with third parties.
The ability to delete training and validation data, as well as your fine-tuned models.
The ability to opt out of sharing your prompts and generated results. (A word of caution here: never submit proprietary data as a prompt to any public LLM.)
Detail about how prompts and completions are stored and for how long. Ensure that stored data is encrypted and that the vendor’s data storage parameters match your corporate policies.
A clear understanding of how the provider uses your data.
Assurance that the provider’s usage log and traceability meet your organization’s compliance requirements.

Other approaches a genAI vendor can take include:

Minimizing data transfer so that only the most essential, non-sensitive information is shared with the publicly available model (e.g. OpenAI). In some cases, this entails only sharing metadata, or the ‘data about the data.’ Metadata doesn’t include the actual content but contains details about it.
Masking sensitive data before it’s shared with any other third party, which can entail replacing the actual data with randomized characters to keep the structure of the data intact for analysis while securing sensitive information.
Local data processing to limit the amount of data transferred over the Internet. This can include extracting features from the data, converting it into lower-dimensional representations, or using local models to anonymize the data before it’s sent to any third party.
Developing proprietary machine learning models to train sensitive data on the vendor’s secure servers.

Establishing trust in generative AI

The question of trust in generative AI results from the way some solutions create “hallucinations” – a term that refers to a genAI tool making up answers when it doesn’t know or lacks access to the right data. GenAI hallucinations can erode trust in the tool that is adopting the genAI model. If answers to queries are fabricated, the credibility of the application that has embedded this technology suffers.

For example, a generative AI prompt could be: “Name the top 10 Danish underground nationalist hacker groups.” While only five such groups exist on the dark web (or the underground), the genAI application may still produce ten names as requested – five of which would be pure fiction.

While GPT, Bard, and other state-of-the-art AI models are highly advanced, they are not immune to errors, biases, or hallucinations. However, even generalized AI models may inadvertently produce false or misleading information lacking real data or facts, due to the limitations of current statistical language modeling techniques. For example, ChatGPT only uses data through 2021, so any answers to queries that must account for more recent events will likely contain limited or incorrect information.

Because genAI and the LLMs upon which they’re built continue to learn and improve, it’s impossible to find a tool that delivers 100% accuracy 100% of the time. But there are guidelines you can follow when choosing genAI to ensure the data and insights you receive are trustworthy and reliable, as outlined in the following list:

The model’s data querying uses scoped data access, which limits the app’s access to user-related data; and prompt engineering, which are prompts that guide the AI model in generating desired responses and ensure that outputs are accurate, relevant, and contextually appropriate.
LLM answers are processed in-house before outputting to the end user to avoid common hallucinations. This way, the model’s fidelity will increase over time by building up and training on a proprietary repository of common mistakes.
Built-in robust feedback mechanisms engage users in a fast feedback loop to detect and mitigate any incorrect answers generated by the genAI model. This feedback loop can be as simple as offering a thumbs up or down on each generated response.
The vendor’s data analytics team performs an ongoing, manual, in-house review of the model’s answers. This approach is the best way to review and improve the model.
The genAI tool says: “I don’t know the answer” when it lacks the information to provide a response.

While generative AI has the potential to advance cybersecurity and CTI to unprecedented levels, the need for more stringent policies around data protection is paramount to its success. Additionally, for a generative AI model to answer critical cybersecurity questions correctly, it must have access to a deep collection of wide-ranging threat intelligence. To respond to questions about underground cybercriminal networks and activities, it must have access to that type of CTI.

By removing the issues of data privacy and data trust, genAI can simplify complex cybersecurity tasks. More specifically, with threat intelligence, the technology can aid report generation, making intelligence faster and easier to access by teams and individuals at all security-maturity levels. And it can assist security teams in improving post-detection actions such as alert prioritization, augmented threat detection, playbook creation, and incident response. Ultimately, this combination of humans and computers will allow security teams to be more proactive and better equipped to deal with the most urgent threats targeting their business.