Digital brain showing API keys in AI training dataset

Nearly 12,000 Live API Keys for AWS, MailChimp, and WalkScore Found in AI Training Dataset

Security researchers have discovered nearly 12,000 secrets and API keys in an open-source AI training dataset that could successfully authenticate across various services.

The secrets were found in the Common Crawl database which stores petabytes of data it has collected from various websites since 2008. Numerous large language models (LLMs) such as OpenAI, DeepSeek, Google, Meta, Anthropic, and Stability can use the data for training purposes.

Truffle Security researchers discovered the security secrets after analyzing 400 terabytes of data scraped from 2.67 billion web pages in December 2024.

They warned that exposing API keys to AI chatbots exposes impacted vendors to cyber attacks and could cause LLMs to behave unexpectedly and recommend bad security practices.

AWS, MailChimp, and WalkScore API keys found in an AI training dataset

The researchers found 219 distinct types of secrets from various online services including AWS, MailChimp, and WalkScore in the AI training dataset. In total, 11,908 secrets that could authenticate successfully across various services were found in the AI training dataset.

The AI training dataset captured the API keys because software developers hardcoded in them in frontend HTML and JavaScript snippets instead of passing them securely passed as server-side variables.

Marketing automation giant MailChimp was heavily impacted, with nearly 1,500 unique API keys extracted from HTML forms and JavaScript variables found in the AI training dataset.

The researchers observed a high reuse frequency, with 63% of the exposed API keys reused across multiple pages. WalkScore was the most impacted, with its leaked API keys reused “57,029 times across 1,871 subdomains.”

The researchers also found 17 exposed unique live Slack webhooks, which could allow malicious third-party apps to post messages. They warned that the exposed secrets could enable malicious actors to carry out phishing campaigns, and perform data exfiltration and brand impersonation.

Exposed secrets in AI training dataset a real risk

The researchers further warned that training AI models on insecure code could encourage bad security practices such as hardcoding API keys and credentials, putting millions of organizations at risk.

“Given our experience finding exposed secrets on the public internet, we suspected that hardcoded credentials might be present in the training data, potentially influencing model behavior,” the researchers explained.

The researcher tested various AI coding assistants including GitHub Copilot and ChatGPT and found that they suggested hardcoding API keys and credentials in webpages.

“The real risk? Inexperienced (and non) coders might follow this advice blindly, unaware they’re introducing major security flaws,” they added. “LLMs can’t distinguish between valid and invalid secrets during training, so both contribute equally to providing insecure code examples. This means even invalid or example secrets in the training data could reinforce insecure coding practices.”

While AI training dataset undergoes cleaning to remove sensitive information such as personally identifiable information, there are no guarantees regarding the accuracy of the process.

Meanwhile, Truffle Security has contacted impacted vendors and assisted them to revoke and rotate thousands of live API keys exposed to the AI training dataset.

“This issue isn’t just about AI. These are real websites exposing these tokens that have been captured by Common Crawl,” warned Dr. Katie Paxton-Fear, Principal Security Research Engineer at Harness. “Ideally, any token, code, or key should only be valid for a short period of time and should be regularly rotated. This way, even if they are captured, they do not remain valid for long.”

In February, cybersecurity firm Lasso Security also warned that AI chatbots could access source code repositories even after they were made private using the Wayback Copilot attack method, exposing API keys and tokens to misuse.

The company discovered 20,580 exposed GitHub belonging to 6,290 organizations including Google, Microsoft, Intel, PayPal, IBM, Huawei, and Tencent, exposing over 300 secrets, keys, and tokens.