Data Leak by Microsoft AI Researchers Exposes 38TB of Private Internal Data

A misconfigured Azure storage container has caused a potentially catastrophic data leak for Microsoft, as an account used by AI researchers has exposed some 38 terabytes of internal data including employee credentials and private keys.

Microsoft’s investigation of the data leak indicates that customers were not impacted, but its blog post on the subject does not address unauthorized access by third parties. Security researchers with Wiz Research uncovered the vulnerability on June 22 as part of a more general scan of the internet for similarly exposed storage accounts, and the issue was addressed by Microsoft on June 24. However, the Wiz research indicates that unauthorized access may have been available since October 2021.

Microsoft indicates workstations of two AI researchers were exposed

The data leak reportedly stems from the activity of two AI researchers, who had disk backups of their workstations exposed. This included some 30,000 messages with assorted Microsoft team members in addition to private keys, login credentials and internal secrets related to open-source training data.

Misconfigured Azure SAS tokens are the culprit. Azure allows for specific files stored in these accounts to be shared publicly, but the AI researchers reportedly set up a URL to share the entirety of the storage account within the company. This URL could be found listed in a Microsoft GitHub repository used to share AI models and open-source code within the organization. The labeling of the URL indicates that it was meant to share just these files, but ended up not only sharing the full contents of the storage container but also allowed anyone that followed it to delete or overwrite files.

The security ramifications of this are obvious and extremely worrisome. A malicious attacker would not just be limited to data theft, but could also inject malware into any of the code in the container. However, it does appear that the GitHub repository was only shared within Microsoft. The AI researchers likely believed that the storage account was secure as it was set to be private for most files, but were not aware that the URL granted access beyond what they intended to share.

Another issue contributing to the data leak is that security access tokens were set to never expire (or at least not until 2051). The container was secured with the simplest available type of token, Account SAS, which is the easiest key type to make but the most difficult for administrators to track. Microsoft says that it performs regular scans for just this sort of eventuality, but the offending link had been dismissed as a “false positive.” It has since updated this scanning system to pick up overly permissive SAS tokens of this nature.

Roger Grimes, Data-Driven Defense Evangelist at KnowBe4, feels that this is just the tip of the iceberg in terms of these types of stories: “This is one of the top risks of AI, one of your employees accidentally sharing confidential organization information. It’s happening far more than is being reported. In order to mitigate the risk of it occurring, organizations need to create and publish policies preventing the sharing of organization confidential data with AI and other external sources, and educate users about the risks and how to avoid it. Organizations can also use traditional data leak prevention tools and strategies to look for and prevent accidental AI leaks.”

Mohit Tiwari, Co-Founder and CEO at Symmetry Systems, observes that this issue is almost inevitable at organizations of Microsoft’s size: “Even security experts routinely get cloud access controls wrong. SAS tokens are risky because they are similar to shared links to folders that you hand out but have no good way to keep track of. AWS has a deep web of policies covering role, attribute, bucket, cross-account, service control, etc… to an extent that once you have 8000 S3 buckets and 900PB, you can’t say who is accessing what across your environment. A bigger challenge is that monitoring data access is like drinking from the firehose of billions of data activity logs per day. Microsoft has recently been in the news for making data events expensive; and even when free, almost no cloud customer monitors data events at scale. The key takeaway is that organizations have to understand what data you have, who can access it, and how it is being accessed. What Wiz has identified is not a cloud posture problem — this is a data inventory and access problem.”

Data leak may not have made it out of company

There is no indication as of yet that third parties outside of the company accessed the data leak, with Microsoft stating that “no other internal services were put at risk because of this issue,” but the length of time that the AI researchers had their full work storage open to anyone who came along does leave lingering questions.

Microsoft is in the midst of something of a streak of data leaks and breaches, having recently reported a compromise of government Outlook email accounts by suspected Chinese spies that were able to steal a signing key. A trove of secret company information about its Xbox product line also appears to have been attached to public court filings accidentally, with the files revealing company plans for its next gaming console and discussions about acquiring gaming giants Nintendo and Valve.

Microsoft also appears to have been aware of the security risks of overly permissive SAS tokens long before the AI researchers made this high-profile mistake, and in fact included the issue as part of a recent blog post about known attack approaches on cloud services.

To combat this potential security issue, Wiz recommends that organizations implement Service SAS or User Delegation SAS tokens with a stored access policy that cuts off the potential creation of free-ranging Account SAS tokens being created and creates a central means to track and revoke any problem tokens that appear. Wiz also recommends the fairly obvious step of creating task-specific external storage containers that don’t host anything else, and disabling SAS access for those that don’t require it.

Microsoft adds some recommendations for handling SAS sharing URLs: be sure to set near-term expiration times, only expose them to parties that need access to the files, use Azure Monitor and Azure Storage Logs to monitor requests to these storage accounts, and ensure that write permissions are not present if they are not needed.

Patrick Tiquet, Vice President of Security & Architecture at Keeper Security, adds the following recommendation: “In some cases, organizations need assurances from AI providers that sensitive information will be kept isolated to their organization and not be used to cross-train AI products, potentially divulging sensitive information outside of the organization through AI collective knowledge. The implementation of AI-powered cybersecurity tools requires a comprehensive strategy that also includes supplementary technologies to boost limitations as well as human expertise to provide a layered defense against the evolving threat landscape.”