Over Half of Fortune 500 Companies Are Leaving Sensitive Information Open to Reconnaissance via Document Metadata

PDF files hosted by many organizations, including more than half of the companies listed on the Fortune 500, are leaking sensitive information. PDF document metadata can contain a variety of information that provides attackers with the reconnaissance details they need to execute a more targeted and sophisticated attack: employee names and positions, the software used to author the PDF file, web server version, physical location, and in some cases even employee ID numbers meant for internal use.

While most of the major PDF authoring tools allow one to turn off the recording of document metadata, it is generally on by default and most users do not manually disable it.

Document metadata provides puzzle pieces attackers seek

With the average Fortune 500 company hosting over 9,700 publicly accessible PDF and Word files, document metadata is one of the largest unprotected attack surfaces. However, the majority of these companies (51%) are not taking steps to secure this area by scrubbing these documents of metadata and configuring software to not automatically include it.

Document metadata is useful to attackers in terms of reconnaissance, as it reveals a fair amount about employee locations, personality and behavior. In some cases, it can also provide key pieces of inside information such as ID numbers. While attackers are unlikely to glean enough from document metadata alone to breach a system, these pieces of information can be invaluable in preparing more targeted types of attacks: spear phishing by email, social engineering attempts over the phone, and so on.

The report, prepared by information security firm UpGuard, uses the industry standard “MITRE ATT&CK” framework to evaluate the risk presented by document metadata. Attacks listed in the MITRE catalog generally begin with a reconnaissance phase, and the exposed metadata contains a good deal of the type of information that attackers seek as they initially evaluate a target.

The study examined all document file types that can contain this sort of sensitive metadata, including Excel and PowerPoint files. However, it found that PDF files were by far the biggest vulnerability among Fortune 500 companies. 88.7% of the public-facing documents on company servers are PDFs, largely due to the fact that it is a read-only file format.

How attackers use public documents for reconnaissance

So what exactly is in document metadata that can be exploited? The first major point of vulnerability is the “author” field. The majority of the document metadata reviewed in the study had some sort of identifying employee information in this field, anything from the employee name to the city or business unit with an ID number or username. Security best practice is to set this field to be filled with just the company name, which most organizations are not doing.

Vulnerabilities in the author field also seem to vary by industry. Education, government and media organizations most commonly made this mistake. It was least frequently seen among travel companies and non-profit agencies.

Document metadata is also leaking information about host hardware and software that attackers conducting reconnaissance are interested in. The author field can again be a vulnerability here; over 100 of the documents reviewed had the name of the authoring software in the author field. A point of particular concern is the use of free PDF file converters, something commonly done when the premium Adobe PDF writer software is not available. These converters often insert their names into the document metadata as part of the marketing strategy. If an attacker finds an employee name and a converter that they use, that creates an opening for a malware attack based on a fake version of the employee’s favored converter.

The “creator” field is a much more common source of reconnaissance-ready information. 98% of Office documents and 75% of PDF files reviewed listed the software used to author them in this field. Some of these include the specific software version, information an attacker could leverage if there is a known exploit for that particular version.

875 of the reviewed files also listed information about the creator’s operating system, and 371 included some sort of identifiable information about the target company’s hardware. This is another area where known exploits could be leveraged after some very basic reconnaissance is conducted.

UpGuard’s suggested action items for remediating the reconnaissance threat include ensuring that employees have access to the needed proprietary software for authoring all the types of files that the organization uses (so that questionable third-party “shadow” software tools will not be used), employee training on reviewing and scrubbing document metadata, and regular removal of old documents that are no longer necessary. Penetration tests generally do not examine these documents, so organizations should also implement a separate auditing process that focuses on reviewing metadata through the lens of potential attacker reconnaissance.