The Linux Foundation and Harvard Lab have released the second in a series of studies of the most commonly used and most critical software packages in the regular operations of Linux servers. This second study focuses on what open source software is most commonly deployed in both private and public organizations, with an eye toward better evaluating potential vulnerabilities and where security support should be concentrated.
The first census in the series was published in 2015 and focused on Debian Linux software packages. This second study draws on scans of codebases from thousands of companies, and one of its key security findings is that some 80% of the lines of code in the top 50 packages were the creation of a set of just 136 developers.
Linux open source software census identifies most frequently used packages, potential security issues
Free and Open Source Software (FOSS) was chosen as the subject of the second census in this series due to its ubiquity; as the report notes, tens of millions of FOSS projects now exist and organizations of all types and sizes regularly rely on them (with an estimated 98% of codebases now including some sort of FOSS element). However, decentralized distribution and freedom to modify makes it difficult to track and measure the security status of these projects. The recent issue with Log4j is a clear illustration of this phenomenon.
The project begins with one simple metric that has not really been adequately explored and documented before: which FOSS projects are the most widely used? Knowing which are most common means that resources for security can be prioritized to them. A prior preliminary report published in 2020 provided two unranked top 10 lists of the most commonly used open source software at package level, but this complete and final report includes eight ordered top 500 lists (with half of these at the package/version level).
The report stresses that it is not attempting to present any kind of security profile on open source software packages, merely discovering which are the most commonly used so that they can be prioritized for further analysis. In addition to being scrutinized for security vulnerabilities, this data also helps to identify understaffed projects and ones in which outdated versions are commonly used.
Lessons taken from open source software census
One of the primary lessons the researchers took from this project is that there is a strong need for a standardized naming schema for software components, an issue that also emerged in the first census. This is one of the areas in which freedom to modify contributes to serious difficulties in identifying and cataloging these pieces of software, adding substantial time to the overall effort in inspecting formats and naming standards.
Documentation of package versions also proved to be a serious issue. The census relied heavily on data provided by survey respondents. In quite a few cases, the respondent named a package version that was far beyond the most recent version in the official repository. After some investigation, it was determined that this is often due to companies performing their own internal updates and not sharing them outside of the organization.
From a security perspective, perhaps the biggest finding was that a relative handful of developers are responsible for over 4/5ths of the code in the top 50 projects of each list. 136 developers were responsible for a little over 80% of all of this code, 23% of projects had one developer responsible for over 80% of that project’s code, and 94% of projects had fewer than ten developers contributing more than 90% of the code.
Individual developer security is also a potentially underlooked issue, given that many of the packages that made the assorted top 500 lists are hosted by accounts of this type. These accounts tend to have less in the way of security protecting them than organizational accounts do. The report notes that account takeovers on GitHub and other sites have been increasing as of late, usually for the purpose of installing backdoors in the project. Developers can also simply “go rogue” for any number of reasons and unexpectedly pull access to their code or even intentionally corrupt it, as happened recently with the “colors.js” and “faker.js” libraries.
The study notes that government involvement might assist with the situation. For example, the EU established a FOSS strategy in 2014 (which was renewed in 2020), but very few other nations have made efforts of this nature in the open source software space. The United States has slowly built a campaign for a “Software Bill of Materials” that would require the components of open source software in use in government systems be cataloged and up to date. A push for such a measure began in 2014, but did not begin to become a federal requirement until a Biden administration executive order last year tasked the National Institutes of Standards and Technology with the development of minimum elements.