Hot on the heels of high-profile data scraping incidents at Facebook and LinkedIn that compromised hundreds of millions of accounts, the personal information of about 1.3 million users of social media darling Clubhouse has been found posted to a hacker forum.
Clubhouse is taking an interesting tack, essentially defending the scrapers by saying that it was a permissible use of the API and that the company does not have a serious objection to the publicly-available information of its users being harvested in this way. The incident highlights the fact that while scraping is not strictly illegal in most of the world, it is generally prohibited by platform terms of service in the interest of user privacy and keeping the information away from controversial projects such as Clearview AI’s facial recognition database.
Rash of data scraping hits major social media platforms
The core of Clubhouse’s argument is that incidents like these should not be characterized as “data breaches” given that the information was posted publicly by the platform users. However, platform users generally do not expect this information to be collated into unknown third-party databases when they post it (where it is often combined with other public information that has been scraped, and sometimes with private information as well). They believe it will only be visible to other platform users that view their account profile or page; the average person is most likely not even aware of the concept of “data scraping” or that it is a possibility.
In the case of the Clubhouse incident, the data scraping captured elements that are not a particular threat of identity theft or privacy violation when taken in isolation: name, user ID number, username, profile photo URL, Instagram and Twitter handles, number of followers and people followed, account creation data, and the name of the user that invited them. Reports say that this impacted about 1.3 million Clubhouse users; the app has been downloaded about 10 million times at present and has about two million active weekly users.
It appears that the Clubhouse SQL database used sequential numbering in the creation of user profiles, which allowed scrapers relatively easy access with basic tools. It would appear that a simple script that adds one number to profile links would be sufficient for mass data scraping. However, a statement from Clubhouse indicates that the company supports data access in this way via its API: “Clubhouse has not been breached or hacked. The data referred to is all public profile information from our app, which anyone can access via the app or our API.” The spokesperson additionally called reports that there was a personal data leak or that the app was breached or hacked “misleading and false.”
While Clubhouse sought to downplay the data scraping, other incidents of this nature have demonstrated negative impacts on users. For example, a user might later decide to remove some of the information they have publicly posted; once it has been scraped it is beyond reach in an unknown number of third party databases. This type of information also gradually finds its way to both unscrupulous data brokers and underground “combo files” that pool enough information to perpetuate scams and identity fraud, in addition to feeding automated attacks run by bots. The case of Clearview AI also demonstrates how third parties might use this information for applications that the person who posted it never intended; for example, feeding photos into a facial recognition database.
While the Clubhouse data scraping will not likely require anyone to secure their bank account or sign up for credit monitoring, in incidents of this nature the platform user is usually advised to be on higher alert for suspicious messages and emails that may be attempts to leverage the captured personal information for a scam or a social engineering attack. Clubhouse users are not the only ones who need to be concerned about this possibility, however. Recent large-scale scraping incidents at Facebook and LinkedIn have left hundreds of millions with the same concern.
The Facebook leak, which occurred in early April, compromised 533 million users. This breach included some profile information that these users opted not to make public, however, such as phone numbers and email addresses. This breach also involved an API, but in this case being abused in a way that was not intended by engineers. The vulnerability was exploited by feeding phone numbers into the API, which would return valid user details when a match was hit upon. Security researcher Aidan Steele claims that he submitted this vulnerability to Facebook in 2014 and it was brushed off; the data scraping took place in late 2019.
While #datascraping is not strictly illegal in most of the world, it is generally prohibited by platform terms of service. Users exposed – Facebook 533M, LinkedIn 500M, Clubhouse 1.3M. #privacy #respectdata
Click to Tweet
A similar data scraping incident was reported by LinkedIn last week, also involving about half a million user records. The records appeared for sale on an underground forum at a price of several thousand dollars for the full set. This breach also included items of personal information that users may not have intended to be public, such as email addresses and phone numbers. The business networking giant took a similar approach to the one that Clubhouse opted for, insisting that it was not a “data breach” and that everything that was leaked was “public information.” The company did not name a specific cause, but another case of API abuse is a reasonable assumption given the number of records involved.