Along Nanjing Road Pedestrian Street in China showing data leak from web scraping

Web Scraping on Alibaba’s Taobao Resulted in Data Leak of 1.1 Billion Records

The Chinese ecommerce giant Alibaba’s Taobao shopping platform suffered a data leak that exposed over 1.1 billion pieces of user information according to a Chinese court verdict. Taobao is Alibaba’s popular shopping platform in China, with more than 710 million customers every month in 2020.

The implicated marketing consultant illegally obtained user data through web scraping software from November 2019 to July 2020 before Alibaba discovered the illegal operation and alerted the police.

The People’s Court of Suiyang District in Central Henan Province imprisoned the Chinese software developer and his employer for three years in prison and a $70,260 fine (450,000 Yuan).

Web scraping software illegally accessed non-public information

For several months, the marketer used web scraping software to access data that was not publicly available.

The Wall Street Journal reported that the developer began using the web crawling software in November 2019, gathering information including user IDs, mobile phone numbers, and customer comments.

Mobile phone numbers are sensitive because the Chinese government requires handset owners to register SIM cards with their official details.

Taobao’s spokeswoman acknowledged the data leak adding that the company had devoted substantial resources to combat web scraping and protect its users’ data privacy and security.

She added that the company was working with law enforcement to protect its interests and users and prevent a similar incident.

Web scraping remains a controversial practice even in the United States. Tech giants such as Facebook, Twitter, and LinkedIn have fallen victims to web scraping, exposing hundreds of millions of users’ information.

Data leak did not financially affect Alibaba or its customers

The affiliate marketer did not sell the data or share it with third parties but used it to serve his clients.

Alibaba also noted that neither the company nor its customers incurred any financial losses from the web scraping data leak.

Additionally, the People’s Court of Suiyang District found that Alibaba and Taobao did not violate any laws in their conduct. However, the company risks additional sanctions for security laxity under China’s proposed cyber laws.

The Chinese government proposed a raft of measures to suppress the influence of private companies which collect colossal amounts of user information.

China’s government granted authorities unrestricted power to shut down tech companies that mishandle “core state data.” The rule, which goes into effect on Sep 1, would also introduce personal information protection legislation to protect user information from similar data breaches.

The law allows the Communist Party to impose a vice-like grip on tech giants such as Alibaba and Tencent, which store extensive amounts of sensitive user data.

Commenting on Alibaba’s web scraping data leak, Chris Clements, VP of Solutions Architecture at Cerberus Sentinel, said:

“It’s unfortunate that we’ve basically come to the point where you more or less have to assume that all information you share online will either be

leaked, stolen or purposefully sold to third parties without your knowledge. Privacy regulations like GDPR can have some effect in preventing organizations from misusing your information, but they are largely toothless in preventing information from being stolen by cybercriminals.”

He noted that even data leak notifications relied on organizations discovering data breaches which is usually too late.

“Most organizations only find this out if and when they are contacted by a third party, usually security researchers or law enforcement that has noticed data that appears to belong to them for sale on the dark web,” he added.

Clements noted that organizations must adopt a “true culture of security” that seriously prioritizes user data safety.

“This includes critical components like security education, secure software development lifecycles along with system and application hardening, regular penetration testing to identify potential risks and finally continuous monitoring for suspicious activity coupled with proactive threat hunting.”

James McQuiggan, Security Awareness Advocate at KnowBe4, concurred that most organizations only discovered data breaches after criminals accessed data for a significant amount of time.

“Organizations should focus on protections if the cybercriminals are already in the network instead of reacting after the breach; especially as this relates to technology and processes in place to secure and protect sensitive information like names, email addresses, and phone numbers,” McQuiggan added. “A software developer may have already had access to the website or via a third-party site, which is a common attack vector for cybercriminals to leverage the supply chain for the website to gain access.”

David Stewart, CEO at Approov says that, although it is unclear how the data leak occurred, it’s likely that the developer exploited the Broken Object Level Authorization (BOLA) vulnerability to access it.

“Recent security research into mHeath apps and APIs disclosed similar issues. The key lesson is understanding the importance of ensuring that the user getting the data is really authorized to do so. Vulnerabilities like these are hard to track down, and while enterprises are doing so it is good practice to shield APIs so that scripts intent on data scraping – or worse – are blocked.”

Saryu Nayyar, CEO at Gurucul, said billions of Chinese mobile numbers were at risk of exploitation for committing vishing and texting schemes, and potential identity theft after paring them with user’s names and identification.

“Eight months is an eternity in cyberspace and accounts for the software developer’s ability to gather that many mobile phone numbers. As always, cyber defenses should be deployed that are able to discover anomalous activity in real-time and prevent attackers from compromising your data.”

“Automated scrapping of website data is difficult to detect when it is done slowly over a long period of time as appears to be the case here,” Jorge Orchilles, CTO at SCYTHE. “Most high-volume sites can limit extremely fast (non-human speed) requests by monitoring the application layer traffic with a variety of tools including web application firewalls and CDN providers. It appears some of the data was not publicly accessible so the attack may have leveraged valid credentials or SQL injection.”