Army of toy robots showing web scraping and bad bots

Web Scraping Is Legal (For Now), but It May Be Hurting Your Business

The term “web scraping” has been around almost as long as the internet, but recent advancements in artificial intelligence (AI) have thrust it back into the spotlight. Web scraping is simple to explain: it refers to using automated software to obtain (or “scrape”) large amounts of data from across the web. What’s more challenging to explain is whether web scraping is ethical—or even legal. While web scraping plays a vital role in powering search engines and other critical web services, it is also used by cybercriminals and even legitimate businesses for morally dubious purposes, such as stealing content or compromising sensitive data.

The emergence of generative AI and large language model (LLM) tools has drawn significant attention back to web scraping and its dubious legal standing. Web scraping is essentially the lifeblood of these solutions, which are trained on vast amounts of information pulled from locations throughout the public internet. This has added copyright concerns to the long list of legal considerations, reigniting the debate about the ethics of web scraping practices. Jurisdictions worldwide have been slow to act on this issue, leaving organizations that want to protect their information from would-be web scrapers to deal with the problem themselves.

How is web scraping used, and why should I care?

Information obtained via web scraping is generally stored in databases, which can be examined and analyzed for specific information, such as names, email addresses, price listings, and other potentially valuable data. In the early days of web scraping, it was often accomplished by manually copying and pasting text into spreadsheets and databases. Modern scrapers, however, use automated processes (aka bots), often directly extracting the HTML that forms the underpinning of the website. Today’s web scrapers operate with a high degree of efficiency and can pull data quickly and at scale.

While search providers and generative AI developers are among the most prominent users of web scrapers, they aren’t the only ones leveraging web scraping. In fact, web scraping is used in a wide variety of ways. For example, web scraping tools can be used to collect large amounts of information from the wilderness of social media, helping researchers analyze public sentiment towards certain issues, industries, or individual products. The same solutions enable price comparison services, allowing customers to find the best deals on high-demand products.

Some bad actors use web scrapers to intentionally obtain personal information, credit card numbers, or login credentials for malicious purposes. This can lead to identity theft, privacy violations, and even data breaches.

The morality of web scraping can be dubious. Most would agree that price comparison tools are a reasonable use case, but what about a business seeking to gain a competitive advantage by scraping the pricing data of rival businesses? Some web scrapers even directly reuse content they’ve obtained from other websites, confusing search providers and consumers. That confusion can impact SEO rankings and damage a company’s brand reputation by associating it with redundant or low-quality web content. This is often done maliciously with the intent of undermining a competitor.

Even when content isn’t reused in its entirety, important questions remain unanswered. Is it ethical for generative AI developers to train their models on content they don’t own or copyrighted content? The legality of this will be tested in the coming months and years.

How to stop web scrapers from stealing your information

A judicial ruling in 2022 reaffirmed that it is legal to scrape publicly available data from the internet. While it is technically possible to take legal action against web scrapers, doing so requires the ability to prove that verifiable harm was committed. That could include theft of intellectual property (IP), violation of terms of service, or other malicious actions. Unfortunately, how the law is applied and interpreted can vary widely between jurisdictions, making the outcome difficult to predict. This is particularly true regarding generative AI as no commonly accepted standard governs how training data is obtained or used.

Ultimately, organizations that want to stop web scraping cannot afford to wait for lawmakers or regulators to address the problem. Even in a best-case scenario, legal recourse only provides remediation options after a breach or other harmful event has already happened. Organizations can’t afford to wait until their data has already been stolen —they need to be proactive, addressing the root of the problem and ensuring that malicious actors cannot scrape information off their site in the first place.

As automated bots are used to scrape information, stopping them starts with implementing an effective bot management solution. Organizations need the ability to reliably detect automated web traffic and mitigate its impact on their web pages. Today’s bot management tools can mitigate the threat of hackers, competitors, fraudsters, and other malicious actors using automated bots to abuse applications, overwhelm servers, or conduct unwanted web scraping. Addressing the problem of “bad” bots requires a holistic approach that protects not just websites but mobile applications and APIs. Organizations that want to avoid seeing their data swept up by web crawlers with unknown intentions must ensure complete visibility and control over human and automated traffic and the means to deflect and deter unwanted visitors.

To stop web scraping, first stop bad bots

While the questions surrounding the legality of web scraping and its uses may eventually be answered, organizations must address the problem proactively. Malicious automated traffic is a growing problem. In fact, research indicates that bad bots made up 30% of all web traffic in 2022, underscoring the scale of the issue. Implementing a solution to identify and manage automated traffic can help organizations more effectively address the problem of web scraping and the countless other challenges associated with bad bots.