Navigating the Legal Landscape Surrounding Web Scraping

The arrival of the internet brought with it a flood of expansive and predominantly unstructured information. As the economic value of this unstructured information has grown, new technologies for collecting and synthesizing web data have also developed, including the practice of web scraping. By definition, web scraping is the practice of using software programs (sometimes referred to as ‘bots’, ‘crawlers’ or ‘spiders’) to extract information and data from websites, which are then automatically downloaded and sorted. Today, there are approximately 44 trillion gigabytes (44 zettabytes) of data on the web, and web scraping accounts for approximately 52% of web traffic. Businesses across all industries use web scraping for a variety of reasons, including harvesting data for AI, machine learning, or for retail companies to collect competitor prices and adjust their own prices accordingly. In 2014, 22% of website visitors were identified as web scrapers, with a 17% increase in scraping across all industries.

Despite the ubiquitous nature of web scraping, the legality of scraping practices is not widely understood. Even those who work in the cybersecurity field have incorrectly concluded that web scraping is legal because the information on the internet is public domain. Although theories of liability regarding web scraping are still developing, there are various state and federal claims that can be, and have been, levied against web scrapers. As the amount of web data being created increases exponentially, so too will the use of web scraping by businesses seeking to capitalize on data-driven insights. It is therefore important for the growing number of companies who conduct web scraping or do business with data harvesters to understand the nuanced legality of web scraping so they can better navigate the risks associated with the practice and protect themselves from liability.

Breach of contract

Breach of contract liability rests on the theory that a contract is formed between a website provider and a visitor through the respective website’s terms of use. The formation of such a contract hinges on whether a visitor has actual or constructive knowledge of the website’s terms of use and agrees to them. Whether a website’s terms of use are contained in “click-wrap” (i.e., where the user has to click through to obtain access to the website) or “browsewrap” agreements (i.e., terms of use that are posted to the website but do not require affirmative conduct from visitors) influences if these terms can be enforced — it is more difficult to prove that a visitor had notice of such terms of use when a website relies on a browsewrap agreement. Recent cases suggest that the enforceability of website terms of use can be contingent on the website visitor having actual or constructive notice of prohibitive terms of use beyond a visitor’s passive access to a website, such as the receipt of a cease and desist letter outlining the terms the recipient violated.

Copyright infringement

Under the Copyright Act, copyright protection exists in original works of authorship fixed in any tangible medium of expression. Facts do not receive copyright protection because they are not original to an author, but the line between factual material (which cannot be protected) and creative material (which can be protected) can be ambiguous. For example, descriptions of facts can be original so news articles may be entitled to copyright protection if they contain original expression. Only a “narrow category of works in which the creative spark is utterly lacking or so trivial as to be virtually nonexistent [is] incapable of sustaining a valid copyright.”

In Craigslist Inc. v. 3Taps, Craigslist, a website that enables users to post classified ads, brought suit against certain defendants who scraped housing listings posted on the Craigslist platform. The defendant filed a motion to dismiss Craigslist’s copyright claims because Craigslist is a non-copyrightable compilation. However, the Court denied the motion, finding that the Craigslist website “display[ed] some minimal level of creativity” by selecting which categories to include and under what name.” If a scraper also scrapes photographs (which can be protected under a separate copyright), the owners of the scraped websites may also be able to argue that the web scraper has committed copyright infringement. Such arguments are more persuasive when the web scraper makes the scraped information (including photographs) publicly available.

The more difficult question is whether web scraping copyrighted data for AI or machine learning purposes can also expose a business to copyright infringement liability. In Authors Guild v. Google, the 2nd Circuit found that Google’s digitalization and annotation of some 20 million copyrighted books, including the use of these books in its training database to educate its Google Book Search algorithm, did not constitute infringement because its actions were protected by the fair use doctrine. The judges in this case held that “[t]he purpose of the copying is highly transformative, the public display of text is limited, and the revelations do not provide a significant market substitute for the protected aspects of the originals.” They also noted that the company’s commercial disposition and profit incentives do not justify denial of fair use, explaining that, “[w]hat matters in such cases is not so much ‘the amount and substantiality of the portion used’ in making a copy, but … of what is thereby made accessible to a public for which it may serve as a competing substitute.”

While the Authors Guild v. Google holding has been cited by some scholars to support the claim that copyright law is meant only for humans and that reading performed by computers doesn’t count as infringement, it is worth pointing out that the law is still unsettled on this point. The Supreme Court denied certiorari and other circuits have not tackled the question. Interestingly, the U.S. Patent and Trademark Office (USPTO) recently published a notice in the Federal Register seeking information on this very topic. In the notice, the USPTO asks whether existing statutory language (e.g., the fair use doctrine) adequately addresses the legality of AI algorithms learning their functions by consuming copyrighted material. Until legal theories of infringement that account for the unique features of AI technology arise, companies that use copyrighted material to train machine-learning algorithms appear to have favorable precedent to defend their scraping practices against potential infringement claims, particularly where the resulting AI-work product is transformative.

The Computer Fraud and Abuse Act

The Computer Fraud and Abuse Act of 1986 (CFAA), was passed by Congress in order to address computer hacking. Since the rapid digitalization of the world has outpaced the United States’ ability to enact responsive legislation, the application of the CFAA has been steadily expanded to address computer-related activities that were not originally within its purview, including web scraping. The CFAA is violated when someone “intentionally accesses a computer without authorization or exceeds authorized access, and thereby obtains … information from any protected computer.” A “protected computer” is defined to include a computer “used in or affecting interstate or foreign commerce or communication,” which is interpreted to include any computer connected to the web.

In the web scraping context, claims based on the CFAA generally hinge on a factual analysis of the extent to which the scraper’s access and use of a website’s data was “unauthorized.” Courts have not uniformly applied the CFAA to web scraping cases, but current trends indicate favorable case law for scrapers, reflected in the 2017 hiQ v. LinkedIn case. hiQ is a data science company that harvests user profiles from LinkedIn and uses them to analyze workforce data, for example, to predict when employees are likely to leave their jobs, and provides scraped data to corporate HR departments. LinkedIn uses a robots.txt protocol to communicate that access to LinkedIn servers via automated bots is prohibited (although it grants express permission to certain entities, such as the Google search engine) and has a User Agreement under which users agree not to scrape data or information or use bots or other automated methods to access its services. LinkedIn sent a cease and desist letter asserting that hiQ was violating the CFAA and requesting that hiQ stop its web scraping activities, which prompted hiQ to file an injunction to stop LinkedIn from blocking its access. The Ninth Circuit Court of Appeals upheld the injunction, suggesting that courts may be leaning towards interpreting scraping public data freely shared on the web as permissible under the CFAA (although the hiQ court did note that even if the CFAA does not apply to web scraping public information, other causes of action are still available). The Court explained that it “favor[s] a narrow interpretation of the CFAA’s ‘without authorization’ provision so as not to turn a criminal hacking statute into a ‘sweeping Internet-policing mandate.’” Recent cases brought against scrapers under the CFAA similarly look at whether websites employ encryption procedures, such as username and password requirements, that would deem the visitor’s scraping more akin to hacking, rather than imposing liability for access to features configured to be readily accessible to the general public.

In contrast to the Ninth Circuit’s holding in hiQ, courts in other jurisdictions construe violations of a website’s terms of use as violations under the CFAA. For example, when the First Circuit applied the CFAA to web scraping in 2003, it held that “[a] lack of authorization could be established by an explicit statement on the website restricting access.” The First Circuit found that public website providers can ban scrapers by making it explicit on their webpage or with a link that denotes restrictions.

In light of this conflict among the courts of appeals, LinkedIn will take its case to the Supreme Court. In March 2020, LinkedIn filed a petition for certiorari to challenge the Ninth Circuit’s prior ruling, signaling that the Supreme Court will have an opportunity to weigh in on the hiQ holding and, hopefully, clarify how the CFAA should be applied in cases concerning the scraping of publicly available web information.

The take away for businesses

Businesses seeking to take advantage of the present data boom through web scraping should follow best practices when it comes to scraping and ensure that transactions with companies that use web scraping are carefully drafted.

To minimize concerns stemming from web scraping, companies should scrape discreetly, respect terms of service, check whether sites are using the robots.txt protocol to communicate that scraping is prohibited (please note that courts remain split as to whether failure to use a robots.txt protocol, or poor implementation of such a protocol, constitutes an implied license for scrapers), abide by cease and desist letters, avoid scraping private or classified information and scrape in a nonaggressive manner that does not burden web servers. Companies that web scrape should put in place procedures with respect to reviewing and honoring terms of use for the websites that they scrape. Web scrapers should also consider whether the website owner of the information they are scraping will license or authorize the use of its content. Website owners are more likely to recognize the value of their data in today’s digital economy and to diagnose web scraping as a lost opportunity to derive revenue from their data.

Businesses should also be aware that in a 2018 study that analyzed 20 years of CFAA actions against web scrapers, a large number of claims were brought “by direct commercial competitors or companies in closely adjacent markets to each other.” Companies seeking to avoid litigation should therefore be especially cognizant of whether their data harvesting endeavors are aimed at commercial rivals.

In M&A transactions that involve the acquisition of a business or technology that utilizes web scraping, it is important to draft contracts in a manner that limits exposure to liability from such scraping activities. For example, the acquirer should ensure that such a contract includes a covenant that the scraper complies with all laws and an indemnity on third party claims brought against the acquirer that covers potential web scraping causes of action.

Navigating the legal landscape surrounding web scraping is not easy, but businesses who are cognizant of the various theories of liability applicable to web scraping can better manage the risks. Businesses should continue to monitor the shifting standards of liability as case law on this subject continues to develop. More experts are questioning whether the existing laws regarding web scraping are too antiquated, and businesses should be cognizant of the possibility that new legislation may be introduced to better define the legal contours of web scraping.