Google logo inside office showing privacy update for training AI models

“All Your Content Belongs to Us”: Google Privacy Policy Update Suggests It Plans to Scrape Everything on the Internet to Improve AI Models

Though the temptation of cost savings is great, many organizations are hesitant about the adoption of AI tools for a number of reasons. One of the big ones is legal uncertainty about whether these AI models actually have the lawful right to use all of the data they train on, and if they might spit protected elements back out when generating content. Google appears to have blasted right through this debate like a bull in a china shop with its latest privacy policy update, the wording of which essentially declares that the entire internet is its domain to scrape for creation of its AI products.

Privacy policy update stakes out extreme position on legally dicey issue

The new privacy policy is a small (but extremely meaningful) update to language that has existed for some years now. Google had previously declared that it used public internet sources to train “language models,” naming Google Translate as an example. That was inoffensive enough to not draw widespread attention.

That same privacy policy segment has now been changed from “language models” to “AI models,” and Google specifically names current projects like Bard and its Cloud AI as examples. One might reasonably expect that content posted to Google’s free services, such as Blogger and Sites, would be used in this way. But the privacy policy wording appears to indicate that Google feels everything it can reach on the public internet is fair game for it to use to enhance its own products.

This “right to scrape” debate has been playing out on multiple fronts for some years now. One of the big cases has been that of Clearview AI, which built a massive biometric facial recognition database by scraping publicly posted pictures without anyone’s knowledge or consent. The debate has not been going in their favor thus far; Facebook and other social media companies have banned access to their platforms due to terms of service and privacy policy violations, some countries (Canada) and states (Illinois) have banned them entirely from their territory, and they have amassed large fines in the EU for privacy and data handling transgressions. The company is essentially hanging on due solely to the fact that the US does not have a federal-level data privacy law yet.

Biometric information does enjoy a higher level of legal protection than random blog posts and articles. However, the fundamental point of these cases is that users have some reasonable expectation of how their content will be used by private entities when they put it on the internet, and having it absorbed into a private for-profit database is not always among those reasonable expectations. The issue is even more complex with AI models involved, as they may spit some of this information back out in a legally actionable way.

Legal challenges along these lines are broadly expected over the next few years as AI models and products continue to roll out, and some are in courts already. OpenAI is facing a class action lawsuit in Northern California over its freewheeling scraping of the internet, and the ultimate result of that case will apply directly to Google’s plans. A separate and more targeted lawsuit was also filed against OpenAI by a collection of authors who say that the company specifically scraped their protected works. And Microsoft looks to be headed to court over its training of Github Copilot, which plaintiffs say helped itself to open source code without honoring the required licensing agreements.

Timothy Morris, Chief Security Advisor at Tanium, points out that legal issues related to “deepfakes” and privacy policy may also end up being a challenge for AI models: “From a privacy point of view, the ability of AI to create new works can test the definition of what is “public.” Taking something that is public to create new works will create legal challenges, and I believe cause a cry for better regulations. For example, the deep fakes that use public images and information.”

AI models already changing the internet landscape

As these court judgments and precedents likely take years to shake out, and in turn prompt more clear regulation of AI models, the deployment of tools like ChatGPT is already causing major ripple effects across the public internet. AI scraping by competitors has been the cited reason for both Twitter and Reddit’s APIs suddenly going entirely pay-for-play. Over the July 4th weekend Twitter took the extreme additional step of requiring users to log in to view any tweets at all, and then doubled down by throttling all users to viewing just several hundred posts per day.

While the issue of AI models scraping content is presently (and sympathetically) framed as the little guy having work stolen by AI models that then threaten to put them out of their job, the Twitter and Reddit moves demonstrate that the legal clashes over scraping might ultimately be more of a “Godzilla vs King Kong” scenario involving tech’s biggest names accusing each other of violating their respective terms of service. Regulators might ultimately also treat scraping in the same way they treat violations involving poorly secured APIs, as a privacy policy failing incumbent upon the content host.

AI models like Google and Bing’s new assisted search tools are also roiling private industry that has built around search traffic. These tools offer an “AI summary” often scraped from websites, diverting users from actually visiting the sources it took the information from. Google in particular faces legal and regulatory threats along antitrust lines in this area, given that it holds an estimated 90% of the search advertising market. The beta version of Google’s “Search Generative Experience” tool indicates that AI-generated text will take up the entire first page of search results (combined with advertising links), with actual links to websites “below the fold” requiring users to scroll down and click a “show more” button to view.