Five Key Points from the Invasion of Privacy Lawsuit Against OpenAI

“Plaintiffs allege in the OpenAI lawsuit that, since commercialization, OpenAI used five different datasets to train ChatGPT and each of these datasets performs massive data collection, effectively scraping the whole internet.”

On September 6, OpenAI faced its second invasion of privacy lawsuit filed in the U.S. District Court for the District of Northern California, for allegedly stealing private information from millions of internet users. While the Plaintiffs acknowledge in their complaint that Artificial Intelligence (AI) has the potential to create life-saving technologies and herald discoveries that could improve our daily lives, they claim OpenAI crossed the line of using altruistic means of reaching its objective when it abruptly restructured itself into a for-profit business. Following this restructuring, the Plaintiffs allege OpenAI scraped private information from millions of users to train their Large Language Models. Here are five key allegations from the privacy suit against OpenAI.

1. Plaintiffs Seek Invasion of Privacy Claims in California but Copyright Infringement in New York

California has one of the strongest privacy protection laws in the country. Initially, when the public learned that OpenAI scraped online information to train their large language models, plaintiffs sought privacy protections in the state, but as creative writers, including Sarah Silverman and the Authors Guild, learned, their copyrighted work was scraped, and subsequent copyright protections were sought in federal district court in New York.

The Plaintiffs seeking privacy protections in Northern California assert two statutes falling under California Invasion of Privacy Act (CIPA) (not to be confused with California Consumer Privacy Act). First, Plaintiffs assert code 631a, which generally prohibits “wiretapping.” But “Courts have consistently interpreted [section 631(a)] as applying only to communications over telephones and not through the internet.” Licea v. Cinmar, LLC, 2023 WL 2415592 (C.D. Cal. March 7, 2023). Nevertheless, the Plaintiffs here asserted, Section 631(a) is not limited to phone lines, but also applies to “new technologies” such as computers, the Internet, and email. See Matera v. Google Inc., No. 15-CV-04062-LHK, 2016 U.S. Dist. LEXIS 107918, at *61-*63 (N.D. Cal. Aug. 12, 2016).

The Plaintiffs’ second assertion, 632(a) prohibits “eavesdropping,” or “a person who intentionally and without the consent of all parties to a confidential communication, uses an electronic amplifying or recording device to eavesdrop upon or record the confidential information… by means of a telephone, or other device…” CIPA contains an exemption for wiretapping or eavesdropping on its own conversation Cal. Penal Code § 630, et seq.

With regard to this lawsuit and future such privacy lawsuits, courts will attempt to resolve three issues: 1. Whether asserted confidential information falls under protected content 2. Whether a third-party’s embedded code is merely a tool used by the website’s owner, or an application provided by a separate vendor. 3. The level of consent required if embedded code on a website is implemented by a separate vendor.

Here, district courts in California are struggling with the first two issues. First, whether a third-party code embedded on websites in the form of browser side scripts is merely a tool, which falls under the exception of wiretapping or eavesdropping on its own conversation. In Byars v. Hotopic, Inc. (C.D. Cal. Feb. 14, 2023), the court ruled that a third-party chat feature embedded on Hot Topics website was merely a “tool” and no more than an “extension” of the website provider. Even though this decision diverged from the Byars v. Goodyear decision (a case brought by the same Plaintiff), courts are moving toward the direction of applying the wiretapping of one’s own conversation exception if a third-party service was used to “record and analyze its own data in aid of Defendant’s business’, not for the ‘aggregation of data for resale.”

Plaintiffs allege in the OpenAI lawsuit that, since commercialization, OpenAI used five different datasets to train ChatGPT and each of these dataset performs massive data collection, effectively scraping the whole internet. One of these datasets, WebText2, collected all “outbound” data from social media sites such as Reddit, Youtube, Facebook, TikTok, SnapChat, and Instagram, without the consent of original creators or social media sites. Indeed, the co-founder and CEO of Reddit, Steve Huffman, commented, “The Reddit corpus of data is really valuable. But we don’t need to give all of that value to some of the largest companies in the world for free.”

Decisions from district courts in California are split with regard to whether the CIPA’s protected content, or confidential information, includes IP addresses, and information necessary to conduct day-to-day operation. The court looks into the nature of the communications, such as a chat feature, or key strokes performed by the user, to determine whether the content is protected by statute. In Saleh v. Nike Cal, the federal district court ruled that protected contents include the date and time of the [plaintiff’s] visit [to the website], the duration of the visit, Plaintiff’s IP address, his location at the time of the visit, his browser type, and the operating system of his device.”

But the court in Graham v. Noom, Inc., 533 F. Supp. 3d 823, 833 (N.D. Cal. 2021), and Yoon v. Lululemon USA, Inc., 549 F. Supp. 3d 1073, 1082-83 (C.D. Cal. 2021) explicitly held that allegations about “the date and time of the visit, the duration of the visit, Plaintiff’s IP address, her location at the time of the visit, her browser type, and the operating system of her device” are not “contents” for the purposes of CIPA.

Plaintiffs allege that to train their AI products, Defendants collected private information, including contact details, login credentials, emails, payment information, IP addresses, transaction records, geolocation data, cookies, chat log, analytics, key strokes and more. More information is needed during the discovery phase to reveal whether OpenAI’s LLM performed data scraping on the server-side or client-side. Nevertheless, the split among district courts in California is casting a shadow as to whether this type of customer information constitutes protected content under the CIPA statute.

2. OpenAI Abandoned Its Original Mission

Fair use by nonprofit organizations for scholarly research, or for commenting and teaching, is often used as an affirmative defense to copyright infringement claims. But with regard to the invasion of privacy lawsuit filed in California, plaintiffs are not claiming copyright infringement. Nevertheless, Plaintiffs emphasized the original intent of OpenAI was to provide research of a new technology in a responsible and safe manner, but then it later closed its code from peer review for commercial profit.

Plaintiffs claim “OpenAI abandoned its original goals and principles…” Instead “it doubled down on strategy to secretly harvest massive amounts of personal data from the internet, including private information and private conversations, medical data…” One of the original investors of OpenAI, Elon Musk,commented: “I’m still confused as to how a non-profit to which I donated ~100M somehow became a $30B market cap for-profit.” He also noted, “OpenAI was created as an open source (which is why I named it ‘Open’ AI)”.

3. Scraping without Consent

In the era of technology and information in which we live, personal information is becoming an essential driver for technology products. As a general matter, technology companies purchase and sell internet user data like any other company asset or property. But, the Plaintiffs argue, the large language models embedded within OpenAI’s products were developed by consuming large amounts of personal data without consent, for the purposes of training their AI.

The Plaintiffs claim the success of OpenAI products, including AI products like ChatGPT-3.5, ChatGPT-4.0,4 and Microsoft products like Dall-E, and Vall-E, “only reached the level of sophistication they have today due to training on stolen, misappropriated data, and Defendants continue to misappropriate data, scraping from the internet without any notice or consent, as well as taking personal information from the Products’ 100+ million registered users without their full knowledge and consent. “

OpenAI have yet to respond to the complaint, but they may assert that customers adequately consented to the use, reuse, sale and resale of their personal information when they agreed to the Terms of Use by checking a clickbox from the original website owner, or that the information scraped by the Large Language Models were client-facing information available to the public. Whether the court will find users adequately consented to OpenAI’s use of their personal information or that the content scraped by OpenAI falls under CIPA’s exception remains to be seen.

4. Plaintiffs Claim Defendants Violated the ECPA

Plaintiffs brought suit under the Wire Tap Act of the Electronic Communications Privacy Act (ECPA) statute. The Wire Tap Act prohibits intentional interception of the contents of any wire, oral, or electronic communication through the use of a device.

It is unclear how the court will apply the Wire Tap Act in this case. Exceptions to the Wire Tap Act include interception during the ordinary course of business, consent, and when service providers divulge the contents of the communication with the consent of the originator (18 U.S.C. § 2511).

5. Massive Privacy Violations Present Risk to Consumers

The lawsuit also raises pressing concerns about the nefarious purposes for which the data collected by OpenAI may be used. OpenAI’s ability to build a complete profile of a user’s behavior pattern, “including but not limited to where they go, what they do, with whom they interact, and what their interests and habits are…. raises vital ethical and legal questions about privacy, consent, and the use of personal data,” says the complaint. Additionally, AI tools like ChatGPT are increasingly being integrated into healthcare systems, creating risks for patient confidentiality, including minors.

The Takeaway: Congress Must Set Limits and Create Protected Categories of Personal Information

Artificial Intelligence is a double-edged sword; based on the profile created from the data it collects, AI can give suggestions to improve our daily lives or to manipulate us in ways that solely benefit the owners of AI, including social engineering (cyber attacks) and fraud. As such, we would be wise to implement the current solutions at our disposal to prevent disastrous consequences.

The General Data Protection Regulation (GDPR) is an important law established by the European Union to protect individual privacy rights. In March 2023, European authorities temporarily banned ChatGPT under the suspicion of OpenAI’s breach of GDPR rules.

California’s CCPA classifies several categories of personal information as the most sensitive identifiers, such as social security number, driver’s license number, and electronic records (IP addresses, purchase history, geolocation data). Congress must allow innovation of AI to flourish while protecting consumer privacy by protecting these sensitive categories of personal information.

OpenAI has until October 5, 2023, to respond to the complaint.

Image Source: Deposit Photos
Author: Primakov
Image ID: 651971872

Andy Yang Andy Yang is a patent attorney licensed in the State of California. Andy holds certifications from NCEES for FE Electrical and Computer Engineering and CompTIA Cyber Security+. Prior to becoming [...see more]

Warning & Disclaimer: The pages, articles and comments on IPWatchdog.com do not constitute legal advice, nor do they create any attorney-client relationship. The articles published express the personal opinion and views of the author as of the time of publication and should not be attributed to the author’s employer, clients or the sponsors of IPWatchdog.com.