#DataPrivacyWeek: ChatGPT's Data-Scraping Model Under Scrutiny From Privacy Experts

One use of ChatGPT, the superstar chatbot created by generative AI firm OpenAI, is drafting privacy notices. Ironically, ChatGPT itself is under scrutiny from data protection experts.

While the various uses of ChatGPT – and other generative AI – can raise ethical and legal concerns regarding the violation of data privacy, as Infosecurity previously investigated, some experts are questioning the very existence of OpenAI’s chatbot for privacy reasons.

As it's currently very fashionable, I used #ChatGPT in my #DataProtection class 2day: Students drafted fictional T&Cs for a social media platform w. it, and then inspected their loopholes under GDPR. A nice experiment to properly use ubiquitous tech in education. #AcademicTwitter pic.twitter.com/ACLufADZto
— Antonio Davola (@ADavola) January 24, 2023

Addressing the Data-Scraping Method

First, the method used by OpenAI to collect the data ChatGPT is based on needs to be fully disclosed by the generative AI firm, claims Alexander Hanff, member of the European Data Protection Board's (EDPB) support pool of experts.

"If OpenAI obtained its training data through trawling the internet, it’s unlawful," Hanff, who hosts That Privacy Show on the streaming platform Twitch, told Infosecurity.

"Just because something is online doesn't mean it's legal to take it. Scraping billions or trillions of data points from sites with terms and conditions which, in themselves, said that the data couldn't be scraped by a third party, is a breach of the contract. Then, you also need to consider the rights of individuals to have their data protected under the EU’s GDPR, ePrivacy directive and Charter of Fundamental Rights," he added.

Dennis Hillemann, partner at the law firm Fieldfisher, added, that from a legal point of view, there are many tensions between GDPR and foundational models; the large artificial intelligence models trained using self-supervised learning, such as ChatGPT.

"The way they work, by throwing a lot of unlabeled data and then defining the use cases, contradicts the EU GDPR’s principle of data minimization, which says that an organization should use a minimal amount of information useful for a predefined purpose," Hillemann said.

Fair Use Not Likely Applicable

In many jurisdictions, using information without the owner’s consent or copyright is allowed under the fair use principle in certain circumstances, including research, quoting, news reporting, teaching satire or criticism purposes.

Hillemann said it is likely OpenAI would lean on this defense regarding the data they used to build their ChatGPT model.

According to Hanff, however, none of the fair use conditions apply to their AI models. "Fair use only gives you access to limited information, such as an excerpt from an article. You can't simply grab information from a whole site under fair use; that wouldn't be considered fair. Satire wouldn't be valid either: although you can use ChatGPT for parody and satire, OpenAI has not created the model for these specific purposes," he said.

Also, although OpenAI states on its privacy policy page that it "[does] not and will not sell your Personal Information," Hanff considers ChatGPT a commercial product.

"There is a big debate on the definition of ‘selling,’ but ChatGPT is a commercial activity without a doubt – the company could be valued at $29bn – so they can't get away with fair use. This AI model will create tremendous revenue and wealth for a small number of individuals off the back of everybody else's content and work," he said.

"We're not ready, as a society, to deal with these technologies."Alexander Hanff, member, European Data Protection Board's (EDPB) support pool of experts

Hillemann is more nuanced: "I think, at this point, it’s very hard to define whether ChatGPT is considered a commercial use of data. On the one hand, a lot of people are using it for free; on the other, the company is involved in a commercial business model."

However, the announcement of ChatGPT Pro, a premium version of the chatbot that users will need to pay for, incites the lawyer to align closer to Hanff.

"Even if they don’t sell personal data, they use it to perfect their product," Hillemann noted.

Interestingly, here is ChatGPT’s answer when asked, ‘How does ChatGPT deal with personally identifiable information?’: "As a language model, GPT-3 does not have the ability to collect or store any personal information. It can only process the information that is given to it as input and respond with generated text based on that input. When working with personal information, it is important to follow data protection laws and regulations such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA) to ensure that the personal information is being handled appropriately."

Here, the chatbot focuses solely on the trained model and how it can be used without referring to the data-collecting process. "It shows that for OpenAI, the value is the model, not in the training," Hanff argued.

Dealing With Inaccurate Data

Another issue with ChatGPT and other generative AI models is that, by scraping vast quantities of unlabeled data, they take inaccurate information.

"This inaccurate data – be it fake news, misinformation or simple human errors – is not only used by people but also feeds the model itself. And, as I understand it, ChatGPT can judge whether a piece of information is correct. That means that, if the model initially has a lot of inaccurate data about someone or something, its judgment will be flawed. Then there’s a breach of EU GDPR, which requires organizations to verify a piece of information before processing it," Hillemann explained.

Hanff agreed: "I've looked at a few privacy policies created by ChatGPT. Because the material it's been trained on comes from the internet and a great deal of material around privacy policies on the internet is wrong, organizations using the chatbot to create their privacy policies put end users at risk of being misinformed about the way their data is being processed."

Foundational Models in a Legal Vacuum

The privacy advocate also noted that this data-scraping method has been used "by facial recognition company Clearview AI, which was fined by multiple supervisor authorities across the EU, Australia, the UK and the US."

Kohei Kurihara, CEO of the Privacy by Design Lab, agreed there is a good chance that scraping the internet to build a generative AI model like ChatGPT’s breaches contracts with platforms such as social media sites.

"Any massive scraping of data without the users’ consent is a privacy concern," he told Infosecurity.

"We don't allow cars to leave the factory without seatbelts, and the same should be true for AI models."Alexander Hanff, member, European Data Protection Board's (EDPB) support pool of experts

However, Kurihara also argues that the case of Clearview AI is different because the firm was specifically scraping biometrics data, which is particularly sensitive, and was using it for law enforcement purposes.

No legal action against OpenAI’s internet data scraping model has been taken yet. However, Kurihara said, "time will tell if it is deemed unlawful as well."

Hillemann said another legal case that could set a precedent in judging foundational models is the Stable Diffusion lawsuit, in which artists have sued the UK-based firm Stability AI for using their work to train its image-generative AI model.

"What’s at stake here is copyright infringement and not privacy violations, but both cases are asking the same question: ‘Is it okay to scrape the internet, with its personal information and copyrighted work, and use it to generate new content and build a business model on it?’ We’ve never been in such a situation before, except perhaps with Google, where a company is scraping the whole internet and releasing it in the wild. Meta or Apple have used data in very concerning ways too, but for purposes that stayed within their own business."

Call for Ad-Hoc AI Regulators

The situation could become more concerning in the future, Hanff argues. "Microsoft, who announced it would invest $10bn in OpenAI over the coming years, now has an AI that can completely copy your voice signal from three seconds of audio. Not only have they used data illegally to train their AI, data that was biased and are planning on using models that are breaking fundamental privacy rights, but they're also potentially creating an antitrust issue where it becomes impossible for people to compete with machines."

"We're not ready, as a society, to deal with these technologies," Hanff also claimed.

"I would like to see specific AI regulators with technical knowledge capable of auditing OpenAI at a very deep level – ones that would be different from data protection and privacy regulators because the nuances of the technology are much more difficult to understand and regulate. If an AI is dangerous, it should be removed and retrained until we have a safe model for everyone to use. We don't allow cars to leave the factory without seatbelts, and the same should be true for AI models."

More optimistically, Hillemann argued that "While we have to look at the risks, we should not forget the benefits of these tools. And these are intrinsically linked to the business model generative AI firms have invented."

With the EU’s Artificial Intelligence Act underway and similar initiatives being proposed in other countries, the impact of generative AI models on privacy will probably be on the regulators’ agenda for years to come.

Meanwhile, OpenAI, Stability AI and similar technologies could be increasingly challenged by privacy advocates to disclose the safeguards they have put in place to protect their users against data violations. In another article, Infosecurity will investigate some privacy experts’ concerns that OpenAI’s public privacy policy is not up to the standards of data protection regulations.

OpenAI was contacted by Infosecurity but did not respond to requests for comment on their training model. The company provided a fact sheet but none of the privacy concerns were directly addressed.

#DataPrivacyWeek: ChatGPT's Data-Scraping Model Under Scrutiny From Privacy Experts

Kevin Poireault

Addressing the Data-Scraping Method

Fair Use Not Likely Applicable

Dealing With Inaccurate Data

Foundational Models in a Legal Vacuum

Call for Ad-Hoc AI Regulators

You may also like

#DataPrivacyWeek: Addressing ChatGPT's Shortfalls in Data Protection Law Compliance

OpenAI's ChatGPT is Breaking GDPR, Says Noyb

Massive Adoption of Generative AI Accelerates Regulation Plans

Portugal Forces Sam Altman's Worldcoin to Stop Collecting Biometric Data

No Better Time to Propose U.S. Consumer Data Privacy and Security Act

What’s hot on Infosecurity Magazine?

Most IT Leaders Say Severity of Cyber-Attacks has Increased

Chinese Espionage Group Upgrades Malware Arsenal to Target All Major OS

Russia Shifts Cyber Focus to Battlefield Intelligence in Ukraine

Exclusive: Paris 2024 CISO Reveals Cybersecurity Plans for the Olympics

Prolific DDoS Marketplace Shut Down by UK Law Enforcement

Cybercriminals Exploit CrowdStrike Outage Chaos

Fact vs. Fiction: Dispelling Zero Trust Misconceptions

Cybercriminals Exploit CrowdStrike Outage Chaos

Exclusive: Paris 2024 CISO Reveals Cybersecurity Plans for the Olympics

CISA's Jack Cable Discusses US Push for More Secure Software

Chinese Espionage Group Upgrades Malware Arsenal to Target All Major OS

North Korean Hackers Targeted Cybersecurity Firm KnowBe4 with Fake IT Worker

The Future of Fraud: Defending Against Advanced Account Attacks

Mastering IP & Data Security in the Industrial Age

Experiencing a DDoS Simulation to Enhance Defenses

How to Unlock Frictionless Security with Device Identity & MFA

Adapting to Tomorrow's Threat Landscape: AI's Role in Cybersecurity and Security Operations in 2024

How to Proactively Remediate Rising Web Application Threats

#Infosec2024: Claire Williams on Leadership, Cultivating a High Performing Team and Overcoming Adversity (video)

#Infosec2024: Navigating the Ransomware Toll on Victims with Jason Nurse (video)

#Infosec2024: Experts Share How CISOs Can Manage Change as the Only Constant

#Infosec2024: 104 EU Laws Have Different Definitions of Cybersecurity

Infosecurity Magazine Autumn Online Summit 2024: Day Two

Infosecurity Magazine Autumn Online Summit 2024: Day One

#DataPrivacyWeek: ChatGPT's Data-Scraping Model Under Scrutiny From Privacy Experts

Written by

Addressing the Data-Scraping Method

Fair Use Not Likely Applicable

Dealing With Inaccurate Data

Foundational Models in a Legal Vacuum

Call for Ad-Hoc AI Regulators

You may also like

What’s hot on Infosecurity Magazine?