#DataPrivacyWeek: ChatGPT's Data-Scraping Model Under Scrutiny From Privacy Experts

Written by

One use of ChatGPT, the superstar chatbot created by generative AI firm OpenAI, is drafting privacy notices. Ironically, ChatGPT itself is under scrutiny from data protection experts.

While the various uses of ChatGPT – and other generative AI – can raise ethical and legal concerns regarding the violation of data privacy, as Infosecurity previously investigated, some experts are questioning the very existence of OpenAI’s chatbot for privacy reasons.

Addressing the Data-Scraping Method

First, the method used by OpenAI to collect the data ChatGPT is based on needs to be fully disclosed by the generative AI firm, claims Alexander Hanff, member of the European Data Protection Board's (EDPB) support pool of experts.

"If OpenAI obtained its training data through trawling the internet, it’s unlawful," Hanff, who hosts That Privacy Show on the streaming platform Twitch, told Infosecurity.

"Just because something is online doesn't mean it's legal to take it. Scraping billions or trillions of data points from sites with terms and conditions which, in themselves, said that the data couldn't be scraped by a third party, is a breach of the contract. Then, you also need to consider the rights of individuals to have their data protected under the EU’s GDPR, ePrivacy directive and Charter of Fundamental Rights," he added.

Dennis Hillemann, partner at the law firm Fieldfisher, added, that from a legal point of view, there are many tensions between GDPR and foundational models; the large artificial intelligence models trained using self-supervised learning, such as ChatGPT.

"The way they work, by throwing a lot of unlabeled data and then defining the use cases, contradicts the EU GDPR’s principle of data minimization, which says that an organization should use a minimal amount of information useful for a predefined purpose," Hillemann said.

Fair Use Not Likely Applicable

In many jurisdictions, using information without the owner’s consent or copyright is allowed under the fair use principle in certain circumstances, including research, quoting, news reporting, teaching satire or criticism purposes.

Hillemann said it is likely OpenAI would lean on this defense regarding the data they used to build their ChatGPT model.

According to Hanff, however, none of the fair use conditions apply to their AI models. "Fair use only gives you access to limited information, such as an excerpt from an article. You can't simply grab information from a whole site under fair use; that wouldn't be considered fair. Satire wouldn't be valid either: although you can use ChatGPT for parody and satire, OpenAI has not created the model for these specific purposes," he said.

Also, although OpenAI states on its privacy policy page that it "[does] not and will not sell your Personal Information," Hanff considers ChatGPT a commercial product.

"There is a big debate on the definition of ‘selling,’ but ChatGPT is a commercial activity without a doubt – the company could be valued at $29bn – so they can't get away with fair use. This AI model will create tremendous revenue and wealth for a small number of individuals off the back of everybody else's content and work," he said.

"We're not ready, as a society, to deal with these technologies."Alexander Hanff, member, European Data Protection Board's (EDPB) support pool of experts

Hillemann is more nuanced: "I think, at this point, it’s very hard to define whether ChatGPT is considered a commercial use of data. On the one hand, a lot of people are using it for free; on the other, the company is involved in a commercial business model."

However, the announcement of ChatGPT Pro, a premium version of the chatbot that users will need to pay for, incites the lawyer to align closer to Hanff.

"Even if they don’t sell personal data, they use it to perfect their product," Hillemann noted.

Interestingly, here is ChatGPT’s answer when asked, ‘How does ChatGPT deal with personally identifiable information?’: "As a language model, GPT-3 does not have the ability to collect or store any personal information. It can only process the information that is given to it as input and respond with generated text based on that input. When working with personal information, it is important to follow data protection laws and regulations such as the General Data Protection Regulation (GDPR) or California Consumer Privacy Act (CCPA) to ensure that the personal information is being handled appropriately."

Here, the chatbot focuses solely on the trained model and how it can be used without referring to the data-collecting process. "It shows that for OpenAI, the value is the model, not in the training," Hanff argued.

Dealing With Inaccurate Data

Another issue with ChatGPT and other generative AI models is that, by scraping vast quantities of unlabeled data, they take inaccurate information.

"This inaccurate data – be it fake news, misinformation or simple human errors – is not only used by people but also feeds the model itself. And, as I understand it, ChatGPT can judge whether a piece of information is correct. That means that, if the model initially has a lot of inaccurate data about someone or something, its judgment will be flawed. Then there’s a breach of EU GDPR, which requires organizations to verify a piece of information before processing it," Hillemann explained.

Hanff agreed: "I've looked at a few privacy policies created by ChatGPT. Because the material it's been trained on comes from the internet and a great deal of material around privacy policies on the internet is wrong, organizations using the chatbot to create their privacy policies put end users at risk of being misinformed about the way their data is being processed."

Foundational Models in a Legal Vacuum

The privacy advocate also noted that this data-scraping method has been used "by facial recognition company Clearview AI, which was fined by multiple supervisor authorities across the EU, Australia, the UK and the US."

Kohei Kurihara, CEO of the Privacy by Design Lab, agreed there is a good chance that scraping the internet to build a generative AI model like ChatGPT’s breaches contracts with platforms such as social media sites.

"Any massive scraping of data without the users’ consent is a privacy concern," he told Infosecurity.

"We don't allow cars to leave the factory without seatbelts, and the same should be true for AI models."Alexander Hanff, member, European Data Protection Board's (EDPB) support pool of experts

However, Kurihara also argues that the case of Clearview AI is different because the firm was specifically scraping biometrics data, which is particularly sensitive, and was using it for law enforcement purposes.

No legal action against OpenAI’s internet data scraping model has been taken yet. However, Kurihara said, "time will tell if it is deemed unlawful as well."

Hillemann said another legal case that could set a precedent in judging foundational models is the Stable Diffusion lawsuit, in which artists have sued the UK-based firm Stability AI for using their work to train its image-generative AI model.

"What’s at stake here is copyright infringement and not privacy violations, but both cases are asking the same question: ‘Is it okay to scrape the internet, with its personal information and copyrighted work, and use it to generate new content and build a business model on it?’ We’ve never been in such a situation before, except perhaps with Google, where a company is scraping the whole internet and releasing it in the wild. Meta or Apple have used data in very concerning ways too, but for purposes that stayed within their own business."

Call for Ad-Hoc AI Regulators

The situation could become more concerning in the future, Hanff argues. "Microsoft, who announced it would invest $10bn in OpenAI over the coming years, now has an AI that can completely copy your voice signal from three seconds of audio. Not only have they used data illegally to train their AI, data that was biased and are planning on using models that are breaking fundamental privacy rights, but they're also potentially creating an antitrust issue where it becomes impossible for people to compete with machines."

"We're not ready, as a society, to deal with these technologies," Hanff also claimed.

"I would like to see specific AI regulators with technical knowledge capable of auditing OpenAI at a very deep level – ones that would be different from data protection and privacy regulators because the nuances of the technology are much more difficult to understand and regulate. If an AI is dangerous, it should be removed and retrained until we have a safe model for everyone to use. We don't allow cars to leave the factory without seatbelts, and the same should be true for AI models."

More optimistically, Hillemann argued that "While we have to look at the risks, we should not forget the benefits of these tools. And these are intrinsically linked to the business model generative AI firms have invented."

With the EU’s Artificial Intelligence Act underway and similar initiatives being proposed in other countries, the impact of generative AI models on privacy will probably be on the regulators’ agenda for years to come.

Meanwhile, OpenAI, Stability AI and similar technologies could be increasingly challenged by privacy advocates to disclose the safeguards they have put in place to protect their users against data violations. In another article, Infosecurity will investigate some privacy experts’ concerns that OpenAI’s public privacy policy is not up to the standards of data protection regulations.

OpenAI was contacted by Infosecurity but did not respond to requests for comment on their training model. The company provided a fact sheet but none of the privacy concerns were directly addressed.

What’s hot on Infosecurity Magazine?