September 25, 2023

ChatGPT lives in the shadow of a big data scandal; understand

a artificial intelligence (AI) conquered the world in recent months thanks to advances in great language paradigms (Master’s), which supports popular services such as chat. At first glance, technology may seem like magic, but behind it are vast amounts of information that power intelligent and eloquent responses. However, this model may be in the shadow of the big data scandal.

systems Generative artificial intelligencelike ChatGPT, are high probability machines: they parse huge amounts of text and match terms (which is known as border) to generate unpublished text on demand – the more parameters, the more sophisticated the AI. The first version of ChatGPT, released last November, contains 175 billion variables.

What has begun to haunt authorities and experts alike is the nature of the data used to train these systems — it’s hard to know where the information comes from and what exactly is feeding the machines. a GPT-3 scientific paper, the first version of the “brain” of ChatGPT, gives an idea of ​​what it was used for. Common Crawl, WebText2 (text packages filtered from the Internet and social networks), Books1 and Books2 (book packages available on the web), and the English version of Wikipedia were used.

Although the packages have been revealed, it’s not known exactly what they’re made of — no one can say if there was a post from any personal blog or from a social network that feeds the model, for example. The Washington Post Parsing a package named C4used to train LLMs T5And Google and LlaMAl Facebook. It found 15 million sites, which include news outlets, gaming forums, pirated book depositories, and two databases containing voter information in the United States.

The origin of databases for large AI models raises concerns filming: Joel Saget/AFP

With the stiff competition in the generative AI market, transparency around data usage has deteriorated. OpenAI did not disclose which databases it used to train GPT-4, the current brain of ChatGPT. when we talk about A poetchatbot it Recently arrived in BrazilHey Google She also adopted a vague statement that she trains her models with “publicly available information on the Internet”.

movement of authorities

This has led to action by regulators in different countries. in March , Italy ChatGPT suspended For fears of breaching data protection laws. In May, Canadian regulators launched an investigation against OpenAI over its data collection and use. In this week , Federal Trade Commission (FTC) in the United States to investigate whether the service caused harm to consumers and whether OpenAI engaged in “unfair or deceptive” privacy and data security practices. According to the agency, these practices may have caused “reputational damage to people”.

The Ibero-American Data Protection Network (RIPD), which includes 16 data authorities from 12 countries, including Brazil, also decided to investigate OpenAI’s practices. here , Estadao sought National Data Protection Authority (ANPD), which stated in a note that it is “conducting a preliminary study, although not exclusively dedicated to ChatGPT, aimed at supporting concepts related to generative models of artificial intelligence, as well as identifying potential risks to privacy and data protection.” Previously, it was the ANPD party Publish a document In which she indicated her desire to be the supervisory and regulatory authority on artificial intelligence.

Things only change when there is a scandal. It is beginning to become clear that we have not learned from past mistakes. ChatGPT is very vague about the databases used

Luã Cruz, Communications Specialist at the Brazilian Institute for Consumer Defense (Idec)

Luca Pelli, Professor of Law and Coordinator of the Center for Technology and Society at the Getulio Vargas Foundation (FGV) in Rio, has petitioned the ANPD about the use of data by AI big models. “As the owner of personal data, I have the right to know how OpenAI is issuing responses about me. Obviously, ChatGPT generated results from a huge database that also includes my personal information,” he tells Estadão. Is there consent for them to use my personal data? No. Is there a legal basis for my data to be used to train AI models? No.

Belli claims he has not received any response from ANPD. When asked about the topic in the report, the agency did not respond — nor did it indicate whether it was working with RIPD on the subject.

He recalls the turmoil leading up to the scandal Cambridge Analytica, as the data of 87 million people on Facebook was misused. Privacy and data protection experts have pointed to the problem of data usage on the big platforms, but the authorities’ actions have not addressed the problem.

“Things only change when there is a scandal. It is starting to become clear that we have not learned from the mistakes of the past. He is very vague about the databases used,” says Luã Cruz, communications specialist at ChatGPT. Brazilian Institute for Consumer Defense (Idec).

However, unlike the case of Facebook, misuse of data by LLM can generate not only a privacy scandal, but also a copyright scandal. In the US, writers Mona Awad and Paul Tremblay sued Open AI Because they believe their books have been used to train ChatGPT.

In addition, visual artists also fear that their work will feed into image generators, such as DALL-E 2, Midjourney, and Stable Diffusion. This week, OpenAI entered into an agreement with the Associated Press to use its press scripts to train its models. It’s a shy step ahead of what the company has already built.

“In the future we will see a flood of collective actions that run counter to the limits of data use. Privacy and copyright are very close ideas,” says Rafael Zanata, Director of the Associação. data privacy brazil. For him, the copyright agenda has more appeal and should put more pressure on the tech giants.

Google has changed its terms of use for using public data on the web to train AI systems filming: Josh Adelson/AFP

Zanata argues that the great AI models challenge the notion that public data on the Internet are resources available for use regardless of the context in which they are applied. “You have to respect the integrity of the context. For example, whoever posted a photo on photolog Years ago, he wouldn’t have imagined it and wouldn’t even allow his image to be used to train an AI bank.

To try and gain some legal certainty, Google, for example, changed its terms of use on July 1st to indicate that data “available on the web” can be used to train AI systems.

“We may, for example, collect information that is publicly available online or from other public sources to help train Google models for artificial intelligence and build features such as Google Translate capabilities, Bard, and AI in the cloud,” the document says. Or, if information about your activity appears on a website, we may index and display it via Google services.” Wanted by EstadaoGiant does not comment on the matter.

Until now, the AI ​​giants have treated their databases almost like a “recipe.” Coke– No industrial secret. However, for those who follow the topic, this cannot be an excuse for the lack of guarantees and transparency.

“Anvisa does not need to know the specific formula of Coca-Cola. It needs to know whether basic rules were followed in the construction and regulation of the product and whether or not the product causes any harm to the population. If it does harm, it should have an alert. Cruz says: “There are levels of transparency that can be respected that do not achieve the gold of technology.”