When AI Plagiarizes AI

Jonathan BaileyDecember 11, 2023

5 minutes read

Jax Winterbone was recently playing with Grok, the new AI system launched by X (formerly Twitter) under the xAI label. However, when putting in a prompt to try and get it to modify some malware, he got an unexpected response.

Uhhh. Tell me that Grok is literally just ripping OpenAI's code base lol. This is what happened when I tried to get it to modify some malware for a red team engagement. Huge if true. #GrokX pic.twitter.com/4fHOc9TVOz
— Jax Winterbourne (@JaxWinterbourne) December 9, 2023

This led him to question whether “Grok is really just roping OpenAI’s code base” and prompting Jon Christian at Futurism to look into the matter. There, he found a post by xAI engineer Igor Babusckin, who attempted to explain the situation.

The issue here is that the web is full of ChatGPT outputs, so we accidentally picked up some of them when we trained Grok on a large amount of web data. This was a huge surprise to us when we first noticed it. For what it’s worth, the issue is very rare and now that we’re aware…
— Igor Babuschkin (@ibab_ml) December 9, 2023

According to Babuschkin, the issue isn’t that Grok is using any OpenAI code, but that “the web is full of ChatGPT outputs” and Grok was inadvertently trained on some of those outputs, prompting it to repeat the language.

He goes on to say that the issue is “rare” but that they are working to ensure future versions of Grok do not have the issue.

However, the story does point to a future challenge for AI. As AI content makes up more and more of the web and AI content remains difficult to automatically detect, it is inevitable that AI systems will be trained on the output of other AI systems.

This opens the door to what we saw here, where one AI plagiarized its answer from another AI system, simply because it ingested that system’s output.

It’s a problem that AI systems are going to have to deal with, if they want to remain remotely up to date.

The Snake Eating Its Tail

The problem is fairly simple. AI systems require large amounts of content to train themselves on. Obviously, this has become a point of contention for many human creators, who feel that AI was illegally trained on their copyright-protected work and are filing lawsuits over it.

But, while the legal implications are still being sorted out, the technical limitation remains. Without a large volume of training data, AI systems are useless.

Initially, AI systems were trained almost wholly on human-created content. That’s simply because there was very little AI-generated content available. Now, however, AI content is being posted online in record amounts, in particular in the form of spam.

However, since there is no clearly effective way to filter out AI writing, at least not at this time, AI systems are going to hoover up AI-generated content, including content from their competitors.

While it would be easy to limit an AI’s dataset to content published before a certain date, likely November 30, 2022, that would mean that the AI’s answers would become more and more dated as time goes on. Meaning that the choice is between training AI on other AI output or becoming less relevant over time.

From a purely technical standpoint, this is bad news for AI. AI-generated content has been widely criticized for being error-prone, subject to bias and other issues. Training AI systems on other AI output is likely to exasperate these issues, making the output more problematic in multiple ways.

However, from a legal and ethical standpoint, things aren’t quite as bad for AI, at least not in the United States.

Legal and Ethical Questions

All this raises a simple question: What happens, legally and ethically, when an AI system plagiarizes a competing AI system?

The answer, in the United States, is probably not much.

Here, the United States Copyright Office makes it very clear that they do not consider the output of a generative AI to be eligible for copyright protection. As such, AI-generated text and images do not qualify for protection, though any additional creativity on top or around it could.

This means that one AI system regurgitating the output of a competitor doesn’t raise any copyright issues. As long as no code was copied and the system itself doesn’t violate a competitor’s rights, the output is fair game.

From an ethical standpoint, things are clearly more complicated. The response to Grok’s plagiarism of OpenAI was one of mocking. The story cam amidst a busy post-launch weekend where Grok insulted X owner Elon Musk and “went woke” despite promises it would be “anti-woke”, so this was another reason to lampoon Grok, X and Musk in the media.

However, the ethics of one AI plagiarizing another are not settled and may become more important as competition in the space increases. While this case, at least according to Babusckin, was an accident, what happens when an AI system does it on purpose to speed up or improve training?

There’s not an easy answer for that. Ethical norms are established over time by those in the field. There has been neither enough time nor similar incidents to know what direction this is going.

Still, there may be a more immediate problem for AI systems, China.

The China Problem

Earlier this month, The Beijing Internet Court ruled that an AI-generated image was protected by copyright. This is in stark contrast to the United States, where such protection has been roundly denied.

On one hand, this sounds like a potential win for AI systems. The output of their systems gaining copyright protection would allow users to exploit those works the same as they would anything created by humans.

However, it could also create new problems. If it is determined that generative AI infringes on the work of human authors when it uses it for training, it would, in theory, apply just as strongly to AI-generated work as it does human created work.

This could also put AI companies in the awkward position of not owing their own output. For example, in the Beijing case, the rightsholder was not StabilityAI, but the person who generated the image.

Obviously, there are ways around this. AI companies could attempt to use their terms of service to grant themselves significant rights, but that doesn’t change how other users interact with other companies.

If AI output can be protected by copyright and users control that copyright, it could become a problem for AI systems if training on copyright-protected work is found to be unlawful.

While that is a lot of “Ifs” in that sentence, at least one court in China has established the first two (for now, pending possible appeal), making the ongoing battle over AI training data even higher stakes than it was before.

Bottom Line

In the end, the real threat of training AI systems on AI output isn’t a legal or ethical one, it’s the technical issues.

AI writing, at this stage, is generally lower-quality, more error-prone and less useful than human-written text. Training AI systems on AI writing may simply exacerbate these issues.

Unfortunately for AI systems, there’s no easy way to avoid it. With AI detection as unreliable as it is and the need to constantly update datasets, it’s inevitable that AI systems will be trained on both their own and their competitors’ outputs.

Stories like this one are likely to become more common, instances where AIs let the veil slip and show clearly what they were trained on. We’ve been finding similar instances of plagiarism of human-generated work for some time, now AI content is plentiful enough on the internet that we’re seeing it there too.

For AI companies, this amounts to a new set of legal, ethical and technical headaches that they need to address quickly, ideally before the problem gets too far out of hand.

Want to Reuse or Republish this Content?

If you want to feature this article in your site, classroom or elsewhere, just let us know! We usually grant permission within 24 hours.

Click Here to Get Permission for Free