Whether scraping data for generative AI training infringes copyright is a matter for debate

Data scraping on the Internet and copyright infringement

A thorny issue in the broad area of generative AI concerns copyright and how or whether it applies in activities involving the gathering of data about images, texts and other content for use by AI publishers.

Specifically, such use would be using the data gathered, with permission if required, as sources for training the algorithms that generate the AI-produced content such as images, video and text.

It was one of the discussion topics in the March monthly episode of For Immediate Release, the business podcast I co-host with Shel Holtz that we published on March 20. That discussion, in which Shel and I appear to be on opposite sides of the copyright and permission debate, was prompted by an article last week in The Economist that warned a battle royal is brewing over copyright and AI.

The main issue here, The Economist argues, is the oceans of copyrighted data that bots have siphoned up while being trained to create human-like content. That information comes from everywhere: social media feeds, Internet searches, digital libraries, television, radio, banks of statistics and so on. Often, it is alleged, AI models plunder the databases without permission. Those responsible for the source material complain that their work is hoovered up without consent, credit or compensation.

In short, says The Economist, some AI platforms may be doing with other media what Napster did at the turn of the century with songs – ignoring copyright altogether. Napster, a platform for sharing mainly pirated songs, was ultimately brought down by copyright law.

Similar arguments and warnings are being made by others, too, who believe the issue of infringing or ignoring copyright is one that hasn’t been addressed at all and is becoming urgent. I remember reading a really good assessment in The Verge last November that made very clear points on the scary truth about AI copyright is nobody knows what will happen next.

Already legal actions are starting, the most dramatic being the lawsuit filed by Getty Images against Stability AI, the company behind the Stable Diffusion AI image generator. Getty claims that over 12 million of its copyrighted images, along with their descriptions and metadata, were used to train Stable Diffusion. The lawsuit seeks a staggering $1.8 trillion in compensation.

Overall, it’s a muddy picture. I think many people speak of generative AI and copyright in the sense of the actual ouput, eg, a digital artwork or a text article. Some see it as the digital images or texts that can also be the input where AI tools look at reference images or articles that form part of the training and learning that enable these tech tools to generate content from the prompts users give.

It’s about the data

What we’re actually looking at here is the input in the form of data, eg, the metadata mentioned above in the Getty lawsuit against Stability AI, and whether it’s been acquired and used by AI firms with permission and in full recognition of copyright ownership.

Governments are starting to do more than just talk about the issue. In the UK, for instance, the government is creating a code of practice for generative AI firms according to a report in Computer Weekly last week, that aims to enable generative AI companies in the UK to mine data, text and images that would attract investment, support company formation and growth, and show international leadership. And hopefully address the matter of copyright.

In the US, the federal government Copyright Office has just published Copyright Registration Guidance: Works Containing Material Generated by Artificial Intelligence that sets out a statement of policy to clarify practices for examining and registering works that contain material generated by the use of AI technology.

These are good moves undoubtedly. Yet they illustrate one of the biggests hurdles – the geographic nature of copyright law. Perhaps territoriality of copyright law is a better label to illustrate the predicament of applying ‘geographic law’ in a situation that is not defined by geography.

Let’s ask an AI

So what is the way forward? It’s clear to me that no one really has a credible and uniform answer yet. With the help of ChatGPT Plus and Google’s just-released Bard, I researched the topic. The input from these two chatbots was helpful: as research assistants they save a lot of time that I can then use on verification and some editing or rewriting.

A good starting point in looking for an answer to the big question of whether scraping data for generative AI training infringes copyright or not is this – it depends.

When it comes to generative AI, large datasets are typically used to train models to produce human-like text, images, or other content. Often, this data is scraped from publicly-accessible websites or platforms. However, the act of scraping data does not inherently infringe on copyright, as copyright law protects original works and not the raw data itself. That said, the way this data is used could potentially infringe on copyright.

Some factors to consider:

  1. Fair Use Doctrine

The fair use doctrine is a legal principle in the United States that allows the limited use of copyrighted material without obtaining permission from the copyright holder. It balances the interests of copyright holders with the public’s interest in the free flow of information. To determine if a particular use of copyrighted material falls under fair use, US courts consider four factors:

  1. The purpose and character of the use.
  2. The nature of the copyrighted work.
  3. The amount and substantiality of the portion used.
  4. The effect of the use on the potential market for or value of the copyrighted work.

Generative AI models that use scraped data for non-commercial research, education, or other transformative purposes may potentially be protected under the fair use doctrine. However, it is essential to note that fair use is determined on a case-by-case basis, and there is no guarantee that a specific use will be considered fair.

Note this is just in the US. In the UK, the similar principle is known as fair dealing which has significant differences compared to the US fair use doctrine.

  1. Public Domain

If the data being scraped consists of works in the public domain, then there is no copyright infringement. Public domain works are those whose copyright has expired or those that were never subject to copyright protection. Examples include works created by some governments and certain facts, ideas, or theories.

  1. License Agreements

Some websites or platforms provide access to their data through Application Programming Interfaces (APIs) or other means, subject to specific licensing agreements. These agreements may outline permitted and restricted uses of the data. Using data in accordance with such a license agreement would not constitute copyright infringement as permission will have been agreed with the data owners.

Does the above provide clear answers? Actually, I don’t think so although it does help shine some light on the overall picture to illustrate the complexity. Never was there a more muddy phrase than “it depends.”

For me, there is one clear answer – do not use someone else’s data unless you are sure you have permission if that’s required. Just because something is easily available online and you could scrape it, that doesn’t mean that you should.

It’s worth mentioning that both ChatGPT Plus and Bard concluded their research in a similar manner:

The use of generative AI for transformative purposes is likely to fall within the fair use exception to copyright law. This means that you can scrape data from the Internet and use it to train a generative AI model without infringing copyright. However, it is important to note that fair use [and fair dealing] is a complex area of law, and there is no guarantee that a court will find a particular use to be fair. If you are unsure whether a particular use is fair, it is always best to consult with an attorney.

(Image at top created by Bing Image Creator in response to the simple prompt: “Data scraping on the Internet.” I’d imagined something Matrix-like, and this result isn’t bad from such a simple prompt.)

Neville Hobson

Social Strategist, Communicator, Writer, and Podcaster with a curiosity for tech and how people use it. Believer in an Internet for everyone. Early adopter (and leaver) and experimenter with social media. Occasional test pilot of shiny new objects. Avid tea drinker.