Two things happened this week which has caused me to think about how we navigate copyright issues in the era of Generative AI. The first was the announcement by OpenAI about their new web crawler and the second was a response I got from ChatGPT when asking about a literary character by an author my wife works with.
GPTBot
GPTBot is the web crawler that OpenAI announced last week that it will now use going forwards to collect data from the internet to train it’s models on. What makes it interesting is that it can be clearly identified and very simply blocked (in 2 short lines of code!) by any publisher that doesn’t want their content crawled and used by OpenAI in training any future large language models. OpenAI also use GPTBot to “remove sources that require paywall access, are known to gather personally identifiable information (PII), or have text that violates (our) policies”. OpenAI also gives guidance on how publishers can allow access to parts of their sites, but not others.
To me, this represents good progress where OpenAI are being fully transparent and putting the power back into the publishers hands on whether or not they’re happy for their content to be included in GPT’s training data.
Transparency like this will build trust among generative AI companies, publishers and consumers, which will hopefully lead to more responsible use of AI in the future. Ensuring content owners are 100% in control of their content is a big step towards a more ethical approach to gathering data to train large language models and is hopefully an example that other generative AI companies will follow.
Copyrighted Text
The second thing that happened last week was I was told by ChatGPT (using GPT-4) that it couldn’t respond to my request because it couldn’t reproduce copyrighted material. At the time I was asking ChatGPT to give me some quote from The Skull in Jonathan Stroud’s Lockwood & Co. series for a little side project I’m working on 👀.
This is the first time I’ve seen a message like this from ChatGPT and not something I’ve seen widely discussed so I’m not sure if this is a new feature. However, I think this is the responsible response from ChatGPT - It shouldn’t be able to reproduce any copyrighted material and whilst it can talk to me about the book series, summarise the plot and answer questions on the content, actually reproducing any text is the line it absolutely shouldn’t cross.
Implications
So why are these two seemingly unrelated things so interesting? Well I think they point to a good middle ground in the ongoing debate and lawsuits around how online content is being used by generative AI companies to train their large language models.
Let’s put this in human terms, and simplify things for clarity - there is currently a huge amount of content that people have access to online without charge. This is all paid for by advertising (but that’s another topic for another time - I have strong views!). From a consumers’ perspective they aren’t paying for any of this content beyond an implicit sharing of their ‘data’. This is one of the main things that has made the internet what it is today - the ability to freely share information and giving consumers the ability to use the internet for a huge variety of day-to-day tasks.
There is also some ‘premium’ content that is only available online if you’re willing to pay. That could be through more explicitly sharing your data (e.g. having to have a registered account to access it) or through paying a subscription or other charge. Premium, chargeable content has been a big area of growth for lots of publishers over the past few years and there are now lots of successful, paid for services online.
In both of these cases, we’re (society) saying that it is ok for humans to consume this content freely, if it’s available freely, or for a charge if it’s through a paid service. But what society also says (in our laws etc.) is that it’s not ok for humans to reproduce this content without the express permission of the content owner. I’m aware that this is a simplification of what are some very complex laws that differ from country-to-country but the broad principles are consistent.
Much like how humans read and internalise information from books or articles, large language models 'consume' content by analysing and storing the data they are trained on. This absorption process is analogous to our learning: both humans and generative AI models use the ingested information to inform future actions, responses, or decisions.
My provocation is therefore, why should this be any different for large language models? If generative AI companies only train their models on content that is freely available, or through commercial agreements with premium content providers, and agree that they won’t re-produce copyrighted material wouldn’t this represent a good outcome for all? I certainly think so.
There are obviously a few caveats to this (e.g. this is easier to spot/enforce for text generation than image generation), and I know the world isn’t as simple as I’ve described above, but this seems like a sensible approach. This way society can benefit from large language models that are trained on the widest selection of material available, premium content can be gated and paid for and content creators are safeguarded from their work being reproduced without permission.
Conclusion
To get to this proposed solution we have to conceptually separate the training of the large language models from the outputs they generate, which I think is a very reasonable thing to do and it’s great to see OpenAI moving in this direction. I’m sure that we’ll be caught up in this debate for some time and there are plenty more legal cases (and lawyers fees!) to be had, but I’d love to see us find some middle ground that benefits all parties and allows generative AI technology to continue to progress in a way that puts control back into publishers hands and fairly rewards content creators.
It will be interesting to see how other people feel about this. How do you think we can find a way for generative AI to continue to develop whilst respecting creative rights?
Please feel free to sound off in the comments below!
"The future is already here, it's just not evenly distributed."
William Gibson
This article was researched and written with help from ChatGPT, but was lovingly reviewed, edited and fine-tuned by a human.
I think there's an interesting question about safe defaults around OpenAI's crawler bot. This is something that comes up in web standards quite a lot - ideally you want the choices made today to continue to be as "safe" years in the future, so website owners don't have to constantly tweak things like their robots.txt (REP). Until recently, the main users of REP were archivers like Wayback, search crawlers, and academic crawlers. But REP works at the user-agent level, not the purpose level - so while you might have rules to nudge Wayback away from something, unless you're prepared to default block everything and then allow specific agents (which means that someone building a new search engine won't list your site, or at least won't list it as usefully) you are going to be "wide open" to novel crawlers. (This isn't made any easier by search engine crawlers that may also be feeding training data for LLMs.)
Given that it took 25 years for REP to be standardised, I wouldn't bank on a change. And I can't see OpenAI voluntarily limiting its bot (say only to sites that explicitly allowed it via REP). So right now I don't think this is a fair balance despite making it *possible* to for media owners to control OpenAI use of their content. LLMs are still highly extractive, something that needs a better approach than a technical opt out that in any case likely only represents one form of input for future training sets.