Discussion about this post

User's avatar
James Aylett's avatar

I think there's an interesting question about safe defaults around OpenAI's crawler bot. This is something that comes up in web standards quite a lot - ideally you want the choices made today to continue to be as "safe" years in the future, so website owners don't have to constantly tweak things like their robots.txt (REP). Until recently, the main users of REP were archivers like Wayback, search crawlers, and academic crawlers. But REP works at the user-agent level, not the purpose level - so while you might have rules to nudge Wayback away from something, unless you're prepared to default block everything and then allow specific agents (which means that someone building a new search engine won't list your site, or at least won't list it as usefully) you are going to be "wide open" to novel crawlers. (This isn't made any easier by search engine crawlers that may also be feeding training data for LLMs.)

Given that it took 25 years for REP to be standardised, I wouldn't bank on a change. And I can't see OpenAI voluntarily limiting its bot (say only to sites that explicitly allowed it via REP). So right now I don't think this is a fair balance despite making it *possible* to for media owners to control OpenAI use of their content. LLMs are still highly extractive, something that needs a better approach than a technical opt out that in any case likely only represents one form of input for future training sets.

Expand full comment
2 more comments...

No posts