A week in Generative AI: Autonomy, Audio & Search
News for the week ending 23rd March 2025
This was a week where on the face of it, there were lots of small updates from Anthropic, Open AI, and Google, so I’ve focused on sharing lots of little updates in the world of generative AI.
However, before we get in to that we have a great piece of research that shows how frontier models are doubling how long a task they can perform autonomously every 7 months, which is worth keeping an eye on.
The biggest of the small updates this week was from OpenAI who released next-generation audio models for developers, but I think they’re a big deal, and you can have a play around with them yourself over at OpenAI.fm - its a lot of fun! This week we also saw Anthropic give Claude search capabilities and Google added Mindmaps to NotebookLM. Pika Labs also showed off a great new video editing feature that allows people to manipulate individual objects and characters in a video.
There’s lots to cover in Ethics news too with a Google model helping crack a 10-year old superbug problem in just 2 days, hundreds of celebrities warning against the copyright petition from OpenAI and Google that I reported on last week, and Cloudflare luring web-scraping bots into an ‘AI Labrinth’.
There are also some fantastic Long Reads - I highly recommend checking out the one from Ethan Mollick (always!) that reports on research they’ve done with Procter & Gamble that shows the benefits of using a generative model as a teammate.
Agent autonomy time is doubling every 7 months
I don’t normally share research in my newsletters as they can be technical, but this one is a doozy - a paper on how AI agents have been improving at performing longer and longer tasks autonomously over the last 5 years.
The chart above shows that the most capable model we have right now for performing autonomous tasks is Claude 3.7 Sonnet in the domains of software engineering, cybersecurity, general reasoning and machine learning tasks. Claude 3.7 Sonnet has a 50%+ success rate (the also looked at an 80%+ success rate and the trends are the same) for tasks of around 50 minutes and compares it to other frontier models dating back to 2020.
The trend line shows that the length of time models are able to perform autonomous tasks in these domains is doubling every 7 months, which means that they will be able to complete day long tasks autonomously in 2028 and months long tasks by 2030 🤯.
There’s a great write up of this in a bit more detail by Azeem Azhar here. It seems it’s relatively easy to scale models for autonomy at a 50% or 80% success rate, but getting to 99%+ is going to take much longer and seems to be the biggest barrier to AI Agents being more generally useful.
OpenAI introduces next-generation audio modes
On the face of it, this is a relatively small release from OpenAI, purely focused on developers. However, if you have a look at it, have a bit of a play, and think about it, I think this is a really significant release.
There are the obvious advancements in the quality of the voices now available to developers through the API, but the big thing for me is the separation of instructions for the voice which allows you to instruct the model to speak in a specific way and customise how the voice sounds. For example you can customise the identity, affect, emotion, pauses, pronunciation, punctuation, delivery, phrasing, tone, and anything you can really think of. These instructions are just a natural language prompt, a little bit like how you build a custom GPT.
There’s a really cool demo site for these new capabilities at OpenAI.fm where you can play around with the different voices and vibes (which are essentially pre-canned instructions). It’s definitely worth checking out and having a play around with to get an idea of what’s now possible with these new audio modes.
One of my favourites I made was this one where I set the Ballad voice, which is British, to use a Robot vibe. These are the instructions the voice had:
Identity: A robot
Affect: Monotone, mechanical, and neutral, reflecting the robotic nature of the customer service agent.
Tone: Efficient, direct, and formal, with a focus on delivering information clearly and without emotion.
Emotion: Neutral and impersonal, with no emotional inflection, as the robot voice is focused purely on functionality.
Pauses: Brief and purposeful, allowing for processing and separating key pieces of information, such as confirming the return and refund details.
Pronunciation: Clear, precise, and consistent, with each word spoken distinctly to ensure the customer can easily follow the automated process.
It’s very impressive how much you can control each of these voices with the instructions and I think we’ll see some really cool products using these new capabilities later this year. I also think this is OpenAI laying the ground work for more human-like voice interactions with sophisticated AI agents that they’ll be launching later this year. Exciting to see how this play out!
Anthropic adds web search to its Claude chatbot
Ahhhhh, search in Claude where have you been all my life?! I’ve made no secret of the fact that Claude is my go-to model for most things I use generative AI for and so its great to see search added to bring it closer to feature-parity with ChatGPT.
Unfortunately this is only available as a preview feature in the US for now, so I haven’t been able to try it out, but the video demo looks great. They appear to be using Brave to power their web search, which is a nice privacy-first solution.
Looking forward to trying this out when I can get my hands on it!
Google shows off Mindmaps in NotebookLM
I know there are a lot of fans of NotebookLM out there - for me, its probably the best user interface we have for large language models as it’s just so much more feature-rich and useful for its specific use cases than a plain chat interface. So it’s great to see Google doubling-down on this and adding the ability to create Mindmaps, allowing users to visually explore the knowledge base they have in the platform.
Pika shows off advanced video editing
I haven’t posted much about the text-to-video models for a while, mostly because they all seem to be converging on the same problems and the same solutions and I haven’t seen anything truly innovative from them for a while. However, this upcoming feature from Pika did get my attention as not only does it look great, but I’m sure is going to be genuinely useful for people working with video models.
It’s very simple - once you’ve generated an image, you can ask Pika to manipulate any character or object in the video, while keeping the rest of the video perfectly intact. Impressive!
Nvidia debuts GR00T N1
Nvidia had its biggest conference of the year, GTC, this week in San Jose and made a host of new announcements from new chips, new personal AI supercomputers, but most importantly robots!
The big robotics announcement was a family of open ‘generalist’ humanoid robotics models called GR00T N1. They’ve been trained on both synthetic and real data and have what Nvidia calls a ‘dual system architecture’ which allows them to think fast (movements) and slow (planning and reasoning).
They also showed off these cute little BDX Star Wars droids which Disney has built for their parks:
Atlas can walk, crawl & run
Just cool to see. Atlas was taught these moves in simulation before transferring them to the real world. This is the future of robotics and it’s bringing that future to us very quickly now!
AI Ethics News
Google's AI 'co-scientist' cracked 10-year superbug problem in just 2 days
People are using Google’s new AI model to remove watermarks from images
Hundreds of celebrities warn against letting OpenAI and Google ‘freely exploit’ Hollywood
Microsoft is exploring a way to credit contributors to AI training data
Cloudflare is luring web-scraping bots into an ‘AI Labyrinth’
ChatGPT hit with privacy complaint over defamatory hallucinations
Academics accuse AI startups of co-opting peer review for publicity
Long Reads
Ethan Mollick - The Cybernetic Teammate
Stratechery - An Interview with OpenAI CEO Sam Altman About Building a Consumer Tech Company
Stephanie Zhan - Dreaming of a daily life with super intelligent AI
Simon Willison - Not all AI-assisted programming is vibe coding (but vibe coding rocks)
Anthropic - Controlling Powerful AI
“The future is already here, it’s just not evenly distributed.“
William Gibson