A week in Generative AI: o3, Veo 2 & 12 days
News for the week ending 22nd December 2024
Well, I think that’s a wrap for the year, and what a few weeks we’ve had with a hugely impressive number of announcements, releases and previews from both Open AI and Google!
This week we’ve seen even more announced by both companies with the new o3 model from OpenAI rounding off their ‘12 days of ship-mas’. Google also announced Veo 2 which is now the most capable text-to-video model of 2024, beating OpenAI’ Sora in benchmarks and in terms of quality.
In Ethics News there is some interesting research which suggests that one in four US workers are now using GenAI weekly for work, often without clear guidance and rules in place. There’s also an interesting study from Anthropic that suggest that GenAI models really don’t like being forced to change their minds.
I also highly recommend AI Explained’s coverage of the o3 announcement in Long Reads and also Ethan Mollicks’ coverage of all the announcements over the last few weeks.
I hope everyone has a great holiday over the festive break and I’ll be back with more GenAI news in the new year - it’s set to be an exciting one!
o3 announced, blows away evaluations
Firstly, let’s put this in some context. OpenAI’s o1 model was announced in September in preview and fully launched just 2 weeks ago. We now have o3 announced, the second generation in this family of models (named o3 to avoid issues a with Telefonica’s O2 brand), which will be publicly available by the end of Q1 2025. That is some incredibly fast progress.
Secondly, let’s talk about how good o3 is (I’m deliberately not using the word intelligence here!). The model has been tested against some of the hardest evaluations available in maths, coding, and advanced reasoning. It performs better at these tasks than most humans. In fact, in maths and coding, its better than 99% of humans and at advanced reasoning its better than probably around 90% of humans. That’s astonishing.
However, there are some caveats to o3’s advanced reasoning capabilities:
The model’s highest reasoning scores were only achieved when it was allowed to ‘think’ for as long as possible and the impressive 88% score on ARC-AGI’s reasoning evaluation took the model 16 hours and probably cost around $350k to answer 100 questions.
There are still some very simple reasoning questions that a human can easily get right that o3 cannot answer.
So what does this all mean? Here are some of my initial thoughts and reflections:
I think the best way to think about these models is that where GPT-4 was trained on large volumes of text data (language) the o-series of models are effectively being trained on large volumes of reasoning data. We should think of these as Large Reasoning Models (LRMs), instead of Large Language Models (LLMs).
LRMs represent an entirely new approach to training GenAI models, but using tried and tested techniques, so I expect to see more rapid progress still to come.
LRMs look like they are getting incredibly good (very quickly!) at very advanced problem solving in very advanced technical domains. But they still can’t solve some of the simplest reasoning problems. This shows that they will be great for advanced scientific research, but not so good at day-to-day practical tasks.
There are some (semi-valid) arguments that LRMs are effectively using brute force to solve for reasoning, but I absolutely expect the time and costs of running these models to decrease quite quickly. So I don’t think this argument will hold for too long.
Lastly, we have to talk about whether o3 represents progress towards Artificial General Intelligence (AGI). The answer is kind of… but I do think it has definitely helped us get closer to a good definition of what AGI actually is. Francois Chollet (creator of the ARC-AGI evaluation) has said that he believes we will have AGI when it is no longer possible to create evaluations that are easy for humans, but impossible for AI.
I really like this definition of AGI as it gets to the core issue that we have always had with AI. Moravec’s paradox states that tasks that humans find easy and effortless are incredibly difficult for AI to master, while tasks that humans find challenging are relatively easy for AI to perform. I think we will have truly reached AGI when Moravec’s paradox no longer holds true.
Google announce Veo 2 and Imagen 3
It would be easy to overlook all the announcements Google has made over the last couple of weeks because of OpenAI’s ‘12 days of Shipmas’, but Google have also shipped a huge amount. To top it all off they announced Veo 2 and Imagen 3 this week.
The star of the show is Veo 2 and it has to be seen to be believed. Like Sora (which has only just become more widely available) it is a text-to-video model, but it beats Sora on many benchmarks and seems to have a much better grasp of physics and conforming to the prompt its given. It shouldn’t be surprising that Google can produce the best text-to-video model given all the YouTube content that they have access to for training. However, Veo 2 has come out of relative nowhere to take the crown of best video model of 2024. It’s well worth checking out some of the examples that Google have shared.
Imagen 3 is less impressive, just bringing Google’s flagship text-to-image model to a par with some of the other frontier image generating models in the market.
12 Days of OpenAI is over
We’ve seen the final week of OpenAI’s ‘Ship-mas’ and below is a full summary of everything that’s been announced/released over the last couple of weeks:
Day One: o1 and ChatGPT Pro
Day Two: Reinforcement Fine-Tuning
Day Three: Sora
Day Four: Canvas for all
Day Five: ChatGPT in Apple Intelligence
Day Six: Advanced Voice with Video
Day Seven: Projects in ChatGPT
Day Eight: Search for all
Day Nine: Mini dev day
Day Ten: 1-800-CHATGPT
Day Eleven: Work with apps
Day Twelve: o3
So, we haven’t seen an announcement of GPT 4.5, or anything setting up a more agentic experience coming to ChatGPT next year. However, the star of the show this week, and the whole series of announcements, has to be o3 which I covered in more detail above. Beyond that there were some nice little additional announcements that OpenAI made this week:
Giving all users access to ChatGPT with Search will I think have a big impact on the search landscape next year as more consumer gravitate towards using generative search platforms as they have a much better user experience than traditional search platforms.
1-800-CHATGPT is an interesting oddity. I think if it was available globally it could bring ChatGPT 4o’s capabilities to a huge number of people across the world that don’t have access to a reliable internet connection but do have access to a phone network. As it stands it’s US only and therefore a bit weird.
Overall, the ‘12 days of ship-mas’ from OpenAI has been a huge success. They’ve dominated all the GenAI news over the last three weeks despite some incredibly impressive announcements from Google over the same time period. Congrats to the whole OpenAI team for shipping such a huge amount at the end of the year!
Google releases its own 'reasoning' AI model
Here’s another Google announcement that in any other week would have made huge headlines - they’ve released their own reasoning model (LRM) based on Gemini 2.0 Flash. Based on reports, its not really on a par with o1 let alone o3 but its great to see Google getting into the LRM game to give OpenAI a run for its money. Some healthy competition will no doubt drive advancements higher, further, faster baby!
AI Ethics News
Nearly one in four US workers use generative AI on a weekly basis, often without clear rules
Google DeepMind launches new AI fact-checking benchmark with Gemini in the lead
OpenAI cofounder Ilya Sutskever says the way AI is built is about to change
UK arts and media reject plan to let AI firms use copyrighted material
New Anthropic study shows AI really doesn't want to be forced to change its views
OpenAI had a 2-year lead in the AI race to work 'uncontested,' Microsoft CEO Satya Nadella says
Long Reads
AI Explained - o3, wow
One Useful Thing - What just happened
Stratechery - The 2024 Stratechery Year in Review
“The future is already here, it’s just not evenly distributed.“
William Gibson
Do you think that the market for new ai models is becoming over saturated?