O is for Omni, but A is for Agent

OpenAI announces GPT-4o its new real-time, multimodal chat model

May 13, 2024

Introducing GPT-4o

Earlier today OpenAI live-streamed their Spring Event and announced the arrival of their latest model - GPT-4o. The o stands for omni as GPT-4o is a combined text-audio-vision model. As OpenAI state on the GPT-4o product page, it’s their “newest flagship model that provides GPT-4-level intelligence but is much faster and improves on its capabilities across text, voice, and vision.”

However, I don’t think that sentence does what OpenAI announced justice, so let me summarise GPT-4o’s new features and then dive a little deeper into some of the more significant parts of the announcement. Below are the main GPT-4o features that OpenAI announced:

GPT-4o will be available to free users of ChatGPT (no ChatGPT plus subscription required) and this is the first time OpenAI has made GPT-4 level intelligence available to everyone.
- Free users will also now be able to access many of the previous paid only features of ChatGPT Plus such as web browsing, data analysis and visualisation, image analysis, file uploads, GPTs and the recently announced memory features.
GPT-4o adds audio understanding to GPT-4, meaning it now natively processes text, vision and audio all in a single model.
GPT-4o has real-time conversational speech, powered by the native audio processing of the new model.
- This brings a much more human-level responsiveness to voice interactions that allows the model to do real-time translation and pick up on, express, and understand emotions.
GPT-4o has a new desktop app that allows you to run ChatGPT on your personal computer. This allows it to see your screen and understand what you’re working on.
- This will also be the app that gives ChatGPT access to more of your information with email, calendar and web browsing integrations. I suspect we will see these features added later this year.
GPT-4o comes with a simplified look and feel for ChatGPT designed to be friendlier and more conversational

As I speculated in my newsletter yesterday, I think today’s announcements are laying the foundations for creating a more personalised ‘agent-like’ experience for ChatGPT users. This aligns with Sam Altman’s commitment to rolling out new capabilities more incrementally and helping society come to terms with access to new types of intelligence.

Performance

Let’s get some interesting things out of the way before getting on to more important things. The first thing is performance and also the solution to a mystery!

It turns out that the mysterious ‘gpt2-chatbot’ that suddenly appeared in the LMSYS Chatbot Arena a few weeks ago and that Sam Altman tweeted about was GPT4-o all along. More importantly, GPT-4o is significantly better in the Chatbot Arena rankings with a provisional ELO score of 1310. This is the highest score ever recorded in the arena and means it is rated at nearly 5% better than any other GenAI chatbot model by users.

Personally, I would rate GPT4-o as a ‘GPT4.5-level’ model. We’ve had GPT-4-level models for a while and none of the newest models from Anthropic, Google or Meta have really moved the game on. I think GPT-4o does move the game on, but we’ll see what the industry consensus is as it’s tested more broadly over the next few of weeks.

Real-Time

“Getting to human-level response times and expressiveness turns out to be a big change“

GPT-4o is the first widely available real-time GenAI chat model. This is a big deal. We’ve seen some interesting real-time chat demos from the likes of Groq but this is the first time this kind of real-time interaction has been available in a mainstream model. And it’s not just text - it’s voice/audio chat and this makes a huge difference. One of the biggest gripes about both the Human AI pin and Rabbit’s R1 devices was the lag in interacting with them. Interacting with GPT-4o is pretty much instant.

As Sam Altman says in his blog post, “Getting to human-level response times and expressiveness turns out to be a big change“. ChatGPT with voice was already very close to the experience we saw in the movie Her back in 2013, but GPT-4o really is that experience now and it’s incredibly impressive!

Emotions

There was a great demo in the live-stream where a Twitter user asked if GPT-4o could tell what a user is feeling just by looking at their face. The answer is yes it can!

GPT-4o is able to infer emotions from your voice and your face (via video or image). It’s also able to express different emotions using its voice and can even sing and whisper 🤯

I can’t state how much of a game changer I think this will be for our interactions with technology. This is the first time that a piece of technology can understand our emotions and be able to emotionally express a response. I think this is what’s been missing from the voice interfaces we’ve had for over a decade now and over time, emotional interactions will make voice interfaces the preferred way of interacting with technology for most people.

Assistants

Not a huge amount of time was given to ChatGPT’s new desktop app in either the live-stream or on the new product pages, but I think this is another big deal. The desktop app is able to not only give you access to all the GPT-4o features I’ve already covered but it can also see what you’re doing on your desktop in realtime. This means GPT-4o in the desktop app can help you with many more tasks, including things like meetings which it can actively participate in and summarise for you.

As I’ve previously mentioned, I think ChatGPT’s desktop app will be the beachhead for a much more personalised experience beyond the recently announced memory features. Sam Altman alludes to this in his blog post when he wrote “As we add (optional) personalization, access to your information, the ability to take actions on your behalf, and more, I can really see an exciting future where we are able to use computers to do much more than ever before.“

This is Sam Altman saying that OpenAI are going to add more integrations to the desktop app, giving ChatGPT access to more (optional) personal information, and that this will be an important foundation to ChatGPT becoming a personal AI assistant.

Conclusions

As with OpenAI’s DevDay last year, there’s a lot to digest from today’s Spring Event. Over the coming days and weeks as more people get access to GPT-4o and the new desktop app I think we’ll see a lot of praise for the work OpenAI has done with their newest model.

On the surface, there are some fantastic new features and capabilities, but I think we’ll look back in a few years time (or maybe months 🤓) and see that the most significant aspects of today’s announcements are GPT-4o’s ability to conduct real-time conversations, it’s emotional understanding and expression, and the desktop app which lays the foundations for ChatGPT to become a truly personal AI assistant.