A week in Generative AI: Gemini Live, Runway Turbo & Flux
News for the week ending 18th August 2024
The big news this week was the launch of Gemini Live at Google’s #MadeByGoogle ‘24 event. Gemini Live is Google’s response to GPT-4o’s Advanced Voice mode and is another example of the voice capabilities that the next generation of voice assistants will have. We also saw the launch of Runway’s Gen-3 Turbo model which is bringing us much closer to realtime video generation and Flux, a text-to-image model is getting a lot of praise for its high quality images and the ability for users to run a version of it locally on a well equipped computer.
In ethics news, there’s an article about how X’s new image generating model has no guardrails or safeguards, allowing it to create images on almost anything. OpenAI also announced that they have shut down accounts linked to an Iranian influence operation that was generating content about the US presidential election.
Gemini Live, Google’s answer to ChatGPT’s Advanced Voice Mode, launches
The big news of the week were some of the announcements Google made at their #MadeByGoogle ‘24 event where they launched the new Pixel 9. As the Verge wrote, AI overshadowed Pixel at the Pixel event, and loads of new AI features were shared:
Pixel screenshots allow users to capture their screens so the information captured is then searchable later
The Gemini assistant on device will be much faster thanks for Gemini 1.5 Flash
You can ask Gemini about what’s on your screen at any given time
There are also lots of other little features like call transcription, AI generated weather summaries and lots of AI photo features
However, the bigger new announcement was Google’s Gemini Live, their answer to OpenAI’s GPT-4o Advanced Voice Mode. There is some great coverage of Gemini Live from Joanna Stern at the Wall Street Journal and Maxwell Zeff at TechCrunch. They both describe Gemini Live allowing them to talk WITH their phone, not TO it, which I think is a great summary of how we should think about the next generation of generative AI powered voice assistants.
The next generation of generative AI powered voice assistants will allow us to talk WITH our phones, not just TO them.
The other interesting thing about Gemini Live is that it 100% runs in the cloud, which is very different from Apple’s approach with Apple Intelligence, which first and foremost runs on device and then uses the cloud for more complex use cases. This means that (currently) Gemini Live can’t really perform tasks on the Pixel phone like set alarms or timers. Google say they’re working on ways Gemini Live can control phone functions, but no news on when that might be coming.
Runway’s Gen-3 Alpha Turbo is here and can make AI videos faster than you can type
Runway debuted its third generation video generation model last month, but this week showed off Gen-3’s Turbo model, which it claims is seven times faster and half the cost. It’s the speed of the model that is the really big thing here, with Runways’s CEO claiming “it now takes me longer to type a sentence than to generate a video.“
Lag has been a big issue when generating video, and Runway seems to have solved that. We’re moving very close to the point where we have real-time video generation similar to the real-time image generation we started seeing last year with Stability.ai’s SDXL Turbo and Leonardo.ai’s Realtime Canvas.
Forget Midjourney — Flux is the new king of AI image generation
Flux, a text-to-image model created by a new startup called Black Forest Labs has been getting a lot of attention since its launch a few weeks ago. Black Forest Labs was founded by engineers from Stability AI just a few months ago and their first Flux model is available in three versions.
The Pro version (largest, most capable model) is available via API, The Dev version (medium sized model) is an open-weights model that can be used for non-commercial applications and the Schnell model (smallest, fastest model) is small enough to be downloaded to a well-equipped local computer for personal use.
Lots have commentators have been testing the Pro version of Flux and have been very impressed with the quality, in some cases exceeding Midjourney v6.1 that landed in July. The fact that versions of the model are open-weight and can run locally is also a big selling point. Black Forest Labs say they’re now working on a text-to-video model that will be open-source, branding it “State-of-the-Art Text to Video for all.“
Purdue's UniT gives robots a more human-like sense of touch
This is a little bit left-field and a little bit technical, but I’ve always been fascinated by how we will give robots a sense of touch. I think this will be a really important feature for robots to gain mass adoption in the real world and to regularly be acting in and amongst the human population. It will also give GenAI models a HUGE new dataset to be trained with, which is one of the reasons I don’t subscribe to the idea that we’re running out of data to train bigger and more sophisticated GenAI models.
AI Ethics News
ChatGPT unexpectedly began speaking in a user’s cloned voice during testing
OpenAI shuts down election influence operation using ChatGPT
Research AI model unexpectedly modified its own code to extend runtime
Long Reads
One Useful Thing - Change blindness
Stratechery - Integration and Android
MIT News - LLMs develop their own understanding of reality as their language abilities improve
NYMag - The Future Will Be Brief
“The future is already here, it’s just not evenly distributed.“
William Gibson