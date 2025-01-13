If you’re not familiar with the concept of “world models” just yet, a storm of activity at the start of 2025 gives every indication it may soon become a well-known term.

Jensen Huang, CEO of Nvidia, used his keynote presentation at CES to announce a new platform, Cosmos, for what they’re calling “world foundation models.” Cosmos is a generative AI tool that produces virtual-world-like videos. The next day, Google’s DeepMind revealed similar ambitions with a project led by a former OpenAI engineer. This all comes several months after an intriguing startup, World Labs, achieved unicorn status—a startup valued $1 billion or more—within only four months to do the same thing.

To understand what world models are, it’s worth pointing out that we’re at an inflection point in the way we build and deploy intelligent machines like drones, robots, and autonomous vehicles. Rather than explicitly programming behavior, engineers are turning to 3D computer simulation and AI to let the machines teach themselves. This means physically accurate virtual worlds are becoming an essential source of training data to teach machines to perceive, understand, and navigate three-dimensional space.

What large language models are to systems like ChatGPT, world models are to the virtual world simulators needed to train robots. Therefore, world models are a type of generative AI tool capable of producing 3D environments and simulating virtual worlds. Just like ChatGPT is built with an intuitive chat interface, world-model interfaces might allow more people, even those without technical game developer skillsets, to build 3D virtual worlds. They could also help robots better understand, plan, and navigate their surroundings.

To be clear, most early world models including those announced by Nvidia generate spatial training data in a video format. There are, however, already models capable of producing fully immersive scenes as well. One tool made by a startup called Odyssey, uses gaussian splatting to create scenes which can be loaded into 3D software tools like Unreal Engine and Blender. Another startup, Decart, demoed their world model as a playable version of a game similar to Minecraft. DeepMind has similarly gone the video game route.

All this reflects the potential for changes in the way computer graphics work at a foundational level. In 2023, Huang predicted that in the future, “every single pixel will be generated, not rendered but generated.” He’s recently taken a more nuanced view by saying that traditional rendering systems aren’t likely to fully disappear. It’s clear, however, that generative AI predicting which pixels to show may soon encroach on the work that game engines do today.

The implications for robotics are potentially huge.

Nvidia is now working hard to establish the branding label “physical AI” as a term for the intelligent systems that will power warehouse AMRs, inventory drones, humanoid robots, autonomous vehicles, farmer-less tractors, delivery robots, and more. To give these systems the ability to perform their work effectively in the real world, especially in environments with humans, they must train in physically accurate simulations. World models could potentially produce synthetic training scenarios of any variety imaginable.