You have probably seen a artificial intelligence system get out of the way. You request a video of a dog and, as the dog runs behind the loveseat, its collar disappears. Then, as the camera pans back, the loveseat becomes a sofa.
Part of the problem lies in predictive nature many AI models. Like the models that feed ChatGPTwhich are trained to predict text, video generation models predict what is statistically most plausible to watch next. In neither case does the AI hold any clearly defined model of the world that it continually updates to make more informed decisions.
But that’s starting to change as researchers in many AI fields work to create “world models,” with implications that extend beyond generating video and using chatbots to achieve augmented reality. roboticsautonomous vehicles and even human intelligence – or general artificial intelligence (AGI).
On supporting science journalism
If you enjoy this article, please consider supporting our award-winning journalism by subscribe. By purchasing a subscription, you are helping to ensure the future of impactful stories about the discoveries and ideas shaping our world today.
A simple way to understand world modeling is to use models in four dimensions, or 4D (three dimensions plus time). To do this, let’s go back to 2012, when Titanic, 15 years after its theatrical release, has been painstakingly converted to stereoscopic 3D. If you were to freeze any image, you would get a sense of distance between the characters and objects on the ship. But if Leonardo DiCaprio had his back to the camera, you wouldn’t be able to walk around him to see his face. The illusion of 3D in cinema is achieved using stereoscopy: two slightly different images, often projected in rapid alternation, one for the left eye and one for the right eye. Everyone in the cinema sees the same pair of images and therefore a similar perspective.
Multiple perspectives, however, are increasingly possible thanks to the last decade of research. Imagine realizing you should have taken a photo from a different angle, then asking the AI to make that adjustment, giving the same scene a new perspective. As of 2020, NeRF (neural radiance field) algorithms have proposed a pathway to create “photorealistic romantic views” but required combining many photos so that an AI system could generate a 3D representation. Other 3D approaches use AI to predictively fill in missing information, deviating further from reality.
Now imagine that each image of Titanic were represented in 3D so that the film existed in 4D. You can scroll through time to see different moments or scroll through space to look at it from different perspectives. You can also generate new versions of it. For example, a recent preprint, “NeoVerse: Enhanced 4D World Model with Monocular Wilderness Videos“, describes a way to transform videos into 4D models to generate new videos from different perspectives.
But 4D techniques can also help generate new video content. Another recent preprint, “TeleWorld: towards a dynamic multimodal synthesis with a 4D world model“, applies to the scenario we started with: the dog running behind the love seat. The authors claim that the stability of AI video systems improves when a continuously updated 4D world model guides the generation. The 4D model of the system would help prevent the love seat from turning into a sofa and the dog from losing its collar.
These are early results, but they suggest a broader trend: models that update an internal scene map as they are generated. However, 4D modeling has many applications beyond video generation. For augmented reality (AR) – think Meta’s prototype Orion glasses – a 4D world model is an evolving map of the user’s world over time. It allows AR systems to maintain the stability of virtual objects, make lighting and perspective believable, and have a spatial memory of what happened recently. This also allows for occlusions, when digital objects disappear behind real objects. An article from 2023 expresses the requirement bluntly: “To achieve occlusion, a 3D model of the physical environment is necessary. »
Being able to quickly convert videos to 4D also provides rich data for training robots and autonomous vehicles on how the real world operates. And by generating 4D models of the space they’re in, robots could better navigate their way around and predict what might happen next. Today’s general-purpose vision language AI models, which understand images and text but do not generate clearly defined world models, often make mistakes; A reference document presented at a conference in 2025 report “striking limitations” in their basic world-modeling capabilities, including “near-random accuracy when distinguishing motion trajectories.”
Here’s the catch: the “global model” means much more to those seeking AGI. For example, today’s leading extended language models (LLMs), such as those powering ChatGPT, have an implicit sense of the world from their training data. “In some ways, I would say the LLM already has a very good global model; it’s just that we don’t really understand how it works,” says Angjoo Kanazawa, assistant professor of electrical engineering and computer science at the University of California, Berkeley. However, these conceptual models do not provide a physical understanding of the world in real time, because LLMs cannot update their training data in real time. Even OpenAI technical report notes that once deployed, its GPT-4 model “does not learn from experience.”
“How do we develop a Smart LLM vision system that can actually have continuous input, update its understanding of the world and act accordingly? » said Kanazawa. “It’s a big open issue. I think AGI is not possible without actually solving this problem.
Although researchers debate whether LLMs could ever achieve AGI, many view LLMs as a component of future AI systems. The LLM would serve as a layer for “language and common sense to communicate,” Kanazawa says; it would serve as an “interface”, while a more clearly defined underlying world model would provide the necessary “spatio-temporal memory” that current LLMs lack.
In recent years, a number of prominent AI researchers have turned to global models. In 2024, Fei Fei Li founded World Labs, which recently launched its Marble software to create 3D worlds from “text, images, video, or rough 3D layouts,” according to the startup. promotional material. Death November AI reserve Yann LeCun announced on LinkedIn that he was leaving Meta to launch a startup, now called Advanced Machine Intelligence (AMI Labs), to build “systems that understand the physical world, have persistent memory, can reason, and plan complex action sequences.” He sowed these ideas in a Position statement 2022 in which he asked why humans can act well in situations they have never encountered and argued that the answer “might lie in the ability…to learn models of the world, internal models of how the world works.” Research increasingly shows the benefits of internal models. April 2025 Nature reported paper results on DreamerV3an AI agent that, by learning a model of the world, can improve its behavior by “imagining” future scenarios.
While in the context of AGIthe “world model” refers more to an internal model of how reality works, not just 4D reconstructions. Advances in 4D modeling could provide components that help understand viewpoints, memory and even short-term predictions. And meanwhile, on the path to AGI, 4D models can provide rich simulations of reality in which to test AIs to ensure that when we let them operate in the real worldthey know how to exist there.
It’s time to defend science
If you enjoyed this article, I would like to ask for your support. Scientific American has been defending science and industry for 180 years, and we are currently experiencing perhaps the most critical moment in these two centuries of history.
I was a Scientific American subscriber since the age of 12, and it helped shape the way I see the world. SciAm always educates and delights me, and inspires a sense of respect for our vast and beautiful universe. I hope this is the case for you too.
If you subscribe to Scientific Americanyou help ensure our coverage centers on meaningful research and discoveries; that we have the resources to account for decisions that threaten laboratories across the United States; and that we support budding and working scientists at a time when the value of science itself too often goes unrecognized.
In exchange, you receive essential information, captivating podcastsbrilliant infographics, newsletters not to be missedunmissable videos, stimulating gamesand the best writings and reports from the scientific world. You can even give someone a subscription.
There has never been a more important time for us to stand up and show why science matters. I hope you will support us in this mission.
