The world model will make rapid breakthroughs this year! Autonomous driving may迎 the turning point for commercialization

robot
Abstract generation in progress

“Driven by the joint momentum of a unified architecture, data ecosystem, and compute power support, world models will see rapid breakthroughs this year!”

At the 2026 Zhongguancun Forum Annual Conference special forum “AI Future Forum: Leapfrogging · Investing · Coexistence,” held on March 29, Zhu Jun, founder of Shengshu Technology and deputy director of the Institute for Artificial Intelligence at Tsinghua University, put forward the above viewpoint.

How to Build

Meanwhile, the definition of world models is being broadened and blurred. “It’s necessary to further clarify the definition of ‘world model.’” Zhu Jun said that much of the current research is incomplete. For example, some methods for generating interactive video essentially remain limited to reconstructing the digital space; they are mainly used for one-way interaction between humans and the system and do not have the ability to learn and execute actions in real environments.

“World models” are divided by Wu Wei, founder of the manifold space, into two categories: one is the world model in the digital world, mainly focused on building more real-time interactive interfaces; the other is for the physical world, becoming a predictable machine brain. “The capabilities that support the two types of world models are not the same. In the digital world, you need to cater more to creators’ preferences, whereas in the physical world you need to replicate real physics and robotic operations.”

Taking autonomous driving and embodied intelligence as examples, autonomous driving collects real-vehicle data to achieve a data closed loop, while robots face the data cold-start problem. Wu Wei analyzed that many companies tend to deploy robots using an approach similar to autonomous driving—conducting remote teleoperation in real environments to collect data. Although the data quality is very high, there is a problem that model performance grows at a rate tied to the scale of parameters or the amount of compute投入. “For world model training, using first-person perspective data for pretraining can solve this problem.”

Based on enterprise experience, Xu Huazhe, founder of PoPoKa Robotics and an assistant professor at the Institute of Interdisciplinary Information Research at Tsinghua University, pointed out that when data is collected in 100 households, it can’t generalize to 10,000 households. Robot pretraining needs to use first-person video for pretraining to provide genuine, meaningful generalization. Specifically, first define what to do and what not to do, and then iteratively improve in reverse, including hardware, motion control, and so on. For example, PoPoKa’s robot hands can’t achieve 21 degrees of freedom, but they can generalize to doing 10 things well, and then wait for upgrades.

Zhu Jun proposed a “unified world model framework,” theoretically unifying cross-modal generation and action tasks. This unification is not a patchwork at the engineering level, but a unification at the structural level. From a more macro perspective, whether in the digital world or the physical world, the end result will be made up of intelligent agents of different forms. Agents in the physical world have “bodies,” and world models are their core “intelligent control center.”

Building a general world model can return to first principles for large models: an extensible architecture, large-scale data, and sufficient compute power. Zhu Jun believes that world models should adopt a unified architecture, while current mainstream methods are often modular and fragmented—some focus on fitting action trajectories, some focus on prediction, and some directly learn control policies.

Technological Breakthroughs

When discussing the technological possibilities of world models, Zhang Mingxing, associate professor at Tsinghua University, said that many world model routes are based on language model capabilities and then transferred to more modalities. However, whether language is sufficient to model the physical world still requires another kind of shallow-space language. There are currently theoretical disagreements on this. In addition, by data training or through physical space—can you achieve “physical remote sensing/telemetry” or “first-person view”? The physical-space modality and its implementation still need to be broken through.

Specifically, in 2026, two major technological breakthroughs should be the focus for world models. Wu Wei said the first is real-time manipulation and interaction capability, and the second is post-training of the world model. “Especially reinforcement learning and online learning,” Xu Huazhe elaborated in detail—extending reinforcement learning to one hundred, one thousand, or ten thousand robots, reaching human-like speed without losing success rate; also, enabling embodied intelligence to conduct fast online learning for weird tasks even after deployment.

Based on long-term accumulation in video large models, Zhu Jun proposed a clearer technical roadmap: at the bottom layer, Diffusion Transformer (U-ViT) serves as the unified backbone architecture; in pixel-space decoding, corresponding to the Vidu video generation model, it serves digital content creation; in action-space decoding, it serves embodied interaction in the physical world. This means that the same backbone model can support both digital-world generation capabilities and physical-world action capabilities.

According to the introduction, Shengshu Technology has validated its capabilities in multi-task scenarios. For example: captcha-solving operation tasks—by simulating human mouse control through robotic arm operations, achieving screen recognition and precise clicking; game decision-making tasks—involving long-horizon planning and multi-step reasoning, requiring perception, prediction, and decision-making to work together; flexible object manipulation—when facing complex, irregular objects, enabling stable grasping.

A unified architecture brings a new development path. Through experimental observation, Zhu Jun said two key phenomena are evident: first, compared with traditional Vision-Language-Action (VLA, vision-language-action) routes, data utilization efficiency increases by an order of magnitude; second, multi-task generalization capability is enhanced—under a unified model, efficient generalization can be achieved across 50+ tasks, and performance not only does not drop but increases instead. By contrast, traditional VLA models (such as PI0.5) will see a clear performance decline as the number of tasks increases.

At the implementation level, the two major tracks of autonomous driving and industrial vertical scenarios will reach a commercialization and capitalization inflection point in 2026. Bai Zongyi, founding partner of Yaotu Capital, said bluntly that he is optimistic about new opportunities in the era of embodied intelligence—the last-mile logistics track. Ivo Muth, deputy general manager for research and development at Audi China, believes that regarding space intelligence and world models, the most core changes in the future, beyond improving driving safety, will also be reflected in situational awareness and ride comfort.

(Editor: Wen Jing)

Keywords:

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin