World-R1 enables text-to-video generation with 3D geometric consistency through reinforcement learning without changing the architecture or requiring 3D datasets. Core: Reconstruct scene 3D Gaussians using Depth Anything 3 and render from new viewpoints, compare with the original video, and use reconstruction error, trajectory deviation, and semantic credibility from the new view as rewards, learned via Flow-GRPO. The base Wan 2.1 is extended to Small/Large, with about 3,000 prompts, no 3D assets; dynamic fine-tuning is inserted every 100 steps during training. Large improves by 7.91dB, Small by 10.23dB, with 92% geometric consistency in blind tests and an overall preference of 86%. Code available on GitHub, CC BY-NC-SA 4.0.

MeNews

2026-04-28 10:00:20

Abstract generation in progress

AIMPACT News, April 28 (UTC+8): According to Beating monitoring, a team from Microsoft Research and Zhejiang University proposed World-R1. It uses reinforcement learning to help text-to-video models learn 3D geometric consistency, without modifying the model architecture or relying on 3D datasets.

Core idea: after generating a video, use a pre-trained 3D foundational model, Depth Anything 3, to reconstruct the scene’s 3D Gaussians (3DGS), then render from a new viewpoint and compare with the original video. Combine reconstruction error, trajectory deviation, and new-view semantic credibility (rated by Qwen3-VL) into a reward signal, which is fed back to the video model via Flow-GRPO (a reinforcement learning algorithm adapted for flow-matching models).

The base model is the open-source Wan 2.1 (1.3B and 14B). It is trained separately into World-R1-Small and World-R1-Large. The training data consists of only about 3,000 pure-text prompts generated by Gemini, with no use of any 3D assets. During training, every 100 steps, one round of “dynamic fine-tuning” is inserted: temporarily disabling 3D rewards and keeping only image-quality rewards. This prevents the model from suppressing non-rigid dynamics—such as character motion—in pursuit of geometric rigidity.

On 3D consistency metrics, World-R1-Large’s PSNR (3DPSNR) improves by 7.91dB over the base Wan 2.1 14B, and the Small version improves by 10.23dB. VBench general video quality does not decrease and instead improves.

In a blind test with 25 participants, the geometric consistency win rate is 92%, and overall preference is 86%. The code has been open-sourced on GitHub, licensed under CC BY-NC-SA 4.0. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
327.34K Popularity
#
CryptoMarketsDipSlightly
227.66K Popularity
#
DailyPolymarketHotspot
659.13K Popularity
#
SolanaReleasesQuantumRoadmap
12.75M Popularity
#
TapAndPayWithGateCard
14.03K Popularity

Sitemap

Microsoft World-R1: Teaching video models to "understand" 3D with reinforcement learning, no architecture changes, PSNR increases by 10dB

Trending Topics

WCTCTradingKingPK

CryptoMarketsDipSlightly

DailyPolymarketHotspot

SolanaReleasesQuantumRoadmap

TapAndPayWithGateCard

Pin