VGHuman is an embodied AI framework developed by a joint team from Peking University, CMU, Tongji University, UCLA, and the University of Michigan, published on arXiv. It autonomously navigates unfamiliar 3D environments using vision. The framework consists of the World Layer (a 3D Gaussian field with semantic and collision meshes reconstructed from monocular video, accounting for occlusion) and the Agent Layer (first-person RGB-D perception, iterative reasoning for planning, and diffusion models transforming into full-body actions). In 200 test scenes, it achieved a success rate approximately 30 percentage points higher than baselines like NaVILA, with collision rates lower or comparable; it supports running, jumping, and other actions, as well as long-range planning. The code is planned to be open-sourced, with a GitHub repository already created.

MeNews

2026-05-01 03:10:18

Abstract generation in progress

ME News Report, April 14 (UTC+8), according to 1M AI News monitoring, a joint team from Peking University, Carnegie Mellon University, Tongji University, University of California, Los Angeles, and University of Michigan released VGHuman on arXiv, an embodied AI framework that enables digital humans to autonomously navigate unfamiliar 3D environments solely based on visual perception. Previously, digital human systems generally relied on preset scripts or privileged state information for driving, while VGHuman’s starting point is to give digital humans real eyes, allowing them to see the way, plan, and act independently. The framework consists of two layers. The World Layer reconstructs a 3D Gaussian scene with semantic annotations and collision meshes from monocular video, with occlusion perception designed to enable recognition of small occluded objects even in complex outdoor environments. The Agent Layer equips the digital human with first-person RGB-D (color + depth) perception, generating planning through spatial perception visual cues and iterative reasoning, ultimately transforming into full-body motion sequences driven by a diffusion model to animate the character. In navigation benchmarks across 200 test scenes, at three difficulty levels—simple path, obstacle avoidance, and dynamic pedestrians—VGHuman’s task success rate exceeds that of the strongest baselines such as NaVILA, NaVid, and Uni-NaVid by about 30 percentage points, with collision rates being comparable or lower. The framework also supports various movement styles like running and jumping, as well as long-term planning for visiting multiple targets consecutively. The code and models are planned to be open-sourced, with a GitHub repository already established. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
WCTCTradingKingPK
433.33K Popularity
#
USSeeksStrategicBitcoinReserve
58.68M Popularity
#
BitcoinETFOptionLimitQuadruples
975.6K Popularity
#
#FedHoldsRateButDividesDeepen
29.77K Popularity
#
DeFiLossesTop600MInApril
10.16M Popularity

Sitemap

Joint research by five universities enables digital humans to autonomously navigate 3D scenes using vision, with success rates approximately 30 percentage points higher than the optimal baseline.

Trending Topics

WCTCTradingKingPK

USSeeksStrategicBitcoinReserve

BitcoinETFOptionLimitQuadruples

#FedHoldsRateButDividesDeepen

DeFiLossesTop600MInApril

Pin