Planning is only added during the reasoning phase; the base model doesn't need to be modified. If this plug-and-play optimization can be scaled with more computing power, long video generation could truly transform from mere tinkering to engineering.

View Original
BlockBeatNews
Introducing AlphaGo search, a brand-new MCTS video generation framework with longer video durations surpassing Sora
This paper proposes the Planning at Inference framework, which introduces multi-tree Monte Carlo Tree Search during the reasoning phase, treating long video generation as a sequential decision-making problem. By using lookahead backtracking and reward backpropagation to evaluate multiple segments, it significantly alleviates semantic drift and error accumulation caused by block-wise generation. The multi-tree structure improves search efficiency and can serve as a fully plug-and-play inference optimization method without fine-tuning the base model. In Cosmos-Predict2 experiments, high-quality coherent videos longer than 20 seconds are generated, surpassing greedy/beam search and Best-of-N in metrics such as object persistence, temporal coherence, and text alignment; compared to Sora and Kling, the duration is increased by 18% and 47%, respectively, with comparable visual quality. Although computational costs are high, with improvements in the base model and hardware, this approach has the potential to advance long video generation toward practical engineering applications.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned