Introducing AlphaGo search, a brand-new MCTS video generation framework with longer video durations surpassing Sora
This paper proposes the Planning at Inference framework, which introduces multi-tree Monte Carlo Tree Search during the reasoning phase, treating long video generation as a sequential decision-making problem. By using lookahead backtracking and reward backpropagation to evaluate multiple segments, it significantly alleviates semantic drift and error accumulation caused by block-wise generation. The multi-tree structure improves search efficiency and can serve as a fully plug-and-play inference optimization method without fine-tuning the base model. In Cosmos-Predict2 experiments, high-quality coherent videos longer than 20 seconds are generated, surpassing greedy/beam search and Best-of-N in metrics such as object persistence, temporal coherence, and text alignment; compared to Sora and Kling, the duration is increased by 18% and 47%, respectively, with comparable visual quality. Although computational costs are high, with improvements in the base model and hardware, this approach has the potential to advance long video generation toward practical engineering applications.