Sand.ai secures funding of over 100 million USD: sticking to an autoregressive video route, planning to release an open-source MoE large model in July

According to Beating Monitoring, the video generation large model company Sand.ai (founded in January 2024) announced the completion of two rounds of financing totaling over 100 million USD. Investors include Look Capital, Lollapalooza Capital (Wang Huiwen’s family office), Jiukun Venture Capital, Matrix Partners China, MSA Capital, Innovation Works, Source Code Capital, IDG, Baidu Venture Capital, and several other leading institutions. This round was financially advised by Xinghan Capital.

Sand.ai founder Cao Yue said in an interview that the team has long adhered to a non-consensus autoregressive (Autoregressive) video generation approach, rather than the mainstream diffusion route. Its previously released Magi-1 model has remained #1 on Google DeepMind’s Physics-IQ physical authenticity test leaderboard.

To break through the “impossible triangle” of video generation—cost, speed, and quality—Sand.ai shifted last year to explore the MoE (Mixture of Experts) architecture. It plans to release a new generation of video generation model adopting the MoE architecture in July 2026 (Q3), balancing efficient inference with the largest parameter scale currently in the open-source field, and will open-source this model.

In terms of commercialization, Sand.ai adopts a dual-engine strategy of models and products. Its music agent product VidMuse, launched in January this year, has already achieved 10 million USD in ARR in just 2 months. In addition, its open-source MagiAttention operator library has been used by nearly all multimodal model teams in China and has received official recommendations from NVIDIA.

Regarding the “world models” concept that has been widely discussed in the industry, Cao Yue believes it is still in the pre-GPT era (before GPT-1 appeared), with neither data nor approaches converged. He pointed out that video is the most important data modality for moving toward world models. Models should enable autonomous learning of physical laws by predicting the raw video observation data (Pixels/Frames), rather than explicitly modeling state variables by introducing human priors.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned