Technical report, weights, demo all released, comparing favorably to Kling-Foley, the open-source community finally has a capable video and audio effects framework.

View Original
BlockBeatNews
Xiaomi open-sourced the video dubbing model ControlFoley, allowing individuals to decide how the sound should be matched.
Under monitoring by Beating, Xiaomi's team open-sourced the video and audio effects framework ControlFoley, emphasizing controllability: generating sounds based on images, text, or reference audio, with the ability to change voice styles while maintaining synchronization between audio and video. The underlying architecture uses a spatiotemporal audio-visual encoder adapted from CAV-MAE, implementing decoupling of time and timbre. Multi-task evaluation achieves open-source state-of-the-art results and is competitive in comparison with Kling-Foley, but still lags behind in some KL metrics on Kling-Audio-Eval and MovieGen-Audio-Bench. The project has released technical reports, code, weights, and demos.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned