Xiaomi open-sourced the video dubbing model ControlFoley, allowing individuals to decide how the sound should be matched.
Under monitoring by Beating, Xiaomi's team open-sourced the video and audio effects framework ControlFoley, emphasizing controllability: generating sounds based on images, text, or reference audio, with the ability to change voice styles while maintaining synchronization between audio and video. The underlying architecture uses a spatiotemporal audio-visual encoder adapted from CAV-MAE, implementing decoupling of time and timbre. Multi-task evaluation achieves open-source state-of-the-art results and is competitive in comparison with Kling-Foley, but still lags behind in some KL metrics on Kling-Audio-Eval and MovieGen-Audio-Bench. The project has released technical reports, code, weights, and demos.