Switching from wav2vec2 to whisper-large-v3, the robustness of multi-language lip-syncing has significantly improved; the boundary between academic and commercial use must be clearly recognized.

View Original
CoinNetwork
Meituan open-sourced LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps
Meituan Changmao Team has open-sourced LongCat-Video-Avatar 1.5, refactoring audio/video generation to improve spatiotemporal stability and inference speed. It replaces wav2vec2 with whisper-large-v3 to enhance lip-sync alignment and multilingual robustness; it uses GRPO reinforcement learning to reduce hand artifacts and incorrect frames, strengthening identity consistency in long videos. It adopts multi-segment rolling inference with preceding context, with 8-step distilled DMD2 balancing efficiency and fidelity. The framework can generalize to anime/animal styles and supports single-/multi-channel audio. Licensed under MIT and mainly for academic use—please check before commercial use.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned