From Whisper to DMD2 distillation, the tech stack is quite solidly built, and multilingual and anime-style generalization is very appealing to someone like me who does derivative works.

View Original
MeNews
Meituan open-sources LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps
Meituan LongCat team open-sourced LongCat-Video-Avatar 1.5, fully releasing the code and weights. Switched to Whisper-large-v3 to improve multilingual lip-sync and style generalization, using multi-segment rolling inference and DMD2-based few-step distillation to reduce inference to 8 steps, balancing speed and fidelity. After 508 source data pairs, 770 evaluators with 13,240 judgments, and evaluations by 10 experts, it significantly improves temporal stability, identity consistency, and natural lip movements, and can generalize to anime and animal styles. It natively supports single/multi-channel audio. MIT License, primarily for academic use; commercial use requires further review.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned