Meituan open-sourced LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

robot
Abstract generation in progress
CryptoWorld News: Meituan's Long Cat team has open-sourced the digital human generation framework LongCat-Video-Avatar 1.5, reconstructing audio extraction and video generation algorithms, focusing on industrial-grade spatiotemporal stability and ultra-fast inference. The framework replaces the wav2vec2 encoder with the whisper-large-v3 audio encoder, improving lip-sync and lip shape dynamics, and enhancing robustness in multi-language and cross-language lip generation. The model is optimized through GRPO reinforcement learning, reducing artifacts such as hand deformation and abnormal frame skipping, and improving identity consistency in long videos. The framework employs multi-segment rolling inference, utilizing previous video segments to establish global temporal context, maintaining character identity coherence. The inference process introduces DMD2's few-step distillation technique, compressing denoising iterations to 8 steps, balancing inference efficiency and image fidelity. Evaluation tests are based on 508 pairs of images and audio samples, with 770 evaluators collecting 13,240 judgments, and 10 experts rating from multiple dimensions. The framework can be generalized to anime and animal styles, supports mono and multi-channel audio input, and model weights are released under the MIT license. The displayed content is for academic use only; commercial use requires verification of relevant content.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 11
  • 3
  • Share
Comment
Add a comment
Add a comment
MoonlightColdWallet
· 05-22 10:09
GRPO paying attention to hand details is quite interesting, the old problem with diffusion models is the finger disaster.
View OriginalReply0
BudgetValidator
· 05-22 07:58
whisper-large-v3 definitely aligns much better with lip movements after switching, whereas previously wav2vec2 multilingual often didn't match correctly in various scenarios.
View OriginalReply0
GateUser-6319729f
· 05-22 07:31
MIT License is highly praised, but the commercial use terms need to be carefully reviewed to avoid pitfalls.
View OriginalReply0
GateUser-af0ea0c9
· 05-22 07:26
Improving spatiotemporal stability is much more meaningful than simply increasing FID; video generation is finally heading in the right direction.
View OriginalReply0
SlippageSailor
· 05-22 07:19
Should we include the dataset mainly focused on academic content? Want to try reproducing it?
View OriginalReply0
GateUser-f4ae43e9
· 05-22 07:19
The multi-segment scrolling reasoning design is clever; preventing long videos from crashing the face is crucial.
View OriginalReply0
GotLiquidatedAgainLastNight.
· 05-22 07:10
Who came up with the name LongCat, do Meituan engineers also like to pet cats?
View OriginalReply0
DeltaSmile
· 05-22 07:10
Supports both mono and multi-channel, making it very suitable as a dubbing tool.
View OriginalReply0
SeaSaltAirdropNotes
· 05-22 07:10
Identity consistency has finally been taken seriously; previously, in face-swapping videos, the person often changed in the second half.
View OriginalReply0
CrystalBallForSentiment
· 05-22 07:10
How much does DMD2 improve efficiency? Are there latency data on the A100?
View OriginalReply0
View More
  • Pinned