Meituan open-sources LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

robot
Abstract generation in progress
ME AI News, according to Data Observation Beating Monitoring, Meituan LongCat team has open-sourced the audio-driven portrait video generation framework LongCat-Video-Avatar 1.5, fully releasing the code and model weights. This upgrade replaces Wav2Vec2 with Whisper-Large audio encoder, aiming to provide stronger long-video identity consistency and broader style generalization capabilities. The framework switches to the Whisper-large-v3 audio encoder to improve lip-sync and lip-shape dynamics. The acoustic representations brought by Whisper-large-v3 significantly enhance the stability of multi-language and cross-language lip generation. To improve temporal stability, the framework adopts multi-segment rolling inference in long video generation to maintain character identity coherence. An inference-side DMD2-based few-step distillation technique is introduced, compressing denoising iterations to 8 steps, accelerating inference to 8 NFE while balancing inference efficiency and image fidelity. Model evaluation was conducted on 508 image-audio source pairs. Crowdsourced evaluation involved 770 raters collecting 13,240 judgments, and 10 experts scored based on physical plausibility, coordination, temporal stability, and identity consistency. The official demonstration compares the framework with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, focusing on improving temporal stability, identity consistency, and natural lip movements. In addition to realistic portraits, the framework can generalize to anime and animal styles, and natively supports mono and multi-channel audio input. Model weights are released under the MIT license. Meanwhile, the project's ethics statement notes that the generated content displayed on the page is for academic use only and not permitted for commercial use. Actual commercial deployment still requires separate verification of weights, code, materials, and content boundaries. (Source: BlockBeats)
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 11
  • 2
  • Share
Comment
Add a comment
Add a comment
HedgeHedgeBaby
· 2h ago
Native support for single and multi-channel audio; this is needed for podcast clipping.
View OriginalReply0
LendingRateAnxiety
· 2h ago
Did the 10 experts evaluate what specifically? Is it detailed in the paper?
View OriginalReply0
TheWaveOfRasterization
· 3h ago
MIT License is highly praised, academic-friendly
View OriginalReply0
GlassBottleFeather
· 3h ago
Is DMD2 distillation now standard? It seems like everyone is using it.
View OriginalReply0
ReboundAtTheStreetCornerAfter
· 3h ago
动物风格是什么鬼,猫说话?
Reply0
GateUser-dd8dffab
· 3h ago
Improving identity consistency is very important; previously, changing perspectives easily made it seem like a different person.
View OriginalReply0
GateUser-c29c3db9
· 3h ago
770 evaluators, 13,240 judgments. Is this assessment scale serious?
View OriginalReply0
BridgeTroll
· 3h ago
Anime-style generalization is an easter egg; the secondary creation community is going to be lively.
View OriginalReply0
CandleAfterTheRain
· 3h ago
The design of scrolling reasoning is very clever; long videos won't crash anymore.
View OriginalReply0
GateUser-deff9ed8
· 3h ago
Multilingual lip-syncing finally works now; previously, the English model always had weird lip movements for Chinese.
View OriginalReply0
View More
  • Pinned