Meituan LongCat team open-sourced LongCat-Video-Avatar 1.5, fully releasing the code and weights. Switched to Whisper-large-v3 to improve multilingual lip-sync and style generalization, using multi-segment rolling inference and DMD2-based few-step distillation to reduce inference to 8 steps, balancing speed and fidelity. After 508 source data pairs, 770 evaluators with 13,240 judgments, and evaluations by 10 experts, it significantly improves temporal stability, identity consistency, and natural lip movements, and can generalize to anime and animal styles. It natively supports single/multi-channel audio. MIT License, primarily for academic use; commercial use requires further review.

MeNews

2026-05-22 08:04:01

Abstract generation in progress

ME AI News, according to Data Observation Beating Monitoring, Meituan LongCat team has open-sourced the audio-driven portrait video generation framework LongCat-Video-Avatar 1.5, fully releasing the code and model weights. This upgrade replaces Wav2Vec2 with Whisper-Large audio encoder, aiming to provide stronger long-video identity consistency and broader style generalization capabilities. The framework switches to the Whisper-large-v3 audio encoder to improve lip-sync and lip-shape dynamics. The acoustic representations brought by Whisper-large-v3 significantly enhance the stability of multi-language and cross-language lip generation. To improve temporal stability, the framework adopts multi-segment rolling inference in long video generation to maintain character identity coherence. An inference-side DMD2-based few-step distillation technique is introduced, compressing denoising iterations to 8 steps, accelerating inference to 8 NFE while balancing inference efficiency and image fidelity. Model evaluation was conducted on 508 image-audio source pairs. Crowdsourced evaluation involved 770 raters collecting 13,240 judgments, and 10 experts scored based on physical plausibility, coordination, temporal stability, and identity consistency. The official demonstration compares the framework with HeyGen, Kling Avatar 2.0, and OmniHuman-1.5, focusing on improving temporal stability, identity consistency, and natural lip movements. In addition to realistic portraits, the framework can generalize to anime and animal styles, and natively supports mono and multi-channel audio input. Model weights are released under the MIT license. Meanwhile, the project's ethics statement notes that the generated content displayed on the page is for academic use only and not permitted for commercial use. Actual commercial deployment still requires separate verification of weights, code, materials, and content boundaries. (Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

8 Likes

Reward
8
11
2
Share

Comment

Add a comment

HedgeHedgeBaby

· 2h ago

Native support for single and multi-channel audio; this is needed for podcast clipping.

View OriginalReply0

LendingRateAnxiety

· 2h ago

Did the 10 experts evaluate what specifically? Is it detailed in the paper?

View OriginalReply0

TheWaveOfRasterization

· 3h ago

MIT License is highly praised, academic-friendly

View OriginalReply0

GlassBottleFeather

· 3h ago

Is DMD2 distillation now standard? It seems like everyone is using it.

View OriginalReply0

ReboundAtTheStreetCornerAfter

· 3h ago

动物风格是什么鬼，猫说话？

Reply0

GateUser-dd8dffab

· 3h ago

Improving identity consistency is very important; previously, changing perspectives easily made it seem like a different person.

View OriginalReply0

GateUser-c29c3db9

· 3h ago

770 evaluators, 13,240 judgments. Is this assessment scale serious?

View OriginalReply0

BridgeTroll

· 3h ago

Anime-style generalization is an easter egg; the secondary creation community is going to be lively.

View OriginalReply0

CandleAfterTheRain

· 3h ago

The design of scrolling reasoning is very clever; long videos won't crash anymore.

View OriginalReply0

GateUser-deff9ed8

· 3h ago

Multilingual lip-syncing finally works now; previously, the English model always had weird lip movements for Chinese.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
254.86K Popularity
#
PlatinumCardCreatorExclusive
65.31K Popularity
#
DailyPolymarketHotspot
1.03M Popularity
#
GateSquarePizzaDay
1.72M Popularity
#
SpaceXOfficiallyFilesforIPO
550.35K Popularity

Pinned

Sitemap

Meituan open-sources LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

Trending Topics

TradfiTradingChallenge

PlatinumCardCreatorExclusive

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned