CryptoWorld News: Meituan's Long Cat team has open-sourced the digital human generation framework LongCat-Video-Avatar 1.5, reconstructing audio extraction and video generation algorithms, focusing on industrial-grade spatiotemporal stability and ultra-fast inference. The framework replaces the wav2vec2 encoder with the whisper-large-v3 audio encoder, improving lip-sync and lip shape dynamics, and enhancing robustness in multi-language and cross-language lip generation. The model is optimized through GRPO reinforcement learning, reducing artifacts such as hand deformation and abnormal frame skipping, and improving identity consistency in long videos. The framework employs multi-segment rolling inference, utilizing previous video segments to establish global temporal context, maintaining character identity coherence. The inference process introduces DMD2's few-step distillation technique, compressing denoising iterations to 8 steps, balancing inference efficiency and image fidelity. Evaluation tests are based on 508 pairs of images and audio samples, with 770 evaluators collecting 13,240 judgments, and 10 experts rating from multiple dimensions. The framework can be generalized to anime and animal styles, supports mono and multi-channel audio input, and model weights are released under the MIT license. The displayed content is for academic use only; commercial use requires verification of relevant content.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

13 Likes

Reward
13
11
3
Share

Comment

Add a comment

MoonlightColdWallet

· 05-22 10:09

GRPO paying attention to hand details is quite interesting, the old problem with diffusion models is the finger disaster.

View OriginalReply0

BudgetValidator

· 05-22 07:58

whisper-large-v3 definitely aligns much better with lip movements after switching, whereas previously wav2vec2 multilingual often didn't match correctly in various scenarios.

View OriginalReply0

CurveWhaleWatcher

· 05-22 07:31

MIT License is highly praised, but the commercial use terms need to be carefully reviewed to avoid pitfalls.

View OriginalReply0

MacroTide

· 05-22 07:26

Improving spatiotemporal stability is much more meaningful than simply increasing FID; video generation is finally heading in the right direction.

View OriginalReply0

SlippageSailor

· 05-22 07:19

Should we include the dataset mainly focused on academic content? Want to try reproducing it?

View OriginalReply0

ArbShuttle

· 05-22 07:19

The multi-segment scrolling reasoning design is clever; preventing long videos from crashing the face is crucial.

View OriginalReply0

GotLiquidatedAgainLastNight.

· 05-22 07:10

Who came up with the name LongCat, do Meituan engineers also like to pet cats?

View OriginalReply0

DeltaSmile

· 05-22 07:10

Supports both mono and multi-channel, making it very suitable as a dubbing tool.

View OriginalReply0

SeaSaltAirdropNotes

· 05-22 07:10

Identity consistency has finally been taken seriously; previously, in face-swapping videos, the person often changed in the second half.

View OriginalReply0

CrystalBallForSentiment

· 05-22 07:10

How much does DMD2 improve efficiency? Are there latency data on the A100?

View OriginalReply0

Trending Topics
View More
#
SummerCreationCamp
1.42M Popularity
#
EventContractsLaunch
82.68K Popularity
#
BrentReturnsTo100
1.83M Popularity
#
IntelQ2RevenueSurges25%
401.17K Popularity
#
UStoImpose10To12.5PercentTariffsOn60Economies
11.8M Popularity

Pinned

Sitemap

Meituan open-sourced LongCat-Video-Avatar 1.5 digital human framework inference reduced to 8 steps

Trending Topics

SummerCreationCamp

EventContractsLaunch

BrentReturnsTo100

IntelQ2RevenueSurges25%

UStoImpose10To12.5PercentTariffsOn60Economies

Pinned