Meituan releases the native multimodal large model LongCat-Next

MaticHoleFiller · 2026-03-28T09:44:21+00:00

Meituan today released and open-sourced the native multimodal large model LongCat-Next, which uses a discrete native resolution visual tokenizer, breaking the language-centric architecture, and achieving a unified mapping of images, speech, and text to promote AI technological advancement.

MaticHoleFiller

2026-03-28 09:44:21

Abstract generation in progress

Sina Tech reported on March 27 in the morning that Meituan has released and fully open-sourced LongCat-Next, a native multimodal large model, along with its core components: the discrete native resolution vision tokenizer (dNaViT).

The model breaks the current large-model “language-centered” traditional patchwork architecture, unifying images, speech, and text into the same-origin discrete Tokens. Using a purely “next Token prediction” (Next Token Prediction, NTP) paradigm, LongCat-Next makes vision and speech the AI’s “native mother tongue.”

According to the introduction, LongCat-Next achieves three key technological breakthroughs: first, the discrete native autoregressive architecture (DiNA) thoroughly breaks down the barriers between modalities; second, the discrete native resolution vision tokenizer (dNaViT) builds a “dictionary” for the visual world; and third, a semantic-alignment-complete encoder solves the industry challenge of “inevitably losing information” when discretizing.

Massive information and precise analysis are available in the Sina Finance app.

责任编辑：江钰涵

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

2 Likes