According to Beating Monitoring, Google has released and open-sourced a draft model of the Gemma 4 series for multi-token prediction (MTP). This is a lightweight auxiliary model based on speculative decoding architecture, capable of achieving up to 3 times inference speedup while maintaining the final verification weights of the main model, without any loss of output quality or logical reasoning ability.

Standard large language models can only generate one token at a time, which is prone to GPU memory bandwidth bottlenecks and results in idle compute resources. The MTP solution allows the lightweight draft model to utilize idle compute power to predict multiple future tokens in advance, which are then verified in parallel by the heavyweight target model such as the 31B model. If the target model agrees with the draft, it will accept the entire sequence at once. To further improve efficiency, the draft model directly shares activation states and KV caches (which store historical context to avoid redundant computation) with the target model. For edge-side models like E2B and E4B, the team also introduced clustering techniques at the embedding layer.

Currently, the MTP model has been fully open-sourced under the same Apache 2.0 license as Gemma 4 and natively supports mainstream inference frameworks such as vLLM, SGLang, and Ollama. This speed optimization significantly lowers the application barrier, enabling developers to run 26B MoE and 31B dense models smoothly on consumer-grade graphics cards, and support real-time AI interactions on mobile devices with lower power consumption.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
354.16K Popularity
#
BitcoinHoldsFirmAbove80K
94.28M Popularity
#
CryptoMarketRecovery
110.14K Popularity
#
AaveSuesToUnfreeze73MInETH
4.17K Popularity
#
DailyPolymarketHotspot
824.26K Popularity

Sitemap

Maximum speedup of 3 times with zero loss, Google's open-source Gemma4 all-series MTP speculative decoding models

Trending Topics

GateSquareMayTradingShare

BitcoinHoldsFirmAbove80K

CryptoMarketRecovery

AaveSuesToUnfreeze73MInETH

DailyPolymarketHotspot

Pin