Google releases and open-sources the draft model of the Gemma 4 series with multi-token prediction (MTP), using speculative decoding to achieve up to 3x inference speedup while maintaining the final weights of the main model without sacrificing output quality. MTP leverages idle computing power to pre-forecast multiple future tokens and verifies them in parallel with a heavyweight target model; if the draft is approved, it accepts the entire sequence at once and shares activation states and KV caches. Introduces clustering for E2B/E4B embedding layers. MTP is fully open-sourced and supports frameworks like VLLM, SGLang, Ollama, etc., capable of smoothly running 26B MOE and 31B dense models on consumer-grade GPUs, while reducing power consumption for real-time AI on mobile devices.

CoinNetwork

2026-05-06 00:37:51

Abstract generation in progress

CoinWorld News reports that Google has released and open-sourced the draft model of the Gemma 4 series for multi-token prediction (MTP).
This model adopts a speculative decoding architecture, capable of achieving up to 3x inference speedup while maintaining the final verification weights of the main model, without sacrificing output quality.
The MTP scheme utilizes idle computing power to pre-forecast multiple future tokens, which are then verified in parallel by a heavyweight target model.
If the target model agrees with the draft, it will accept the entire sequence at once.
The draft model shares the activation states and KV caches of the target model; for E2B and E4B models, the team introduced clustering techniques at the embedding layer.
Currently, the MTP model has been fully open-sourced, supporting mainstream inference frameworks such as VLLM, SGLang, and Ollama.
This optimization significantly lowers the application barrier, enabling developers to smoothly run 26B MOE and 31B dense models on ordinary consumer-grade graphics cards, and support real-time AI interactions on mobile devices with lower power consumption.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
GateSquareMayTradingShare
370.56K Popularity
#
BitcoinHoldsFirmAbove80K
94.28M Popularity
#
CryptoMarketRecovery
110.86K Popularity
#
AaveSuesToUnfreeze73MInETH
1.01M Popularity
#
DailyPolymarketHotspot
825.47K Popularity

Sitemap

Google open-sources the Gemma 4 full series MTP speculative decoding models, with up to 3x speed increase

Trending Topics

GateSquareMayTradingShare

BitcoinHoldsFirmAbove80K

CryptoMarketRecovery

AaveSuesToUnfreeze73MInETH

DailyPolymarketHotspot

Pin