Google open-sources the Gemma 4 full series MTP speculative decoding models, with up to 3x speed increase

robot
Abstract generation in progress

CoinWorld News reports that Google has released and open-sourced the draft model of the Gemma 4 series for multi-token prediction (MTP).
This model adopts a speculative decoding architecture, capable of achieving up to 3x inference speedup while maintaining the final verification weights of the main model, without sacrificing output quality.
The MTP scheme utilizes idle computing power to pre-forecast multiple future tokens, which are then verified in parallel by a heavyweight target model.
If the target model agrees with the draft, it will accept the entire sequence at once.
The draft model shares the activation states and KV caches of the target model; for E2B and E4B models, the team introduced clustering techniques at the embedding layer.
Currently, the MTP model has been fully open-sourced, supporting mainstream inference frameworks such as VLLM, SGLang, and Ollama.
This optimization significantly lowers the application barrier, enabling developers to smoothly run 26B MOE and 31B dense models on ordinary consumer-grade graphics cards, and support real-time AI interactions on mobile devices with lower power consumption.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin