Crypto界网消息,Google has released and open-sourced the draft models for the Gemma 4 series of multi-token prediction (MTP). This is a lightweight auxiliary model based on speculative decoding architecture, capable of achieving up to 3x inference speedup while maintaining the final verification authority of the main model, without any loss of output quality or logical reasoning ability. The model has been fully open-sourced under the same Apache 2.0 license as Gemma 4 and natively supports mainstream inference frameworks such as vllm, sglang, and ollama. This speed optimization significantly lowers the application barrier, allowing developers to smoothly run 26b MoE and 31b dense models on ordinary consumer-grade graphics cards, and support real-time AI interactions on mobile devices with lower power consumption.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pin