Futures
Access hundreds of perpetual contracts
TradFi
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
Pre-IPOs
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
GateRouter
Smartly choose from 40+ AI models, with 0% extra fees
Maximum speedup of 3 times with zero loss, Google's open-source Gemma4 all-series MTP speculative decoding models
According to Beating Monitoring, Google has released and open-sourced a draft model of the Gemma 4 series for multi-token prediction (MTP). This is a lightweight auxiliary model based on speculative decoding architecture, capable of achieving up to 3 times inference speedup while maintaining the final verification weights of the main model, without any loss of output quality or logical reasoning ability.
Standard large language models can only generate one token at a time, which is prone to GPU memory bandwidth bottlenecks and results in idle compute resources. The MTP solution allows the lightweight draft model to utilize idle compute power to predict multiple future tokens in advance, which are then verified in parallel by the heavyweight target model such as the 31B model. If the target model agrees with the draft, it will accept the entire sequence at once. To further improve efficiency, the draft model directly shares activation states and KV caches (which store historical context to avoid redundant computation) with the target model. For edge-side models like E2B and E4B, the team also introduced clustering techniques at the embedding layer.
Currently, the MTP model has been fully open-sourced under the same Apache 2.0 license as Gemma 4 and natively supports mainstream inference frameworks such as vLLM, SGLang, and Ollama. This speed optimization significantly lowers the application barrier, enabling developers to run 26B MoE and 31B dense models smoothly on consumer-grade graphics cards, and support real-time AI interactions on mobile devices with lower power consumption.