CoinWorld news, Google has deployed a multi-token prediction (MTP) architecture in its Pixel 9 and Pixel 10 series devices, directly accelerating the built-in Gemini Nano v3 model. The new architecture attaches a lightweight transformer prediction head to the tail of the already frozen main model, increasing on-device inference speed by over 50% while retaining the original safety alignment and output quality. To avoid redundant runtime memory overhead from draft computation during autoregressive generation, Google designed a zero-copy mechanism that successfully reuses the feature activations already computed by the main model, significantly improving the prediction accuracy of candidate tokens. In actual business operations, this architecture allows the model to successfully predict an average of nearly 2 additional tokens per single inference, reducing the frequency with which the main processor is awakened for verification, thereby saving system power consumption.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 2
  • 1
  • Share
Comment
Add a comment
Add a comment
DewdropSapling
· 2h ago
Google's MTP architecture really has something - 50% speed boost and power savings, mobile AI is about to change the game.
View OriginalReply0
AirdropCartographer
· 2h ago
The zero-copy mechanism is cleverly designed, reusing feature activations to avoid memory explosion, with solid engineering details.
View OriginalReply0