According to Beating's monitoring, Google has deployed a Multi-Token Prediction (MTP) architecture in the Pixel 9 and Pixel 10 series devices, directly accelerating the built-in Gemini Nano v3 model. By attaching a lightweight Transformer prediction head to the tail of the frozen main model, the new architecture improves on-device inference speed by more than 50% while fully retaining the original safety alignment and output quality.

Traditional speculative decoding requires running an independent draft model to predict candidate tokens. This not only occupies additional runtime memory on the phone, but also limits prediction accuracy because the independent model cannot access the internal hidden states of the main model.

The new architecture embeds the MTP head at the tail of the frozen main model, successfully reusing the feature activations already computed by the main model, significantly improving the prediction accuracy of candidate tokens.

To avoid redundant runtime memory overhead from draft computation during autoregressive generation, Google designed a zero-copy mechanism. In traditional solutions, the draft model needs to maintain an independent key-value (KV) cache when generating candidate tokens, while the zero-copy mechanism allows the attached prediction head to directly read the main model's existing cache through cross-attention. This not only eliminates the startup latency of draft prediction, but also saves about 130MB of runtime memory for the phone.

In practical Pixel tasks such as notification summaries and text proofreading, the MTP architecture enables the model to successfully predict nearly 2 additional tokens per inference on average, reducing the frequency of the main processor being woken up for verification, thus saving system power. In highly structured text generation tasks like smart replies, the token acceptance rate is improved by up to 55%.

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

Reward
like
Comment
Repost
Share

Comment

Add a comment

No comments

Trending Topics
View More
#
Get2SharesOfSKHynixAtZeroCost
1.64M Popularity
#
MicronOvertakesMetaInMarketValue
353.65K Popularity
#
WorldCup🇿🇦vs🇨🇦
129.14K Popularity
#
USMayPCEInflationRisesTo4.1%HighestIn3Years
195.84K Popularity
#
StakeUSD1Earn9.48%APR
1M Popularity

Pinned

Sitemap

Google Pixel deploys zero-copy MTP, Gemini Nano inference speeds up by over 50% and saves memory.

Trending Topics

Get2SharesOfSKHynixAtZeroCost

MicronOvertakesMetaInMarketValue

WorldCup🇿🇦vs🇨🇦

USMayPCEInflationRisesTo4.1%HighestIn3Years

StakeUSD1Earn9.48%APR

Pinned