Google 新開源 DiffusionGemma 模型:生成快 4 倍,但品質落後 Gemma 4

Google DeepMind has released DiffusionGemma, a new member of the open-source Gemma 4 family. Official tests show it can reach about 700 tokens per second on an Nvidia RTX 5090, surpass 1,000 tokens per second on an H100, and is about 4 times faster than similarly sized autoregressive Gemma models.
(Background: Google releases the Gemma 4 12B open-source model, which can be run locally on a 16GB consumer laptop)
(Background detail: Beating Google’s models! Tether launches a “mobile-friendly” medical AI QVAC MedPsy, cutting off the cloud to address privacy pain points)

Table of Contents

Toggle

  • What does a non-sequential token generation model look like
  • Where does the speed advantage come from
  • The cost behind speed: quality lags on all benchmarks

This time, Google DeepMind has put an outlier into the Gemma 4 open-source family. Most language models generate text using “autoregression”—in simple terms, they decide one token at a time from left to right, where the probability of the next token is determined by the previous token(s), completing an output sequentially.

DiffusionGemma does exactly the opposite: it first lays down placeholder tokens across a “canvas,” then repeatedly “denoises” the entire block over multiple passes, ultimately outputting the full finalized text in a single shot. This logic is closer to how Stable Diffusion generates images than to how GPT generates text.

Google states that this architecture provides a quantifiable speed advantage on local hardware, and it is open-sourced under the Apache 2.0 license for developers and researchers to use.

What does a non-sequential token generation model look like

DiffusionGemma adopts a “Mixture of Experts” (MoE) architecture.

The idea behind MoE is that the model contains a large number of “expert” sub-networks, but during each inference only a portion of them is activated, rather than mobilizing all parameters every time. Put plainly, even though the overall model is large, each computation only calls the necessary few experts. DiffusionGemma has a total of 26 billion parameters (26B), but only 3.8 billion (3.8B) are actually activated during inference. This allows it to run within the 18GB VRAM of high-end graphics cards, and the advantage becomes especially pronounced after quantization.

The generation process is even more worth breaking down. A standard autoregressive model is like a linear assembly line: after the first token is produced, the second can start calculating, and so on.

DiffusionGemma, on the other hand, first fills the entire output region with placeholder tokens, then performs multiple denoising passes. In each pass, tokens at all positions are updated simultaneously, correcting each other’s estimated values iteratively until the whole block converges to the final output. Up to 256 tokens can be processed in parallel at once.

This design has specific significance for “non-linear tasks.” Google’s example is solving Sudoku: traditional autoregressive models perform normally on such tasks, because correctly filling a particular cell often depends on other cells that have not yet been decided, but autoregression can only move forward step by step and cannot go back. DiffusionGemma can continuously self-correct across the entire batch of tokens, which is theoretically more favorable for tasks with complex entangled logical dependencies.

Other application scenarios mentioned by Google include: in-line editing, molecular sequence generation, and mathematical drawing/plotting.

Where does the speed advantage come from

From a hardware perspective, inference speed for autoregressive models is constrained by “memory bandwidth.” Each time it outputs a token, it must read the model’s weights from memory once, and the data-movement speed becomes the bottleneck. The bottleneck for diffusion models is different: they are “compute-intensive.” They compute a large batch of tokens at once, but each token requires far fewer memory reads.

This shift in bottleneck has real economic implications. The compute capability of modern GPUs is usually far more abundant than memory bandwidth. With autoregressive “one token at a time” generation, the expensive compute units are essentially kept waiting for memory to feed them data, leaving them in a half-idle state for long periods.

Diffusion-based generation distributes the workload into large-scale parallel computation, allowing the GPU’s compute power to be utilized fully. For applications that need long-duration, large-batch output, this characteristic of “making good use of the hardware and maximizing utilization” can be more practical than simple speed numbers.

This difference is directly reflected in performance on modern GPUs. Google’s official test results are as follows: on the consumer-grade Nvidia RTX 5090, DiffusionGemma’s output speed is about 700 tokens per second; on a single data-center-grade Nvidia H100 AI accelerator, it reaches over 1,000 tokens per second. According to Google’s own estimate, this is about 4 times the speed of the same-sized standard autoregressive Gemma model.

“Dongqu” (a commentator) emphasizes that the figures above all come from Google’s official tests and are not independently verified by third parties. Depending on different scenarios and different generation lengths, the actual speedup ratio may vary.

The cost behind speed: quality lags on all benchmarks

However, across all publicly released benchmark tests, DiffusionGemma’s scores are consistently lower than standard Gemma 4. In other words, the 4x speed does not come out of thin air—it comes at the cost of a systemic drop in generation quality.

This trade-off means something very different for different usage scenarios. If you care about tokens output per second—such as needing large-scale batch processing, running local inference on edge devices, or dealing with applications that are highly sensitive to latency—DiffusionGemma’s speed advantage is real. If your task has a higher requirement for answer quality, standard Gemma 4 is still more reliable for now.

For the local AI community, this model represents a concrete instantiation of a direction of trade-off: on limited local hardware, how much quality are you willing to exchange for how much speed? For that question, there is now a reference point you can directly run experiments on. The Apache 2.0 license means any developer can fine-tune and research based on it, and the practical ceiling of diffusion-based language generation will be determined next by the community’s testing.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned