Zyphra releases the first diffusion language model in the AMD ecosystem, achieving a maximum speedup of 7.7 times

robot
Abstract generation in progress

AIMPACT News, May 15 (UTC+8), according to Beating Monitoring, Zyphra released ZAYA1-8B-Diffusion-Preview, a hybrid expert (MoE) diffusion model transformed from an autoregressive large language model. Although the official promotion claims it as the "first" model to implement this architecture transformation, this approach was already pioneered by teams like SDAR and LLaDA 2.0 at the end of last year. The true uniqueness of ZAYA1 lies in the fact that it is the first diffusion language model trained within the AMD hardware ecosystem.

Setting aside marketing rhetoric, the model still validates the engineering efficiency benefits of the diffusion architecture. Traditional autoregressive models are limited by word-by-word serial generation, and accumulating KV caches can push generation speed to physical limits. As recently revealed by the industry trend from the He Kaiming team’s pure diffusion model ELF, parallel denoising is the key to breaking this bottleneck.

ZAYA1 adopts the TiDAR scheme to skip from-scratch pretraining, enabling simultaneous denoising of 16 token candidates in a single forward pass, completely transforming the VRAM bandwidth bottleneck into a compute bottleneck.

Practical tests show that, combined with ZAYA1’s dedicated CCA attention mechanism, using a standard lossless sampler can achieve a 4.6x decoding acceleration ratio without compromising generation quality. Switching to a hybrid logit sampler further boosts the acceleration ratio to 7.7x, providing substantial cost reduction for large-scale inference tasks that are time-consuming. (Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 7
  • 2
  • Share
Comment
Add a comment
Add a comment
GateUser-9008328f
· 7h ago
The pre-training cost saved by TiDAR is enough to train how many downstream tasks?
View OriginalReply0
CrystalBallForSentiment
· 7h ago
The diffusion language model finally no longer has to worry about NV's attitude, which is a good thing.
View OriginalReply0
GateUser-eccf92a1
· 7h ago
TiDAR skipping pre-training is so cost-effective; AMD's ecosystem finally has a competitive diffusion model.
View OriginalReply0
GateUser-4aa73916
· 8h ago
Single forward pass can handle 16 tokens, making it perfect for latency-sensitive scenarios.
View OriginalReply0
Semi-MeltedIceCream
· 8h ago
CCA Attention Lossless Sampling 4.6x, looking to write a technical blog with engineering details
View OriginalReply0
MosaicButterfly
· 8h ago
16 tokens denoise simultaneously, trading memory for computing power; this approach is very friendly to consumer-grade cards.
View OriginalReply0
LookingAtTheCandlestickChart
· 8h ago
Training on AMD instead of porting, the discourse dominance in the ecosystem has started to change
View OriginalReply0
  • Pinned