DeepSeek open-sources inference acceleration framework DeepSpec, launches DSpark to boost V4 model speed by up to 85%.

robot
Abstract generation in progress

According to Dongcha Beating monitoring, DeepSeek, together with Peking University, released the technical report for DSpark, a speculative sampling acceleration framework, and open-sourced the full-stack codebase DeepSpec. DSpark has now been deployed in DeepSeek-V4’s online business. While ensuring lossless output, DSpark improves the single-user generation speed of the Flash version by 60% to 85%, and the Pro version by 57% to 78%. Under strict latency constraints, DSpark outperforms the original single-token multi-branch prediction (MTP-1) baseline and significantly boosts the system’s overall throughput.

Previously, multi-token speculative sampling was mostly hard to implement in online production environments. The autoregressive draft model is too slow to generate, while the parallel draft model has extremely low acceptance rates for the latter half of long sequences because predictions for each position are independent. If, under high concurrency, multi-token drafts were blindly verified, the large model would waste a large amount of compute power verifying incorrect tokens that are destined to be rejected, causing the system’s overall throughput to collapse severely. Therefore, in the industry, online production primarily relied on single-token prediction (MTP-1).

DSpark overcomes the throughput degradation bottleneck under high concurrency. First, DSpark uses the DFlash parallel backbone network to generate hidden states, and then adds an extremely lightweight Markov head. The Markov head injects associations between adjacent words in a serial manner at very low cost through table lookup and a single matrix multiplication. At the same time, the system integrates a confidence prediction head and a posterior calibration algorithm. To achieve zero-cost scheduling compatibility with production environments perfectly and to prevent future information leakage, the scheduler adopts an asynchronous mechanism that uses predictions from two steps earlier to dynamically determine the candidate word truncation length, completely preventing the large model from verifying high-risk tail incorrect tokens under heavy load.

In addition to DSpark, the DeepSpec codebase open-sourced by DeepSeek this time also natively supports open-source large models such as Qwen3 and Gemma. DeepSpec provides a complete Python toolchain, covering everything from downloading prompts, rebuilding large model caches, and training draft models to benchmark evaluation. Developers can directly use the open-source scripts to customize and deploy dedicated acceleration modules locally for different open-source large models.

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned