Just now, DeepSeek V4 updated DSpark, with inference speed increased by 80%.

robot
Abstract generation in progress

Just now, DeepSeek V4 received an update.

A new speculative decoding framework called DSpark was introduced, and the full-stack speculative decoding framework DeepSpec supporting this version was also open-sourced simultaneously.

DeepSeek-V4-Pro-DSpark is not a brand-new architecture model, but rather incorporates a speculative decoding module on top of DeepSeek-V4-Pro. The focus of this update is on engineering implementation, not on iteration of the model’s capabilities themselves.

DSpark has been deployed in real production traffic for DeepSeek-V4 (Flash and Pro), significantly accelerating the inference speed of large language models (LLMs).

  • Technical Report: "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation"

  • Technical Report Link: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf

The core motivation behind DSpark is to address the latency and throughput bottlenecks faced by LLM inference in production environments, especially under high concurrency. In short, DSpark successfully combines high-throughput "parallel generation" with adaptive "load-aware verification."

Speculative decoding is a technique that accelerates LLM inference without altering the model output distribution. Its core idea is to introduce a lightweight "draft model" that pre-generates several candidate tokens, which are then batch-verified and accepted by the target model. This transforms serial token-by-token generation into parallel batch verification, significantly reducing end-to-end latency.

Building on this, DSpark's innovation lies in introducing a semi-autoregressive generation architecture: it retains the high throughput advantages of the parallel draft model while adding a lightweight serial module to model the dependencies between tokens within a block, mitigating the acceptance rate degradation that parallel draft models tend to suffer at later positions.

In addition, there is hardware-aware confidence-scheduled verification: previous speculative decoding often blindly sent all generated draft tokens for verification. Under high system load, these tail tokens, which are highly likely to be rejected, waste valuable batch computation power. DSpark introduces a confidence head to evaluate each token's survival probability. Combined with a hardware-aware prefix scheduler, the system can dynamically tailor the optimal verification length for each request based on real-time engine throughput characteristics, allocating computation only to tokens with the highest expected return.

To be implemented on real production infrastructure, DSpark's scheduler employs an asynchronous mechanism to be compatible with zero-overhead scheduling (ZOS) and continuous CUDA graph replay. It uses historical predictions from the previous two steps to determine the current dynamic truncation length, thereby hiding scheduling latency, avoiding GPU pipeline stalls, and ensuring complete lossless restoration of the target model's output distribution.

In tests covering multiple domains including mathematical reasoning, code generation, and daily conversation, DSpark significantly outperforms the current state-of-the-art autoregressive model (Eagle3) and parallel draft model (DFlash). For example, on the Qwen3 series (4B, 8B, 14B) target models, its average acceptance length is 26.7% to 30.9% higher than Eagle3 and 16.3% to 18.4% higher than DFlash.

Compared to the previous generation's single-token production baseline (MTP-1), while maintaining the same overall throughput, DSpark improves user generation speed by 60%-85% (Flash model) and 57%-78% (Pro model), respectively.

Also open-sourced alongside DSpark is DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models. It serves as the "open-source infrastructure" carrying this approach and implementations of other cutting-edge algorithms, including data preparation tools, draft model implementations, training code, and evaluation scripts.

DeepSpec divides the overall workflow into three phases: data preparation, training, and evaluation. These phases must be run sequentially, with the output of each phase serving as the input for the next.

During the data preparation phase, prompt data needs to be downloaded, the target model is used to regenerate answers via the inference engine, and a target cache is built. Notably, taking the default Qwen/Qwen3-4B configuration as an example, the target cache size can be approximately 38 TB, requiring sufficient storage resource assessment before use.

The training phase can be launched via bash scripts/train/train.sh. This script calls train.py and starts one worker per visible GPU. Users can select different algorithm and target model configurations in the config/ directory by specifying config_path. The project also supports adjusting training settings by overriding config_path, target_cache_dir, and modifying individual configuration fields via --opts.

Regarding hardware, DeepSpec's default configurations and scripts target a single-node, 8-GPU setup. If the number of GPUs is smaller, users need to correspondingly reduce the number of visible GPUs in CUDA_VISIBLE_DEVICES.

The evaluation phase is launched via bash scripts/eval/eval.sh. The evaluation script uses the trained draft model checkpoint to measure acceptance on multiple speculative decoding benchmark tasks. The current evaluation datasets listed in the project include GSM8K, MATH500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca, and Arena-Hard-v2, covering different task types such as mathematical reasoning, code generation, conversational ability, and comprehensive Q&A.

In terms of algorithms, DeepSpec currently includes three built-in draft models: DSpark, DFlash, and Eagle3. For target model series, the project currently supports Qwen3 and Gemma.

The open-sourcing of DeepSpec consolidates speculative decoding engineering practices, which were previously scattered across various research teams, into a reproducible and scalable standardized toolchain. For researchers and engineers looking to accelerate inference for their own LLMs, this means they can directly train custom draft models on a mature framework, skipping a large amount of repetitive infrastructure setup work.

Source: Machine Heart

Risk Warning and Disclaimer

        Market risks exist, and investments should be made cautiously. This article does not constitute personal investment advice, nor does it consider the specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their particular circumstances. Investment based on this article is at your own risk.
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned