Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
CFD
U.S. stock CFD derivatives
US Stocks
Access real US stocks and ETFs
HK Stocks
Trade quality Hong Kong-listed stocks
Korean Stocks
SK Hynix
Real Korean stocks and top assets
Stock Futures
High leverage, 24/7 trading
Tokenized Stocks
Backed by real stock assets
IPO Access
Unlock full access to global stock IPOs
GUSD
Mint GUSD for Treasury RWA yields
Stocks Activities
Trade Popular Stocks and Unlock Generous Airdrops
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
Just now, DeepSeek V4 updated DSpark, with inference speed increased by 80%.
Just now, DeepSeek V4 received an update.
A new speculative decoding framework called DSpark was introduced, and the full-stack speculative decoding framework DeepSpec supporting this version was also open-sourced simultaneously.
DeepSeek-V4-Pro-DSpark is not a brand-new architecture model, but rather incorporates a speculative decoding module on top of DeepSeek-V4-Pro. The focus of this update is on engineering implementation, not on iteration of the model’s capabilities themselves.
DSpark has been deployed in real production traffic for DeepSeek-V4 (Flash and Pro), significantly accelerating the inference speed of large language models (LLMs).
Technical Report: "DSpark: Confidence-Scheduled Speculative Decoding with Semi-Autoregressive Generation"
Technical Report Link: https://github.com/deepseek-ai/DeepSpec/blob/main/DSpark_paper.pdf
The core motivation behind DSpark is to address the latency and throughput bottlenecks faced by LLM inference in production environments, especially under high concurrency. In short, DSpark successfully combines high-throughput "parallel generation" with adaptive "load-aware verification."
Speculative decoding is a technique that accelerates LLM inference without altering the model output distribution. Its core idea is to introduce a lightweight "draft model" that pre-generates several candidate tokens, which are then batch-verified and accepted by the target model. This transforms serial token-by-token generation into parallel batch verification, significantly reducing end-to-end latency.
Building on this, DSpark's innovation lies in introducing a semi-autoregressive generation architecture: it retains the high throughput advantages of the parallel draft model while adding a lightweight serial module to model the dependencies between tokens within a block, mitigating the acceptance rate degradation that parallel draft models tend to suffer at later positions.
In addition, there is hardware-aware confidence-scheduled verification: previous speculative decoding often blindly sent all generated draft tokens for verification. Under high system load, these tail tokens, which are highly likely to be rejected, waste valuable batch computation power. DSpark introduces a confidence head to evaluate each token's survival probability. Combined with a hardware-aware prefix scheduler, the system can dynamically tailor the optimal verification length for each request based on real-time engine throughput characteristics, allocating computation only to tokens with the highest expected return.
To be implemented on real production infrastructure, DSpark's scheduler employs an asynchronous mechanism to be compatible with zero-overhead scheduling (ZOS) and continuous CUDA graph replay. It uses historical predictions from the previous two steps to determine the current dynamic truncation length, thereby hiding scheduling latency, avoiding GPU pipeline stalls, and ensuring complete lossless restoration of the target model's output distribution.
In tests covering multiple domains including mathematical reasoning, code generation, and daily conversation, DSpark significantly outperforms the current state-of-the-art autoregressive model (Eagle3) and parallel draft model (DFlash). For example, on the Qwen3 series (4B, 8B, 14B) target models, its average acceptance length is 26.7% to 30.9% higher than Eagle3 and 16.3% to 18.4% higher than DFlash.
Compared to the previous generation's single-token production baseline (MTP-1), while maintaining the same overall throughput, DSpark improves user generation speed by 60%-85% (Flash model) and 57%-78% (Pro model), respectively.
Also open-sourced alongside DSpark is DeepSpec, a full-stack codebase for training and evaluating speculative decoding draft models. It serves as the "open-source infrastructure" carrying this approach and implementations of other cutting-edge algorithms, including data preparation tools, draft model implementations, training code, and evaluation scripts.
DeepSpec divides the overall workflow into three phases: data preparation, training, and evaluation. These phases must be run sequentially, with the output of each phase serving as the input for the next.
During the data preparation phase, prompt data needs to be downloaded, the target model is used to regenerate answers via the inference engine, and a target cache is built. Notably, taking the default Qwen/Qwen3-4B configuration as an example, the target cache size can be approximately 38 TB, requiring sufficient storage resource assessment before use.
The training phase can be launched via
bash scripts/train/train.sh. This script callstrain.pyand starts one worker per visible GPU. Users can select different algorithm and target model configurations in theconfig/directory by specifyingconfig_path. The project also supports adjusting training settings by overridingconfig_path,target_cache_dir, and modifying individual configuration fields via--opts.Regarding hardware, DeepSpec's default configurations and scripts target a single-node, 8-GPU setup. If the number of GPUs is smaller, users need to correspondingly reduce the number of visible GPUs in
CUDA_VISIBLE_DEVICES.The evaluation phase is launched via
bash scripts/eval/eval.sh. The evaluation script uses the trained draft model checkpoint to measure acceptance on multiple speculative decoding benchmark tasks. The current evaluation datasets listed in the project include GSM8K, MATH500, AIME25, HumanEval, MBPP, LiveCodeBench, MT-Bench, Alpaca, and Arena-Hard-v2, covering different task types such as mathematical reasoning, code generation, conversational ability, and comprehensive Q&A.In terms of algorithms, DeepSpec currently includes three built-in draft models: DSpark, DFlash, and Eagle3. For target model series, the project currently supports Qwen3 and Gemma.
The open-sourcing of DeepSpec consolidates speculative decoding engineering practices, which were previously scattered across various research teams, into a reproducible and scalable standardized toolchain. For researchers and engineers looking to accelerate inference for their own LLMs, this means they can directly train custom draft models on a mature framework, skipping a large amount of repetitive infrastructure setup work.
Source: Machine Heart
Risk Warning and Disclaimer