Futures
Access hundreds of perpetual contracts
CFD
Gold
One platform for global traditional assets
Options
Hot
Trade European-style vanilla options
Unified Account
Maximize your capital efficiency
Demo Trading
Introduction to Futures Trading
Learn the basics of futures trading
Futures Events
Join events to earn rewards
Demo Trading
Use virtual funds to practice risk-free trading
CFD
U.S. stock CFD derivatives
US Stocks
Access real US stocks and ETFs
HK Stocks
Trade quality Hong Kong-listed stocks
Korean Stocks
SK Hynix
Real Korean stocks and top assets
Stock Futures
High leverage, 24/7 trading
Tokenized Stocks
Backed by real stock assets
IPO Access
Unlock full access to global stock IPOs
GUSD
Mint GUSD for Treasury RWA yields
Stocks Activities
Trade Popular Stocks and Unlock Generous Airdrops
Launch
CandyDrop
Collect candies to earn airdrops
Launchpool
Quick staking, earn potential new tokens
HODLer Airdrop
Hold GT and get massive airdrops for free
IPO Access
Unlock full access to global stock IPOs
Alpha Points
Trade on-chain assets and earn airdrops
Futures Points
Earn futures points and claim airdrop rewards
Promotions
AI
Gate AI
Your all-in-one conversational AI partner
Gate AI Bot
Use Gate AI directly in your social App
GateClaw
Gate Blue Lobster, ready to go
Gate for AI Agent
AI infrastructure, Gate MCP, Skills, and CLI
Gate Skills Hub
10K+ Skills
From office tasks to trading, the all-in-one skill hub makes AI even more useful.
DeepSeek new technology is ported to Apple chips! Mac’s local large language model accelerates by 60%
DSpark was just open-sourced for a week and has already been ported to Mac computers.
The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.
After installation, these two models generate text 1.6x and 1.4x faster on Mac, respectively.
What's more impressive is that it achieves something most ported versions can't — byte-level identical output to the original model, without a single character difference.
In other words, it gains speed without sacrificing quality at all.
The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.
Running Large Models on Mac, 60% Faster
For the DSpark open-sourced by DeepSeek on June 27, the official figures claim a 60% to 85% speed boost in server-side scenarios.
However, this technology was only implemented on data center GPUs at the time, with no version adapted for Apple Silicon.
mlx-dspark is the first native version of this technology for Apple Silicon.
The idea behind DSpark is to pair a smaller model to assist the target model. The small model first spits out several candidate tokens, and the target model verifies them in one go. Accepted tokens are kept; rejected ones are sent back for re-speculation.
The cost of this step differs between data centers and Mac computers.
On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus — the price is flat regardless of how many passengers. Decoding is already memory-bound, so verifying a few more tokens hardly costs extra time.
On Apple Silicon, it's more like a metered taxi. The more candidate tokens you verify, the higher the meter goes.
Rahim measured that for Gemma-4 12B, each additional token verified costs about 14 milliseconds. He compiled this into a cost model and concluded that the speed ceiling on Apple Silicon is around 2.2x.
In short, Rahim ported the smaller assistant model from the HuggingFace checkpoint and paired it with the two target models: Gemma-4 12B and Qwen3-4B.
He also reimplemented the verification process within the MLX framework and quantized the weights to 4-bit.
As a result, on the M4 Pro, compared to Apple's official MLX tools, the generation speed of Gemma-4 12B increased from 18.4 tok/s to approximately 30 tok/s, about 1.6x the original. Qwen3-4B increased from 52.9 tok/s to approximately 73 tok/s, about 1.4x the original.
Additionally, in mlx-dspark, Rahim did something most porting work doesn't.
Ported Versions Can Also Achieve High-Precision Reproduction
Most versions that bring large models to local devices only support greedy decoding, meaning they pick the token with the highest probability at each step.
In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model proposes candidate tokens, and the acceptance probability is min(1, p/q). The unaccepted parts are resampled from the residual.
He verified that the output generated by this process strictly equals the exact distribution that the target model would produce at the same temperature — not a compromised approximation.
Most speculative decoding only implements a greedy version because verifying greedy correctness is simple: just compare token by token.
Rahim went a step further to verify the output distribution under sampling mode, confirming there was no distortion.
The precision to use for the verifying target model was a pitfall he discovered through trial and error.
If the small model is paired with a base target model that hasn't been instruction-tuned, only 47% of the candidate tokens pass verification. Switching to the corresponding instruction-tuned version brought this ratio up to 82%.
He also tested using bf16 precision for the target model, but the increase in verification cost outweighs the gain in acceptance rate, making it slower. So keeping the target model at 8-bit is the most cost-effective default.
The small draft model, which proposes candidate tokens, uses a different precision.
Rahim compressed the draft model itself. After 4-bit quantization, it is only 1.8 GB, fits easily in memory, and runs losslessly.
As a result, DSpark not only achieves acceleration but also reproduces the 16% to 18% acceptance rate improvement mentioned in the paper on the device side.
DFlash Also Integrated, Code Tasks Faster
After the tweet was posted, a comment came in from Jian Chen, one of the authors of the DFlash paper, asking if Rahim could try their team's model.
DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead is Zhijian Liu, an assistant professor at UCSD and a research scientist at NVIDIA.
DFlash's approach differs from DSpark. It uses a parallel "block diffusion" to denoise an entire block of 16 tokens at once, rather than guessing step by step with dependencies like DSpark.
Rahim acted quickly.
Using Jian's own porting script, he connected the gemma4-12B-it-DFlash released by z-lab to the Gemma-4 target model in mlx-vlm, and ran another head-to-head comparison on the same Mac against the DSpark he had just tested.
For code and math tasks, DFlash's block decoding achieved an acceptance length of 5.95 to 6.20, a speed of about 36 tok/s, and about 2.1x speedup, beating DSpark.
However, DFlash attempts to generate an entire block of 16 tokens at once, but the target model may not approve all of them. Only a portion actually passes verification — the industry calls this "acceptance length" — and it doesn't always fill all 16 tokens.
So in scenarios like open-ended chat where content is unpredictable, the acceptance length cannot reach high levels, blocks are not filled, and DFlash's advantage cannot be exerted.
DSpark's Markov head was designed precisely to address this same issue. When generating a block of words in parallel, later positions are computed independently and may not be coherent with each other. The Markov head adds a layer of dependency between these positions to correct this.
As a result, in chat scenarios, DSpark is actually faster than DFlash.
The updated mlx-dspark v0.0.3 officially integrated z-lab's original DFlash into the package and added a parameter to manually shorten the effective block length of DFlash. Use short blocks for chat, and still use full 16-token blocks for code and math.
After this update, the same Mac and the same package can handle both chat and code/math tasks without needing to switch back and forth between DSpark and DFlash projects.
Rahim said in his tweet that the same method should also work on larger draft models like Qwen3-8B and 14B.
Source: Quantum Bit
Risk Warning and Disclaimer