DeepSeek new technology is ported to Apple chips! Mac’s local large language model accelerates by 60%

robot
Abstract generation in progress

DSpark was just open-sourced for a week and has already been ported to Mac computers.

The ported version is called mlx-dspark, running the Gemma-4 12B and Qwen3-4B models.

After installation, these two models generate text 1.6x and 1.4x faster on Mac, respectively.

What's more impressive is that it achieves something most ported versions can't — byte-level identical output to the original model, without a single character difference.

In other words, it gains speed without sacrificing quality at all.

The person behind this is Abdur Rahim, an engineer who tinkers with open-source projects in his spare time. He single-handedly created the first native Mac version since DSpark was open-sourced.

Running Large Models on Mac, 60% Faster

For the DSpark open-sourced by DeepSeek on June 27, the official figures claim a 60% to 85% speed boost in server-side scenarios.

However, this technology was only implemented on data center GPUs at the time, with no version adapted for Apple Silicon.

mlx-dspark is the first native version of this technology for Apple Silicon.

The idea behind DSpark is to pair a smaller model to assist the target model. The small model first spits out several candidate tokens, and the target model verifies them in one go. Accepted tokens are kept; rejected ones are sent back for re-speculation.

The cost of this step differs between data centers and Mac computers.

On data center GPUs, verifying a batch of candidate tokens is more like chartering a bus — the price is flat regardless of how many passengers. Decoding is already memory-bound, so verifying a few more tokens hardly costs extra time.

On Apple Silicon, it's more like a metered taxi. The more candidate tokens you verify, the higher the meter goes.

Rahim measured that for Gemma-4 12B, each additional token verified costs about 14 milliseconds. He compiled this into a cost model and concluded that the speed ceiling on Apple Silicon is around 2.2x.

In short, Rahim ported the smaller assistant model from the HuggingFace checkpoint and paired it with the two target models: Gemma-4 12B and Qwen3-4B.

He also reimplemented the verification process within the MLX framework and quantized the weights to 4-bit.

As a result, on the M4 Pro, compared to Apple's official MLX tools, the generation speed of Gemma-4 12B increased from 18.4 tok/s to approximately 30 tok/s, about 1.6x the original. Qwen3-4B increased from 52.9 tok/s to approximately 73 tok/s, about 1.4x the original.

Additionally, in mlx-dspark, Rahim did something most porting work doesn't.

Ported Versions Can Also Achieve High-Precision Reproduction

Most versions that bring large models to local devices only support greedy decoding, meaning they pick the token with the highest probability at each step.

In mlx-dspark, Rahim implemented the temperature sampling method originally described in the DSpark paper. The draft model proposes candidate tokens, and the acceptance probability is min(1, p/q). The unaccepted parts are resampled from the residual.

He verified that the output generated by this process strictly equals the exact distribution that the target model would produce at the same temperature — not a compromised approximation.

Most speculative decoding only implements a greedy version because verifying greedy correctness is simple: just compare token by token.

Rahim went a step further to verify the output distribution under sampling mode, confirming there was no distortion.

The precision to use for the verifying target model was a pitfall he discovered through trial and error.

If the small model is paired with a base target model that hasn't been instruction-tuned, only 47% of the candidate tokens pass verification. Switching to the corresponding instruction-tuned version brought this ratio up to 82%.

He also tested using bf16 precision for the target model, but the increase in verification cost outweighs the gain in acceptance rate, making it slower. So keeping the target model at 8-bit is the most cost-effective default.

The small draft model, which proposes candidate tokens, uses a different precision.

Rahim compressed the draft model itself. After 4-bit quantization, it is only 1.8 GB, fits easily in memory, and runs losslessly.

As a result, DSpark not only achieves acceleration but also reproduces the 16% to 18% acceptance rate improvement mentioned in the paper on the device side.

DFlash Also Integrated, Code Tasks Faster

After the tweet was posted, a comment came in from Jian Chen, one of the authors of the DFlash paper, asking if Rahim could try their team's model.

DFlash is another speculative decoding scheme proposed in a paper published by z-lab in May. The team lead is Zhijian Liu, an assistant professor at UCSD and a research scientist at NVIDIA.

DFlash's approach differs from DSpark. It uses a parallel "block diffusion" to denoise an entire block of 16 tokens at once, rather than guessing step by step with dependencies like DSpark.

Rahim acted quickly.

Using Jian's own porting script, he connected the gemma4-12B-it-DFlash released by z-lab to the Gemma-4 target model in mlx-vlm, and ran another head-to-head comparison on the same Mac against the DSpark he had just tested.

For code and math tasks, DFlash's block decoding achieved an acceptance length of 5.95 to 6.20, a speed of about 36 tok/s, and about 2.1x speedup, beating DSpark.

However, DFlash attempts to generate an entire block of 16 tokens at once, but the target model may not approve all of them. Only a portion actually passes verification — the industry calls this "acceptance length" — and it doesn't always fill all 16 tokens.

So in scenarios like open-ended chat where content is unpredictable, the acceptance length cannot reach high levels, blocks are not filled, and DFlash's advantage cannot be exerted.

DSpark's Markov head was designed precisely to address this same issue. When generating a block of words in parallel, later positions are computed independently and may not be coherent with each other. The Markov head adds a layer of dependency between these positions to correct this.

As a result, in chat scenarios, DSpark is actually faster than DFlash.

The updated mlx-dspark v0.0.3 officially integrated z-lab's original DFlash into the package and added a parameter to manually shorten the effective block length of DFlash. Use short blocks for chat, and still use full 16-token blocks for code and math.

After this update, the same Mac and the same package can handle both chat and code/math tasks without needing to switch back and forth between DSpark and DFlash projects.

Rahim said in his tweet that the same method should also work on larger draft models like Qwen3-8B and 14B.

Source: Quantum Bit

Risk Warning and Disclaimer

        Market risks exist, and investment must be cautious. This article does not constitute personal investment advice and does not take into account specific investment objectives, financial situations, or needs of individual users. Users should consider whether any opinions, views, or conclusions in this article are suitable for their specific circumstances. Any investment based on this content is at your own risk.
DEEPSEEK-4.87%
View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned