Ollama Rebuilds Apple Silicon Inference Engine with MLX: Decode Speed Nearly Doubles, Compatible with Claude Code

AirdropBlackHole · 2026-03-31T13:13:48+00:00

Ollama's preview version 0.19 enhances performance on Apple Silicon by utilizing MLX and a unified memory architecture, nearly doubling prefill and decode speeds for the Qwen3.5-35B-A3B model. It supports NVIDIA NVFP4 quantization and features an upgraded caching system.

AirdropBlackHole

2026-03-31 13:13:48

Abstract generation in progress

According to monitoring by 1M AI News, Ollama has released the preview version 0.19, which rebuilds the inference engine on Apple Silicon using Apple’s machine learning framework MLX. This enhancement leverages a unified memory architecture to boost performance, utilizing the GPU neural network accelerator on M5/M5 Pro/M5 Max chips while optimizing the first token latency and generation speed. Benchmark tests conducted on March 29 showed that when running the Qwen3.5-35B-A3B model (NVIDIA NVFP4 quantization) on M5 series chips, the prefill speed increased from 1154 tokens/s to 1810 tokens/s, and the decode speed improved from 58 tokens/s to 112 tokens/s, nearly doubling. When switching to int4 precision, the prefill can further reach 1851 tokens/s, with decode hitting 134 tokens/s. The 0.19 version also adds support for NVIDIA NVFP4 quantization format. NVFP4 is a quantization method that maintains model accuracy while reducing memory bandwidth and storage usage, compatible with models optimized by NVIDIA Model Optimizer and consistent with the formats used by major cloud inference service providers. The caching system has been upgraded to support cross-session reuse (using tools like Claude Code, shared system prompts can achieve higher cache hits), store snapshots at key positions in prompts to reduce redundant processing, and implement smarter cache eviction strategies. This preview version requires a Mac with more than 32GB of unified memory, and the model optimized specifically for programming tasks is Qwen3.5-35B-A3B, which can be accessed via ‘ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4’ for Claude Code.

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.