llama.cpp officially supports WebGPU, with inference memory usage on the browser side dropping by over 30%

robot
Abstract generation in progress

ME AI According to Beating Monitoring, the official WebGPU backend for llama.cpp and ggml has been officially released, enabling GGUF-format large models to run directly in the browser with local GPU acceleration. The new backend breaks free from reliance on specific native clients or complex WebAssembly architectures, achieving privacy-preserving, on-device inference with no data leaving the device—opening up a zero-configuration entry point to local compute power for the web ecosystem.

A related paper published on May 20 states that the WebGPU backend introduces static memory planning and an efficient model-loading mechanism, reducing runtime GPU memory overhead on the web by 29% to 33% compared with existing frameworks. On mainstream GPU devices such as Intel, Apple, and NVIDIA, average decoding throughput improves by 45% to 69%.

The web demo is based on the open-source library wllama. Recent underlying optimizations completed at the foundation level have achieved better GPU memory control than described in the paper. llama.cpp can also be natively compiled via Google’s C++ WebGPU implementation Dawn, providing a benchmarking baseline for performance comparisons between Vulkan and WebGPU.

(Source: BlockBeats)

View Original
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • 10
  • 2
  • Share
Comment
Add a comment
Add a comment
UnderTheGlassDome
· 8h ago
ggml's WebGPU adaptation is really detailed; a 29% reduction in VRAM usage is impressive.
View OriginalReply0
StargazerInTheWoods
· 9h ago
Is the biggest obstacle to the popularization of WebGPU Safari support?
View OriginalReply0
MountainBeforeTheStorm
· 9h ago
Pure client-side inference means I finally don't have to upload my chat history to the cloud.
View OriginalReply0
OwlMarketMonitoringLamp
· 10h ago
Finally able to run local large models in the browser, privacy enthusiasts are ecstatic
View OriginalReply0
BridgeHopRanger
· 10h ago
From now on, Chrome will be my AI IDE.
View OriginalReply0
APuppyInTheWarmSun
· 10h ago
45-69% throughput improvement, a qualitative change in web experience
View OriginalReply0
LpGrandma
· 10h ago
GGUF format + WebGPU, the llama.cpp ecosystem is becoming more complete
View OriginalReply0
AirdropArchivist
· 10h ago
With this release pace, the llama.cpp team really isn't sleeping.
View OriginalReply0
RetroRadioEcho
· 10h ago
Static memory planning, this technical term sounds like it saves VRAM.
View OriginalReply0
ReboundAtTheStreetCornerAfter
· 10h ago
Dawn's compilation leaves a backdoor for hardcore players, praise.
View OriginalReply0
View More
  • Pinned