llama.cpp and ggml official WebGPU backend are officially released, allowing browsers to run GGUF large models with local GPU acceleration, entirely on the client side, with data never leaving the device, achieving zero-configuration privacy inference. The paper states that static memory planning and efficient loading reduce webpage VRAM usage by 29–33%, and decoding throughput across Intel, Apple, and NVIDIA devices improves by 45–69%. The demo is based on wllama, with underlying optimizations exceeding the paper's expectations. It can also be compiled locally using Google C++ WebGPU Dawn, providing benchmarks comparing Vulkan and WebGPU.

MeNews

2026-05-22 13:03:46

Abstract generation in progress

ME AI According to Beating Monitoring, the official WebGPU backend for llama.cpp and ggml has been officially released, enabling GGUF-format large models to run directly in the browser with local GPU acceleration. The new backend breaks free from reliance on specific native clients or complex WebAssembly architectures, achieving privacy-preserving, on-device inference with no data leaving the device—opening up a zero-configuration entry point to local compute power for the web ecosystem.

A related paper published on May 20 states that the WebGPU backend introduces static memory planning and an efficient model-loading mechanism, reducing runtime GPU memory overhead on the web by 29% to 33% compared with existing frameworks. On mainstream GPU devices such as Intel, Apple, and NVIDIA, average decoding throughput improves by 45% to 69%.

The web demo is based on the open-source library wllama. Recent underlying optimizations completed at the foundation level have achieved better GPU memory control than described in the paper. llama.cpp can also be natively compiled via Google’s C++ WebGPU implementation Dawn, providing a benchmarking baseline for performance comparisons between Vulkan and WebGPU.

(Source: BlockBeats)

View Original

This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.

13 Likes

Reward
13
10
2
Share

Comment

Add a comment

UnderTheGlassDome

· 8h ago

ggml's WebGPU adaptation is really detailed; a 29% reduction in VRAM usage is impressive.

View OriginalReply0

StargazerInTheWoods

· 9h ago

Is the biggest obstacle to the popularization of WebGPU Safari support?

View OriginalReply0

MountainBeforeTheStorm

· 9h ago

Pure client-side inference means I finally don't have to upload my chat history to the cloud.

View OriginalReply0

OwlMarketMonitoringLamp

· 10h ago

Finally able to run local large models in the browser, privacy enthusiasts are ecstatic

View OriginalReply0

BridgeHopRanger

· 10h ago

From now on, Chrome will be my AI IDE.

View OriginalReply0

APuppyInTheWarmSun

· 10h ago

45-69% throughput improvement, a qualitative change in web experience

View OriginalReply0

LpGrandma

· 10h ago

GGUF format + WebGPU, the llama.cpp ecosystem is becoming more complete

View OriginalReply0

AirdropArchivist

· 10h ago

With this release pace, the llama.cpp team really isn't sleeping.

View OriginalReply0

RetroRadioEcho

· 10h ago

Static memory planning, this technical term sounds like it saves VRAM.

View OriginalReply0

ReboundAtTheStreetCornerAfter

· 10h ago

Dawn's compilation leaves a backdoor for hardcore players, praise.

View OriginalReply0

Trending Topics
View More
#
TradfiTradingChallenge
271.8K Popularity
#
PlatinumCardCreatorExclusive
81.29K Popularity
#
DailyPolymarketHotspot
1.03M Popularity
#
GateSquarePizzaDay
1.75M Popularity
#
SpaceXOfficiallyFilesforIPO
555.18K Popularity

Pinned

Sitemap

llama.cpp officially supports WebGPU, with inference memory usage on the browser side dropping by over 30%

Trending Topics

TradfiTradingChallenge

PlatinumCardCreatorExclusive

DailyPolymarketHotspot

GateSquarePizzaDay

SpaceXOfficiallyFilesforIPO

Pinned