Running large models in the browser finally no longer depends on cloud service providers; local GPUs can now take off directly.

View Original
MeNews
llama.cpp officially supports WebGPU, with inference memory usage on the browser side dropping by over 30%
llama.cpp and ggml official WebGPU backend are officially released, allowing browsers to run GGUF large models with local GPU acceleration, entirely on the client side, with data never leaving the device, achieving zero-configuration privacy inference. The paper states that static memory planning and efficient loading reduce webpage VRAM usage by 29–33%, and decoding throughput across Intel, Apple, and NVIDIA devices improves by 45–69%. The demo is based on wllama, with underlying optimizations exceeding the paper's expectations. It can also be compiled locally using Google C++ WebGPU Dawn, providing benchmarks comparing Vulkan and WebGPU.
This page may contain third-party content, which is provided for information purposes only (not representations/warranties) and should not be considered as an endorsement of its views by Gate, nor as financial or professional advice. See Disclaimer for details.
  • Reward
  • Comment
  • Repost
  • Share
Comment
Add a comment
Add a comment
No comments
  • Pinned